# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
import requests
import csv
import time

# Base parameters
query_base = "machine learning"  # base query string
fields = "paperId,title,abstract,year"  # fields to retrieve
limit = 100  # records per request
max_results_needed = 10000  # total records to collect
base_url = "https://api.semanticscholar.org/graph/v1/paper/search"

# Define a range of years to split the query.
years = list(range(2000, 2025))

csv_filename = "semantic_scholar_abstracts.csv"

# Function to perform GET requests with exponential backoff for 429 errors
def get_with_backoff(url, params, max_retries=5):
    backoff_time = 5  # initial backoff in seconds
    for attempt in range(max_retries):
        response = requests.get(url, params=params)
        if response.status_code == 429:
            print(f"Received 429 (Too Many Requests). Backing off for {backoff_time} seconds.")
            time.sleep(backoff_time)
            backoff_time *= 2  # exponential backoff
        else:
            return response
    return response  # return the last response if all retries fail
global_count = 0  # count of total papers collected

with open(csv_filename, mode='w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    # Write header row
    writer.writerow(["paperId", "title", "abstract", "year"])

    # Loop over each year to split the query
    for year in years:
        offset = 0
        # Create a query that includes the publication year.
        # (This assumes that appending the year helps narrow the results.)
        query = f"{query_base} {year}"
        print(f"\nStarting collection for year: {year}")

        while True:
            if global_count >= max_results_needed:
                break  # Stop if we've reached the target

            params = {
                "query": query,
                "offset": offset,
                "limit": limit,
                "fields": fields
            }

            print(f"Fetching records for year {year}: {offset+1} to {offset+limit}...")
            response = get_with_backoff(base_url, params)

            # Check for 400 error
            if response.status_code == 400:
                print(f"Error: Received status code {response.status_code} for offset {offset} in year {year}. Moving to next year.")
                break

            if response.status_code != 200:
                print(f"Error: Received status code {response.status_code}. Stopping.")
                break

            data = response.json()
            papers = data.get("data", [])
            if not papers:
                print(f"No more papers returned for year {year}.")
                break

            # Write each paper's data to CSV
            for paper in papers:
                writer.writerow([
                    paper.get("paperId", ""),
                    paper.get("title", ""),
                    paper.get("abstract", ""),
                    paper.get("year", "")
                ])
                global_count += 1
                if global_count >= max_results_needed:
                    break  # Stop once target is reached

            offset += limit  # update offset for next page
            # Brief pause to help avoid rate limits
            time.sleep(1)

        if global_count >= max_results_needed:
            break  # Stop processing further years if 10,000 records are collected
print(f"\nData collection complete. {global_count} records saved to {csv_filename}")


Starting collection for year: 2000
Fetching records for year 2000: 1 to 100...
Fetching records for year 2000: 101 to 200...
Fetching records for year 2000: 201 to 300...
Received 429 (Too Many Requests). Backing off for 5 seconds.
Received 429 (Too Many Requests). Backing off for 10 seconds.
Fetching records for year 2000: 301 to 400...
Fetching records for year 2000: 401 to 500...
Received 429 (Too Many Requests). Backing off for 5 seconds.
Received 429 (Too Many Requests). Backing off for 10 seconds.
Received 429 (Too Many Requests). Backing off for 20 seconds.
Received 429 (Too Many Requests). Backing off for 40 seconds.
Received 429 (Too Many Requests). Backing off for 80 seconds.
Error: Received status code 429. Stopping.

Starting collection for year: 2001
Fetching records for year 2001: 1 to 100...
Fetching records for year 2001: 101 to 200...
Fetching records for year 2001: 201 to 300...
Received 429 (Too Many Requests). Backing off for 5 seconds.
Received 429 (Too Many Reque

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file
df = pd.read_csv("semantic_scholar_abstracts.csv")

# Define a function that cleans a text and prints intermediate outputs for demonstration.
def clean_text_verbose(text):
    if not isinstance(text, str):
        return text  # if text is NaN or not a string, return it unchanged.

    print("=== Original Text ===")
    print(text)

    # (1) Remove noise: special characters and punctuation
    text_no_special = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    print("\n=== After Removing Special Characters and Punctuation ===")
    print(text_no_special)

    # (2) Remove numbers
    text_no_numbers = re.sub(r'\d+', '', text_no_special)
    print("\n=== After Removing Numbers ===")
    print(text_no_numbers)

    # (4) Lowercase all texts
    text_lower = text_no_numbers.lower()
    print("\n=== After Converting to Lowercase ===")
    print(text_lower)

    # (3) Remove stopwords using the stopwords list
    stop_words = set(stopwords.words('english'))
    words = text_lower.split()
    words_no_stop = [word for word in words if word not in stop_words]
    text_no_stop = ' '.join(words_no_stop)
    print("\n=== After Removing Stopwords ===")
    print(text_no_stop)

    # (5) Stemming
    stemmer = PorterStemmer()
    words_stemmed = [stemmer.stem(word) for word in text_no_stop.split()]
    text_stemmed = ' '.join(words_stemmed)
    print("\n=== After Stemming ===")
    print(text_stemmed)

    # (6) Lemmatization
    lemmatizer = WordNetLemmatizer()
    words_lemmatized = [lemmatizer.lemmatize(word) for word in text_stemmed.split()]
    text_lemmatized = ' '.join(words_lemmatized)
    print("\n=== After Lemmatization ===")
    print(text_lemmatized)

    return text_lemmatized

# Demonstrate the cleaning steps on one sample abstract
sample_abstract = df['abstract'].dropna().iloc[0]
print("\n\n--- Cleaning a Sample Abstract ---")
cleaned_sample = clean_text_verbose(sample_abstract)

# Now define a streamlined cleaning function to apply to the entire dataset
def final_clean_text(text):
    if not isinstance(text, str):
        return text
    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Lowercase
    text = text.lower()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in text.split() if word not in stop_words]
    text = ' '.join(words)
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in text.split()]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply the cleaning to the "abstract" column and store results in a new column "clean_abstract"
df['clean_abstract'] = df['abstract'].apply(final_clean_text)

# Save the updated DataFrame to a new CSV file
output_filename = "semantic_scholar_abstracts_clean.csv"
df.to_csv(output_filename, index=False)

print(f"\nCleaned data has been saved to '{output_filename}'")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...




--- Cleaning a Sample Abstract ---
=== Original Text ===
This article surveys the contents of the workshop Post-Processing in Machine Learning and Data Mining: Interpretation, Visualization, Integration, and Related Topics within KDD-2000: The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20-23 August 2000. The corresponding web site is on www.acm.org/sigkdd/kdd2000 First, this survey paper introduces the state of the art of the workshop topics, emphasizing that postprocessing forms a significant component in Knowledge Discovery in Databases (KDD). Next, the article brings up a report on the contents, analysis, discussion, and other aspects regarding this workshop. Afterwards, we survey all the workshop papers. They can be found at (and downloaded from) www.cas.mcmaster.ca/~bruha/kdd2000/kddrep.html The authors of this report worked as the organizers of the workshop; the programme committee was formed by additional three researches

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import spacy
import pandas as pd
from collections import Counter
import nltk
from nltk import RegexpParser

# Ensure required NLTK data is downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# -------------------------------------------
# Step 1: Load spaCy's English model
# -------------------------------------------
nlp = spacy.load("en_core_web_sm")

# -------------------------------------------
# Step 2: Load the clean abstracts CSV file
# -------------------------------------------
df = pd.read_csv("semantic_scholar_abstracts_clean.csv")
clean_texts = df['clean_abstract'].dropna()

# =============================================================================
# (1) Parts of Speech (POS) Tagging and Count Calculation
# =============================================================================
print("=== Part-of-Speech (POS) Tagging and Count Calculation ===")

total_noun = 0
total_verb = 0
total_adj  = 0
total_adv  = 0

for text in clean_texts:
    doc = nlp(text)
    for token in doc:
        if token.pos_ in ["NOUN", "PROPN"]:
            total_noun += 1
        elif token.pos_ == "VERB":
            total_verb += 1
        elif token.pos_ == "ADJ":
            total_adj += 1
        elif token.pos_ == "ADV":
            total_adv += 1

print(f"Total Nouns:      {total_noun}")
print(f"Total Verbs:      {total_verb}")
print(f"Total Adjectives: {total_adj}")
print(f"Total Adverbs:    {total_adv}")

# =============================================================================
# (2) Constituency Parsing and Dependency Parsing
# =============================================================================
print("\n=== Constituency Parsing and Dependency Parsing ===")

# For demonstration, select one sample sentence from the first clean abstract.
sample_text = clean_texts.iloc[0]
doc_sample = nlp(sample_text)
sample_sentence = list(doc_sample.sents)[0]

print("\nSample Sentence for Parsing:")
print(sample_sentence.text)

# --- Constituency Parsing using NLTK's RegexpParser ---
# Convert the sample sentence into a list of tuples using spaCy tokens.
tokens = [(token.text, token.tag_) for token in sample_sentence]

# Define a simple grammar for chunking into noun phrases (NP), verb phrases (VP), and prepositional phrases (PP)
grammar = r"""
  NP: {<DT>?<JJ.*>*<NN.*>+}   # Noun Phrase: optional determiner, adjectives, and one or more nouns
  PP: {<IN><NP>}             # Prepositional Phrase: preposition followed by a noun phrase
  VP: {<VB.*><NP|PP>*}       # Verb Phrase: verb followed by optional noun or prepositional phrases
"""

# Create the RegexpParser and parse the tokens
constituency_parser = RegexpParser(grammar)
constituency_tree = constituency_parser.parse(tokens)

print("\nConstituency Parse Tree (via NLTK RegexpParser):")
print(constituency_tree)
constituency_tree.pretty_print()

# --- Dependency Parsing using spaCy ---
print("\nDependency Parse (token, dependency relation, head):")
for token in sample_sentence:
    print(f"{token.text:15} {token.dep_:10} {token.head.text}")

print("\nExplanation:")
print("1. Constituency Parse Tree: The tree above is generated using a simple grammar to chunk the sentence into constituents like NP (noun phrase), VP (verb phrase), and PP (prepositional phrase).")
print("   It provides a rough hierarchical structure of the sentence, showing how words combine into larger phrases.")
print("2. Dependency Parse: Each token is printed along with its dependency relation and head token, illustrating the grammatical relationships (e.g., subject, object) between words.")

# =============================================================================
# (3) Named Entity Recognition (NER)
# =============================================================================
print("\n=== Named Entity Recognition (NER) ===")
# Define the entity labels of interest: PERSON, ORG, GPE (locations), PRODUCT, and DATE.
entity_labels = ["PERSON", "ORG", "GPE", "PRODUCT", "DATE"]
entity_counter = Counter()

for text in clean_texts:
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in entity_labels:
            entity_counter[ent.label_] += 1

print("Entity Counts:")
for label in entity_labels:
    print(f"{label}: {entity_counter[label]}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


=== Part-of-Speech (POS) Tagging and Count Calculation ===
Total Nouns:      813783
Total Verbs:      121065
Total Adjectives: 105515
Total Adverbs:    12023

=== Constituency Parsing and Dependency Parsing ===

Sample Sentence for Parsing:
articl survey content workshop postprocess machin learn data mine interpret visual integr relat topic within kdd sixth acm sigkdd intern confer knowledg discoveri data mine boston usa august correspond web site wwwacmorgsigkddkdd first survey paper introduc state art workshop topic emphas postprocess form signific compon knowledg discoveri databas kdd next articl bring report content analysi discus aspect regard workshop afterward survey workshop paper found download wwwcasmcmastercabruhakddkddrephtml author report work organ workshop programm committe form addit three research field

Constituency Parse Tree (via NLTK RegexpParser):
(S
  (NP
    articl/NNP
    survey/NN
    content/NN
    workshop/NNP
    postprocess/NN
    machin/NN)
  (VP learn/VB

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
pip install requests-html nest_asyncio pandas nltk

Collecting requests-html
  Downloading requests_html-0.10.0-py3-none-any.whl.metadata (15 kB)
Collecting pyquery (from requests-html)
  Downloading pyquery-2.0.1-py3-none-any.whl.metadata (9.0 kB)
Collecting fake-useragent (from requests-html)
  Downloading fake_useragent-2.0.3-py3-none-any.whl.metadata (17 kB)
Collecting parse (from requests-html)
  Downloading parse-1.20.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting bs4 (from requests-html)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting w3lib (from requests-html)
  Downloading w3lib-2.3.1-py3-none-any.whl.metadata (2.3 kB)
Collecting pyppeteer>=0.0.14 (from requests-html)
  Downloading pyppeteer-2.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting appdirs<2.0.0,>=1.4.3 (from pyppeteer>=0.0.14->requests-html)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting pyee<12.0.0,>=11.0.0 (from pyppeteer>=0.0.14->requests-html)
  Downloading pyee-11.1.1-py3-none-any.whl.metadata (2.8

In [None]:
pip install "lxml[html_clean]"


Collecting lxml_html_clean (from lxml[html_clean])
  Downloading lxml_html_clean-0.4.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.1-py3-none-any.whl (14 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.4.1


In [None]:
# Part 1
import requests
from bs4 import BeautifulSoup
import time
import random
import csv

def fetch_products_from_page(page_num):
    """
    Fetch and parse a single GitHub Marketplace Actions page.

    Args:
        page_num (int): The page number to fetch.

    Returns:
        list: A list of dictionaries, each containing:
              - Product Name
              - Description
              - URL (absolute)
              - Page Number
    """
    url = f"https://github.com/marketplace?type=actions&page={page_num}"

    # Set headers to mimic a real browser request
    headers = {
        "User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/115.0.0.0 Safari/537.36"),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://github.com/"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Page {page_num}: Error fetching page (Status code: {response.status_code}).")
        return []

    soup = BeautifulSoup(response.text, "html.parser")

    # Find product cards using the data-testid attribute (as per inspected HTML)
    product_cards = soup.find_all("div", attrs={"data-testid": "marketplace-item"})
    products = []

    for card in product_cards:
        # Extract product name and URL from the <h3> element containing the <a> tag
        h3_tag = card.find("h3")
        if h3_tag:
            a_tag = h3_tag.find("a", href=True)
            if a_tag:
                product_name = a_tag.get_text(strip=True)
                # Construct absolute URL (assuming href is a relative link)
                product_url = "https://github.com" + a_tag.get("href", "")
            else:
                product_name = "N/A"
                product_url = "N/A"
        else:
            product_name = "N/A"
            product_url = "N/A"

        # Extract the product description from the <p> tag with known classes
        p_tag = card.find("p", class_="mt-1 mb-0 text-small fgColor-muted line-clamp-2")
        description = p_tag.get_text(strip=True) if p_tag else "N/A"

        products.append({
            "Product Name": product_name,
            "Description": description,
            "URL": product_url,
            "Page Number": page_num
        })

    return products

def main():
    """
    Loops through pages 1 to 500, scrapes product data from each page,
    and saves the combined results to a CSV file.
    """
    all_products = []
    max_pages = 500  # Loop through 500 pages regardless of product count

    for page in range(1, max_pages + 1):
        print(f"Loading page {page}...")
        products = fetch_products_from_page(page)
        if products:
            print(f"Page {page} complete. Extracted {len(products)} products.")
            all_products.extend(products)
        else:
            print(f"Skipping page {page} due to error.")

        # Pause randomly between 1 and 3 seconds to avoid server overload
        time.sleep(random.uniform(1, 3))

    # Save all extracted products to a CSV file
    csv_filename = "scraped_products.csv"
    fieldnames = ["Product Name", "Description", "URL", "Page Number"]

    try:
        with open(csv_filename, "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            for product in all_products:
                writer.writerow(product)
        print(f"Product data saved to '{csv_filename}'.")
    except Exception as e:
        print("Error saving CSV:", e)

if __name__ == "__main__":
    main()


Loading page 1...
Page 1 complete. Extracted 20 products.
Loading page 2...
Page 2 complete. Extracted 20 products.
Loading page 3...
Page 3 complete. Extracted 20 products.
Loading page 4...
Page 4 complete. Extracted 20 products.
Loading page 5...
Page 5 complete. Extracted 20 products.
Loading page 6...
Page 6 complete. Extracted 20 products.
Loading page 7...
Page 7 complete. Extracted 20 products.
Loading page 8...
Skipping page 8 due to error.
Loading page 9...
Page 9 complete. Extracted 20 products.
Loading page 10...
Skipping page 10 due to error.
Loading page 11...
Page 11 complete. Extracted 20 products.
Loading page 12...
Page 12 complete. Extracted 20 products.
Loading page 13...
Page 13 complete. Extracted 20 products.
Loading page 14...
Page 14 complete. Extracted 20 products.
Loading page 15...
Page 15 complete. Extracted 20 products.
Loading page 16...
Skipping page 16 due to error.
Loading page 17...
Page 17 complete. Extracted 20 products.
Loading page 18...
Page 18 c

In [None]:
# part 2
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

# Read the CSV data
data = pd.read_csv("scraped_products.csv")
print("Original Data Loaded.")

# -------------------------------
# Data Quality Operations
# -------------------------------

# Remove duplicate rows
data.drop_duplicates(inplace=True)

# Drop rows where critical columns are missing
data.dropna(subset=["Product Name", "Description"], inplace=True)

# Reset index after cleaning
data.reset_index(drop=True, inplace=True)

# -------------------------------
# Text Preprocessing Functions
# -------------------------------

def clean_text(text):
    """ Remove HTML tags, special characters, and digits, then convert text to lowercase. """
    text = re.sub(r'<[^>]+>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabet characters
    return text.lower().strip()  # Convert to lowercase and strip whitespace

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    """ Preprocess text by cleaning, tokenizing, removing stopwords, and lemmatizing. """
    text = clean_text(text)
    tokens = word_tokenize(text)  # Tokenization
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]  # Stopword removal & Lemmatization
    return " ".join(tokens)  # Join tokens back into a string

# -------------------------------
# Apply Preprocessing to Columns
# -------------------------------

data["Product Name Processed"] = data["Product Name"].astype(str).apply(preprocess_text)
data["Description Processed"] = data["Description"].astype(str).apply(preprocess_text)

# Save the cleaned data to a new CSV file
output_filename = "cleaned_github_marketplace_data.csv"
data.to_csv(output_filename, index=False, encoding='utf-8')

print(f"Cleaned data saved as: {output_filename}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Original Data Loaded.
Cleaned data saved as: cleaned_github_marketplace_data.csv


In [None]:
files.download("cleaned_github_marketplace_data.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
!pip install tweepy
import tweepy
import pandas as pd
import re

# PART 1: DATA EXTRACTION
# Set  Bearer Token
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAOOCzQEAAAAAQWYXvw9dFI3QIBYBfV0hFb11qi8%3DI5UeP4qR3zA6YRODaVnccyXG70dNCNIvk5rfSC120GZ0krweWz"

# Create a Tweepy client using the bearer token
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# Define the query to search for tweets that include the hashtags for machine learning or artificial intelligence.
# Also, exclude retweets and limit results to English-language tweets.
query = "#machinelearning OR #artificialintelligence -is:retweet lang:en"

# Request tweet fields and expand the author details to get the username.
response = client.search_recent_tweets(
    query=query,
    tweet_fields=["id", "text", "author_id"],
    expansions="author_id",
    user_fields=["username"],
    max_results=100  # Maximum allowed per request
)

# Create a mapping of user IDs to their corresponding usernames
tweets_data = []
if response.data and response.includes and "users" in response.includes:
    users = {u["id"]: u for u in response.includes["users"]}
    for tweet in response.data:
        user_info = users.get(tweet.author_id)
        if user_info:  # Ensure we have user details
            tweets_data.append({
                "tweet_id": tweet.id,
                "username": user_info.username,
                "text": tweet.text
            })

print(f"Extracted {len(tweets_data)} tweets.")

# PART 2: DATA CLEANING & QUALITY CHECK

# Convert the extracted data to a DataFrame
df = pd.DataFrame(tweets_data)

# Define a function to clean tweet text
def clean_text(text):
    text = re.sub(r"http\S+", "", text)       # Remove URLs
    text = re.sub(r"@\w+", "", text)          # Remove user mentions
    text = re.sub(r"#", "", text)             # Remove hashtag symbol
    text = re.sub(r"\s+", " ", text)          # Remove extra whitespace
    return text.strip()

# Apply cleaning to the tweet text and add as a new column
df["clean_text"] = df["text"].apply(clean_text)

# Final quality check: remove any rows with missing or empty values in critical columns.
df.dropna(subset=["tweet_id", "username", "clean_text"], inplace=True)
df = df[(df["tweet_id"] != "") & (df["username"] != "") & (df["clean_text"] != "")]

# Save the cleaned data to a CSV file for further analysis.
csv_filename = "cleaned_tweets.csv"
df.to_csv(csv_filename, index=False)

print(f"Cleaned data saved to {csv_filename}.")


Extracted 99 tweets.
Cleaned data saved to cleaned_tweets.csv.


In [None]:
from google.colab import files
files.download("cleaned_tweets.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below:
The assignment was but but I found some question challenging like scrapping products from github market place. I enjoyed scrapping data though. it felt like hacking! It took me 12hrs to complete the assignment.


Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog