<a href="https://colab.research.google.com/github/Abhignya-Jagathpally/Abhignya_INFO5731_Fall2025/blob/Assignments/Jagathpally_Abhignya_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
# Your code here
!pip -q install pandas tqdm

import urllib.request
import json
import time
import pandas as pd
from tqdm.auto import tqdm

def simple_fetch(url):
    """
    Simplest possible fetch - no fancy headers, no compression handling
    """
    try:
        # Create the most basic request possible
        request = urllib.request.Request(url)

        # Just add a basic User-Agent
        request.add_header('User-Agent', 'Mozilla/5.0 (compatible; DataCollector 1.0)')

        with urllib.request.urlopen(request, timeout=30) as response:
            # Read raw bytes
            content = response.read()

            # Try to decode as UTF-8, with fallback
            try:
                text = content.decode('utf-8')
            except UnicodeDecodeError:
                # If UTF-8 fails, try latin-1 (which accepts any byte)
                text = content.decode('latin-1')

            # Parse JSON
            return json.loads(text)

    except Exception as e:
        print(f"Error: {e}")
        print(f"URL: {url}")
        return None

def collect_all_narrators():
    """
    Collect all narrator data with minimal complexity
    """
    base_url = "https://ddr.densho.org/api/0.2/narrator/"
    all_narrators = []
    offset = 0
    limit = 50  # Start with smaller batches

    print("Starting simple data collection...")

    # First, get total count
    first_url = f"{base_url}?limit=1&offset=0"
    first_data = simple_fetch(first_url)

    if not first_data:
        print(" Could not fetch initial data")
        return []

    total_records = first_data.get('total', 0)
    print(f"Total records to collect: {total_records}")

    # Setup progress bar
    pbar = tqdm(total=total_records, desc="Collecting narrators")

    while True:
        # Build URL with parameters
        url = f"{base_url}?limit={limit}&offset={offset}"

        print(f"Fetching: {url}")
        data = simple_fetch(url)

        if not data:
            print(f"Failed to fetch data at offset {offset}")
            break

        # Get the narrator objects
        narrators = data.get('objects', [])

        if not narrators:
            print("No more narrators found")
            break

        # Process each narrator
        for narrator in narrators:
            links = narrator.get('links', {})

            narrator_data = {
                'id': narrator.get('id'),
                'name': narrator.get('display_name'),
                'bio_text': narrator.get('bio'),
                'generation': narrator.get('generation'),
                'birth_location': narrator.get('birth_location'),
                'birth_date': narrator.get('b_date'),
                'death_date': narrator.get('d_date'),
                'page_url': links.get('html'),
                'json_url': links.get('json'),
                'interviews_api': links.get('interviews'),
                'image_url': links.get('img'),
                'thumb_url': links.get('thumb'),
            }

            all_narrators.append(narrator_data)

        # Update progress
        pbar.update(len(narrators))

        # Check if we're done
        next_offset = data.get('next_offset')
        if next_offset is None or next_offset <= offset:
            print("Reached end of data")
            break

        # Be polite - wait between requests
        time.sleep(1)
        offset = next_offset

    pbar.close()
    return all_narrators

# ===== RUN THE COLLECTION =====

print(" Starting simple Densho data collection...")

narrators = collect_all_narrators()

if narrators:
    print(f"\n Successfully collected {len(narrators)} narrator records!")

    # Create DataFrame
    df = pd.DataFrame(narrators)

    # Save to CSV
    filename = "/content/sample_data/densho_narrators_raw.csv"
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f" Saved to {filename}")

    # Show some stats
    print(f"\n Data Summary:")
    print(f"   Total records: {len(df)}")
    print(f"   Columns: {len(df.columns)}")
    print(f"   Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

    # Show sample data
    print(f"\n Sample Data:")
    pd.set_option('display.max_colwidth', 50)
    print(df[['id', 'name', 'generation', 'birth_location']].head())

    # Check for missing values
    print(f"\n Missing Values:")
    missing = df.isnull().sum()
    for col, count in missing.items():
        if count > 0:
            pct = (count / len(df)) * 100
            print(f"   {col}: {count} ({pct:.1f}%)")

    print(f"\n Data collection complete! File saved as '{filename}'")

else:
    print(" No data was collected")
    print("\nTroubleshooting suggestions:")
    print("1. Try running this code on your local machine")
    print("2. Check your internet connection")
    print("3. The API might be temporarily unavailable")

 Starting simple Densho data collection...
Starting simple data collection...
Total records to collect: 1009


Collecting narrators:   0%|          | 0/1009 [00:00<?, ?it/s]

Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=0
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=25
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=50
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=75
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=100
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=125
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=150
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=175
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=200
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=225
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=250
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=275
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=300
Fetching: https://ddr.densho.org/api/0.2/narrator/?limit=50&offset=325
Fetching: h

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:

# Your code here
!pip -q install pandas nltk unidecode

import re
import pandas as pd
from unidecode import unidecode
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# NLTK assets (first run downloads)
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)  # Add this line to fix the error
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

# Load the CSV from Cell 1
in_csv = "/content/sample_data/densho_narrators_raw.csv"
df = pd.read_csv(in_csv)

# We'll clean 'bio_text' (fall back to name if bio is empty)
def pick_source_text(row):
    txt = str(row.get("bio_text") or "").strip()
    return txt if txt else str(row.get("name") or "")

df["source_text"] = df.apply(pick_source_text, axis=1)

# Prepare resources
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def normalize_basic(text: str) -> str:
    """
    (4) Lowercase
    + ASCII fold (normalize accents)
    + (1)(2) Remove punctuation/special chars AND numbers (keep spaces)
    """
    text = str(text)
    text = text.lower()
    text = unidecode(text)
    text = re.sub(r"[^a-z\s]", " ", text)   # drop digits and punctuation
    text = re.sub(r"\s+", " ", text).strip()
    return text

def remove_stop_words(tokens):
    return [t for t in tokens if t not in stop_words and len(t) > 1]

def clean_pipeline(text: str):
    # Normalize
    norm = normalize_basic(text)
    # Tokenize
    tokens = nltk.word_tokenize(norm) if norm else []
    # (3) Remove stopwords
    nostop = remove_stop_words(tokens)
    # (5) Stemming
    stemmed = [stemmer.stem(t) for t in nostop]
    # (6) Lemmatization
    lemmatized = [lemmatizer.lemmatize(t) for t in nostop]
    return {
        "clean_norm": norm,
        "clean_nostop": " ".join(nostop),
        "stemmed_text": " ".join(stemmed),
        "lemmatized_text": " ".join(lemmatized),
        # Final preferred cleaned text:
        "clean_text": " ".join(lemmatized),
    }

# Apply the cleaning pipeline
#clean = df["source_text"].fillna("").map(clean_pipeline)
df["source_text"] = df["bio_text"]
clean = df["source_text"].fillna("").map(clean_pipeline)
clean_df = pd.DataFrame(list(clean))

# Merge and save
out_df = pd.concat([df, clean_df], axis=1)
out_csv = "densho_narrators_clean.csv"
out_df.to_csv(out_csv, index=False, encoding="utf-8")
print(f"Saved cleaned data with new columns to {out_csv}")

# Show a preview (code + visible output)
cols = ["id","name","source_text","clean_text","stemmed_text","lemmatized_text"]
pd.set_option("display.max_colwidth", 100)
display(out_df.head(5)[cols])

Saved cleaned data with new columns to densho_narrators_clean.csv


Unnamed: 0,id,name,source_text,clean_text,stemmed_text,lemmatized_text
0,361,Kay Aiko Abe,"Nisei female. Born May 9, 1927, in Selleck, Washington. Spent much of childhood in Beaverton, Or...",nisei female born may selleck washington spent much childhood beaverton oregon father owned farm...,nisei femal born may selleck washington spent much childhood beaverton oregon father own farm in...,nisei female born may selleck washington spent much childhood beaverton oregon father owned farm...
1,291,Art Abe,"Nisei male. Born June 12, 1921, in Seattle, Washington. Grew up in an area of Seattle with few o...",nisei male born june seattle washington grew area seattle japanese american attending university...,nisei male born june seattl washington grew area seattl japanes american attend univers washingt...,nisei male born june seattle washington grew area seattle japanese american attending university...
2,293,Sharon Tanagi Aburano,"Nisei female. Born October 31, 1925, in Seattle, Washington. Family owned and operated a success...",nisei female born october seattle washington family owned operated successful grocery store prio...,nisei femal born octob seattl washington famili own oper success groceri store prior world war i...,nisei female born october seattle washington family owned operated successful grocery store prio...
3,597,Toshiko Aiboshi,"Nisei female. Born July 8, 1928, in Boyle Heights, California. At an early age, went to live wit...",nisei female born july boyle height california early age went live family friend father passed a...,nisei femal born juli boyl height california earli age went live famili friend father pass away ...,nisei female born july boyle height california early age went live family friend father passed a...
4,1014,Douglas L. Aihara,"Sansei male. Born March 15, 1950, in Torrance, California. Grew up in the Los Angeles area, wher...",sansei male born march torrance california grew los angeles area father sold insurance active lo...,sansei male born march torranc california grew lo angel area father sold insur activ lo angel ko...,sansei male born march torrance california grew los angeles area father sold insurance active lo...


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# === Setup Stanza-only stack & restart (Colab) ===
!pip -q uninstall -y numpy || true
!pip -q install "numpy==1.26.4"

# CPU wheel for broad compatibility; remove index-url if you want CUDA/GPU
!pip -q install --index-url https://download.pytorch.org/whl/cpu "torch==2.3.1"

!pip -q install "stanza==1.8.2" "pandas==2.2.2" "tqdm==4.66.5"

import stanza
# Download all needed processors in one go (includes constituency)
stanza.download('en', processors='tokenize,pos,lemma,depparse,ner,constituency', verbose=False)

# Force a clean restart so pinned wheels load properly
import os, time
print("Restarting runtime to finalize installations...")
time.sleep(1)
os.kill(os.getpid(), 9)


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
torchvision 0.23.0+cpu requires torch==2.8.0, but you have torch 2.3.1+cpu which is incompatible.[0m[31m
[0mRestarting runtime to finalize installations...


In [None]:
# Your code here
# === Analysis with Stanza (POS, DEP, Constituency, NER) ===
import os
import pandas as pd
from collections import Counter
from tqdm.auto import tqdm
import stanza

# Build Stanza pipeline (CPU), now that runtime has restarted
nlp = stanza.Pipeline(
    'en',
    processors='tokenize,pos,lemma,depparse,ner,constituency',
    use_gpu=False,
    verbose=False
)

# -------- Inputs / outputs --------
IN_CSV = "densho_narrators_clean.csv"   # produced earlier
TEXT_COL = "clean_text"

OUT_DIR = "analysis_outputs_stanza"
os.makedirs(OUT_DIR, exist_ok=True)
CONSTIT_FILE = os.path.join(OUT_DIR, "constituency_trees.txt")
DEPEND_FILE  = os.path.join(OUT_DIR, "dependency_trees.txt")
POS_COUNTS_CSV = os.path.join(OUT_DIR, "pos_totals.csv")
NER_COUNTS_CSV = os.path.join(OUT_DIR, "ner_counts.csv")

PRINT_CAP_SENTENCES = None     # set to None to print ALL sentences (very long)
EXAMPLE_SENTENCE_INDEX = 0   # which sentence to explain

# -------- Load data --------
df = pd.read_csv(IN_CSV)
if TEXT_COL not in df.columns:
    raise ValueError(f"Column '{TEXT_COL}' not found in {IN_CSV}. Run the cleaning step first.")
texts = [str(t).strip() for t in df[TEXT_COL].fillna("") if str(t).strip()]

# -------- Helpers --------
def pos_bucket(upos):
    if upos in ("NOUN", "PROPN"): return "NOUN"
    if upos == "VERB": return "VERB"
    if upos == "ADJ":  return "ADJ"
    if upos == "ADV":  return "ADV"
    return None

def dependency_lines(sent):
    # stanza Sentence.words: 1-based ids; head==0 means ROOT
    id2txt = {w.id: w.text for w in sent.words}
    lines = []
    for w in sent.words:
        head_text = "ROOT" if w.head == 0 else id2txt.get(w.head, "ROOT")
        lines.append(f"{w.text}\t({w.deprel})\t-->\t{head_text}")
    return lines

def constituency_string(sent):
    try:
        # sent.constituency is a tree; .to_string() returns bracketed string
        return sent.constituency.to_string()
    except Exception:
        return "(NO_PARSE)"

# -------- Aggregations --------
pos_counter = Counter()
ner_counter = Counter()
ner_labels_of_interest = {"PERSON", "ORG", "GPE", "LOC", "PRODUCT", "DATE"}

all_const_lines, all_dep_lines = [], []
printed = 0
global_sent_idx = 0

example_sent_text = None
example_const = None
example_dep = None

print("Processing documents... (POS/DEP/Constituency/NER via Stanza)")
for text in tqdm(texts, total=len(texts)):
    if not text:
        continue
    doc = nlp(text)

    # POS & NER aggregates are per-doc
    for sent in doc.sentences:
        for w in sent.words:
            b = pos_bucket(w.upos)
            if b:
                pos_counter[b] += 1

    # NER: stanza exposes doc.ents (across sentences)
    if getattr(doc, "entities", None):
        for ent in doc.entities:
            if ent.type in ner_labels_of_interest:
                ner_counter[ent.type] += 1

    # Trees per sentence
    for sent in doc.sentences:
        cstr = constituency_string(sent)
        dlines = dependency_lines(sent)
        all_const_lines.append(cstr)
        all_dep_lines.extend(dlines)
        all_dep_lines.append("")  # spacer

        if PRINT_CAP_SENTENCES is None or printed < PRINT_CAP_SENTENCES:
            print("\n--- Sentence ---")
            print(" ".join([w.text for w in sent.words]))
            print("\nConstituency Tree:")
            print(cstr)
            print("\nDependency Tree (token (dep) --> head):")
            for ln in dlines:
                print(ln)
            printed += 1

        if example_sent_text is None and global_sent_idx == EXAMPLE_SENTENCE_INDEX:
            example_sent_text = " ".join([w.text for w in sent.words])
            example_const = cstr
            example_dep = dlines
        global_sent_idx += 1

# -------- Save artifacts --------
with open(CONSTIT_FILE, "w", encoding="utf-8") as f:
    for line in all_const_lines:
        f.write(line + "\n")

with open(DEPEND_FILE, "w", encoding="utf-8") as f:
    for line in all_dep_lines:
        f.write(line + "\n")

# (1) POS totals
pos_totals = {
    "NOUN_total": pos_counter.get("NOUN", 0),
    "VERB_total": pos_counter.get("VERB", 0),
    "ADJ_total":  pos_counter.get("ADJ", 0),
    "ADV_total":  pos_counter.get("ADV", 0),
}
pd.DataFrame([pos_totals]).to_csv(POS_COUNTS_CSV, index=False)
print("\n=== (1) POS Totals ===")
display(pd.DataFrame([pos_totals]))

# (3) NER counts (selected)
ner_df = pd.DataFrame(
    [{"label": k, "count": v} for k, v in sorted(ner_counter.items())]
)
ner_df.to_csv(NER_COUNTS_CSV, index=False)
print("\n=== (3) NER Counts (PERSON/ORG/GPE/LOC/PRODUCT/DATE) ===")
display(ner_df)

print(f"\nSaved constituency trees to: {CONSTIT_FILE}")
print(f"Saved dependency trees to:   {DEPEND_FILE}")
print(f"Saved POS totals to:         {POS_COUNTS_CSV}")
print(f"Saved NER counts to:         {NER_COUNTS_CSV}")

# (2) Explanation
if example_sent_text:
    print("\n=== (2) Example Sentence & Explanation ===")
    print("Sentence:")
    print(example_sent_text)

    print("\nConstituency Tree (bracketed):")
    print(example_const)

    print("\nDependency Tree (token (dep) --> head):")
    for ln in example_dep:
        print(ln)

    print("\nExplanation:")
    print(
        "• Constituency parsing groups words into nested phrases (NP, VP, PP, etc.), "
        "showing the sentence as a hierarchy from S down to terminals.\n"
        "• Dependency parsing links each word to its syntactic head with labeled relations "
        "(e.g., nsubj, obj, obl). It emphasizes who governs whom rather than explicit phrase boundaries."
    )
else:
    print("\n[Note] No sentence found for the example explanation.")



Processing documents... (POS/DEP/Constituency/NER via Stanza)


  0%|          | 0/1005 [00:00<?, ?it/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
minidoka	(compound)	-->	camp
concentration	(compound)	-->	camp
camp	(compound)	-->	idaho
idaho	(parataxis)	-->	spent
resettled	(acl)	-->	idaho
attended	(amod)	-->	school
grade	(compound)	-->	school
school	(compound)	-->	school
high	(amod)	-->	school
school	(appos)	-->	idaho
chicago	(compound)	-->	illinois
illinois	(parataxis)	-->	spent
completed	(acl)	-->	school
dental	(amod)	-->	school
school	(compound)	-->	chicago
university	(compound)	-->	chicago
chicago	(nsubj)	-->	opening
eventually	(advmod)	-->	opening
opening	(advcl)	-->	completed
private	(amod)	-->	practice
dental	(amod)	-->	practice
practice	(obj)	-->	opening
chicago	(obj)	-->	opening

--- Sentence ---
nisei male born january manzanar concentration camp california world war ii left camp family los angeles area parent lived war established career engineer active little tokyo service center

Constituency Tree:
(NO_PARSE)

Dependency Tree (token (dep) --> head):
nis

Unnamed: 0,NOUN_total,VERB_total,ADJ_total,ADV_total
0,31102,8719,5175,750



=== (3) NER Counts (PERSON/ORG/GPE/LOC/PRODUCT/DATE) ===


Unnamed: 0,label,count
0,DATE,523
1,GPE,1363
2,LOC,10
3,ORG,7
4,PERSON,137



Saved constituency trees to: analysis_outputs_stanza/constituency_trees.txt
Saved dependency trees to:   analysis_outputs_stanza/dependency_trees.txt
Saved POS totals to:         analysis_outputs_stanza/pos_totals.csv
Saved NER counts to:         analysis_outputs_stanza/ner_counts.csv

=== (2) Example Sentence & Explanation ===
Sentence:
nisei female born may selleck washington spent much childhood beaverton oregon father owned farm influenced early age parent conversion christianity world war ii removed portland assembly center oregon minidoka concentration camp idaho war worked establish successful volunteer program feed homeless seattle washington

Constituency Tree (bracketed):
(NO_PARSE)

Dependency Tree (token (dep) --> head):
nisei	(amod)	-->	female
female	(nsubj)	-->	spent
born	(acl)	-->	female
may	(aux)	-->	selleck
selleck	(compound)	-->	washington
washington	(obj)	-->	born
spent	(root)	-->	ROOT
much	(amod)	-->	farm
childhood	(compound)	-->	father
beaverton	(compound)	-->	fat

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
# PART-1 (pagination fixed): Scrape GitHub Marketplace Actions → CSV
#   pip install requests beautifulsoup4 pandas

import csv
import time
import random
from datetime import datetime, timezone
from typing import List, Dict, Tuple
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup, FeatureNotFound
from requests.adapters import HTTPAdapter, Retry
import pandas as pd

BASE = "https://github.com"
# Use empty query to get the paginated listing (not just featured)
BASE_LIST = "https://github.com/marketplace?type=actions&query="
OUTPUT_CSV = "github_actions_marketplace_raw.csv"

# Controls
MAX_PRODUCTS = 1000                # target cap
TIME_LIMIT_SECONDS = 10 * 60       # stop after N seconds
REQUEST_DELAY_RANGE = (0.6, 1.2)   # polite jitter
TIMEOUT = 25
MAX_PAGES = 500                    # hard stop in case of long pagination
EMPTY_PAGE_TOLERANCE = 2           # stop after N consecutive empty pages

def make_session() -> requests.Session:
    s = requests.Session()
    s.headers.update({
        "User-Agent": ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                       "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": BASE_LIST,
        "Connection": "keep-alive",
    })
    retries = Retry(
        total=5,
        connect=5,
        read=5,
        backoff_factor=0.7,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=frozenset(["GET"]),
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retries, pool_connections=20, pool_maxsize=40)
    s.mount("https://", adapter)
    s.mount("http://", adapter)
    return s

def get_soup(session, url, timeout=TIMEOUT):
    resp = session.get(url, timeout=timeout)
    resp.raise_for_status()
    html = resp.text
    try:
        return BeautifulSoup(html, "lxml")
    except FeatureNotFound:
        return BeautifulSoup(html, "html.parser")

def strict_action_cards(soup: BeautifulSoup) -> List[Dict]:
    """
    Prefer strong matches: anchor href starts with /marketplace/actions/
    Pull name from heading/anchor, description from nearby muted/paragraph text.
    """
    results = []
    seen = set()
    containers = soup.select("article, li, div")
    for node in containers:
        a = node.select_one('a[href^="/marketplace/actions/"]')
        if not a:
            continue
        href = a.get("href", "").strip()
        if not href or href in seen:
            continue
        seen.add(href)

        # Product name
        name = ""
        for tag in ("h3", "h2", "h4"):
            h = node.find(tag)
            if h:
                name = h.get_text(" ", strip=True)
                break
        if not name:
            name = a.get_text(" ", strip=True)

        # Description
        desc = ""
        cand = node.select_one("p, div.color-fg-muted, div.text-small, div.f6, span.color-fg-muted")
        if cand:
            desc = cand.get_text(" ", strip=True)
        if not desc:
            p = node.find("p")
            if p:
                desc = p.get_text(" ", strip=True)

        results.append({
            "name": name,
            "description": desc,
            "url": urljoin(BASE, href),
        })
    return results

def fallback_cards(soup: BeautifulSoup) -> List[Dict]:
    """
    Looser heuristic: any marketplace link containing '/marketplace/actions/'
    Useful in case markup differs across pages.
    """
    results = []
    seen = set()
    for a in soup.select('a[href*="/marketplace/actions/"]'):
        href = a.get("href", "").strip()
        if not href or href in seen:
            continue
        seen.add(href)
        name = a.get_text(" ", strip=True)
        if not name or len(name) < 2:
            continue
        # try close-by description
        desc = ""
        parent = a
        for _ in range(4):
            parent = parent.parent
            if not parent:
                break
            if parent.name in ("article", "li", "div", "section"):
                cand = parent.select_one("p, div.color-fg-muted, div.text-small, div.f6, span.color-fg-muted")
                if cand:
                    desc = cand.get_text(" ", strip=True)
                    break
        results.append({
            "name": name,
            "description": desc,
            "url": urljoin(BASE, href),
        })
    return results

def build_page_url(page_num: int) -> str:
    # Explicit page parameter
    if page_num <= 1:
        return BASE_LIST
    return f"{BASE_LIST}&page={page_num}"

def scrape_actions(max_products=MAX_PRODUCTS, time_limit_s=TIME_LIMIT_SECONDS) -> Tuple[pd.DataFrame, dict]:
    session = make_session()
    t0 = time.perf_counter()

    rows = []
    seen_urls = set()
    errors = 0
    empty_streak = 0

    page_counter = 0
    for page in range(1, MAX_PAGES + 1):
        if time.perf_counter() - t0 > time_limit_s:
            print(f"[INFO] Time limit reached. Stopping.")
            break
        if len(rows) >= max_products:
            break

        url = build_page_url(page)
        try:
            soup = get_soup(session, url)
        except Exception as e:
            errors += 1
            print(f"[WARN] Failed {url}: {e}")
            # small backoff and continue; if many fail in a row, time limit will bail out
            time.sleep(random.uniform(*REQUEST_DELAY_RANGE))
            continue

        page_counter += 1

        # Try strict first; fall back if nothing found
        cards = strict_action_cards(soup)
        if not cards:
            cards = fallback_cards(soup)

        added = 0
        for c in cards:
            if c["url"] in seen_urls:
                continue
            seen_urls.add(c["url"])
            rows.append({
                "product_name": c["name"],
                "description": c["description"],
                "url": c["url"],
                "page_number": page,
            })
            added += 1
            if len(rows) >= max_products:
                break

        print(f"[INFO] Page {page}: found {len(cards)} action cards, added {added}, total {len(rows)}")

        # empty-page logic
        if added == 0:
            empty_streak += 1
            if empty_streak >= EMPTY_PAGE_TOLERANCE:
                print(f"[INFO] {EMPTY_PAGE_TOLERANCE} consecutive empty pages. Stopping.")
                break
        else:
            empty_streak = 0

        # polite pause
        time.sleep(random.uniform(*REQUEST_DELAY_RANGE))

    df = pd.DataFrame(rows, columns=["product_name", "description", "url", "page_number"])
    df.to_csv(OUTPUT_CSV, index=False, encoding="utf-8")

    report = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "pages_visited": page_counter,
        "total_collected": int(len(df)),
        "unique_urls": int(df["url"].nunique()) if not df.empty else 0,
        "errors": int(errors),
        "elapsed_seconds": round(time.perf_counter() - t0, 2),
        "output_csv": OUTPUT_CSV,
    }

    print("\n=== SCRAPE REPORT ===")
    for k, v in report.items():
        print(f"{k}: {v}")

    if not df.empty:
        print("\nSAMPLE ROWS:")
        print(df.head(5).to_string(index=False))

    return df, report

if __name__ == "__main__":
    scrape_actions()


[INFO] Page 1: found 20 action cards, added 20, total 20
[INFO] Page 2: found 20 action cards, added 20, total 40
[INFO] Page 3: found 20 action cards, added 20, total 60
[INFO] Page 4: found 0 action cards, added 0, total 60
[INFO] Page 5: found 20 action cards, added 20, total 80
[INFO] Page 6: found 20 action cards, added 20, total 100
[INFO] Page 7: found 0 action cards, added 0, total 100
[INFO] Page 8: found 20 action cards, added 20, total 120
[INFO] Page 9: found 20 action cards, added 20, total 140
[INFO] Page 10: found 20 action cards, added 20, total 160
[INFO] Page 11: found 0 action cards, added 0, total 160
[INFO] Page 12: found 20 action cards, added 20, total 180
[INFO] Page 13: found 20 action cards, added 20, total 200
[INFO] Page 14: found 0 action cards, added 0, total 200
[INFO] Page 15: found 20 action cards, added 20, total 220
[INFO] Page 16: found 20 action cards, added 20, total 240
[INFO] Page 17: found 20 action cards, added 20, total 260
[INFO] Page 18: fou

In [None]:
# PART-2: Preprocess + Data Quality for GitHub Marketplace Actions CSV (model-free tokenizer)
#   pip install pandas nltk beautifulsoup4

import re
import json
from typing import Dict, List
import pandas as pd
from bs4 import BeautifulSoup, FeatureNotFound, MarkupResemblesLocatorWarning
import warnings
warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import wordpunct_tokenize  # <-- doesn't require punkt / punkt_tab

# ---------- Setup NLTK resources ----------
try:
    _ = stopwords.words("english")
except LookupError:
    nltk.download("stopwords")
try:
    nltk.data.find("corpora/wordnet")
except LookupError:
    nltk.download("wordnet")

RAW_CSV   = "github_actions_marketplace_raw.csv"
CLEAN_CSV = "github_actions_marketplace_clean.csv"
REPORT_JSON = "github_actions_marketplace_quality_report.json"

REQUIRED_COLS = ["product_name", "description", "url", "page_number"]
STOPWORDS = set(stopwords.words("english"))
LEMMATIZER = WordNetLemmatizer()

def strip_html(text: str) -> str:
    if not isinstance(text, str):
        return ""
    t = text.strip()
    # If it doesn’t look like HTML, don’t parse it at all
    if "<" not in t and ">" not in t:
        return t
    try:
        return BeautifulSoup(t, "lxml").get_text(" ", strip=True)
    except FeatureNotFound:
        return BeautifulSoup(t, "html.parser").get_text(" ", strip=True)

def normalize_spaces(text: str) -> str:
    return re.sub(r"\s+", " ", text).strip()

def normalize_url(u: str) -> str:
    if not isinstance(u, str):
        return ""
    u = u.strip()
    if not u:
        return ""
    if u.startswith(("http://","https://")):
        return u
    return f"https://github.com{u if u.startswith('/') else '/' + u}"

def tokenize_lower(text: str) -> List[str]:
    """Lowercase + remove punctuation to spaces + split with wordpunct tokenizer (no models)."""
    text = (text or "").lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)  # keep letters/digits/spaces
    text = normalize_spaces(text)
    return wordpunct_tokenize(text) if text else []

def remove_stop_and_short(tokens: List[str]) -> List[str]:
    return [t for t in tokens if t not in STOPWORDS and len(t) > 1]

def lemmatize_tokens(tokens: List[str]) -> List[str]:
    return [LEMMATIZER.lemmatize(t) for t in tokens]

def preprocess_text(name: str, desc: str) -> Dict[str, str]:
    base = f"{strip_html(name or '')} {strip_html(desc or '')}".strip()
    tokens = tokenize_lower(base)
    tokens_ns = remove_stop_and_short(tokens)
    lemmas = lemmatize_tokens(tokens_ns)
    return {
        "clean_norm": " ".join(tokens),
        "tokens_no_stop": " ".join(tokens_ns),
        "lemmatized_text": " ".join(lemmas),
        "clean_text": " ".join(lemmas),
    }

def quality_checks(df: pd.DataFrame) -> Dict[str, int]:
    metrics = {}
    for col in REQUIRED_COLS:
        metrics[f"missing_{col}"] = int(df[col].isna().sum() + (df[col].astype(str).str.len() == 0).sum())
    metrics["duplicate_url_rows"] = int(df.duplicated(subset=["url"]).sum())
    # ✅ correct: apply ~ to the boolean mask BEFORE summing
    invalid_url_mask = ~df["url"].astype(str).str.startswith(("http://", "https://"))
    metrics["invalid_url_rows"] = int(invalid_url_mask.sum())
    metrics["non_numeric_page_rows"] = int(pd.to_numeric(df["page_number"], errors="coerce").isna().sum())
    return metrics


def run_pipeline():
    df = pd.read_csv(RAW_CSV)

    # Quick raw sanity (no clean_text yet!)
    print("Rows:", len(df))
    print("Unique URLs:", df["url"].nunique())
    print("Invalid URLs:", (~df["url"].astype(str).str.startswith(("http://","https://"))).sum())
    print("Missing names:", (df["product_name"].astype(str).str.len()==0).sum())
    print("\nTop pages by count:\n", df["page_number"].value_counts().head(10))

    # ----- Quality & preprocessing pipeline -----
    missing = [c for c in REQUIRED_COLS if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required columns in {RAW_CSV}: {missing}")

    for c in ["product_name","description","url"]:
        df[c] = df[c].astype(str).str.strip()
    df["url"] = df["url"].map(normalize_url)

    df["page_number"] = pd.to_numeric(df["page_number"], errors="coerce")

    initial_quality = quality_checks(df)

    req_mask = (df["product_name"].astype(str).str.len() > 0) & (df["url"].astype(str).str.len() > 0)
    dropped_missing_required = int((~req_mask).sum())
    df = df.loc[req_mask].copy()

    df["description"] = df["description"].fillna("")
    na_pages = int(df["page_number"].isna().sum())
    df["page_number"] = df["page_number"].fillna(-1).astype(int)

    before = len(df)
    df = df.drop_duplicates(subset=["url"], keep="first").reset_index(drop=True)
    dups_removed = before - len(df)

    # Create cleaned text columns
    pre = df.apply(lambda r: preprocess_text(r["product_name"], r["description"]), axis=1, result_type="expand")
    df_clean = pd.concat([df, pre], axis=1)

    # Save
    df_clean.to_csv(CLEAN_CSV, index=False, encoding="utf-8")

    report = {
        "source_csv": RAW_CSV,
        "clean_csv": CLEAN_CSV,
        "total_rows_source": int(len(pd.read_csv(RAW_CSV))),
        "initial_quality": initial_quality,
        "dropped_rows_missing_required_fields": dropped_missing_required,
        "page_number_na_fixed": na_pages,
        "duplicates_removed_by_url": int(dups_removed),
        "final_row_count": int(len(df_clean)),
        "required_columns_present": all(c in df_clean.columns for c in REQUIRED_COLS),
        "notes": (
            "Preprocessing: strip HTML (with lxml fallback), lowercase, remove punctuation & extra spaces, "
            "tokenize (wordpunct_tokenize), remove stopwords, lemmatize. 'clean_text' is canonical."
        ),
    }

    with open(REPORT_JSON, "w", encoding="utf-8") as f:
        json.dump(report, f, ensure_ascii=False, indent=2)

    print("\n=== DATA QUALITY REPORT ===")
    for k, v in report.items():
        print(f"{k}: {v}")

    # ✅ Now 'clean_text' exists
    print("\nSAMPLE (product_name, url, clean_text):")
    print(df_clean[["product_name", "url", "clean_text"]].head(5).to_string(index=False))


if __name__ == "__main__":
    run_pipeline()


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Rows: 1000
Unique URLs: 1000
Invalid URLs: 0
Missing names: 0

Top pages by count:
 page_number
1     20
2     20
3     20
5     20
6     20
8     20
9     20
10    20
12    20
13    20
Name: count, dtype: int64

=== DATA QUALITY REPORT ===
source_csv: github_actions_marketplace_raw.csv
clean_csv: github_actions_marketplace_clean.csv
total_rows_source: 1000
initial_quality: {'missing_product_name': 0, 'missing_description': 0, 'missing_url': 0, 'missing_page_number': 0, 'duplicate_url_rows': 0, 'invalid_url_rows': 0, 'non_numeric_page_rows': 0}
dropped_rows_missing_required_fields: 0
page_number_na_fixed: 0
duplicates_removed_by_url: 0
final_row_count: 1000
required_columns_present: True
notes: Preprocessing: strip HTML (with lxml fallback), lowercase, remove punctuation & extra spaces, tokenize (wordpunct_tokenize), remove stopwords, lemmatize. 'clean_text' is canonical.

SAMPLE (product_name, url, clean_text):
                product_name                                              

In [None]:
###################
# Token + bigram frequency from the clean CSV
import re, itertools, collections, pandas as pd

df = pd.read_csv("github_actions_marketplace_clean.csv")
tokens = (df["clean_text"].fillna("").str.split().tolist())
flat = list(itertools.chain.from_iterable(tokens))

# top unigrams
uni = collections.Counter([t for t in flat if len(t) > 2]).most_common(25)

# top bigrams
bigrams = []
for toks in tokens:
    bigrams.extend(zip(toks, toks[1:]))
bi = collections.Counter([" ".join(bg) for bg in bigrams if all(len(w)>2 for w in bg)]).most_common(25)

print("Top 25 tokens:")
for w,c in uni: print(f"{w:20s} {c}")
print("\nTop 25 bigrams:")
for w,c in bi: print(f"{w:25s} {c}")

###################
# Adds 'stemmed_text' to your clean CSV
import pandas as pd
import nltk
from nltk.stem import PorterStemmer

df = pd.read_csv("github_actions_marketplace_clean.csv")
stemmer = PorterStemmer()

def stem_line(s: str) -> str:
    toks = str(s or "").split()
    return " ".join(stemmer.stem(t) for t in toks)

df["stemmed_text"] = df["clean_text"].fillna("").map(stem_line)
df.to_csv("github_actions_marketplace_clean.csv", index=False, encoding="utf-8")
print("Added 'stemmed_text' to github_actions_marketplace_clean.csv")
print(df[["product_name","clean_text","stemmed_text"]].head(3).to_string(index=False))


###################
# Simple asserts to catch regressions when you re-run scraping
import pandas as pd
df = pd.read_csv("github_actions_marketplace_clean.csv")

assert set(["product_name","description","url","page_number","clean_text"]).issubset(df.columns), "Missing expected columns"
assert len(df) >= 500, "Too few rows collected"
assert df["url"].nunique() == len(df), "Duplicate URLs present"
assert (~df["url"].astype(str).str.startswith(("http://","https://"))).sum() == 0, "Invalid URLs found"
assert df["product_name"].astype(str).str.len().gt(0).all(), "Empty product names found"

print("All assertions passed ")



Top 25 tokens:
action               550
github               428
run                  142
request              112
pull                 106
build                101
setup                99
file                 98
code                 95
deploy               86
release              83
workflow             73
using                65
check                61
docker               56
install              56
issue                55
version              54
every                53
read                 51
repository           51
input                50
feedback             47
take                 47
piece                46

Top 25 bigrams:
github action             262
pull request              104
action github             47
read every                46
every piece               46
piece feedback            46
feedback take             46
take input                46
input seriously           46
action run                30
github page               23
action workflow           19
github relea

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
!pip install tweepy

In [None]:


import os, time, requests, datetime as dt, tweepy, pandas as pd

BEARER_TOKEN = 'AAAAAAAAAAAAAAAAAAAAAGH44QEAAAAAbrzF4y2AW3ya5setNZWwWzopMgo%3DnnL34r289phmtVsGPlms7MFjBJFba30I4mnwiBRKxCmNNYNpmo'
QUERY = "(#machinelearning OR #MachineLearning OR #AI OR #ArtificialIntelligence) -is:retweet lang:en"
MAX_TWEETS = 200       # keep low on tight quotas
REQUESTS_DELAY = 1.2
OUT_CSV = "tweets_raw.csv"

def preflight_or_wait(token):
    r = requests.get(
        "https://api.x.com/2/tweets/search/recent",
        params={"query": "AI -is:retweet", "max_results": 10},
        headers={"Authorization": f"Bearer " + token},
        timeout=20,
    )
    if r.status_code == 429:
        reset = int(r.headers.get("x-rate-limit-reset", "0"))
        wait = max(reset - int(time.time()), 0)
        ts = dt.datetime.utcfromtimestamp(reset).strftime("%Y-%m-%d %H:%M:%S UTC")
        print(f"[Preflight] 429. Waiting {wait}s until {ts} …")
        time.sleep(wait + 2)
    elif r.status_code != 200:
        print("Probe failed:", r.status_code, r.text[:400])
        raise SystemExit("Fix token/plan before continuing.")

preflight_or_wait(BEARER_TOKEN)

client = tweepy.Client(bearer_token=BEARER_TOKEN, wait_on_rate_limit=False)

tweets, next_token, total = [], None, 0
while total < MAX_TWEETS:
    try:
        resp = client.search_recent_tweets(
            query=QUERY,
            expansions=["author_id"],
            tweet_fields=["id","text","created_at","lang","author_id"],
            user_fields=["username","name"],
            max_results=100,
            next_token=next_token
        )
    except tweepy.TooManyRequests as e:
        # Stop instead of long sleeps inside notebooks
        print("Hit rate limit during pagination. Stopping.")
        break

    if not resp.data:
        break

    users = {u.id: u.username for u in (resp.includes.get("users", []) if resp.includes else [])}
    for t in resp.data:
        tweets.append({"tweet_id": t.id, "username": users.get(t.author_id, ""), "text": t.text})
        total += 1
        if total >= MAX_TWEETS:
            break

    next_token = resp.meta.get("next_token")
    if not next_token:
        break
    time.sleep(REQUESTS_DELAY)

pd.DataFrame(tweets, columns=["tweet_id","username","text"]).to_csv(OUT_CSV, index=False)
print(f"Saved {len(tweets)} tweets → {OUT_CSV}")


  ts = dt.datetime.utcfromtimestamp(reset).strftime("%Y-%m-%d %H:%M:%S UTC")


[Preflight] 429. Waiting 881s until 2025-09-30 02:14:57 UTC …
Hit rate limit during pagination. Stopping.
Saved 100 tweets → tweets_raw.csv


In [None]:
# PART 2: Clean tweet text, run data-quality checks, save clean CSV
# pip install pandas nltk beautifulsoup4

import re, json, warnings
import pandas as pd
from bs4 import BeautifulSoup, MarkupResemblesLocatorWarning
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

# one-time downloads (safe to re-run)
try: _ = stopwords.words("english")
except LookupError: nltk.download("stopwords")

warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)

IN_CSV   = "tweets_raw.csv"
OUT_CSV  = "tweets_clean.csv"
REPORT   = "tweets_quality_report.json"

STOP = set(stopwords.words("english"))

def strip_html(text: str) -> str:
    t = str(text or "").strip()
    if "<" not in t and ">" not in t:  # skip BS if not HTML-like
        return t
    return BeautifulSoup(t, "html.parser").get_text(" ", strip=True)

def clean_text(s: str) -> str:
    s = strip_html(s)
    s = s.lower()
    # remove URLs
    s = re.sub(r"https?://\S+|www\.\S+", " ", s)
    # remove mentions and hashtag symbols (keep the term after #)
    s = re.sub(r"@\w+", " ", s)
    s = re.sub(r"#", " ", s)
    # remove non-alphanumeric except spaces
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    # collapse whitespace
    s = re.sub(r"\s+", " ", s).strip()
    return s

def tokenize_no_stop(s: str):
    toks = wordpunct_tokenize(s) if s else []
    toks = [t for t in toks if t not in STOP and len(t) > 1]
    return " ".join(toks)

# ----- LOAD -----
df = pd.read_csv(IN_CSV)

# ----- BASIC QUALITY & CONSISTENCY -----
required = ["tweet_id","username","text"]
missing_cols = [c for c in required if c not in df.columns]
if missing_cols:
    raise ValueError(f"Missing columns: {missing_cols}")

# drop rows with empty critical fields
df["tweet_id"] = pd.to_numeric(df["tweet_id"], errors="coerce")
df["username"] = df["username"].astype(str).str.strip()
df["text"] = df["text"].astype(str).str.strip()
before = len(df)
df = df.dropna(subset=["tweet_id","username","text"])
df = df[df["username"] != ""]
dropped_missing = before - len(df)

# remove duplicates by tweet_id
before = len(df)
df = df.drop_duplicates(subset=["tweet_id"]).reset_index(drop=True)
dups_removed = before - len(df)

# ----- CLEANING -----
df["text_clean"] = df["text"].map(clean_text)
df["text_tokens"] = df["text_clean"].map(tokenize_no_stop)

# Final sanity: no empty text after cleaning
before = len(df)
df = df[df["text_clean"].str.len() > 0].reset_index(drop=True)
empties_removed = before - len(df)

# ----- SAVE CLEAN -----
df.to_csv(OUT_CSV, index=False, encoding="utf-8")

# ----- REPORT -----
report = {
    "rows_input": int(pd.read_csv(IN_CSV).shape[0]),
    "rows_after_drop_missing": int(len(df) + dropped_missing + dups_removed + empties_removed),
    "dropped_missing_required_fields": int(dropped_missing),
    "duplicates_removed_by_tweet_id": int(dups_removed),
    "rows_removed_empty_after_clean": int(empties_removed),
    "final_rows": int(len(df)),
    "columns": list(df.columns),
    "notes": "Cleaned with HTML strip, lowercase, URL/mention/hashtag removal, punctuation trim, whitespace collapse; tokenized and stopwords removed."
}
with open(REPORT, "w", encoding="utf-8") as f:
    json.dump(report, f, indent=2)

print(f"Saved clean data → {OUT_CSV}")
print("Sample:")
print(df.head(5).to_string(index=False))
print("\nReport:", json.dumps(report, indent=2))


Saved clean data → tweets_clean.csv
Sample:
           tweet_id        username                                                                                                                                                                                                                                                                                                             text                                                                                                                                                                                                                                     text_clean                                                                                                                                                                                                  text_tokens
1972847537167204576    PremiumKwame                                                                                                                                        

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

Overall, the assignment was well-scoped and I enjoyed the hands-on mix of API retrieval, text cleaning, and syntactic/semantic NLP. The main challenges were practical handling edge cases in cleaning, managing dependencies/models for parsing, and robustly paginating marketplace pages. The only significant pain point was Question 5 (Twitter/X): API access limits, token setup, and rate-limit workarounds required noticeably more time and resources than implied by the course forum guidance. Given that, the provided time felt appropriate for Q1–Q4 but tight once Q5’s overhead was included; extending the window or allowing a pre-provided dataset for Q5 would better match the expected effort.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog