<a href="https://colab.research.google.com/github/AnjaliAleti/Aleti_INFO5731_Fall2024/blob/main/Aleti_Anjali_Assignment_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 1**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100


**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2024 or 2025 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
# Your code here
!pip -q install pandas tqdm requests

In [None]:

!pip -q install pandas tqdm requests

import time
import requests
import pandas as pd
from tqdm.auto import tqdm


API_KEY = ""

QUERY = "machine learning"
TARGET_N = 10000
SLEEP_SECONDS = 1.1
OUT_CSV = f"semantic_scholar_{QUERY.replace(' ','_')}_{TARGET_N}.csv"

BASE_URL = "https://api.semanticscholar.org/graph/v1/paper/search/bulk"
FIELDS = "paperId,title,abstract,year,venue,url,authors"


headers = {}
if isinstance(API_KEY, str) and API_KEY.strip():
    headers["x-api-key"] = API_KEY.strip()

def fetch_with_retries(url, params, headers, max_retries=8):
    """Robust GET with retries/backoff for rate limits (429) and server errors."""
    backoff = 1.0
    for attempt in range(max_retries):
        try:
            r = requests.get(url, params=params, headers=headers, timeout=60)

            if r.status_code == 200:
                return r.json()

            elif r.status_code == 429:

                print(f"429 Rate limit hit. Sleeping {backoff:.1f}s...")
                time.sleep(backoff)
                backoff = min(backoff * 2, 60)

            elif 500 <= r.status_code < 600:

                print(f"{r.status_code} Server error. Sleeping {backoff:.1f}s...")
                time.sleep(backoff)
                backoff = min(backoff * 2, 60)

            else:

                raise RuntimeError(f"HTTP {r.status_code}: {r.text[:300]}")

        except requests.RequestException as e:
            print("Request exception:", e, f"| Sleeping {backoff:.1f}s...")
            time.sleep(backoff)
            backoff = min(backoff * 2, 60)

    raise RuntimeError("Max retries exceeded (API not responding or rate limited too long).")


all_rows = []
seen_ids = set()

params = {
    "query": QUERY,
    "fields": FIELDS
}

print(f"Starting bulk search for query: {QUERY}")
resp = fetch_with_retries(BASE_URL, params=params, headers=headers)

estimated_total = resp.get("total", None)
if estimated_total is not None:
    print(f"Estimated matches: {estimated_total}")

pbar = tqdm(total=TARGET_N, desc="Collected abstracts", unit="paper")

while True:
    data = resp.get("data", [])
    if not data:
        print("No more data returned. Stopping.")
        break

    for paper in data:
        if len(all_rows) >= TARGET_N:
            break

        pid = paper.get("paperId")
        if not pid or pid in seen_ids:
            continue

        seen_ids.add(pid)

        authors = paper.get("authors", [])
        author_names = ", ".join([a.get("name", "").strip() for a in authors if a.get("name")])

        row = {
            "paperId": pid,
            "title": paper.get("title", ""),
            "abstract": paper.get("abstract", ""),
            "year": paper.get("year", ""),
            "venue": paper.get("venue", ""),
            "url": paper.get("url", ""),
            "authors": author_names
        }

        all_rows.append(row)
        pbar.update(1)

    if len(all_rows) >= TARGET_N:
        print("Reached TARGET_N. Stopping.")
        break

    token = resp.get("token")
    if not token:
        print("No token for next page. Stopping.")
        break

    params = {
        "query": QUERY,
        "fields": FIELDS,
        "token": token
    }

    time.sleep(SLEEP_SECONDS)
    resp = fetch_with_retries(BASE_URL, params=params, headers=headers)

pbar.close()


df = pd.DataFrame(all_rows)
df.to_csv(OUT_CSV, index=False, encoding="utf-8")

print(f"\n✅ Saved {len(df)} records to: {OUT_CSV}")
df.head(5)

Starting bulk search for query: machine learning
Estimated matches: 1017265


Collected abstracts:   0%|          | 0/10000 [00:00<?, ?paper/s]

No token for next page. Stopping.

✅ Saved 1836 records to: semantic_scholar_machine_learning_10000.csv


Unnamed: 0,paperId,title,abstract,year,venue,url,authors
0,00000c33779acab142af6c7a6dae8b36fac0805d,Insights into Household Electric Vehicle Charg...,In the era of burgeoning electric vehicle (EV)...,2024.0,Energies,https://www.semanticscholar.org/paper/00000c33...,"Ahmad Almaghrebi, Kevin James, Fares al Juhesh..."
1,0000238f07f151172cf2602588ba762b55c8464b,Personalized Prediction of Response to Smartph...,Background Meditation apps have surged in popu...,2021.0,Journal of Medical Internet Research,https://www.semanticscholar.org/paper/0000238f...,"Christian A. Webb, M. Hirshberg, R. Davidson, ..."
2,00002d31a8c758062a51d9a259313d81a5eaf399,A Machine Learning Method to Quantify the Role...,,2020.0,International Conference on Information System...,https://www.semanticscholar.org/paper/00002d31...,"L. Szczyrba, Yang Zhang, D. Pamukçu, D. Eroglu"
3,0000315635be19f6278dbc72597b3065fac405f0,Abstractive text summarization of low-resource...,Background Humans must be able to cope with th...,2023.0,PeerJ Computer Science,https://www.semanticscholar.org/paper/00003156...,"Nida Shafiq, Isma Hamid, Muhammad Asif, Qamar ..."
4,00005d68c6c7eb4d3c27da8242a30b9a498f991e,Detection of DDoS Attacks on Clouds Computing ...,The growing number of cloud-based services has...,2023.0,International Conference on Communication and ...,https://www.semanticscholar.org/paper/00005d68...,"Iehab Alrassan, Asma Alqahtani"


In [None]:
from google.colab import files
files.download(OUT_CSV)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
!pip -q install pandas nltk

import re
import pandas as pd
import nltk


nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("wordnet")
nltk.download("omw-1.4")

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer


INPUT_CSV = "semantic_scholar_machine_learning_10000.csv"
TEXT_COL = "abstract"

df = pd.read_csv(INPUT_CSV)

print("Rows:", len(df))
print("Columns:", df.columns.tolist())


sample_series = df[TEXT_COL].fillna("").astype(str).head(3)
print("\nSample raw text (first 3):")
for i, t in enumerate(sample_series, start=1):
    print(f"\n--- Sample {i} ---\n{t[:600]}")


def remove_noise(text: str) -> str:
    text = re.sub(r"[^A-Za-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

step1 = sample_series.apply(remove_noise)

print("\n\n==============================")
print("STEP 1 OUTPUT (Noise removed)")
print("==============================")
for i, t in enumerate(step1, start=1):
    print(f"\n--- Step1 Sample {i} ---\n{t[:600]}")


def remove_numbers(text: str) -> str:
    text = re.sub(r"\d+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

step2 = step1.apply(remove_numbers)

print("\n\n==============================")
print("STEP 2 OUTPUT (Numbers removed)")
print("==============================")
for i, t in enumerate(step2, start=1):
    print(f"\n--- Step2 Sample {i} ---\n{t[:600]}")


STOPWORDS = set(stopwords.words("english"))

def remove_stopwords(text: str) -> str:
    tokens = word_tokenize(text)
    filtered = [w for w in tokens if w.lower() not in STOPWORDS]
    return " ".join(filtered)

step3 = step2.apply(remove_stopwords)

print("\n\n==============================")
print("STEP 3 OUTPUT (Stopwords removed)")
print("==============================")
for i, t in enumerate(step3, start=1):
    print(f"\n--- Step3 Sample {i} ---\n{t[:600]}")


def to_lower(text: str) -> str:
    return text.lower()

step4 = step3.apply(to_lower)

print("\n\n==============================")
print("STEP 4 OUTPUT (Lowercased)")
print("==============================")
for i, t in enumerate(step4, start=1):
    print(f"\n--- Step4 Sample {i} ---\n{t[:600]}")


stemmer = PorterStemmer()

def stem_text(text: str) -> str:
    tokens = word_tokenize(text)
    stemmed = [stemmer.stem(w) for w in tokens]
    return " ".join(stemmed)

step5 = step4.apply(stem_text)

print("\n\n==============================")
print("STEP 5 OUTPUT (Stemmed)")
print("==============================")
for i, t in enumerate(step5, start=1):
    print(f"\n--- Step5 Sample {i} ---\n{t[:600]}")


lemmatizer = WordNetLemmatizer()

def lemmatize_text(text: str) -> str:
    tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(w) for w in tokens]
    return " ".join(lemmas)

step6 = step4.apply(lemmatize_text)

print("\n\n==============================")
print("STEP 6 OUTPUT (Lemmatized)")
print("==============================")
for i, t in enumerate(step6, start=1):
    print(f"\n--- Step6 Sample {i} ---\n{t[:600]}")


def full_clean_pipeline(text: str) -> str:
    if pd.isna(text):
        text = ""
    text = str(text)

    text = remove_noise(text)
    text = remove_numbers(text)
    text = remove_stopwords(text)
    text = to_lower(text)
    text = lemmatize_text(text)

    return text

df["clean_text"] = df[TEXT_COL].apply(full_clean_pipeline)

OUTPUT_CSV = INPUT_CSV.replace(".csv", "_cleaned.csv")
df.to_csv(OUTPUT_CSV, index=False, encoding="utf-8")

print("\n✅ Saved cleaned CSV with new column 'clean_text' to:", OUTPUT_CSV)


df[[TEXT_COL, "clean_text"]].head(5)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Rows: 1836
Columns: ['paperId', 'title', 'abstract', 'year', 'venue', 'url', 'authors']

Sample raw text (first 3):

--- Sample 1 ---
In the era of burgeoning electric vehicle (EV) popularity, understanding the patterns of EV users’ behavior is imperative. This paper examines the trends in household charging sessions’ timing, duration, and energy consumption by analyzing real-world residential charging data. By leveraging the information collected from each session, a novel framework is introduced for the efficient, real-time prediction of important charging characteristics. Utilizing historical data and user-specific features, machine learning models are trained to predict the connection duration, charging duration, chargin

--- Sample 2 ---
Background Meditation apps have surged in popularity in recent years, with an increasing number of individuals turning to these apps to cope with stress, including during the COVID-19 pandemic. Meditation apps are the most commonly used mental hea

Unnamed: 0,abstract,clean_text
0,In the era of burgeoning electric vehicle (EV)...,era burgeoning electric vehicle ev popularity ...
1,Background Meditation apps have surged in popu...,background meditation apps surged popularity r...
2,,
3,Background Humans must be able to cope with th...,background human must able cope huge amount in...
4,The growing number of cloud-based services has...,growing number cloud based service led rising ...


In [None]:
from google.colab import files
files.download(OUTPUT_CSV)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
!pip -q install pandas spacy nltk

import pandas as pd
from collections import Counter
import nltk


nltk.download("punkt")
nltk.download("punkt_tab")

nltk.download("averaged_perceptron_tagger")
nltk.download("averaged_perceptron_tagger_eng")

nltk.download("maxent_ne_chunker")
nltk.download("maxent_ne_chunker_tab")
nltk.download("words")

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag
from nltk.chunk import ne_chunk


import spacy
!python -m spacy download en_core_web_sm -q
nlp = spacy.load("en_core_web_sm")


CLEAN_CSV = "semantic_scholar_machine_learning_10000_cleaned.csv"
TEXT_COL = "clean_text"

df = pd.read_csv(CLEAN_CSV)
df[TEXT_COL] = df[TEXT_COL].fillna("").astype(str)

print("Rows:", len(df))
print("Columns:", df.columns.tolist())


example_sentence = None
for text in df[TEXT_COL].head(200):
    if text.strip():
        sents = sent_tokenize(text)
        if sents:
            example_sentence = sents[0]
            break

if not example_sentence:
    raise ValueError("No valid clean_text sentence found. Check your CSV.")

print("\n==============================")
print("EXAMPLE SENTENCE (used for parsing)")
print("==============================")
print(example_sentence)


N_ROWS = 1000
texts_for_pos = df[TEXT_COL].head(N_ROWS).tolist()

noun_tags = {"NN", "NNS", "NNP", "NNPS"}
verb_tags = {"VB", "VBD", "VBG", "VBN", "VBP", "VBZ"}
adj_tags  = {"JJ", "JJR", "JJS"}
adv_tags  = {"RB", "RBR", "RBS"}

noun_count = verb_count = adj_count = adv_count = 0

for txt in texts_for_pos:
    tokens = word_tokenize(txt)
    tags = pos_tag(tokens)
    for _, t in tags:
        if t in noun_tags:
            noun_count += 1
        elif t in verb_tags:
            verb_count += 1
        elif t in adj_tags:
            adj_count += 1
        elif t in adv_tags:
            adv_count += 1

print("\n==============================")
print("PART (1) POS TAGGING COUNTS")
print("==============================")
print("Processed rows:", N_ROWS)
print("Total Nouns:", noun_count)
print("Total Verbs:", verb_count)
print("Total Adjectives:", adj_count)
print("Total Adverbs:", adv_count)

print("\nPOS tags for example sentence:")
ex_tokens = word_tokenize(example_sentence)
ex_pos = pos_tag(ex_tokens)
print(ex_pos)


print("\n==============================")
print("PART (2) CONSTITUENCY TREE (NLTK ne_chunk Tree)")
print("==============================")
chunk_tree = ne_chunk(ex_pos)
print(chunk_tree)

print("\n==============================")
print("PART (2) DEPENDENCY PARSE (spaCy)")
print("==============================")
doc_sent = nlp(example_sentence)
for token in doc_sent:
    if token.is_space:
        continue
    print(f"{token.text:<15} --> {token.head.text:<15} ({token.dep_})")

print("\n==============================")
print("EXPLANATION (One sentence)")
print("==============================")
print("Constituency tree groups words into nested phrase units (hierarchical structure).")
print("Dependency tree links each word to a head word and labels the grammatical relation (subject, object, modifier).")


N_ROWS_NER = 1000
texts_for_ner = df[TEXT_COL].head(N_ROWS_NER).tolist()

entity_type_counts = Counter()
entity_text_counts = Counter()

for doc in nlp.pipe(texts_for_ner, batch_size=30):
    for ent in doc.ents:
        entity_type_counts[ent.label_] += 1
        entity_text_counts[ent.text] += 1

print("\n==============================")
print("PART (3) NER ENTITY COUNTS (by type)")
print("==============================")
print("Processed rows:", N_ROWS_NER)
for label, cnt in entity_type_counts.most_common():
    print(f"{label:<10}: {cnt}")

print("\nTop 20 entity strings:")
for ent, cnt in entity_text_counts.most_common(20):
    print(f"{ent:<30} {cnt}")


ner_df = pd.DataFrame(entity_type_counts.items(), columns=["entity_type", "count"]).sort_values("count", ascending=False)
ner_df.to_csv("ner_entity_type_counts.csv", index=False)

print("\n✅ Saved NER counts file: ner_entity_type_counts.csv")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/12.8 MB[0m [31m25.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m5.9/12.8 MB[0m [31m83.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m11.4/12.8 MB[0m [31m144.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m148.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m75.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may nee

In [None]:
from google.colab import files
files.download("ner_entity_type_counts.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Part - 1

Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
!pip -q install requests beautifulsoup4 pandas tqdm

import time, re, requests
import pandas as pd
from bs4 import BeautifulSoup
from tqdm.auto import tqdm
from urllib.parse import urljoin

BASE = "https://github.com"
MARKET_URL = "https://github.com/marketplace"

TYPE = "actions"
SORT = "popularity"
TARGET_N = 1000
SLEEP_SECONDS = 1.2
MAX_PAGES = 600

OUT_CSV = "github_marketplace_actions_1000.csv"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9"
}

def clean_ws(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "")).strip()

def fetch_page(page_num: int, max_retries: int = 6):
    params = {"type": TYPE, "sort": SORT, "page": page_num}
    backoff = 1.0
    for _ in range(max_retries):
        try:
            r = requests.get(MARKET_URL, params=params, headers=HEADERS, timeout=30)
            if r.status_code == 200:
                return r.text
            if r.status_code in (429, 500, 502, 503, 504):
                time.sleep(backoff)
                backoff = min(backoff * 2, 30)
                continue
            print(f"HTTP {r.status_code} on page {page_num}")
            return None
        except requests.RequestException:
            time.sleep(backoff)
            backoff = min(backoff * 2, 30)
    return None

def parse_actions(html: str, page_num: int):
    soup = BeautifulSoup(html, "html.parser")
    rows = []
    seen = set()


    anchors = soup.select('a[href^="/marketplace/actions/"]')

    for a in anchors:
        href = (a.get("href") or "").strip()
        name = clean_ws(a.get_text(" ", strip=True))
        if not href or not name:
            continue

        key = (href, name.lower())
        if key in seen:
            continue
        seen.add(key)

        full_url = urljoin(BASE, href)


        desc = ""
        container = a.find_parent(["li", "article", "div", "section"])
        if container:
            p = container.find("p")
            if p:
                desc = clean_ws(p.get_text(" ", strip=True))

        rows.append({
            "product_name": name,
            "description": desc,
            "url": full_url,
            "page_number": page_num
        })


    uniq = {}
    for r in rows:
        uniq[r["url"]] = r
    return list(uniq.values())

all_rows = []
seen_urls = set()

pbar = tqdm(total=TARGET_N, desc="Scraping actions", unit="action")

for page in range(1, MAX_PAGES + 1):
    if len(seen_urls) >= TARGET_N:
        break

    html = fetch_page(page)
    if not html:
        time.sleep(SLEEP_SECONDS)
        continue

    page_rows = parse_actions(html, page)

    for row in page_rows:
        if len(seen_urls) >= TARGET_N:
            break
        if row["url"] in seen_urls:
            continue
        seen_urls.add(row["url"])
        all_rows.append(row)
        pbar.update(1)

    time.sleep(SLEEP_SECONDS)

pbar.close()

df_actions = pd.DataFrame(all_rows)
df_actions.to_csv(OUT_CSV, index=False, encoding="utf-8")

print("\n✅ PART-1 COMPLETE")
print("Total actions collected:", len(df_actions))
print("Saved CSV:", OUT_CSV)

df_actions.head(10)

Scraping actions:   0%|          | 0/1000 [00:00<?, ?action/s]


✅ PART-1 COMPLETE
Total actions collected: 1000
Saved CSV: github_marketplace_actions_1000.csv


Unnamed: 0,product_name,description,url,page_number
0,TruffleHog OSS,,https://github.com/marketplace/actions/truffle...,1
1,Metrics embed,,https://github.com/marketplace/actions/metrics...,1
2,yq - portable yaml processor,,https://github.com/marketplace/actions/yq-port...,1
3,Super-Linter,,https://github.com/marketplace/actions/super-l...,1
4,Rebuild Armbian and Kernel,,https://github.com/marketplace/actions/rebuild...,1
5,Gosec Security Checker,,https://github.com/marketplace/actions/gosec-s...,1
6,Checkout,,https://github.com/marketplace/actions/checkout,1
7,OpenCommit — improve commits with AI 🧙,,https://github.com/marketplace/actions/opencom...,1
8,SSH Remote Commands,,https://github.com/marketplace/actions/ssh-rem...,1
9,Claude Code Action Official,,https://github.com/marketplace/actions/claude-...,1


Part 2

In [None]:
!pip -q install pandas nltk beautifulsoup4

import re
import pandas as pd
from bs4 import BeautifulSoup
from collections import Counter

import nltk
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("wordnet")
nltk.download("omw-1.4")

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

INPUT_CSV = "github_marketplace_actions_1000.csv"
CLEAN_CSV = "github_marketplace_actions_1000_cleaned.csv"
QUALITY_CSV = "github_marketplace_actions_quality_report.csv"

df = pd.read_csv(INPUT_CSV).fillna("")

print("Rows:", len(df))
print("Columns:", df.columns.tolist())


STOPWORDS = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def remove_html(text: str) -> str:
    return BeautifulSoup(text, "html.parser").get_text(" ")

def remove_noise(text: str) -> str:
    text = remove_html(str(text))
    text = re.sub(r"[^A-Za-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def preprocess_text(text: str):
    """
    Clean noise -> lowercase -> tokenize -> remove stopwords -> lemmatize
    Returns tokens list.
    """
    text = remove_noise(text).lower()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in STOPWORDS]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens


df["product_name_tokens"] = df["product_name"].apply(preprocess_text)
df["description_tokens"]  = df["description"].apply(preprocess_text)

df["product_name_clean"] = df["product_name_tokens"].apply(lambda x: " ".join(x))
df["description_clean"]  = df["description_tokens"].apply(lambda x: " ".join(x))

print("\nPreview after preprocessing:")
df[["product_name", "product_name_clean", "description", "description_clean"]].head(5)

quality = {}


quality["missing_product_name"] = int((df["product_name"].str.strip() == "").sum())
quality["missing_description"]  = int((df["description"].str.strip() == "").sum())
quality["missing_url"]          = int((df["url"].str.strip() == "").sum())
quality["missing_page_number"]  = int(df["page_number"].isna().sum())


quality["duplicate_urls"]  = int(df.duplicated(subset=["url"]).sum())
quality["duplicate_names"] = int(df.duplicated(subset=["product_name"]).sum())


quality["invalid_url_format"] = int((~df["url"].astype(str).str.startswith("https://github.com/marketplace/actions/")).sum())


quality["very_short_names(<3chars)"] = int((df["product_name"].astype(str).str.len() < 3).sum())
quality["very_short_desc(<10chars)"] = int((df["description"].astype(str).str.len() < 10).sum())

quality_df = pd.DataFrame(list(quality.items()), columns=["check", "count"]).sort_values("count", ascending=False)
quality_df.to_csv(QUALITY_CSV, index=False, encoding="utf-8")

print("\n✅ Data Quality Report:")
display(quality_df)


df = df.drop_duplicates(subset=["url"]).reset_index(drop=True)


df.to_csv(CLEAN_CSV, index=False, encoding="utf-8")

print("\n✅ PART-2 COMPLETE")
print("Saved cleaned CSV:", CLEAN_CSV)
print("Saved quality report CSV:", QUALITY_CSV)

df.head(5)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Rows: 1000
Columns: ['product_name', 'description', 'url', 'page_number']

Preview after preprocessing:

✅ Data Quality Report:


Unnamed: 0,check,count
1,missing_description,1000
8,very_short_desc(<10chars),1000
0,missing_product_name,0
3,missing_page_number,0
2,missing_url,0
4,duplicate_urls,0
5,duplicate_names,0
6,invalid_url_format,0
7,very_short_names(<3chars),0



✅ PART-2 COMPLETE
Saved cleaned CSV: github_marketplace_actions_1000_cleaned.csv
Saved quality report CSV: github_marketplace_actions_quality_report.csv


Unnamed: 0,product_name,description,url,page_number,product_name_tokens,description_tokens,product_name_clean,description_clean
0,TruffleHog OSS,,https://github.com/marketplace/actions/truffle...,1,"[trufflehog, os]",[],trufflehog os,
1,Metrics embed,,https://github.com/marketplace/actions/metrics...,1,"[metric, embed]",[],metric embed,
2,yq - portable yaml processor,,https://github.com/marketplace/actions/yq-port...,1,"[yq, portable, yaml, processor]",[],yq portable yaml processor,
3,Super-Linter,,https://github.com/marketplace/actions/super-l...,1,"[super, linter]",[],super linter,
4,Rebuild Armbian and Kernel,,https://github.com/marketplace/actions/rebuild...,1,"[rebuild, armbian, kernel]",[],rebuild armbian kernel,


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
!pip -q install tweepy pandas

In [None]:
import pandas as pd
import tweepy


BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAHDe7wEAAAAAjKVdVFGx9ZBqxkwaycpPD4x1dwA%3DCN0ThsdVKZtP52ILRnkASDBcEdCijJr5Ro5pB4o5RkibGtTrCC"

client = tweepy.Client(bearer_token=BEARER_TOKEN, wait_on_rate_limit=True)

In [None]:
query = "(#machinelearning OR #artificialintelligence) -is:retweet lang:en"

rows = []
next_token = None


TOTAL_TO_COLLECT = 300
MAX_PER_REQUEST = 100

while len(rows) < TOTAL_TO_COLLECT:
    resp = client.search_recent_tweets(
        query=query,
        max_results=min(MAX_PER_REQUEST, TOTAL_TO_COLLECT - len(rows)),
        tweet_fields=["id", "text", "author_id", "created_at"],
        expansions=["author_id"],
        user_fields=["username"],
        next_token=next_token
    )


    if resp.data is None:
        break


    user_map = {}
    if resp.includes and "users" in resp.includes:
        user_map = {u.id: u.username for u in resp.includes["users"]}


    for t in resp.data:
        rows.append({
            "tweet_id": t.id,
            "username": user_map.get(t.author_id, None),
            "text": t.text
        })


    next_token = resp.meta.get("next_token")
    if not next_token:
        break

df = pd.DataFrame(rows)
print("Collected:", len(df))
df.head()

Collected: 300


Unnamed: 0,tweet_id,username,text
0,2028340880508088487,wittoast42,Gratitude is a visibility strategy.\n\nRead mo...
1,2028339718048317829,CarterVr3,You don’t need to be a programmer to use AI.\n...
2,2028338322825691536,ekascloud,Why Choose Eka Cloud Full Course Explanation &...
3,2028338266810834966,Timothy_Hughes,The AI Easy-Button: Why thinking still matters...
4,2028337959343206567,flarestartcom,Building Sacred: A Privacy-First Period Tracke...


In [None]:
df.to_csv("tweets_raw.csv", index=False)
print("File saved successfully.")

File saved successfully.


In [None]:
from google.colab import files
files.download("tweets_raw.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd
import re
from google.colab import files

df = pd.read_csv("tweets_raw.csv")

df = df.drop_duplicates()
df = df.dropna(subset=["tweet_id", "username", "text"])

df["tweet_id"] = df["tweet_id"].astype(str)

df["clean_text"] = df["text"].str.replace(r"http\S+|www\.\S+", "", regex=True)
df["clean_text"] = df["clean_text"].str.replace(r"\s+", " ", regex=True)
df["clean_text"] = df["clean_text"].str.strip().str.lower()

print("Total rows after cleaning:", len(df))
print("\nMissing values:\n", df.isnull().sum())
print("\nDuplicate tweet_id count:", df["tweet_id"].duplicated().sum())
print("Empty clean_text rows:", (df["clean_text"].str.len() == 0).sum())

df.to_csv("tweets_cleaned.csv", index=False)

files.download("tweets_cleaned.csv")

Total rows after cleaning: 300

Missing values:
 tweet_id      0
username      0
text          0
clean_text    0
dtype: int64

Duplicate tweet_id count: 0
Empty clean_text rows: 0


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question (5 points)

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

For me , this assignment was an extensive and practical experience in web scraping, text preprocessing, and natural language processing. The ability to deal with API limitations and compatibility was one of the most difficult points, especially when dealing with various libraries in Google Colab. The patience and debugging abilities were necessary to manage pagination, rate limits, and periodic package dependency errors. Moreover, syntax analysis like constituency and dependency parsing was also technically challenging since it involved both linguistic theory and details of how the tool works.
Nevertheless, I found the practicality of the assignment very enjoyable. The practical aspect of learning was achieved through the collection of real-world data on sources such as Semantic Scholar, GitHub Marketplace, and twitter. This preprocessing and data quality process was especially rewarding since it was used to convert unstructured, sloppy data into organized and interpretable data. Another thing that I liked about the assignment was that it combined several NLP tasks: POS tagging, parsing, and named entity recognition, which helped me understand the interaction between various methods in text analytics.

As far as the time available to get through with the assignment is concerned, it was decent but demanded regular effort and time management. Due to the fact that the tasks were to be carried out in the API set up, debugging, and processing of large data, the students had to dedicate enough time to testing and troubleshooting. Generally, the assignment itself was well-designed to train the practical skills in both NLP and data engineering, and it was a helpful experience.
