# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [6]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd

# Number of pages with narrators
total_pages = 41
narrator_info = {}

for page in range(1, total_pages + 1):
    url = f"https://ddr.densho.org/narrators/?page={page}"
    request = Request(url, headers={"User-Agent": "Mozilla/5.0"})
    response = urlopen(request)
    soup = BeautifulSoup(response.read(), "html.parser")

    # Find narrator entries on the page
    narrator_cards = soup.select("#list_tab .media-body")

    for card in narrator_cards:
        if len(narrator_info) >= 904:  # stop once all narrators are collected
            break
        name_tag = card.find("a")
        details_tag = card.find(class_="source muted")

        name = name_tag.get_text(strip=True) if name_tag else "Unknown"
        details = details_tag.get_text(strip=True) if details_tag else ""

        narrator_info[name] = details

    if len(narrator_info) >= 904:
        break

# Convert dictionary to DataFrame
df = pd.DataFrame(list(narrator_info.items()), columns=["Name", "Details"])

# Save to CSV
df.to_csv("Narrator_Information.csv", index=False, encoding="utf-8")
print(f"✅ Saved {len(df)} narrators into Narrator_Information.csv")


✅ Saved 904 narrators into Narrator_Information.csv


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [7]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word

df_ab = pd.read_csv("Narrator Information.csv")
df_ab = df_ab.dropna()

df_ab["CleanedName"] = df_ab["Name"].str.replace(r"[^\w\s]", "", regex=True)
df_ab["CleanedDetails"] = df_ab["Details"].str.replace(r"[^\w\s]", "", regex=True)

df_ab["CleanedName"] = df_ab["CleanedName"].str.replace(r"\d+", "", regex=True)
df_ab["CleanedDetails"] = df_ab["CleanedDetails"].str.replace(r"\d+", "", regex=True)

nltk.download("stopwords")
stop_words_ab = set(stopwords.words("english"))

df_ab["CleanedName"] = df_ab["CleanedName"].apply(
    lambda text_ab: " ".join(word_ab for word_ab in text_ab.split() if word_ab not in stop_words_ab)
)
df_ab["CleanedDetails"] = df_ab["CleanedDetails"].apply(
    lambda text_ab: " ".join(word_ab for word_ab in text_ab.split() if word_ab not in stop_words_ab)
)

df_ab["CleanedName"] = df_ab["CleanedName"].str.lower()
df_ab["CleanedDetails"] = df_ab["CleanedDetails"].str.lower()

stemmer_ab = PorterStemmer()

df_ab["CleanedName"] = df_ab["CleanedName"].apply(
    lambda text_ab: " ".join(stemmer_ab.stem(word_ab) for word_ab in text_ab.split())
)
df_ab["CleanedDetails"] = df_ab["CleanedDetails"].apply(
    lambda text_ab: " ".join(stemmer_ab.stem(word_ab) for word_ab in text_ab.split())
)

nltk.download("wordnet")

df_ab["CleanedName"] = df_ab["CleanedName"].apply(
    lambda text_ab: " ".join(Word(word_ab).lemmatize() for word_ab in text_ab.split())
)
df_ab["CleanedDetails"] = df_ab["CleanedDetails"].apply(
    lambda text_ab: " ".join(Word(word_ab).lemmatize() for word_ab in text_ab.split())
)

df_ab.to_csv("Narrators_Information_Cleaned.csv", index=False)
print(df_ab.head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


                    Name                                            Details  \
0           Kay Aiko Abe  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1                Art Abe  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Sharon Tanagi Aburano  Nisei female. Born October 31, 1925, in Seattl...   
3        Toshiko Aiboshi  Nisei female. Born July 8, 1928, in Boyle Heig...   
4      Douglas L. Aihara  Sansei male. Born March 15, 1950, in Torrance,...   

             CleanedName                                     CleanedDetails  
0           kay aiko abe  nisei femal born may selleck washington spent ...  
1                art abe  nisei male born june seattl washington grew ar...  
2  sharon tanagi aburano  nisei femal born octob seattl washington famil...  
3        toshiko aiboshi  nisei femal born juli boyl height california a...  
4        dougla l aihara  sansei male born march torranc california grew...  


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [18]:
import pandas as pd
import nltk
from nltk import pos_tag, word_tokenize
from collections import Counter

def ensure_nltk_ready():
    try:
        nltk.data.find("taggers/averaged_perceptron_tagger")
    except LookupError:
        nltk.download("averaged_perceptron_tagger")
    # NLTK ≥ 3.9 may require the *_eng variant
    try:
        nltk.data.find("taggers/averaged_perceptron_tagger_eng")
    except LookupError:
        try:
            nltk.download("averaged_perceptron_tagger_eng")
        except Exception:
            pass

ensure_nltk_ready()

# Load cleaned narrators CSV
narrators_df = pd.read_csv("/content/Narrators_Information_Cleaned.csv")

# Use whichever cleaned-details column exists
details_col = "CleanedDetails_ab" if "CleanedDetails_ab" in narrators_df.columns else "CleanedDetails"
if details_col not in narrators_df.columns:
    raise KeyError("Expected 'CleanedDetails' or 'CleanedDetails_ab' in the CSV.")
narrators_df[details_col] = narrators_df[details_col].fillna("")

# Function to count POS categories (preserve_line avoids punkt/punkt_tab)
def extract_pos_counts(text):
    tokens = word_tokenize(str(text), preserve_line=True)
    tagged_tokens = pos_tag(tokens)
    tag_counts = Counter(tag for _, tag in tagged_tokens)

    noun_count = sum(tag_counts.get(tag, 0) for tag in ["NN", "NNS", "NNP", "NNPS"])
    verb_count = sum(tag_counts.get(tag, 0) for tag in ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"])
    adjective_count = sum(tag_counts.get(tag, 0) for tag in ["JJ", "JJR", "JJS"])
    adverb_count = sum(tag_counts.get(tag, 0) for tag in ["RB", "RBR", "RBS"])

    return noun_count, verb_count, adjective_count, adverb_count

# Apply POS tagging on the cleaned details column
narrators_df[["NounCount", "VerbCount", "AdjectiveCount", "AdverbCount"]] = pd.DataFrame(
    narrators_df[details_col].apply(extract_pos_counts).to_list(),
    index=narrators_df.index
)

# Compute totals across dataset
total_nouns = int(narrators_df["NounCount"].sum())
total_verbs = int(narrators_df["VerbCount"].sum())
total_adjectives = int(narrators_df["AdjectiveCount"].sum())
total_adverbs = int(narrators_df["AdverbCount"].sum())

print(f"Total Nouns: {total_nouns}")
print(f"Total Verbs: {total_verbs}")
print(f"Total Adjectives: {total_adjectives}")
print(f"Total Adverbs: {total_adverbs}")


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Total Nouns: 8890
Total Verbs: 2058
Total Adjectives: 2737
Total Adverbs: 464


In [20]:
!pip install benepar


Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Building wheels for collected packages: benepar
  Building wheel for benepar (setup.py) ... [?25l[?25hdone
  Created wheel for benepar: filename=benepar-0.2.0-py3-none-any.whl size=37625 sha256=4e731471c13d1fbab1791405629e556294b21f3cd750dd1449f5a9ed9f69c4b7
  Stored in directory: /root/.cache/pip/wheels/9b/84/c1/f2ac877f519e2864e7dfe52a1c17fe5cdd50819cb8d1f1945f
Successfully built benepar
Installing collected packages: torch-struct, benepar
Successfully installed benepar-0.2.0 torch-struct-0.5


In [21]:
import pandas as pd
import spacy
import benepar
from spacy import displacy
from nltk import Tree

# Load spaCy model and benepar
nlp_ab = spacy.load("en_core_web_sm")
benepar.download("benepar_en3")
nlp_ab.add_pipe("benepar", config={"model": "benepar_en3"})

def plot_dependency_tree_ab(doc_ab):
    print("\nDependency Parsing Tree (Text Representation):")
    for token_ab in doc_ab:
        print(f"{token_ab.text} --({token_ab.dep_} → {spacy.explain(token_ab.dep_)})--> {token_ab.head.text}")
    displacy.render(doc_ab, style="dep", jupyter=True, options={"compact": True, "distance": 90})

def plot_constituency_tree_ab(doc_ab):
    for sent_ab in doc_ab.sents:
        print("\nConstituency Parsing Tree (Text Representation):")
        print(sent_ab._.parse_string)
        tree_ab = Tree.fromstring(sent_ab._.parse_string)
        tree_ab.pretty_print()

# Example: take first cleaned detail from your dataframe
sentence_ab = df_pos_ab["CleanedDetails"][0]
doc_ab = nlp_ab(sentence_ab)

print(f"Sentence: {sentence_ab}")

plot_dependency_tree_ab(doc_ab)
plot_constituency_tree_ab(doc_ab)


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Sentence: nisei femal born may selleck washington spent much childhood beaverton oregon father own farm influenc earli

Dependency Parsing Tree (Text Representation):
nisei --(compound → compound)--> femal
femal --(nsubj → nominal subject)--> spent
born --(acl → clausal modifier of noun (adjectival clause))--> femal
may --(aux → auxiliary)--> spent
selleck --(compound → compound)--> washington
washington --(nsubj → nominal subject)--> spent
spent --(ROOT → root)--> spent
much --(amod → adjectival modifier)--> beaverton
childhood --(compound → compound)--> beaverton
beaverton --(compound → compound)--> father
oregon --(compound → compound)--> father
father --(compound → compound)--> earli
own --(amod → adjectival modifier)--> earli
farm --(compound → compound)--> influenc
influenc --(compound → compound)--> earli
earli --(dobj → direct object)--> spent





Constituency Parsing Tree (Text Representation):
(S (NP (NP (FW nisei) (NN femal)) (VP (VP (VBN born) (MD may) (RB selleck)) (NNP washington))) (VP (VBD spent) (NP (JJ much) (NN childhood)) (NP (NP (UCP (NN beaverton) (FW oregon)) (NN father)) (JJ own) (NN farm) (FW influenc) (FW earli))))
                                                       S                                                                    
                  _____________________________________|_________________________________                                    
                 |                                                                       VP                                 
                 |                                 ______________________________________|____________                       
                 NP                               |         |                                         NP                    
        _________|________                        |         |                    

In [22]:
#3
import spacy
import pandas as pd
from collections import Counter

# Load a small English model from spaCy
nlp_ab = spacy.load("en_core_web_sm")

# Read the preprocessed narrator information
df_ner_ab = pd.read_csv("/content/Narrators_Information_Cleaned.csv")

# Create a counter for entity frequencies and a dictionary for storing sample entities
entity_counter_ab = Counter()
entity_examples_ab = {"PERSON": [], "ORG": [], "GPE": [], "PRODUCT": [], "DATE": []}

# Loop through each record in the CleanedDetails column and perform NER
for detail_ab in df_ner_ab["CleanedDetails"]:
    doc_ab = nlp_ab(str(detail_ab))
    for ent_ab in doc_ab.ents:
        if ent_ab.label_ in entity_examples_ab:
            entity_counter_ab[ent_ab.label_] += 1
            # entity_examples_ab[ent_ab.label_].append(ent_ab.text)  # optional storage

# Display the overall counts of recognized entity types
print("Named Entity Recognition Results:")
for entity_ab, count_ab in entity_counter_ab.items():
    print(f"{entity_ab}: {count_ab} Entities")


Named Entity Recognition Results:
PERSON: 1200 Entities
GPE: 1402 Entities
DATE: 228 Entities
ORG: 205 Entities


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [40]:
import requests, time, random, re
from bs4 import BeautifulSoup
import pandas as pd

BASE_ab = "https://github.com"
INDEX_ab = BASE_ab + "/marketplace?type=actions&page={}&sort=popularity"
HEADERS_ab = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
}

rows_ab = []
seen_ab = set()
max_pages_ab = 600
target_min_ab = 1000
consec_empty_ab = 0

for page_ab in range(1, max_pages_ab + 1):
    url_ab = INDEX_ab.format(page_ab)
    r_ab = requests.get(url_ab, headers=HEADERS_ab, timeout=30)
    if r_ab.status_code != 200:
        print(f"⚠️ Page {page_ab} status {r_ab.status_code}; stopping.")
        break

    soup_ab = BeautifulSoup(r_ab.text, "html.parser")
    links_ab = soup_ab.select('a[href^="/marketplace/actions/"]')

    hrefs_ab = []
    for a_ab in links_ab:
        href_ab = a_ab.get("href", "")
        if re.fullmatch(r"/marketplace/actions/[a-zA-Z0-9\-._]+", href_ab):
            hrefs_ab.append(href_ab)
    hrefs_ab = list(dict.fromkeys(hrefs_ab))

    added_this_page_ab = 0
    for href_ab in hrefs_ab:
        full_ab = BASE_ab + href_ab
        if full_ab in seen_ab:
            continue
        a_ab = soup_ab.find("a", href=href_ab)
        title_ab = a_ab.get_text(strip=True) if a_ab else ""
        desc_ab = ""
        if a_ab:
            h3_ab = a_ab.find_parent("h3")
            if h3_ab:
                sib_p_ab = h3_ab.find_next_sibling("p")
                if sib_p_ab:
                    desc_ab = sib_p_ab.get_text(" ", strip=True)
                else:
                    p_ab = h3_ab.find_next("p")
                    if p_ab:
                        desc_ab = p_ab.get_text(" ", strip=True)

        rows_ab.append(
            {
                "Product Name": title_ab or "N/A",
                "Description": desc_ab or "N/A",
                "URL": full_ab,
                "Page": page_ab,
            }
        )
        seen_ab.add(full_ab)
        added_this_page_ab += 1

    if added_this_page_ab == 0:
        consec_empty_ab += 1
    else:
        consec_empty_ab = 0

    print(f" Page {page_ab}: {added_this_page_ab} items (total {len(seen_ab)})")
    time.sleep(random.uniform(1.2, 2.5))

    if len(seen_ab) >= target_min_ab and page_ab >= 1:
        print(" Reached target minimum items.")
        break
    if consec_empty_ab >= 2:
        print(" Two consecutive empty pages; stopping.")
        break

df_git_ab = pd.DataFrame(rows_ab).drop_duplicates(subset=["URL"]).reset_index(drop=True)
df_git_ab.to_csv("GitHub_Actions.csv", index=False, encoding="utf-8")
print(f" Saved {len(df_git_ab)} products to GitHub_Actions.csv")
display(df_git_ab.head(10))


 Page 1: 20 items (total 20)
 Page 2: 20 items (total 40)
 Page 3: 20 items (total 60)
 Page 4: 0 items (total 60)
 Page 5: 20 items (total 80)
 Page 6: 20 items (total 100)
 Page 7: 0 items (total 100)
 Page 8: 20 items (total 120)
 Page 9: 20 items (total 140)
 Page 10: 20 items (total 160)
 Page 11: 0 items (total 160)
 Page 12: 20 items (total 180)
 Page 13: 20 items (total 200)
 Page 14: 0 items (total 200)
 Page 15: 20 items (total 220)
 Page 16: 20 items (total 240)
 Page 17: 20 items (total 260)
 Page 18: 0 items (total 260)
 Page 19: 20 items (total 280)
 Page 20: 20 items (total 300)
 Page 21: 20 items (total 320)
 Page 22: 20 items (total 340)
 Page 23: 19 items (total 359)
 Page 24: 20 items (total 379)
 Page 25: 20 items (total 399)
 Page 26: 20 items (total 419)
 Page 27: 20 items (total 439)
 Page 28: 20 items (total 459)
 Page 29: 20 items (total 479)
 Page 30: 20 items (total 499)
 Page 31: 20 items (total 519)
 Page 32: 20 items (total 539)
 Page 33: 20 items (total 5

Unnamed: 0,Product Name,Description,URL,Page
0,TruffleHog OSS,Find and verify leaked credentials in your sou...,https://github.com/marketplace/actions/truffle...,1
1,Metrics embed,An infographics generator with 40+ plugins and...,https://github.com/marketplace/actions/metrics...,1
2,yq - portable yaml processor,"create, read, update, delete, merge, validate ...",https://github.com/marketplace/actions/yq-port...,1
3,Super-Linter,Super-linter is a ready-to-run collection of l...,https://github.com/marketplace/actions/super-l...,1
4,Gosec Security Checker,Runs the gosec security checker,https://github.com/marketplace/actions/gosec-s...,1
5,Rebuild Armbian and Kernel,"Support Amlogic, Rockchip and Allwinner boxes",https://github.com/marketplace/actions/rebuild...,1
6,Checkout,Checkout a Git repository at a particular version,https://github.com/marketplace/actions/checkout,1
7,OpenCommit — improve commits with AI 🧙,Replaces lame commit messages with meaningful ...,https://github.com/marketplace/actions/opencom...,1
8,SSH Remote Commands,Executing remote ssh commands,https://github.com/marketplace/actions/ssh-rem...,1
9,generate-snake-game-from-github-contribution-grid,Generates a snake game from a github user cont...,https://github.com/marketplace/actions/generat...,1


In [42]:
import pandas as pd
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)

INPUT_CSV_ab = "GitHub_Actions.csv"
df_ab = pd.read_csv(INPUT_CSV_ab)

print(f"Loaded {len(df_ab)} rows from {INPUT_CSV_ab}")

df_ab = df_ab.fillna("")
for c_ab in ["Product Name", "Description", "URL"]:
    df_ab[c_ab] = df_ab[c_ab].astype(str).str.strip()

before_dedup_ab = len(df_ab)
df_ab = df_ab.drop_duplicates(subset=["URL"]).reset_index(drop=True)
print(f"Deduplicated by URL: {before_dedup_ab} -> {len(df_ab)}")

stop_words_ab = set(stopwords.words("english"))
lemm_ab = WordNetLemmatizer()

def clean_text_ab(text_ab: str) -> str:
    text_ab = re.sub(r"<.*?>", " ", text_ab)
    text_ab = re.sub(r"&[a-z]+;", " ", text_ab)
    text_ab = re.sub(r"[^a-zA-Z\s]", " ", text_ab)
    text_ab = re.sub(r"\s+", " ", text_ab).strip().lower()
    toks_ab = nltk.word_tokenize(text_ab)
    toks_ab = [t_ab for t_ab in toks_ab if t_ab not in stop_words_ab]
    toks_ab = [lemm_ab.lemmatize(t_ab) for t_ab in toks_ab]
    return " ".join(toks_ab)

df_ab["Cleaned_Name"] = df_ab["Product Name"].map(clean_text_ab)
df_ab["Cleaned_Description"] = df_ab["Description"].map(clean_text_ab)

df_ab["Valid_URL"] = df_ab["URL"].str.contains(
    r"^https://github\.com/marketplace/actions/.*", regex=True, na=False
)

df_ab["Missing_Desc"] = df_ab["Description"].eq("") | (df_ab["Description"].str.len() < 3) | df_ab["Description"].eq("N/A")

n_urls_invalid_ab = (~df_ab["Valid_URL"]).sum()
n_missing_desc_ab = df_ab["Missing_Desc"].sum()
print(f"Invalid URL pattern: {n_urls_invalid_ab} rows")
print(f"Missing/weak descriptions: {n_missing_desc_ab} rows")

FULL_OUT_ab = "GitHub_Actions_Cleaned.csv"
df_ab.to_csv(FULL_OUT_ab, index=False, encoding="utf-8")
print(f" Saved full cleaned data with flags to {FULL_OUT_ab} (rows: {len(df_ab)})")

subset_ab = df_ab[df_ab["Valid_URL"]].reset_index(drop=True)
SUB_OUT_ab = "GitHub_Actions_Cleaned_ValidOnly.csv"
subset_ab.to_csv(SUB_OUT_ab, index=False, encoding="utf-8")
print(f" Saved valid-URL subset to {SUB_OUT_ab} (rows: {len(subset_ab)})")

display(df_ab.head(10))


Loaded 1014 rows from GitHub_Actions.csv
Deduplicated by URL: 1014 -> 1014
Invalid URL pattern: 0 rows
Missing/weak descriptions: 0 rows
 Saved full cleaned data with flags to GitHub_Actions_Cleaned.csv (rows: 1014)
 Saved valid-URL subset to GitHub_Actions_Cleaned_ValidOnly.csv (rows: 1014)


Unnamed: 0,Product Name,Description,URL,Page,Cleaned_Name,Cleaned_Description,Valid_URL,Missing_Desc
0,TruffleHog OSS,Find and verify leaked credentials in your sou...,https://github.com/marketplace/actions/truffle...,1,trufflehog os,find verify leaked credential source code,True,False
1,Metrics embed,An infographics generator with 40+ plugins and...,https://github.com/marketplace/actions/metrics...,1,metric embed,infographics generator plugins option display ...,True,False
2,yq - portable yaml processor,"create, read, update, delete, merge, validate ...",https://github.com/marketplace/actions/yq-port...,1,yq portable yaml processor,create read update delete merge validate yaml,True,False
3,Super-Linter,Super-linter is a ready-to-run collection of l...,https://github.com/marketplace/actions/super-l...,1,super linter,super linter ready run collection linters code...,True,False
4,Gosec Security Checker,Runs the gosec security checker,https://github.com/marketplace/actions/gosec-s...,1,gosec security checker,run gosec security checker,True,False
5,Rebuild Armbian and Kernel,"Support Amlogic, Rockchip and Allwinner boxes",https://github.com/marketplace/actions/rebuild...,1,rebuild armbian kernel,support amlogic rockchip allwinner box,True,False
6,Checkout,Checkout a Git repository at a particular version,https://github.com/marketplace/actions/checkout,1,checkout,checkout git repository particular version,True,False
7,OpenCommit — improve commits with AI 🧙,Replaces lame commit messages with meaningful ...,https://github.com/marketplace/actions/opencom...,1,opencommit improve commits ai,replaces lame commit message meaningful ai gen...,True,False
8,SSH Remote Commands,Executing remote ssh commands,https://github.com/marketplace/actions/ssh-rem...,1,ssh remote command,executing remote ssh command,True,False
9,generate-snake-game-from-github-contribution-grid,Generates a snake game from a github user cont...,https://github.com/marketplace/actions/generat...,1,generate snake game github contribution grid,generates snake game github user contribution ...,True,False


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [50]:
#1
import pandas as pd

# === Your credentials ===
api_key_ab = 'nnrtfUtKf6UuIwOsJYvAWV1Bb'
api_key_secret_ab = 'maQ4gzLWOwFbqkw4LWdM0YWzAOkIKT5WPDF3Is5cfD1a2fmz1R'
access_token_ab = '1413550362280087553-mzUdRr4VyhgFMUiNzlE13CeSduluz9'
access_token_secret_ab = 'mhyh7fKm5xmVoIYgCx5kRf3WEoISUQMfRrxCkyp0RvPIo'

#  New bearer token
bearer_token_ab = 'AAAAAAAAAAAAAAAAAAAAADr74QEAAAAASb997pEXfVpuCGAxkHE56W75t08%3Db1ontiSvc3YWiGhAP6EsGoaecrwYP5pjYJOAAu75iFxHJ8TrzS'

# OAuth 1 for legacy support
auth_ab = tweepy.OAuth1UserHandler(
    consumer_key=api_key_ab,
    consumer_secret=api_key_secret_ab,
    access_token=access_token_ab,
    access_token_secret=access_token_secret_ab
)
api_ab = tweepy.API(auth_ab)

# v2 client
client_ab = tweepy.Client(bearer_token=bearer_token_ab, wait_on_rate_limit=True)

# Assignment: target hashtags (only 1 request, max 100 tweets)
query_ab = '(#machinelearning OR #artificialintelligence OR #generativeAI) -is:retweet lang:en'
resp_ab = client_ab.search_recent_tweets(
    query=query_ab,
    max_results=100,
    tweet_fields=["created_at", "text", "author_id"],
    expansions=["author_id"],
    user_fields=["username"]
)

# Map author_id -> username
user_lookup_ab = {}
if resp_ab.includes and "users" in resp_ab.includes:
    for u in resp_ab.includes["users"]:
        user_lookup_ab[str(u.id)] = u.username

# Store required fields
tweetDict_ab = {'tweet_id': [], 'username': [], 'tweet_time': [], 'tweetText': []}
if resp_ab.data:
    for t in resp_ab.data:
        tweetDict_ab['tweet_id'].append(t.id)
        tweetDict_ab['username'].append(user_lookup_ab.get(str(t.author_id), ""))
        tweetDict_ab['tweet_time'].append(t.created_at)
        tweetDict_ab['tweetText'].append(t.text)

# DataFrame
dfTweets_ab = pd.DataFrame(tweetDict_ab)
print(dfTweets_ab.head())

# Save raw
dfTweets_ab.to_csv("Generative_AI_Tweets_ab.csv", index=False)
print(f"Saved {len(dfTweets_ab)} rows to Generative_AI_Tweets_ab.csv")


              tweet_id        username                tweet_time  \
0  1972900178131898867  NexusLinkIndia 2025-09-30 05:43:51+00:00   
1  1972899833297224046    AmarMohanJha 2025-09-30 05:42:29+00:00   
2  1972899726078149050     SteveKlinko 2025-09-30 05:42:04+00:00   
3  1972899506179145860         monjere 2025-09-30 05:41:11+00:00   
4  1972898594391794078   sakshibhavita 2025-09-30 05:37:34+00:00   

                                           tweetText  
0  Machine Learning is transforming the way busin...  
1  I'm committed to growing my career in AI/ML, b...  
2  Please go visit: https://t.co/b27KFRj08m (Musi...  
3  It is #AI.\n\nIt is not real.\n\nLooks real to...  
4  Job Opportunity at FACE Prep | Salary Rs 6.0 L...  
Saved 100 rows to Generative_AI_Tweets_ab.csv


In [51]:
import pandas as pd
import nltk, re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

df_Tweet_ab = pd.read_csv('Generative_AI_Tweets_ab.csv')

# NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

lemmatizer_ab = WordNetLemmatizer()
stop_words_ab = set(stopwords.words('english'))

def preprocess_text_1_ab(text_ab):
    if not isinstance(text_ab, str):
        text_ab = str(text_ab)
    text_ab = text_ab.lower()
    text_ab = re.sub(r'http\S+|www\.\S+', ' ', text_ab)   # URLs
    text_ab = re.sub(r'[@#]\w+', ' ', text_ab)            # @handles, #hashtags
    text_ab = re.sub(r'[^a-zA-Z\s]', ' ', text_ab)        # keep letters only
    text_ab = re.sub(r'\s+', ' ', text_ab).strip()
    toks_ab = word_tokenize(text_ab)
    toks_ab = [w for w in toks_ab if w not in stop_words_ab]
    toks_ab = [lemmatizer_ab.lemmatize(w) for w in toks_ab]
    return ' '.join(toks_ab)

# Clean tweets
df_Tweet_ab['tweetText'] = df_Tweet_ab['tweetText'].apply(preprocess_text_1_ab)

# Data quality: missing values + duplicates
missing_data_ab = df_Tweet_ab.isnull().sum()
before_ab = len(df_Tweet_ab)
df_Tweet_ab = df_Tweet_ab.drop_duplicates(subset=['tweet_id'])
df_Tweet_ab = df_Tweet_ab.drop_duplicates(subset=['username', 'tweetText'])
after_ab = len(df_Tweet_ab)

if missing_data_ab.any():
    print("There is missing data; filling defaults.")
    df_Tweet_ab = df_Tweet_ab.fillna('unknown')
else:
    print("No missing data.")

# Save cleaned
df_Tweet_ab.to_csv('cleaned_Generative_AI_Tweets_ab.csv', index=False)
print(f"Cleaned rows: {after_ab} (from {before_ab}). Saved to cleaned_Generative_AI_Tweets_ab.csv")
print("\nCleaned Data Sample:")
print(df_Tweet_ab.head())


No missing data.
Cleaned rows: 97 (from 100). Saved to cleaned_Generative_AI_Tweets_ab.csv

Cleaned Data Sample:
              tweet_id        username                 tweet_time  \
0  1972900178131898867  NexusLinkIndia  2025-09-30 05:43:51+00:00   
1  1972899833297224046    AmarMohanJha  2025-09-30 05:42:29+00:00   
2  1972899726078149050     SteveKlinko  2025-09-30 05:42:04+00:00   
3  1972899506179145860         monjere  2025-09-30 05:41:11+00:00   
4  1972898594391794078   sakshibhavita  2025-09-30 05:37:34+00:00   

                                           tweetText  
0  machine learning transforming way business ope...  
1  committed growing career ai ml finding right l...  
2  please go visit music video find understanding...  
3  real look real told using would know introduci...  
4             job opportunity face prep salary r lpa  


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

I think this assignment was useful because it taught me how to collect, clean, and analyze real text data. The most difficult part was getting the scraping and API setup to work properly, since small errors could stop the code from running. I enjoyed the text cleaning part, where I could clearly see messy text turn into clean data that was easier to work with. The time given for the assignment was enough, and I was able to finish it step by step without feeling rushed. Overall, it gave me good practice with real-world data tasks.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog