<a href="https://colab.research.google.com/github/SriAmbica11/SriAmbica_INFO5731_Spring2025/blob/main/Sangineedi_SriAmbica_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# IMDb review page for the new movie
url_reviews = "https://www.imdb.com/title/tt6263850/reviews/"
headers_req = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

# Initialize storage for reviews and a counter for pages processed
reviews_collected = []
page_count = 1

while len(reviews_collected) < 1000:
    try:
        response = requests.get(url_reviews, headers=headers_req)
        if response.status_code != 200:
            print(f"Error on page {page_count}: Status {response.status_code}")
            break
        soup = BeautifulSoup(response.text, 'html.parser')
        review_blocks = soup.find_all('div', class_='sc-8c7aa573-5 gBEznl')
        if not review_blocks:
            print("No review blocks detected; the IMDb structure might have changed.")
            break

        for block in review_blocks:
            try:
                header = block.find("h3", class_="ipc-title__text")
                rev_title = header.get_text(strip=True) if header else "No Title"
                body = block.find("div", class_="ipc-html-content-inner-div")
                rev_text = body.get_text(strip=True) if body else "No Text Available"
                if rev_text == "No Text Available":
                    alt_body = block.find_next_sibling("div", {"data-testid": "review-overflow"})
                    rev_text = alt_body.get_text(strip=True) if alt_body else "No Text"
                reviews_collected.append({
                    "Review Title": rev_title,
                    "Review Text": rev_text
                })
                if len(reviews_collected) >= 1000:
                    break
            except Exception as inner_err:
                print("Skipping one review due to:", inner_err)
        page_count += 1
    except Exception as outer_err:
        print("Error on page", page_count, ":", outer_err)
    time.sleep(2)

reviews_df = pd.DataFrame(reviews_collected)
reviews_df.to_csv("assignment2_reviews.csv", index=False)
reviews_df.head(5)


Unnamed: 0,Review Title,Review Text
0,Till you're 90 Wolverine!,Hugh Jackman is the perfect Wolverine. What a ...
1,Weak story but a fun ride.,No Text
2,"""You were always the wrong one, till you weren...",What a crazy blast ! Bonkers !!Sooo !...\nWhat...
3,Easter Egg Heaven,"So many Easter Eggs, so true to the comic char..."
4,Meh:(,No Text


In [None]:
print(reviews_df.shape)

(1000, 2)


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Download required NLTK datasets (if not already available)
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:

#Data load
assignment2_df = pd.read_csv("assignment2_reviews.csv")
print("Original Sample of 'Review Text':")
print(assignment2_df['Review Text'].head(), "\n")

Original Sample of 'Review Text':
0    Hugh Jackman is the perfect Wolverine. What a ...
1                                              No Text
2    What a crazy blast ! Bonkers !!Sooo !...\nWhat...
3    So many Easter Eggs, so true to the comic char...
4                                              No Text
Name: Review Text, dtype: object 



In [None]:
# (1) Eliminate unwanted symbols, punctuation, and non-alphanumeric noise
assignment2_df['clean1'] = assignment2_df['Review Text'].apply(
    lambda txt: re.sub(r'[^\w\s]', '', txt))
print("After Noise Removal (Special Characters & Punctuation):")
print(assignment2_df['clean1'].head(), "\n")



After Noise Removal (Special Characters & Punctuation):
0    Hugh Jackman is the perfect Wolverine What a f...
1                                              No Text
2    What a crazy blast  Bonkers Sooo \nWhat I can ...
3    So many Easter Eggs so true to the comic chara...
4                                              No Text
Name: clean1, dtype: object 



In [None]:
# (2) Remove any numeric digits from the text
assignment2_df['clean2'] = assignment2_df['clean1'].apply(
    lambda txt: re.sub(r'\d+', '', txt))
print("After Number Elimination:")
print(assignment2_df['clean2'].head(), "\n")



After Number Elimination:
0    Hugh Jackman is the perfect Wolverine What a f...
1                                              No Text
2    What a crazy blast  Bonkers Sooo \nWhat I can ...
3    So many Easter Eggs so true to the comic chara...
4                                              No Text
Name: clean2, dtype: object 



In [None]:
# (3) Filter out stopwords using NLTK's stopwords list
eng_stopwords = set(stopwords.words('english'))
assignment2_df['clean3'] = assignment2_df['clean2'].apply(
    lambda txt: " ".join([word for word in txt.split() if word.lower() not in eng_stopwords]))
print("After Stopword Removal:")
print(assignment2_df['clean3'].head(), "\n")



After Stopword Removal:
0    Hugh Jackman perfect Wolverine fun movie like ...
1                                                 Text
2    crazy blast Bonkers Sooo say movie whole team ...
3    many Easter Eggs true comic characters may pos...
4                                                 Text
Name: clean3, dtype: object 



In [None]:
# (4) Transform all text into lowercase format
assignment2_df['clean4'] = assignment2_df['clean3'].apply(lambda txt: txt.lower())
print("After Converting to Lowercase:")
print(assignment2_df['clean4'].head(), "\n")



After Converting to Lowercase:
0    hugh jackman perfect wolverine fun movie like ...
1                                                 text
2    crazy blast bonkers sooo say movie whole team ...
3    many easter eggs true comic characters may pos...
4                                                 text
Name: clean4, dtype: object 



In [None]:
# (5) Apply stemming to reduce words to their base forms using PorterStemmer
porter = PorterStemmer()
assignment2_df['clean5'] = assignment2_df['clean4'].apply(
    lambda txt: " ".join([porter.stem(word) for word in txt.split()]))
print("After Stemming:")
print(assignment2_df['clean5'].head(), "\n")



After Stemming:
0    hugh jackman perfect wolverin fun movi like di...
1                                                 text
2    crazi blast bonker sooo say movi whole team be...
3    mani easter egg true comic charact may possibl...
4                                                 text
Name: clean5, dtype: object 



In [None]:
# (6) Perform lemmatization to further normalize the words using WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
assignment2_df['Cleaned_Review'] = assignment2_df['clean5'].apply(
    lambda txt: " ".join([lemmatizer.lemmatize(word) for word in txt.split()]))
print("After Lemmatization (Final Cleaned Data):")
print(assignment2_df['Cleaned_Review'].head(), "\n")

# Save the DataFrame with the new cleaned text column back to a CSV file
assignment2_df.to_csv("assignment2_reviews_clean.csv", index=False)

After Lemmatization (Final Cleaned Data):
0    hugh jackman perfect wolverin fun movi like di...
1                                                 text
2    crazi blast bonker sooo say movi whole team be...
3    mani easter egg true comic charact may possibl...
4                                                 text
Name: Cleaned_Review, dtype: object 



In [None]:
# Save the DataFrame with the new cleaned text column back to a CSV file
assignment2_df.to_csv("assignment2_reviews_clean.csv", index=False)

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Your code here
import pandas as pd
import nltk
import re
from collections import Counter

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
# Load the cleaned reviews dataset
data_path = "assignment2_reviews_clean.csv"
df_reviews = pd.read_csv(data_path)
print("Sample of Clean Reviews:")
print(df_reviews.head())

Sample of Clean Reviews:
                                        Review Title  \
0                          Till you're 90 Wolverine!   
1                         Weak story but a fun ride.   
2  "You were always the wrong one, till you weren...   
3                                  Easter Egg Heaven   
4                                              Meh:(   

                                         Review Text  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1                                            No Text   
2  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
3  So many Easter Eggs, so true to the comic char...   
4                                            No Text   

                                              clean1  \
0  Hugh Jackman is the perfect Wolverine What a f...   
1                                            No Text   
2  What a crazy blast  Bonkers Sooo \nWhat I can ...   
3  So many Easter Eggs so true to the comic chara...   
4                    

In [None]:
import pandas as pd
import nltk
import re
from collections import Counter

# Ensure necessary NLTK resources are available
try:
    nltk.word_tokenize("test")
except LookupError:
    nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load the cleaned review data
data_file = "assignment2_reviews_clean.csv"
df_clean = pd.read_csv(data_file)
print("Sample Clean Review Data:")
print(df_clean.head())

Sample Clean Review Data:
                                        Review Title  \
0                          Till you're 90 Wolverine!   
1                         Weak story but a fun ride.   
2  "You were always the wrong one, till you weren...   
3                                  Easter Egg Heaven   
4                                              Meh:(   

                                         Review Text  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1                                            No Text   
2  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
3  So many Easter Eggs, so true to the comic char...   
4                                            No Text   

                                              clean1  \
0  Hugh Jackman is the perfect Wolverine What a f...   
1                                            No Text   
2  What a crazy blast  Bonkers Sooo \nWhat I can ...   
3  So many Easter Eggs so true to the comic chara...   
4                   

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
nltk.download('averaged_perceptron_tagger_eng')
# ---------------------------------------------------
# (1) POS Tagging: Count Nouns, Verbs, Adjectives, Adverbs
# ---------------------------------------------------
def analyze_syntax(sentence):
    try:
        tokens = nltk.word_tokenize(sentence)
    except LookupError as e:
        if "punkt_tab" in str(e):
            nltk.download('punkt_tab')
        else:
            nltk.download('punkt')
        tokens = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokens)
    counts = {"Noun": 0, "Verb": 0, "Adj": 0, "Adv": 0}
    for word, tag in tagged:
        if tag.startswith("NN"):
            counts["Noun"] += 1
        elif tag.startswith("VB"):
            counts["Verb"] += 1
        elif tag.startswith("JJ"):
            counts["Adj"] += 1
        elif tag.startswith("RB"):
            counts["Adv"] += 1
    return counts

# Apply POS analysis on each cleaned review and aggregate the counts
pos_analysis = df_clean["Cleaned_Review"].dropna().apply(analyze_syntax)
total_pos = Counter()
for pos_dict in pos_analysis:
    total_pos.update(pos_dict)
print("\nOverall Part-of-Speech Totals:")
print(total_pos)

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.



Overall Part-of-Speech Totals:
Counter({'Noun': 42252, 'Adj': 16362, 'Verb': 11368, 'Adv': 3999})


In [None]:




# ---------------------------------------------------
# (2) Constituency & Dependency Parsing (Simple Rule-based Example)
# ---------------------------------------------------
def parse_constituency(sentence):
    # A simplistic tree: if too short, group as one NP; else, split into NP, VP, NP
    words = sentence.split()
    if len(words) < 6:
        return "(S (NP " + " ".join(words) + "))"
    return f"(S (NP {words[0]} {words[1]}) (VP {words[2]} {words[3]}) (NP {' '.join(words[4:6])}))"

def parse_dependency(sentence):
    # Build a chain dependency: each word 'points' to the next
    tokens = sentence.split()
    dependencies = [(tokens[i], "->", tokens[i+1]) for i in range(len(tokens)-1)]
    return dependencies

# Demonstrate parsing on a sample sentence from the dataset
example_sentence = df_clean["Cleaned_Review"].dropna().iloc[0]
const_tree = parse_constituency(example_sentence)
dep_tree = parse_dependency(example_sentence)

print("\nExample Constituency Parse Tree:")
print(const_tree)
print("\nExample Dependency Parse Tree:")
print(dep_tree)
print("\nExplanation: The constituency tree represents the sentence in hierarchical parts (e.g., noun phrases and verb phrases), while the dependency tree illustrates direct links between consecutive words.")

# ---------------------------------------------------
# (3) Named Entity Recognition: Extract Entities
# ---------------------------------------------------
def extract_entities(text):
    # Define patterns for entity types (these can be refined as needed)
    person_pat = r'\b(?:Alice|Bob|Charlie|Nolan|Murphy)\b'
    org_pat = r'\b(?:IMDB|WarnerBros|Universal|Sony|Disney)\b'
    loc_pat = r'\b(?:London|NewYork|Paris|Berlin|LosAngeles|USA)\b'
    date_pat = r'\b(?:\d{4}|January|February|March|April|May|June|July|August|September|October|November|December)\b'

    persons = re.findall(person_pat, text, flags=re.IGNORECASE)
    organizations = re.findall(org_pat, text, flags=re.IGNORECASE)
    locations = re.findall(loc_pat, text, flags=re.IGNORECASE)
    dates = re.findall(date_pat, text, flags=re.IGNORECASE)

    return {"Person": len(persons), "Organization": len(organizations), "Location": len(locations), "Date": len(dates)}

# Process each review and accumulate entity counts
ner_analysis = df_clean["Cleaned_Review"].dropna().apply(extract_entities)
total_entities = Counter()
for ner in ner_analysis:
    total_entities.update(ner)
print("\nNamed Entity Counts:")
print(total_entities)



Example Constituency Parse Tree:
(S (NP hugh jackman) (VP perfect wolverin) (NP fun movi))

Example Dependency Parse Tree:
[('hugh', '->', 'jackman'), ('jackman', '->', 'perfect'), ('perfect', '->', 'wolverin'), ('wolverin', '->', 'fun'), ('fun', '->', 'movi'), ('movi', '->', 'like'), ('like', '->', 'dialogu'), ('dialogu', '->', 'clever'), ('clever', '->', 'quip'), ('quip', '->', 'f'), ('f', '->', 'bomb'), ('bomb', '->', 'sprinkl'), ('sprinkl', '->', 'definit'), ('definit', '->', 'take'), ('take', '->', 'serious'), ('serious', '->', 'ton'), ('ton', '->', 'fun'), ('fun', '->', 'cameo'), ('cameo', '->', 'didnt'), ('didnt', '->', 'expect'), ('expect', '->', 'normal'), ('normal', '->', 'watch'), ('watch', '->', 'spoiler'), ('spoiler', '->', 'video'), ('video', '->', 'ahead'), ('ahead', '->', 'time'), ('time', '->', 'didnt'), ('didnt', '->', 'occas'), ('occas', '->', 'im'), ('im', '->', 'glad'), ('glad', '->', 'didnt'), ('didnt', '->', 'oh'), ('oh', '->', 'snap'), ('snap', '->', 'moment'),

In [None]:
!pip install benepar

Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.6.0->benepar)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.6.0->benepar)
  Downloading nvidia_cublas_

In [None]:
import pandas as pd
import spacy
import benepar
from spacy import displacy
from nltk import Tree

nlp = spacy.load("en_core_web_sm")
benepar.download('benepar_en3')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
sample_sentence = df_reviews["Cleaned_Review"].dropna().iloc[0]  # Adjust as needed
doc = nlp(sample_sentence)
print(f"Sentence: {sample_sentence}")
print("\nDependency Parsing Tree (Text Representation):")
for token in doc:
    explanation = spacy.explain(token.dep_) if spacy.explain(token.dep_) else "no explanation"
    print(f"{token.text} --({token.dep_} → {explanation})--> {token.head.text}")

displacy.render(doc, style='dep', jupyter=True, options={'compact': True, 'distance': 90})
for sent in doc.sents:
    print("\nConstituency Parsing Tree (Text Representation):")
    print(sent._.parse_string)
    tree = Tree.fromstring(sent._.parse_string)
    tree.pretty_print()

print("\nExplanation: The constituency tree organizes the sentence into hierarchical segments (e.g., noun and verb phrases) based on learned structures, while the dependency tree shows direct relationships between words in the sentence.")


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
  state_dict = torch.load(
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Sentence: hugh jackman perfect wolverin fun movi like dialogu clever quip f bomb sprinkl definit take serious ton fun cameo didnt expect normal watch spoiler video ahead time didnt occas im glad didnt oh snap moment good action pack fun film break fox joke speak camera joke funni definit see sequel two horizon promot movi hard watch two hot one eat chicken wing make dynam duo wolverin lol

Dependency Parsing Tree (Text Representation):
hugh --(compound → compound)--> jackman
jackman --(nsubj → nominal subject)--> perfect
perfect --(ROOT → root)--> perfect
wolverin --(amod → adjectival modifier)--> fun
fun --(compound → compound)--> movi
movi --(dobj → direct object)--> perfect
like --(prep → prepositional modifier)--> movi
dialogu --(amod → adjectival modifier)--> quip
clever --(amod → adjectival modifier)--> quip
quip --(pobj → object of preposition)--> like
f --(compound → compound)--> bomb
bomb --(nsubj → nominal subject)--> sprinkl
sprinkl --(conj → conjunct)--> perfect
definit --(


Constituency Parsing Tree (Text Representation):
(S (NP (NN hugh) (FW jackman)) (JJ perfect) (FW wolverin) (NP (JJ fun) (FW movi)) (PP (IN like) (FW dialogu)) (JJ clever) (NN quip) (FW f) (NN bomb) (VBP sprinkl) (RB definit) (VP (VB take) (JJ serious) (NN ton) (NN fun)) (NN cameo) (VP (VBD did) (RB nt) (VP (VP (VB expect) (NP (NP (JJ normal) (VB watch) (NN spoiler) (NN video)) (ADVP (RB ahead) (NN time)))) (VP (VBD did) (RB nt) (VB occas) (PRP i) (RB m) (RB glad) (VP (VBD did) (RB nt))))))
                                                                                                                       S                                                                                                        
       ________________________________________________________________________________________________________________|______________________________________________________________                                           
      |             |       |          |             

In [None]:
import re
from collections import Counter

def extract_named_entities(text):
    # Extended patterns for various entity types
    person_pattern = r'\b(?:Alice|Bob|Charlie|Nolan|Murphy|Tom|Scarlett|Hanks|Downey|DiCaprio)\b'
    org_pattern = r'\b(?:IMDB|Warner Bros|Universal|Sony|Disney|Netflix|HBO)\b'
    location_pattern = r'\b(?:London|New\sYork|Paris|Berlin|Los\sAngeles|USA|Toronto|Sydney)\b'
    date_pattern = r'\b(?:\d{4}|January|February|March|April|May|June|July|August|September|October|November|December)\b'
    # New entity categories: Product names (typically movie titles) and Awards
    product_pattern = r'\b(?:Inception|Interstellar|Dunkirk|Tenet|Oppenheimer|Avengers|Batman|Wolverine)\b'
    award_pattern = r'\b(?:Oscars|Academy\sAwards|Golden\sGlobes|BAFTA|Cannes)\b'

    persons = re.findall(person_pattern, text, flags=re.IGNORECASE)
    organizations = re.findall(org_pattern, text, flags=re.IGNORECASE)
    locations = re.findall(location_pattern, text, flags=re.IGNORECASE)
    dates = re.findall(date_pattern, text, flags=re.IGNORECASE)
    products = re.findall(product_pattern, text, flags=re.IGNORECASE)
    awards = re.findall(award_pattern, text, flags=re.IGNORECASE)

    return {
        "Person": len(persons),
        "Organization": len(organizations),
        "Location": len(locations),
        "Date": len(dates),
        "Product": len(products),
        "Award": len(awards)
    }

# Apply NER extraction on the cleaned reviews and sum the results
ner_results = df_clean["Cleaned_Review"].dropna().apply(extract_named_entities)
aggregated_ner = Counter()
for entity_counts in ner_results:
    aggregated_ner.update(entity_counts)
print("\nNamed Entity Counts:")
print(aggregated_ner)



Named Entity Counts:
Counter({'Date': 92, 'Organization': 91, 'Person': 0, 'Location': 0, 'Product': 0, 'Award': 0})


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
import requests
import pandas as pd
import time

# API endpoint for searching repositories tagged as "github-action"
github_api_endpoint = "https://api.github.com/search/repositories"

# HTTP headers for the request
api_headers = {
    "Accept": "application/vnd.github.v3+json",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

def build_query_parameters(page_number, results_per_page=30):
    """
    Construct the query parameters for the GitHub API.
    """
    return {
        "q": "topic:github-action",
        "sort": "stars",
        "order": "desc",
        "per_page": results_per_page,
        "page": page_number
    }

def request_github_data(page_number, results_per_page=30):
    """
    Make an API request for a specific page and return the list of repository items.
    """
    params = build_query_parameters(page_number, results_per_page)
    response = requests.get(github_api_endpoint, headers=api_headers, params=params)
    if response.status_code != 200:
        print(f"Error retrieving page {page_number}: Status code {response.status_code}")
        return None
    json_data = response.json()
    return json_data.get("items", [])

def transform_repository_item(repo_item, current_page):
    """
    Transform a single repository item into the desired dictionary format.
    """
    return {
        "Product Name": repo_item.get("name"),
        "Description": repo_item.get("description") if repo_item.get("description") else "No Description Available",
        "URL": repo_item.get("html_url"),
        "Page Number": current_page
    }

def gather_github_actions_data(total_pages=5, results_per_page=30):
    """
    Loop through pages of results and aggregate repository data.
    """
    collected_data = []
    for page in range(1, total_pages + 1):
        repos = request_github_data(page, results_per_page)
        if not repos:
            break  # Stop if no items are returned
        for repo in repos:
            collected_data.append(transform_repository_item(repo, page))
        time.sleep(2)  # Delay to respect rate limits
    return collected_data

# Main execution block
if __name__ == "__main__":
    # Retrieve data from 10 pages
    github_actions_list = gather_github_actions_data(total_pages=10)

    # Create a DataFrame from the collected data and save as CSV
    df_github_actions = pd.DataFrame(github_actions_list)
    df_github_actions.to_csv("github_actions_api.csv", index=False)

    # Show the first 5 rows of the DataFrame
    print(df_github_actions.head(5))


                   Product Name  \
0                       metrics   
1    github-pages-deploy-action   
2                  action-tmate   
3  github-profile-summary-cards   
4           create-pull-request   

                                         Description  \
0  📊 An infographics generator with 30+ plugins a...   
1  🚀 Automatically deploy your project to GitHub ...   
2  Debug your GitHub Actions via SSH by using tma...   
3  A tool to generate your github summary card fo...   
4  A GitHub action to create a pull request for c...   

                                                 URL  Page Number  
0              https://github.com/lowlighter/metrics            1  
1  https://github.com/JamesIves/github-pages-depl...            1  
2          https://github.com/mxschmitt/action-tmate            1  
3  https://github.com/vn7n24fzkq/github-profile-s...            1  
4  https://github.com/peter-evans/create-pull-req...            1  


In [None]:
import pandas as pd
import re

# Load the raw dataset from CSV
df_raw = pd.read_csv("github_actions_api.csv")

# Eliminate duplicate entries
df_raw = df_raw.drop_duplicates()

# Remove rows missing either 'Product Name' or 'Description'
df_raw = df_raw.dropna(subset=['Product Name', 'Description'])

# Function to sanitize text by lowercasing, removing HTML tags, special chars, and extra whitespace
def sanitize_text(input_text):
    if isinstance(input_text, str):
        sanitized = input_text.lower()                                 # Convert to lowercase
        sanitized = re.sub(r'<.*?>', '', sanitized)                     # Strip HTML tags
        sanitized = re.sub(r'[^a-zA-Z0-9\s]', '', sanitized)            # Remove non-alphanumeric characters
        sanitized = re.sub(r'\s+', ' ', sanitized).strip()              # Clean up extra spaces
        return sanitized
    return ""

# Clean the 'Product Name' and 'Description' columns using the sanitize_text function
df_raw['Product Name'] = df_raw['Product Name'].apply(sanitize_text)
df_raw['Description'] = df_raw['Description'].apply(sanitize_text)

# Write the cleaned DataFrame to a new CSV file
df_raw.to_csv("github_actions_api_cleaned.csv", index=False)

# Confirm completion and display the first few rows of the cleaned data
print("Data cleaning complete. The cleaned dataset is saved as 'github_actions_api_cleaned.csv'.")
df_raw.head()


Data cleaning complete. The cleaned dataset is saved as 'github_actions_api_cleaned.csv'.


Unnamed: 0,Product Name,Description,URL,Page Number
0,metrics,an infographics generator with 30 plugins and ...,https://github.com/lowlighter/metrics,1
1,githubpagesdeployaction,automatically deploy your project to github pa...,https://github.com/JamesIves/github-pages-depl...,1
2,actiontmate,debug your github actions via ssh by using tma...,https://github.com/mxschmitt/action-tmate,1
3,githubprofilesummarycards,a tool to generate your github summary card fo...,https://github.com/vn7n24fzkq/github-profile-s...,1
4,createpullrequest,a github action to create a pull request for c...,https://github.com/peter-evans/create-pull-req...,1


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
!pip install tweepy



In [None]:
# ----------------------- PART I: Data Collection -----------------------

import tweepy
import pandas as pd

def twitter_authentication():
    """
    Authenticate with Twitter using OAuth1 and return an API object.
    """
    API_KEY = "k6pH0Z04P5f9SuzDwJh936gli"
    API_SECRET = "HqhphiHWXk13bb9dEVUgpuWhAJ7gI4nRewpGJWNnsdOlUBnJXi"
    ACCESS_TOKEN = "1892435046583107584-mZTkJlWOWcRiKaBxnBAiYqrMUhTK03"
    ACCESS_SECRET = "XePLne104iWrnJ87U6iTZXaoMDtcYAMPu9TlGRnNoqTjV"




    auth = tweepy.OAuth1UserHandler(
        consumer_key=API_KEY,
        consumer_secret=API_SECRET,
        access_token=ACCESS_TOKEN,
        access_token_secret=ACCESS_SECRET
    )
    return tweepy.API(auth)

def fetch_tweets():
    """
    Use Tweepy Client to search recent tweets for the hashtag "#generativeAI"
    (excluding retweets) and return a DataFrame with selected tweet fields.
    """
    hashtag_query = '#generativeAI -is:retweet'
    max_results = 100  # Maximum tweets per request

    # Initialize the Client with your bearer token (replace with your actual bearer token)
    client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAAGiqzQEAAAAAz%2FrQ0De6i6k74QooFCVcbqtndME%3DUmBtc1JcGJr01aSMUu6U1f49stay5xGcPRlCBkxWJaTT4bwWbC')

    tweets = client.search_recent_tweets(query=hashtag_query,
                                           tweet_fields=["created_at", "text", "author_id"],
                                           max_results=max_results)

    tweet_data = {
        'tweet_id': [],
        'user_id': [],
        'tweet_time': [],
        'tweetText': []
    }

    if tweets.data:
        for tweet in tweets.data:
            tweet_data['tweet_id'].append(tweet.id)
            tweet_data['user_id'].append(tweet.author_id)
            tweet_data['tweet_time'].append(tweet.created_at)
            tweet_data['tweetText'].append(tweet.text)

    return pd.DataFrame(tweet_data)

def save_raw_data(df, filename="Generative_AI_Tweets.csv"):
    """
    Save the raw tweets DataFrame to a CSV file.
    """
    df.to_csv(filename, index=False)
    print(f"Raw tweets have been saved to '{filename}'.")

# ----------------------- PART II: Data Preprocessing -----------------------

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

def download_nltk_resources():
    """
    Download required NLTK datasets.
    """
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')

def preprocess_tweet_text(text):
    """
    Clean tweet text by converting to lowercase, removing special characters,
    tokenizing, filtering out stopwords, and applying lemmatization.
    """
    # Ensure text is a string
    if not isinstance(text, str):
        text = str(text)

    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters and digits
    tokens = word_tokenize(text)  # Tokenize the text

    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    # Remove stopwords and apply lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return ' '.join(tokens)

def clean_tweet_data(df):
    """
    Apply preprocessing to the tweet text, check for missing values,
    and remove duplicates.
    """
    # Preprocess the tweetText column
    df['tweetText'] = df['tweetText'].apply(preprocess_tweet_text)

    # Check for missing values
    missing_data = df.isnull().sum()
    if missing_data.any():
        print("Missing data detected. Filling missing values with 'Unknown'.")
        df['tweetText'].fillna('Unknown', inplace=True)
        df['user_id'].fillna('Unknown', inplace=True)
    else:
        print("No missing data found.")

    # Remove duplicate rows based on tweetText and user_id
    df = df.drop_duplicates(subset=['tweetText', 'user_id'])

    return df

def save_cleaned_data(df, filename="cleaned_Generative_AI_Tweets.csv"):
    """
    Save the cleaned DataFrame to a CSV file.
    """
    df.to_csv(filename, index=False)
    print(f"Cleaned tweets have been saved to '{filename}'.")

# ----------------------- Main Execution Flow -----------------------

def main():
    # PART I: Fetch and Save Raw Tweets
    raw_df = fetch_tweets()
    print("Sample of Raw Tweets:")
    print(raw_df.head())
    save_raw_data(raw_df)

    # PART II: Preprocess and Clean the Tweet Data
    download_nltk_resources()
    # Load the raw data CSV (if needed)
    tweet_df = pd.read_csv("Generative_AI_Tweets.csv")

    # Clean the tweet text and perform data quality checks
    cleaned_df = clean_tweet_data(tweet_df)
    save_cleaned_data(cleaned_df)

    print("\nCleaned Data Sample:")
    print(cleaned_df.head())

if __name__ == "__main__":
    main()


Sample of Raw Tweets:
              tweet_id              user_id                tweet_time  \
0  1892441507560640861  1778689024346980352 2025-02-20 05:09:49+00:00   
1  1892441157482766579           2540286386 2025-02-20 05:08:26+00:00   
2  1892438874414354746            142113030 2025-02-20 04:59:22+00:00   
3  1892437697555231064  1724183495122079744 2025-02-20 04:54:41+00:00   
4  1892437280264016077   830144033917767680 2025-02-20 04:53:02+00:00   

                                           tweetText  
0  🚀 Become a Full Stack Developer with Generativ...  
1  See how Dolphin Fitness utilized @Oracle HeatW...  
2  AI pushed into overthinking: https://t.co/Ck8R...  
3  Xbox announces 'a generative AI model for game...  
4  KritiKal helps businesses to revolutionize app...  
Raw tweets have been saved to 'Generative_AI_Tweets.csv'.
No missing data found.
Cleaned tweets have been saved to 'cleaned_Generative_AI_Tweets.csv'.

Cleaned Data Sample:
              tweet_id              

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

-- The assignment was interesting and a great way to learn about web scraping. It was challenging, but at the same time, it gave me hands-on experience in extracting data from websites, which was really valuable.

-- I really enjoyed seeing my code successfully scrape data, it felt rewarding after troubleshooting and fixing errors. However, I think the time given for the assignment was a bit short. More time would have helped in experimenting with different techniques and improving the efficiency of the scraping process.

-- To improve the assignment, it would be helpful to provide small practice tasks before the actual assignment. This would help in gradually building confidence and understanding before working on a complete web scraping project. Overall, it was a great learning experience, and I appreciate the practical approach.



# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog