<a href="https://colab.research.google.com/github/NikhilaArutla/Nikhila_INFO5731_Spring2025/blob/main/Arutla_Nikhila_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [30]:
import requests  # To get data from the internet
from bs4 import BeautifulSoup  # To read and extract data from web pages
import pandas as pd  # To handle data and save it in CSV format
import time  # To manage time (for delays if needed)

# URL of the IMDb reviews page for a specific movie (Barbie in this case)
URL = "https://www.imdb.com/title/tt1517268/reviews"
# This is the browser info to avoid being blocked by the website
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

def fetch_page_data(page_num):
    """ Get the data from the review page. """
    try:
        # Get the page content using the page number
        response = requests.get(URL, headers=HEADERS, params={"PAGE-NO": page_num})
        response.raise_for_status()  # Check if the request is successful
        return response.text  # Return the page content
    except requests.RequestException as e:
        print(f"Error fetching page {page_num}: {e}")
        return None  # If there's an error, return None

def extract_review_data(review, user_block, rating_tag):
    """ Extract details from each review. """
    # Get user ID and profile link
    user_id = user_block.find("a", {"data-testid": "author-link"}).text.strip() if user_block.find("a", {"data-testid": "author-link"}) else "No User ID"
    user_profile_link = "https://www.imdb.com" + user_block.find("a", {"data-testid": "author-link"})["href"] if user_block.find("a", {"data-testid": "author-link"}) else "No Profile Link"

    # Get the review title and text
    title = review.find("h3", class_="ipc-title__text").text.strip() if review.find("h3", class_="ipc-title__text") else "No title"
    text = review.find("div", class_="ipc-html-content-inner-div").text.strip() if review.find("div", class_="ipc-html-content-inner-div") else "No detailed text"

    # If text is missing, get alternate text
    if text == "No detailed text":
        alt_text = review.find_next_sibling("div", {"data-testid": "review-overflow"})
        text = alt_text.text.strip() if alt_text else "*No Review added by viewer*"

    # Get the review rating
    rating = rating_tag.text.strip() if rating_tag else "No Rating by viewer"

    # Return all the data in a dictionary
    return {"USER-ID": user_id, "REVIEW-TITLE": title, "REVIEW-COMMENTS": text, "RATINGS": rating}

def fetch_reviews(target_reviews=1000):
    """ Collect the reviews until we reach the target number. """
    reviews_data = []  # List to store review data
    page = 1  # Start with the first page

    # Keep fetching pages until we get enough reviews
    while len(reviews_data) < target_reviews:
        page_content = fetch_page_data(page)  # Get the page content
        if not page_content:  # If no content, stop
            break

        # Read and parse the page content
        soup = BeautifulSoup(page_content, 'html.parser')
        reviews = soup.find_all('div', class_='sc-8c7aa573-5 gBEznl')  # Find all reviews
        users = soup.find_all('div', {'data-testid': 'reviews-author'})  # Find all user info
        ratings = soup.find_all('span', class_='ipc-rating-star--rating')  # Find all ratings

        if not reviews:  # If no reviews, stop
            print("No more reviews found.")
            break

        # Loop through each review, user, and rating
        for review, user_block, rating_tag in zip(reviews, users, ratings):
            try:
                reviews_data.append({**extract_review_data(review, user_block, rating_tag), "PAGE-NO": page})
                if len(reviews_data) >= target_reviews:  # If enough reviews, stop
                    break
            except Exception as e:
                print(f"Skipping review due to error: {e}")  # Skip any reviews with errors

        page += 1  # Go to the next page

    # Return the collected data as a DataFrame
    return pd.DataFrame(reviews_data)

# Get 1000 reviews
df_reviews = fetch_reviews(1000)

# Save the reviews data to a CSV file
df_reviews.to_csv('IMDB_REVIEWS_BARBIE.csv', index=False)

# Print success message
print('Data has been successfully saved to IMDB_REVIEWS_BARBIE.csv')

# Show the first few rows of the data
df_reviews.head()


Data has been successfully saved to IMDB_REVIEWS_BARBIE.csv


Unnamed: 0,USER-ID,REVIEW-TITLE,REVIEW-COMMENTS,RATINGS,PAGE-NO
0,Natcat87,Too heavy handed,*No Review added by viewer*,6,1
1,LoveofLegacy,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,1
2,fscsgxp,"Amazing Cast & Set, but the political message ...",*No Review added by viewer*,6,1
3,aherdofbeautifulwildponies,A Hot Pink Mess,*No Review added by viewer*,6,1
4,L3MM3,People are missing the point,Seeing a lot of reviews saying that this movie...,9,1


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [31]:
import re  # To use regular expressions for text cleaning
import pandas as pd  # To handle data in tabular form
import nltk  # To use NLTK library for text processing
from nltk.corpus import stopwords  # To remove common stopwords like "and", "the", etc.
from nltk.stem import PorterStemmer, WordNetLemmatizer  # To perform stemming and lemmatization

# Download necessary NLTK resources (stopwords, tokenization, and wordnet)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the dataset (ensure the CSV file is correct)
df = pd.read_csv("IMDB_REVIEWS_BARBIE.csv")

# Define text cleaning functions

def remove_noise(text):
    """ Remove special characters and punctuation from text. """
    return re.sub(r'[^a-zA-Z\s]', '', str(text))

def remove_numbers(text):
    """ Remove any numbers from the text. """
    return re.sub(r'\d+', '', str(text))

def remove_stopwords(text):
    """ Remove common words like "the", "is", etc. from the text. """
    stop_words = set(stopwords.words('english'))  # List of common stopwords
    words = str(text).split()  # Split the text into words
    return " ".join([word for word in words if word.lower() not in stop_words or word.lower() == "no"])  # Join back the words after filtering out stopwords

def apply_stemming(text):
    """ Reduce words to their root form using stemming. """
    stemmer = PorterStemmer()  # Create a Porter stemmer object
    words = str(text).split()  # Split the text into words
    return " ".join([stemmer.stem(word) for word in words])  # Stem each word and join them back

def apply_lemmatization(text):
    """ Convert words to their base form using lemmatization. """
    lemmatizer = WordNetLemmatizer()  # Create a lemmatizer object
    words = str(text).split()  # Split the text into words
    return " ".join([lemmatizer.lemmatize(word) for word in words])  # Lemmatize each word and join them back

# Apply the cleaning steps to the review title and comments

# First, make everything lowercase, remove noise, numbers, and stopwords
df['processed_title'] = df['REVIEW-TITLE'].astype(str).str.lower().apply(remove_noise).apply(remove_numbers).apply(remove_stopwords)
df['processed_comment'] = df['REVIEW-COMMENTS'].astype(str).str.lower().apply(remove_noise).apply(remove_numbers).apply(remove_stopwords)

# Apply stemming (reduce words to their root form)
df['stemmed_headline'] = df['processed_title'].apply(apply_stemming)
df['stemmed_review'] = df['processed_comment'].apply(apply_stemming)

# Apply lemmatization (convert words to their base form)
df['lemmatized_headline'] = df['processed_title'].apply(apply_lemmatization)
df['lemmatized_review'] = df['processed_comment'].apply(apply_lemmatization)

# Save the cleaned data into a new CSV file
df.to_csv("CLEANED_IMDB_REVIEWS.csv", index=False)

# Show the cleaned data (first few rows)
df.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,USER-ID,REVIEW-TITLE,REVIEW-COMMENTS,RATINGS,PAGE-NO,processed_title,processed_comment,stemmed_headline,stemmed_review,lemmatized_headline,lemmatized_review
0,Natcat87,Too heavy handed,*No Review added by viewer*,6,1,heavy handed,no review added viewer,heavi hand,no review ad viewer,heavy handed,no review added viewer
1,LoveofLegacy,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,1,beautiful film preachy,margot best shes given film disappointing mark...,beauti film preachi,margot best she given film disappoint market f...,beautiful film preachy,margot best shes given film disappointing mark...
2,fscsgxp,"Amazing Cast & Set, but the political message ...",*No Review added by viewer*,6,1,amazing cast set political message strong,no review added viewer,amaz cast set polit messag strong,no review ad viewer,amazing cast set political message strong,no review added viewer
3,aherdofbeautifulwildponies,A Hot Pink Mess,*No Review added by viewer*,6,1,hot pink mess,no review added viewer,hot pink mess,no review ad viewer,hot pink mess,no review added viewer
4,L3MM3,People are missing the point,Seeing a lot of reviews saying that this movie...,9,1,people missing point,seeing lot reviews saying movie preachy men po...,peopl miss point,see lot review say movi preachi men portray id...,people missing point,seeing lot review saying movie preachy men por...


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [35]:
import pandas as pd  # For handling data in a table format (like Excel or CSV)
import spacy  # A library to analyze text
from collections import Counter  # To count specific words easily

# Load spaCy's language model for English
nlp = spacy.load("en_core_web_sm")

# Load the cleaned dataset of IMDb reviews
df_imdb = pd.read_csv("CLEANED_IMDB_REVIEWS.csv")

# Function to analyze parts of speech (POS) tags and count nouns, verbs, adjectives, and adverbs
def pos_analysis(text):
    doc = nlp(text)  # Process the text

    # Count different parts of speech (POS)
    pos_counts = Counter(token.pos_ for token in doc)
    noun_count = pos_counts["NOUN"]  # Count nouns
    verb_count = pos_counts["VERB"]  # Count verbs
    adj_count = pos_counts["ADJ"]   # Count adjectives
    adv_count = pos_counts["ADV"]   # Count adverbs

    return [(token.text, token.pos_) for token in doc], noun_count, verb_count, adj_count, adv_count

# Apply POS tagging on the first 10 comments in the dataset
df_imdb_sample = df_imdb[['processed_comment']].dropna().head(10)  # Only take first 10 rows

imdb_pos_results = []
for comment in df_imdb_sample['processed_comment']:
    pos_tags, noun_count, verb_count, adj_count, adv_count = pos_analysis(comment)
    imdb_pos_results.append((pos_tags, noun_count, verb_count, adj_count, adv_count))

df_imdb_pos = pd.DataFrame(imdb_pos_results, columns=['POS_TAGS', 'NOUNS', 'VERBS', 'ADJECTIVES', 'ADVERBS'])

# Show the POS results
print(df_imdb_pos.head())

# Summarize the counts of POS tags (nouns, verbs, adjectives, adverbs)
pos_counts = df_imdb_pos[['NOUNS', 'VERBS', 'ADJECTIVES', 'ADVERBS']].sum()
print("\nPOS Counts Summary:")
print(pos_counts)

# Save the results to a CSV file for later use
df_imdb_pos.to_csv("POS_TAGGING_RESULTS.csv", index=False)

# Show the POS results again
df_imdb_pos.head()


                                            POS_TAGS  NOUNS  VERBS  \
0  [(no, DET), (review, NOUN), (added, VERB), (vi...      2      1   
1  [(margot, ADV), (best, ADJ), (she, PRON), (s, ...     29     18   
2  [(no, DET), (review, NOUN), (added, VERB), (vi...      2      1   
3  [(no, DET), (review, NOUN), (added, VERB), (vi...      2      1   
4  [(seeing, VERB), (lot, NOUN), (reviews, NOUN),...     55     29   

   ADJECTIVES  ADVERBS  
0           0        0  
1          15        4  
2           0        0  
3           0        0  
4          12        6  

POS Counts Summary:
NOUNS         205
VERBS         110
ADJECTIVES     68
ADVERBS        29
dtype: int64


Unnamed: 0,POS_TAGS,NOUNS,VERBS,ADJECTIVES,ADVERBS
0,"[(no, DET), (review, NOUN), (added, VERB), (vi...",2,1,0,0
1,"[(margot, ADV), (best, ADJ), (she, PRON), (s, ...",29,18,15,4
2,"[(no, DET), (review, NOUN), (added, VERB), (vi...",2,1,0,0
3,"[(no, DET), (review, NOUN), (added, VERB), (vi...",2,1,0,0
4,"[(seeing, VERB), (lot, NOUN), (reviews, NOUN),...",55,29,12,6


In [40]:
import pandas as pd  # For handling tabular data like CSV files
import spacy  # A library to process and analyze text
from collections import Counter  # To count entities or items
from spacy import displacy  # To visualize the parsing results

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned dataset of IMDb reviews
df_imdb = pd.read_csv("CLEANED_IMDB_REVIEWS.csv")

# Function to do dependency parsing, which shows relationships between words
def dependency_parsing(sentence):
    doc = nlp(sentence)  # Process the sentence
    return [(token.text, token.dep_, token.head.text) for token in doc]  # Return word, its dependency relation, and the word it depends on

# Function to do Named Entity Recognition (NER) and extract entities (like names, locations)
def named_entity_recognition(text):
    doc = nlp(text)  # Process the text
    entities = [(ent.text, ent.label_) for ent in doc.ents]  # Get all entities
    entity_counts = Counter(ent.label_ for ent in doc.ents)  # Count the types of entities
    return entities, entity_counts

# Apply dependency parsing and NER on the first 5 comments from the dataset
df_imdb_sample = df_imdb[['processed_comment']].dropna().head(5)  # Only first 5 rows for analysis

dependency_trees = []
ner_results = []

for comment in df_imdb_sample['processed_comment']:
    # Get dependency parsing results
    dependency_tree = dependency_parsing(comment)
    dependency_trees.append(dependency_tree)

    # Get Named Entity Recognition results
    named_entities, entity_counts = named_entity_recognition(comment)
    ner_results.append((named_entities, entity_counts))

# Convert NER results to a DataFrame for easy viewing
df_ner = pd.DataFrame(ner_results, columns=['NAMED_ENTITIES', 'ENTITY COUNTS'])

# Save NER results to CSV
df_ner.to_csv("NAMED_ENTITY_RECOGNITION_RESULTS.csv", index=False)

# Flatten the dependency tree data to save it into a DataFrame
dependency_data = []

for tree in dependency_trees:
    for token, dep, head in tree:
        dependency_data.append((token, dep, head))

# Create a DataFrame for the flattened dependency data
df_dependency = pd.DataFrame(dependency_data, columns=['TOKEN', 'DEPENDENCY', 'HEAD'])

# Save the dependency tree results to a CSV file
df_dependency.to_csv("DEPENDENCY_PRAISING_RESULTS.csv", index=False)

# Visualize the dependency parsing results (relationships between words)
for comment in df_imdb_sample['processed_comment']:
    doc = nlp(comment)
    displacy.render(doc, style="dep", jupyter=True)  # Render dependency parsing tree in Jupyter notebook

# Show the NER results
print(df_ner.head())

# Visualize the Named Entities found in the text
for comment in df_imdb_sample['processed_comment']:
    doc = nlp(comment)
    displacy.render(doc, style="ent", jupyter=True)  # Render the named entity tree in Jupyter notebook


                                     NAMED_ENTITIES  \
0                                                []   
1             [(second, ORDINAL), (half, CARDINAL)]   
2                                                []   
3                                                []   
4  [(validin, ORG), (ken accessory barbie, PERSON)]   

                   ENTITY COUNTS  
0                             {}  
1  {'ORDINAL': 1, 'CARDINAL': 1}  
2                             {}  
3                             {}  
4        {'ORG': 1, 'PERSON': 1}  




# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [2]:
# PART-1 Web Scraping with Enhancements

import pandas as pd  # Importing pandas to store and manipulate scraped data
import requests  # Importing requests to send HTTP requests to web pages
from bs4 import BeautifulSoup  # Importing BeautifulSoup to parse HTML content
import time  # Importing time to add delays between requests

# GitHub Marketplace base URL (with pagination)
BASE_URL = "https://github.com/marketplace?type=actions&page="

# Headers to mimic a real browser and avoid getting blocked
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://github.com/marketplace",
}

# Function to scrape GitHub Marketplace pages
def scrape_github_marketplace(pages=50):
    data = []  # List to store scraped data
    session = requests.Session()  # Creating a session to manage requests efficiently

    # Loop through multiple pages (pagination handling)
    for page in range(1, pages + 1):
        url = BASE_URL + str(page)  # Construct URL with the page number

        try:
            response = session.get(url, headers=HEADERS)  # Sending HTTP request
            if response.status_code != 200:  # Checking if request was successful
                print(f"Failed to retrieve page {page}. Status Code: {response.status_code}")
                continue  # Skip to the next page if there's an error

            # Parsing the HTML content of the page
            soup = BeautifulSoup(response.text, 'html.parser')

            # Finding all elements that contain product details
            actions = soup.find_all('div', class_='position-relative border rounded-2 d-flex marketplace-common-module__marketplace-item--MohVH gap-3 p-3')

            # Extracting details from each product
            for action in actions:
                try:
                    # Extracting product name
                    name_tag = action.find('h3', class_='d-flex f4 lh-condensed prc-Heading-Heading-6CmGO')
                    name = name_tag.text.strip() if name_tag else "Unknown"

                    # Extracting description
                    desc_tag = action.find('p', class_='mt-1 mb-0 text-small fgColor-muted line-clamp-2')
                    description = desc_tag.text.strip() if desc_tag else "No description"

                    # Extracting product URL
                    link_tag = action.find('a', href=True)
                    link = "https://github.com" + link_tag['href'].strip() if link_tag else "No link"

                    # Storing extracted data
                    data.append([name, description, link, page])

                except Exception as e:
                    print(f"Skipping item due to error: {e}")  # Handling errors in individual items

        except Exception as e:
            print(f"Error scraping page {page}: {e}")  # Handling errors for a page

        time.sleep(5)  # Adding a delay to prevent server overload

    # Returning scraped data as a DataFrame
    return pd.DataFrame(data, columns=['PRODUCT_NAME', 'DESCRIPTION', 'URL', 'PAGE_NO'])

# Running the scraping function for 50 pages
df = scrape_github_marketplace(50)

# Saving raw scraped data to a CSV file
df.to_csv("GITHUB_MARKETPLACE_RAW.csv", index=False)

# Displaying the first few rows of the scraped data
df.head()


Unnamed: 0,PRODUCT_NAME,DESCRIPTION,URL,PAGE_NO
0,TruffleHog OSS,Scan Github Actions with TruffleHog,https://github.com/marketplace/actions/truffle...,1
1,Metrics embed,An infographics generator with 40+ plugins and...,https://github.com/marketplace/actions/metrics...,1
2,yq - portable yaml processor,"create, read, update, delete, merge, validate ...",https://github.com/marketplace/actions/yq-port...,1
3,Super-Linter,Super-linter is a ready-to-run collection of l...,https://github.com/marketplace/actions/super-l...,1
4,Gosec Security Checker,Runs the gosec security checker,https://github.com/marketplace/actions/gosec-s...,1


In [14]:
import pandas as pd
import nltk
from textblob import TextBlob
from nltk.corpus import stopwords

# Download necessary NLP datasets
nltk.download('punkt')
nltk.download('stopwords')

# Define stopwords
stop = set(stopwords.words('english'))

# Function to clean and tokenize text
def clean_and_tokenize_text(text):
    if pd.isna(text):
        return []
    words = list(TextBlob(text.lower()).words)
    words = [word for word in words if word not in stop]
    return words

# Load data
df = pd.read_csv("GITHUB_MARKETPLACE_RAW.csv")  # Replace with actual raw data filename

# Drop existing tokenized columns to prevent duplication
df = df.drop(columns=['product_tokens', 'description_tokens'], errors='ignore')

# Apply text cleaning function
df['product_tokens'] = df['PRODUCT_NAME'].apply(clean_and_tokenize_text)
df['description_tokens'] = df['DESCRIPTION'].apply(clean_and_tokenize_text)

# Save cleaned and tokenized data
df.to_csv("GH_MP_CLEANED_TOKENS.csv", index=False, mode='w')

# Display column names and first few rows to check output
print(df.columns)
df.head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Index(['PRODUCT_NAME', 'DESCRIPTION', 'URL', 'PAGE_NO', 'product_tokens',
       'description_tokens'],
      dtype='object')


Unnamed: 0,PRODUCT_NAME,DESCRIPTION,URL,PAGE_NO,product_tokens,description_tokens
0,TruffleHog OSS,Scan Github Actions with TruffleHog,https://github.com/marketplace/actions/truffle...,1,"[trufflehog, oss]","[scan, github, actions, trufflehog]"
1,Metrics embed,An infographics generator with 40+ plugins and...,https://github.com/marketplace/actions/metrics...,1,"[metrics, embed]","[infographics, generator, 40, plugins, 300, op..."
2,yq - portable yaml processor,"create, read, update, delete, merge, validate ...",https://github.com/marketplace/actions/yq-port...,1,"[yq, portable, yaml, processor]","[create, read, update, delete, merge, validate..."
3,Super-Linter,Super-linter is a ready-to-run collection of l...,https://github.com/marketplace/actions/super-l...,1,[super-linter],"[super-linter, ready-to-run, collection, linte..."
4,Gosec Security Checker,Runs the gosec security checker,https://github.com/marketplace/actions/gosec-s...,1,"[gosec, security, checker]","[runs, gosec, security, checker]"


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [15]:
# Install required libraries (Tweepy, Pandas, NLTK)
!pip install tweepy pandas nltk




In [16]:
# Import necessary libraries for the task
import tweepy
import pandas as pd
import time
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK data files
nltk.download("stopwords")
nltk.download("punkt")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [21]:
# Import tweepy for accessing Twitter API

# Define API keys and tokens for authentication
API_KEY = "A9XM4CUUxiv1EREVmrHI8d6p2"
API_SECRET = "SwFHbe7KCxVq4LnPNmbPB5gdC5gNoL8wrlT5xohtfjfilPL02V"
ACCESS_TOKEN = "1891368043550523392-B73A6c2pcq4eOyWu5WnpJNX6sR1QfE"
ACCESS_SECRET = "zAxBnd1H2cUo1c3kMWWwipRNkgxV5XsC4SMwZgzCTJfAM"
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAAFSzQEAAAAA2R3F0aAZwUeffItrLy425q6z3fA%3DdY8Qw6IInFXo7QLYvVhukPKkQDX8eFjY0GrWhe9H6NYGaPD6J8"


# Authenticate using OAuth 1.0a
auth = tweepy.OAuth1UserHandler(API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

# Initialize the Tweepy client with the bearer token
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# Confirm successful authentication
print("✅ AUTHENTICATION SUCCESSFUL!")


✅ AUTHENTICATION SUCCESSFUL!


In [22]:
# Import necessary libraries

# Initialize the Tweepy client with the bearer token
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# Define a function to scrape tweets based on the given query
def scrape_tweets(query, num_tweets=100):
    tweets_data = []

    try:
        # Use Tweepy Paginator to fetch tweets
        for tweet in tweepy.Paginator(
            client.search_recent_tweets,
            query=query,
            tweet_fields=["id", "text", "created_at", "author_id"],
            max_results=50
        ).flatten(limit=num_tweets):

            # Append tweet data to the list
            tweets_data.append({
                "Tweet_ID": tweet.id,
                "Author_ID": tweet.author_id,
                "Text": tweet.text,
                "Created_At": tweet.created_at
            })

    except tweepy.TooManyRequests:
        print("🔴 API RATE LIMIT REACHED! WAITING 15 MINUTES BEFORE RETRYING.....")
        time.sleep(900)
        return scrape_tweets(query, num_tweets)
    except tweepy.TweepyException as e:
        print(f"🔴 AN ERROR OCCURRED: {e}")
        return pd.DataFrame()

    # Return the data in DataFrame format
    return pd.DataFrame(tweets_data)

# Scrape tweets related to AI & Machine Learning hashtags
df_tweets = scrape_tweets("#MachineLearning OR #ArtificialIntelligence -is:retweet", num_tweets=50)

# Save the scraped data to a CSV file
df_tweets.to_csv("AI_ML_TWEETS.csv", index=False)

print("✅ SUCCESSFULLY COLLECTED AND SAVED 50 TWEETS IN 'AI_ML_TWEETS.csv'!! ")


✅ SUCCESSFULLY COLLECTED AND SAVED 50 TWEETS IN 'AI_ML_TWEETS.csv'!! 


In [24]:
# Load and display the first few rows of the scraped tweet data
df = pd.read_csv("AI_ML_TWEETS.csv")
df.head()


Unnamed: 0,Tweet_ID,Author_ID,Text,Created_At
0,1892394495611752858,705539763349164032,Women in #AI: @SarahBitamazire helps companie...,2025-02-20 02:03:01+00:00
1,1892394313272717804,4879174121,Check out my article on DeepSeek-V3! This 671B...,2025-02-20 02:02:17+00:00
2,1892394013941985515,1563886639587692544,RT @Python_Dv: Build AI Model From Scratch htt...,2025-02-20 02:01:06+00:00
3,1892393864494813313,1892388757883801600,I'm thrilled to announce the launch of my jour...,2025-02-20 02:00:30+00:00
4,1892393845960196386,47254260,Watch on-demand to explore what steps organiza...,2025-02-20 02:00:26+00:00


In [25]:
# Import necessary libraries for data cleaning
import pandas as pd
import re

# Load the raw tweet data
df = pd.read_csv("AI_ML_TWEETS.csv")

# Step 1: Remove Retweets (tweets starting with "RT @")
df = df[~df["Text"].str.startswith("RT @")]

# Step 2: Handle missing values by dropping rows with any null values
df.dropna(inplace=True)

# Step 3: Remove duplicate tweets based on tweet ID
df.drop_duplicates(subset=["Tweet_ID"], inplace=True)

# Step 4: Clean the tweet text (remove URLs, mentions, hashtags, etc.)
def clean_text(text):
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r"@\w+", "", text)  # Remove mentions (@username)
    text = re.sub(r"#\w+", "", text)  # Remove hashtags (#)
    text = re.sub(r"\n", " ", text)  # Remove newlines
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

df["Text"] = df["Text"].apply(clean_text)

# Step 5: Convert tweet creation date to DateTime format
df["Created_At"] = pd.to_datetime(df["Created_At"])

# Step 6: Perform data quality checks
print("\nCleaned Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nDuplicate Rows:")
print(df.duplicated().sum())

# Save the cleaned data to a new CSV file
df.to_csv("CLEANED_AI_ML_TWEETS.csv", index=False)
print("\n✅ SUCCESSFULLY CLEANED AND SAVED THE DATA TO 'CLEANED_AI_ML_TWEETS.csv' !! ")



Cleaned Data:
              Tweet_ID            Author_ID  \
0  1892394495611752858   705539763349164032   
1  1892394313272717804           4879174121   
3  1892393864494813313  1892388757883801600   
4  1892393845960196386             47254260   
5  1892393816272834948   875579780506066945   

                                                Text                Created_At  
0            Women in : helps companies implement Cc 2025-02-20 02:03:01+00:00  
1  Check out my article on DeepSeek-V3! This 671B... 2025-02-20 02:02:17+00:00  
3  I'm thrilled to announce the launch of my jour... 2025-02-20 02:00:30+00:00  
4  Watch on-demand to explore what steps organiza... 2025-02-20 02:00:26+00:00  
5  Explore this insightful on the role of the Met... 2025-02-20 02:00:19+00:00  

Data Info:
<class 'pandas.core.frame.DataFrame'>
Index: 43 entries, 0 to 47
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 

# Mandatory Question

**Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.**

I performed an assignment by collecting user reviews for 'BARBIE' movie from IMDB that released in 2023 or 2024. The procedure began with web scraping to acquire 1,000 reviews together with cleaning text and syntactic analysis tasks.

I encountered major difficulties managing the website structure of IMDB as part of our work. The pagination of IMDB reviews demanded me to develop a scraping system which smoothly jumped between multiple review pages. The reviews contained HTML tags and special characters as well as emojis which required detailed cleaning procedures. The removal of common words known as stopwords was a complicated task which demanded precise implementation because lemmatization needed to retain meaningful textual content.

Web scraping proved to be an extremely interesting task for me. By using BeautifulSoup and requests I retrieved actual customer opinion data from different reviews which enabled viewing of how opinions fluctuate throughout the collection. Data cleaning operations proved to be satisfying because they turned disorganized unprocessed text data into an organized structure that made analysis easier. The syntax analysis task became more compelling because it involved both part-of-speech recognition and the extraction of entities such as movie names and actors along with locations.

The assignment duration was reasonable to handle if students organized their work efficiently. The process of web scraping combined with data processing requires extended amounts of time because of dealing with large datasets. The process of handling missing data and debugging program issues together with validating the CSV file structure increased the total workload. The assignment proved instructive although it brought together the multiple data science aspects including data gathering and natural language processing methods.

# Write your response below
**Fill out survey and provide your valuable feedback.**

**https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog**

Filled the survey.