# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [16]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time


# Base IMDb URL for reviews
base_url = "https://www.imdb.com/title/tt2911666/reviews/?ref_=ttrt_sa_3"

# Headers to mimic a real user browser and avoid blocks
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
}

# Function to scrape IMDb reviews
def scrape_imdb_reviews(num_reviews=100):
    reviews_data = []
    page = 1

    while len(reviews_data) < num_reviews:
        try:
            # Construct the URL for pagination
            url = f"{base_url}?start={(page - 1) * 10}"

            # Send a request to IMDb
            response = requests.get(url, headers=HEADERS)
            if response.status_code != 200:
                print(f"Failed to fetch page {page}. Status Code: {response.status_code}")
                break

            # Parse the page content
            soup = BeautifulSoup(response.text, "html.parser")

            # Find all review elements
            review_blocks = soup.find_all('h3', class_='ipc-title__text')

            # If no reviews are found, stop the loop
            if not review_blocks:
                print("No more reviews found. IMDb may have changed the structure.")
                break

            # Extract and store reviews
            for review in review_blocks:
                review_text = review.text.strip()
                reviews_data.append({"Review": review_text})

                # Stop if we reach the required number of reviews
                if len(reviews_data) >= num_reviews:
                    break

            page += 1  # Move to the next page
            time.sleep(1)  # Delay to avoid being blocked

        except Exception as e:
            print(f"Error scraping page {page}: {e}")
            break

    return pd.DataFrame(reviews_data)

# Scrape 100 reviews (adjust as needed)
df = scrape_imdb_reviews(100)

# Add a word count column
df["Word_Count"] = df["Review"].apply(lambda x: len(str(x).split()))

# Save to CSV
df.to_csv("John_Wick_4_Reviews.csv", index=False)

# Display the first few rows
print(df.head())


                                              Review  Word_Count
0  Keanu gets pissed and shoots people in the fac...          12
1  Kinetic, concise, and stylish; John Wick kicks...           8
2      Story: 3 minutes; Entertainment: 101 minutes.           6
3                        Yeah I'm Thinking He's Back           5
4  The best action revenge film of all time from ...          12


In [17]:
req = requests.get(base_url)
print(req.text)  # Check if IMDb is returning the expected HTML

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>



In [18]:
df.shape

(100, 2)

In [None]:
df["Reviews"]

0       Imagine a video game where you are shooting ba...
1       The Table, the international crminal brotherho...
2       The first three John Wick films came in fairly...
3       These John Wick movies can be sort of fun in t...
4       I went to the cinema with great expectations. ...
                              ...                        
1245    HORRIBLE movie. I love John Wick. I mean I wou...
1246    In a world where movie sequels seem to be loat...
1247    May or may not count as a spoiler but John Wic...
1248    Indiana jones, terminator, predator, Jurassic ...
1249    John Wick: Chapter 4 is almost three hours of ...
Name: Reviews, Length: 1250, dtype: object

In [19]:
# Assuming your DataFrame is named 'df'
df["Review"].to_csv("John Wick_Chapter 4_reviews.csv", index=False)

In [20]:
df1 = pd.read_csv("John Wick_Chapter 4_reviews.csv")
df1.head()

Unnamed: 0,Review
0,Keanu gets pissed and shoots people in the fac...
1,"Kinetic, concise, and stylish; John Wick kicks..."
2,Story: 3 minutes; Entertainment: 101 minutes.
3,Yeah I'm Thinking He's Back
4,The best action revenge film of all time from ...


In [21]:
df1.shape

(100, 1)

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [22]:
# Set options to display full column content
pd.set_option("display.max_colwidth", None)

In [23]:
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [24]:
# Write your code here
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

# Load the CSV file
# Read the CSV file containing the movie reviews into a DataFrame
df1 = pd.read_csv("John Wick_Chapter 4_reviews.csv")

# Function to preprocess and clean review
def review_preprocessing(review):
    # Remove punctuation and special characters
    review = ''.join([character for character in review if character not in string.punctuation])

    # Remove numbers
    review = ''.join([character for character in review if not character.isdigit()])

    # Tokenize the review
    words = nltk.word_tokenize(review)

    # Remove stopwords
    words = [w for w in words if w.lower() not in stopwords.words('english')]

    # Lowercase all words
    words = [w.lower() for w in words]

    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(w) for w in words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]

    # Join the words back into a cleaned sentence
    cleaned_user_reviews = ' '.join(words)

    return cleaned_user_reviews

# Apply the preprocessing function to the "Reviews" column and create a new column
# Create a new column in the DataFrame containing the cleaned and preprocessed reviews
df1['Reviews_After_Cleaning'] = df1['Review'].apply(review_preprocessing)


In [25]:
df1.head()

Unnamed: 0,Review,Reviews_After_Cleaning
0,Keanu gets pissed and shoots people in the face for 101 minutes*,keanu get piss shoot peopl face minut
1,"Kinetic, concise, and stylish; John Wick kicks ass.",kinet concis stylish john wick kick as
2,Story: 3 minutes; Entertainment: 101 minutes.,stori minut entertain minut
3,Yeah I'm Thinking He's Back,yeah im think he back
4,The best action revenge film of all time from 2014 so far!,best action reveng film time far


In [26]:
print(df1["Reviews_After_Cleaning"])

0         keanu get piss shoot peopl face minut
1        kinet concis stylish john wick kick as
2                   stori minut entertain minut
3                         yeah im think he back
4              best action reveng film time far
                        ...                    
95                                             
96                                        excel
97             dont mess anoth person dog simpl
98                   love movi highli recommend
99    keanu bring quiet believ action throwback
Name: Reviews_After_Cleaning, Length: 100, dtype: object


In [27]:
# Save the DataFrame to a new CSV file with the cleaned data
df1.to_csv("Cleaned_text_of_John Wick_Chapter 4_reviews.csv", index=False)

In [35]:
from google.colab import files
# Save the cleaned DataFrame to a CSV file
cleaned_file_Johnwick = "Cleaned_text_of_JohnWick_Chapter4_reviews.csv"
df.to_csv(cleaned_file_Johnwick, index=False)

files.download(cleaned_file_Johnwick)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [28]:
df1.head()

Unnamed: 0,Review,Reviews_After_Cleaning
0,Keanu gets pissed and shoots people in the face for 101 minutes*,keanu get piss shoot peopl face minut
1,"Kinetic, concise, and stylish; John Wick kicks ass.",kinet concis stylish john wick kick as
2,Story: 3 minutes; Entertainment: 101 minutes.,stori minut entertain minut
3,Yeah I'm Thinking He's Back,yeah im think he back
4,The best action revenge film of all time from 2014 so far!,best action reveng film time far


In [29]:
df1.shape

(100, 2)

In [30]:
df1["Reviews_After_Cleaning"]

Unnamed: 0,Reviews_After_Cleaning
0,keanu get piss shoot peopl face minut
1,kinet concis stylish john wick kick as
2,stori minut entertain minut
3,yeah im think he back
4,best action reveng film time far
...,...
95,
96,excel
97,dont mess anoth person dog simpl
98,love movi highli recommend


In [31]:
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [32]:
from nltk.tree import Tree


In [33]:
# "Reviews_After_Cleaning" column
cleaned_review = df1["Reviews_After_Cleaning"]

# Define a simple elements for constituency parsing
elements = r"""
    NP: {<DT>?<JJ>*<NN>}  # Noun
    VP: {<VB.*><NP|PP|CLAUSE>+$}  # Verb
    PP: {<IN><NP>}  # Prepositional
    CLAUSE: {<NP><VP>}  # Clause
"""

# Function to conduct syntax and structure analysis
def perform_syntax_structure_analysis(review):
    # Tokenize the review into words
    words = nltk.word_tokenize(review)

    tags = nltk.pos_tag(words)
    parser = nltk.RegexpParser(elements)
    constituency_tree = parser.parse(tags)

    named_entities = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(review)))

    return tags, constituency_tree, named_entities

# Initialize counters for POS categories
total_nouns = 0
total_verbs = 0
total_adjectives = 0
total_adverbs = 0

# Initialize dictionaries to count entities
entity_counts = {
    "Person": 0,
    "Organization": 0,
    "Location": 0,
    "Product": 0,
    "Date": 0
}

# Here we are performing syntax and structure analysis for each review in the column
for indexs, review in enumerate(cleaned_review[:15], start=1):
    print(f"Analysis for review {indexs}:")
    tags, constituency_tree, named_entities = perform_syntax_structure_analysis(review)

    # Calculate the total number of Nouns (N), Verbs (V), Adjectives (Adj), and Adverbs (Adv) for each review
    nouns = sum(1 for word, pos in tags if pos.startswith('N'))
    verbs = sum(1 for word, pos in tags if pos.startswith('V'))
    adjectives = sum(1 for word, pos in tags if pos.startswith('J'))
    adverbs = sum(1 for word, pos in tags if pos.startswith('R'))

    total_nouns += nouns
    total_verbs += verbs
    total_adjectives += adjectives
    total_adverbs += adverbs

    # Extract entities and count them
    for entity in named_entities:
        if isinstance(entity, nltk.Tree):
            entity_label = entity.label()
            entity_text = " ".join(word for word, pos in entity.leaves())
            if entity_label in entity_counts:
                entity_counts[entity_label] += 1

    print("Tags:")
    print(tags)

    print("Constituency Parsing Tree:")
    print(Tree.fromstring(str(constituency_tree)).pretty_print())

    print("Named Entities:")
    print(named_entities)
    print("="*50)

# Print the totals
print(f"Total Nouns (N): {total_nouns}")
print(f"Total Verbs (V): {total_verbs}")
print(f"Total Adjectives (Adj): {total_adjectives}")
print(f"Total Adverbs (Adv): {total_adverbs}")

# Print entity counts
print("Entity Counts:")
for entity_type, count in entity_counts.items():
    print(f"{entity_type}: {count}")


Analysis for review 1:
Tags:
[('keanu', 'NN'), ('get', 'VB'), ('piss', 'JJ'), ('shoot', 'NN'), ('peopl', 'NN'), ('face', 'NN'), ('minut', 'NN')]
Constituency Parsing Tree:
           S                                                  
           |                                                   
         CLAUSE                                               
    _______|___________________                                
   |                           VP                             
   |        ___________________|_________________________      
   NP      |             NP             NP       NP      NP   
   |       |        _____|_____         |        |       |     
keanu/NN get/VB piss/JJ     shoot/NN peopl/NN face/NN minut/NN

None
Named Entities:
(S keanu/NN get/VB piss/JJ shoot/NN peopl/NN face/NN minut/NN)
Analysis for review 2:
Tags:
[('kinet', 'NN'), ('concis', 'NN'), ('stylish', 'JJ'), ('john', 'NN'), ('wick', 'NN'), ('kick', 'NN'), ('as', 'IN')]
Constituency Parsing Tree:

In [34]:
#alternative
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tree import ParentedTree
import string

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# review text
review_text = df1["Reviews_After_Cleaning"].iloc[0]

#Parts of Speech (POS) Tagging
doc = nlp(review_text)
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

for token in doc:
    if token.pos_ == 'NOUN':
        noun_count += 1
    elif token.pos_ == 'VERB':
        verb_count += 1
    elif token.pos_ == 'ADJ':
        adj_count += 1
    elif token.pos_ == 'ADV':
        adv_count += 1

print(f"Number of Nouns: {noun_count}")
print(f"Number of Verbs: {verb_count}")
print(f"Number of Adjectives: {adj_count}")
print(f"Number of Adverbs: {adv_count}")

#Constituency Parsing
def nltk_constituency_parsing(text):
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        tagged = nltk.pos_tag(words)
        grammar = "NP: {<DT>?<JJ>*<NN>}"
        cp = nltk.RegexpParser(grammar)
        tree = cp.parse(tagged)
        print("Constituency Parsing Tree:")
        tree.pretty_print()

# Call the constituency parsing function on the review  text
nltk_constituency_parsing(review_text)

#Named Entity Recognition
entities = {}
for entity in doc.ents:
    entities[entity.label_] = entities.get(entity.label_, 0) + 1

print("\nNamed Entity Recognition (NER):")
for label, count in entities.items():
    print(f"{label}: {count}")


Number of Nouns: 5
Number of Verbs: 1
Number of Adjectives: 1
Number of Adverbs: 0
Constituency Parsing Tree:
                               S                              
   ____________________________|_________________________      
  |       NP             NP             NP       NP      NP   
  |       |         _____|_____         |        |       |     
get/VB keanu/NN piss/JJ     shoot/NN peopl/NN face/NN minut/NN


Named Entity Recognition (NER):


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_github_marketplace_actions(
    start_page=1,
    end_page=50,    # <-- 50 pages * 20 items/page = 1000 items
    delay=1.0,
    output_file='github_actions_1000.csv'
):
    """
    Scrapes GitHub Marketplace Actions listings from start_page to end_page.
    Each page contains around 20 products, so 50 pages ~ 1000 products.

    :param start_page: The first page number to scrape
    :param end_page: The last page number to scrape
    :param delay: Delay (seconds) between requests to avoid overloading the server
    :param output_file: CSV file to write the results
    """

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/108.0.0.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
    }

    # Potential CSS selectors (change if needed by inspecting actual HTML)
    possible_selectors = [
        "article.py-4.border-bottom",
        "div.col-12.d-flex.flex-wrap.py-4.border-bottom.color-border-muted",
        "li.Box-row",
        "div[class*='marketplace-item']"
    ]

    fieldnames = ['product_name', 'description', 'url', 'page']

    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        for page_num in range(start_page, end_page + 1):
            # Marketplace Actions URL, changing the 'page' parameter
            url = f"https://github.com/marketplace/actions?page={page_num}"
            print(f"Scraping page {page_num}: {url}")

            # Make the request
            try:
                response = requests.get(url, headers=headers, timeout=10)
                response.raise_for_status()
            except requests.exceptions.RequestException as e:
                print(f"Request error on page {page_num}: {e}")
                continue

            # Parse with BeautifulSoup
            soup = BeautifulSoup(response.text, "html.parser")

            # Try each selector until we find one that matches items
            action_cards = []
            for selector in possible_selectors:
                test_cards = soup.select(selector)
                if test_cards:
                    action_cards = test_cards
                    print(f"  Found {len(action_cards)} items using selector: {selector}")
                    break
            else:
                # If no selector matched
                print("  No items found with any known selector on this page.")
                # Optional: print snippet of HTML for debugging
                continue

            # Extract product details
            for card in action_cards:
                name_tag = card.find('h3')
                desc_tag = card.find('p')
                link_tag = card.find('a', href=True)

                if not (name_tag and link_tag):
                    continue

                product_name = name_tag.get_text(strip=True)
                description = desc_tag.get_text(strip=True) if desc_tag else ""
                action_url = link_tag['href']
                if action_url.startswith('/'):
                    action_url = "https://github.com" + action_url

                writer.writerow({
                    'product_name': product_name,
                    'description': description,
                    'url': action_url,
                    'page': page_num
                })

            # Small delay between pages
            time.sleep(delay)

    print(f"Scraping completed. Data saved to '{output_file}'.")


if __name__ == "__main__":
    # Scrape pages 1 through 50 (~1000 products) with a 1-second delay
    scrape_github_marketplace_actions(start_page=1, end_page=50, delay=1.0)



Scraping page 1: https://github.com/marketplace/actions?page=1
  Found 20 items using selector: div[class*='marketplace-item']
Scraping page 2: https://github.com/marketplace/actions?page=2
  Found 20 items using selector: div[class*='marketplace-item']
Scraping page 3: https://github.com/marketplace/actions?page=3
  Found 20 items using selector: div[class*='marketplace-item']
Scraping page 4: https://github.com/marketplace/actions?page=4
  Found 20 items using selector: div[class*='marketplace-item']
Scraping page 5: https://github.com/marketplace/actions?page=5
  Found 20 items using selector: div[class*='marketplace-item']
Scraping page 6: https://github.com/marketplace/actions?page=6
  Found 20 items using selector: div[class*='marketplace-item']
Scraping page 7: https://github.com/marketplace/actions?page=7
  Found 20 items using selector: div[class*='marketplace-item']
Scraping page 8: https://github.com/marketplace/actions?page=8
  Found 20 items using selector: div[class*='mar

In [2]:
import pandas as pd

def read_csv_with_pandas(csv_file='github_actions_1000.csv'):
    """Reads and displays rows from the CSV using pandas."""
    df = pd.read_csv(csv_file)
    print(df)  # Print the entire DataFrame
    print("\nDataFrame shape:", df.shape)  # (num_rows, num_columns)

if __name__ == "__main__":
    read_csv_with_pandas()


                     product_name  \
0                  TruffleHog OSS   
1                   Metrics embed   
2    yq - portable yaml processor   
3                    Super-Linter   
4          Gosec Security Checker   
..                            ...   
995               Rebuild Armbian   
996                 GitHub Script   
997        Deploy to GitHub Pages   
998          ChatGPT CodeReviewer   
999                    FTP Deploy   

                                           description  \
0                  Scan Github Actions with TruffleHog   
1    An infographics generator with 40+ plugins and...   
2    create, read, update, delete, merge, validate ...   
3    Super-linter is a ready-to-run collection of l...   
4                      Runs the gosec security checker   
..                                                 ...   
995                                Build Armbian Linux   
996         Run simple scripts using the GitHub client   
997  This action will handle the 

In [10]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [14]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [15]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import time

# If you're in Google Colab, import this module to enable file download:
try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

def clean_text(text):
    """
    Clean text by removing HTML tags, special chars, converting to lowercase,
    tokenizing, removing stopwords, and lemmatizing.
    """
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove non-alphanumeric (preserve space)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Lowercase
    text = text.lower().strip()

    # Tokenize
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Join tokens back into a single string
    return " ".join(tokens)

def preprocess_and_download(input_csv='github_actions_1000.csv', output_csv='github_actions_cleaned.csv'):
    # 1) Load the data
    df = pd.read_csv(input_csv)
    print(f"Initial dataset shape: {df.shape}")

    # 2) Basic data checks
    print("\nHEAD of the original data:")
    print(df.head(3))

    # 3) Remove duplicates
    before = len(df)
    df.drop_duplicates(subset=['product_name', 'description', 'url', 'page'], inplace=True)
    after = len(df)
    print(f"\nRemoved {before - after} duplicates. Current shape: {df.shape}")

    # 4) Handle missing values
    #    Replace empty strings with NaN so we can drop them easily
    df['product_name'] = df['product_name'].replace('', np.nan)
    df['description'] = df['description'].replace('', np.nan)

    # Drop rows missing product_name or description
    before = len(df)
    df.dropna(subset=['product_name', 'description'], how='any', inplace=True)
    after = len(df)
    print(f"Dropped {before - after} rows with missing product_name/description. Shape: {df.shape}")

    # If 'page' is missing, fill with 0
    if 'page' in df.columns:
        df['page'] = df['page'].fillna(0)

    # 5) Preprocess text columns
    start_time = time.time()
    df['cleaned_product_name'] = df['product_name'].apply(clean_text)
    df['cleaned_description']  = df['description'].apply(clean_text)
    end_time = time.time()
    print(f"\nText cleaning took {end_time - start_time:.2f} seconds.")

    # 6) Show a sample of the cleaned data
    print("\nHEAD of cleaned data:")
    print(df[['product_name', 'cleaned_product_name', 'description', 'cleaned_description']].head(3))

    # 7) Save to CSV
    df.to_csv(output_csv, index=False)
    print(f"\nCleaned data saved to: {output_csv} - Final shape: {df.shape}")

    # 8) Download the file if in Google Colab
    if IN_COLAB:
        print("\nDownloading the cleaned file to your local machine...")
        files.download(output_csv)

preprocess_and_download('github_actions_1000.csv', 'github_actions_cleaned.csv')


Initial dataset shape: (1000, 4)

HEAD of the original data:
                   product_name  \
0                TruffleHog OSS   
1                 Metrics embed   
2  yq - portable yaml processor   

                                         description  \
0                Scan Github Actions with TruffleHog   
1  An infographics generator with 40+ plugins and...   
2  create, read, update, delete, merge, validate ...   

                                                 url  page  
0  https://github.com/marketplace/actions/truffle...     1  
1  https://github.com/marketplace/actions/metrics...     1  
2  https://github.com/marketplace/actions/yq-port...     1  

Removed 0 duplicates. Current shape: (1000, 4)
Dropped 0 rows with missing product_name/description. Shape: (1000, 4)

Text cleaning took 4.30 seconds.

HEAD of cleaned data:
                   product_name        cleaned_product_name  \
0                TruffleHog OSS               trufflehog os   
1                 Metrics e

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [6]:
import tweepy
import pandas as pd
import re

# 🔹 Step 1: Twitter API Credentials
API_KEY = "4W8I4MqXHtaxqdhphMeUMAfjY"
API_SECRET = "oKe4Evosm83e8D7tPbn1npsztebxjZMa3p00OsRWUd8lJTJoI5"
ACCESS_TOKEN = "1891605471368445957-sQp17BnPeOcusJNZrnLPGlasneyerm"
ACCESS_SECRET = "nn4dB3dXZ9orWmYkjuNPV38LVc7Uzip8QaWaB5aY3lgsi"
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAP9fzQEAAAAALfJohZT1hMgc2rw9uoNXo6RJb5U%3DRMSVdGVZi91u4MLcC1n23HKHIrHuKGsT0uLEUFFFgt7kzLDK9Y"

# 🔹 Step 2: Authenticate with Twitter API
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# 🔹 Step 3: Define the Search Query
query = "#MachineLearning OR #ArtificialIntelligence OR #AI -is:retweet lang:en"

# 🔹 Step 4: Fetch Tweets
tweets = client.search_recent_tweets(query=query, tweet_fields=["id", "text", "author_id", "created_at"], max_results=100)

# 🔹 Step 5: Extract Data
tweet_data = []

if tweets.data:
    for tweet in tweets.data:
        tweet_data.append({
            "Tweet_ID": tweet.id,
            "Username": f"user_{tweet.author_id}",  # Usernames require an additional API call
            "Created_At": tweet.created_at,
            "Text": tweet.text
        })

# 🔹 Step 6: Convert to DataFrame
df = pd.DataFrame(tweet_data)

# 🔹 Step 7: Data Cleaning Function
def clean_text(text):
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r"@\w+", "", text)  # Remove mentions
    text = re.sub(r"#\w+", "", text)  # Remove hashtags
    text = re.sub(r"[^A-Za-z0-9\s]", "", text)  # Remove special characters
    return text.strip()

df["Cleaned_Text"] = df["Text"].apply(clean_text)

# 🔹 Step 8: Data Quality Check
df.drop_duplicates(subset=["Tweet_ID"], keep="first", inplace=True)  # Remove duplicate tweets
df.dropna(inplace=True)  # Remove missing values

# 🔹 Step 9: Save to CSV
df.to_csv("twitter_ai_ml_tweets.csv", index=False, encoding="utf-8")

print(" Data scraping and cleaning complete! File saved: twitter_ai_ml_tweets.csv")


 Data scraping and cleaning complete! File saved: twitter_ai_ml_tweets.csv


In [7]:
from google.colab import files
files.download("twitter_ai_ml_tweets.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [8]:
import pandas as pd

# Load the uploaded file
file_path = "twitter_ai_ml_tweets.csv"
df = pd.read_csv(file_path)

# Display basic information about the dataset
df.info(), df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Tweet_ID      100 non-null    int64 
 1   Username      100 non-null    object
 2   Created_At    100 non-null    object
 3   Text          100 non-null    object
 4   Cleaned_Text  100 non-null    object
dtypes: int64(1), object(4)
memory usage: 4.0+ KB


(None,
               Tweet_ID                  Username                 Created_At  \
 0  1891678317633315207  user_1891022278533709824  2025-02-18 02:37:11+00:00   
 1  1891678309819301954  user_1849144255274713088  2025-02-18 02:37:09+00:00   
 2  1891678288025718879  user_1422938193339420682  2025-02-18 02:37:04+00:00   
 3  1891678273496662092  user_1748991548455563264  2025-02-18 02:37:00+00:00   
 4  1891678257667318099  user_1517080755356065795  2025-02-18 02:36:56+00:00   
 
                                                 Text  \
 0  @stellardeployer @Ashcryptoreal @pumpdotfun @f...   
 1  Rain Girl \n#雨水 #ai #aiart #aigirl #aiwomen #a...   
 2  Abridge @AbridgeHQ Raises $250 Million in Seri...   
 3             Bullish\nAwesome\n\n@Ammo_AI #AI #ammo   
 4  @Ammo_AI #AI #ammo \nThis is a great project, ...   
 
                                         Cleaned_Text  
 0  we have all the elements to make this the next...  
 1                                          Rain Girl  

In [9]:
# Save the cleaned DataFrame to a CSV file
cleaned_file_path_tweeter = "twitter_ai_ml_tweets_cleaned_final.csv"
df.to_csv(cleaned_file_path_tweeter, index=False)

files.download(cleaned_file_path_tweeter)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# All the questions were challenging for me. I found question 5 somewhat esay but rest of the question were diffcult to slove.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog