<a href="https://colab.research.google.com/github/LGChalla/Laxmigayathri_INFO5731_Spring2025/blob/main/INFO5731_Assignment_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


#cleaned amazon review file https://drive.google.com/file/d/182VReHd1bMlOm5XHArqqoirKSLn1hngM/view?usp=drive_link

In [13]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.microsoft import EdgeChromiumDriverManager
import pandas as pd
import time
from bs4 import BeautifulSoup

def setup_driver():
    options = webdriver.EdgeOptions()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Edge(options=options)
    return driver

In [11]:
# Fetch the HTML content of the URL
def fetch_html(driver, url):
    driver.get(url)
    time.sleep(5)  # Wait for the page to load completely
    return driver.page_source

# Parse the reviews from the HTML
def parse_reviews(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    reviews = []

    # Find all review elements
    review_elements = soup.find_all("li", {"data-hook": "review"})

    for review in review_elements:
        try:
            # Extract reviewer name
            reviewer = review.find("span", {"class": "a-profile-name"}).text.strip()
        except:
            reviewer = "Anonymous"

        try:
            # Extract review title
            title = review.find("a", {"data-hook": "review-title"}).text.strip()
        except:
            title = "No Title"

        try:
            # Extract review body
            body = review.find("span", {"data-hook": "review-body"}).text.strip()
        except:
            body = "No Review Text"

        try:
            # Extract star rating
            rating = review.find("i", {"data-hook": "review-star-rating"}).text.strip()
        except:
            rating = "No Rating"

        try:
            # Extract review date
            date = review.find("span", {"data-hook": "review-date"}).text.strip()
        except:
            date = "No Date"

        # Append the extracted data to the reviews list
        reviews.append({
            "Reviewer": reviewer,
            "Title": title,
            "Body": body,
            "Rating": rating,
            "Date": date
        })

    return reviews

# Save the reviews to a CSV file
def save_to_csv(reviews, output_file):
    df = pd.DataFrame(reviews)
    df.to_csv(output_file, index=False, encoding="utf-8")
    print(f"Reviews saved to {output_file}")

# Main function to scrape reviews
def scrape_amazon_reviews(url, max_pages, output_file):
    driver = setup_driver()
    all_reviews = []

    for page in range(1, max_pages + 1):
        print(f"Scraping page {page}...")
        try:
            # Update the URL to include the page number
            page_url = f"{url}&pageNumber={page}"
            html_content = fetch_html(driver, page_url)
            page_reviews = parse_reviews(html_content)

            if not page_reviews:
                print(f"No reviews found on page {page}. Stopping pagination.")
                break

            all_reviews.extend(page_reviews)
        except Exception as e:
            print(f"Error on page {page}: {e}")
            break

    # Save all reviews to a CSV file
    save_to_csv(all_reviews, output_file)
    driver.quit()

# Run the scraper
if __name__ == "__main__":
    # URL of the Amazon reviews page
    BASE_URL = "https://www.amazon.com/Bubble-Extra-Gentle-Ounce-473ml/dp/B073KM53NG/ref=cm_cr_arp_d_product_top?ie=UTF8&th=1"  # Replace with your product's reviews URL
    MAX_PAGES = 100  # Adjust this based on the number of pages to scrape
    OUTPUT_FILE = "amazon_reviews.csv"

    scrape_amazon_reviews(BASE_URL, MAX_PAGES, OUTPUT_FILE)

Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...
Scraping page 11...
Scraping page 12...
Scraping page 13...
Scraping page 14...
Scraping page 15...
Scraping page 16...
Scraping page 17...
Scraping page 18...
Scraping page 19...
Scraping page 20...
Scraping page 21...
Scraping page 22...
Scraping page 23...
Scraping page 24...
Scraping page 25...
Scraping page 26...
Scraping page 27...
Scraping page 28...
Scraping page 29...
Scraping page 30...
Scraping page 31...
Scraping page 32...
Scraping page 33...
Scraping page 34...
Scraping page 35...
Scraping page 36...
Scraping page 37...
Scraping page 38...
Scraping page 39...
Scraping page 40...
Scraping page 41...
Scraping page 42...
Scraping page 43...
Scraping page 44...
Scraping page 45...
Scraping page 46...
Scraping page 47...
Scraping page 48...
Scraping page 49...
Scraping page 50...
Scraping 

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [1]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file
try:
    df = pd.read_csv("amazon_reviews.csv")
    print("CSV file loaded successfully.")
except FileNotFoundError:
    print("Error: The file 'amazon_reviews.csv' was not found. Please check the file path.")
    exit()

# Check the columns in the DataFrame
print("Columns in the DataFrame:")
print(df.columns)

# Verify the DataFrame is not empty
if df.empty:
    print("The DataFrame is empty. Please check the file content.")
    exit()

# Check if the 'Body' column exists
if 'Body' not in df.columns:
    print("Error: The column 'Body' does not exist in the DataFrame.")
    print("Available columns:", df.columns)
    exit()

# Initialize tools for stemming and lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define stopwords
stop_words = set(stopwords.words('english'))

# Function to clean text
def clean_text(text):
    # Step 1: Remove noise (special characters and punctuations)
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters and punctuations
    print(f"After removing special characters: {text}")

    # Step 2: Remove numbers
    text = re.sub(r'\d+', '', text)  # Remove numbers
    print(f"After removing numbers: {text}")

    # Step 3: Remove stopwords
    words = text.split()
    words = [word for word in words if word.lower() not in stop_words]
    print(f"After removing stopwords: {' '.join(words)}")

    # Step 4: Convert text to lowercase
    words = [word.lower() for word in words]
    print(f"After converting to lowercase: {' '.join(words)}")

    # Step 5: Perform stemming
    stemmed_words = [stemmer.stem(word) for word in words]
    print(f"After stemming: {' '.join(stemmed_words)}")

    # Step 6: Perform lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
    print(f"After lemmatization: {' '.join(lemmatized_words)}")

    # Join the cleaned words back into a single string
    cleaned_text = ' '.join(lemmatized_words)
    return cleaned_text

# Apply the cleaning function to the 'Body' column and save the cleaned data in a new column
df['cleaned_text'] = df['Body'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv("cleaned_amazon_reviews.csv", index=False)

# Display the first few rows of the cleaned dataset
print("\nCleaned Data:")
print(df.head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
After converting to lowercase: kids love bubble bath makes lot bubbles tear free wish came larger size go pretty quickly order two four pack keep around read
After stemming: kid love bubbl bath make lot bubbl tear free wish came larger size go pretti quickli order two four pack keep around read
After lemmatization: kid love bubbl bath make lot bubbl tear free wish came larger size go pretti quickli order two four pack keep around read
After removing special characters: My 18 month old loves a good bubble bath and these are perfect and gentle on the skin
Read more
After removing numbers: My  month old loves a good bubble bath and these are perfect and gentle on the skin
Read more
After removing stopwords: month old loves good bubble bath perfect gentle skin Read
After converting to lowercase: month old loves good bubble bath perfect gentle skin read
After stemming: month old love good bubbl bath perfect gentl skin read
Aft

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

#https://drive.google.com/file/d/1Znd8-xlVJKSUKDxdIMbrwvK7CHudRWde/view?usp=drive_link
#https://drive.google.com/file/d/1E3JyJg1p_GTWNPtuxCbt_t8uhZfmYgI0/view?usp=drive_link

In [1]:
!pip uninstall nltk
!pip install nltk==3.8.1
from nltk.tokenize import word_tokenize
text = "This is a test sentence."
tokens = word_tokenize(text)
print(tokens)


Found existing installation: nltk 3.8.1
Uninstalling nltk-3.8.1:
  Would remove:
    /usr/local/bin/nltk
    /usr/local/lib/python3.11/dist-packages/nltk-3.8.1.dist-info/*
    /usr/local/lib/python3.11/dist-packages/nltk/*
Proceed (Y/n)? y
  Successfully uninstalled nltk-3.8.1
Collecting nltk==3.8.1
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
textblob 0.19.0 requires nltk>=3.9, but you have nltk 3.8.1 which is incompatible.[0m[31m
[0mSuccessfully installed nltk-3.8.1
['This', 'is', 'a', 'test', 'sentence', '.']


In [9]:
import pandas as pd
import nltk
from nltk import pos_tag, word_tokenize
from collections import Counter

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load the cleaned CSV file
try:
    df = pd.read_csv("cleaned_amazon_reviews.csv")
    print("CSV file loaded successfully.")
except FileNotFoundError:
    print("Error: The file 'cleaned_amazon_reviews.csv' was not found. Please check the file path.")
    exit()

# Initialize a Counter to store POS counts
pos_counts = Counter()

# Function to perform POS tagging and count specific parts of speech
def pos_tagging_analysis(text):
    global pos_counts
    tokens = word_tokenize(text)  # Tokenize the text
    pos_tags = pos_tag(tokens)  # Perform POS tagging
    # Count specific POS tags
    for _, tag in pos_tags:
        if tag.startswith('N'):  # Nouns
            pos_counts['Noun'] += 1
        elif tag.startswith('V'):  # Verbs
            pos_counts['Verb'] += 1
        elif tag.startswith('J'):  # Adjectives
            pos_counts['Adjective'] += 1
        elif tag.startswith('R'):  # Adverbs
            pos_counts['Adverb'] += 1
    return pos_tags

# Apply POS tagging to each row in the 'cleaned_text' column
df['pos_tags'] = df['cleaned_text'].apply(pos_tagging_analysis)

# Print the total counts of Nouns, Verbs, Adjectives, and Adverbs
print("\nPOS Tagging Analysis:")
print(f"Total Nouns: {pos_counts['Noun']}")
print(f"Total Verbs: {pos_counts['Verb']}")
print(f"Total Adjectives: {pos_counts['Adjective']}")
print(f"Total Adverbs: {pos_counts['Adverb']}")

# Save the DataFrame with POS tags to a new CSV file
df.to_csv("pos_tagged_amazon_reviews.csv", index=False)

# Display the first few rows of the DataFrame with POS tags
print("\nData with POS Tags:")
print(df[['cleaned_text', 'pos_tags']].head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


CSV file loaded successfully.

POS Tagging Analysis:
Total Nouns: 8800
Total Verbs: 1300
Total Adjectives: 1600
Total Adverbs: 400

Data with POS Tags:
                                        cleaned_text  \
0  mr bubbl realli help reduc dryness skin cold w...   
1  work great love scentless use cap half make pe...   
2  kid love bubbl bath make lot bubbl tear free w...   
3  month old love good bubbl bath perfect gentl s...   
4  good valu product thought would produc bubbl read   

                                            pos_tags  
0  [(mr, NN), (bubbl, NN), (realli, NN), (help, N...  
1  [(work, NN), (great, JJ), (love, NN), (scentle...  
2  [(kid, NN), (love, NN), (bubbl, NN), (bath, NN...  
3  [(month, NN), (old, JJ), (love, NN), (good, JJ...  
4  [(good, JJ), (valu, NN), (product, NN), (thoug...  


In [5]:
import pandas as pd
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned Amazon reviews CSV file
try:
    df = pd.read_csv("cleaned_amazon_reviews.csv")
    print("CSV file loaded successfully.")
except FileNotFoundError:
    print("Error: The file 'cleaned_amazon_reviews.csv' was not found. Please check the file path.")
    exit()

# Check if the 'cleaned_text' column exists
if 'cleaned_text' not in df.columns:
    print("Error: The column 'cleaned_text' does not exist in the DataFrame.")
    print("Available columns:", df.columns)
    exit()

# Function to perform dependency parsing and simulated constituency parsing
def parse_review(text):
    # Process the text using spaCy
    doc = nlp(text)

    # Dependency Parsing
    print("\nDependency Parsing Tree:")
    for token in doc:
        print(f"{token.text} ({token.dep_}) --> {token.head.text}")

    # Simulated Constituency Parsing
    print("\nSimulated Constituency Parsing Tree:")
    for chunk in doc.noun_chunks:
        print(f"Noun Phrase: {chunk.text}")
    for token in doc:
        if token.pos_ == "VERB":
            print(f"Verb Phrase: {token.text} {' '.join([child.text for child in token.children])}")

# Apply parsing to the first 5 reviews in the 'cleaned_text' column
print("\nParsing Example for Cleaned Amazon Reviews:")
for i, review in enumerate(df['cleaned_text'][:5]):
    print(f"\nReview {i + 1}: {review}")
    parse_review(review)

CSV file loaded successfully.

Parsing Example for Cleaned Amazon Reviews:

Review 1: mr bubbl realli help reduc dryness skin cold winter isnt child doesnt like bubbl softer skin recommend read

Dependency Parsing Tree:
mr (compound) --> realli
bubbl (compound) --> realli
realli (compound) --> help
help (nsubj) --> is
reduc (xcomp) --> help
dryness (compound) --> skin
skin (dobj) --> reduc
cold (amod) --> winter
winter (nsubj) --> is
is (ROOT) --> is
nt (neg) --> is
child (attr) --> is
does (aux) --> like
nt (neg) --> like
like (intj) --> read
bubbl (nmod) --> recommend
softer (amod) --> recommend
skin (compound) --> recommend
recommend (nsubj) --> read
read (attr) --> is

Simulated Constituency Parsing Tree:
Noun Phrase: mr bubbl realli help
Noun Phrase: dryness skin
Noun Phrase: cold winter
Noun Phrase: child
Noun Phrase: bubbl softer skin recommend
Verb Phrase: reduc skin
Verb Phrase: like does nt
Verb Phrase: bubbl 
Verb Phrase: read like recommend

Review 2: work great love scentl

In [12]:
# Load the cleaned Amazon reviews CSV file
try:
    df = pd.read_csv("cleaned_amazon_reviews.csv")
    print("CSV file loaded successfully.")
except FileNotFoundError:
    print("Error: The file 'cleaned_amazon_reviews.csv' was not found. Please check the file path.")
    exit()

# Function to perform Named Entity Recognition (NER)
def extract_entities(text):
    doc = nlp(text)  # Process the text using spaCy
    entities = [(ent.text, ent.label_) for ent in doc.ents]  # Extract entities and their labels
    return entities

# Apply the NER function to each review and store the results in a new column
df['entities'] = df['cleaned_text'].apply(extract_entities)

# Flatten the list of all entities and their labels
all_entities = [entity for entities in df['entities'] for entity in entities]

# Count the occurrences of each entity type
entity_counts = Counter([label for _, label in all_entities])

# Print the extracted entities and their counts
print("\nNamed Entity Recognition (NER):")
print("Extracted Entities:")
for entity, label in all_entities:
    print(f"{entity} ({label})")

print("\nEntity Counts:")
for label, count in entity_counts.items():
    print(f"{label}: {count}")

# Save the DataFrame with extracted entities to a new CSV file
df.to_csv("ner_amazon_reviews.csv", index=False)

# Display the first few rows of the DataFrame with entities
print("\nData with Extracted Entities:")
print(df[['cleaned_text', 'entities']].head())

CSV file loaded successfully.

Named Entity Recognition (NER):
Extracted Entities:
bubbl realli help reduc (PERSON)
winter (DATE)
bubbl softer (PERSON)
half (CARDINAL)
bubbl bath (PERSON)
two four (CARDINAL)
month old (DATE)
bubbl bath (PERSON)
bubbl bath (PERSON)
issu (ORG)
lo recomiendo read (PERSON)
bubbl bath (PERSON)
nunca sent tan estafada con algo (ORG)
la peor compra (ORG)
jétai jeun read (PERSON)
bubbl realli help reduc (PERSON)
winter (DATE)
bubbl softer (PERSON)
half (CARDINAL)
bubbl bath (PERSON)
two four (CARDINAL)
month old (DATE)
bubbl bath (PERSON)
bubbl bath (PERSON)
issu (ORG)
lo recomiendo read (PERSON)
bubbl bath (PERSON)
nunca sent tan estafada con algo (ORG)
la peor compra (ORG)
jétai jeun read (PERSON)
bubbl realli help reduc (PERSON)
winter (DATE)
bubbl softer (PERSON)
half (CARDINAL)
bubbl bath (PERSON)
two four (CARDINAL)
month old (DATE)
bubbl bath (PERSON)
bubbl bath (PERSON)
issu (ORG)
lo recomiendo read (PERSON)
bubbl bath (PERSON)
nunca sent tan estafada 

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


#https://drive.google.com/file/d/1WsAaQNbSSOqkXbeD1nqNmqNM4Ip2sZc0/view?usp=drive_link

Github MarketPlace page:
https://github.com/marketplace?type=actions

In [8]:
#The first prompt was asking to write a code to scrape the web using selenium
#had to ask to use edge web driver and not chrome
#next asked it print scraped with the page numer to see if it was scraping
#This contiued with directly inputting the part 2 and asking to input the file from the previously written code



In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.microsoft import EdgeChromiumDriverManager
import pandas as pd
import time

# Set up Selenium WebDriver for Edge using WebDriver Manager
def setup_driver():
    options = webdriver.EdgeOptions()
    options.add_argument("--headless")  # Run in headless mode
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    # Use EdgeChromiumDriverManager to automatically manage the WebDriver
    driver = webdriver.Edge(options=options)
    return driver

# Function to scrape a single page
def scrape_page(driver, page_number):
    url = f"https://github.com/marketplace?type=actions&page={page_number}"
    print(f"Scraping page {page_number}...")
    driver.get(url)

    # Wait for the page to load
    try:
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.mt-4.marketplace-common-module__marketplace-list-grid--vCk7D"))
        )
    except Exception as e:
        print(f"Failed to load page {page_number}: {e}")
        return []

    # Extract product cards
    product_cards = driver.find_elements(By.CSS_SELECTOR, "div.mt-4.marketplace-common-module__marketplace-list-grid--vCk7D > div")
    data = []

    for card in product_cards:
        try:
            # Extract product name
            name = card.find_element(By.CSS_SELECTOR, "div.flex-1 > div > h3").text.strip()

            # Extract product description
            description = card.find_element(By.CSS_SELECTOR, "div.flex-1 > p").text.strip()

            # Extract product URL
            product_url = card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")

            # Append data
            data.append({
                "Product Name": name,
                "Description": description,
                "URL": product_url,
                "Page Number": page_number
            })
        except Exception as e:
            # Skip if any element is missing
            print(f"Error extracting data from a card: {e}")
            continue

    return data

# Main function to scrape multiple pages
def scrape_marketplace(max_pages, output_file):
    driver = setup_driver()
    all_data = []

    for page in range(1, max_pages + 1):
        # Scrape data from the current page
        page_data = scrape_page(driver, page)

        # If no data is found, stop scraping (end of pagination)
        if not page_data:
            print(f"No data found on page {page}. Stopping pagination.")
            break

        all_data.extend(page_data)

        # Introduce a delay to avoid overloading the server
        time.sleep(2)

    # Save data to CSV
    df = pd.DataFrame(all_data)
    df.to_csv(output_file, index=False, encoding="utf-8")
    print(f"Scraping completed. Data saved to {output_file}")

    driver.quit()

# Run the scraper
if __name__ == "__main__":
    MAX_PAGES = 50  # Adjust this based on the number of pages to scrape
    OUTPUT_FILE = "github_marketplace_actions.csv"
    scrape_marketplace(MAX_PAGES, OUTPUT_FILE)

Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...
Scraping page 11...
Scraping page 12...
Scraping page 13...
Scraping page 14...
Scraping page 15...
Scraping page 16...
Scraping page 17...
Scraping page 18...
Scraping page 19...
Scraping page 20...
Scraping page 21...
Scraping page 22...
Scraping page 23...
Scraping page 24...
Scraping page 25...
Scraping page 26...
Scraping page 27...
Scraping page 28...
Scraping page 29...
Scraping page 30...
Scraping page 31...
Scraping page 32...
Scraping page 33...
Scraping page 34...
Scraping page 35...
Scraping page 36...
Scraping page 37...
Scraping page 38...
Scraping page 39...
Scraping page 40...
Scraping page 41...
Scraping page 42...
Scraping page 43...
Scraping page 44...
Scraping page 45...
Scraping page 46...
Scraping page 47...
Scraping page 48...
Scraping page 49...
Scraping page 50...
Scraping 

In [None]:
!pip uninstall nltk
!pip install nltk==3.8.1
from nltk.tokenize import word_tokenize
text = "This is a test sentence."
tokens = word_tokenize(text)
print(tokens)


Found existing installation: nltk 3.8.1
Uninstalling nltk-3.8.1:
  Would remove:
    /usr/local/bin/nltk
    /usr/local/lib/python3.11/dist-packages/nltk-3.8.1.dist-info/*
    /usr/local/lib/python3.11/dist-packages/nltk/*
Proceed (Y/n)? y
  Successfully uninstalled nltk-3.8.1
Collecting nltk==3.8.1
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
textblob 0.19.0 requires nltk>=3.9, but you have nltk 3.8.1 which is incompatible.[0m[31m
[0mSuccessfully installed nltk-3.8.1
['This', 'is', 'a', 'test', 'sentence', '.']


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re

# Ensure required NLTK resources are downloaded
try:
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('omw-1.4')
except Exception as e:
    print(f"Error downloading NLTK resources: {e}")

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define stopwords
stop_words = set(stopwords.words('english'))

# Function to clean and preprocess text
def preprocess_text(text):
    # Handle non-string or missing values
    if not isinstance(text, str):
        return ""
    # Convert to lowercase
    text = text.lower()
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back into a single string
    return ' '.join(tokens)

# Load the dataset
data = pd.read_csv("github_marketplace_actions.csv")

# Handle missing values in the 'Description' column
data['Description'] = data['Description'].fillna("")

# Apply preprocessing to the 'Description' column
data['Cleaned Description'] = data['Description'].apply(preprocess_text)

# Save the cleaned data
data.to_csv("github_marketplace_cleaned.csv", index=False)
print("Preprocessing completed. Cleaned data saved to 'github_marketplace_cleaned.csv'.")

Preprocessing completed. Cleaned data saved to 'github_marketplace_cleaned.csv'.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


##prompt: I directly gave the assignment question, part 1 and it gave the follwoing code , had to replace the bearer token
#follwowing this I prompted saying part 2 and gave the question in, for the data cleaning

#https://drive.google.com/file/d/1aaKL3Z-_WnOJt-yj8Dilo9FwuIGzCiDi/view?usp=drive_link

In [None]:
pip install tweepy



In [None]:
import tweepy
import pandas as pd

# Step 1: Authenticate with the Twitter API
def authenticate_twitter(api_key, api_key_secret, bearer_token):
    client = tweepy.Client(bearer_token=bearer_token)
    return client

# Step 2: Scrape tweets based on hashtags
def scrape_tweets(client, query, max_tweets=200):
    tweets_data = []
    response = client.search_recent_tweets(query=query, max_results=10, tweet_fields=["id", "text", "author_id"])
    for tweet in response.data:
        tweets_data.append({
            "Tweet ID": tweet.id,
            "Username": tweet.author_id,
            "Text": tweet.text
        })
    return tweets_data

# Step 3: Save tweets to a CSV file
def save_to_csv(tweets_data, filename):
    df = pd.DataFrame(tweets_data)
    df.to_csv(filename, index=False, encoding="utf-8")
    print(f"Saved {len(tweets_data)} tweets to {filename}")

# Main function
if __name__ == "__main__":
    # Replace with your Twitter API credentials
    BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAD6gzQEAAAAAaWvc5HQu8ZXFFoj9zKswLY%2BGFj0%3DQPxF8sSF20hxmMtilv8YiiVMj6BiN3qiieFdZIHlZfp9zenYUI"

    # Authenticate with the Twitter API
    client = authenticate_twitter(None, None, BEARER_TOKEN)

    # Define the hashtag and number of tweets to scrape
    QUERY = "#machinelearning OR #artificialintelligence"
    MAX_TWEETS = 100  # Adjust the number of tweets to scrape

    # Scrape tweets
    tweets = scrape_tweets(client, QUERY, MAX_TWEETS)

    # Save tweets to a CSV file
    save_to_csv(tweets, "tweets_data.csv")

Saved 10 tweets to tweets_data.csv


In [None]:
import pandas as pd

# Step 4: Perform data cleaning
def clean_data(input_filename, output_filename):
    # Load the dataset
    df = pd.read_csv(input_filename)

    # Display initial dataset information
    print("Initial Dataset Info:")
    print(df.info())
    print("\nInitial Dataset Preview:")
    print(df.head())

    # 1. Remove duplicate rows
    df = df.drop_duplicates()
    print(f"\nNumber of duplicates removed: {len(df)}")

    # 2. Handle missing values
    # Check for missing values
    print("\nMissing values before cleaning:")
    print(df.isnull().sum())

    # Drop rows with missing values (if any)
    df = df.dropna()

    # Verify no missing values remain
    print("\nMissing values after cleaning:")
    print(df.isnull().sum())

    # 3. Standardize text data
    # Remove leading/trailing whitespace
    df['Text'] = df['Text'].str.strip()

    # Remove special characters (optional)
    df['Text'] = df['Text'].str.replace(r'[^\w\s]', '', regex=True)

    # Convert text to lowercase
    df['Text'] = df['Text'].str.lower()

    # 4. Validate data types
    # Ensure Tweet ID is a string
    df['Tweet ID'] = df['Tweet ID'].astype(str)

    # Display cleaned dataset information
    print("\nCleaned Dataset Info:")
    print(df.info())
    print("\nCleaned Dataset Preview:")
    print(df.head())

    # Save the cleaned dataset to a new CSV file
    df.to_csv(output_filename, index=False, encoding="utf-8")
    print(f"\nCleaned data saved to '{output_filename}'")

# Main function
if __name__ == "__main__":
    # Define input and output filenames
    input_filename = "tweets_data.csv"  # The raw data file
    output_filename = "cleaned_tweets_data.csv"  # The cleaned data file

    # Perform data cleaning
    clean_data(input_filename, output_filename)

Initial Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Tweet ID  10 non-null     int64 
 1   Username  10 non-null     int64 
 2   Text      10 non-null     object
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes
None

Initial Dataset Preview:
              Tweet ID             Username  \
0  1892333426964881545  1829231426673266688   
1  1892333419092205607  1557455754973544450   
2  1892333352843198589  1423805384259784718   
3  1892333046122086872  1885832559873122304   
4  1892332838243987617            555031989   

                                                Text  
0  Connected Intelligence, Boomi Locks Down APIs ...  
1  Artificial Intelligence (AI) focuses on creati...  
2  Machine Learning-Driven Optimization of Spent ...  
3  💼 AI in the Workplace\n\nBoss: "We need to hir...  
4  Explainable #AI for a Trustworthy Workpl

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog