<a href="https://colab.research.google.com/github/CharmikaSadhula/Charmika_INFO5731_Spring2025/blob/main/Charmika_Sadhula_Assignment_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [1]:
from IPython.display import Markdown

colab_link = "https://colab.research.google.com/drive/1proHiPs3_n9_qtKK9HXaCNO-BVgWMngV?usp=sharing"

md_text = f"## IMDB Reviews Scraping Project\nClick the link below to access the full project:\n\n[Open Project in Google Colab]({colab_link})"
display(Markdown(md_text))


## IMDB Reviews Scraping Project
Click the link below to access the full project:

[Open Project in Google Colab](https://colab.research.google.com/drive/1proHiPs3_n9_qtKK9HXaCNO-BVgWMngV?usp=sharing)

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Install required libraries
!pip install pandas nltk

# Import necessary libraries
import pandas as pd
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK resources
nltk.download('wordnet')
nltk.download('omw-1.4')

# Load dataset
df = pd.read_csv("/content/imdb_reviews_1500_fixed.csv")

# Ensure "Review" column exists
if "Review" not in df.columns:
    raise KeyError("The 'Review' column is missing in the dataset!")

# Remove rows with missing or empty reviews
df = df[df["Review"].notna()]  # Remove NaN values
df = df[df["Review"].str.strip() != "No review text"]  # Remove placeholder reviews

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define stopwords list
stop_words = set([
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
    "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her",
    "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs",
    "themselves", "what", "which", "who", "whom", "this", "that", "these", "those",
    "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had",
    "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if",
    "or", "because", "as", "until", "while", "of", "at", "by", "for", "with",
    "about", "against", "between", "into", "through", "during", "before", "after",
    "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over",
    "under", "again", "further", "then", "once", "here", "there", "when", "where",
    "why", "how", "all", "any", "both", "each", "few", "more", "most", "other",
    "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
    "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"
])

# Function to remove special characters and punctuation
def remove_special_characters(text):
    return re.sub(r'[^\w\s]', '', str(text))

# Function to remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', str(text))

# Function to remove stopwords
def remove_stopwords(text):
    words = text.split()
    words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(words)

# Function to convert text to lowercase
def convert_to_lowercase(text):
    return str(text).lower()

# Function for stemming
def apply_stemming(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

# Function for lemmatization
def apply_lemmatization(text):
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

# Step 1: Remove special characters and punctuation
df["No_Special_Characters"] = df["Review"].apply(remove_special_characters)

# Step 2: Remove numbers
df["No_Numbers"] = df["No_Special_Characters"].apply(remove_numbers)

# Step 3: Remove stopwords
df["No_Stopwords"] = df["No_Numbers"].apply(remove_stopwords)

# Step 4: Convert text to lowercase
df["Lowercased"] = df["No_Stopwords"].apply(convert_to_lowercase)

# Step 5: Apply stemming
df["Stemmed"] = df["Lowercased"].apply(apply_stemming)

# Step 6: Apply lemmatization
df["Lemmatized"] = df["Lowercased"].apply(apply_lemmatization)

print(df)



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


            Movie Reviewer  Rating Review Date  \
0          Salaar  Unknown     6.0     Unknown   
1          Salaar  Unknown     7.0     Unknown   
2          Salaar  Unknown     7.0     Unknown   
4          Salaar  Unknown    10.0     Unknown   
5          Salaar  Unknown     6.0     Unknown   
...           ...      ...     ...         ...   
1493  Oppenheimer  Unknown     8.0     Unknown   
1495  Oppenheimer  Unknown    10.0     Unknown   
1496  Oppenheimer  Unknown     6.0     Unknown   
1497  Oppenheimer  Unknown     5.0     Unknown   
1498  Oppenheimer  Unknown     NaN     Unknown   

                                                 Review  \
0     Synopsis: The film invests significant time in...   
1     Salaar has been my most-awaited film for 2023....   
2     This time Prashanth's magic didn't worked out....   
4     Full of action and storylines.If you like acti...   
5     There is no reason or point to watch the first...   
...                                          

In [3]:
# Replace 'df' with the name of your DataFrame if different
csv_file_path = "/content/Cleaned_IMDB_Reviews_dataset.csv"
df.to_csv(csv_file_path, index=False)

# Print the file path for reference
print(f"CSV file saved at: {csv_file_path}")


CSV file saved at: /content/Cleaned_IMDB_Reviews_dataset.csv


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [12]:
!pip install benepar


Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.6.0->benepar)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.6.0->benepar)
  Downloading nvidia_cublas_

In [4]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m85.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [1]:
import benepar
benepar.download('benepar_en3')


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


True

In [6]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
print("SpaCy model loaded successfully!")


SpaCy model loaded successfully!


In [3]:
import pandas as pd
import spacy
from collections import Counter

# Load the cleaned dataset
file_path = "/content/Cleaned_IMDB_Reviews_dataset.csv"
df = pd.read_csv(file_path)

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")

# Function to get POS counts
def pos_counts(text):
    doc = nlp(text)
    pos_tags = [token.pos_ for token in doc]
    return dict(Counter(pos_tags))

# Apply POS tagging
df["POS_Tags"] = df["Review"].apply(pos_counts)

# Calculate total Noun (N), Verb (V), Adjective (Adj), Adverb (Adv)
total_pos_counts = Counter()
for pos_dict in df["POS_Tags"]:
    total_pos_counts.update(pos_dict)

# Display POS counts
print("Total POS Counts:")
print(f"Nouns (N): {total_pos_counts['NOUN']}")
print(f"Verbs (V): {total_pos_counts['VERB']}")
print(f"Adjectives (Adj): {total_pos_counts['ADJ']}")
print(f"Adverbs (Adv): {total_pos_counts['ADV']}")


Total POS Counts:
Nouns (N): 63100
Verbs (V): 38540
Adjectives (Adj): 33740
Adverbs (Adv): 21640


In [4]:
import benepar
import nltk
from spacy import displacy

# Load benepar model
nltk.download('benepar')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

# Choose one example sentence
example_sentence = df["Review"].iloc[0]  # First review

# Parse the sentence
doc = nlp(example_sentence)

# Constituency Parsing
constituency_parse = list(doc.sents)[0]._.parse_string
print("Constituency Parsing Tree:")
print(constituency_parse)

# Dependency Parsing Visualization
displacy.render(doc, style="dep", jupyter=True)


[nltk_data] Error loading benepar: Package 'benepar' not found in
[nltk_data]     index
  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache

Constituency Parsing Tree:
(NP (NP (NN Synopsis)) (: :) (S (NP (DT The) (NN film)) (VP (VP (VBZ invests) (NP (JJ significant) (NN time)) (PP (IN in) (NP (NN world) (HYPH -) (NN building)))) (CC but) (VP (VBZ falls) (ADVP (RB short) (PP (IN of) (S (VP (VBG captivating) (NP (DT the) (NN audience)) (PP (IN within) (NP (PRP$ its) (NN narrative)))))))))) (. .))




In [15]:
import spacy
import pandas as pd
from collections import Counter

# Load the cleaned dataset
file_path = "/content/Cleaned_IMDB_Reviews_dataset.csv"  # Update with the correct path
df = pd.read_csv(file_path)

# Load a lightweight SpaCy model for faster processing
nlp = spacy.load("en_core_web_sm")  # Faster than "en_core_web_trf"

# Function to extract named entities efficiently
def extract_named_entities(doc):
    entities = {"PERSON": [], "ORG": [], "GPE": [], "PRODUCT": [], "DATE": []}
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_].append(ent.text)
    return entities

# Process all reviews in batches using nlp.pipe() for faster execution
docs = list(nlp.pipe(df["Review"].astype(str), batch_size=50))  # Adjust batch_size for performance
df["NER_Entities"] = [extract_named_entities(doc) for doc in docs]

# Function to count entity occurrences
def count_entities(entity_label):
    return Counter(ent for sublist in df["NER_Entities"].apply(lambda x: x[entity_label]) for ent in sublist)

# Aggregate entity counts
entity_counts = {label: count_entities(label) for label in ["PERSON", "ORG", "GPE", "PRODUCT", "DATE"]}

# Convert entity counts to DataFrame for display
entity_counts_df = pd.DataFrame.from_dict(entity_counts, orient='index').transpose().fillna(0)

# Save the dataset with NER results
ner_file_path = "/content/imdb_reviews_1500_with_ner.csv"
df.to_csv(ner_file_path, index=False)

print(f" Data saved to {ner_file_path}")


 Data saved to /content/imdb_reviews_1500_with_ner.csv


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [8]:
!pip install requests beautifulsoup4 pandas




In [16]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

# Base URL for GitHub Marketplace Actions
BASE_URL = "https://github.com/marketplace?type=actions"

# Headers to mimic real browser behavior
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Referer": "https://github.com/",
    "Connection": "keep-alive"
}

# Create a session to maintain cookies
session = requests.Session()

# Function to scrape a single page
def scrape_page(page_num):
    url = f"{BASE_URL}&page={page_num}"

    print(f"Scraping page {page_num}...")

    response = session.get(url, headers=HEADERS)

    # Handling bad request errors
    if response.status_code == 400:
        print(f"Failed to load page {page_num}, Status Code: 400. Retrying with new session...")
        time.sleep(5)  # Pause before retrying
        session.cookies.clear()  # Clear cookies before retrying
        return scrape_page(page_num)  # Retry with a new request

    if response.status_code != 200:
        print(f"Failed to load page {page_num}, Status Code: {response.status_code}")
        return []  # Skip failed pages

    soup = BeautifulSoup(response.text, "html.parser")
    actions = []

    for item in soup.find_all("div", class_="marketplace-common-module__marketplace-item--MohVH"):
        try:
            name = item.find("h3").get_text(strip=True)
            description = item.find("p").get_text(strip=True)
            link = "https://github.com" + item.find("a")["href"]

            actions.append({
                "Product Name": name,
                "Description": description,
                "URL": link,
                "Page Number": page_num
            })
        except Exception as e:
            print(f"Error parsing item on page {page_num}: {e}")

    return actions

# Scrape 50 pages (20 products per page = 1000 products)
all_actions = []
for page in range(1, 51):
    all_actions.extend(scrape_page(page))

    if len(all_actions) >= 1000:  # Stop at 1000 products
        break

    time.sleep(random.uniform(5, 10))  # Increase delay to avoid blocking

# Convert to DataFrame
df = pd.DataFrame(all_actions[:1000])

# Save CSV
csv_file_path = "/content/github_marketplace_1000_products.csv"
df.to_csv(csv_file_path, index=False)

print(f"Scraping completed! 1000 products saved to {csv_file_path}")


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...
Scraping page 11...
Scraping page 12...
Scraping page 13...
Scraping page 14...
Scraping page 15...
Scraping page 16...
Scraping page 17...
Scraping page 18...
Scraping page 19...
Scraping page 20...
Scraping page 21...
Scraping page 22...
Scraping page 23...
Scraping page 24...
Scraping page 25...
Scraping page 26...
Scraping page 27...
Scraping page 28...
Scraping page 29...
Scraping page 30...
Scraping page 31...
Scraping page 32...
Scraping page 33...
Scraping page 34...
Scraping page 35...
Scraping page 36...
Scraping page 37...
Scraping page 38...
Scraping page 39...
Scraping page 40...
Scraping page 41...
Scraping page 42...
Scraping page 43...
Scraping page 44...
Scraping page 45...
Scraping page 46...
Scraping page 47...
Scraping page 48...
Scraping page 49...
Scraping page 50...
Scraping 

In [2]:
!pip uninstall -y nltk
!pip install --no-cache-dir nltk


Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
Successfully installed nltk-3.9.1


In [3]:
!pip uninstall -y nltk
!pip install --no-cache-dir nltk


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


NLTK Data Downloaded Successfully.


In [1]:
import nltk

# Remove existing nltk_data if needed
nltk.data.path.append('/usr/share/nltk_data/')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

print("NLTK data successfully downloaded!")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


NLTK data successfully downloaded!


[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
import nltk
from nltk.data import find

# Set custom path if necessary
nltk.data.path.append('/usr/share/nltk_data/')

# Check if Punkt tokenizer is available
try:
    find('tokenizers/punkt.zip')
    print("Punkt tokenizer is installed correctly.")
except LookupError:
    print("Punkt tokenizer not found! Downloading manually...")
    nltk.download('punkt')


Punkt tokenizer is installed correctly.


In [4]:
!pip install spacy

import spacy
nlp = spacy.load("en_core_web_sm")

def tokenize_text(text):
    doc = nlp(text)
    return [token.text for token in doc]

test_text = "This is a test sentence for tokenization."
tokens = tokenize_text(test_text)

print(tokens)  # Expected output: ['This', 'is', 'a', 'test', 'sentence', 'for', 'tokenization', '.']


['This', 'is', 'a', 'test', 'sentence', 'for', 'tokenization', '.']


In [5]:
import spacy
import pandas as pd

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Function to clean text
def clean_text(text):
    if not isinstance(text, str):
        return ""  # Return empty string if not a valid text input

    # Process text with spaCy
    doc = nlp(text.lower())  # Convert to lowercase

    # Remove stopwords and punctuation, keep only meaningful words
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

    # Join tokens back into a clean sentence
    return " ".join(tokens)

# Load GitHub Products dataset
df_github = pd.read_csv("/content/github_marketplace_1000_products.csv")

# Apply text cleaning function to the "Description" column
df_github["cleaned_description"] = df_github["Description"].apply(clean_text)

# Save cleaned data for further analysis
df_github.to_csv("/content/github_cleaned_products.csv", index=False)

print("Data cleaning completed successfully! ")


Data cleaning completed successfully! 


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [6]:
pip install tweepy pandas python-dotenv


Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [9]:
import tweepy
import pandas as pd
import os
from dotenv import load_dotenv

# Load API credentials from .env file
load_dotenv('/content/API KEY.env')

API_KEY = os.getenv("API_KEY")
API_SECRET_KEY = os.getenv("API_SECRET_KEY")
ACCESS_TOKEN = os.getenv("ACCESS_TOKEN")
ACCESS_TOKEN_SECRET = os.getenv("ACCESS_TOKEN_SECRET")
BEARER_TOKEN = os.getenv("BEARER_TOKEN")

# Authenticate with Twitter API v2
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# Define search parameters
query = "machine learning OR artificial intelligence -is:retweet lang:en"
max_results = 100  # Maximum per request

# Fetch tweets
tweets = client.search_recent_tweets(query=query, max_results=max_results, tweet_fields=["id", "text", "author_id"])

# Store tweets in a structured DataFrame
tweet_data = []
for tweet in tweets.data:
    tweet_data.append({
        "Tweet ID": tweet.id,
        "Username": tweet.author_id,  # Fetching the author ID instead of username
        "Text": tweet.text
    })

# Convert to Pandas DataFrame
df = pd.DataFrame(tweet_data)

# Save to CSV
df.to_csv("\content\tweets_data.csv", index=False)

print("Scraping completed. Tweets saved to 'tweets_data.csv'.")


TooManyRequests: 429 Too Many Requests
Too Many Requests

In [12]:
import pandas as pd
import re
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Load the tweets dataset
file_path = "/content/tweets_data.csv"  # Update the file path if needed
df = pd.read_csv(file_path)

# Display column names to verify
print("Columns in dataset:", df.columns)

# Use the correct column name for text
text_column = "Text"  # Adjust if necessary

# Function to clean tweet text
def clean_tweet_text(text):
    if pd.isna(text):  # Handle missing values
        return ""

    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)

    # Remove special characters, numbers, and punctuations
    text = re.sub(r"[^a-zA-Z\s]", "", text)

    # Tokenization using spaCy
    doc = nlp(text.lower())  # Convert to lowercase and process text

    # Remove stopwords and lemmatize
    cleaned_text = " ".join([token.lemma_ for token in doc if not token.is_stop])

    return cleaned_text

# Apply cleaning function to the correct text column
df["cleaned_text"] = df[text_column].apply(clean_tweet_text)

# Save cleaned data
df.to_csv("/content/cleaned_tweets.csv", index=False)
print("Cleaned data saved as 'cleaned_tweets.csv'")


Columns in dataset: Index(['Tweet ID', 'Username', 'Text'], dtype='object')
Cleaned data saved as 'cleaned_tweets.csv'


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog