<a href="https://colab.research.google.com/github/DhannajayaPaliwal12/Dhananjaya_INFO5731_Fall2025/blob/main/Paliwal_Dhananjaya_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
# ! pip install selenium
# ! pip install webdriver-manager

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.firefox import GeckoDriverManager
import time
import pandas as pd

def scrape_imdb_reviews(url, scroll_pause=2, max_scrolls=20):
    ### Setting up configurations for the webdriver
    options = webdriver.FirefoxOptions()
    options.add_argument("--headless")
    driver = webdriver.Firefox(
        options=options
    )
    wait = WebDriverWait(driver, 10)

    print(f"Opening page: {url}")
    driver.get(url)

    ### Click the main "See all" button
    try:
        see_all_btn = wait.until(
            EC.element_to_be_clickable((By.XPATH, "//span[contains(@class,'ipc-see-more__text') and text()='See all']"))
        )
        driver.execute_script("arguments[0].click();", see_all_btn)
        print("Clicked the 'See all' button to view all reviews")
        time.sleep(3)
    except Exception:
        print("Could not find 'See all' button, scraping current page only")

    ### Scrolling to load all reviews
    last_height = driver.execute_script("return document.body.scrollHeight")
    scrolls = 0
    while scrolls < max_scrolls:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            print("Reached bottom of page")
            break
        last_height = new_height
        scrolls += 1
        print(f"Completed {scrolls} scrolls")

    ### Expand the spoiler reviews
    print("Expanding spoiler reviews")
    spoiler_buttons = driver.find_elements(By.CSS_SELECTOR, "button.review-spoiler-button")
    expanded = 0
    for btn in spoiler_buttons:
        try:
            driver.execute_script("arguments[0].click();", btn)
            expanded += 1
            time.sleep(0.3)
        except Exception:
            pass
    print(f"✅ Expanded {expanded} spoiler reviews")

    ### Collecting all reviews containers
    print("Extracting review data from all containers")
    reviews = []
    review_blocks = driver.find_elements(By.CSS_SELECTOR, "article.user-review-item")

    for idx, block in enumerate(review_blocks, 1):
        if idx <= 1000:
          try:
              rating = block.find_element(By.CSS_SELECTOR, ".ipc-rating-star--rating").text
          except:
              rating = None
          try:
              title = block.find_element(By.CSS_SELECTOR, "div[data-testid='review-summary']").text
          except:
              title = None
          try:
              author = block.find_element(By.CSS_SELECTOR, "a[data-testid='author-link']").text
          except:
              author = None
          try:
              date = block.find_element(By.CSS_SELECTOR, "li.review-date").text
          except:
              date = None
          try:
              text = block.find_element(By.CSS_SELECTOR, "div.ipc-html-content").text
          except:
              text = None
          try:
              helpful = block.find_element(By.CSS_SELECTOR, ".ipc-voting__label__count--up").text
          except:
              helpful = None

          reviews.append({
              "rating": rating,
              "title": title,
              "author": author,
              "date": date,
              "text": text,
              "helpful": helpful
          })

          if idx % 100 == 0:
              print(f"Extracted {len(reviews)} reviews...")

    driver.quit()
    print(f"Finished Extracting {len(reviews)} reviews.")
    return reviews


if __name__ == "__main__":
    imdb_url = "https://www.imdb.com/title/tt15398776/reviews/"
    data = scrape_imdb_reviews(imdb_url, scroll_pause=2, max_scrolls=30)
    data = pd.DataFrame(data)

    data.to_csv("imdb_reviews.csv", index=False)


Opening page: https://www.imdb.com/title/tt15398776/reviews/
Clicked the 'See all' button to view all reviews
Completed 1 scrolls
Completed 2 scrolls
Completed 3 scrolls
Completed 4 scrolls
Completed 5 scrolls
Completed 6 scrolls
Completed 7 scrolls
Completed 8 scrolls
Completed 9 scrolls
Completed 10 scrolls
Completed 11 scrolls
Completed 12 scrolls
Completed 13 scrolls
Completed 14 scrolls
Completed 15 scrolls
Completed 16 scrolls
Completed 17 scrolls
Completed 18 scrolls
Completed 19 scrolls
Completed 20 scrolls
Completed 21 scrolls
Completed 22 scrolls
Completed 23 scrolls
Completed 24 scrolls
Completed 25 scrolls
Completed 26 scrolls
Completed 27 scrolls
Completed 28 scrolls
Completed 29 scrolls
Completed 30 scrolls
Expanding spoiler reviews
✅ Expanded 140 spoiler reviews
Extracting review data from all containers
Extracted 100 reviews...


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [4]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word

imdb_data = pd.read_csv("imdb_reviews.csv")

### Removing noise from the reviews text.
def remove_special_characters(text):
    return re.sub(r"[^a-zA-Z0-9\s]", " ", str(text))

imdb_data["no_noise"] = imdb_data["text"].apply(remove_special_characters)
print("Removing noise from the reviews text\n", imdb_data[["text", "no_noise"]])


### Removing numbers from the reviews text.
def remove_numbers(text):
    return re.sub(r"\d+", " ", str(text))

imdb_data["no_numbers"] = imdb_data["no_noise"].apply(remove_numbers)
print("Removing numbers from the reviews text\n", imdb_data[["text", "no_numbers"]])

### Removing stopwords by using the stopwords list.
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
def remove_stopwords(text):
    tokens = str(text).split()
    filtered = [word for word in tokens if word.lower() not in stop_words]
    return " ".join(filtered)

imdb_data["no_stopwords"] = imdb_data["no_numbers"].apply(remove_stopwords)
print("Removing stopwords from the reviews text\n", imdb_data[["text", "no_stopwords"]])

### Lowercase all texts
imdb_data["lowercase"] = imdb_data["no_stopwords"].str.lower()
print("Lowercasing all texts\n", imdb_data[["text", "lowercase"]])

### Stemming
st = PorterStemmer()
imdb_data["stemmed"] = imdb_data['lowercase'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
print("Stemming\n", imdb_data[["text", "stemmed"]])

### Lemmatization
nltk.download('wordnet')
imdb_data['lemmatized'] = imdb_data['stemmed'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
print("Lemmatizing\n", imdb_data[["text", "lemmatized"]])

imdb_data = imdb_data.drop(columns=["no_noise", "no_numbers", "no_stopwords", "lowercase", "stemmed"])
imdb_data = imdb_data.rename(columns={"lemmatized": "cleaned_reviews"})

Removing noise from the reviews text
                                                   text  \
0    You'll have to have your wits about you and yo...   
1    One of the most anticipated films of the year ...   
2    I'm a big fan of Nolan's work so was really lo...   
3    I'm still collecting my thoughts after experie...   
4    "Oppenheimer" is a biographical thriller film ...   
..                                                 ...   
995  Simply extraordinary, a film that goes beyond ...   
996  I was fully in since the first scene, what an ...   
997  For films like Oppenheimer, we fell in love wi...   
998  "Oppenheimer," directed by Christopher Nolan, ...   
999  In summary, Oppenheimer fulfils its promise of...   

                                              no_noise  
0    You ll have to have your wits about you and yo...  
1    One of the most anticipated films of the year ...  
2    I m a big fan of Nolan s work so was really lo...  
3    I m still collecting my thoughts

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stemming
                                                   text  \
0    You'll have to have your wits about you and yo...   
1    One of the most anticipated films of the year ...   
2    I'm a big fan of Nolan's work so was really lo...   
3    I'm still collecting my thoughts after experie...   
4    "Oppenheimer" is a biographical thriller film ...   
..                                                 ...   
995  Simply extraordinary, a film that goes beyond ...   
996  I was fully in since the first scene, what an ...   
997  For films like Oppenheimer, we fell in love wi...   
998  "Oppenheimer," directed by Christopher Nolan, ...   
999  In summary, Oppenheimer fulfils its promise of...   

                                               stemmed  
0    wit brain fulli switch watch oppenheim could e...  
1    one anticip film year mani peopl includ oppenh...  
2    big fan nolan work realli look forward underst...  
3    still collect thought experienc film cillian m...  
4    opp

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatizing
                                                   text  \
0    You'll have to have your wits about you and yo...   
1    One of the most anticipated films of the year ...   
2    I'm a big fan of Nolan's work so was really lo...   
3    I'm still collecting my thoughts after experie...   
4    "Oppenheimer" is a biographical thriller film ...   
..                                                 ...   
995  Simply extraordinary, a film that goes beyond ...   
996  I was fully in since the first scene, what an ...   
997  For films like Oppenheimer, we fell in love wi...   
998  "Oppenheimer," directed by Christopher Nolan, ...   
999  In summary, Oppenheimer fulfils its promise of...   

                                            lemmatized  
0    wit brain fulli switch watch oppenheim could e...  
1    one anticip film year mani peopl includ oppenh...  
2    big fan nolan work realli look forward underst...  
3    still collect thought experienc film cillian m...  
4    

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
# !pip install benepar

import spacy
from collections import Counter
import benepar

benepar.download('benepar_en3')
nlp = spacy.load("en_core_web_sm")

# add benepar (this sets up constituency parsing)
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

texts = imdb_data["cleaned_reviews"].dropna().tolist()[:4]

docs = list(nlp.pipe(texts))  # these are Doc objects

### POS tagging
pos_counts = Counter([token.pos_ for doc in docs for token in doc])
print("POS counts:", {k: pos_counts[k] for k in ["NOUN", "VERB", "ADJ", "ADV"]})

### Dependency Parsing (first sentence of first doc)
print("\nDependency Parsing Tree (example sentence)")
first_doc = docs[0]
first_sent = list(first_doc.sents)[0]  # take first sentence span
for token in first_sent:
    print(f"{token.text:<12} {token.dep_:<10} {token.head.text:<12} POS={token.pos_}")

### Constituency Parsing (first sentence only)
print("\nConstituency Parsing Tree (example sentence)")
print(first_sent._.parse_string)   # works now, since it's a Span

print("""
Explanation:
- Dependency Parsing shows grammatical links, with 'say' as the ROOT and other words
like 'quit' or 'make' connected as clauses. It also captures relations such as 'viewer'
being the object of 'get' and 'film' the object of 'watch'.
- Constituency Parsing groups words into phrases, e.g., (NP viewer) as a noun phrase
and (VP could get away) as a verb phrase. This reveals how nouns, verbs, and modifiers
combine to build the sentence structure.
""")

### Named Entity Recognition
ner_counts = Counter([ent.label_ for doc in docs for ent in doc.ents])
print("NER counts:", dict(ner_counts))

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


POS counts: {'NOUN': 166, 'VERB': 83, 'ADJ': 61, 'ADV': 30}

Dependency Parsing Tree (example sentence)
wit          compound   switch       POS=PROPN
brain        compound   switch       POS=NOUN
fulli        compound   switch       POS=PROPN
switch       nsubj      watch        POS=NOUN
watch        csubj      absolut      POS=VERB
oppenheim    dobj       watch        POS=PROPN
could        punct      watch        POS=AUX
easili       conj       watch        POS=NOUN
get          advcl      watch        POS=VERB
away         advmod     get          POS=ADV
nonattent    compound   viewer       POS=NOUN
viewer       dobj       get          POS=NOUN
intellig     advcl      watch        POS=NOUN
filmmak      nsubj      show         POS=PROPN
show         ccomp      watch        POS=PROPN
audienc      prep       watch        POS=PROPN
great        amod       respect      POS=ADJ
respect      compound   fire         POS=NOUN
fire         dobj       watch        POS=NOUN
dialogu      amod  

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [6]:
def scrape_github_marketplace(pages=20, delay=2):
    ### Setting up configurations for the webdriver
    options = webdriver.FirefoxOptions()
    options.set_preference("general.useragent.override",
                       "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/122.0.0.0 Safari/537.36")
    options.add_argument("--headless")

    driver = webdriver.Firefox(
        options=options
    )
    wait = WebDriverWait(driver, 10)
    base_url = "https://github.com/marketplace?type=actions&page={}"

    data = []
    ### Starting to go through each page
    for page in range(1, pages + 1):
        url = base_url.format(page)
        print(f"Opening page {page}: {url}")
        driver.get(url)
        time.sleep(delay)

        try:
            ### Reading the html content of the page
            html = driver.page_source
            wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a[href*='/actions/']")))
        except Exception as e:
            print(f"Timeout loading page {page}: {e}")
            continue

        cards = driver.find_elements(By.CSS_SELECTOR, 'div[data-testid="non-featured-item"]')
        print("Found:", len(cards))

        for card in cards:
          try:
              ### Getting the required product link and product_name along with description
              link_elem = card.find_element(By.CSS_SELECTOR, "a.marketplace-common-module__marketplace-item-link--AeQSq")
              product_name = link_elem.text.strip()
              product_url = link_elem.get_attribute("href")
          except:
              product_name, product_url = None, None

          try:
              description = card.find_element(By.CSS_SELECTOR, "p.fgColor-muted").text.strip()
          except:
              description = None

          data.append({
              "name": product_name,
              "url": product_url,
              "description": description,
              "page": page
          })

        print(f"Extracted {len(cards)} items from page {page}")

    driver.quit()
    print(f"\n Finished scraping {len(data)} items.")
    return data

if __name__ == "__main__":
    scraped_data = scrape_github_marketplace(pages=50, delay=3)  # ~1000 items
    df = pd.DataFrame(scraped_data)
    df.to_csv("github_marketplace_actions.csv", index=False)
    print("Data saved to github_marketplace_actions.csv")


Opening page 1: https://github.com/marketplace?type=actions&page=1
Found: 20
Extracted 20 items from page 1
Opening page 2: https://github.com/marketplace?type=actions&page=2
Found: 20
Extracted 20 items from page 2
Opening page 3: https://github.com/marketplace?type=actions&page=3
Found: 20
Extracted 20 items from page 3
Opening page 4: https://github.com/marketplace?type=actions&page=4
Found: 20
Extracted 20 items from page 4
Opening page 5: https://github.com/marketplace?type=actions&page=5
Found: 20
Extracted 20 items from page 5
Opening page 6: https://github.com/marketplace?type=actions&page=6
Found: 20
Extracted 20 items from page 6
Opening page 7: https://github.com/marketplace?type=actions&page=7
Found: 20
Extracted 20 items from page 7
Opening page 8: https://github.com/marketplace?type=actions&page=8
Found: 20
Extracted 20 items from page 8
Opening page 9: https://github.com/marketplace?type=actions&page=9
Found: 20
Extracted 20 items from page 9
Opening page 10: https://git

In [7]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt_tab')

print("Initial shape of the dataframe:", df.shape)

### Performing Data Quality Checks
print("\nMissing values per column:")
print(df.isna().sum())

print("\nDuplicate rows:", df.duplicated().sum())
df.drop_duplicates(inplace=True)

### Dropping rows missing critical fields (name or url)
df.dropna(subset=["name", "url"], inplace=True)

print("Shape of dataframe after cleaning:", df.shape)

### Preprocessing text
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    if pd.isna(text):
        return ""
    ### Removing HTML tags
    text = re.sub(r"<.*?>", " ", text)
    ### Keepping only letters
    text = re.sub(r"[^a-zA-Z\s]", " ", text)
    ### Lowercase
    text = text.lower()
    ### Tokenize
    tokens = word_tokenize(text)
    ### Remove stopwords + lemmatize
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return " ".join(tokens)

### Apply preprocessing
df["description_clean"] = df["description"].astype(str).apply(clean_text)

### Getting a quick summary of cleaned text
all_words = " ".join(df["description_clean"]).split()
freq = Counter(all_words).most_common(20)

print("\nTop 20 most common words in descriptions:")
print(freq)

df.to_csv("github_marketplace_actions_clean.csv", index=False)
print("Saved cleaned data to github_marketplace_actions_clean.csv")


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Initial shape of the dataframe: (1000, 4)

Missing values per column:
name           0
url            0
description    0
page           0
dtype: int64

Duplicate rows: 0
Shape of dataframe after cleaning: (1000, 4)

Top 20 most common words in descriptions:
[('github', 354), ('action', 321), ('run', 118), ('request', 91), ('pull', 85), ('code', 82), ('file', 82), ('build', 67), ('workflow', 63), ('using', 61), ('release', 51), ('deploy', 50), ('repository', 48), ('pr', 45), ('automatically', 45), ('issue', 42), ('project', 41), ('install', 41), ('version', 40), ('check', 40)]
Saved cleaned data to github_marketplace_actions_clean.csv


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [8]:
import tweepy
from google.colab import userdata

BEARER_TOKEN = userdata.get('X_BEARER_TOKEN')

client = tweepy.Client(bearer_token=BEARER_TOKEN)

def fetch_tweets_for_hashtag(hashtag, max_tweets=100):
    query = f"{hashtag} -is:retweet"

    resp = client.search_recent_tweets(
        query=query,
        tweet_fields=["id", "text", "author_id", "created_at"],
        expansions=["author_id"],
        user_fields=["username"],
        max_results=50  # max per request
    )

    results = []
    if not resp.data:
        return results

    ### Build a dict from user_id → user object to map usernames
    users = {}
    if resp.includes and "users" in resp.includes:
        for user in resp.includes["users"]:
            users[user.id] = user

    ### For each tweet, combine with user info
    for tw in resp.data:
        user = users.get(tw.author_id)
        username = user.username if user else None
        results.append({
            "tweet_id": tw.id,
            "username": username,
            "text": tw.text
        })

    return results

if __name__ == "__main__":
    # hashtags = ["#machinelearning", "#artificialintelligence"]
    hashtags = ["#machinelearning"]
    all_tweets = []
    for i, tag in enumerate(hashtags):
        if i > 0:  # Add a delay before fetching the next hashtag
            print(f"\nPausing for 10 seconds before fetching tweets for {tag}\n")
            time.sleep(60) # Pause for 10 seconds to avoid rate limits
        tweets = fetch_tweets_for_hashtag(tag, max_tweets=100)
        print(f"\nFetched {len(tweets)} tweets for {tag}\n")
        all_tweets.extend(tweets)
    all_tweets = pd.DataFrame(all_tweets)
    all_tweets.to_csv("tweets.csv", index=False)

TooManyRequests: 429 Too Many Requests
Usage cap exceeded: Monthly product cap

In [9]:
tweets = pd.read_csv("tweets.csv")
print("Dataset Description\n")
print(tweets.info())

print("\nFirst 5 rows\n", tweets.head())

tweets = tweets.drop_duplicates(subset="tweet_id", keep="first")
print("\nNumber of Null values in each column\n", tweets.isnull().sum())

### Dropping rows where text is missing
tweets = tweets.dropna(subset=["tweet_id", "text"])

def clean_text(text):
    text = re.sub(r"http\S+", "", text)         # remove URLs
    text = re.sub(r"@\w+", "", text)            # remove mentions
    text = re.sub(r"#", "", text)               # remove hashtag symbol
    text = re.sub(r"[^A-Za-z0-9\s]", " ", text) # remove emojis/punctuation
    text = re.sub(r"\s+", " ", text).strip()    # remove extra spaces
    return text.lower()

tweets["clean_text"] = tweets["text"].astype(str).apply(clean_text)

tweets.to_csv("tweets_cleaned.csv", index=False)
print("\nCleaned data saved as tweets_cleaned.csv")

Dataset Description

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet_id  100 non-null    int64 
 1   username  100 non-null    object
 2   text      100 non-null    object
dtypes: int64(1), object(2)
memory usage: 2.5+ KB
None

First 5 rows
               tweet_id         username  \
0  1972832514160153027     rasangarocks   
1  1972831317160288326        cyberkeyx   
2  1972830137579352403  EffieRitts70463   
3  1972830025889292446     Radiology_AI   
4  1972830009120453009        Sargol_MD   

                                                text  
0  Python Become a Master: 120 ‘Real World’ Pytho...  
1  If your account is hacked, or your Account is ...  
2  @PythonPr That's a great way to approach it! V...  
3  Deep learning method for lumbar paraspinal mus...  
4  We developed a Machine Learning framework to p...  

Number of Null values in e

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

The webscrapping part in the assignment felt difficult and time consuming as I struggled through it. Most of the websites block such web scrapping attempts or have a limit for the number of records which can be scrapped. It was a nice learning exercise as I had previously not tried my hands over web scrapping data so got to learn a lot through this assignment. Time was sufficient enough to complete this whole thing.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog