# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [2]:
# Your code here
import requests
from bs4 import BeautifulSoup
import pandas as pd



In [3]:
def scrape_imdb_reviews(no_of_reviews=1000):
    base_url = 'https://www.imdb.com/title/tt5296406/reviews'
    reviews = []
    page_num = 0  # Adjust pagination based on IMDb's structure

    while len(reviews) < no_of_reviews:
        url = f'{base_url}?start={page_num}'
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract review text using the identified class
        review_containers = soup.find_all('div', class_='ipc-html-content ipc-html-content--base')
        for container in review_containers:
            inner_div = container.find('div', class_='ipc-html-content-inner-div')
            if inner_div:
                reviews.append(inner_div.text.strip())

        page_num += 1  # Increment by 1 for pagination

    return reviews[:no_of_reviews]

# Scraping top 1000 user reviews
designated_survivor_reviews = scrape_imdb_reviews(1000)
print(designated_survivor_reviews[:5])
print(len(designated_survivor_reviews))

["I enjoyed the first two seasons of Designated Survivor it was entertaining and kept your interest. If your a 24 fan you will enjoy the first two seasons it's basically if Jack Bauer became president. The third season it fell apart with the removal of some of the cast with no explanation. They also added lots of unnecessary cussing that did not fit the vibe it totally changed the mood of the show.", 'Season One and Two were fantastic: Interesting, gripping, great cast and production. I couldn\'t wait for the next episode. After Netflix took over Season Three, I stopped watching after about one-third of the episodes. Season Three is obsessed and strategized with misguided identity politics, "politically correct" content, and crass language. The screenplays for Season Three are amateurish and nowhere match those of Season One and Two. Strange, when Netflix produces, things get sleazy.', 'I really liked "Designated Survivor" for the first two seasons (gave them a rating of 9) and was hap

In [4]:
reviews_df = pd.DataFrame(designated_survivor_reviews, columns=['reviews'])
reviews_df.head()

Unnamed: 0,reviews
0,I enjoyed the first two seasons of Designated ...
1,Season One and Two were fantastic: Interesting...
2,"I really liked ""Designated Survivor"" for the f..."
3,I've liked this show even though it's a little...
4,This show is one of my favourite Netflix Origi...


In [5]:
reviews_df.to_csv('Designated_reviews.csv', index=False)
print(' Reviews are collected, and saved successfully ')

 Reviews are collected, and saved successfully 


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [6]:
# Write code for each of the sub parts with proper comments.
#importing all the required modules
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [7]:
data = "/content/Designated_reviews.csv"
df = pd.read_table(data,names=['text'])
df.head()

Unnamed: 0,text
0,reviews
1,I enjoyed the first two seasons of Designated ...
2,Season One and Two were fantastic: Interesting...
3,"I really liked ""Designated Survivor"" for the f..."
4,I've liked this show even though it's a little...


In [8]:
#1:Remove noise, such as special characters and punctuations.
def remove_noise(text):
    clean_text = re.sub('[^a-zA-Z0-9]', ' ', text)
    return clean_text
df['clean_text'] = df['text'].apply(remove_noise)
# Displaying the Data Frame after removing noise
print("\nData Frame after removing noise:")
df.head(10)


Data Frame after removing noise:


Unnamed: 0,text,clean_text
0,reviews,reviews
1,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...
2,Season One and Two were fantastic: Interesting...,Season One and Two were fantastic Interesting...
3,"I really liked ""Designated Survivor"" for the f...",I really liked Designated Survivor for the f...
4,I've liked this show even though it's a little...,I ve liked this show even though it s a little...
5,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...
6,"Started off brilliant, season one is fantastic...",Started off brilliant season one is fantastic...
7,"This show, which I stumbled upon when trying t...",This show which I stumbled upon when trying t...
8,I've just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...
9,I wish Tom Kirkman were our actual president! ...,I wish Tom Kirkman were our actual president ...


In [9]:
#2: Remove Numbers
def remove_numbers(text):
    clean_text = re.sub(r'\d+', '', text)
    return clean_text
df['remove_numbers'] = df['clean_text'].apply(remove_numbers)
print("\nData Frame after removing numbers:")
df.head(10)


Data Frame after removing numbers:


Unnamed: 0,text,clean_text,remove_numbers
0,reviews,reviews,reviews
1,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...
2,Season One and Two were fantastic: Interesting...,Season One and Two were fantastic Interesting...,Season One and Two were fantastic Interesting...
3,"I really liked ""Designated Survivor"" for the f...",I really liked Designated Survivor for the f...,I really liked Designated Survivor for the f...
4,I've liked this show even though it's a little...,I ve liked this show even though it s a little...,I ve liked this show even though it s a little...
5,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...
6,"Started off brilliant, season one is fantastic...",Started off brilliant season one is fantastic...,Started off brilliant season one is fantastic...
7,"This show, which I stumbled upon when trying t...",This show which I stumbled upon when trying t...,This show which I stumbled upon when trying t...
8,I've just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...
9,I wish Tom Kirkman were our actual president! ...,I wish Tom Kirkman were our actual president ...,I wish Tom Kirkman were our actual president ...


In [10]:
# 3: Remove stopwords by using the stopwords List
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = nltk.word_tokenize(text)
    filter_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filter_words)
df['remove_stopwords'] = df['remove_numbers'].apply(remove_stopwords)
print("\nData Frame after removing stopwords without lowercase:")
df.head(10)


Data Frame after removing stopwords without lowercase:


Unnamed: 0,text,clean_text,remove_numbers,remove_stopwords
0,reviews,reviews,reviews,reviews
1,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,enjoyed first two seasons Designated Survivor ...
2,Season One and Two were fantastic: Interesting...,Season One and Two were fantastic Interesting...,Season One and Two were fantastic Interesting...,Season One Two fantastic Interesting gripping ...
3,"I really liked ""Designated Survivor"" for the f...",I really liked Designated Survivor for the f...,I really liked Designated Survivor for the f...,really liked Designated Survivor first two sea...
4,I've liked this show even though it's a little...,I ve liked this show even though it s a little...,I ve liked this show even though it s a little...,liked show even though little liberal excited ...
5,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,show one favourite Netflix Originals kind comb...
6,"Started off brilliant, season one is fantastic...",Started off brilliant season one is fantastic...,Started off brilliant season one is fantastic...,Started brilliant season one fantastic Season ...
7,"This show, which I stumbled upon when trying t...",This show which I stumbled upon when trying t...,This show which I stumbled upon when trying t...,show stumbled upon trying find something watch...
8,I've just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,binge watched end Episode initial premise inte...
9,I wish Tom Kirkman were our actual president! ...,I wish Tom Kirkman were our actual president ...,I wish Tom Kirkman were our actual president ...,wish Tom Kirkman actual president Great charac...


In [11]:
# 4: Lowercase all texts
df['lowercase_texts'] = df['remove_stopwords'].apply(lambda x: x.lower())
print("\nData Frame after converting texts to lowercase:")
df.head(10)


Data Frame after converting texts to lowercase:


Unnamed: 0,text,clean_text,remove_numbers,remove_stopwords,lowercase_texts
0,reviews,reviews,reviews,reviews,reviews
1,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,enjoyed first two seasons Designated Survivor ...,enjoyed first two seasons designated survivor ...
2,Season One and Two were fantastic: Interesting...,Season One and Two were fantastic Interesting...,Season One and Two were fantastic Interesting...,Season One Two fantastic Interesting gripping ...,season one two fantastic interesting gripping ...
3,"I really liked ""Designated Survivor"" for the f...",I really liked Designated Survivor for the f...,I really liked Designated Survivor for the f...,really liked Designated Survivor first two sea...,really liked designated survivor first two sea...
4,I've liked this show even though it's a little...,I ve liked this show even though it s a little...,I ve liked this show even though it s a little...,liked show even though little liberal excited ...,liked show even though little liberal excited ...
5,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,show one favourite Netflix Originals kind comb...,show one favourite netflix originals kind comb...
6,"Started off brilliant, season one is fantastic...",Started off brilliant season one is fantastic...,Started off brilliant season one is fantastic...,Started brilliant season one fantastic Season ...,started brilliant season one fantastic season ...
7,"This show, which I stumbled upon when trying t...",This show which I stumbled upon when trying t...,This show which I stumbled upon when trying t...,show stumbled upon trying find something watch...,show stumbled upon trying find something watch...
8,I've just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,binge watched end Episode initial premise inte...,binge watched end episode initial premise inte...
9,I wish Tom Kirkman were our actual president! ...,I wish Tom Kirkman were our actual president ...,I wish Tom Kirkman were our actual president ...,wish Tom Kirkman actual president Great charac...,wish tom kirkman actual president great charac...


In [12]:
#5. Stemming
stemmer = PorterStemmer()
def apply_stemming(text):
    words = nltk.word_tokenize(text)
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)
df['after_stemming'] = df['lowercase_texts'].apply(apply_stemming)
print("\nData Frame after applying stemming:")
df.head(10)


Data Frame after applying stemming:


Unnamed: 0,text,clean_text,remove_numbers,remove_stopwords,lowercase_texts,after_stemming
0,reviews,reviews,reviews,reviews,reviews,review
1,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,enjoyed first two seasons Designated Survivor ...,enjoyed first two seasons designated survivor ...,enjoy first two season design survivor enterta...
2,Season One and Two were fantastic: Interesting...,Season One and Two were fantastic Interesting...,Season One and Two were fantastic Interesting...,Season One Two fantastic Interesting gripping ...,season one two fantastic interesting gripping ...,season one two fantast interest grip great cas...
3,"I really liked ""Designated Survivor"" for the f...",I really liked Designated Survivor for the f...,I really liked Designated Survivor for the f...,really liked Designated Survivor first two sea...,really liked designated survivor first two sea...,realli like design survivor first two season g...
4,I've liked this show even though it's a little...,I ve liked this show even though it s a little...,I ve liked this show even though it s a little...,liked show even though little liberal excited ...,liked show even though little liberal excited ...,like show even though littl liber excit netfli...
5,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,show one favourite Netflix Originals kind comb...,show one favourite netflix originals kind comb...,show one favourit netflix origin kind combin h...
6,"Started off brilliant, season one is fantastic...",Started off brilliant season one is fantastic...,Started off brilliant season one is fantastic...,Started brilliant season one fantastic Season ...,started brilliant season one fantastic season ...,start brilliant season one fantast season good...
7,"This show, which I stumbled upon when trying t...",This show which I stumbled upon when trying t...,This show which I stumbled upon when trying t...,show stumbled upon trying find something watch...,show stumbled upon trying find something watch...,show stumbl upon tri find someth watch grab ri...
8,I've just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,binge watched end Episode initial premise inte...,binge watched end episode initial premise inte...,bing watch end episod initi premis interest wh...
9,I wish Tom Kirkman were our actual president! ...,I wish Tom Kirkman were our actual president ...,I wish Tom Kirkman were our actual president ...,wish Tom Kirkman actual president Great charac...,wish tom kirkman actual president great charac...,wish tom kirkman actual presid great charact k...


In [13]:
#6. Lemmatization
lemmatizer = WordNetLemmatizer()
def apply_lemmatization(text):
    words = nltk.word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)
df['after_lemmatization'] = df['after_stemming'].apply(apply_lemmatization)
print("\nData Frame after applying lemmatization:")
df.head(10)


Data Frame after applying lemmatization:


Unnamed: 0,text,clean_text,remove_numbers,remove_stopwords,lowercase_texts,after_stemming,after_lemmatization
0,reviews,reviews,reviews,reviews,reviews,review,review
1,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,I enjoyed the first two seasons of Designated ...,enjoyed first two seasons Designated Survivor ...,enjoyed first two seasons designated survivor ...,enjoy first two season design survivor enterta...,enjoy first two season design survivor enterta...
2,Season One and Two were fantastic: Interesting...,Season One and Two were fantastic Interesting...,Season One and Two were fantastic Interesting...,Season One Two fantastic Interesting gripping ...,season one two fantastic interesting gripping ...,season one two fantast interest grip great cas...,season one two fantast interest grip great cas...
3,"I really liked ""Designated Survivor"" for the f...",I really liked Designated Survivor for the f...,I really liked Designated Survivor for the f...,really liked Designated Survivor first two sea...,really liked designated survivor first two sea...,realli like design survivor first two season g...,realli like design survivor first two season g...
4,I've liked this show even though it's a little...,I ve liked this show even though it s a little...,I ve liked this show even though it s a little...,liked show even though little liberal excited ...,liked show even though little liberal excited ...,like show even though littl liber excit netfli...,like show even though littl liber excit netfli...
5,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,This show is one of my favourite Netflix Origi...,show one favourite Netflix Originals kind comb...,show one favourite netflix originals kind comb...,show one favourit netflix origin kind combin h...,show one favourit netflix origin kind combin h...
6,"Started off brilliant, season one is fantastic...",Started off brilliant season one is fantastic...,Started off brilliant season one is fantastic...,Started brilliant season one fantastic Season ...,started brilliant season one fantastic season ...,start brilliant season one fantast season good...,start brilliant season one fantast season good...
7,"This show, which I stumbled upon when trying t...",This show which I stumbled upon when trying t...,This show which I stumbled upon when trying t...,show stumbled upon trying find something watch...,show stumbled upon trying find something watch...,show stumbl upon tri find someth watch grab ri...,show stumbl upon tri find someth watch grab ri...
8,I've just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,I ve just binge watched it up to the end of Ep...,binge watched end Episode initial premise inte...,binge watched end episode initial premise inte...,bing watch end episod initi premis interest wh...,bing watch end episod initi premis interest wh...
9,I wish Tom Kirkman were our actual president! ...,I wish Tom Kirkman were our actual president ...,I wish Tom Kirkman were our actual president ...,wish Tom Kirkman actual president Great charac...,wish tom kirkman actual president great charac...,wish tom kirkman actual presid great charact k...,wish tom kirkman actual presid great charact k...


In [15]:
# Save the cleaned data to a new CSV file
df.to_csv('cleaned_data_.csv', index=False)
print("\nCleaned data saved ")


Cleaned data saved 


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [16]:
# Your code here
# Part 1: Parts of Speech (POS) Tagging
from collections import Counter
nltk.download('averaged_perceptron_tagger_eng')

# Read the data from cleaned_data_.csv
df = pd.read_csv('cleaned_data_.csv')

def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return pos_tags
# Iterate through each row and print the POS tagging on a new line
for index, row in df.iterrows():
    pos_tags = pos_tagging(row['clean_text'])
    print(f"POS tagging for row {index + 1}:\n{pos_tags}\n")
    noun_count = verb_count = adj_count = adv_count = 0
    for _, pos in pos_tags:
        if pos.startswith('N'):
            noun_count += 1
        elif pos.startswith('V'):
            verb_count += 1
        elif pos.startswith('JJ'):
            adj_count += 1
        elif pos.startswith('RB'):
            adv_count += 1
    print(f"Total No of Nouns: {noun_count}")
    print(f"Total No of Verbs: {verb_count}")
    print(f"Total No of Adjectives: {adj_count}")
    print(f"Total No of Adverbs: {adv_count}")

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Total No of Adjectives: 5
Total No of Adverbs: 3
POS tagging for row 288:
[('The', 'DT'), ('pilot', 'NN'), ('episode', 'NN'), ('was', 'VBD'), ('intriguing', 'VBG'), ('but', 'CC'), ('as', 'IN'), ('Designated', 'NNP'), ('Survivor', 'NNP'), ('just', 'RB'), ('watched', 'VBD'), ('episode', 'RB'), ('5', 'CD'), ('goes', 'VBZ'), ('on', 'IN'), ('it', 'PRP'), ('s', 'VBZ'), ('really', 'RB'), ('hitting', 'VBG'), ('it', 'PRP'), ('s', 'JJ'), ('stride', 'JJ'), ('Kiefer', 'NNP'), ('s', 'NN'), ('acting', 'NN'), ('is', 'VBZ'), ('compelling', 'VBG'), ('and', 'CC'), ('all', 'DT'), ('of', 'IN'), ('the', 'DT'), ('characters', 'NNS'), ('are', 'VBP'), ('beginning', 'VBG'), ('to', 'TO'), ('get', 'VB'), ('fleshed', 'VBN'), ('out', 'RP'), ('as', 'IN'), ('the', 'DT'), ('story', 'NN'), ('begins', 'VBZ'), ('to', 'TO'), ('come', 'VB'), ('together', 'RB'), ('The', 'DT'), ('idea', 'NN'), ('of', 'IN'), ('an', 'DT'), ('every', 'DT'), ('man', 'NN'), ('becom

In [17]:
!pip install benepar
!pip install benepar nltk
!pip install tensorflow
!pip install tensorflow==2.8.0

Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.6.0->benepar)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.6.0->benepar)
  Downloading nvidia_cublas_

In [18]:
#Part2:Constituency Parsing and Dependency Parsing:
# Importing the libraries
import benepar
import sys
import spacy
from spacy import displacy
benepar.download('benepar_en3')
parser = benepar.Parser("benepar_en3")
nlp = spacy.load('en_core_web_sm')
options = {'compact': True, 'font': 'Arial black', 'distance': 100}
# Parsing a sentence in clean_text
try:
    tree = parser.parse(df['clean_text'][2])
    print(tree)
except:
    print("No Parse Tree")

doc = nlp(df['clean_text'][2])
displacy.render(doc, style='dep', options=options, jupyter=True)

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_

(TOP
  (S
    (NP
      (NP (NN Season) (CD One) (CC and) (CD Two))
      (VBD were)
      (ADJP (JJ fantastic))
      (JJ Interesting)
      (VBG gripping)
      (JJ great)
      (NN cast)
      (CC and)
      (NN production))
    (NP (PRP I))
    (VP
      (VP
        (MD couldn)
        (RB t)
        (VP
          (VB wait)
          (PP
            (IN for)
            (NP
              (NP (DT the) (JJ next) (NN episode))
              (SBAR
                (IN After)
                (S
                  (NP (NNP Netflix))
                  (VP
                    (VBD took)
                    (PRT (RP over))
                    (NP (NNP Season) (NNP Three)))))))))
      (NP (PRP I))
      (VP
        (VBD stopped)
        (VBG watching)
        (SBAR
          (IN after)
          (S
            (NP
              (NP
                (NP
                  (NP (QP (RB about) (CD one) (NN third)))
                  (PP (IN of) (NP (DT the) (NNS episodes))))
                (SBAR
 



# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


In [19]:
#Part 1
import time

# Base URL of GitHub Marketplace (Actions section)
BASE_URL = "https://github.com/marketplace?type=actions&page="

# Headers to mimic a browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Referer": "https://www.google.com/",
}

# Initialize list to store data
data = []
page = 0
# Scrape multiple pages (Assuming 40 products per page, we need ~25 pages for 1000 products)
while len(data)<1000:
    page+=1
    url = f"{BASE_URL}{page}"

    response = requests.get(url, headers=headers)
    #print(response.text)
    if response.status_code != 200:
        print(f"Failed to fetch page {page}")
        continue

    soup = BeautifulSoup(response.text, "html.parser")

    # Find all product containers

    items = soup.find_all('div', {'data-testid': 'non-featured-item'})
    for item in items:
        try:
            name = item.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf').text.strip()
            url = "https://github.com" + item.find('a')['href']
            description = item.find('p', class_='text-small').text.strip()
            data.append([name, description, url, page])
        except AttributeError:
            continue

# Save to CSV
df = pd.DataFrame(data, columns=["Product Name", "Description", "URL", "Page Number"])

df.to_csv("github_marketplace_actions.csv", index=False)
print("Scraping completed! Data saved to github_marketplace_actions.csv")

Scraping completed! Data saved to github_marketplace_actions.csv


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [20]:
#Part2
from nltk.tokenize import word_tokenize
df = pd.read_csv('/content/github_marketplace_actions.csv')

# Data Cleaning Function
def clean_text(text):
    text = str(text)  # Ensure it's a string
    text = re.sub(r"<.*?>", "", text)  # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

# Apply text cleaning to Name & Description columns
df["Product Name"] = df["Product Name"].apply(clean_text)
df["Description"] = df["Description"].apply(clean_text)

# Tokenization, Stopword Removal, Lemmatization
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatization
    return " ".join(tokens)

df["Processed Description"] = df["Description"].apply(preprocess_text)

# Data Quality Checks
df.drop_duplicates(inplace=True)  # Remove duplicates
df.dropna(inplace=True)  # Handle missing values

# Save cleaned data
df.to_csv("cleaned_github_marketplace_actions.csv", index=False)

print("Data preprocessing and quality checks completed. Cleaned data saved.")

Data preprocessing and quality checks completed. Cleaned data saved.


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [44]:
pip install tweepy



In [25]:
import tweepy

# Set your keys and tokens here
api_key = ''
api_key_secret = ''
access_token = ''
access_token_secret =''
# Authenticate with Twitter
auth = tweepy.OAuth1UserHandler(
    consumer_key=api_key,
    consumer_secret=api_key_secret,
    access_token=access_token,
    access_token_secret=access_token_secret
)
api = tweepy.API(auth)

In [21]:
import tweepy
import pandas as pd
import re
import ssl
import time

# Fix SSL issue
ssl._create_default_https_context = ssl._create_unverified_context

# Step 1: Set up Twitter API authentication
API_KEY = "q3ENmou7Z9BwGpatEbxAaP6Tt"
API_SECRET = "igiWK4qSdUVmBoIPhCNc7i7Y19293720zwclgAXVLr7TmwLTfX"
ACCESS_TOKEN = "1892384049131638785-ZSsF3IvEHK2pfniPvdAxFlg1Nj2SCR"
ACCESS_SECRET = "S4sOXZq9aBlzhMyTOUEgzv1SOVfVCSH0bpzOeaUgLGL1a"

# Authenticate with Twitter API using OAuth 2.0
auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

# Step 2: Scrape tweets based on hashtags
def scrape_tweets(hashtag, count=100):
    tweets = []
    try:
        for tweet in tweepy.Cursor(api.search_tweets, q=hashtag, lang="en", tweet_mode="extended").items(count):
            tweets.append([tweet.id_str, tweet.user.screen_name, tweet.full_text])
        time.sleep(2)  # Prevent hitting rate limits
    except tweepy.TooManyRequests:
        print("Rate limit exceeded. Waiting for 15 minutes...")
        time.sleep(900)  # Wait for 15 minutes before retrying
    except Exception as e:
        print(f"Error: {e}")
    return tweets

# Define hashtags
hashtags = ["#MachineLearning", "#ArtificialIntelligence"]

data = []
for tag in hashtags:
    data.extend(scrape_tweets(tag, count=100))

# Convert to DataFrame
df = pd.DataFrame(data, columns=["Tweet_ID", "Username", "Text"])

# Step 3: Data Cleaning
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)  # Remove mentions
    text = re.sub(r'#[A-Za-z0-9_]+', '', text)  # Remove hashtags
    text = re.sub(r'[^A-Za-z0-9 ]+', '', text)  # Remove special characters
    text = text.lower().strip()  # Convert to lowercase and trim spaces
    return text

df["Cleaned_Text"] = df["Text"].apply(clean_text)

# Step 4: Final Data Quality Check
print("Total rows before cleaning:", len(df))
df.drop_duplicates(subset="Cleaned_Text", inplace=True)  # Remove duplicates
print("Total rows after cleaning:", len(df))

# Step 5: Save cleaned data to CSV
df.to_csv("cleaned_tweets.csv", index=False)
print("Cleaned data saved successfully!")


Error: 403 Forbidden
453 - You currently have access to a subset of X API V2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.x.com/en/portal/product
Error: 403 Forbidden
453 - You currently have access to a subset of X API V2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.x.com/en/portal/product
Total rows before cleaning: 0
Total rows after cleaning: 0
Cleaned data saved successfully!


# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog