# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
# Your code here

def scrape_imdb_reviews(no_of_reviews=1000):
    base_url = 'https://www.imdb.com/title/tt15239678/reviews'
    reviews = []
    page_num = 0  # Adjust pagination based on IMDb's structure

    while len(reviews) < no_of_reviews:
        url = f'{base_url}?start={page_num}'
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract review text using the identified class
        review_containers = soup.find_all('div', class_='ipc-html-content ipc-html-content--base')
        for container in review_containers:
            inner_div = container.find('div', class_='ipc-html-content-inner-div')
            if inner_div:
                reviews.append(inner_div.text.strip())

        page_num += 25  # Increment by 25 for pagination

    return reviews[:no_of_reviews]

# Scraping top 1000 user reviews
dune_Part2_reviews = scrape_imdb_reviews(1000)
print(dune_Part2_reviews[:5])
print(len(dune_Part2_reviews))


['Had the pleasure to watch this film in an early screening and was completely blown away."Dune: Part 2" is everything one could ask for from a film of its kind. As a big fan of the Game of Thrones franchise, it\'s been a long time since iv\'e encountered this level of world-building and epicness. The plot and story development are carried out in an awe-inspiring manner throughout the movie, progressing at a precise pace toward a spectacular climax that is executed perfectly.Denis Villeneuve continues to prove himself as one of the most promising filmmakers of our time, and if it was up to me I would keep him in these high-budget epic tales such as these since there are very few directors working today that can tackle this genre as good as he does. The film received praise from many great filmmakers, the most notable being Christopher Nolan (The Dark Knight Trilogy, Oppenheimer), who very accurately compared Villeneuve\'s achievement in this film to the "Empire Strikes Back" of the mod

In [6]:
# Creating a DataFrame
reviews_df = pd.DataFrame(dune_Part2_reviews, columns=['reviews'])
reviews_df

Unnamed: 0,reviews
0,Had the pleasure to watch this film in an earl...
1,"If you liked or loved the first one, the same ..."
2,"Dune Part 2 is an epic movie; slickly made, an..."
3,This is the kind of movie that is impossible t...
4,"Like the first part, the second part is visual..."
...,...
995,This was a perfect sequel to Denis' part one. ...
996,I just got out of an early access showing and ...
997,"We have waited many, many years for a movie of..."
998,This movie has the same problems as part one. ...


In [7]:
# Saving to the CSV file
reviews_df.to_csv('dune_Part2_reviews.csv', index=False)
print(' Reviews are collected successfully ')

 Reviews are collected successfully 


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [8]:
# Write code for each of the sub parts with proper comments.
#importing all the required modules
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re


# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [10]:
data_url="/content/dune_Part2_reviews.csv"
df = pd.read_table(data_url,names=['text'])
df

Unnamed: 0,text
0,reviews
1,Had the pleasure to watch this film in an earl...
2,"If you liked or loved the first one, the same ..."
3,"Dune Part 2 is an epic movie; slickly made, an..."
4,This is the kind of movie that is impossible t...
...,...
996,This was a perfect sequel to Denis' part one. ...
997,I just got out of an early access showing and ...
998,"We have waited many, many years for a movie of..."
999,This movie has the same problems as part one. ...


In [11]:
#1: Remove Noise like special characters and punctuations
def remove_noise(text):
    clean_text = re.sub('[^a-zA-Z0-9]', ' ', text)
    return clean_text
df['clean_text'] = df['text'].apply(remove_noise)
# Printing data frame after noise
print("\nData Frame:")
df


Data Frame:


Unnamed: 0,text,clean_text
0,reviews,reviews
1,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...
2,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same ...
3,"Dune Part 2 is an epic movie; slickly made, an...",Dune Part 2 is an epic movie slickly made an...
4,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...
...,...,...
996,This was a perfect sequel to Denis' part one. ...,This was a perfect sequel to Denis part one ...
997,I just got out of an early access showing and ...,I just got out of an early access showing and ...
998,"We have waited many, many years for a movie of...",We have waited many many years for a movie of...
999,This movie has the same problems as part one. ...,This movie has the same problems as part one ...


In [12]:
#2:- Remove Numbers
#creating remove_number function
def remove_numbers(text):
    clean_text = re.sub(r'\d+', '', text)
    return clean_text
df['clean_text_without_numbers'] = df['clean_text'].apply(remove_numbers)
print("\nData Frame:")
df


Data Frame:


Unnamed: 0,text,clean_text,clean_text_without_numbers
0,reviews,reviews,reviews
1,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...
2,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same ...,If you liked or loved the first one the same ...
3,"Dune Part 2 is an epic movie; slickly made, an...",Dune Part 2 is an epic movie slickly made an...,Dune Part is an epic movie slickly made and...
4,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...
...,...,...,...
996,This was a perfect sequel to Denis' part one. ...,This was a perfect sequel to Denis part one ...,This was a perfect sequel to Denis part one ...
997,I just got out of an early access showing and ...,I just got out of an early access showing and ...,I just got out of an early access showing and ...
998,"We have waited many, many years for a movie of...",We have waited many many years for a movie of...,We have waited many many years for a movie of...
999,This movie has the same problems as part one. ...,This movie has the same problems as part one ...,This movie has the same problems as part one ...


In [13]:
# 3: Remove stopwords by using the stopwords List
#creating remove_stopwords function
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = nltk.word_tokenize(text)
    filter_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filter_words)
df['clean_text_without_stopwords'] = df['clean_text_without_numbers'].apply(remove_stopwords)
print("\nData Frame after removing stopwords without lowercase:")
df


Data Frame after removing stopwords without lowercase:


Unnamed: 0,text,clean_text,clean_text_without_numbers,clean_text_without_stopwords
0,reviews,reviews,reviews,reviews
1,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,pleasure watch film early screening completely...
2,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same ...,If you liked or loved the first one the same ...,liked loved first one apply one Personally lov...
3,"Dune Part 2 is an epic movie; slickly made, an...",Dune Part 2 is an epic movie slickly made an...,Dune Part is an epic movie slickly made and...,Dune Part epic movie slickly made visually stu...
4,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,kind movie impossible justice talking kind exp...
...,...,...,...,...
996,This was a perfect sequel to Denis' part one. ...,This was a perfect sequel to Denis part one ...,This was a perfect sequel to Denis part one ...,perfect sequel Denis part one accomplished mai...
997,I just got out of an early access showing and ...,I just got out of an early access showing and ...,I just got out of an early access showing and ...,got early access showing absolutely incredible...
998,"We have waited many, many years for a movie of...",We have waited many many years for a movie of...,We have waited many many years for a movie of...,waited many many years movie caliber years old...
999,This movie has the same problems as part one. ...,This movie has the same problems as part one ...,This movie has the same problems as part one ...,movie problems part one editing terrible Scene...


In [14]:
# 4: Lowercase all texts
df['clean_text_lowercase'] = df['clean_text_without_stopwords'].apply(lambda x: x.lower())
print("\nData Frame after converting texts to lowercase:")
df


Data Frame after converting texts to lowercase:


Unnamed: 0,text,clean_text,clean_text_without_numbers,clean_text_without_stopwords,clean_text_lowercase
0,reviews,reviews,reviews,reviews,reviews
1,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,pleasure watch film early screening completely...,pleasure watch film early screening completely...
2,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same ...,If you liked or loved the first one the same ...,liked loved first one apply one Personally lov...,liked loved first one apply one personally lov...
3,"Dune Part 2 is an epic movie; slickly made, an...",Dune Part 2 is an epic movie slickly made an...,Dune Part is an epic movie slickly made and...,Dune Part epic movie slickly made visually stu...,dune part epic movie slickly made visually stu...
4,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,kind movie impossible justice talking kind exp...,kind movie impossible justice talking kind exp...
...,...,...,...,...,...
996,This was a perfect sequel to Denis' part one. ...,This was a perfect sequel to Denis part one ...,This was a perfect sequel to Denis part one ...,perfect sequel Denis part one accomplished mai...,perfect sequel denis part one accomplished mai...
997,I just got out of an early access showing and ...,I just got out of an early access showing and ...,I just got out of an early access showing and ...,got early access showing absolutely incredible...,got early access showing absolutely incredible...
998,"We have waited many, many years for a movie of...",We have waited many many years for a movie of...,We have waited many many years for a movie of...,waited many many years movie caliber years old...,waited many many years movie caliber years old...
999,This movie has the same problems as part one. ...,This movie has the same problems as part one ...,This movie has the same problems as part one ...,movie problems part one editing terrible Scene...,movie problems part one editing terrible scene...


In [15]:
#5. Stemming
stemmer = PorterStemmer()
def apply_stemming(text):
    words = nltk.word_tokenize(text)
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)
df['clean_text_stemmed'] = df['clean_text_lowercase'].apply(apply_stemming)
print("\nData Frame after applying stemming:")
df


Data Frame after applying stemming:


Unnamed: 0,text,clean_text,clean_text_without_numbers,clean_text_without_stopwords,clean_text_lowercase,clean_text_stemmed
0,reviews,reviews,reviews,reviews,reviews,review
1,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,pleasure watch film early screening completely...,pleasure watch film early screening completely...,pleasur watch film earli screen complet blown ...
2,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same ...,If you liked or loved the first one the same ...,liked loved first one apply one Personally lov...,liked loved first one apply one personally lov...,like love first one appli one person love one ...
3,"Dune Part 2 is an epic movie; slickly made, an...",Dune Part 2 is an epic movie slickly made an...,Dune Part is an epic movie slickly made and...,Dune Part epic movie slickly made visually stu...,dune part epic movie slickly made visually stu...,dune part epic movi slickli made visual stun e...
4,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,kind movie impossible justice talking kind exp...,kind movie impossible justice talking kind exp...,kind movi imposs justic talk kind experi never...
...,...,...,...,...,...,...
996,This was a perfect sequel to Denis' part one. ...,This was a perfect sequel to Denis part one ...,This was a perfect sequel to Denis part one ...,perfect sequel Denis part one accomplished mai...,perfect sequel denis part one accomplished mai...,perfect sequel deni part one accomplish main p...
997,I just got out of an early access showing and ...,I just got out of an early access showing and ...,I just got out of an early access showing and ...,got early access showing absolutely incredible...,got early access showing absolutely incredible...,got earli access show absolut incred see imax ...
998,"We have waited many, many years for a movie of...",We have waited many many years for a movie of...,We have waited many many years for a movie of...,waited many many years movie caliber years old...,waited many many years movie caliber years old...,wait mani mani year movi calib year old femal ...
999,This movie has the same problems as part one. ...,This movie has the same problems as part one ...,This movie has the same problems as part one ...,movie problems part one editing terrible Scene...,movie problems part one editing terrible scene...,movi problem part one edit terribl scene conne...


In [16]:
#6. Lemmatization
lemmatizer = WordNetLemmatizer()
def apply_lemmatization(text):
    words = nltk.word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)
df['clean_text_lemmatized'] = df['clean_text_stemmed'].apply(apply_lemmatization)
print("\nData Frame after applying lemmatization:")
df


Data Frame after applying lemmatization:


Unnamed: 0,text,clean_text,clean_text_without_numbers,clean_text_without_stopwords,clean_text_lowercase,clean_text_stemmed,clean_text_lemmatized
0,reviews,reviews,reviews,reviews,reviews,review,review
1,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...,pleasure watch film early screening completely...,pleasure watch film early screening completely...,pleasur watch film earli screen complet blown ...,pleasur watch film earli screen complet blown ...
2,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same ...,If you liked or loved the first one the same ...,liked loved first one apply one Personally lov...,liked loved first one apply one personally lov...,like love first one appli one person love one ...,like love first one appli one person love one ...
3,"Dune Part 2 is an epic movie; slickly made, an...",Dune Part 2 is an epic movie slickly made an...,Dune Part is an epic movie slickly made and...,Dune Part epic movie slickly made visually stu...,dune part epic movie slickly made visually stu...,dune part epic movi slickli made visual stun e...,dune part epic movi slickli made visual stun e...
4,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,This is the kind of movie that is impossible t...,kind movie impossible justice talking kind exp...,kind movie impossible justice talking kind exp...,kind movi imposs justic talk kind experi never...,kind movi imposs justic talk kind experi never...
...,...,...,...,...,...,...,...
996,This was a perfect sequel to Denis' part one. ...,This was a perfect sequel to Denis part one ...,This was a perfect sequel to Denis part one ...,perfect sequel Denis part one accomplished mai...,perfect sequel denis part one accomplished mai...,perfect sequel deni part one accomplish main p...,perfect sequel deni part one accomplish main p...
997,I just got out of an early access showing and ...,I just got out of an early access showing and ...,I just got out of an early access showing and ...,got early access showing absolutely incredible...,got early access showing absolutely incredible...,got earli access show absolut incred see imax ...,got earli access show absolut incred see imax ...
998,"We have waited many, many years for a movie of...",We have waited many many years for a movie of...,We have waited many many years for a movie of...,waited many many years movie caliber years old...,waited many many years movie caliber years old...,wait mani mani year movi calib year old femal ...,wait mani mani year movi calib year old femal ...
999,This movie has the same problems as part one. ...,This movie has the same problems as part one ...,This movie has the same problems as part one ...,movie problems part one editing terrible Scene...,movie problems part one editing terrible scene...,movi problem part one edit terribl scene conne...,movi problem part one edit terribl scene conne...


In [17]:
# Saving the clean data to a new CSV file
df.to_csv('clean_data_.csv', index=False)
print("\nCleaned data saved ")


Cleaned data saved 


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [6]:
# Your code here
import pandas as pd
import nltk
from collections import Counter
# Download NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [7]:
# Read the data from cleaned_data_.csv
df = pd.read_csv('clean_data_.csv')

# Part 1: Parts of Speech (POS) Tagging
def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return pos_tags
# Iterate through each row and print the POS tagging on a new line
for index, row in df.iterrows():
    pos_tags = pos_tagging(row['clean_text'])
    print(f"POS tagging for row {index + 1}:\n{pos_tags}\n")
    noun_count = verb_count = adj_count = adv_count = 0
    for _, pos in pos_tags:
        if pos.startswith('N'):
            noun_count += 1
        elif pos.startswith('V'):
            verb_count += 1
        elif pos.startswith('JJ'):
            adj_count += 1
        elif pos.startswith('RB'):
            adv_count += 1
    print(f"Total Nouns: {noun_count}")
    print(f"Total Verbs: {verb_count}")
    print(f"Total Adjectives: {adj_count}")
    print(f"Total Adverbs: {adv_count}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Total Adjectives: 21
Total Adverbs: 13
POS tagging for row 288:
[('Had', 'VBD'), ('the', 'DT'), ('pleasure', 'NN'), ('to', 'TO'), ('watch', 'VB'), ('this', 'DT'), ('film', 'NN'), ('in', 'IN'), ('an', 'DT'), ('early', 'JJ'), ('screening', 'NN'), ('and', 'CC'), ('was', 'VBD'), ('completely', 'RB'), ('blown', 'VBN'), ('away', 'RP'), ('Dune', 'NNP'), ('Part', 'NNP'), ('2', 'CD'), ('is', 'VBZ'), ('everything', 'NN'), ('one', 'NN'), ('could', 'MD'), ('ask', 'VB'), ('for', 'IN'), ('from', 'IN'), ('a', 'DT'), ('film', 'NN'), ('of', 'IN'), ('its', 'PRP$'), ('kind', 'NN'), ('As', 'IN'), ('a', 'DT'), ('big', 'JJ'), ('fan', 'NN'), ('of', 'IN'), ('the', 'DT'), ('Game', 'NNP'), ('of', 'IN'), ('Thrones', 'NNP'), ('franchise', 'NN'), ('it', 'PRP'), ('s', 'VBD'), ('been', 'VBN'), ('a', 'DT'), ('long', 'JJ'), ('time', 'NN'), ('since', 'IN'), ('iv', 'NN'), ('e', 'NN'), ('encountered', 'VBD'), ('this', 'DT'), ('level', 'NN'), ('of', 'IN'), (

In [8]:
# Install the required modules
!pip install benepar
!pip install tensorflow
!pip install tensorflow==2.8.0

# Downloading the required models
import benepar
import spacy.cli
benepar.download('benepar_en3')
spacy.cli.download("en_core_web_sm")

# Importing the libraries
import sys
import spacy
from spacy import displacy
parser = benepar.Parser("benepar_en3")
nlp = spacy.load('en_core_web_sm')
options = {'compact': True, 'font': 'Arial black', 'distance': 100}
# Parsing 3rd sentence in clean_text
try:
    tree = parser.parse(df['clean_text'][3])
    print("Constituency Parsing Tree:")
    print(tree)
except Exception as e:
    print(f"Error in constituency parsing: {e}")

# Dependency Parsing
doc = nlp(df['clean_text'][3])
print("Dependency Parsing Tree (Displayed Below):")
displacy.render(doc, style='dep', options=options, jupyter=True)

[31mERROR: Could not find a version that satisfies the requirement tensorflow==2.8.0 (from versions: 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0, 2.13.1, 2.14.0rc0, 2.14.0rc1, 2.14.0, 2.14.1, 2.15.0rc0, 2.15.0rc1, 2.15.0, 2.15.0.post1, 2.15.1, 2.16.0rc0, 2.16.1, 2.16.2, 2.17.0rc0, 2.17.0rc1, 2.17.0, 2.17.1, 2.18.0rc0, 2.18.0rc1, 2.18.0rc2, 2.18.0)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow==2.8.0[0m[31m
[0m

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


  state_dict = torch.load(
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Constituency Parsing Tree:
(TOP
  (S
    (S
      (NP (NNP Dune) (NNP Part) (CD 2))
      (VP
        (VBZ is)
        (NP
          (NP (DT an) (JJ epic) (NN movie))
          (VP
            (VP (ADVP (RB slickly)) (VBN made))
            (CC and)
            (ADJP (RB visually) (JJ stunning))))))
    (CC But)
    (S
      (S
        (NP (PRP I))
        (VP
          (VBD had)
          (S
            (VP
              (TO to)
              (VP
                (VB explain)
                (NP (RB quite) (DT a) (NN bit))
                (PP
                  (IN to)
                  (NP
                    (NP
                      (NP
                        (NP (DT the) (NNS friends))
                        (PP (IN around) (NP (PRP me))))
                      (SBAR
                        (WHNP (WP who))
                        (S
                          (VP
                            (VBD had)
                            (RB not)
                            (VP
             

In [9]:
# (3) Named Entity Recognition
import en_core_web_sm
# Loading the spaCy English model
nlp=en_core_web_sm.load()
# Named Entity Recognition (NER)
for x in df['clean_text']:
    doc = nlp(x)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    if entities:
        print(entities)

[('2', 'CARDINAL'), ('Thrones', 'ORG'), ('Denis Villeneuve', 'PERSON'), ('today', 'DATE'), ('Christopher Nolan', 'PERSON'), ('The Dark Knight Trilogy  ', 'ORG'), ('Oppenheimer', 'ORG'), ('Villeneuve', 'PERSON'), ('first', 'ORDINAL'), ('first', 'ORDINAL'), ('second', 'ORDINAL'), ('first', 'ORDINAL'), ('one', 'CARDINAL'), ('a few years ago', 'DATE'), ('don', 'PERSON'), ('2', 'CARDINAL'), ('first', 'ORDINAL'), ('one', 'CARDINAL'), ('first', 'ORDINAL'), ('Timoth e Chalamet', 'ORG'), ('Zendaya', 'PERSON'), ('first', 'ORDINAL'), ('Austin Butler', 'PERSON'), ('Rebecca Ferguson', 'PERSON'), ('one', 'CARDINAL'), ('the year', 'DATE'), ('Javier Bardem', 'PERSON'), ('Hans Zimmer', 'PERSON'), ('one', 'CARDINAL'), ('Oscar', 'PERSON'), ('Grammy', 'PERSON'), ('VFX  Production Design  Sound', 'ORG'), ('Denis', 'NORP'), ('1', 'CARDINAL'), ('first', 'ORDINAL'), ('first', 'ORDINAL'), ('2', 'CARDINAL'), ('one', 'CARDINAL'), ('recent decades', 'DATE'), ('Frank Herbert', 'PERSON'), ('many years ago', 'DATE')

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [26]:
import time

# Base URL of GitHub Marketplace (Actions section)
BASE_URL = "https://github.com/marketplace?type=actions&page="

# Headers to mimic a browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8",
    "Referer": "https://www.google.com/",
}

# Initialize list to store data
data = []
page = 0
# Scrape multiple pages (Assuming 20 products per page, we need ~50 pages for 1000 products)
while len(data)<1000:
    page+=1
    url = f"{BASE_URL}{page}"

    response = requests.get(url, headers=headers)
    #print(response.text)
    if response.status_code != 200:
        print(f"Failed to fetch page {page}")
        continue

    soup = BeautifulSoup(response.text, "html.parser")

    # Find all product containers

    items = soup.find_all('div', {'data-testid': 'non-featured-item'})
    for item in items:
        try:
            name = item.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf').text.strip()
            url = "https://github.com" + item.find('a')['href']
            description = item.find('p', class_='text-small').text.strip()
            data.append([name, description, url, page])
        except AttributeError:
            continue

# Save to CSV
df = pd.DataFrame(data, columns=["Product Name", "Description", "URL", "Page Number"])

df.to_csv("github_marketplace_actions.csv", index=False)
print("Scraping completed! Data saved to github_marketplace_actions.csv")

Scraping completed! Data saved to github_marketplace_actions.csv


In [27]:
#Part2
from nltk.tokenize import word_tokenize
df = pd.read_csv('/content/github_marketplace_actions.csv')

# Data Cleaning
def clean_text(text):
    text = str(text)  # making sure it's a string
    text = re.sub(r"<.*?>", "", text)  # Removing HTML tags
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Removing special characters
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

# Apply text cleaning to Name & Description columns
df["Product Name"] = df["Product Name"].apply(clean_text)
df["Description"] = df["Description"].apply(clean_text)

# Tokenization, Stopword Removal, Lemmatization
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatization
    return " ".join(tokens)

df["Processed Description"] = df["Description"].apply(preprocess_text)

# Data Quality Checks
df.drop_duplicates(inplace=True)  # Remove duplicates
df.dropna(inplace=True)  # Handle missing values

# Save cleaned data
df.to_csv("cleaned_github_marketplace_actions.csv", index=False)

print("Cleaned data saved.")

Cleaned data saved.


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [75]:
pip install tweepy



In [77]:
import tweepy

# Set your keys and tokens here
api_key =
api_key_secret =
access_token =
access_token_secret =

# Authenticate with Twitter
auth = tweepy.OAuth1UserHandler(
    consumer_key=api_key,
    consumer_secret=api_key_secret,
    access_token=access_token,
    access_token_secret=access_token_secret
)
api = tweepy.API(auth)

In [81]:
import tweepy

# Set up client with bearer token
client = tweepy.Client(bearer_token='')

# Define the query to target machine learning and AI-related hashtags
query = "#machinelearning OR #artificialintelligence -is:retweet"  # Exclude retweets

# Function to fetch tweets with pagination
def fetch_tweets(query, max_tweets):
    all_tweets = []
    next_token = None

    while len(all_tweets) < max_tweets:
        # Fetch tweets
        tweets = client.search_recent_tweets(
            query=query,
            tweet_fields=["id", "author_id", "text"],
            max_results=100,  # max allowed in one request
            next_token=next_token
        )

        # Add tweets to the list
        if tweets.data:
            all_tweets.extend(tweets.data)

        # If there are more tweets, get the next token for pagination
        next_token = tweets.meta.get('next_token') if 'next_token' in tweets.meta else None

        # Break if no next_token is available (i.e., no more tweets to fetch)
        if not next_token:
            break

    return all_tweets

# Fetch up to 1000 tweets
tweets = fetch_tweets(query, 1000)

# Print tweet ID, username, and text if data is available
for tweet in tweets:
    user = client.get_user(id=tweet.author_id)  # Fetch user details by author_id
    print(f"Tweet ID: {tweet.id}, Username: {user.data.username}, Tweet: {tweet.text}")

TooManyRequests: 429 Too Many Requests
Too Many Requests

In [82]:
import pandas as pd

# Function to clean tweet data
def clean_data(tweets):
    # Create an empty list to store cleaned data
    cleaned_data = []

    # Iterate through each tweet and clean
    for tweet in tweets:
        tweet_id = tweet.id
        username = tweet.author_id  # username will be fetched later
        text = tweet.text

        # Perform basic text cleaning: remove unwanted spaces, special characters
        text = text.strip()  # Remove leading/trailing whitespaces

        # Append the cleaned data to the list
        cleaned_data.append([tweet_id, username, text])

    # Create a DataFrame from the cleaned data
    df = pd.DataFrame(cleaned_data, columns=["tweet_id", "author_id", "text"])

    # Remove duplicates based on tweet ID or text
    df = df.drop_duplicates(subset=["tweet_id", "text"], keep="first")

    # Drop any rows with missing values (if any)
    df = df.dropna(subset=["tweet_id", "author_id", "text"])

    # Ensure consistency, e.g., removing excessive spaces in text
    df["text"] = df["text"].apply(lambda x: ' '.join(x.split()))  # Remove extra spaces between words

    return df

# Fetch tweets and clean data
tweets = fetch_tweets(query, 1000)

# Clean the data
cleaned_df = clean_data(tweets)

# Save the cleaned data to a CSV file
cleaned_df.to_csv("cleaned_tweets.csv", index=False)

print("Cleaned data has been saved to 'cleaned_tweets.csv'.")


TooManyRequests: 429 Too Many Requests
Too Many Requests

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [84]:
"""
Processing large datasets can be challenging due to extended runtimes and occasional crashes, making it important to efficiently manage and optimize code for better performance. It's valuable to gain a thorough understanding of these concepts.
Also, coming to 5th question the code was giving time out error  too many requests
"""

"\nProcessing large datasets can be challenging due to extended runtimes and occasional crashes, making it important to efficiently manage and optimize code for better performance. It's valuable to gain a thorough understanding of these concepts.\nAlso, coming to 5th question the code was giving time out error  too many requests\n"

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog