<a href="https://colab.research.google.com/github/Dheepthi-Reddy/DheepthiReddy_INFO5731_Fall2024/blob/main/Vangeti_Dheepthi_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Your code here

# importing necessary libraries
import requests                   # 'requests' library imported to send HTTP requests to fetch web pages
from bs4 import BeautifulSoup     # 'BeautifulSoup' library imported to parse HTML content and extract data and extracting data from web pages
import pandas as pd               # 'pandas' library for data manipulation operatioins

# define a 'getReviews' function to fetch IMDb reviews
def getReviews(totalReviews=1000):

    # empty list to store the fetched reviews
    reviews = []
    start = 0

    # imdb url of the user reviews page
    url = f"https://www.imdb.com/title/tt6791350/reviews/_ajax?ref_=undefined&paginationKey="

    # initialized 'last_key' to handle pagination
    last_key = ""

    # while loop to run untill 1000 reviews are fetched
    while len(reviews) < totalReviews:

        # making the get request to the url with the pagination key
        response = requests.get(f"{url}{last_key}")

        if response.status_code != 200:
            break  # to break the loop if the request failed

        # to parse html content of the page using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        # here we search for the div tage with name 'text show-more__control'
        reviewDiv = soup.find_all("div", class_="text show-more__control")

        # to loop through each review
        for review in reviewDiv:
            # reviews are appended to the 'reviews' list after removing any unwanted whitespace
            reviews.append(review.get_text(strip=True))
            # the loop stops once we reach the desired number of reviews
            if len(reviews) >= totalReviews:
                break
        # IMDb uses a pagination key for loading more reviews
        paginationKey = soup.find("div", class_="load-more-data")
        # if more more reviews are found then we update our last_key
        if paginationKey:
            last_key = paginationKey["data-key"]
        else:
            # if not found we break the loop
            break

    # to ensure 1000 reviews are collected
    return reviews[:totalReviews]

# executing the defined function
imdb_reviews = getReviews(1000)

# saving the review to CSV file
df = pd.DataFrame(imdb_reviews, columns=["Reviews"])
df.to_csv("UserReviews.csv", index=False)

print("A total of", len(imdb_reviews), "reviews saved to UserReviews.csv file.")

A total of 1000 reviews saved to UserReviews.csv file.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write code for each of the sub parts with proper comments.


# importing necessary libraries
import pandas as pd       # 'pandas' library for data manipulation operatioins
import re                 # regular expressions library for string manipulation operations
import nltk               # Natural Language Toolkit for text processing and analysis
from nltk.corpus import stopwords                         # to remove commonly used words from the particular language
from nltk.stem import PorterStemmer, WordNetLemmatizer    # 'PorterStemmer' for reducing the words, 'WordNetLemmatizer' to convert words to its dictionary form
from nltk.tokenize import word_tokenize                   # 'word_tokenize' to split the text into individual words

# downloading NLTK data
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

# reading the reviews file using pandas dataframe
df = pd.read_csv("UserReviews.csv")

# defined a function using regular expressions library to remove noise like special characters and punctuations
def removeNoise(review):
    return re.sub(r'[^\w\s]', '', str(review))

# defined a function using regular expressions to remove numbers
def removeNumbers(review):
    return re.sub(r'\d+', '', review)

# defined a function to convert text to lowercase
def lowercase(review):
    return review.lower()

# defined function to remove stopwords
def removeStopwords(review):

    # initializing a set that has common words in english language
    stopWords = set(stopwords.words('english'))
    # to split the input string to seperate words
    tokens = word_tokenize(review)
    # initialized an empty list to store non-stopwords
    filtered_words = []

    # for loop to iterate over each word in the tokens
    for word in tokens:
        # to check if the word is not a stopword
        if word not in stopWords:
            # adding non-stopword words to the list
            filtered_words.append(word)
    # joining all the filtered words to a single string
    return ' '.join(filtered_words)

# Function for stemming the words
def stemmingFunc(review):
    # created a PorterStemmer object
    stemmer = PorterStemmer()

    # to split the input string to seperate words
    tokens = word_tokenize(review)
    # initialized empty list to store stemmed words
    stemmed_words = []

    # for loop to iterate over each word in the tokens list
    for word in tokens:
        # to stem each word
        stemmed_word = stemmer.stem(word)
        # appending the stemmed word to the list
        stemmed_words.append(stemmed_word)

    # joining all the stemmed words into a single string
    return ' '.join(stemmed_words)

# defined function for lemmatization
def lemmatizationFunc(review):

    #  created WordNetLemmatizer object
    lemmatizer = WordNetLemmatizer()

    # to split the input string to seperate words
    tokens = word_tokenize(review)
    # initialized an empty list to store lemmatized words
    lemmatized_words = []

    # for loop to iterate over each word in the tokens list
    for word in tokens:
         # to lemmatize each word
        lemmatized_word = lemmatizer.lemmatize(word)

        # appending the lemmatized word to the list
        lemmatized_words.append(lemmatized_word)

    # joining all the lemmatized words into a single string
    return ' '.join(lemmatized_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [20]:
# 1. Removing the Noise and displaying after noise removal
df['Clean_Review'] = df['Reviews'].apply(removeNoise)
print("\n1. After Removing Noise:\n")
df.head(10)


1. After Removing Noise:



Unnamed: 0,Reviews,Clean_Review
0,"Guardians of the Galaxy Volume 3 is chaotic, w...",Guardians of the Galaxy Volume 3 is chaotic we...
1,Having sat through some phase 4 films that fai...,Having sat through some phase 4 films that fai...
2,This. This is what I've wanted. Yeah some of t...,This This is what Ive wanted Yeah some of the ...
3,It all leads back to where we once started off...,It all leads back to where we once started off...
4,"Up to this point, there has been one trilogy i...",Up to this point there has been one trilogy in...
5,"I'm not one to cry at movies often, but this o...",Im not one to cry at movies often but this one...
6,"""There is no God. That's why I stepped in."" I ...",There is no God Thats why I stepped in I have ...
7,"So first of all, everything to do with Rocket ...",So first of all everything to do with Rocket i...
8,A near perfect end to an incredible Marvel tri...,A near perfect end to an incredible Marvel tri...
9,"I loved the first GotG, and enjoyed Vol. 2 a l...",I loved the first GotG and enjoyed Vol 2 a lot...


In [21]:
# 2. Removing the numbers and displaying
df['Clean_Review'] = df['Clean_Review'].apply(removeNumbers)
print("\n2. After Removing Numbers:\n")
df.head(10)


2. After Removing Numbers:



Unnamed: 0,Reviews,Clean_Review
0,"Guardians of the Galaxy Volume 3 is chaotic, w...",Guardians of the Galaxy Volume is chaotic wei...
1,Having sat through some phase 4 films that fai...,Having sat through some phase films that fail...
2,This. This is what I've wanted. Yeah some of t...,This This is what Ive wanted Yeah some of the ...
3,It all leads back to where we once started off...,It all leads back to where we once started off...
4,"Up to this point, there has been one trilogy i...",Up to this point there has been one trilogy in...
5,"I'm not one to cry at movies often, but this o...",Im not one to cry at movies often but this one...
6,"""There is no God. That's why I stepped in."" I ...",There is no God Thats why I stepped in I have ...
7,"So first of all, everything to do with Rocket ...",So first of all everything to do with Rocket i...
8,A near perfect end to an incredible Marvel tri...,A near perfect end to an incredible Marvel tri...
9,"I loved the first GotG, and enjoyed Vol. 2 a l...",I loved the first GotG and enjoyed Vol a lot ...


In [22]:
# 3. Converting the list to Lowercase
df['Clean_Review'] = df['Clean_Review'].apply(lowercase)
print("\n3. After Converting to Lowercase:\n")
df.head(10)


3. After Converting to Lowercase:



Unnamed: 0,Reviews,Clean_Review
0,"Guardians of the Galaxy Volume 3 is chaotic, w...",guardians of the galaxy volume is chaotic wei...
1,Having sat through some phase 4 films that fai...,having sat through some phase films that fail...
2,This. This is what I've wanted. Yeah some of t...,this this is what ive wanted yeah some of the ...
3,It all leads back to where we once started off...,it all leads back to where we once started off...
4,"Up to this point, there has been one trilogy i...",up to this point there has been one trilogy in...
5,"I'm not one to cry at movies often, but this o...",im not one to cry at movies often but this one...
6,"""There is no God. That's why I stepped in."" I ...",there is no god thats why i stepped in i have ...
7,"So first of all, everything to do with Rocket ...",so first of all everything to do with rocket i...
8,A near perfect end to an incredible Marvel tri...,a near perfect end to an incredible marvel tri...
9,"I loved the first GotG, and enjoyed Vol. 2 a l...",i loved the first gotg and enjoyed vol a lot ...


In [23]:
# 4. Removing Stopwords
df['Clean_Review'] = df['Clean_Review'].apply(removeStopwords)
print("\n4. After Removing Stopwords:\n")
df.head(10)


4. After Removing Stopwords:



Unnamed: 0,Reviews,Clean_Review
0,"Guardians of the Galaxy Volume 3 is chaotic, w...",guardians galaxy volume chaotic weird oftentim...
1,Having sat through some phase 4 films that fai...,sat phase films failed inspire guardians feels...
2,This. This is what I've wanted. Yeah some of t...,ive wanted yeah jokes bit silly tone bit confu...
3,It all leads back to where we once started off...,leads back started great trilogies indicated p...
4,"Up to this point, there has been one trilogy i...",point one trilogy mcu excellent start finish t...
5,"I'm not one to cry at movies often, but this o...",im one cry movies often one broke four merely ...
6,"""There is no God. That's why I stepped in."" I ...",god thats stepped admit one best lines ever sp...
7,"So first of all, everything to do with Rocket ...",first everything rocket movie absolutely incre...
8,A near perfect end to an incredible Marvel tri...,near perfect end incredible marvel trilogyguar...
9,"I loved the first GotG, and enjoyed Vol. 2 a l...",loved first gotg enjoyed vol lot original quir...


In [24]:
# 5. Apply Stemming
df['Clean_Review'] = df['Clean_Review'].apply(stemmingFunc)
print("\n5. After Stemming the text:\n")
df.head(10)


5. After Stemming the text:



Unnamed: 0,Reviews,Clean_Review
0,"Guardians of the Galaxy Volume 3 is chaotic, w...",guardian galaxi volum chaotic weird oftentim r...
1,Having sat through some phase 4 films that fai...,sat phase film fail inspir guardian feel like ...
2,This. This is what I've wanted. Yeah some of t...,ive want yeah joke bit silli tone bit confus t...
3,It all leads back to where we once started off...,lead back start great trilog indic past gotg s...
4,"Up to this point, there has been one trilogy i...",point one trilog mcu excel start finish time a...
5,"I'm not one to cry at movies often, but this o...",im one cri movi often one broke four mere esti...
6,"""There is no God. That's why I stepped in."" I ...",god that step admit one best line ever spoken ...
7,"So first of all, everything to do with Rocket ...",first everyth rocket movi absolut incred heart...
8,A near perfect end to an incredible Marvel tri...,near perfect end incred marvel trilogyguardian...
9,"I loved the first GotG, and enjoyed Vol. 2 a l...",love first gotg enjoy vol lot origin quirki fu...


In [25]:
# 6. Apply Lemmatization
df['Clean_Review'] = df['Clean_Review'].apply(lemmatizationFunc)
print("\n6. After Lemmatization:\n")
df.head(10)


6. After Lemmatization:



Unnamed: 0,Reviews,Clean_Review
0,"Guardians of the Galaxy Volume 3 is chaotic, w...",guardian galaxi volum chaotic weird oftentim r...
1,Having sat through some phase 4 films that fai...,sat phase film fail inspir guardian feel like ...
2,This. This is what I've wanted. Yeah some of t...,ive want yeah joke bit silli tone bit confus t...
3,It all leads back to where we once started off...,lead back start great trilog indic past gotg s...
4,"Up to this point, there has been one trilogy i...",point one trilog mcu excel start finish time a...
5,"I'm not one to cry at movies often, but this o...",im one cri movi often one broke four mere esti...
6,"""There is no God. That's why I stepped in."" I ...",god that step admit one best line ever spoken ...
7,"So first of all, everything to do with Rocket ...",first everyth rocket movi absolut incred heart...
8,A near perfect end to an incredible Marvel tri...,near perfect end incred marvel trilogyguardian...
9,"I loved the first GotG, and enjoyed Vol. 2 a l...",love first gotg enjoy vol lot origin quirki fu...


In [9]:
# Save cleaned dataset
df.to_csv("UserReviews_Cleaned.csv", index = False)
print("\nCleaned data saved to UserReviews_Cleaned.csv file")


Cleaned data saved to UserReviews_Cleaned.csv file


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [10]:
# Your code here

# 1. Parts of Speech (POS) Tagging:

# importing necessary library
import pandas as pd                           # 'pandas' library for data manipulation operatioins
import nltk                                   # Natural Language Toolkit for text processing and analysis
from nltk.tokenize import word_tokenize       # 'word_tokenize' to split the text into seperate words
from nltk import pos_tag                      # 'pos_tag' from nltk for parts of speech tagging
from collections import Counter               # to count items

# downloading necessary nltk data
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

# defined a function to perform POS tagging and counting
def POS_Tagging(reviews):
    # to split the input string to seperate words
    tokens = word_tokenize(reviews)
    # to perform POS tagging
    tags = pos_tag(tokens)
    # initializing counters for every speech of interest
    posCounts = Counter()
    # defined a dictionary for POS tags of interest
    posInterest = {'NN': 'Noun', 'VB': 'Verb', 'JJ': 'Adjective', 'RB': 'Adverb'}

    # to count occurrences of each POS of interest
    # word is the actual word from input text, tag is the POS assigned to that word
    for word, tag in tags:
        # for extracting first two charecters of POS tag
        if tag[:2] in posInterest:
            # count incrementing
            posCounts[posInterest[tag[:2]]] += 1
    return posCounts

# applying the POS tagging and counting function to each review and for collecting the results
list_POS_Counts = df['Clean_Review'].apply(POS_Tagging).tolist()
# initialied lists for every parts of speech
nouns = []
verbs = []
adjectives = []
adverbs = []
# loop to iterate through the list of counts and update separate lists for each part of speech.
for counts in list_POS_Counts:
    nouns.append(counts.get('Noun', 0))          # to get count of Nouns else 0 if not present
    verbs.append(counts.get('Verb', 0))          # to get count of Verbs else 0 if not present
    adjectives.append(counts.get('Adjective', 0))  # to get count of Adjectives else 0 if not present
    adverbs.append(counts.get('Adverb', 0))      # to get count of Adverbs else 0 if not present

# created a new DataFrame using the above lists
POS_df = pd.DataFrame({
    'Nouns': nouns,
    'Verbs': verbs,
    'Adjectives': adjectives,
    'Adverbs': adverbs,
})

# adding the Reviews column to the new DataFrame
POS_df.insert(0, 'Clean_Review', df['Clean_Review'].values)

# Display the new DataFrame
print("\n",POS_df.head())

# initialized total counts of each parts of speech
total_nouns = 0
total_verbs = 0
total_adjectives = 0
total_adverbs = 0

# loop to cumulate all the counts for each POS
for count in list_POS_Counts:
    total_nouns += count.get('Noun', 0)  # Noun counts
    total_verbs += count.get('Verb', 0)    # Verb counts
    total_adjectives += count.get('Adjective', 0)  # Adjective counts
    total_adverbs += count.get('Adverb', 0)  # Adverb counts

# printing the final counts
print(f"\nTotal Nouns: {total_nouns}")
print(f"Total Verbs: {total_verbs}")
print(f"Total Adjectives: {total_adjectives}")
print(f"Total Adverbs: {total_adverbs}")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



                                         Clean_Review  Nouns  Verbs  \
0  guardian galaxi volum chaotic weird oftentim r...     89     20   
1  sat phase film fail inspir guardian feel like ...     58      8   
2  ive want yeah joke bit silli tone bit confus t...     66     11   
3  lead back start great trilog indic past gotg s...     36     12   
4  point one trilog mcu excel start finish time a...     90     14   

   Adjectives  Adverbs  
0          26        9  
1          20        4  
2          25        6  
3          15        7  
4          38        4  

Total Nouns: 62905
Total Verbs: 15004
Total Adjectives: 24950
Total Adverbs: 5710


In [11]:
# 2. Constituency Parsing and Dependency Parsing:

# installing the spaCy library
!pip install spacy
#downloading the English language model for spaCy
!python -m spacy download en_core_web_sm
#installing the Benepar library, used for constituency parsing
!pip install benepar

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Building wheels for collected packages: benepar
  Building whe

In [12]:
# Constituency Parsing:

import pandas as pd                     # 'pandas' library for data manipulation operatioins
import spacy                            # NLP library for text processing
import benepar    # a library for parsing for generating constituency parse trees
from nltk import Tree                   # importing tree class from NLTK for tree structuresera

# Download the Benepar model for generating constituency parse trees
benepar.download('benepar_en3')

userReviews_df = pd.read_csv("UserReviews_Cleaned.csv")    # loading the csv file
nlp = spacy.load("en_core_web_sm")        # to load the English language model from spaCy
# benepar component is added to spaCy pipeline, which does constituency parsing after standard parser process
nlp.add_pipe("benepar", config={"model": "benepar_en3"}, after='parser')

# defined a function to  split text to small chunks
def split_into_chunks(text, max_length=500):
    words = text.split()        # storing the list of words after splitting
    chunks = []                 # initializing an empty list to store chunks
    current_chunk = []          # initialized empty list to build the chunk of text for iteratingthrough the words.

    # for loop to iterate over each word in the list
    for word in words:
        current_chunk.append(word)    # each word is appended to the current chunk
        # Check if adding the next word would exceed the max_length
        if len(" ".join(current_chunk)) >= max_length:
            chunks.append(" ".join(current_chunk[:-1]))  # adding the current chunk without the last word
            current_chunk = [word]  # starting a new chunk with the last word

    # checking for any remaining words
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

print("\nConstituency Parsing:")

# for loop to  iterate through user reviews
for review in userReviews_df['Clean_Review'][:1]:

    chunks = split_into_chunks(review, max_length=500)  # splitting the review into smaller chunks with max limit of token size is 512
    # for loop to iterate through each chunks
    for chunk in chunks:
        try:
            doc = nlp(chunk)
            for sent in doc.sents:

                constituency_tree = sent._.parse_string    # Getting the constituency parse tree as a string
                tree = Tree.fromstring(constituency_tree)  # converting the string to tree representation
                tree.pretty_print()                        # Convert the string representation to an NLTK Tree object
        # exception handling for any errors
        except ValueError as e:
            print(f"Error processing chunk: {chunk}")
            print(e)

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Constituency Parsing:




                                                                                                                                                                                                                                                                 S                                                                                                                                                                                                                                                                            
                          _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________|__________________________________________                                                                                                                                                                             

**Constituency Parsing:**

Constituency parsing is about the process of analyzing a text by breaking it down into hierarchical structures called constituents, showing that how words group together focusing on the grammatical relationships within a sentence to form a large unit.


Example sentence: guardian galaxi volum chaotic weird

In the above sentence the constituency tree groups words into different phrases, where:

-> "guardian": this is a Noun (NN) and is the head noun of a noun phrase (NP).

-> "galaxi": is also a Noun (NN) and it is a modifier to "gaurdian", like describing a specific type of guardian.

-> "volum": this is a Noun(NN) and can be considered as a furthure modifier of "gaurdian".

-> "chaotic": this is a adjective (JJ), describing the qualities of "gaurdian"

-> "weird": this is another Adjective (JJ),  enhancing description of the nouns.

The entire sentence I took here forms a Noun Phrase (NP) with "guardian" (NN) as the head, modified using additional nouns ("galaxi" and "volum") and adjectives ("chaotic" and "weird"). This entire text shows how these individual elements combine to explain the complex idea about the "guardian"



In [26]:
# Dependency Parsing:

from spacy import displacy  # NLP library for importing displacy for visualising dependencies

# for loop to iterate over dependency parsing for all cleaned reviews
for review in userReviews_df['Clean_Review'][:1]:

    doc = nlp(review)  # creating a doc object for processing the review text with spaCy
    print("Dependency Parsing:\n")
    displacy.render(doc, style='dep', jupyter=True)  # for rendering the dependency parse tree

    # for loop to iterate through each token in the processed document
    for token in doc:
        print(f"{token.text} -> {token.dep_} -> {token.pos_}")



Dependency Parsing:



guardian -> compound -> PROPN
galaxi -> compound -> PROPN
volum -> nmod -> PROPN
chaotic -> amod -> ADJ
weird -> amod -> PROPN
oftentim -> compound -> PROPN
ridicul -> nsubj -> PROPN
also -> advmod -> ADV
full -> amod -> ADJ
heart -> compound -> NOUN
emot -> nmod -> NOUN
great -> amod -> ADJ
themesi -> nsubj -> NOUN
must -> aux -> AUX
say -> ROOT -> VERB
best -> amod -> ADJ
marvel -> compound -> PROPN
movi -> compound -> PROPN
sinc -> compound -> PROPN
endgam -> nsubj -> PROPN
that -> det -> SCONJ
necessarili -> nsubj -> NOUN
hard -> advmod -> ADV
though -> mark -> ADV
need -> ccomp -> VERB
surpass -> amod -> ADJ
way -> npadvmod -> NOUN
home -> advmod -> ADV
amaz -> prep -> ADV
high -> amod -> ADJ
moment -> pobj -> NOUN
lazi -> compound -> PROPN
other -> amod -> ADJ
marvel -> compound -> NOUN
desper -> compound -> NOUN
need -> nsubj -> NOUN
hit -> ccomp -> VERB
theyv -> compound -> PROPN
final -> amod -> ADJ
got -> compound -> VERB
ithighlightseveri -> compound -> PROPN
member -> compo

**Dependency Parsing:**

Dependency parsing checks for the grammatical structure of a sentence based on its dependency between the other words in a sentence.

For the same sentence, the dependency parsing is:

-> "guardian": this is the head noun (NN) of the phrase, acts as the central element.

-> "galaxi": this word is linked as a compound modifier (compound) to "guardian" like a specific type of the guardian.

-> "volum": this is linked as a nominal modifier (nmod) to "guardian" describing further about it.

-> "chaotic": this is an adjectival modifier (amod), describing the qualities of "guardian" by enhancing its meaning.

-> "weird": this word acts as an adjectival modifier (amod), further descsribing "guardian" and adding depth to its description.

This dependency structure shows how "guardian" acts as the focal point, with "galaxi" and "volum" providing specific modifications to it, while "chaotic" and "weird" increase the overall characterization of the noun.

In [14]:
# 3. Named Entity Recognition

import spacy                      # NLP library for named entity recognition (NER)
from collections import Counter   # for counting hashable objects
import pandas as pd               # 'pandas' library for data manipulation operatioins

nlp = spacy.load("en_core_web_sm")    # Loading the spaCy model

def namedEntityRecognition(text):
    doc = nlp(text)   # processing the text using spaCy
    entities = []     # initializing a list to store the  entities
    for entity in doc.ents:    # for loop to extract entities and their labels
        entities.append((entity.text.strip(), entity.label_))  # Append the entity text and label as a tuple
    return entities

entity_list = df['Clean_Review'].apply(namedEntityRecognition)   # Apply NER to each review and calculating the results
allEntities = []     # initializing a list to store all the entities

# Iterate through each sublist of entities and extend the all_entities list
for sublist in entity_list:
    for entity in sublist:
        allEntities.append(entity)

entity_count = Counter(allEntities)     # counting each unique entity using Counter
entity_df = pd.DataFrame(entity_count.items(), columns=['Entity', 'Count'])    # converting the Counter to a DataFrame for formatting
# Split the Entity into two separate columns
entity_df[['Entity Text', 'Entity Type']] = pd.DataFrame(entity_df['Entity'].tolist(), index=entity_df.index)

# formatting the final dataframe properly
entity_df = entity_df.drop(columns=['Entity'])
entity_df = entity_df[['Entity Text', 'Entity Type', 'Count']]
entity_df = entity_df.sort_values(by='Count', ascending=False).reset_index(drop=True)   # Sort the DataFrame by Count in descending order

# final entity count dataframe
print(entity_df)

                      Entity Text Entity Type  Count
0                             one    CARDINAL    704
1                           first     ORDINAL    440
2                             two    CARDINAL    251
3                            seri        NORP    158
4                          second     ORDINAL    131
...                           ...         ...    ...
2562                    tensionth     ORDINAL      1
2563                   chang pace      PERSON      1
2564  filmmak motiv inspir vision      PERSON      1
2565                       soonwa         ORG      1
2566                        greek        NORP      1

[2567 rows x 3 columns]


In [15]:
# Separate entities by type for counting
typesOfEntity = {}        # initialized empty dictionary to store different entity types as keys and their respective counts

# for loop to iterate through allEntities
for entity, labelName in allEntities:

  # checking if the labelName is already a key in typesOfEntity, if not it initializes a new Counter for that label.
    if labelName not in typesOfEntity:
        typesOfEntity[labelName] = Counter()
    typesOfEntity[labelName][entity] += 1

totalCount = {}     # initialized a empty dictionary to store total counts by entity type

# for loop to iterate over each entity type and its corresponding Counter of entities
for label, entities in typesOfEntity.items():
    totalCount[label] = sum(entities.values())  # counts of all entities under the specified label
    print(f"Entity Type: {label} => Total: {totalCount[label]}")
    for entity, count in entities.most_common():
        print(f"  {entity}: {count}")

Entity Type: CARDINAL => Total: 1167
  one: 704
  two: 251
  three: 72
  half: 37
  ten: 15
  four: 9
  million: 7
  togeth: 6
  five: 6
  everyon: 6
  zero: 6
  someth: 5
  thousand: 3
  famou: 3
  billion: 3
  one two: 3
  six: 2
  togeth one: 2
  two three: 1
  occas: 1
  tear least five: 1
  someth rewatch million: 1
  nearli half: 1
  dozen: 1
  eleven: 1
  ton: 1
  twofold: 1
  half dozen: 1
  beauti: 1
  someth beauti: 1
  throughoutth: 1
  balnk: 1
  almost nine: 1
  half billion: 1
  howev: 1
  nine: 1
  showcas: 1
  two film third: 1
  one billion: 1
  seven: 1
  nico: 1
  fightsom: 1
  one million: 1
  thisgunn: 1
  waseven: 1
Entity Type: ORG => Total: 1026
  funni: 52
  issu: 29
  starlord: 27
  abus: 27
  chukwudi iwuji: 18
  superhero movi: 16
  plu: 15
  guardian: 11
  narr: 11
  qualiti: 10
  tissu: 9
  genr: 8
  gunn movi: 7
  seri: 7
  closur: 6
  replac: 6
  surpris movi: 6
  confus: 5
  funni time: 5
  horrif: 5
  primari: 4
  legendari: 4
  gunn guardian: 4
  spec

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

https://drive.google.com/drive/u/1/folders/1IKO2k1pW--2ewtz0duqDDzLsgEbB8FLy

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [19]:
# Write your response below
'''
The assignment is good yet challenging, I really enjoyed scraping different movies reviews into a dataset.
I found the 3rd question quite challenging, it took most of the time I spent on the whole assignment.
Time provided to complete the assignment was good but having both in class excercise and assignemnt
in the same week made it even more challenging to meet the deadlines.

'''

'\nThe assignment is good yet challenging, I really enjoyed scraping different movies reviews into a dataset.\nI found the 3rd question quite challenging, it took most of the time I spent on the whole assignment.\nTime provided to complete the assignment was good but having both in class excercise and assignemnt\nin the same week made it even more challenging to meet the deadlines.\n\n'