<a href="https://colab.research.google.com/github/Grishma5278/Info-5731/blob/main/Tallapareddy_Grishma_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


# New Section

In [27]:
import requests
from bs4 import BeautifulSoup
import csv

def fetch_reviews(imdb_id):
    reviews = []
    url = f"https://www.imdb.com/title/{imdb_id}/reviews?ref_=tt_ql_3"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    review_divs = soup.find_all('div', class_='lister-item mode-detail imdb-user-review collapsable')
    for div in review_divs:
        title = div.find('a', class_='title').text.strip()
        text = div.find('div', class_='text show-more__control').text.strip()
        try:
            rating = div.find('span', class_='rating-other-user-rating').span.text.strip()
        except AttributeError:
            rating = 'No Rating'

        reviews.append({'review_title': title, 'review_text': text, 'rating': rating})

    return reviews

def save_reviews_to_csv(reviews, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['review_title', 'review_text', 'rating'])
        writer.writeheader()
        for review in reviews:
            writer.writerow(review)

# Example usage
imdb_id = 'tt8178634'  # Example IMDb ID for "Squid Game"
reviews = fetch_reviews(imdb_id)
save_reviews_to_csv(reviews, 'rrr_reviews.csv')

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [28]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

data = pd.read_csv('rrr_reviews.csv')

df = pd.DataFrame(data)

df

Unnamed: 0,review_title,review_text,rating
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9
2,The best superhero movie in years,I have to try and review this without comparin...,9
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9
4,Wow........,This was an incredible film. I never heard of ...,10
5,I wish I could've seen this in a theater,There is officially ZERO reason to watch Gray ...,10
6,"As An American, This is Everything Modern Star...","SO long story short, word of mouth happened an...",10
7,"Rousing, Rampaging, Revolutionary...",It strikes me that in recent times not many fi...,8
8,SS Rajamouli Delivers A Power Packed Action En...,The last time director SS Rajamouli managed to...,10
9,Three outrageous hours.,"Don't ask me for a plot, because I didn't have...",9


In [29]:
df['noise_removed'] = df['review_text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...


In [30]:
df['numbers_removed'] = df['noise_removed'].apply(lambda x: re.sub(r'\d+', '', x))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...,I have seen a lot of movies in my time made in...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...,I bet youd never think the mashup the heavyhan...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...,I have to try and review this without comparin...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...,When I pushed play I did not really believe th...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...,This was an incredible film I never heard of t...


In [31]:
df['lowercased'] = df['numbers_removed'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...,I have seen a lot of movies in my time made in...,i have seen a lot of movies in my time made in...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...,I bet youd never think the mashup the heavyhan...,i bet youd never think the mashup the heavyhan...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...,I have to try and review this without comparin...,i have to try and review this without comparin...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...,When I pushed play I did not really believe th...,when i pushed play i did not really believe th...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...,This was an incredible film I never heard of t...,this was an incredible film i never heard of t...


In [32]:
# Assuming stopwords are downloaded and available
stop_words = set(stopwords.words('english'))
df['stopwords_removed'] = df['lowercased'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...,I have seen a lot of movies in my time made in...,i have seen a lot of movies in my time made in...,seen lot movies time made lot different styles...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...,I bet youd never think the mashup the heavyhan...,i bet youd never think the mashup the heavyhan...,bet youd never think mashup heavyhanded rambo ...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...,I have to try and review this without comparin...,i have to try and review this without comparin...,try review without comparing anything directly...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...,When I pushed play I did not really believe th...,when i pushed play i did not really believe th...,pushed play really believe would ever watch wh...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...,This was an incredible film I never heard of t...,this was an incredible film i never heard of t...,incredible film never heard film netflix broug...


In [33]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [34]:
ps = PorterStemmer()
df['stemmed'] = df['stopwords_removed'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...,I have seen a lot of movies in my time made in...,i have seen a lot of movies in my time made in...,seen lot movies time made lot different styles...,seen lot movi time made lot differ style diffe...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...,I bet youd never think the mashup the heavyhan...,i bet youd never think the mashup the heavyhan...,bet youd never think mashup heavyhanded rambo ...,bet youd never think mashup heavyhand rambo my...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...,I have to try and review this without comparin...,i have to try and review this without comparin...,try review without comparing anything directly...,tri review without compar anyth directli reall...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...,When I pushed play I did not really believe th...,when i pushed play i did not really believe th...,pushed play really believe would ever watch wh...,push play realli believ would ever watch whole...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...,This was an incredible film I never heard of t...,this was an incredible film i never heard of t...,incredible film never heard film netflix broug...,incred film never heard film netflix brought s...


In [35]:
# Assuming lemmatization data is available
lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['stemmed'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...,I have seen a lot of movies in my time made in...,i have seen a lot of movies in my time made in...,seen lot movies time made lot different styles...,seen lot movi time made lot differ style diffe...,seen lot movi time made lot differ style diffe...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...,I bet youd never think the mashup the heavyhan...,i bet youd never think the mashup the heavyhan...,bet youd never think mashup heavyhanded rambo ...,bet youd never think mashup heavyhand rambo my...,bet youd never think mashup heavyhand rambo my...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...,I have to try and review this without comparin...,i have to try and review this without comparin...,try review without comparing anything directly...,tri review without compar anyth directli reall...,tri review without compar anyth directli reall...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...,When I pushed play I did not really believe th...,when i pushed play i did not really believe th...,pushed play really believe would ever watch wh...,push play realli believ would ever watch whole...,push play realli believ would ever watch whole...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...,This was an incredible film I never heard of t...,this was an incredible film i never heard of t...,incredible film never heard film netflix broug...,incred film never heard film netflix brought s...,incred film never heard film netflix brought s...


In [36]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [37]:
df['cleaned_text'] = df['lemmatized']
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized,cleaned_text
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...,I have seen a lot of movies in my time made in...,i have seen a lot of movies in my time made in...,seen lot movies time made lot different styles...,seen lot movi time made lot differ style diffe...,seen lot movi time made lot differ style diffe...,seen lot movi time made lot differ style diffe...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...,I bet youd never think the mashup the heavyhan...,i bet youd never think the mashup the heavyhan...,bet youd never think mashup heavyhanded rambo ...,bet youd never think mashup heavyhand rambo my...,bet youd never think mashup heavyhand rambo my...,bet youd never think mashup heavyhand rambo my...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...,I have to try and review this without comparin...,i have to try and review this without comparin...,try review without comparing anything directly...,tri review without compar anyth directli reall...,tri review without compar anyth directli reall...,tri review without compar anyth directli reall...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...,When I pushed play I did not really believe th...,when i pushed play i did not really believe th...,pushed play really believe would ever watch wh...,push play realli believ would ever watch whole...,push play realli believ would ever watch whole...,push play realli believ would ever watch whole...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...,This was an incredible film I never heard of t...,this was an incredible film i never heard of t...,incredible film never heard film netflix broug...,incred film never heard film netflix brought s...,incred film never heard film netflix brought s...,incred film never heard film netflix brought s...


In [38]:
df.to_csv('rrr_reviews_cleaned.csv', index=False)
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized,cleaned_text
0,Have Never Seen Anything Quite Like This,"I have seen a lot of movies in my time, made i...",10,I have seen a lot of movies in my time made in...,I have seen a lot of movies in my time made in...,i have seen a lot of movies in my time made in...,seen lot movies time made lot different styles...,seen lot movi time made lot differ style diffe...,seen lot movi time made lot differ style diffe...,seen lot movi time made lot differ style diffe...
1,Rambo Meshed with Crouching Tiger + Musical......,I bet you'd never think the mash-up the heavy-...,9,I bet youd never think the mashup the heavyhan...,I bet youd never think the mashup the heavyhan...,i bet youd never think the mashup the heavyhan...,bet youd never think mashup heavyhanded rambo ...,bet youd never think mashup heavyhand rambo my...,bet youd never think mashup heavyhand rambo my...,bet youd never think mashup heavyhand rambo my...
2,The best superhero movie in years,I have to try and review this without comparin...,9,I have to try and review this without comparin...,I have to try and review this without comparin...,i have to try and review this without comparin...,try review without comparing anything directly...,tri review without compar anyth directli reall...,tri review without compar anyth directli reall...,tri review without compar anyth directli reall...
3,Weirdly spectacular,"When I pushed play, I did not really believe t...",9,When I pushed play I did not really believe th...,When I pushed play I did not really believe th...,when i pushed play i did not really believe th...,pushed play really believe would ever watch wh...,push play realli believ would ever watch whole...,push play realli believ would ever watch whole...,push play realli believ would ever watch whole...
4,Wow........,This was an incredible film. I never heard of ...,10,This was an incredible film I never heard of t...,This was an incredible film I never heard of t...,this was an incredible film i never heard of t...,incredible film never heard film netflix broug...,incred film never heard film netflix brought s...,incred film never heard film netflix brought s...,incred film never heard film netflix brought s...


In [39]:
df.columns

Index(['review_title', 'review_text', 'rating', 'noise_removed',
       'numbers_removed', 'lowercased', 'stopwords_removed', 'stemmed',
       'lemmatized', 'cleaned_text'],
      dtype='object')

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [41]:
!pip install nltk
!python -m nltk.downloader averaged_perceptron_tagger

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [42]:
import nltk
nltk.download('punkt')
nltk.download('words')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [44]:
import pandas as pd
import spacy
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tree import Tree
from collections import Counter

# Load the spaCy English model for dependency parsing and named entity recognition.
nlp = spacy.load("en_core_web_sm")

# Function to print constituency parsing tree using NLTK.
def print_constituency_tree(text):
    """Prints the constituency parsing tree of a text using NLTK."""
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        tagged = pos_tag(words)
        chunked = ne_chunk(tagged)
        for subtree in chunked:
            if type(subtree) == Tree:
                print(subtree.label(), " ".join(word for word, pos in subtree.leaves()))
            else:
                print(subtree[0], subtree[1])

# Function to print dependency parsing tree using spaCy.
def print_dependency_tree(text):
    """Prints the dependency parsing tree of a text using spaCy."""
    doc = nlp(text)
    for token in doc:
        print(f"{token.text} --> {token.dep_} --> {token.head.text}, children: {[child for child in token.children]}")

# Function to extract named entities and count their occurrences.
def extract_named_entities(text):
    """Extracts named entities from the text and counts their occurrences."""
    doc = nlp(text)
    entity_counter = Counter()
    for ent in doc.ents:
        entity_counter[ent.label_] += 1
    return entity_counter

# Load the cleaned text from the CSV file, focusing on the first 40 rows.
df = pd.read_csv("rrr_reviews_cleaned.csv", nrows=40)

# Combine the cleaned text from these rows into a single string for analysis.
combined_text = " ".join(df['cleaned_text'].tolist())

# (1) POS Tagging
tokens = word_tokenize(combined_text)
pos_tags = pos_tag(tokens)
noun_count = len([word for word, pos in pos_tags if pos.startswith('N')])
verb_count = len([word for word, pos in pos_tags if pos.startswith('V')])
adj_count = len([word for word, pos in pos_tags if pos.startswith('J')])
adv_count = len([word for word, pos in pos_tags if pos.startswith('R')])

print(f"Total Nouns: {noun_count}")
print(f"Total Verbs: {verb_count}")
print(f"Total Adjectives: {adj_count}")
print(f"Total Adverbs: {adv_count}")

# (2) Constituency Parsing and Dependency Parsing
print("Constituency Parsing Trees:")
print_constituency_tree(combined_text)
print("\nDependency Parsing Tree:")
print_dependency_tree(combined_text)

# (3) Named Entity Recognition
entity_counter = extract_named_entities(combined_text)
print("Named Entities:")
for entity, count in entity_counter.items():
    print(f"{entity}: {count}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
rrr VBP
stori NN
that WDT
complet VBZ
fiction NN
delhi NN
becom VBD
new JJ
canva NN
bheem NN
might MD
fought VB
nizam RB
much JJ
one CD
find VBP
imper JJ
warn NN
british NN
he PRP
taken VBN
lightli NN
also RB
find VBP
shelter JJ
muslim NN
delhi NN
ramaraju NN
might MD
seem VB
like IN
welltrain NN
soldier NN
follow JJ
instruct NN
blindli NN
also RB
seem VBP
past JJ
one CD
uncl NN
samuthirakani NN
know VBP
scott NN
ray NN
stevenson NN
might MD
believ VB
brown JJ
rubbish JJ
deserv NNS
even RB
bullet VBP
wast JJS
jennif NN
olivia NN
morri NN
seem VBP
empathet NN
freedom NN
movement NN
turn NN
cheek VBP
one CD
use NN
hand NN
weaponsth VBD
first JJ
half NN
rrr JJ
run NN
like IN
clockwork NN
there RB
emot VBZ
core NN
malli NN
there RB
song JJ
danc NN
naatu NN
naatu JJ
itll NNS
make VBP
smile JJ
friendship NN
explor NN
dosti NN
there RB
even RB
laugh IN
whenev NN
bheem VBP
tri JJ
befriend NN
jennif NN
cinemat NN
liberti NN
taken 

 In this parse tree of constituencies:
  Natural language processing (NLP) uses two basic techniques to examine the links and structure of sentences: constituency parsing and dependency parsing. Constituency parsing is the process of dissecting a sentence into its component phrases, which are then shown in a hierarchical tree structure. The syntactic structure of the sentence in the given text sample is revealed by the constituency parsing tree, which shows how words are arranged into phrases and clauses. To produce words like "seen a lot," "made a lot," and "time around the world," for example, the verbs "made" and "seen" are combined with the nouns "lot," "movie," and "time." These nouns and verbs are further modified with adjectives and adverbs like "different," "mainstream," and "even," which enhance the text's descriptive subtleties.







# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

The assignment presented a comprehensive task involving various natural language processing (NLP) techniques, which I found both challenging and enjoyable. Implementing functionalities like POS tagging, constituency parsing, dependency parsing, and named entity recognition required a good understanding of NLP concepts and tools like NLTK and Stanford CoreNLP. Managing the flow of data from web scraping to text cleaning and then to syntactic analysis demanded careful planning and attention to detail. The diversity of tasks within the assignment made it intellectually stimulating, and I appreciated the opportunity to apply and deepen my knowledge of NLP. However, ensuring the accuracy and efficiency of each step, especially in handling large volumes of text data, added complexity to the assignment. Overall, I found the time allotted for the assignment adequate, although certain aspects, like debugging and refining the code, could have benefited from more time.

