<a href="https://colab.research.google.com/github/143211/TARUN_INFO5731/blob/main/Tarun_konda_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:


import requests
from bs4 import BeautifulSoup
import csv

def fetch_reviews(imdb_id):
    reviews = []
    url = f"https://www.imdb.com/title/{imdb_id}/reviews?ref_=tt_urv"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Assuming each review is within a 'div' tag with a specific class.
    # This is a placeholder and might need adjustment based on actual page structure.
    review_divs = soup.find_all('div', class_='review-container')
    for div in review_divs:
        title = div.find('a', class_='title').text.strip()
        text = div.find('div', class_='text show-more__control').text.strip()
        try:
            rating = div.find('span', class_='rating-other-user-rating').span.text.strip()
        except AttributeError:
            rating = 'No Rating'

        reviews.append({'review_title': title, 'review_text': text, 'rating': rating})

    return reviews

def save_reviews_to_csv(reviews, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['review_title', 'review_text', 'rating'])
        writer.writeheader()
        for review in reviews:
            writer.writerow(review)

# Example usage
imdb_id = 'tt5433140'  # Example IMDb ID
reviews = fetch_reviews(imdb_id)
save_reviews_to_csv(reviews, 'Fastx_reviews.csv')


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [3]:
df = pd.read_csv('Fastx_reviews.csv')

In [4]:
df['noise_removed'] = df['review_text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...


In [5]:
df['numbers_removed'] = df['noise_removed'].apply(lambda x: re.sub(r'\d+', '', x))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...,I thought they couldnt possibly write somethin...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...,I can write the exact same review for the last...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...,Fast Furious did what a lot of franchises do...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...,By this point I went to see Fast X without a c...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...,The movie starts it story from the first ten m...


In [6]:
df['lowercased'] = df['numbers_removed'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...,I thought they couldnt possibly write somethin...,i thought they couldnt possibly write somethin...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...,I can write the exact same review for the last...,i can write the exact same review for the last...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...,Fast Furious did what a lot of franchises do...,fast furious did what a lot of franchises do...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...,By this point I went to see Fast X without a c...,by this point i went to see fast x without a c...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...,The movie starts it story from the first ten m...,the movie starts it story from the first ten m...


In [7]:
# Assuming stopwords are downloaded and available
stop_words = set(stopwords.words('english'))
df['stopwords_removed'] = df['lowercased'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...,I thought they couldnt possibly write somethin...,i thought they couldnt possibly write somethin...,thought couldnt possibly write something even ...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...,I can write the exact same review for the last...,i can write the exact same review for the last...,write exact review last transformers last indi...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...,Fast Furious did what a lot of franchises do...,fast furious did what a lot of franchises do...,fast furious lot franchises point took action ...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...,By this point I went to see Fast X without a c...,by this point i went to see fast x without a c...,point went see fast x without clue happened la...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...,The movie starts it story from the first ten m...,the movie starts it story from the first ten m...,movie starts story first ten minutes fast five...


In [8]:
ps = PorterStemmer()
df['stemmed'] = df['stopwords_removed'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...,I thought they couldnt possibly write somethin...,i thought they couldnt possibly write somethin...,thought couldnt possibly write something even ...,thought couldnt possibl write someth even wors...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...,I can write the exact same review for the last...,i can write the exact same review for the last...,write exact review last transformers last indi...,write exact review last transform last indiana...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...,Fast Furious did what a lot of franchises do...,fast furious did what a lot of franchises do...,fast furious lot franchises point took action ...,fast furiou lot franchis point took action out...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...,By this point I went to see Fast X without a c...,by this point i went to see fast x without a c...,point went see fast x without clue happened la...,point went see fast x without clue happen last...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...,The movie starts it story from the first ten m...,the movie starts it story from the first ten m...,movie starts story first ten minutes fast five...,movi start stori first ten minut fast five sto...


In [9]:
# Assuming lemmatization data is available
lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['stemmed'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...,I thought they couldnt possibly write somethin...,i thought they couldnt possibly write somethin...,thought couldnt possibly write something even ...,thought couldnt possibl write someth even wors...,thought couldnt possibl write someth even wors...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...,I can write the exact same review for the last...,i can write the exact same review for the last...,write exact review last transformers last indi...,write exact review last transform last indiana...,write exact review last transform last indiana...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...,Fast Furious did what a lot of franchises do...,fast furious did what a lot of franchises do...,fast furious lot franchises point took action ...,fast furiou lot franchis point took action out...,fast furiou lot franchis point took action out...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...,By this point I went to see Fast X without a c...,by this point i went to see fast x without a c...,point went see fast x without clue happened la...,point went see fast x without clue happen last...,point went see fast x without clue happen last...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...,The movie starts it story from the first ten m...,the movie starts it story from the first ten m...,movie starts story first ten minutes fast five...,movi start stori first ten minut fast five sto...,movi start stori first ten minut fast five sto...


In [10]:
df['cleaned_text'] = df['lemmatized']
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized,cleaned_text
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...,I thought they couldnt possibly write somethin...,i thought they couldnt possibly write somethin...,thought couldnt possibly write something even ...,thought couldnt possibl write someth even wors...,thought couldnt possibl write someth even wors...,thought couldnt possibl write someth even wors...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...,I can write the exact same review for the last...,i can write the exact same review for the last...,write exact review last transformers last indi...,write exact review last transform last indiana...,write exact review last transform last indiana...,write exact review last transform last indiana...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...,Fast Furious did what a lot of franchises do...,fast furious did what a lot of franchises do...,fast furious lot franchises point took action ...,fast furiou lot franchis point took action out...,fast furiou lot franchis point took action out...,fast furiou lot franchis point took action out...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...,By this point I went to see Fast X without a c...,by this point i went to see fast x without a c...,point went see fast x without clue happened la...,point went see fast x without clue happen last...,point went see fast x without clue happen last...,point went see fast x without clue happen last...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...,The movie starts it story from the first ten m...,the movie starts it story from the first ten m...,movie starts story first ten minutes fast five...,movi start stori first ten minut fast five sto...,movi start stori first ten minut fast five sto...,movi start stori first ten minut fast five sto...


In [11]:
df.to_csv('Fastx_cleaned_reviews.csv', index=False)
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized,cleaned_text
0,Excruciatingly Awful,I thought they couldn't possibly write somethi...,1,I thought they couldnt possibly write somethin...,I thought they couldnt possibly write somethin...,i thought they couldnt possibly write somethin...,thought couldnt possibly write something even ...,thought couldnt possibl write someth even wors...,thought couldnt possibl write someth even wors...,thought couldnt possibl write someth even wors...
1,What Happened?,I can write the exact same review for the last...,1,I can write the exact same review for the last...,I can write the exact same review for the last...,i can write the exact same review for the last...,write exact review last transformers last indi...,write exact review last transform last indiana...,write exact review last transform last indiana...,write exact review last transform last indiana...
2,The worst one yet.,Fast & Furious 9 did what a lot of franchises ...,3,Fast Furious 9 did what a lot of franchises d...,Fast Furious did what a lot of franchises do...,fast furious did what a lot of franchises do...,fast furious lot franchises point took action ...,fast furiou lot franchis point took action out...,fast furiou lot franchis point took action out...,fast furiou lot franchis point took action out...
3,Fast X,"By this point, I went to see Fast X without a ...",5,By this point I went to see Fast X without a c...,By this point I went to see Fast X without a c...,by this point i went to see fast x without a c...,point went see fast x without clue happened la...,point went see fast x without clue happen last...,point went see fast x without clue happen last...,point went see fast x without clue happen last...
4,They need to stop,The movie starts it story from the first ten m...,4,The movie starts it story from the first ten m...,The movie starts it story from the first ten m...,the movie starts it story from the first ten m...,movie starts story first ten minutes fast five...,movi start stori first ten minut fast five sto...,movi start stori first ten minut fast five sto...,movi start stori first ten minut fast five sto...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [12]:
!pip install nltk
!python -m nltk.downloader averaged_perceptron_tagger

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [13]:
import nltk
nltk.download('punkt')
nltk.download('words')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [14]:
import pandas as pd
import spacy
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tree import Tree
from collections import Counter

# Load the spaCy English model for dependency parsing and named entity recognition.
nlp = spacy.load("en_core_web_sm")

# Function to print constituency parsing tree using NLTK.
def print_constituency_tree(text):
    """Prints the constituency parsing tree of a text using NLTK."""
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        tagged = pos_tag(words)
        chunked = ne_chunk(tagged)
        for subtree in chunked:
            if type(subtree) == Tree:
                print(subtree.label(), " ".join(word for word, pos in subtree.leaves()))
            else:
                print(subtree[0], subtree[1])

# Function to print dependency parsing tree using spaCy.
def print_dependency_tree(text):
    """Prints the dependency parsing tree of a text using spaCy."""
    doc = nlp(text)
    for token in doc:
        print(f"{token.text} --> {token.dep_} --> {token.head.text}, children: {[child for child in token.children]}")

# Function to extract named entities and count their occurrences.
def extract_named_entities(text):
    """Extracts named entities from the text and counts their occurrences."""
    doc = nlp(text)
    entity_counter = Counter()
    for ent in doc.ents:
        entity_counter[ent.label_] += 1
    return entity_counter

# Load the cleaned text from the CSV file, focusing on the first 20 rows.
df = pd.read_csv("Fastx_cleaned_reviews.csv", nrows=20)

# Combine the cleaned text from these rows into a single string for analysis.
combined_text = " ".join(df['cleaned_text'].tolist())

# (1) POS Tagging
tokens = word_tokenize(combined_text)
pos_tags = pos_tag(tokens)
noun_count = len([word for word, pos in pos_tags if pos.startswith('N')])
verb_count = len([word for word, pos in pos_tags if pos.startswith('V')])
adj_count = len([word for word, pos in pos_tags if pos.startswith('J')])
adv_count = len([word for word, pos in pos_tags if pos.startswith('R')])

print(f"Total Nouns: {noun_count}")
print(f"Total Verbs: {verb_count}")
print(f"Total Adjectives: {adj_count}")
print(f"Total Adverbs: {adv_count}")

# (2) Constituency Parsing and Dependency Parsing
print("Constituency Parsing Trees:")
print_constituency_tree(combined_text)
print("\nDependency Parsing Tree:")
print_dependency_tree(combined_text)

# (3) Named Entity Recognition
entity_counter = extract_named_entities(combined_text)
print("Named Entities:")
for entity, count in entity_counter.items():
    print(f"{entity}: {count}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
move NN
street NN
race NN
world NN
jame NN
bond NN
territori JJ
action NN
get VB
ridicul JJ
chapter NN
see NN
movi NN
would MD
top VB
one CD
term NN
crazi NN
kept VBD
entertain JJ
fast JJ
x NN
dumb VBD
even RB
dumb JJ
funth JJ
stunt NN
preposter NN
ever RB
film NN
never RB
thrill VB
everi JJ
intermin JJ
action NN
scene NN
enhanc VBP
rather RB
ropey JJ
cgi NN
flame NN
particular JJ
look NN
realli NN
bad JJ
dread JJ
attempt NN
humour NN
mostli NN
courtesi NN
tyre NN
gibson NN
ludacri NN
get VBP
inevit JJ
crap NN
import NN
famili VBD
see VB
gang JJ
anoth DT
barbecu JJ
think NN
ill NN
hurland VBP
there EX
jason NN
momoa NN
villain NN
dantei NN
see VBP
go VB
kooki JJ
flamboy NN
psycho NN
capabl NN
almost RB
anyth JJ
momoa NN
absolut NN
aw NN
perform VB
border NN
camp NN
help NN
charact VB
dubiou JJ
choic NN
attir RB
hard JJ
find VBP
dant JJ
menac NN
he PRP
flounc VBZ
silki JJ
shirt NN
ill JJ
doubt NN
watch NN
next IN
two CD
fi

In natural language processing, there are two ways for analysing phrase syntactic structure: constituent parsing and dependency parsing. A constituency parsing tree depicts a sentence's hierarchical structure by breaking it down into sub-phrases or "constituents" that correspond to certain grammatical categories like noun phrases (NP) or verb phrases (VP). This tree shows how words gather together to make phrases, which then combine to produce sentences. A dependency parsing tree, on the other hand, is concerned with word connections, distinguishing the "head" words (which can be verbs, nouns, and so on) and their "dependents" (words that alter or complement the head). It depicts the phrase as a network of nodes (words) linked by edges (dependencies), demonstrating how words rely on one another to express meaning. While constituency parsing shows the sentence's layered structure, dependency parsing offers a direct map of syntactic interactions, which is useful for analysing grammatical relationships and specific word functions.

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

The assignment appears to be comprehensive and multifaceted, targeting both practical skills in data collection via web scraping or APIs and theoretical understanding of natural language processing (NLP) concepts. The challenges likely arise from the initial phase of collecting a large dataset, which requires proficiency in handling web data extraction techniques and understanding API usage. Additionally, cleaning and structuring the data for analysis can be time-consuming. The syntactic analysis part demands a good grasp of NLP concepts like POS tagging, parsing, and named entity recognition, which can be complex but are fundamental to text analysis. Enjoyable aspects might include the satisfaction of successfully collecting and analyzing data, gaining insights from raw text, and applying theoretical knowledge to practical scenarios. The time provided for the assignment seems fair, although it heavily depends on prior experience with Python programming, NLP, and data manipulation techniques. Completing this assignment would offer a holistic view of the text data processing pipeline, from collection to syntactic analysis, which is valuable for anyone interested in data science or NLP.