<a href="https://colab.research.google.com/github/AnusreePutta/Info-5731/blob/main/Anusree_Putta_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [5]:
# Your code hereimport requests
import requests
from bs4 import BeautifulSoup
import csv

def fetch_reviews(imdb_id):
    reviews = []
    url = f"https://www.imdb.com/title/tt1517268/reviews?ref_=tt_urv"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Assuming each review is within a 'div' tag with a specific class.
    # This is a placeholder and might need adjustment based on actual page structure.
    review_divs = soup.find_all('div', class_='review-container')
    for div in review_divs:
        title = div.find('a', class_='title').text.strip()
        text = div.find('div', class_='text show-more__control').text.strip()
        try:
            rating = div.find('span', class_='rating-other-user-rating').span.text.strip()
        except AttributeError:
            rating = 'No Rating'

        reviews.append({'review_title': title, 'review_text': text, 'rating': rating})

    return reviews

def save_reviews_to_csv(reviews, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['review_title', 'review_text', 'rating'])
        writer.writeheader()
        for review in reviews:
            writer.writerow(review)

# Example usage
imdb_id = 'tt66547321'  # Example IMDb ID
reviews = fetch_reviews(imdb_id)
save_reviews_to_csv(reviews, 'barbie_reviews.csv')

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [6]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

df = pd.read_csv('barbie_reviews.csv')
df

Unnamed: 0,review_title,review_text,rating
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6
5,Well this really did come as a surprise.,"It pains me to say it, but I enjoyed this movi...",8
6,3 reasons FOR seeing it and 1 reason AGAINST.,The first reason to go see it:It's good fun. I...,7
7,It was depressing,I thought this would be so much different. The...,8
8,My mom and I saw this yesterday. Here are my t...,I don't know if I put spoilers in here. I am s...,6
9,The marketing was more entertaining than the a...,"I went to see this today, everyone in my group...",4


In [7]:
df['noise_removed'] = df['review_text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...


In [8]:
df['numbers_removed'] = df['noise_removed'].apply(lambda x: re.sub(r'\d+', '', x))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...,Margot does the best with what shes given but ...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...,Before making Barbie Greta Gerwig singlehande...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...,The quality the humor and the writing of the m...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...,As much as it pains me to give a movie called ...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...,As a woman that grew up with Barbie I was very...


In [9]:
df['lowercased'] = df['numbers_removed'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...,Margot does the best with what shes given but ...,margot does the best with what shes given but ...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...,Before making Barbie Greta Gerwig singlehande...,before making barbie greta gerwig singlehande...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...,The quality the humor and the writing of the m...,the quality the humor and the writing of the m...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...,As much as it pains me to give a movie called ...,as much as it pains me to give a movie called ...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...,As a woman that grew up with Barbie I was very...,as a woman that grew up with barbie i was very...


In [10]:
stop_words = set(stopwords.words('english'))
df['stopwords_removed'] = df['lowercased'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...,Margot does the best with what shes given but ...,margot does the best with what shes given but ...,margot best shes given film disappointing mark...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...,Before making Barbie Greta Gerwig singlehande...,before making barbie greta gerwig singlehande...,making barbie greta gerwig singlehandedly dire...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...,The quality the humor and the writing of the m...,the quality the humor and the writing of the m...,quality humor writing movie fun quirky unique ...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...,As much as it pains me to give a movie called ...,as much as it pains me to give a movie called ...,much pains give movie called barbie brilliantl...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...,As a woman that grew up with Barbie I was very...,as a woman that grew up with barbie i was very...,woman grew barbie excited movie curious see wo...


In [11]:
ps = PorterStemmer()
df['stemmed'] = df['stopwords_removed'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...,Margot does the best with what shes given but ...,margot does the best with what shes given but ...,margot best shes given film disappointing mark...,margot best she given film disappoint market f...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...,Before making Barbie Greta Gerwig singlehande...,before making barbie greta gerwig singlehande...,making barbie greta gerwig singlehandedly dire...,make barbi greta gerwig singlehandedli direct ...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...,The quality the humor and the writing of the m...,the quality the humor and the writing of the m...,quality humor writing movie fun quirky unique ...,qualiti humor write movi fun quirki uniqu get ...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...,As much as it pains me to give a movie called ...,as much as it pains me to give a movie called ...,much pains give movie called barbie brilliantl...,much pain give movi call barbi brilliantli han...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...,As a woman that grew up with Barbie I was very...,as a woman that grew up with barbie i was very...,woman grew barbie excited movie curious see wo...,woman grew barbi excit movi curiou see would e...


In [12]:
lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['stemmed'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...,Margot does the best with what shes given but ...,margot does the best with what shes given but ...,margot best shes given film disappointing mark...,margot best she given film disappoint market f...,margot best she given film disappoint market f...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...,Before making Barbie Greta Gerwig singlehande...,before making barbie greta gerwig singlehande...,making barbie greta gerwig singlehandedly dire...,make barbi greta gerwig singlehandedli direct ...,make barbi greta gerwig singlehandedli direct ...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...,The quality the humor and the writing of the m...,the quality the humor and the writing of the m...,quality humor writing movie fun quirky unique ...,qualiti humor write movi fun quirki uniqu get ...,qualiti humor write movi fun quirki uniqu get ...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...,As much as it pains me to give a movie called ...,as much as it pains me to give a movie called ...,much pains give movie called barbie brilliantl...,much pain give movi call barbi brilliantli han...,much pain give movi call barbi brilliantli han...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...,As a woman that grew up with Barbie I was very...,as a woman that grew up with barbie i was very...,woman grew barbie excited movie curious see wo...,woman grew barbi excit movi curiou see would e...,woman grew barbi excit movi curiou see would e...


In [13]:
df['cleaned_review'] = df['lemmatized']
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized,cleaned_review
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...,Margot does the best with what shes given but ...,margot does the best with what shes given but ...,margot best shes given film disappointing mark...,margot best she given film disappoint market f...,margot best she given film disappoint market f...,margot best she given film disappoint market f...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...,Before making Barbie Greta Gerwig singlehande...,before making barbie greta gerwig singlehande...,making barbie greta gerwig singlehandedly dire...,make barbi greta gerwig singlehandedli direct ...,make barbi greta gerwig singlehandedli direct ...,make barbi greta gerwig singlehandedli direct ...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...,The quality the humor and the writing of the m...,the quality the humor and the writing of the m...,quality humor writing movie fun quirky unique ...,qualiti humor write movi fun quirki uniqu get ...,qualiti humor write movi fun quirki uniqu get ...,qualiti humor write movi fun quirki uniqu get ...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...,As much as it pains me to give a movie called ...,as much as it pains me to give a movie called ...,much pains give movie called barbie brilliantl...,much pain give movi call barbi brilliantli han...,much pain give movi call barbi brilliantli han...,much pain give movi call barbi brilliantli han...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...,As a woman that grew up with Barbie I was very...,as a woman that grew up with barbie i was very...,woman grew barbie excited movie curious see wo...,woman grew barbi excit movi curiou see would e...,woman grew barbi excit movi curiou see would e...,woman grew barbi excit movi curiou see would e...


In [14]:
df.to_csv('barbie_cleaned_reviews.csv', index=False)
df.head()

Unnamed: 0,review_title,review_text,rating,noise_removed,numbers_removed,lowercased,stopwords_removed,stemmed,lemmatized,cleaned_review
0,"Beautiful film, but so preachy","Margot does the best with what she's given, bu...",6,Margot does the best with what shes given but ...,Margot does the best with what shes given but ...,margot does the best with what shes given but ...,margot best shes given film disappointing mark...,margot best she given film disappoint market f...,margot best she given film disappoint market f...,margot best she given film disappoint market f...
1,A Hot Pink Mess,"Before making Barbie (2023), Greta Gerwig sing...",6,Before making Barbie 2023 Greta Gerwig singleh...,Before making Barbie Greta Gerwig singlehande...,before making barbie greta gerwig singlehande...,making barbie greta gerwig singlehandedly dire...,make barbi greta gerwig singlehandedli direct ...,make barbi greta gerwig singlehandedli direct ...,make barbi greta gerwig singlehandedli direct ...
2,Could Have Been Great. 2nd Half Brings It Down.,"The quality, the humor, and the writing of the...",6,The quality the humor and the writing of the m...,The quality the humor and the writing of the m...,the quality the humor and the writing of the m...,quality humor writing movie fun quirky unique ...,qualiti humor write movi fun quirki uniqu get ...,qualiti humor write movi fun quirki uniqu get ...,qualiti humor write movi fun quirki uniqu get ...
3,"As a guy I felt some discomfort, and that's ok.",As much as it pains me to give a movie called ...,10,As much as it pains me to give a movie called ...,As much as it pains me to give a movie called ...,as much as it pains me to give a movie called ...,much pains give movie called barbie brilliantl...,much pain give movi call barbi brilliantli han...,much pain give movi call barbi brilliantli han...,much pain give movi call barbi brilliantli han...
4,Too heavy handed,"As a woman that grew up with Barbie, I was ver...",6,As a woman that grew up with Barbie I was very...,As a woman that grew up with Barbie I was very...,as a woman that grew up with barbie i was very...,woman grew barbie excited movie curious see wo...,woman grew barbi excit movi curiou see would e...,woman grew barbi excit movi curiou see would e...,woman grew barbi excit movi curiou see would e...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [15]:
# Your code here

!pip install nltk
!python -m nltk.downloader averaged_perceptron_tagger

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
import nltk
nltk.download('punkt')
nltk.download('words')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [17]:
import pandas as pd
import spacy
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tree import Tree
from collections import Counter

# Load the spaCy English model for dependency parsing and named entity recognition.
nlp = spacy.load("en_core_web_sm")

# Function to print constituency parsing tree using NLTK.
def print_constituency_tree(text):
    """Prints the constituency parsing tree of a text using NLTK."""
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        tagged = pos_tag(words)
        chunked = ne_chunk(tagged)
        for subtree in chunked:
            if type(subtree) == Tree:
                print(subtree.label(), " ".join(word for word, pos in subtree.leaves()))
            else:
                print(subtree[0], subtree[1])

# Function to print dependency parsing tree using spaCy.
def print_dependency_tree(text):
    """Prints the dependency parsing tree of a text using spaCy."""
    doc = nlp(text)
    for token in doc:
        print(f"{token.text} --> {token.dep_} --> {token.head.text}, children: {[child for child in token.children]}")

# Function to extract named entities and count their occurrences.
def extract_named_entities(text):
    """Extracts named entities from the text and counts their occurrences."""
    doc = nlp(text)
    entity_counter = Counter()
    for ent in doc.ents:
        entity_counter[ent.label_] += 1
    return entity_counter

# Load the cleaned text from the CSV file, focusing on the first 5 rows.
df = pd.read_csv("barbie_cleaned_reviews.csv", nrows=2)

# Combine the cleaned text from these rows into a single string for analysis.
combined_text = " ".join(df['cleaned_review'].tolist())

# (1) POS Tagging
tokens = word_tokenize(combined_text)
pos_tags = pos_tag(tokens)
noun_count = len([word for word, pos in pos_tags if pos.startswith('N')])
verb_count = len([word for word, pos in pos_tags if pos.startswith('V')])
adj_count = len([word for word, pos in pos_tags if pos.startswith('J')])
adv_count = len([word for word, pos in pos_tags if pos.startswith('R')])

print(f"Total Nouns: {noun_count}")
print(f"Total Verbs: {verb_count}")
print(f"Total Adjectives: {adj_count}")
print(f"Total Adverbs: {adv_count}")

# (2) Constituency Parsing and Dependency Parsing
print("Constituency Parsing Trees:")
print_constituency_tree(combined_text)
print("\nDependency Parsing Tree:")
print_dependency_tree(combined_text)

# (3) Named Entity Recognition
entity_counter = extract_named_entities(combined_text)
print("Named Entities:")
for entity, count in entity_counter.items():
    print(f"{entity}: {count}")

Total Nouns: 229
Total Verbs: 46
Total Adjectives: 77
Total Adverbs: 14
Constituency Parsing Trees:
margot NN
best JJS
she PRP
given VBN
film NN
disappoint NN
market NN
fun NN
quirki VBP
satir NN
homag NN
movi JJ
start NN
way NN
end NN
overdramat NN
speech JJ
end NN
clearli NN
tri NNS
make VBP
audienc JJ
feel NN
someth NN
left VBD
everyon JJ
feel NN
confus NNS
say VBP
im JJ
crotcheti NN
old JJ
man NN
im VBZ
woman NN
im JJ
pretti JJ
sure JJ
im NN
movi NN
target NN
audienc JJ
saddest JJS
part NN
parent NN
kid NN
theater NN
victim NN
poor JJ
market NN
kid NN
movi VBZ
overal JJ
humor NN
fun NN
occas NN
film NN
beauti NN
look NN
whole JJ
concept NN
fall NN
apart RB
second JJ
half NN
film NN
becom NN
piti NN
parti NN
strong JJ
woman NN
make VBP
barbi NN
greta NN
gerwig NN
singlehandedli NN
direct JJ
two CD
film NN
ladi NNS
bird VBP
littl JJ
woman NN
girl NN
precipic NN
adolesc NN
nuanc NN
layer NN
portray NN
motherdaught VBD
relationship NN
combin NN
imagin VBP
visual JJ
clever NN
dialogu NN

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [18]:
# Write your response below
I found it challenging to work with csv files and to save data in csv files.In 1st questuion its difficult to get 1000 reviews at a time for a movie and in the next question it took lot of steps to clean the data.I think 1 week of time is less to work on this assignment asnd i think it needs more time to work on the given tasks.

SyntaxError: invalid decimal literal (<ipython-input-18-6f931d179d8c>, line 2)