<a href="https://colab.research.google.com/github/DasireddyMeghana/Meghana_INFO5731_Spring2024/blob/main/Assignments/Dasireddy_Meghana_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [25]:
import requests
from bs4 import BeautifulSoup
import csv

def get_movie_reviews(movie_url, num_reviews=1000):
    reviews = []
    review_count = 0

    while review_count < num_reviews:
        # Fetch the webpage
        response = requests.get(movie_url.format(review_count))
        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content of the webpage
            soup = BeautifulSoup(response.content, 'html.parser')
            # Find all review elements on the webpage
            review_elements = soup.find_all('div', class_='text show-more__control')
            if not review_elements:
                break

            for review_element in review_elements:
                # Extract the text of the review and strip whitespace
                review_text = review_element.text.strip()
                reviews.append(review_text)
                review_count += 1
        else:
            # If the request fails, print an error message and stop collecting reviews
            print("Failed to fetch page")
            break

    # Return the list of reviews
    return reviews

def save_to_csv(reviews, filename='movie_reviews.csv'):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        # Write the header row
        writer.writerow(['Review'])
        # Write each review to a separate row in the CSV file
        for review in reviews:
            writer.writerow([review])

if __name__ == "__main__":
    all_reviews = []

    reviews = get_movie_reviews("https://www.imdb.com/title/tt0468569/reviews?ref_=tt_ql_3")

    # Extend the list of all reviews
    all_reviews.extend(reviews)

    # Save all collected reviews to a CSV file
    save_to_csv(all_reviews)


In [26]:
import pandas as pd

df = pd.read_csv("movie_reviews.csv")
df

Unnamed: 0,Review
0,"Confidently directed, dark, brooding, and pack..."
1,I got to see The Dark Knight on Wednesday nigh...
2,"Dark, yes, complex, ambitious. Christopher Nol..."
3,I had the pleasure to watch this movie in an I...
4,This movie is a work of art. The finest sequel...
...,...
995,I thought Batman Begins was a very well concei...
996,I thought I was getting a clever comic book mo...
997,This is the best superhero movie ever made!!!!...
998,"There's a villain, full of colour, takes to th..."


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [31]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Load the CSV file
df = pd.read_csv("movie_reviews.csv")

# Remove noise, such as special characters and punctuations
df['Cleaned_Review'] = df['Review'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
df

Unnamed: 0,Review,Cleaned_Review
0,"Confidently directed, dark, brooding, and pack...",Confidently directed dark brooding and packed ...
1,I got to see The Dark Knight on Wednesday nigh...,I got to see The Dark Knight on Wednesday nigh...
2,"Dark, yes, complex, ambitious. Christopher Nol...",Dark yes complex ambitious Christopher Nolan a...
3,I had the pleasure to watch this movie in an I...,I had the pleasure to watch this movie in an I...
4,This movie is a work of art. The finest sequel...,This movie is a work of art The finest sequel ...
...,...,...
995,I thought Batman Begins was a very well concei...,I thought Batman Begins was a very well concei...
996,I thought I was getting a clever comic book mo...,I thought I was getting a clever comic book mo...
997,This is the best superhero movie ever made!!!!...,This is the best superhero movie ever made If ...
998,"There's a villain, full of colour, takes to th...",Theres a villain full of colour takes to the s...


In [32]:
# Remove numbers
df['Cleaned_Review'] = df['Cleaned_Review'].apply(lambda x: re.sub(r'\d+', '', x))
df

Unnamed: 0,Review,Cleaned_Review
0,"Confidently directed, dark, brooding, and pack...",Confidently directed dark brooding and packed ...
1,I got to see The Dark Knight on Wednesday nigh...,I got to see The Dark Knight on Wednesday nigh...
2,"Dark, yes, complex, ambitious. Christopher Nol...",Dark yes complex ambitious Christopher Nolan a...
3,I had the pleasure to watch this movie in an I...,I had the pleasure to watch this movie in an I...
4,This movie is a work of art. The finest sequel...,This movie is a work of art The finest sequel ...
...,...,...
995,I thought Batman Begins was a very well concei...,I thought Batman Begins was a very well concei...
996,I thought I was getting a clever comic book mo...,I thought I was getting a clever comic book mo...
997,This is the best superhero movie ever made!!!!...,This is the best superhero movie ever made If ...
998,"There's a villain, full of colour, takes to th...",Theres a villain full of colour takes to the s...


In [33]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
df['Cleaned_Review'] = df['Cleaned_Review'].apply(lambda x: ' '.join(word for word in x.split() if word.lower() not in stop_words))
df

Unnamed: 0,Review,Cleaned_Review
0,"Confidently directed, dark, brooding, and pack...",Confidently directed dark brooding packed impr...
1,I got to see The Dark Knight on Wednesday nigh...,got see Dark Knight Wednesday night reason tho...
2,"Dark, yes, complex, ambitious. Christopher Nol...",Dark yes complex ambitious Christopher Nolan c...
3,I had the pleasure to watch this movie in an I...,pleasure watch movie IMAX theatre London adver...
4,This movie is a work of art. The finest sequel...,movie work art finest sequel ever made dont th...
...,...,...
995,I thought Batman Begins was a very well concei...,thought Batman Begins well conceived put toget...
996,I thought I was getting a clever comic book mo...,thought getting clever comic book movie Instea...
997,This is the best superhero movie ever made!!!!...,best superhero movie ever made like watch supe...
998,"There's a villain, full of colour, takes to th...",Theres villain full colour takes stage double ...


In [34]:
# Lowercase all texts
df['Cleaned_Review'] = df['Cleaned_Review'].apply(lambda x: x.lower())
df

Unnamed: 0,Review,Cleaned_Review
0,"Confidently directed, dark, brooding, and pack...",confidently directed dark brooding packed impr...
1,I got to see The Dark Knight on Wednesday nigh...,got see dark knight wednesday night reason tho...
2,"Dark, yes, complex, ambitious. Christopher Nol...",dark yes complex ambitious christopher nolan c...
3,I had the pleasure to watch this movie in an I...,pleasure watch movie imax theatre london adver...
4,This movie is a work of art. The finest sequel...,movie work art finest sequel ever made dont th...
...,...,...
995,I thought Batman Begins was a very well concei...,thought batman begins well conceived put toget...
996,I thought I was getting a clever comic book mo...,thought getting clever comic book movie instea...
997,This is the best superhero movie ever made!!!!...,best superhero movie ever made like watch supe...
998,"There's a villain, full of colour, takes to th...",theres villain full colour takes stage double ...


In [37]:
# Stemming
# porter = PorterStemmer()
# df['Cleaned_Review'] = df['Cleaned_Review'].apply(lambda x: ' '.join(porter.stem(word) for word in x.split()))

# Lemmatization
lemmatizer = WordNetLemmatizer()
df['Cleaned_Review'] = df['Cleaned_Review'].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))

print(df)

df.to_csv("cleaned_movie_reviews.csv", index=False)

                                                Review  \
0    Confidently directed, dark, brooding, and pack...   
1    I got to see The Dark Knight on Wednesday nigh...   
2    Dark, yes, complex, ambitious. Christopher Nol...   
3    I had the pleasure to watch this movie in an I...   
4    This movie is a work of art. The finest sequel...   
..                                                 ...   
995  I thought Batman Begins was a very well concei...   
996  I thought I was getting a clever comic book mo...   
997  This is the best superhero movie ever made!!!!...   
998  There's a villain, full of colour, takes to th...   
999  Well here it is, one of the most anticipated m...   

                                        Cleaned_Review  
0    confidently directed dark brooding packed impr...  
1    got see dark knight wednesday night reason tho...  
2    dark yes complex ambitious christopher nolan c...  
3    pleasure watch movie imax theatre london adver...  
4    movie work ar

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [28]:
# Your code here
#Parts of Speech (POS) Tagging
import pandas as pd
import nltk
from collections import Counter

# Load the cleaned CSV file
df = pd.read_csv("cleaned_movie_reviews.csv")

# Function to calculate POS tags and count POS
def calculate_pos_tags(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_counts

# Calculate POS tags and count POS for each review
df['POS_Counts'] = df['Cleaned_Review'].apply(calculate_pos_tags)

# Aggregate POS counts across all reviews
total_pos_counts = Counter()
for pos_counts in df['POS_Counts']:
    total_pos_counts += pos_counts

# Print total number of Noun, Verb, Adjective, and Adverb
total_nouns = total_pos_counts['NN'] + total_pos_counts['NNS'] + total_pos_counts['NNP'] + total_pos_counts['NNPS']
total_verbs = total_pos_counts['VB'] + total_pos_counts['VBD'] + total_pos_counts['VBG'] + total_pos_counts['VBN'] + total_pos_counts['VBP'] + total_pos_counts['VBZ']
total_adjectives = total_pos_counts['JJ'] + total_pos_counts['JJR'] + total_pos_counts['JJS']
total_adverbs = total_pos_counts['RB'] + total_pos_counts['RBR'] + total_pos_counts['RBS']

print("Total number of Nouns (N):", total_nouns)
print("Total number of Verbs (V):", total_verbs)
print("Total number of Adjectives (Adj):", total_adjectives)
print("Total number of Adverbs (Adv):", total_adverbs)



Total number of Nouns (N): 71400
Total number of Verbs (V): 30120
Total number of Adjectives (Adj): 33440
Total number of Adverbs (Adv): 15640


In [29]:
#Constituency Parsing and Dependency Parsing
import spacy
from spacy import displacy

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Get the first sentence from the DataFrame
text = df.iloc[0]['Cleaned_Review']

# Plot the dependency graph
doc = nlp(text)
displacy.render(doc, style='dep', jupyter=True)

# Print token information
for token in doc:
    print(token.text, "-->", token.dep_, "-->", token.pos_)


confidently --> advmod --> ADV
directed --> amod --> VERB
dark --> npadvmod --> NOUN
brooding --> amod --> VERB
packed --> amod --> VERB
impressive --> amod --> ADJ
action --> nmod --> NOUN
sequence --> compound --> NOUN
complex --> amod --> NOUN
story --> compound --> NOUN
dark --> compound --> ADJ
knight --> nsubj --> PROPN
includes --> ROOT --> VERB
careerdefining --> xcomp --> VERB
turn --> compound --> NOUN
heath --> compound --> NOUN
ledger --> dobj --> NOUN
well --> advmod --> INTJ
oscar --> conj --> VERB
worthy --> amod --> ADJ
performance --> compound --> NOUN
tdk --> nsubj --> NOUN
remains --> ccomp --> VERB
best --> amod --> ADJ
batman --> compound --> NOUN
movie --> compound --> NOUN
comic --> amod --> ADJ
book --> compound --> NOUN
movie --> attr --> NOUN
ever --> advmod --> ADV
created --> acl --> VERB


In [30]:
#Named Entity Recognition
import spacy
from collections import Counter

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Function to extract entities and count their occurrences
def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Get the clean text from the DataFrame
clean_texts = df['Cleaned_Review']

# Initialize counters for different entity types
person_counter = Counter()
org_counter = Counter()
loc_counter = Counter()
product_counter = Counter()
date_counter = Counter()

for text in clean_texts:
    entities = extract_entities(text)
    for entity, label in entities:
        if label == 'PERSON':
            person_counter[entity] += 1
        elif label == 'ORG':
            org_counter[entity] += 1
        elif label == 'LOC':
            loc_counter[entity] += 1
        elif label == 'PRODUCT':
            product_counter[entity] += 1
        elif label == 'DATE':
            date_counter[entity] += 1

# Print counts of each entity type
print("Count of Person Names:")
print(person_counter.most_common())

print("\nCount of Organizations:")
print(org_counter.most_common())

print("\nCount of Locations:")
print(loc_counter.most_common())

print("\nCount of Product Names:")
print(product_counter.most_common())

print("\nCount of Dates:")
print(date_counter.most_common())


Count of Person Names:
[('christopher nolan', 280), ('aaron eckhart', 280), ('bruce wayne', 280), ('gotham city', 200), ('harvey dent', 200), ('gotham', 200), ('joker', 200), ('dark knight', 160), ('michael caine', 120), ('jack nicholson', 80), ('freeman lucius fox', 80), ('james gordon', 80), ('tommy lee jones', 80), ('jack nicolson', 40), ('jack', 40), ('joker definitely scary', 40), ('gotham crazy', 40), ('knight shot', 40), ('gordon harvey rachel', 40), ('harvey denttwo', 40), ('jonathan nolan deserve', 40), ('gore', 40), ('christopher', 40), ('michael cane', 40), ('jonathan nolan', 40), ('tim burton', 40), ('alias nick beal show', 40), ('michael', 40), ('freeman', 40), ('gordon batman', 40), ('scum gotham city', 40), ('gary oldmans', 40), ('james newton', 40), ('howard helm', 40), ('jonathon nolan darkest', 40), ('christopher nolan good', 40), ('jim gordon district', 40), ('frank miller', 40), ('street gotham', 40), ('javier bardem role joker', 40), ('michael caine alfred', 40), (

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
#The most time taking part is scraping a dynaming webpage, and cleaning the data. And working on these assignments will
#help in improving the knowledge which I enjoy doing. We got enough time to get done the task.