<a href="https://colab.research.google.com/github/AiswaryaGoriparthi/Aiswarya_INFO5731_Fall2024/blob/main/Goriparthi_Aiswarya_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [13]:
!pip install requests beautifulsoup4 pandas




In [21]:
#your code here

import requests
from bs4 import BeautifulSoup
import pandas as pd

def fetch_reviews(movie_id):
    reviews = []
    page = 1

    while True:
        # Construct the URL for the reviews page
        url = f"https://www.imdb.com/title/{movie_id}/reviews?ref_=tt_ql_3&page={page}"
        print(f"Fetching page: {url}")

        # Set headers to mimic a browser
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
        }

        response = requests.get(url, headers=headers)
        if response.status_code != 200:
            print(f"Failed to retrieve page {page}. HTTP Status Code: {response.status_code}")
            break

        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='review-container')

        if not review_elements:
            print(f"No more reviews found for {movie_id} on page {page}.")
            break

        # Extract reviews from the page
        for review in review_elements:
            title = review.find('a', class_='title').text.strip()
            content = review.find('div', class_='text').text.strip()
            reviews.append({'Title': title, 'Content': content})

        print(f"Fetched {len(review_elements)} reviews from page {page}. Total reviews collected: {len(reviews)}")
        page += 1

    return reviews

# List of movie IDs to scrape
movie_ids = ["tt0439572", "tt8178634","tt0111161","tt9389998",
             "tt0068646", "tt0468569", "tt5074352", "tt15330776",
             "tt2561572", "tt1189073","tt0111161","tt0068646", "tt0108052", "tt0468569","tt0110912",
             "tt0109830", "tt1375666","tt0137523","tt0133093","tt0167260", "tt0111161", "tt0068646",
             "tt6751668", "tt0050083", "tt0108052", "tt0167260", "tt0110912","tt0060196", "tt0120737", "tt0137523", "tt0109830", "tt0080684", "tt0167261",
             "tt0133093","tt0099685", "tt0073486", "tt8503618", "tt0047478", "tt0114369", "tt0317248", "tt0816692", "tt0245429",
             "tt0114814", "tt0120815", "tt0120689", "tt0110413", "tt0110357","tt0172495","tt0102926","tt0038650", "tt0253474","tt0120586"]  # Example movie IDs
all_reviews = []

for movie_id in movie_ids:
    all_reviews.extend(fetch_reviews(movie_id))

# Save to CSV
df = pd.DataFrame(all_reviews)
df.to_csv('imdb_reviews.csv', index=False)
print(f"Saved {len(all_reviews)} reviews to imdb_reviews.csv")


Fetching page: https://www.imdb.com/title/tt0439572/reviews?ref_=tt_ql_3&page=1
Fetched 25 reviews from page 1. Total reviews collected: 25
Fetching page: https://www.imdb.com/title/tt0439572/reviews?ref_=tt_ql_3&page=2
No more reviews found for tt0439572 on page 2.
Fetching page: https://www.imdb.com/title/tt8178634/reviews?ref_=tt_ql_3&page=1
Fetched 25 reviews from page 1. Total reviews collected: 25
Fetching page: https://www.imdb.com/title/tt8178634/reviews?ref_=tt_ql_3&page=2
Fetched 25 reviews from page 2. Total reviews collected: 50
Fetching page: https://www.imdb.com/title/tt8178634/reviews?ref_=tt_ql_3&page=3
Fetched 25 reviews from page 3. Total reviews collected: 75
Fetching page: https://www.imdb.com/title/tt8178634/reviews?ref_=tt_ql_3&page=4
Fetched 25 reviews from page 4. Total reviews collected: 100
Fetching page: https://www.imdb.com/title/tt8178634/reviews?ref_=tt_ql_3&page=5
Fetched 25 reviews from page 5. Total reviews collected: 125
Fetching page: https://www.imdb

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [22]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Load the dataset
df = pd.read_csv('imdb_reviews.csv')

# (1) Remove special characters and punctuations
def remove_special_characters(text):
    return re.sub(r'[^A-Za-z\s]', '', text)

# (2) Remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# (3) Remove stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    words = text.split()
    return ' '.join([word for word in words if word.lower() not in stop_words])

# (4) Convert to lowercase
def convert_to_lowercase(text):
    return text.lower()

# (5) Stemming
ps = PorterStemmer()
def apply_stemming(text):
    words = text.split()
    return ' '.join([ps.stem(word) for word in words])

# (6) Lemmatization
lemmatizer = WordNetLemmatizer()
def apply_lemmatization(text):
    words = text.split()
    return ' '.join([lemmatizer.lemmatize(word) for word in words])

# Apply the cleaning steps to each review
def clean_text(text):
    text = remove_special_characters(text)  # Step 1
    text = remove_numbers(text)             # Step 2
    text = remove_stopwords(text)           # Step 3
    text = convert_to_lowercase(text)       # Step 4
    stemmed_text = apply_stemming(text)     # Step 5 (stemming)
    lemmatized_text = apply_lemmatization(text)  # Step 6 (lemmatization)
    return stemmed_text, lemmatized_text

# Create new columns for cleaned text
df['Stemmed_Content'], df['Lemmatized_Content'] = zip(*df['Content'].map(clean_text))

# Save the cleaned data back to a new CSV file
df.to_csv('imdb_reviews_cleaned.csv', index=False)
print("Cleaned data saved to 'imdb_reviews_cleaned.csv'.")



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Cleaned data saved to 'imdb_reviews_cleaned.csv'.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [23]:
# Your code here

# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import spacy
import nltk
from nltk.corpus import stopwords
from collections import Counter
from spacy import displacy
nltk.download('stopwords')

# Load the spaCy model for English
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

# Text cleaning function
def clean_text(text):
    # (1) Remove special characters and punctuations
    text = re.sub(r'[^\w\s]', '', text)

    # (2) Remove numbers
    text = re.sub(r'\d+', '', text)

    # (3) Remove stopwords
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

    # (4) Lowercase the text
    text = text.lower()

    return text

# Stemming (using nltk's Porter Stemmer)
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def stem_text(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

# Lemmatization (using spaCy's built-in lemmatizer)
def lemmatize_text(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

# Load the collected reviews
df = pd.read_csv('imdb_reviews.csv')

# Clean the 'Content' column and save in a new column
df['Cleaned_Content'] = df['Content'].apply(clean_text)

# Apply stemming and save in a new column
df['Stemmed_Content'] = df['Cleaned_Content'].apply(stem_text)

# Apply lemmatization and save in a new column
df['Lemmatized_Content'] = df['Cleaned_Content'].apply(lemmatize_text)

# Save the cleaned, stemmed, and lemmatized data to a new CSV
df.to_csv('imdb_reviews_cleaned.csv', index=False)

# (1) POS Tagging: Counting Nouns, Verbs, Adjectives, and Adverbs
def pos_tagging(text):
    doc = nlp(text)
    pos_counts = Counter([token.pos_ for token in doc])
    return pos_counts['NOUN'], pos_counts['VERB'], pos_counts['ADJ'], pos_counts['ADV']

df['Nouns'], df['Verbs'], df['Adjectives'], df['Adverbs'] = zip(*df['Lemmatized_Content'].apply(pos_tagging))

# Save the dataframe with POS tags
df.to_csv('imdb_reviews_with_pos.csv', index=False)

# (2) Constituency Parsing and Dependency Parsing (Example sentence parsing)
def parse_and_visualize(text):
    doc = nlp(text)
    for sent in doc.sents:
        print(f"Sentence: {sent.text}")
        print("Dependency Parsing:")
        displacy.render(sent, style="dep", jupyter=False)  # Dependency parsing
        print("Constituency Parsing: ")
        displacy.render(doc, style="dep", jupyter=False)  # Constituency parsing

# Example: Parse and visualize the first review
first_review = df['Lemmatized_Content'].iloc[0]
parse_and_visualize(first_review)

# (3) Named Entity Recognition (NER): Extract and count entities
def perform_ner(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

df['Entities'] = df['Lemmatized_Content'].apply(perform_ner)

# Count entity types
entity_counter = Counter([entity[1] for sublist in df['Entities'] for entity in sublist])
print("Entity Counts:", entity_counter)

# Save the dataframe with NER data
df.to_csv('imdb_reviews_with_ner.csv', index=False)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Sentence: get flashok live positive buzz early screening nowhere near one good comic book movie time even conversation one good dc movie lot fun ita lot may depend feeling ezra miller younger barry allen really annoying think that s point issue rolethe stand michael keaton batman thing danny elfman iconic score boom joy make mistake batman flash movie feature keaton batmansupergirl good enjoy get stand alone
Dependency Parsing:
Constituency Parsing: 
Sentence: i d watch surprise I m amazed leakedit feel run time time cgi dodgy especially certain moment consider go can not believe release like cgi moment big complaintthey get away lot rating include nudity fair amount bad language include f bombim sure ill watch good time watch one time
Dependency Parsing:
Constituency Parsing: 
Entity Counts: Counter({'PERSON': 2124, 'CARDINAL': 1819, 'NORP': 720, 'ORDINAL': 712, 'DATE': 602, 'ORG': 472, 'GPE': 413, 'TIME': 290, 'EVENT': 61, 'LOC': 36, 'PRODUCT': 17, 'LANGUAGE': 14, 'FAC': 10, 'LAW': 8

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [24]:

import pandas as pd

# Load the CSV file
df = pd.read_csv('/content/imdb_reviews_cleaned.csv')

# Display the entire CSV content
print("Contents of imdb_reviews_cleaned:")
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns

# Display the DataFrame
display(df)  # Use display() for better formatting in Colab





Contents of imdb_reviews_cleaned:


Unnamed: 0,Title,Content,Cleaned_Content,Stemmed_Content,Lemmatized_Content
0,Keaton Steals The Show,I just got out of The FlashOK. So it does not ...,got flashok live positive buzz early screening...,got flashok live posit buzz earli screen nowhe...,get flashok live positive buzz early screening...
1,Michael Keaton is Batman,Ok so for me the Grant Gustin will always be m...,ok grant gustin always live action flash ezra ...,ok grant gustin alway live action flash ezra m...,ok grant gustin always live action flash ezra ...
2,"Good film, let down by poor CGI","As a fan of the Flashpoint:Paradox film, i was...",fan flashpointparadox film looking forward sto...,fan flashpointparadox film look forward stori ...,fan flashpointparadox film look forward story ...
3,Come on Barbie. Let's go party?,Finally after the early screening on 6th June ...,finally early screening th june showed unfinis...,final earli screen th june show unfinish print...,finally early screen th june show unfinished p...
4,Terrible CGI was distracting,I really didn't know what to make of this.. I ...,really didnt know make surprised tone much les...,realli didnt know make surpris tone much less ...,really do not know make surprised tone much le...
5,"Come for the fan-service, stay for the surpris...",I always start any review of a superhero movie...,always start review superhero movie making cle...,alway start review superhero movi make clear t...,always start review superhero movie make clear...
6,I actually enjoyed this,Much better than I thought and reviews led me ...,much better thought reviews led believe superw...,much better thought review led believ superwom...,much well think review lead believe superwoman...
7,Fun and Emotional,"When Justice League hit theatres in 2017, I'll...",justice league hit theatres ill admit ezra mil...,justic leagu hit theatr ill admit ezra miller ...,justice league hit theatre ill admit ezra mill...
8,"Overhyped, but a good film with horrible CGI.","So after being in development for years, multi...",development years multiple rewrites director c...,develop year multipl rewrit director chang dra...,development year multiple rewrite director cha...
9,Best DC film by a mile,"Best DC film by a mile..\nFunny, endearing, fi...",best dc film mile funny endearing finally dc f...,best dc film mile funni endear final dc film p...,good dc film mile funny endear finally dc film...


In [25]:

import pandas as pd

# Load the CSV file
df1 = pd.read_csv('/content/imdb_reviews_with_pos.csv')

# Display the contents with a title
print("Contents of cleaned imdb with parts of speech tagging:")

# Set display options to show all rows and columns
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns

# Display the entire DataFrame
display(df1)  # Use display() for better formatting in Colab




Contents of cleaned imdb with parts of speech tagging:


Unnamed: 0,Title,Content,Cleaned_Content,Stemmed_Content,Lemmatized_Content,Nouns,Verbs,Adjectives,Adverbs
0,Keaton Steals The Show,I just got out of The FlashOK. So it does not ...,got flashok live positive buzz early screening...,got flashok live posit buzz earli screen nowhe...,get flashok live positive buzz early screening...,36,21,17,7
1,Michael Keaton is Batman,Ok so for me the Grant Gustin will always be m...,ok grant gustin always live action flash ezra ...,ok grant gustin alway live action flash ezra m...,ok grant gustin always live action flash ezra ...,25,12,6,6
2,"Good film, let down by poor CGI","As a fan of the Flashpoint:Paradox film, i was...",fan flashpointparadox film looking forward sto...,fan flashpointparadox film look forward stori ...,fan flashpointparadox film look forward story ...,18,6,16,5
3,Come on Barbie. Let's go party?,Finally after the early screening on 6th June ...,finally early screening th june showed unfinis...,final earli screen th june show unfinish print...,finally early screen th june show unfinished p...,56,27,22,11
4,Terrible CGI was distracting,I really didn't know what to make of this.. I ...,really didnt know make surprised tone much les...,realli didnt know make surpris tone much less ...,really do not know make surprised tone much le...,33,15,17,16
5,"Come for the fan-service, stay for the surpris...",I always start any review of a superhero movie...,always start review superhero movie making cle...,alway start review superhero movi make clear t...,always start review superhero movie make clear...,51,29,19,16
6,I actually enjoyed this,Much better than I thought and reviews led me ...,much better thought reviews led believe superw...,much better thought review led believ superwom...,much well think review lead believe superwoman...,47,28,29,23
7,Fun and Emotional,"When Justice League hit theatres in 2017, I'll...",justice league hit theatres ill admit ezra mil...,justic leagu hit theatr ill admit ezra miller ...,justice league hit theatre ill admit ezra mill...,110,61,34,21
8,"Overhyped, but a good film with horrible CGI.","So after being in development for years, multi...",development years multiple rewrites director c...,develop year multipl rewrit director chang dra...,development year multiple rewrite director cha...,124,57,45,23
9,Best DC film by a mile,"Best DC film by a mile..\nFunny, endearing, fi...",best dc film mile funny endearing finally dc f...,best dc film mile funni endear final dc film p...,good dc film mile funny endear finally dc film...,23,5,14,12


In [26]:
import pandas as pd

# Load the CSV file
df2 = pd.read_csv('/content/imdb_reviews_with_ner.csv')

# Display the contents with a title
print("Contents of cleaned imdb with Named Entity Recognition:")

# Set display options to show all rows and columns
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns

# Display the entire DataFrame
display(df2)  # Use display() for better formatting in Colab





Contents of cleaned imdb with Named Entity Recognition:


Unnamed: 0,Title,Content,Cleaned_Content,Stemmed_Content,Lemmatized_Content,Nouns,Verbs,Adjectives,Adverbs,Entities
0,Keaton Steals The Show,I just got out of The FlashOK. So it does not ...,got flashok live positive buzz early screening...,got flashok live posit buzz earli screen nowhe...,get flashok live positive buzz early screening...,36,21,17,7,"[('one', 'CARDINAL'), ('one', 'CARDINAL'), ('e..."
1,Michael Keaton is Batman,Ok so for me the Grant Gustin will always be m...,ok grant gustin always live action flash ezra ...,ok grant gustin alway live action flash ezra m...,ok grant gustin always live action flash ezra ...,25,12,6,6,"[('grant gustin', 'PERSON'), ('ezra miller', '..."
2,"Good film, let down by poor CGI","As a fan of the Flashpoint:Paradox film, i was...",fan flashpointparadox film looking forward sto...,fan flashpointparadox film look forward stori ...,fan flashpointparadox film look forward story ...,18,6,16,5,"[('michael keaton', 'PERSON'), ('ezra miller',..."
3,Come on Barbie. Let's go party?,Finally after the early screening on 6th June ...,finally early screening th june showed unfinis...,final earli screen th june show unfinish print...,finally early screen th june show unfinished p...,56,27,22,11,"[('june', 'DATE'), ('one', 'CARDINAL'), ('barr..."
4,Terrible CGI was distracting,I really didn't know what to make of this.. I ...,really didnt know make surprised tone much les...,realli didnt know make surpris tone much less ...,really do not know make surprised tone much le...,33,15,17,16,"[('sillinessezra miller', 'PERSON'), ('michael..."
5,"Come for the fan-service, stay for the surpris...",I always start any review of a superhero movie...,always start review superhero movie making cle...,alway start review superhero movi make clear t...,always start review superhero movie make clear...,51,29,19,16,"[('one', 'CARDINAL'), ('half hour', 'TIME'), (..."
6,I actually enjoyed this,Much better than I thought and reviews led me ...,much better thought reviews led believe superw...,much better thought review led believ superwom...,much well think review lead believe superwoman...,47,28,29,23,"[('michael keaton', 'PERSON'), ('ezra miller',..."
7,Fun and Emotional,"When Justice League hit theatres in 2017, I'll...",justice league hit theatres ill admit ezra mil...,justic leagu hit theatr ill admit ezra miller ...,justice league hit theatre ill admit ezra mill...,110,61,34,21,"[('justice league', 'ORG'), ('ezra miller', 'P..."
8,"Overhyped, but a good film with horrible CGI.","So after being in development for years, multi...",development years multiple rewrites director c...,develop year multipl rewrit director chang dra...,development year multiple rewrite director cha...,124,57,45,23,"[('one', 'CARDINAL'), ('ezra miller', 'PERSON'..."
9,Best DC film by a mile,"Best DC film by a mile..\nFunny, endearing, fi...",best dc film mile funny endearing finally dc f...,best dc film mile funni endear final dc film p...,good dc film mile funny endear finally dc film...,23,5,14,12,"[('rightezra miller', 'PERSON'), ('michael kea..."


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [20]:
# Write your response below
'''

Obstacles

Information Gathering: Because of possible blocks and shifting page layouts, it was challenging to extract information from IMDb.
Data Cleaning: Using text cleaning techniques like lemmatization and stemming required a thorough grasp of natural language processing.
Techniques for Analysis: It took a while to understand syntax and structural analysis, which included parsing and POS tagging.
Satisfying Features:

Opportunities for Learning: It was satisfying to have some practical experience with Python packages like pandas and nltk.
Data insights were obtained using NER analysis of cleansed data, which was a pleasant process.
Solving problems: It was interesting and instructive to overcome scraping and analyzing obstacles.
Organizing Your Time:

While much of the time allotted was fair, it might have been more, particularly for people who were not as experienced with the techniques.
All in all, the task offered a thorough and fulfilling education in data analysis expertise.
'''

'\n\nObstacles\n\nInformation Gathering: Because of possible blocks and shifting page layouts, it was challenging to extract information from IMDb.\nData Cleaning: Using text cleaning techniques like lemmatization and stemming required a thorough grasp of natural language processing.\nTechniques for Analysis: It took a while to understand syntax and structural analysis, which included parsing and POS tagging.\nSatisfying Features:\n\nOpportunities for Learning: It was satisfying to have some practical experience with Python packages like pandas and nltk.\nData insights were obtained using NER analysis of cleansed data, which was a pleasant process.\nSolving problems: It was interesting and instructive to overcome scraping and analyzing obstacles.\nOrganizing Your Time:\n\nWhile much of the time allotted was fair, it might have been more, particularly for people who were not as experienced with the techniques.\nAll in all, the task offered a thorough and fulfilling education in data ana