# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Your code here

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_imdb_reviews(movie_id, num_reviews=1000):
    base_url = f"https://www.imdb.com/title/{movie_id}/reviews"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
    reviews = []
    for start in range(1, num_reviews+1, 10):
        url = f"{base_url}?start={start}"
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")
        review_containers = soup.find_all("div", class_="text show-more__control")
        for review in review_containers:
            reviews.append(review.text.strip())
    return reviews

# Movie ID for "Inception"
movie_id = "tt1375666"

# Scrape top 1000 user reviews
reviews = scrape_imdb_reviews(movie_id, num_reviews=1000)

# Convert data to DataFrame
df = pd.DataFrame({"Review": reviews})

# Save data to CSV
df.to_csv("inception_reviews.csv", index=False)

print("CSV file saved successfully.")


CSV file saved successfully.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Load NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file with the collected text data
df = pd.read_csv("inception_reviews.csv")

# Display the first few rows of the DataFrame
print("Original Data:")
print(df.head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Original Data:
                                              Review
0  My 3rd time watching this movie! Yet, it still...
1  You only get to watch this for the first time ...
2  When you wake up from a good dream, you feel t...
3  When I first watch this movie, I was just shoc...
4  The 20th Century had Casablanca, Star Wars, th...


Remove noise, such as special characters and punctuations


In [3]:
df['clean_text'] = df['Review'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# Display the data after removing noise
print("\nData after removing noise:")
print(df.head())


Data after removing noise:
                                              Review  \
0  My 3rd time watching this movie! Yet, it still...   
1  You only get to watch this for the first time ...   
2  When you wake up from a good dream, you feel t...   
3  When I first watch this movie, I was just shoc...   
4  The 20th Century had Casablanca, Star Wars, th...   

                                          clean_text  
0  My 3rd time watching this movie Yet it still s...  
1  You only get to watch this for the first time ...  
2  When you wake up from a good dream you feel th...  
3  When I first watch this movie I was just shock...  
4  The 20th Century had Casablanca Star Wars the ...  


Remove numbers


In [4]:
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub(r'\d+', '', x))

# Display the data after removing numbers
print("\nData after removing numbers:")
print(df.head())


Data after removing numbers:
                                              Review  \
0  My 3rd time watching this movie! Yet, it still...   
1  You only get to watch this for the first time ...   
2  When you wake up from a good dream, you feel t...   
3  When I first watch this movie, I was just shoc...   
4  The 20th Century had Casablanca, Star Wars, th...   

                                          clean_text  
0  My rd time watching this movie Yet it still st...  
1  You only get to watch this for the first time ...  
2  When you wake up from a good dream you feel th...  
3  When I first watch this movie I was just shock...  
4  The th Century had Casablanca Star Wars the Go...  


Remove stopwords by using the stopwords list.

In [5]:
stop_words = set(stopwords.words('english'))
df['clean_text'] = df['clean_text'].apply(lambda x: ' '.join(word for word in x.split() if word.lower() not in stop_words))

# Display the data after removing stopwords
print("\nData after removing stopwords:")
print(df.head())


Data after removing stopwords:
                                              Review  \
0  My 3rd time watching this movie! Yet, it still...   
1  You only get to watch this for the first time ...   
2  When you wake up from a good dream, you feel t...   
3  When I first watch this movie, I was just shoc...   
4  The 20th Century had Casablanca, Star Wars, th...   

                                          clean_text  
0  rd time watching movie Yet still stunned mind ...  
1  get watch first time choose state mind careful...  
2  wake good dream feel reality harsh wake bad dr...  
3  first watch movie shocked twist Im gonna menti...  
4  th Century Casablanca Star Wars Godfather Blad...  


Lowercase all texts

In [6]:
df['clean_text'] = df['clean_text'].apply(lambda x: x.lower())

# Display the data after lowercasing
print("\nData after lowercasing:")
print(df.head())


Data after lowercasing:
                                              Review  \
0  My 3rd time watching this movie! Yet, it still...   
1  You only get to watch this for the first time ...   
2  When you wake up from a good dream, you feel t...   
3  When I first watch this movie, I was just shoc...   
4  The 20th Century had Casablanca, Star Wars, th...   

                                          clean_text  
0  rd time watching movie yet still stunned mind ...  
1  get watch first time choose state mind careful...  
2  wake good dream feel reality harsh wake bad dr...  
3  first watch movie shocked twist im gonna menti...  
4  th century casablanca star wars godfather blad...  


In [7]:
# Initialize stemming and lemmatization objects
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Stemming

In [8]:
df['stemmed_text'] = df['clean_text'].apply(lambda x: ' '.join(stemmer.stem(word) for word in x.split()))

# Display the data after stemming
print("\nData after stemming:")
print(df.head())


Data after stemming:
                                              Review  \
0  My 3rd time watching this movie! Yet, it still...   
1  You only get to watch this for the first time ...   
2  When you wake up from a good dream, you feel t...   
3  When I first watch this movie, I was just shoc...   
4  The 20th Century had Casablanca, Star Wars, th...   

                                          clean_text  \
0  rd time watching movie yet still stunned mind ...   
1  get watch first time choose state mind careful...   
2  wake good dream feel reality harsh wake bad dr...   
3  first watch movie shocked twist im gonna menti...   
4  th century casablanca star wars godfather blad...   

                                        stemmed_text  
0  rd time watch movi yet still stun mind kept en...  
1  get watch first time choos state mind care fil...  
2  wake good dream feel realiti harsh wake bad dr...  
3  first watch movi shock twist im gonna mention ...  
4  th centuri casablanca star

Lemmatization

In [9]:
df['lemmatized_text'] = df['clean_text'].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))

# Display the data after lemmatization
print("\nData after lemmatization:")
print(df.head())


Data after lemmatization:
                                              Review  \
0  My 3rd time watching this movie! Yet, it still...   
1  You only get to watch this for the first time ...   
2  When you wake up from a good dream, you feel t...   
3  When I first watch this movie, I was just shoc...   
4  The 20th Century had Casablanca, Star Wars, th...   

                                          clean_text  \
0  rd time watching movie yet still stunned mind ...   
1  get watch first time choose state mind careful...   
2  wake good dream feel reality harsh wake bad dr...   
3  first watch movie shocked twist im gonna menti...   
4  th century casablanca star wars godfather blad...   

                                        stemmed_text  \
0  rd time watch movi yet still stun mind kept en...   
1  get watch first time choos state mind care fil...   
2  wake good dream feel realiti harsh wake bad dr...   
3  first watch movi shock twist im gonna mention ...   
4  th centuri casab

In [11]:
# Save the cleaned data with new columns to CSV
df.to_csv("cleaned_inception_reviews.csv", index=False)

print("\nCleaned inception data saved successfully.")


Cleaned inception data saved successfully.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [15]:
import pandas as pd
import nltk
import spacy

# Load NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load SpaCy resources
nlp = spacy.load("en_core_web_sm")

# Load the cleaned data from the CSV file
df = pd.read_csv("cleaned_inception_reviews.csv")

# Define function for Parts of Speech (POS) tagging
def pos_tagging(text):
    pos_tags = nltk.pos_tag(nltk.word_tokenize(text))
    pos_counts = {"Noun": 0, "Verb": 0, "Adjective": 0, "Adverb": 0}
    for word, pos in pos_tags:
        if pos.startswith("N"):
            pos_counts["Noun"] += 1
        elif pos.startswith("V"):
            pos_counts["Verb"] += 1
        elif pos.startswith("J"):
            pos_counts["Adjective"] += 1
        elif pos.startswith("R"):
            pos_counts["Adverb"] += 1
    return pos_counts

# Define function for constituency parsing
def constituency_parsing(text):
    doc = nlp(text)
    constituency_trees = [sent._.parse_string for sent in doc.sents]
    return constituency_trees

# Define function for dependency parsing
def dependency_parsing(text):
    doc = nlp(text)
    dependency_trees = [(token.text, token.dep_, token.head.text) for sent in doc.sents for token in sent]
    return dependency_trees

# Define function for Named Entity Recognition (NER)
def ner(text):
    doc = nlp(text)
    entities = {}
    for ent in doc.ents:
        entity_type = ent.label_
        entity_text = ent.text
        if entity_type not in entities:
            entities[entity_type] = []
        entities[entity_type].append(entity_text)
    return entities

# Apply functions to each review
df['POS_Tagging'] = df['clean_text'].apply(pos_tagging)
df['Constituency_Parsing'] = df['clean_text'].apply(constituency_parsing)
df['Dependency_Parsing'] = df['clean_text'].apply(dependency_parsing)
df['NER'] = df['clean_text'].apply(ner)

# Save the DataFrame with analysis results to a new CSV file
df.to_csv("analysis_results.csv", index=False)

print("Analysis results saved to 'analysis_results.csv'")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Analysis results saved to 'analysis_results.csv'


In [16]:
import pandas as pd

# Load the analysis results from the CSV file
df = pd.read_csv("analysis_results.csv")

# Count the occurrences of each attribute value in the dataset
pos_counts = df['POS_Tagging'].apply(pd.Series).stack().value_counts()
constituency_counts = df['Constituency_Parsing'].apply(pd.Series).stack().value_counts()
dependency_counts = df['Dependency_Parsing'].apply(pd.Series).stack().value_counts()

# Display the count for each attribute
print("POS Tagging Counts:")
print(pos_counts)
print("\nConstituency Parsing Counts:")
print(constituency_counts)
print("\nDependency Parsing Counts:")
print(dependency_counts)

POS Tagging Counts:
{'Noun': 297, 'Verb': 127, 'Adjective': 132, 'Adverb': 57}    100
{'Noun': 180, 'Verb': 85, 'Adjective': 84, 'Adverb': 30}      100
{'Noun': 145, 'Verb': 82, 'Adjective': 59, 'Adverb': 18}      100
{'Noun': 72, 'Verb': 31, 'Adjective': 33, 'Adverb': 8}        100
{'Noun': 90, 'Verb': 41, 'Adjective': 31, 'Adverb': 25}       100
{'Noun': 152, 'Verb': 61, 'Adjective': 72, 'Adverb': 28}      100
{'Noun': 136, 'Verb': 66, 'Adjective': 68, 'Adverb': 32}      100
{'Noun': 35, 'Verb': 23, 'Adjective': 17, 'Adverb': 9}        100
{'Noun': 152, 'Verb': 52, 'Adjective': 53, 'Adverb': 24}      100
{'Noun': 101, 'Verb': 45, 'Adjective': 50, 'Adverb': 22}      100
{'Noun': 81, 'Verb': 43, 'Adjective': 31, 'Adverb': 12}       100
{'Noun': 66, 'Verb': 32, 'Adjective': 38, 'Adverb': 16}       100
{'Noun': 115, 'Verb': 47, 'Adjective': 59, 'Adverb': 24}      100
{'Noun': 21, 'Verb': 10, 'Adjective': 12, 'Adverb': 6}        100
{'Noun': 262, 'Verb': 98, 'Adjective': 100, 'Adverb': 54

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [18]:
# Write your response below
#The task provided a thorough opportunity to utilise numerous NLP approaches, such as data cleaning, POS tagging, parsing, and NER. The intricacy of language structures and entity recognition made parsing and NER very difficult. However, it was entertaining to watch the data transformation and analysis take place. The time allotted appeared appropriate for finishing the job, balancing exploration and implementation within a fair time schedule.