# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.imdb.com/title/tt0137523/reviews/?ref_=tt_ov_rt'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    load_more_button = soup.find('div', class_='load-more-data')
    reviews = []
    while load_more_button and len(reviews) < 1000:
        data_key = load_more_button['data-key']
        load_more_url = f"https://www.imdb.com/title/tt0137523/reviews/_ajax?ajax=1&ref_=undefined&paginationKey={data_key}"
        response = requests.get(load_more_url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='review-container')
        for review_element in review_elements:
            review_text = review_element.find('div', class_='text show-more__control').get_text(strip=True)
            reviews.append(review_text)
        load_more_button = soup.find('div', class_='load-more-data')
    if reviews:
        df = pd.DataFrame({'Review': reviews})
        df.to_csv('1000_reviews.csv', index=False)
        print("Reviews saved to 1000_reviews.csv")
    else:
        print("No reviews found.")
else:
    print("Error accessing movie page:", response.status_code)


Reviews saved to 1000_reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

#Reading the dataset which is collected in the previous question.
df = pd.read_csv('1000_reviews.csv')

# Function to clean text data
def clean_text(text):
  print("Original Text: ")
  print(text)

# Step 1: Removing Noise such as sepcial characters and punctuations
  text = ''.join([char for char in text if char not in string.punctuation])
  print(" ")
  print("Step 1: Result after removing special characters and punctuation is: ")
  print(text)

# Step 2: Remove numbers
  text = ''.join([char for char in text if not char.isdigit()])
  print(" ")
  print("Step 2: Result after removing numbers in the text is: ")
  print(text)

# Step 3: Tokenizing text and removing stop words
  tokens = word_tokenize(text)
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]
  print(" ")
  print("Step 3: Result after tokenizing and removing stopwords: ")
  print(tokens)

# Step 4: Tranforming all sentences into lowercase text
  text = text.lower()
  print(" ")
  print("Step 4: LowerCase text is: ")
  print(text)

# Step 5: Stemming
  stemmer = PorterStemmer()
  stemmed_tokens = [stemmer.stem(word) for word in tokens]
  print(" ")
  print("Step 5: Ouput of the text after stemming is: ")
  print(stemmed_tokens)

# Step 6: Lemmatization
  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
  print(" ")
  print("Step 6: Result after Lemmatization is: ")
  print(lemmatized_tokens)
  print(" ")

# Joining tokens back into text to obtain cleaned version of original text
  cleaned_text = ' '.join(lemmatized_tokens)
  return cleaned_text

# Applying data cleaning to the 'Review' column
df['Cleaned_Review'] = df['Review'].apply(clean_text)

# Saving the cleaned data to a new column in the CSV file
df.to_csv('1000_reviews_cleaned.csv', index=False)
print("Cleaned data saved to 1000_reviews_cleaned.csv")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Original Text: 
David Fincher's cult classic has transcended its popularity upon its initial release to become etched into the fabric of pop culture, to the point where almost any stranger on the street would know the first two rules of 'Fight Club.'Where 'Fight Club' succeeds is in its originality. From the witty dialogue to the Tarantino-esque editing style and the character evolution, the part-time drama, part-time dark comedy is a portrait of a man who experiences an existential crisis and incidentally creates a movement. The duo of Edward Norton and Brad Pitt are iconic, as they couldn't be any more of an odd couple. Their "oil and water" dynamic is a major reason for why 'Fight Club' works - they are able to find function in a dysfunctional world.But it's not just the Norton and Pitt show. The "third wheel" that is Marla Singer (Helena Bonham Carter) is perfectly cast as a kooky free spirit whose temper is as short 

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
import pandas as pd
import nltk
from collections import Counter

# Download NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('treebank')

# Read cleaned text
df = pd.read_csv('1000_reviews_cleaned.csv')

# Function for POS tagging and counting
def pos_tag_and_count(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_counts

# Function for Named Entity Recognition (NER) and counting
def ner_and_count(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    ner_tags = nltk.ne_chunk(pos_tags)
    ner_counts = Counter(chunk.label() for chunk in ner_tags if hasattr(chunk, 'label'))
    return ner_counts

# Apply functions to each review
for index, row in df.iterrows():
    print(f"Named Entity Recognition for Review {index + 1}:")
    ner_counts = ner_and_count(row['Cleaned_Review'])
    print(ner_counts)
    print("-" * 50)

    print(f"POS Tagging for Review {index + 1}:")
    pos_counts = pos_tag_and_count(row['Cleaned_Review'])
    print(pos_counts)
    print("-" * 50)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Counter({'NN': 46, 'JJ': 21, 'RB': 12, 'NNP': 9, 'DT': 5, 'VBN': 3, 'CD': 3, 'VBP': 3, 'PRP': 3, 'VBZ': 3, 'VBD': 3, 'VBG': 2, 'IN': 2, 'VB': 2, 'PRP$': 1, 'EX': 1, 'CC': 1})
--------------------------------------------------
Named Entity Recognition for Review 185:
Counter({'ORGANIZATION': 3, 'GPE': 1})
--------------------------------------------------
POS Tagging for Review 185:
Counter({'NN': 44, 'JJ': 27, 'VBG': 7, 'VBP': 7, 'RB': 6, 'NNP': 5, 'DT': 4, 'IN': 4, 'VB': 4, 'PRP': 3, 'VBD': 3, 'JJS': 3, 'VBN': 3, 'MD': 2, 'VBZ': 1, 'RBR': 1, 'UH': 1, 'NNS': 1})
--------------------------------------------------
Named Entity Recognition for Review 186:
Counter({'GSP': 1, 'ORGANIZATION': 1})
--------------------------------------------------
POS Tagging for Review 186:
Counter({'NN': 29, 'JJ': 15, 'VBD': 7, 'RB': 6, 'NNP': 3, 'EX': 1, 'VBZ': 1, 'NNS': 1, 'VBG': 1, 'MD': 1, 'VB': 1, 'IN': 1})
-------------------------------

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [4]:
# Write your response below
""" I worked on extracting the movie reviews from imdb. Extracting reviews is quite interesting
Faced some difficulties in building the code for all the questions.
One thing is when extracting the reviews from the imbd, it first scrapped the reviews that present in the page 1, I stuggled
with it as i'm getting only 27 reviews, later I recognized that the code is not loading the more reviews which I need to write code
for load more option. After that it worked.
After scraping all the reviews, the 2nd question was some what easy as it included the step to step process mentioned in
question which is easy for me to work on this.
Question 3 is bit challenging for me."""


" I worked on extracting the movie reviews from imdb. Extracting reviews is quite interesting\nFaced some difficulties in building the code for all the questions.\nOne thing is when extracting the reviews from the imbd, it first scrapped the reviews that present in the page 1, I stuggled \nwith it as i'm getting only 27 reviews, later I recognized that the code is not loading the more reviews which I need to write code \nfor load more option. After that it worked.\nAfter scraping all the reviews, the 2nd question was some what easy as it included the step to step process mentioned in \nquestion which is easy for me to work on this. \nQuestion 3 is bit challenging for me."