# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# Movie URL on IMDB for the movie Oppenheimer released in 2023 which has around 4k+ reviews
url_of_the_movie = "https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ov_ql_2"

# Headers for the browser request
request_headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36'
}

# List to store the reviews
reviews_list = []

# Function to scrape reviews from a url
def scrape_reviews(url,request_headers):
    response = requests.get(url, headers=request_headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Finding all review containers
    review_containers = soup.find_all('div', class_='text show-more__control')

    for review in review_containers:
        reviews_list.append(review.get_text(strip=True))

# Iterating through the review pages
for page in range(1, 110):  # assuming 10 reviews per page is the standard
    print(f"Scraping page {page}...")
    scrape_reviews(url_of_the_movie + f"&page={page}",request_headers)

    # Adjusting request throttle and pause to avoid overloading the server
    time.sleep(2)

    if len(reviews_list) >= 1000:
        break

# Saving the reviews to a CSV file
csv_file = "imdb_movie_reviews.csv"

with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(["Review"])

    for review in reviews_list[:1000]:  # Limiting to 1000 reviews
        writer.writerow([review])

print(f"Successfully scraped and saved {len(reviews_list[:1000])} reviews to {csv_file}")


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...
Scraping page 11...
Scraping page 12...
Scraping page 13...
Scraping page 14...
Scraping page 15...
Scraping page 16...
Scraping page 17...
Scraping page 18...
Scraping page 19...
Scraping page 20...
Scraping page 21...
Scraping page 22...
Scraping page 23...
Scraping page 24...
Scraping page 25...
Scraping page 26...
Scraping page 27...
Scraping page 28...
Scraping page 29...
Scraping page 30...
Scraping page 31...
Scraping page 32...
Scraping page 33...
Scraping page 34...
Scraping page 35...
Scraping page 36...
Scraping page 37...
Scraping page 38...
Scraping page 39...
Scraping page 40...
Scraping page 41...
Scraping page 42...
Scraping page 43...
Scraping page 44...
Scraping page 45...
Scraping page 46...
Scraping page 47...
Scraping page 48...
Scraping page 49...
Scraping page 50...
Scraping 

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
!pip install pandas nltk



In [None]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Downloading required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Loading the CSV file with the reviews collected
df = pd.read_csv('imdb_movie_reviews.csv')


stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean text
def clean_text(text):
    # (1) Removing noise (special characters, punctuations, etc.)
    text = re.sub(r'[^\w\s]', '', text)

    # (2) Removing numbers
    text = re.sub(r'\d+', '', text)

    # (3) Tokenize and Removing stopwords
    words = text.split()  # Tokenize by splitting on spaces
    words = [word for word in words if word.lower() not in stop_words]  # Removing stopwords

    # (4) Lowercaseing all texts
    words = [word.lower() for word in words]

    # (5) Stemming
    words_stemmed = [stemmer.stem(word) for word in words]

    # (6) Lemmatization
    words_lemmatized = [lemmatizer.lemmatize(word) for word in words_stemmed]

    return ' '.join(words_lemmatized)  # Return cleaned text

# Applying the cleaning function to the 'Review' column and saving them in a new column 'Cleaned Review'
df['Cleaned Review'] = df['Review'].apply(clean_text)

# Saving the updated DataFrame with the new column back as a CSV
cleaned_csv = 'imdb_movie_reviews_cleaned.csv'
df.to_csv(cleaned_csv, index=False)

print(f"Data cleaning completed and saved to {cleaned_csv}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Data cleaning completed and saved to imdb_movie_reviews_cleaned.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
!pip install pandas spacy nltk benepar
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# Your code here
import pandas as pd
import spacy
from collections import Counter
import nltk
from nltk import pos_tag, word_tokenize
import benepar

# Loading cleaned data from the CSV file
df = pd.read_csv('imdb_movie_reviews_cleaned.csv')

# Loading SpaCy model for POS tagging, Dependency Parsing, and NER
nlp = spacy.load('en_core_web_sm')

# Initializing Benepar for Constituency Parsing
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
benepar.download('benepar_en3')
parser = benepar.Parser("benepar_en3")

# Function for POS tagging and counting N, V, Adj, Adv
def pos_analysis(text):
    pos_counts = Counter()  # Dictionary to hold the counts of POS tags
    tokens = word_tokenize(text)  # Tokenize the text
    tagged_words = pos_tag(tokens)  # POS tagging

    for word, tag in tagged_words:
        if tag.startswith('N'):  # Noun
            pos_counts['Noun'] += 1
        elif tag.startswith('V'):  # Verb
            pos_counts['Verb'] += 1
        elif tag.startswith('JJ'):  # Adjective
            pos_counts['Adjective'] += 1
        elif tag.startswith('RB'):  # Adverb
            pos_counts['Adverb'] += 1

    return pos_counts

# Function for Constituency Parsing and Dependency Parsing
def parse_trees(text):
    doc = nlp(text)

    print("\n=== Constituency Parsing Tree ===")
    # Applying constituency parsing on the first sentence as an example
    sentence = list(doc.sents)[0].text
    tree = parser.parse(sentence.split())
    print(tree)

    print("\n=== Dependency Parsing Tree ===")
    for token in doc:
        print(f'{token.text} ({token.dep_}) <-- {token.head.text}')

    return sentence

# Function for Named Entity Recognition and counting entity types
def ner_analysis(text):
    doc = nlp(text)
    entity_counts = Counter()

    for ent in doc.ents:
        entity_counts[ent.label_] += 1

    return entity_counts

# Applying the POS tagging, parsing, and NER on cleaned data
all_pos_counts = Counter()
all_entity_counts = Counter()

for index, row in df.iterrows():
    clean_text = row['Cleaned Review']

    # (1) POS Tagging
    pos_counts = pos_analysis(clean_text)
    all_pos_counts.update(pos_counts)

    # (2) Parsing (only for the first review to avoid too many trees)
    if index == 0:
        sentence = parse_trees(clean_text)
        print(f"\nExample Sentence for Parsing: {sentence}")

    # (3) NER
    entity_counts = ner_analysis(clean_text)
    all_entity_counts.update(entity_counts)

# Printing POS counts
print("\n=== POS Tagging Results ===")
print(f"Total Nouns: {all_pos_counts['Noun']}")
print(f"Total Verbs: {all_pos_counts['Verb']}")
print(f"Total Adjectives: {all_pos_counts['Adjective']}")
print(f"Total Adverbs: {all_pos_counts['Adverb']}")

# Printing NER counts
print("\n=== Named Entity Recognition Results ===")
for entity, count in all_entity_counts.items():
    print(f"{entity}: {count}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
  state_dict = torch.load(
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



=== Constituency Parsing Tree ===




(TOP
  (S
    (MD youll)
    (VB wit)
    (NP (NN brain) (JJ fulli) (NN switch))
    (VB watch)
    (NP (NNP oppenheim))
    (VP
      (MD could)
      (VP
        (NP (PRP easili))
        (VB get)
        (PRT (RP away))
        (NP (JJ nonattent) (NN viewer))
        (JJ intellig)
        (NN filmmak)
        (NN show)
        (FW audienc)
        (JJ great)
        (NN respect)
        (NN fire)
        (FW dialogu)
        (NN pack)
        (VB inform)
        (JJ relentless)
        (NN pace)
        (NN jump)
        (VB differ)
        (NN time)
        (FW oppenheim)
        (NN life)
        (NP (JJ continu) (NN hour))
        (JJ runtim)
        (JJ visual)
        (NN clue)
        (JJ guid)
        (NN viewer)
        (NN time)
        (MD youll)
        (VP
          (VB get)
          (NP (NN grip))
          (VB quit)
          (FW quickli)
          (JJ relentless)
          (NN help)
          (VB express)
          (NN urgenc)
          (FW u)
          (NN attack)
 

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [16]:
# Write your response below
'''
This assignment was a great exercise in both data processing and natural language analysis. The challenges mainly revolved around balancing simplicity with the depth of analysis. For instance, ensuring that each step of the pipeline, from text cleaning to advanced syntax parsing, worked smoothly required integrating several libraries, each with its nuances. Constituency parsing, in particular, was an exciting challenge, as it forced me to consider sentence structures more deeply than I typically would. What I enjoyed most was the Named Entity Recognition (NER) part, as it reveals how much structured information can be extracted from unstructured text. Seeing entities like organizations, dates, and locations surface from the raw data was satisfying. It emphasizes how much insight can be gained from seemingly ordinary text. This kind of task can be time-intensive, especially for students new to NLP or Python. Depending on their familiarity with the libraries used (like `spacy`, `nltk`, and `benepar`), students may need additional time for installation, testing, and debugging. A few extra days would ensure deeper understanding and a more robust analysis.

'''

'\nThis assignment was a great exercise in both data processing and natural language analysis. The challenges mainly revolved around balancing simplicity with the depth of analysis. For instance, ensuring that each step of the pipeline, from text cleaning to advanced syntax parsing, worked smoothly required integrating several libraries, each with its nuances. Constituency parsing, in particular, was an exciting challenge, as it forced me to consider sentence structures more deeply than I typically would. What I enjoyed most was the Named Entity Recognition (NER) part, as it reveals how much structured information can be extracted from unstructured text. Seeing entities like organizations, dates, and locations surface from the raw data was satisfying. It emphasizes how much insight can be gained from seemingly ordinary text. This kind of task can be time-intensive, especially for students new to NLP or Python. Depending on their familiarity with the libraries used (like `spacy`, `nltk`