# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [55]:
# Your code here
import requests
from bs4 import BeautifulSoup
import csv

def scrape_imdb_reviews(movie_id, max_reviews=1000):
    url = f"https://www.imdb.com/title/{movie_id}/reviews"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        reviews = soup.find_all('div', class_='text show-more__control')
        reviews_data = []
        for review in reviews[:max_reviews]:
            text = review.get_text(strip=True)
            reviews_data.append(text)
        return reviews_data
    else:
        print("Failed to fetch reviews")
        return []
#(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]
def save_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Review'])
        writer.writerows([[review] for review in data])

def main():
    # Example movie ids
    movie_ids = ["tt0172495", "tt0111161"]  # The Matrix, The Shawshank Redemption

    all_reviews = []
    for movie_id in movie_ids:
        reviews = scrape_imdb_reviews(movie_id)
        all_reviews.extend(reviews)

    if all_reviews:
        save_to_csv(all_reviews, 'movie_reviews.csv')
        print("Reviews saved to movie_reviews.csv")
    else:
        print("No reviews found")

if __name__ == "__main__":
    main()


Reviews saved to movie_reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [56]:
# Write code for each of the sub parts with proper comments.
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
#(1) Remove noise, such as special characters and punctuations.
def remove_special_characters(text):
    # Remove special characters and punctuation
    return re.sub(r'[^a-zA-Z\s]', '', text)
#(2) Remove numbers.
def remove_numbers(text):
    # Remove numbers
    return re.sub(r'\d+', '', text)
#(3) Remove stopwords by using the stopwords list.
def remove_stopwords(text):
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    filtered_text = [word for word in tokens if word.lower() not in stop_words]
    return ' '.join(filtered_text)
#(5) Stemming.
def stemming(text):
    # Stemming
    stemmer = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    stemmed_text = [stemmer.stem(word) for word in tokens]
    return ' '.join(stemmed_text)
#(6) Lemmatization.
def lemmatization(text):
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    lemmatized_text = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(lemmatized_text)
#(4) Lowercase all texts
def clean_text(text):
    text = text.lower()
    text = remove_special_characters(text)
    text = remove_numbers(text)
    text = remove_stopwords(text)
    text = stemming(text)
    text = lemmatization(text)
    return text

def clean_and_save_csv(input_file, output_file):
    with open(input_file, 'r', newline='', encoding='utf-8') as infile, \
            open(output_file, 'w', newline='', encoding='utf-8') as outfile:
        reader = csv.DictReader(infile)
        fieldnames = reader.fieldnames + ['Cleaned_Review']
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in reader:
            cleaned_review = clean_text(row['Review'])
            row['Cleaned_Review'] = cleaned_review
            writer.writerow(row)

def main():
    input_file = 'movie_reviews.csv'
    output_file = 'cleaned_movie_reviews.csv'
    clean_and_save_csv(input_file, output_file)
    print("Cleaned data saved to cleaned_movie_reviews.csv")

if __name__ == "__main__":
    main()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data saved to cleaned_movie_reviews.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [54]:
# Your code here
!pip install nltk
!python -m spacy download en_core_web_sm
import nltk
from collections import Counter
import spacy
from nltk import CFG
from nltk.parse import ChartParser
# Load spaCy model
nlp = spacy.load("en_core_web_sm")
# Updated grammar
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> Det N | N
VP -> V NP | V PP
PP -> P NP
Det -> 'the' | 'a' | 'an' | 'The' | '.'
N -> 'dog' | 'cat' | 'bird' | 'man' | 'woman' | 'fox' | 'John' | 'New' | 'York' | 'City'| 'at' | 'Google'| 'is' | 'bustling' | 'metropolis'
V -> 'jumps' | 'runs' | 'flies' | 'walks' | 'over' | 'works'
P -> 'over'
ADJ -> 'quick' | 'brown' | 'lazy'
""")
# Rerun the code
def pos_tagging(text):
    doc = nlp(text)
    pos_counts = Counter(token.pos_ for token in doc)
    return pos_counts

def constituency_parsing(text):
    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text)
    for sent in sentences:
        # Tokenize the sentence into words
        words = nltk.word_tokenize(sent)
        # Perform constituency parsing
        parser =nltk.ChartParser(grammar)
        for tree in parser.parse(words):
            print(tree)

def dependency_parsing(text):
    doc = nlp(text)
    print("\nDependency Parsing Trees:")
    for sent in doc.sents:
        for token in sent:
            print(f"{token.text} <--{token.dep_}-- {token.head.text}")

def named_entity_recognition(text):
    doc = nlp(text)
    entities = Counter(ent.label_ for ent in doc.ents)
    return entities

def main():
    # Example cleaned text
    cleaned_text = "The quick brown fox jumps over the lazy dog. John works at Google. New York City is a bustling metropolis."

    # Parts of Speech (POS) Tagging
    pos_counts = pos_tagging(cleaned_text)
    print("Parts of Speech (POS) Tagging:")
    print(pos_counts)

    # Constituency Parsing
    print("\nConstituency Parsing Trees:")
    constituency_parsing(cleaned_text)

    # Dependency Parsing
    dependency_parsing(cleaned_text)

    # Named Entity Recognition (NER)
    entities = named_entity_recognition(cleaned_text)
    print("\nNamed Entity Recognition (NER):")
    print(entities)

if __name__ == "__main__":
    main()

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Parts of Speech (POS) Tagging:
Counter({'PROPN': 5, 'ADJ': 4, 'DET': 3, 'NOUN': 3, 'PUNCT': 3, 'VERB': 2, 'ADP': 2, 'AUX': 1})

Constituency Parsing Trees:

Dependency Parsing Trees:
The <--det-- fox
quick <--amod-- fox
brown <--amod-- fox
fox <--nsubj-- jumps
jumps <--ROOT-- jumps
over <--prep-- jumps
the <--det-- dog
lazy <--amod-- dog
d

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# The above code was most challeging to execute and learned a lot.