# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
# Installing packages
!pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to fetch page content from IMDB
def fetch_imdb_page(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print("Failed to retrieve page.")
        return None
    return response.content

# Extracting user reviews from IMDB
def extract_imdb_reviews(content):
    soup = BeautifulSoup(content, 'html.parser')
    review_elements = soup.find_all('div', class_='review-container')
    reviews = []

    for review in review_elements:
        try:
            title = review.find('a', class_='title')
            rating = review.find('span', class_='rating-other-user-rating')
            body = review.find('div', class_='text show-more__control')

            title_text = title.text.strip() if title else "No Title"
            rating_text = rating.text.strip() if rating else "No Rating"
            body_text = body.text.strip() if body else "No Review"

            reviews.append({
                'Title': title_text,
                'Rating': rating_text,
                'Review': body_text
            })
        except Exception as e:
            print(f"Error while processing a review: {e}")

    return reviews

# Function to collect user reviews from multiple IMDB movie URLs
def collect_imdb_reviews(movie_urls, target_reviews=1000):
    all_reviews = []

    for url in movie_urls:
        print(f"Collecting reviews from {url}...")
        page_content = fetch_imdb_page(url)
        if page_content is None:
            continue

        reviews_on_page = extract_imdb_reviews(page_content)
        all_reviews.extend(reviews_on_page)

        if len(all_reviews) >= target_reviews:
            break

    return all_reviews[:target_reviews]

# List of movie URLs in 2023, 2024)
imdb_movie_urls = [
    "https://www.imdb.com/title/tt15435876/reviews/?ref_=ttrt_sa_3",
    "https://www.imdb.com/title/tt0412142/reviews/?ref_=ttrt_ql_2",
]

# Collecting reviews from IMDB
imdb_reviews = collect_imdb_reviews(imdb_movie_urls, target_reviews=1000)

# Creating a DataFrame and save the reviews to a CSV file
imdb_reviews_df = pd.DataFrame(imdb_reviews)
imdb_reviews_df.to_csv('imdb_reviews.csv', index=False)

# Displaying the first few rows of the DataFrame
print(imdb_reviews_df.head())
print(f"Total reviews collected: {len(imdb_reviews_df)}. Saved to 'imdb_reviews.csv'")


Collecting reviews from https://www.imdb.com/title/tt15435876/reviews/?ref_=ttrt_sa_3...
Collecting reviews from https://www.imdb.com/title/tt0412142/reviews/?ref_=ttrt_ql_2...
                                               Title Rating  \
0                           Wow...this was terrific!   9/10   
1                                    Fantastic Start  10/10   
2                                     So Far So Good   8/10   
3  The Penguin continues the grimy and bleak Goth...   9/10   
4  A gritty, violent crime drama that is so fun t...   9/10   

                                              Review  
0  Wow. The Penguin is just terrific. Everyone kn...  
1  Absolutely nailed the tone and atmosphere. Gri...  
2  The first episode is a direct continuation of ...  
3  The Batman, for my taste had simply the greate...  
4  I've been waiting for The Penguin ever since i...  
Total reviews collected: 50. Saved to 'imdb_reviews.csv'


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
# Install necessary libraries
!pip install nltk pandas

import re
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Load the data
df = pd.read_csv('imdb_reviews.csv')

# Function to remove special characters and punctuation
remove_special_characters = lambda text: re.sub(r'[^a-zA-Z\s]', '', text)

# Function to remove numbers
remove_numbers = lambda text: re.sub(r'\d+', '', text)

# Function to remove stopwords
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Function to lowercase all text
lowercase_text = lambda text: text.lower()

# Function for stemming
stem_text = lambda text: ' '.join([PorterStemmer().stem(word) for word in text.split()])

# Function for lemmatization
lemmatize_text = lambda text: ' '.join([WordNetLemmatizer().lemmatize(word) for word in text.split()])

# Apply the cleaning functions step by step
df['Cleaned_Review'] = df['Review'].apply(remove_special_characters)
df['Cleaned_Review'] = df['Cleaned_Review'].apply(remove_numbers)
df['Cleaned_Review'] = df['Cleaned_Review'].apply(remove_stopwords)
df['Cleaned_Review'] = df['Cleaned_Review'].apply(lowercase_text)
df['Stemmed_Review'] = df['Cleaned_Review'].apply(stem_text)
df['Lemmatized_Review'] = df['Cleaned_Review'].apply(lemmatize_text)

# Save the cleaned data to a new CSV file
df.to_csv('cleaned_imdb_reviews.csv', index=False)

# Display the first few rows of the cleaned data
df[['Review', 'Cleaned_Review', 'Stemmed_Review', 'Lemmatized_Review']].head()





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Review,Cleaned_Review,Stemmed_Review,Lemmatized_Review
0,Wow. The Penguin is just terrific. Everyone kn...,wow penguin terrific everyone knows great acto...,wow penguin terrif everyon know great actor co...,wow penguin terrific everyone know great actor...
1,Absolutely nailed the tone and atmosphere. Gri...,absolutely nailed tone atmosphere gritty grimy...,absolut nail tone atmospher gritti grimi dark ...,absolutely nailed tone atmosphere gritty grimy...
2,The first episode is a direct continuation of ...,first episode direct continuation matt reeves ...,first episod direct continu matt reev batman i...,first episode direct continuation matt reef ba...
3,"The Batman, for my taste had simply the greate...",batman taste simply greatest portrayal gotham ...,batman tast simpli greatest portray gotham cit...,batman taste simply greatest portrayal gotham ...
4,I've been waiting for The Penguin ever since i...,ive waiting penguin ever since first announced...,ive wait penguin ever sinc first announc go se...,ive waiting penguin ever since first announced...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Your code here
# Libraries
!pip install spacy
!pip install nltk
import spacy
import pandas as pd
from collections import Counter
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

#  NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load Spacy model for parsing and NER
nlp = spacy.load("en_core_web_sm")

# Load the cleaned text data
df = pd.read_csv('cleaned_imdb_reviews.csv')

# Selecting the first few cleaned reviews for analysis
texts = df['Lemmatized_Review'].dropna().tolist()

# (1) Parts of Speech (POS) Tagging and Counting
def pos_tagging_and_count(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)

    # Counting Nouns, Verbs, Adjectives, and Adverbs
    pos_counts = Counter(tag for word, tag in tagged)
    noun_count = sum(1 for word, tag in tagged if tag.startswith('NN'))
    verb_count = sum(1 for word, tag in tagged if tag.startswith('VB'))
    adj_count = sum(1 for word, tag in tagged if tag.startswith('JJ'))
    adv_count = sum(1 for word, tag in tagged if tag.startswith('RB'))

    return tagged, noun_count, verb_count, adj_count, adv_count

# Applying POS tagging to the first review
sample_review = texts[0]
pos_tags, noun_count, verb_count, adj_count, adv_count = pos_tagging_and_count(sample_review)

print(f"POS Tags for Sample Review: {pos_tags}")
print(f"Noun Count: {noun_count}, Verb Count: {verb_count}, Adjective Count: {adj_count}, Adverb Count: {adv_count}")

# (2) Constituency Parsing and Dependency Parsing
def print_parsing_analysis(text):
    doc = nlp(text)

    # Dependency Parsing Tree
    print("\nDependency Parsing Tree:")
    for token in doc:
        print(f"{token.text} -> {token.dep_} (head: {token.head.text})")

    # Constituency Parsing Tree (Spacy does not have built-in support)
    # Displaying the tokens and their respective heads to mimic the structure
    print("\nMimicked Constituency Parsing Tree (token -> head relation):")
    for token in doc:
        print(f"{token.text} -> {token.head.text}")

# Applying parsing to the same review
print_parsing_analysis(sample_review)

# Explanation of Parsing (Using one sentence as a sample):
sample_sentence = "This movie is very interesting."
doc_example = nlp(sample_sentence)

print("\nExplanation of Dependency and Constituency Parsing:")
for token in doc_example:
    print(f"Token: {token.text}, Dependency: {token.dep_}, Head: {token.head.text}")

# Constituency Parsing: Breaks the sentence into nested sub-phrases (noun phrase, verb phrase).
# Dependency Parsing: Shows the syntactic structure where each word is linked to its head (governing word).

# (3) Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Extract Named Entities from all reviews
all_entities = []
for review in texts:
    entities = named_entity_recognition(review)
    all_entities.extend(entities)

# Count occurrences of each entity type
entity_counter = Counter([entity[1] for entity in all_entities])
print(f"\nNamed Entity Count: {entity_counter}")

# Save the Named Entities and their counts to a CSV file
entities_df = pd.DataFrame(all_entities, columns=['Entity', 'Type'])
entities_df.to_csv('imdb_named_entities.csv', index=False)

# Display the first few named entities
print("\nNamed Entities from Reviews:")
print(entities_df.head())





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


POS Tags for Sample Review: [('never', 'RB'), ('paid', 'VBN'), ('much', 'JJ'), ('attention', 'NN'), ('house', 'NN'), ('md', 'NN'), ('first', 'RB'), ('premiered', 'VBD'), ('heard', 'JJ'), ('couple', 'NN'), ('people', 'NNS'), ('basically', 'RB'), ('thing', 'NN'), ('every', 'DT'), ('episode', 'NN'), ('impossible', 'JJ'), ('disease', 'NN'), ('diagnose', 'JJ'), ('house', 'NN'), ('mess', 'NN'), ('team', 'NN'), ('house', 'NN'), ('suddenly', 'RB'), ('solves', 'VBZ'), ('casebut', 'JJ'), ('one', 'CD'), ('day', 'NN'), ('bored', 'VBD'), ('switched', 'JJ'), ('tv', 'NN'), ('house', 'NN'), ('happened', 'VBD'), ('season', 'NN'), ('finale', 'NN'), ('titled', 'VBD'), ('help', 'NN'), ('going', 'VBG'), ('honest', 'JJS'), ('blew', 'VB'), ('away', 'RB'), ('know', 'JJ'), ('happening', 'VBG'), ('character', 'NN'), ('point', 'NN'), ('story', 'NN'), ('acting', 'VBG'), ('fantastic', 'JJ'), ('atmosphere', 'RB'), ('superb', 'JJ'), ('complexity', 'NN'), ('dr', 'NN'), ('gregory', 'NN'), ('house', 'NN'), ('intrigued'

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
from google.colab import files
files.download('cleaned_imdb_reviews.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
'''
The assignment was really tasking, particularly the first part of data gathering. The reviews showed a lot of inconsistency
for every time I run the code. At some point it produced 0 reviews at other time 25 or even 50. This was because of the
fact that the website or may be my code could not allow scraping to take place for long. I feel that some of the website
that were suggested like the Capterra and G2 had issues with allowing scaping. All in all. it was a very challenging assignment.
but worth it. The time too was not very enough for a lot of exploration.
'''

'\nThe assignment was really tasking, particularly the first part of data gathering. The reviews showed a lot of inconsistency \nfor every time I run the code. At some point it produced 0 reviews at other time 25 or even 50. This was because of the\nfact that the website or may be my code could not allow scraping to take place for long. I feel that some of the website\nthat were suggested like the Capterra and G2 had issues with allowing scaping. All in all. it was a very challenging assignment.\nbut worth it. The time too was not very enough for a lot of exploration.\n'