# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
!pip install requests beautifulsoup4 pandas




In [35]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_movie_reviews(movie_id, max_reviews=1000):
    reviews_list = []
    headers = {'User-Agent': 'Mozilla/5.0'}
    page_number = 1
    review_count = 0

    while review_count < max_reviews:
        # IMDb user review page pattern
        url = f'https://www.imdb.com/title/tt{movie_id}/reviews?ref_=nmawd_awd_1&start={page_number * 10}'
        response = requests.get(url, headers=headers)
        if response.status_code != 200:
            print(f"Failed to retrieve reviews from {url}")
            break
        soup = BeautifulSoup(response.content, 'html.parser')

        # Parsing review content
        review_blocks = soup.find_all('div', class_='review-container')

        for review_block in review_blocks:
            review_data = {}

            # Get the review title
            title_tag = review_block.find('a', class_='title')
            review_data['title'] = title_tag.text.strip() if title_tag else 'No Title'

            # Get the rating
            rating_tag = review_block.find('span', class_='rating-other-user-rating')
            review_data['rating'] = rating_tag.span.text.strip() if rating_tag else 'No Rating'

            # Get the full review text
            review_text_tag = review_block.find('div', class_='text show-more__control')
            review_data['review'] = review_text_tag.text.strip() if review_text_tag else 'No Review Text'

            # Add to the list
            reviews_list.append(review_data)
            review_count += 1

            if review_count >= max_reviews:
                break

        # Check if there are more pages to scrape
        if len(review_blocks) < 10:  # If less than 10 reviews were found, stop scraping
            break

        page_number += 1

    return reviews_list

def save_reviews_to_csv(reviews, movie_name):
    df = pd.DataFrame(reviews)
    df.to_csv(f'{movie_name}_imdb_reviews.csv', index=False)
    print(f"Saved {len(reviews)} reviews to {movie_name}_imdb_reviews.csv")

if __name__ == "__main__":
    # Movie IDs for the selected movies
    movies = [
        {"title": "Bahubali", "id": "4849438"},
        {"title": "RRR", "id": "8178634"},
        {"title": "Salaar", "id": "13927994"}
    ]

    for movie in movies:
        print(f"Collecting reviews for: {movie['title']}")
        reviews = get_movie_reviews(movie['id'])
        save_reviews_to_csv(reviews, movie['title'])


Collecting reviews for: Bahubali
Saved 50 reviews to Bahubali_imdb_reviews.csv
Collecting reviews for: RRR
Saved 100 reviews to RRR_imdb_reviews.csv
Collecting reviews for: Salaar
Saved 50 reviews to Salaar_imdb_reviews.csv


In [36]:
import pandas as pd

# Define the file names of the CSV files to be combined
files = [
    'Bahubali_imdb_reviews.csv',
    'RRR_imdb_reviews.csv',
    'Salaar_imdb_reviews.csv'
]

# Initialize an empty list to hold the DataFrames
dataframes = []

# Loop through each file and read it into a DataFrame
for file in files:
    try:
        # Read the CSV file into a DataFrame
        df = pd.read_csv(file)
        # Add a column for the movie name extracted from the file name
        df['movie'] = file.split('_')[0]  # Extracting movie name from the file name
        # Append the DataFrame to the list
        dataframes.append(df)
    except FileNotFoundError:
        print(f"File {file} not found. Please check the file path.")

# Concatenate all DataFrames into a single DataFrame
combined_df = pd.concat(dataframes, ignore_index=True)

# Save the combined DataFrame to a new CSV file
combined_df.to_csv('combined_imdb_reviews.csv', index=False)

print(f"Combined reviews saved to 'combined_imdb_reviews.csv' with {len(combined_df)} reviews.")


Combined reviews saved to 'combined_imdb_reviews.csv' with 200 reviews.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [37]:
!pip install pandas nltk






In [41]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file containing the reviews
df = pd.read_csv('combined_imdb_reviews.csv')

# Ensure the review column exists
if 'review' not in df.columns:
    raise ValueError("The 'review' column was not found in the CSV file.")

# Initialize objects for stemming and lemmatization
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define stopwords
stop_words = set(stopwords.words('english'))

# Function to clean the text data
def clean_text(text):
    # (1) Remove noise, such as special characters and punctuations
    text = re.sub(r'[^A-Za-z\s]', '', text)

    # (2) Remove numbers
    text = re.sub(r'\d+', '', text)

    # (3) Tokenize and remove stopwords
    words = word_tokenize(text)
    words = [word for word in words if word.lower() not in stop_words]

    # (4) Lowercase all text
    words = [word.lower() for word in words]

    # (5) Stemming
    words_stemmed = [ps.stem(word) for word in words]

    # (6) Lemmatization (after stemming for comparison)
    words_lemmatized = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words), ' '.join(words_stemmed), ' '.join(words_lemmatized)

# Apply the clean_text function to the review column and save to new columns
df['clean_text'], df['stemmed_text'], df['lemmatized_text'] = zip(*df['review'].apply(clean_text))

# Save the cleaned data to a new CSV file
df.to_csv('combined_imdb_reviews_cleaned.csv', index=False)

print("Cleaned data has been saved to 'combined_imdb_reviews_cleaned.csv'")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data has been saved to 'combined_imdb_reviews_cleaned.csv'


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [42]:
# Your code here
!pip install nltk spacy
!python -m spacy download en_core_web_sm



Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m79.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [43]:
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize

# Load the spaCy model for parsing and named entity recognition
nlp = spacy.load("en_core_web_sm")

# Load the cleaned data
df = pd.read_csv("combined_imdb_reviews_cleaned.csv")

# Ensure the clean text column exists
if 'clean_text' not in df.columns:
    raise ValueError("The 'clean_text' column was not found in the CSV file.")

# Initialize NLTK POS tagger
nltk.download('averaged_perceptron_tagger')

# Function to conduct POS tagging and count specific POS tags
def pos_tagging_and_count(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)

    # Count nouns, verbs, adjectives, and adverbs
    pos_count = Counter([tag for word, tag in pos_tags])

    # Calculate total for Nouns (NN, NNS, NNP, NNPS), Verbs (VB, VBD, VBG, etc.)
    noun_count = sum(pos_count[tag] for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
    verb_count = sum(pos_count[tag] for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    adj_count = sum(pos_count[tag] for tag in ['JJ', 'JJR', 'JJS'])
    adv_count = sum(pos_count[tag] for tag in ['RB', 'RBR', 'RBS'])

    return pos_tags, noun_count, verb_count, adj_count, adv_count

# Function to perform constituency parsing and dependency parsing
def parse_sentence(sentence):
    doc = nlp(sentence)

    # Dependency parsing
    print("\nDependency Parsing Tree:")
    for token in doc:
        print(f"{token.text} -> {token.dep_} (Head: {token.head.text})")

    # Constituency parsing can be visualized using external tools like Berkeley Neural Parser, but Spacy doesn't have a built-in constituency parser.

# Function to perform Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    # Count entities
    entity_count = Counter([ent.label_ for ent in doc.ents])

    return entities, entity_count

# Iterate through the reviews and perform analysis on the first review for demo purposes
for i, row in df.iterrows():
    clean_text = row['clean_text']

    print(f"\n### Review {i+1} ###")
    print(f"Clean Text: {clean_text}")

    # (1) POS Tagging
    pos_tags, noun_count, verb_count, adj_count, adv_count = pos_tagging_and_count(clean_text)
    print("\nPOS Tagging:")
    print(pos_tags)
    print(f"\nNoun Count: {noun_count}, Verb Count: {verb_count}, Adjective Count: {adj_count}, Adverb Count: {adv_count}")

    # (2) Dependency Parsing and Constituency Parsing
    print("\nConstituency and Dependency Parsing:")
    parse_sentence(clean_text)

    # (3) Named Entity Recognition (NER)
    entities, entity_count = named_entity_recognition(clean_text)
    print("\nNamed Entity Recognition:")
    print(entities)
    print(f"Entity Counts: {entity_count}")

    # For demo purposes, we are analyzing only the first review
    break


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!



### Review 1 ###
Clean Text: baahubali continues epic silliness first baahubali even bigger battles dance scenes menacing looks muscles surprise reveal end part explained circle completed transition hero role amarendra baahubali son mahendra baahubali played prabhas infant saved murderous uncle queen seen prologue first film like antecedent film swings wildly almost slapstick comedy eg baahubali passes blockhead get near princess devasena bloody cartoonish violence including multiple impalements immolations decapitations buzzsaw chariot back equipped sort volleygun arrow launcher tactical highlight palmtree catapults launch teams soldiers use shields form armored balls midair even wile e coyote would dismissed strategy improbable laws physics outrageously ignored throughout action scenes particularly egregious fun example surprisingly film cgi good places times looking like videogame epic liveaction film watched subtitled version cant really comment acting leads still look sound part 

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [6]:
'/content/combined_imdb_reviews_cleaned.csv'

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [45]:
# Write your response below
'''
I found it quite interesting to do this assignment while scraping. I enjoyed when I cleaned and downloaded the csv file. I found it challengin to write the code.
we need more time for these assignments'''

'\nI found it quite interesting to do this assignment while scraping. I enjoyed when I cleaned and downloaded the csv file. I found it challengin to write the code.\nwe need more time for these assignments'