<a href="https://colab.research.google.com/github/Sahithi530/Sahithi_INFO5731_Fall2024/blob/main/Tummala_Sahithi_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from datetime import datetime

# Fetch reviews from IMDb
def fetch_imdb_reviews(movie_id, max_reviews):
    reviews = []
    page_number = 1
    while len(reviews) < max_reviews:
        url = f"https://www.imdb.com/title/{movie_id}/reviews?ref_=tt_ql_3&start={page_number*10+1}&count=10"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
        }
        response = requests.get(url, headers=headers)
        if response.status_code != 200:
            print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        review_elements = soup.find_all('div', class_='review-container')

        for review_element in review_elements:
            review_text_element = review_element.find('div', class_='text show-more__control')
            if review_text_element is None:
                continue
            review_text = review_text_element.get_text(strip=True)
            date_element = review_element.find('div', class_='display-name-date')
            if date_element is None:
                continue

            date_span = date_element.find_all('span')
            if not date_span or len(date_span) < 2:
                continue
            date_text = date_span[1].get_text(strip=True)
            try:
                review_date = datetime.strptime(date_text, "%d %B %Y")
            except ValueError:
                continue

            if review_date.year == 2024:
                reviews.append((review_text, review_date.strftime("%M-%d-%Y")))

            if len(reviews) >= max_reviews:
                break

        print(f"Page {page_number}: {len(reviews)} reviews collected.")
        if len(reviews) >= max_reviews:
            break

        page_number += 1
        time.sleep(2)

    return reviews[:max_reviews]

# Main execution
if __name__ == "__main__":
    kalki_movie_id = "tt12735488"
    total_reviews_needed = 1000
    reviews = fetch_imdb_reviews(kalki_movie_id, total_reviews_needed)
    reviews = reviews[:total_reviews_needed]

    # Save reviews to CSV
    df = pd.DataFrame(reviews, columns=['Review', 'Date'])
    df.to_csv('reviews.csv', index=False)
    print(f"Saved {len(reviews)} reviews to reviews.csv.")

    # Download the CSV
    from google.colab import files
    files.download('reviews.csv')


Page 1: 25 reviews collected.
Page 2: 25 reviews collected.
Page 3: 25 reviews collected.
Page 4: 50 reviews collected.
Page 5: 50 reviews collected.
Page 6: 50 reviews collected.
Page 7: 50 reviews collected.
Page 8: 75 reviews collected.
Page 9: 100 reviews collected.
Page 10: 125 reviews collected.
Page 11: 125 reviews collected.
Page 12: 150 reviews collected.
Page 13: 175 reviews collected.
Page 14: 175 reviews collected.
Page 15: 175 reviews collected.
Page 16: 175 reviews collected.
Page 17: 200 reviews collected.
Page 18: 200 reviews collected.
Page 19: 200 reviews collected.
Page 20: 200 reviews collected.
Page 21: 225 reviews collected.
Page 22: 225 reviews collected.
Page 23: 250 reviews collected.
Page 24: 250 reviews collected.
Page 25: 275 reviews collected.
Page 26: 300 reviews collected.
Page 27: 325 reviews collected.
Page 28: 350 reviews collected.
Page 29: 375 reviews collected.
Page 30: 400 reviews collected.
Page 31: 400 reviews collected.
Page 32: 425 reviews coll

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [9]:

import pandas as pd
import re

def clean_text(text):
    """Cleans the text data by removing noise like special characters and punctuation.

    Args:
        text (str): The text to be cleaned.

    Returns:
        str: The cleaned text.
    """

    # performs removal of special characters and punctuation
    cleaned_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)


    return cleaned_text

# Loading the CSV file
df = pd.read_csv('reviews.csv')

# Clean the 'Review' column by using the clean function and make a new 'Cleaned Review' column.
df['Cleaned Review'] = df['Review'].apply(clean_text)

#saving the new DataFrame to a new CSV file
df.to_csv('kalki_reviews_cleaned.csv', index=False)

print("Cleaned data saved to kalki_reviews_cleaned.csv")


Cleaned data saved to kalki_reviews_cleaned.csv


In [10]:
import pandas as pd
import re

def remove_numbers(text):
    """Remove numbers from the text."""
    text = re.sub(r'\d+', '', text)  # Removes all numbers
    text = re.sub(r'\s+', ' ', text)  # Normalizing whitespace
    return text.strip()  # Removing leading and trailing whitespace

# Load the cleaned CSV file
input_filename = 'kalki_reviews_cleaned.csv'  # Adjust the path if needed
df = pd.read_csv(input_filename)

# Print the columns to verify
print("Columns in DataFrame:", df.columns.tolist())

# Check if 'Cleaned Review' exists in the DataFrame
if 'Cleaned Review' in df.columns:
    #removing numbers and creating a new column
    df['Final_Cleaned_Review'] = df['Cleaned Review'].apply(remove_numbers)

    # Save the final data to a new CSV file
    output_filename = 'kalki_reviews_final_cleaned.csv'
    df.to_csv(output_filename, index=False)

    print(f"Final cleaned data saved to {output_filename}.")
else:
    print("Column 'Cleaned Review' not found. Please check the column names.")


Columns in DataFrame: ['Review', 'Date', 'Cleaned Review']
Final cleaned data saved to kalki_reviews_final_cleaned.csv.


In [11]:
import pandas as pd

#input the CSV file
input_filename = 'kalki_reviews_final_cleaned.csv'  # Adjust the path if needed
df = pd.read_csv(input_filename)

# Print the columns
print("Columns in DataFrame:", df.columns.tolist())

# getting only 'Cleaned Review' column
cleaned_review_df = df[['Cleaned Review']]  # Using double brackets to keep it as a DataFrame

# Save data to a CSV file
output_filename = 'cleaned_reviews_only.csv'
cleaned_review_df.to_csv(output_filename, index=False)

print(f"Cleaned Review column saved to {output_filename}.")


Columns in DataFrame: ['Review', 'Date', 'Cleaned Review', 'Final_Cleaned_Review']
Cleaned Review column saved to cleaned_reviews_only.csv.


In [12]:
import pandas as pd

# Sample list of stopwords
stopwords = set([
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',
    'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him',
    'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its',
    'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what',
    'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am',
    'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has',
    'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',
    'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
    'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
    'through', 'during', 'before', 'after', 'above', 'below', 'to',
    'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
    'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
    'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most',
    'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same',
    'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
    'should', 'now'
])

# input the csv
input_filename = 'cleaned_reviews_only.csv'  # Adjust the path if needed
df = pd.read_csv(input_filename)

# Print the columns
print("Columns in DataFrame:", df.columns.tolist())

# Remove stopwords
df['Reviews_No_Stopwords'] = df['Cleaned Review'].apply(
    lambda text: ' '.join(word for word in text.split() if word.lower() not in stopwords)
)

# Save the data to a new CSV
output_filename = 'kalki_reviews_no_stopwords.csv'
df.to_csv(output_filename, index=False)

print(f"Cleaned reviews without stopwords saved to {output_filename}.")


Columns in DataFrame: ['Cleaned Review']
Cleaned reviews without stopwords saved to kalki_reviews_no_stopwords.csv.


In [13]:
import pandas as pd

# input CSV file containing reviews without stopwords
input_filename = 'kalki_reviews_no_stopwords.csv'  # Adjust the path if needed
df = pd.read_csv(input_filename)

# Print the columns
print("Columns in DataFrame:", df.columns.tolist())

# Lowercase all text
df['Lowercase_Review'] = df['Reviews_No_Stopwords'].str.lower()

# Save the final data with lowercased
output_filename = 'kalki_reviews_lowercase.csv'
df.to_csv(output_filename, index=False)

print(f"Lowercased reviews saved to {output_filename}.")


Columns in DataFrame: ['Cleaned Review', 'Reviews_No_Stopwords']
Lowercased reviews saved to kalki_reviews_lowercase.csv.


In [14]:
import pandas as pd
from nltk.stem import PorterStemmer
import nltk

# NLTK resources are downloaded
nltk.download('punkt')

# Load the CSV file containing lowercase reviews
input_filename = 'kalki_reviews_lowercase.csv'  # Adjust the path if needed
df = pd.read_csv(input_filename)

# verify
print("Columns in DataFrame:", df.columns.tolist())

# Initialize the Porter Stemmer to perform the fucntion
stemmer = PorterStemmer()

# Function to stem the words in the column
def stem_review(text):
    words = text.split()  # Split the text into words
    stemmed_words = [stemmer.stem(word) for word in words]  # Stem each word
    return ' '.join(stemmed_words)  # Join the stemmed words back into a single string

# Apply stemming to the 'Lowercase_Review' column
df['Stemmed_Review'] = df['Lowercase_Review'].apply(stem_review)

# Save the final data with stemmed reviews column name to a  file
output_filename = 'kalki_reviews_stemmed.csv'
df.to_csv(output_filename, index=False)

print(f"Stemmed reviews saved to {output_filename}.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Columns in DataFrame: ['Cleaned Review', 'Reviews_No_Stopwords', 'Lowercase_Review']
Stemmed reviews saved to kalki_reviews_stemmed.csv.


In [15]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
import nltk

# Ensure NLTK resources are downloaded
nltk.download('wordnet')
nltk.download('punkt')

# Load the CSV file containing lowercase reviews
input_filename = 'kalki_reviews_stemmed.csv'  # Adjust the path if needed
df = pd.read_csv(input_filename)

# Print the columns to verify
print("Columns in DataFrame:", df.columns.tolist())

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize the words in the text
def lemmatize_review(text):
    words = text.split()  # Split the text into words
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]  # Lemmatize each word
    return ' '.join(lemmatized_words)  # Join the lemmatized words back into a single string

# Apply lemmatization to the 'Lowercase_Review' column
df['Lemmatized_Review'] = df['Stemmed_Review'].apply(lemmatize_review)

# Save the final data with lemmatized reviews to a new CSV file
output_filename = 'kalki_reviews_lemmatized.csv'
df.to_csv(output_filename, index=False)

print(f"Lemmatized reviews saved to {output_filename}.")




[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Columns in DataFrame: ['Cleaned Review', 'Reviews_No_Stopwords', 'Lowercase_Review', 'Stemmed_Review']
Lemmatized reviews saved to kalki_reviews_lemmatized.csv.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [23]:

import pandas as pd
import nltk
from collections import Counter


# Download the averaged_perceptron_tagger
nltk.download('averaged_perceptron_tagger')


# loading the file and  reading only the specified column , I will be performing it to the clean text which is saved until lemmatized
df = pd.read_csv('kalki_reviews_lemmatized.csv', usecols=['Lemmatized_Review'])

# Display the first few rows of the DataFrame
print("Lowercase_Review")
print(df.head())

# Initialize the function
def pos_analysis(text):
    # Tokenize the text
    words = nltk.word_tokenize(text)

    #POS
    pos_tags = nltk.pos_tag(words)

    # Count occurrences of each part of speech
    pos_counts = Counter(tag for word, tag in pos_tags)

    # Define counts for Nouns, Verbs, Adjectives, and Adverbs
    noun_count = sum(pos_counts.get(tag, 0) for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
    verb_count = sum(pos_counts.get(tag, 0) for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    adj_count = sum(pos_counts.get(tag, 0) for tag in ['JJ', 'JJR', 'JJS'])
    adv_count = sum(pos_counts.get(tag, 0) for tag in ['RB', 'RBR', 'RBS'])

    return noun_count, verb_count, adj_count, adv_count

# Initialize counts
total_nouns, total_verbs, total_adjectives, total_adverbs = 0, 0, 0, 0

# Apply POS analysis to each lemmatized review in the specified column
for review in df['Lemmatized_Review']:
    nouns, verbs, adjectives, adverbs = pos_analysis(review)
    total_nouns += nouns
    total_verbs += verbs
    total_adjectives += adjectives
    total_adverbs += adverbs

# Print results
print("\nTotal Counts from All Reviews:")
print(f"Nouns: {total_nouns}")
print(f"Verbs: {total_verbs}")
print(f"Adjectives: {total_adjectives}")
print(f"Adverbs: {total_adverbs}")


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Lowercase_Review
                                   Lemmatized_Review
0  soar epic second part film excel climax taint ...
1  replica star war movi supream leader look like...
2  dont understand ob hero entri peopl good stori...
3  didnt go big hope expect better adipurush way ...
4  tricki justic big stori time limit regular mov...

Total Counts from All Reviews:
Nouns: 70120
Verbs: 13400
Adjectives: 25360
Adverbs: 3920


In [24]:
import spacy
import pandas as pd
from nltk import pos_tag, word_tokenize, RegexpParser

nlp = spacy.load('en_core_web_sm')

# the given csv files contains all the data including the lemmatized version
df = pd.read_csv('kalki_reviews_lemmatized.csv')  # Adjust the filename as necessary

# checking out for all thecolumn names to select correct one
print("Columns in the DataFrame:", df.columns)

# Example sentence
example_sentence = "Don't understand obsession; hero entry people good story."

# function for Dependency Parsing
def dependency_parse(sentence):
    doc = nlp(sentence)
    # Collecting the dependency parse output
    parse_output = []
    for token in doc:
        parse_output.append(f"{token.text} --> {token.dep_} --> {token.head.text}")
    return "\n".join(parse_output)

# fucntion for Constituency Parsing
def constituency_parse(sentence):
    tokens = word_tokenize(sentence)
    tagged = pos_tag(tokens)

    # Define a simple grammar for parsing
    grammar = """
        NP: {<DT>?<JJ>*<NN.*>+}   # Noun Phrase
        VP: {<VB.*><NP|PP|CLAUSE>+$}  # Verb Phrase
        PP: {<IN><NP>}   # Prepositional Phrase
        CLAUSE: {<NP><VP>}  # Clause
    """
    cp = RegexpParser(grammar)
    tree = cp.parse(tagged)

    # output
    return str(tree)

# Create new columns for dependency and constituency parsing results
df['Dependency_Parse'] = ""
df['Constituency_Parse'] = ""

# Iterate through each review in the DataFrame
for index, row in df.iterrows():
    # so we are going for the data that is upto the lemmatized version of text for parsing since we are asked only for cleaned data
    review = row['Lemmatized_Review']  # Use the actual column name here
    print(f"\nParsing Review {index + 1}: {review}")

    #  results
    dep_parse_result = dependency_parse(review)
    const_parse_result = constituency_parse(review)

    # save the results
    df.at[index, 'Dependency_Parse'] = dep_parse_result
    df.at[index, 'Constituency_Parse'] = const_parse_result

# printing to a  new CSV file
df.to_csv('kalki_reviews_parsed.csv', index=False)
print("Saved parsed reviews to 'kalki_reviews_parsed.csv'.")

from google.colab import files
files.download('kalki_reviews_parsed.csv')


Columns in the DataFrame: Index(['Cleaned Review', 'Reviews_No_Stopwords', 'Lowercase_Review',
       'Stemmed_Review', 'Lemmatized_Review'],
      dtype='object')

Parsing Review 1: soar epic second part film excel climax taint bloat bore technic terribl first half cring dialogu poorli choreograph action scene unnecessari subplotsth prop use throughout movi look like cheap foam much inspir drawn hollywood moviesth first attempt nag ashwin averag best hope improv significantli second part film focu mahabharata adapt faith possibl scifi element movi disappoint wherea charact great epic thoroughli exhilar

Parsing Review 2: replica star war movi supream leader look like emperor palpatin technolog scene armi build alreadi star war seri watch lot hollywood movi feel watch kind movi first timegood side kalki first movi indian cinema level vfx complet give better competit hollywood scienc fiction moviesstori line movi could much better taken 3 hour thing experienc stori

Parsing Review 3: do

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [20]:
import spacy
import pandas as pd

# Loading the spaCy model
nlp = spacy.load('en_core_web_sm')

# Load the cleaned CSV
input_filename = 'kalki_reviews_parsed.csv'  # Adjust the path if needed
df = pd.read_csv(input_filename)

# Print the columns to verify
print("Columns in DataFrame:", df.columns.tolist())

#  extract  entities and their counts
def extract_entities(text):
    doc = nlp(text)
    entities = {}
    for ent in doc.ents:
        # Count specific entity types
        if ent.label_ in ['PERSON', 'ORG', 'GPE', 'PRODUCT', 'DATE']:
            entities[ent.text] = entities.get(ent.text, 0) + 1  # Simplified counting
    return entities

# Dictionary to hold all entities from all reviews
all_entities = {}

# Iterate through each review in the DataFrame, the lowercase review is our claned data , without being stemmed and lemmitizd
for index, row in df.iterrows():
    review = row['Lowercase_Review']  # Ensure the correct column name
    entities = extract_entities(review)

    # Update the dictionary
    for entity, count in entities.items():
        all_entities[entity] = all_entities.get(entity, 0) + count  # Simplified counting

# print the output
print("\nNamed Entity Counts:")
for entity, count in sorted(all_entities.items(), key=lambda item: item[1], reverse=True):
    print(f"{entity}: {count}")


Columns in DataFrame: ['Cleaned Review', 'Reviews_No_Stopwords', 'Lowercase_Review', 'Stemmed_Review', 'Lemmatized_Review', 'Dependency_Parse', 'Constituency_Parse']

Named Entity Counts:
2898: 760
sci: 160
max: 120
6000 years: 80
last 30mins: 80
kamal hassan deepika: 80
nagi: 80
replica star: 40
betteri: 40
london: 40
sharad kelkar hindi: 40
2nd half: 40
karna: 40
heartkamal hassan: 40
2015: 40
sr ab: 40
2021: 40
mahabharatas depictionin conclusion: 40
jack master: 40
aswin: 40
kali karna: 40
amitabh bachchan: 40
amitabh bachchan kamal haasan: 40
mahabharata ashwatthama: 40
kaasi: 40
supreme leader: 40
prabhas bhairava: 40
haasan supreme one: 40
anna ben kyra: 40
angel dune: 40
max indian matrix: 40
kamal hasan: 40
1st half: 40
mahabharata kamal hasan: 40
weekend: 40
india: 40
710kalki 2898: 40
supreme yaskin played kamal: 40
spans mahabharata era: 40
hindi tamil malayalam kannada: 40
dulquer: 40


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
https://myunt-my.sharepoint.com/:x:/r/personal/sahithitummala_my_unt_edu/_layouts/15/Doc.aspx?sourcedoc=%7B17085926-69C3-4E8B-BA2D-922663CC3D3B%7D&file=kalki_reviews_parsed%20(2).csv&action=default&mobileredirect=true

The above link is for the csv file which consists of cleaned data without number, no stop words , lower case which is the actual cleaned data , the stemmed, the dependency parsing , constituency parsing

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
The real challange lied in deciding which kind of data is picked and i have observed how the outputs varied for each individual column
