<a href="https://colab.research.google.com/github/ShashankAlluri28/INFO-5731Computational-Methods/blob/main/Alluri_Shashank_Assignment_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests
from bs4 import BeautifulSoup
import csv

base_url = "https://www.imdb.com/title/tt15398776/reviews"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
reviews = []
num_reviews = 1000
page = 1

while len(reviews) < num_reviews:
    url = f"{base_url}?sort=submissionDate&dir=desc&ratingFilter=0&page={page}"
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        review_containers = soup.find_all("div", class_="text show-more__control")
        if not review_containers:
            break
        for container in review_containers:
            review_text = container.get_text(strip=True)
            reviews.append(review_text)
        page += 1
    else:
        print("Failed to retrieve data")
        break

print(f"Collected {len(reviews)} reviews.")

filename = "movie_reviews.csv"
with open(filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Review"])
    for review in reviews:
        writer.writerow([review])

print("Reviews saved to movie_reviews.csv")


Collected 1000 reviews.
Reviews saved to movie_reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

input_file = "movie_reviews.csv"
output_file = "cleaned_movie_reviews.csv"

print("Starting data cleaning process...")

with open(input_file, mode="r", encoding="utf-8") as infile, \
     open(output_file, mode="w", newline="", encoding="utf-8") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    header = next(reader)
    header.append("Cleaned Review")
    writer.writerow(header)

    for i, row in enumerate(reader, start=1):
        review = row[0]

        print(f"\nReview {i} before cleaning:\n{review}")  # Print original review

        # Remove non-alphabetic characters
        review = re.sub(r'[^a-zA-Z\s]', '', review)
        print(f"Review {i} after removing non-alphabetic characters:\n{review}")

        # Remove digits
        review = re.sub(r'\d+', '', review)
        print(f"Review {i} after removing digits:\n{review}")

        review = review.lower()  # Convert to lowercase
        print(f"Review {i} after converting to lowercase:\n{review}")

        words = review.split()  # Tokenize
        words = [word for word in words if word not in stop_words]  # Remove stopwords
        print(f"Review {i} after removing stopwords:\n{' '.join(words)}")

        stemmed_words = [stemmer.stem(word) for word in words]  # Stemming
        print(f"Review {i} after stemming:\n{' '.join(stemmed_words)}")

        lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]  # Lemmatization
        cleaned_review = ' '.join(lemmatized_words)  # Join tokens

        print(f"\nReview {i} after lemmatization:\n{cleaned_review}")  # Print cleaned review

        row.append(cleaned_review)  # Append cleaned review

        writer.writerow(row)  # Write row to output file

print("\nData cleaning completed and saved to cleaned_movie_reviews.csv")  # Print completion message


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Review 341 before cleaning:
This is more of a court room drama than an exciting movie about the man behind the Los Alamos laboratory. Two and a half hours in, we are still in character development and jumping between 3 different timelines. A confusing, boring slog. Nolan has mastered the manipulation of timelines in his previous brilliant films, but here it just fails miserably until the pieces and timelines FINALLY coalesce in the last 10 minutes. Einstein is in the movie for 5 minutes and without giving any spoilers, he steals the show in the end.Oppenheimer could have been a 75 minute movie instead if 180 and it would have been so much more effective. Clearly, I am in the minority and it will probably win all kinds of awards, but I don't think it deserves it.
Review 341 after removing non-alphabetic characters:
This is more of a court room drama than an exciting movie about the man behind the Los Alamos laboratory Two 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Review 938 after removing stopwords:
christopher nolan best impossible follow first half hour different times running simultaneously one might mistake time travel doc cilian murphy given chance shine incredibly demanding role showcases every aspact complex character j robert oppenheimer doesnt win oscar best actor im sure point isall supporting actors fabulous film thats designed lure thats exactly happened first time saw second thirdi could go far say one greatest films century time
Review 938 after stemming:
christoph nolan best imposs follow first half hour differ time run simultan one might mistak time travel doc cilian murphi given chanc shine incred demand role showcas everi aspact complex charact j robert oppenheim doesnt win oscar best actor im sure point isal support actor fabul film that design lure that exactli happen first time saw second thirdi could go far say one greatest film centuri time

Review 938 after lemmatization:
christoph nolan best imposs follow first half hou

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
#importing and downloading required libs..
import csv
import nltk
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.downloader import download, download_shell
import spacy

# Download NLTK resources
download('averaged_perceptron_tagger')
download('maxent_ne_chunker')
download('words')
download('punkt')

import pandas as pd
import spacy

# Loading spaCy model
nlp = spacy.load("en_core_web_sm")

def analyze_syntax_structure(input_file):
    # Load cleaned text from CSV file
    with open(input_file, mode="r", encoding="utf-8") as infile:
        reader = csv.reader(infile)
        next(reader)  # Skip header
        cleaned_text = ' '.join([row[1] for row in reader])  # Concatenate all cleaned reviews

    # Tokenize text into sentences
    sentences = sent_tokenize(cleaned_text)

    # Initialize counters for POS tagging
    noun_count = 0
    verb_count = 0
    adj_count = 0
    adv_count = 0

    # Initialize counters for Named Entity Recognition
    person_count = 0
    organization_count = 0
    location_count = 0
    product_count = 0
    date_count = 0

    # Function to extract named entities from a sentence
    def extract_entities(sentence):
        entities = ne_chunk(pos_tag(word_tokenize(sentence)))
        for entity in entities:
            if isinstance(entity, nltk.tree.Tree):
                entity_type, entity_name = zip(*entity)
                entity_name = ' '.join(entity_name)
                if entity.label() == 'PERSON':
                    global person_count
                    person_count += 1
                elif entity.label() == 'ORGANIZATION':
                    global organization_count
                    organization_count += 1
                elif entity.label() == 'GPE':
                    global location_count
                    location_count += 1
                elif entity.label() == 'PRODUCT':
                    global product_count
                    product_count += 1
                elif entity.label() == 'DATE':
                    global date_count
                    date_count += 1

    # Conduct POS tagging and NER for each sentence
    for sentence in sentences:
        # Tokenize the sentence
        tokens = word_tokenize(sentence)

        # Perform POS tagging
        tagged_words = pos_tag(tokens)

        # Count POS tags
        for word, tag in tagged_words:
            if tag.startswith('N'):
                noun_count += 1
            elif tag.startswith('V'):
                verb_count += 1
            elif tag.startswith('J'):
                adj_count += 1
            elif tag.startswith('R'):
                adv_count += 1

        # Perform named entity recognition with NLTK
        extract_entities(sentence)

        # Perform named entity recognition with spaCy
        doc = nlp(sentence)
        for ent in doc.ents:
            if ent.label_ == 'PERSON':
                person_count += 1
            elif ent.label_ == 'ORG':
                organization_count += 1
            elif ent.label_ == 'GPE':
                location_count += 1
            elif ent.label_ == 'PRODUCT':
                product_count += 1
            elif ent.label_ == 'DATE':
                date_count += 1

    # Print total counts of POS tags and named entities
    print("\nTotal counts of POS tags:")
    print("Noun:", noun_count)
    print("Verb:", verb_count)
    print("Adjective:", adj_count)
    print("Adverb:", adv_count)
    print("\nTotal counts of named entities:")
    print("Person:", person_count)
    print("Organization:", organization_count)
    print("Location:", location_count)
    print("Product:", product_count)
    print("Date:", date_count)

# Specify input file path
input_file = "cleaned_movie_reviews.csv"

# Analyze syntax and structure of the clean text
analyze_syntax_structure(input_file)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.



Total counts of POS tags:
Noun: 55200
Verb: 13601
Adjective: 22560
Adverb: 5359

Total counts of named entities:
Person: 2160
Organization: 1240
Location: 1000
Product: 80
Date: 360


In [None]:
#Here we are printing the frequency of individual entities identified:

# Loading spaCy model
nlp = spacy.load("en_core_web_sm")

def analyze_syntax_structure(input_file):
    # Load cleaned text from CSV file
    with open(input_file, mode="r", encoding="utf-8") as infile:
        reader = csv.reader(infile)
        next(reader)  # Skip header
        cleaned_text = ' '.join([row[1] for row in reader])  # Concatenate all cleaned reviews

    # Tokenize the text
    sentences = sent_tokenize(cleaned_text)

    # Initialize the dictionary
    entities_count = {}

    # This is our function to extract named entities from a sentence
    def extract_entities(sentence):
        entities = ne_chunk(pos_tag(word_tokenize(sentence)))
        for entity in entities:
            if isinstance(entity, nltk.tree.Tree):
                entity_type, entity_name = zip(*entity)
                entity_name = ' '.join(entity_name)
                entities_count.setdefault(entity.label(), Counter())[entity_name] += 1

    # Conduct POS tagging and ner for each semtence
    for sentence in sentences:
        # Perform named entity recognition with NLTK
        extract_entities(sentence)

        # Perform named entity recognition with spaCy
        doc = nlp(sentence)
        for ent in doc.ents:
            entities_count.setdefault(ent.label_, Counter())[ent.text] += 1

    # Print named entity counts
    print("\nNamed Entity Counts:")
    for label, entity_counter in entities_count.items():
        print(label + ":")
        for entity, freq in entity_counter.items():
            print(f"{entity}: {freq}")

# Specify input file path
input_file = "cleaned_movie_reviews.csv"

# Analyze syntax and structure of the clean text
analyze_syntax_structure(input_file)



Named Entity Counts:
ORG:
oppenheim realli: 40
christoph: 40
favourit nolan film rewatch film great: 40
oppenheim movi: 80
narr: 160
oppenheim experiencedon movi strength: 40
societ implic scientif advanc revoc: 40
oppenheim secur clearanc: 40
significantli movi: 40
oppenheim charact navig intellectu brillianc: 40
historybas movi: 40
surpris movi: 40
bygon: 40
christoph nolan: 80
weaponsth: 40
abus: 40
horrif: 40
cnn: 40
blinder bank: 40
oppenheim review: 40
lot nolan movi: 40
incandesc movi: 40
american prometheu: 40
substanti: 40
wwii: 40
DATE:
week: 40
today: 120
year: 80
last half: 40
first time year: 40
four day: 40
GPE:
japan: 80
manhattan: 160
oppenheim: 480
moviego: 40
figur: 40
mayb: 40
storylin: 40
gaza: 80
hiroshima: 80
nagasaki: 40
japanin: 40
believ: 40
mostli: 40
onlin: 40
EVENT:
world war ii: 80
world war ii direct: 40
wwii: 40
black dark dank: 40
PERSON:
film oppenheim believ: 40
multipl timelin: 40
nolan absolut: 40
oppenheim associ: 40
jargon: 40
chang: 40
jacobson p

In [4]:
# Loading the English language model for spaCy
nlp = spacy.load('en_core_web_sm')

# Function to print constituency parsing and dependency parsing trees for a sentence
def print_parsing_trees(sentence):
    doc = nlp(sentence)
    print("\nSentence:", sentence)
    print("\nConstituency parsing tree:")
    for token in doc:
        print(token.text, token.dep_, token.head.text)
    print("\nDependency parsing tree:")
    for chunk in doc.noun_chunks:
        print(chunk.text, "-->", chunk.root.text)

# Read cleaned text from CSV file
def read_cleaned_text_from_csv(file_path):
    cleaned_text = ''
    with open(file_path, mode="r", encoding="utf-8") as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            cleaned_text += row[0] + ' '  # Assuming the cleaned text is in the first column
    return cleaned_text

# Specify the path to the CSV file containing cleaned text
csv_file_path = "cleaned_movie_reviews.csv"

# Read cleaned text from CSV file
cleaned_text = read_cleaned_text_from_csv(csv_file_path)

# Tokenizing our text into sentences
sentences = sent_tokenize(cleaned_text)

# Printing constituency parsing and dependency parsing trees for each sentence
for sentence in sentences:
    print_parsing_trees(sentence)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Dependency parsing tree:
This film --> film
the moment --> moment
Christopher Nolan's masterpiece --> masterpiece

Sentence: That's right.

Constituency parsing tree:
That nsubj 's
's ROOT 's
right acomp 's
. punct 's

Dependency parsing tree:
That --> That

Sentence: Christopher Nolan, master of epic action movies, has made his best work in what should be considered a hybrid of historical biopic and political thriller.

Constituency parsing tree:
Christopher compound Nolan
Nolan nsubj made
, punct Nolan
master appos Nolan
of prep master
epic amod movies
action compound movies
movies pobj of
, punct Nolan
has aux made
made ROOT made
his poss work
best amod work
work dobj made
in prep work
what nsubjpass considered
should aux considered
be auxpass considered
considered pcomp in
a det hybrid
hybrid oprd considered
of prep hybrid
historical amod biopic
biopic pobj of
and cc biopic
political amod thriller
thriller conj biopic

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
'''
I found it interesting and thought-provoking. The task involved different parts, like getting text data, cleaning it up, and analyzing its structure. One part that was a
bit hard was making sure the code worked well and was efficient, especially with big amounts of data. Understanding and using advanced concepts, like how sentences are
structured, took some time to figure out.

But overall, I enjoyed working on it. It was cool to learn how computers can understand and work with human language. I liked trying out different ways to handle the data
and see how it affected the results.

The time given to finish the assignment was okay. The tasks were laid out in a clear way, which helped with planning. But since some parts were tricky, having a bit more
time would have been helpful to experiment more and make sure everything was done right.
'''

'\nI found it interesting and thought-provoking. The task involved different parts, like getting text data, cleaning it up, and analyzing its structure. One part that was a \nbit hard was making sure the code worked well and was efficient, especially with big amounts of data. Understanding and using advanced concepts, like how sentences are \nstructured, took some time to figure out.\n\nBut overall, I enjoyed working on it. It was cool to learn how computers can understand and work with human language. I liked trying out different ways to handle the data \nand see how it affected the results.\n\nThe time given to finish the assignment was okay. The tasks were laid out in a clear way, which helped with planning. But since some parts were tricky, having a bit more \ntime would have been helpful to experiment more and make sure everything was done right.\n'