<a href="https://colab.research.google.com/github/SaiTejaMunja/SaiTeja_INFO5731_Fall2023/blob/main/Munja_SaiTeja_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [None]:
# Write your code here

# Using Second Source #2

# 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

# The Batman (2022) - Movie Name

import requests
from bs4 import BeautifulSoup
import csv

# Defining the URL of the IMDb page for the film's user reviews
film_url = "https://www.imdb.com/title/tt1877830/reviews"

# Creating a list to store all user reviews
all_user_reviews = []

# Defining the number of pages to scrape (adjust as needed)
num_pages_to_scrape = 800  # You can change this to collect more pages

for page_num in range(1, num_pages_to_scrape + 1):
    # Creating the URL for the current page
    page_url = f"{film_url}?start={((page_num - 1) * 10)}"

    # Sending an HTTP GET request to the IMDb page
    response = requests.get(page_url)

    if response.status_code == 200:
        # Parsing the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Finding the review elements on the page
        review_elements = soup.find_all("div", class_="text show-more__control")

        for review_element in review_elements:
            review_text = review_element.get_text(strip=True)
            all_user_reviews.append(review_text)

    else:
        print(f"Failed to retrieve IMDb page {page_num}. Check the URL or your internet connection.")

# Creating a CSV file to store all user reviews
with open("imdb_user_reviews.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)

    # Writing the header row
    writer.writerow(["User Review"])

    # Writing the data for each user review
    for review in all_user_reviews:
        writer.writerow([review])

print(f"Collected {len(all_user_reviews)} user reviews and saved to 'imdb_user_reviews.csv'.")


Collected 20000 user reviews and saved to 'imdb_user_reviews.csv'.


In [None]:
#importing libraries and reading csv

import numpy as np
import pandas as pd
imdb = pd.read_csv('imdb_user_reviews.csv')

In [None]:
#Checking head of the dataset

imdb.head()

Unnamed: 0,User Review
0,Detective Batman at its peak! Great storyline....
1,I just got out of The BatmanThis movie really ...
2,"A serial killer strikes in Gotham City, killin..."
3,I have been absolutely fizzing to see 'The Bat...
4,"The Riddler(Paul Dano, spot-on. How did it tak..."


In [None]:
#checking the sape of the data

imdb.shape

(20000, 1)

# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write your code here

# importing necessary libraries

import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Loading the stopwords list
stopwords = stopwords.words("english")

# Defining a function to clean the text data
def clean_text(text):
    # Removing noise, such as special characters and punctuations
    text = re.sub(r"[^\w\s]", "", text)

    # Removing numbers
    text = re.sub(r"[0-9]", "", text)

    # Removing stopwords
    text = " ".join([word for word in text.split() if word not in stopwords])

    # applying Lowercase to all texts
    text = text.lower()

    # applying Stemming
    stemmer = PorterStemmer()
    text = " ".join([stemmer.stem(word) for word in text.split()])

    # applying Lemmatization
    lemmatizer = WordNetLemmatizer()
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])

    return text

# Opening the CSV file
with open("imdb_user_reviews.csv", "r") as f:
    reader = csv.reader(f)
    next(reader) # Skip the header row

    # Creating a new CSV file to store the cleaned data
    with open("imdb_reviews_clean.csv", "w", newline="") as f_out:
        writer = csv.writer(f_out)
        writer.writerow(["Review", "Cleaned Review"])

        for row in reader:
            review = row[0]

            # Cleaning the review text
            cleaned_review = clean_text(review)

            # Writing the cleaned review to the new CSV file
            writer.writerow([review, cleaned_review])

# Closing the files
f.close()
f_out.close()

print("Successfully cleaned and saved the text data to imdb_reviews_clean.csv")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Successfully cleaned and saved the text data to imdb_reviews_clean.csv


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# 1. Write Your Code here

# Importing necessary libraries

import spacy

# Loading the spaCy model
nlp = spacy.load("en_core_web_sm")

# Defining a function to perform POS tagging and count parts of speech
def count_parts_of_speech(text):
    doc = nlp(text)
    pos_counts = {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}

    for token in doc:
        if token.pos_ == 'NOUN':
            pos_counts['Noun'] += 1
        elif token.pos_ == 'VERB':
            pos_counts['Verb'] += 1
        elif token.pos_ == 'ADJ':
            pos_counts['Adjective'] += 1
        elif token.pos_ == 'ADV':
            pos_counts['Adverb'] += 1

    return pos_counts

# Opening  the cleaned data file
with open("imdb_reviews_clean.csv", "r") as f:
    reader = csv.reader(f)
    next(reader)  # Skip the header row

    # Initializing POS counts
    total_pos_counts = {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}

    for row in reader:
        cleaned_review = row[1]

        # Counting POS in the cleaned review
        pos_counts = count_parts_of_speech(cleaned_review)

        # Updating the total POS counts
        for pos, count in pos_counts.items():
            total_pos_counts[pos] += count

# Printing the total counts of each POS
print("Total Parts of Speech Counts:")
for pos, count in total_pos_counts.items():
    print(f"{pos}: {count}")

# Closing the file
f.close()

Total Parts of Speech Counts:
Noun: 996000
Verb: 526400
Adjective: 372800
Adverb: 144000


In [None]:
# 2. Write Your Code here

# Importing necessary libraries


import csv
import spacy

# Loading the spaCy model
nlp = spacy.load("en_core_web_sm")

# Opening the cleaned data file (imdb_reviews_clean.csv)
with open("imdb_reviews_clean.csv", "r") as f:
    reader = csv.reader(f)
    next(reader)  # Skipping the header row

    # Processing a single review for parsing analysis
    for row in reader:
        cleaned_review = row[1]  # retrieving  the cleaned review text

        # Tokenizing and parsing the cleaned review
        doc = nlp(cleaned_review)

        # Constituency Parsing

        print("Constituency Parsing Tree:")
        for token in doc:
            print(f"{token.text} [{token.dep_}] -> {token.head.text} [{token.head.dep_}]")

        # Dependency Parsing

        print("\nDependency Parsing Tree:")
        for token in doc:
            print(f"{token.text} [{token.dep_}] -> {token.head.text} [{token.head.dep_}]")

        # Visualizing the dependency parsing tree
        from spacy import displacy
        displacy.serve(doc, style="dep")
        break

# Closing the file
f.close()


Constituency Parsing Tree:
detect [ROOT] -> detect [ROOT]
batman [compound] -> storylin [dobj]
peak [nmod] -> storylin [dobj]
great [amod] -> storylin [dobj]
storylin [dobj] -> detect [ROOT]
just [advmod] -> dark [amod]
dark [amod] -> univers [dobj]
univers [dobj] -> detect [ROOT]
we [nsubj] -> come [relcl]
ve [aux] -> come [relcl]
come [relcl] -> univers [dobj]
expect [ccomp] -> detect [ROOT]
dc [ccomp] -> expect [ccomp]
the [det] -> gloomi [compound]
gloomi [compound] -> exactli [npadvmod]
gritti [compound] -> exactli [npadvmod]
dark [compound] -> tone [compound]
tone [compound] -> exactli [npadvmod]
film [compound] -> exactli [npadvmod]
exactli [npadvmod] -> detect [ROOT]
i [nsubj] -> want [ROOT]
want [ROOT] -> want [ROOT]
when [advmod] -> think [ccomp]
think [ccomp] -> want [ROOT]
movi [dobj] -> think [ccomp]
there [advmod] -> movi [dobj]
beauti [nsubj] -> cinematographi [ccomp]
cinematographi [ccomp] -> think [ccomp]
great [amod] -> score [dobj]
score [dobj] -> cinematographi [cco

In [None]:
# 3. Write Your Code here

# Importing necessary libraries

import csv
import spacy

# Loading the spaCy model
nlp = spacy.load("en_core_web_sm")

# Opening the cleaned data file (imdb_reviews_clean.csv)

with open("imdb_reviews_clean.csv", "r") as f:
    reader = csv.reader(f)
    next(reader)  # Skip the header row

    # Initializing entity counts
    entity_counts = {}

    for row in reader:
        cleaned_review = row[1]  # Retrieving the cleaned review text

        # Extracting entities from the cleaned review
        doc = nlp(cleaned_review)
        for ent in doc.ents:
            entity_type = ent.label_
            entity_text = ent.text

            if entity_type in entity_counts:
                entity_counts[entity_type].append(entity_text)
            else:
                entity_counts[entity_type] = [entity_text]

# Calculating the count of each entity type
entity_type_counts = {entity_type: len(entities) for entity_type, entities in entity_counts.items()}

# Printing the entity type counts
print("Entity Type Counts:")
for entity_type, count in entity_type_counts.items():
    print(f"{entity_type}: {count}")

# Closing the file
f.close()


Entity Type Counts:
ORDINAL: 9600
CARDINAL: 29600
PERSON: 137600
NORP: 8800
ORG: 24000
DATE: 5600
TIME: 8800
GPE: 12800
PRODUCT: 800
WORK_OF_ART: 800


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

A) **Constituency parsing trees** use a tree structure to describe the syntactic structure of a sentence, with the nodes of the tree standing for individual phrases and the edges for the connections between those phrases. The complete phrase is represented by the tree's root node.

**Dependency Parsing trees** describe the syntactic structure of a sentence as a directed graph, where the nodes of the graph represent words and the edges of the graph indicate the grammatical connections between those words. The head word of a sentence is represented by the graph's root node.

**For Example:**

Consider the sentence "Everything Everywhere All at Once is the best movie of the year."

**The constituency parsing tree for this sentence is as follows:**

[S
  [NP [NNP Everything] [NNP Everywhere] [NN All] [NN at] [NN Once]]
  [VP [VBZ is] [JJ the] [JJS best] [NN movie] [PP [IN of] [NP [NN the] [NN year]]]]
  [. .]]

This tree demonstrates that the sentence is made up of a verb phrase (VP is the finest movie of the year) and a subject noun phrase (NP Everything Everywhere All at Once). The verb phrase is made up of a verb (VBZ is), an adjective (JJ the), a superlative adjective (JJS best), a noun (NN movie), and a prepositional phrase (PP of the year). The subject noun phrase is made up of four nouns (NNP Everything, NNP Everywhere, NN All, and NN at Once).


**The dependency parsing tree for this sentence is as follows:**

(Once
  (All
    (Everywhere
      (Everything
        (is (the (best movie) of the year)))
    ))
  )

According to this sentence's dependency tree, the head word "Once" has two dependents: "All" and "is the best movie of the year." The words "Everywhere" and "Everything" are reliant on the word "All." There are four words that follow the phrase "is the best movie of the year": "the," "best," "movie," and "of the year."


**Differences between constituency parsing trees and dependency parsing trees:**



1.   Differently representing the syntactic structure of a phrase are constituency parsing trees and dependency parsing trees. While dependency parsing trees concentrate on the grammatical ties between words, constituency parsing trees concentrate on the hierarchical relationships between phrases.

2.   Additionally, constituency parsing trees are frequently more expressive than dependency parsing trees. Coordination and subordination are only two examples of the syntactic phenomena that constituency parsing trees may describe. Contrarily, dependency parsing trees are frequently shorter and simpler to parse.

**Applications of constituency parsing trees and dependency parsing trees:**

1.   Constituency parsing trees and dependency parsing trees are used in a variety of natural language processing (NLP) applications such as machine translation, text summarization, and question answering.

2.   In machine translation systems, constituency parsing trees are frequently used to translate sentences from one language to another. Text summarization systems and question answering systems both employ dependency parsing trees to extract the key points of a text and to answer questions about a text.