<a href="https://colab.research.google.com/github/Tharunchandubatla/Tharun_INFO5731_Fall2023/blob/main/Chandubatla_Tharun_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the IMDb page for the movie "Top Gun: Maverick"
url = "https://www.imdb.com/title/tt1745960/reviews"

# Function to scrape user reviews
def scrape_imdb_reviews(url, num_reviews=10000):
    reviews = []
    page = 1

    while len(reviews) < num_reviews:
        page_url = f"{url}?sort=helpfulnessScore&dir=desc&ratingFilter=0&spoiler=hide&ref_=tt_ov_rt&page={page}"
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, "html.parser")
        user_reviews = soup.find_all("div", class_="text show-more__control")

        for review in user_reviews:
            text = review.get_text(strip=True)
            reviews.append(text)

        page += 1

    return reviews[:num_reviews]

# Movie name
movie_name = "Top Gun: Maverick"

# Scrape reviews
reviews = scrape_imdb_reviews(url)

# Create a DataFrame
data = pd.DataFrame(reviews, columns=["User Reviews"])

# Save the data to a CSV file
csv_file_name = f"{movie_name}_user_reviews.csv"
data.to_csv(csv_file_name, index=False)
print(f"Collected {len(reviews)} user reviews and saved to {csv_file_name}")

Collected 10000 user reviews and saved to Top Gun: Maverick_user_reviews.csv


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
# Write your code here
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download NLTK data (stopwords and lemmatization data)
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file with user reviews
csv_file_name = "Top Gun: Maverick_user_reviews.csv"
df = pd.read_csv(csv_file_name)

# Define functions for text cleaning
def clean_text(text):
    # Remove special characters and punctuations
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])

    # Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # Lowercase the text
    text = text.lower()

    return text

def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

def stem_text(text):
    stemmer = PorterStemmer()
    words = text.split()
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply the cleaning functions to the "User Reviews" column
df['Cleaned Reviews'] = df['User Reviews'].apply(clean_text)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(remove_stopwords)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(stem_text)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(lemmatize_text)

# Save the cleaned data to a new CSV file
cleaned_csv_file_name = "Top_Gun_Maverick_cleaned_reviews.csv"
df.to_csv(cleaned_csv_file_name, index=False)
print(f"Cleaned data saved to {cleaned_csv_file_name}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data saved to Top_Gun_Maverick_cleaned_reviews.csv


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [9]:
import pandas as pd
import spacy

# Load the cleaned data from the CSV file
cleaned_csv_file_name = "Top_Gun_Maverick_cleaned_reviews.csv"
df = pd.read_csv(cleaned_csv_file_name)

# Initialize spaCy
nlp = spacy.load("en_core_web_sm")

# Function to perform POS tagging and count POS categories
def pos_tagging(text):
    doc = nlp(text)
    pos_counts = {"Noun": 0, "Verb": 0, "Adjective": 0, "Adverb": 0}

    for token in doc:
        if token.pos_ == "NOUN":
            pos_counts["Noun"] += 1
        elif token.pos_ == "VERB":
            pos_counts["Verb"] += 1
        elif token.pos_ == "ADJ":
            pos_counts["Adjective"] += 1
        elif token.pos_ == "ADV":
            pos_counts["Adverb"] += 1

    return pos_counts

# Function to perform constituency parsing
def constituency_parsing(text):
    doc = nlp(text)
    constituency_tree = ""

    for sent in doc.sents:
        for token in sent:
            constituency_tree += f"({token.text} ({token.dep_} "
        constituency_tree += ")"

    return constituency_tree

# Function to perform dependency parsing
def dependency_parsing(text):
    doc = nlp(text)

    for sent in doc.sents:
        for token in sent:
            print(token.text, token.dep_, token.head.text)

# Function to perform Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = {}

    for ent in doc.ents:
        entity_type = ent.label_
        if entity_type in entities:
            entities[entity_type] += 1
        else:
            entities[entity_type] = 1

    return entities

# Example sentence for explanation
example_sentence = df['Cleaned Reviews'][0]

# Perform POS tagging and count POS categories
pos_counts = pos_tagging(example_sentence)
print("POS Tagging:", pos_counts)

# Perform constituency parsing
constituency_tree = constituency_parsing(example_sentence)
print("Constituency Parsing Tree:", constituency_tree)

# Perform dependency parsing
print("Dependency Parsing:")
dependency_parsing(example_sentence)

# Perform Named Entity Recognition (NER)
entities = named_entity_recognition(example_sentence)
print("Named Entity Recognition:", entities)


POS Tagging: {'Noun': 91, 'Verb': 33, 'Adjective': 25, 'Adverb': 9}
Constituency Parsing Tree: (one (nummod (memor (compound (line (compound (origin (nmod (top (amod (gun (compound (maverick (nsubj (get (ROOT (chew (nsubj (superior (amod (tell (ccomp (son (compound (ego (compound (write (compound (check (compound (bodi (dobj (ca (aux (nt (neg (cash (advcl (sometim (compound (wonder (dobj (tom (compound (cruis (nsubj (took (conj (putdown (amod (person (compound (challeng (compound (movi (compound (star (nsubj (seem (ccomp (work (nsubj (harder (advmod (push (xcomp (cruis (amod (day (npadvmod (ridicul (compound (entertain (nmod (top (compound (gun (compound (maverick (compound (cruis (compound (earli (nsubj (first (advmod (play (conj (pete (amod (maverick (compound (mitchel (compound (cocki (compound (young (compound (navi (compound (pilot (compound (aviat (compound (sunglass (compound (kawasaki (compound (motorcycl (nsubj (need (ccomp (speed (compound (sequel (dobj (he (nsubj (arrog (rel

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

**Constituency Parsing Tree:**Constituency parsing is an analysis of language approach that exposes the sentence's structure of hierarchy by dissecting a sentence into its component components or phrases. Each phrase appears as a node in a semantic parsing tree, and such nodes are arranged in a hierarchical fashion to demonstrate how phrases and words connect to one another inside the sentence. As you descend the node tree, you come across terms like phrases containing nouns (NP), phrases containing verbs (VP), prepositional words (PP), and more. The top-level node often represents the complete sentence. The links between nodes show how various phrases are nested and arranged within the sentence. Each node is identified with the sort of phrase it represents. Recognizing the syntactical framework of helps with constituent parsing.
**Dependency Parsing Tree:**Another linguistic analysis method is called dependency parsing, which concentrates on the grammatical links between individual words in an expression rather than on structured phrases. Each word in an expression acts as a node in a relationship's parsing tree, and the borders between nodes stand in for linguistic dependencies or connections. Based on the responsibilities and tasks each word serves in the sentence, the tree shows how they are related to one another. A verb, for instance, may be dependent on its topic, thing, or qualifiers. Prepositions vary according on the terms they refer to. The sort of linguistic connection that every edge in the tree denotes is indicated by a label, such as "nsubj" for nominal respondent or "amod" for adjective modifier. Recognizing the framework of syntax and grammatical rules through dependency.