<a href="https://colab.research.google.com/github/Maheshwar405/Maheshwar-Reddy_INFO5731_Fall2023/blob/main/Boyalla_MaheshwarReddy_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
# Write your code here
import requests
from bs4 import BeautifulSoup
import csv

def scrape_reviews(url, num_pages, output_file):
    reviews_data = []

    for page_num in range(1, num_pages + 1):
        page_url = f"{url}&page={page_num}"

        response = requests.get(page_url)
        if response.status_code != 200:
            print(f"Failed to retrieve page {page_num}. Exiting.")
            break

        soup = BeautifulSoup(response.text, 'html.parser')
        review_containers = soup.find_all('div', class_='lister-item-content')

        for container in review_containers:
            review_text = container.find('div', class_='text').get_text()
            username = container.find('span', class_='display-name-link').get_text()
            review_date = container.find('span', class_='review-date').get_text()

            reviews_data.append([username, review_date, review_text])

    if reviews_data:
        with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow(['Username', 'Review Date', 'Review Text'])  # Header
            csv_writer.writerows(reviews_data)

        print(f"{len(reviews_data)} reviews have been successfully scraped and saved to '{output_file}'.")
    else:
        print("No reviews found on the pages.")

if __name__ == "__main__":
    movie_url = 'https://www.imdb.com/title/tt9603212/reviews/?ref_=tt_ql_2'
    num_pages_to_scrape = 500  # Adjust this number as needed
    output_csv_file = 'movie_reviews.csv'

    scrape_reviews(movie_url, num_pages_to_scrape, output_csv_file)




12500 reviews have been successfully scraped and saved to 'movie_reviews.csv'.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write your code here
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re

# Download NLTK resources if not already installed
nltk.download('stopwords')
nltk.download('wordnet')

# Read the CSV file containing the raw data
df = pd.read_csv('movie_reviews.csv')

# Define a function for text cleaning
def clean_text(text):
    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Lowercase the text
    text = text.lower()

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Initialize a stemmer and lemmatizer
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    # Apply stemming and lemmatization
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

    # Join the cleaned tokens to form the cleaned text
    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text

# Apply the clean_text function to the 'Review Text' column
df['Cleaned Text'] = df['Review Text'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv('amazon_reviews_cleaned.csv', index=False)

print("Text data has been cleaned and saved to 'amazon_reviews_cleaned.csv'.")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Text data has been cleaned and saved to 'amazon_reviews_cleaned.csv'.


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
# Write your code here
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Read the cleaned text data
df = pd.read_csv('amazon_reviews_cleaned.csv')

# (1) Parts of Speech (POS) Tagging
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

for text in df['Cleaned Text']:
    doc = nlp(text)
    for token in doc:
        if token.pos_ == "NOUN":
            noun_count += 1
        elif token.pos_ == "VERB":
            verb_count += 1
        elif token.pos_ == "ADJ":
            adj_count += 1
        elif token.pos_ == "ADV":
            adv_count += 1

print(f"Noun Count: {noun_count}")
print(f"Verb Count: {verb_count}")
print(f"Adjective Count: {adj_count}")
print(f"Adverb Count: {adv_count}")

# (2) Constituency Parsing and Dependency Parsing (using one sentence as an example)
sample_text = df['Cleaned Text'].iloc[0]  # Take the first sentence as an example

# Constituency Parsing Tree
sample_doc = nlp(sample_text)
print("\nConstituency Parsing Tree:")
for token in sample_doc:
    print(f"{token.text} [{token.dep_}]", end=" -> ")
print()

# Dependency Parsing Tree
print("\nDependency Parsing Tree:")
for token in sample_doc:
    print(f"{token.text} [{token.head.text}]", end=" -> ")
print()

# (3) Named Entity Recognition
entities = {
    "PERSON": 0,
    "ORG": 0,
    "LOC": 0,
    "PRODUCT": 0,
    "DATE": 0
}

for text in df['Cleaned Text']:
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_] += 1

print("\nNamed Entity Counts:")
for entity, count in entities.items():
    print(f"{entity}: {count}")






Noun Count: 707500
Verb Count: 300000
Adjective Count: 297500
Adverb Count: 83500

Constituency Parsing Tree:
man [nsubj] -> wish [ROOT] -> love [compound] -> movi [nsubj] -> do [aux] -> nt [neg] -> get [ccomp] -> wrong [amod] -> solid [amod] -> action [compound] -> movi [compound] -> jawdrop [compound] -> stunt [dobj] -> best [amod] -> seri [amod] -> mission [compound] -> imposs [compound] -> movi [nsubj] -> felt [conj] -> like [prep] -> small [amod] -> step [pobj] -> backward [advmod] -> franchis [det] -> fallout [npadvmod] -> mindblow [amod] -> action [compound] -> sequenc [compound] -> stunt [compound] -> work [dobj] -> along [prep] -> develop [xcomp] -> ethan [compound] -> relationship [dobj] -> ilsa [compound] -> provid [compound] -> closur [compound] -> julia [compound] -> show [compound] -> length [compound] -> ethan [nsubj] -> would [aux] -> go [ccomp] -> protect [advcl] -> closest [amod] -> battl [compound] -> impos [compound] -> villain [nsubj] -> dead [nsubj] -> reckon [cco

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Constituency parsing and dependency parsing are two different approaches to analyze the grammatical structure of sentences in natural language.

# Constituency Parsing Tree:

Constituency parsing, also known as phrase structure parsing, represents the grammatical structure of a sentence as a hierarchical tree structure. Each node in the tree represents a phrase, which can be a word, a group of words, or even an entire sentence. The tree starts with a single node representing the entire sentence and branches down into smaller phrases, which, in turn, can be broken down into even smaller phrases or individual words.
The constituency parsing tree helps us understand how words and phrases are grouped together in a sentence and how they relate to each other in terms of their grammatical roles.
# Dependency Parsing Tree:
Dependency parsing, on the other hand, represents the grammatical structure of a sentence as a directed graph where words (or tokens) are nodes, and grammatical relationships between them are represented as labeled arcs or edges. Each word typically has a single head (a word it depends on), and the edges between words indicate the grammatical relationships, such as subject, object, modifier, etc.

Dependency parsing is more focused on the syntactic relationships between words and how they contribute to the overall structure of the sentence. It is often used for tasks like identifying subject-verb relationships, determining dependencies between words, and understanding the grammatical structure of a sentence.


In [9]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [11]:
pip install svgling

Collecting svgling
  Downloading svgling-0.4.0-py3-none-any.whl (23 kB)
Collecting svgwrite (from svgling)
  Downloading svgwrite-1.4.3-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.1/67.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: svgwrite, svgling
Successfully installed svgling-0.4.0 svgwrite-1.4.3


In [13]:
import spacy
import nltk
from nltk import word_tokenize, pos_tag, pos_tag_sents
from collections import Counter
from nltk.parse import ChartParser
import svgling

# Load the spaCy model for Named Entity Recognition
nlp = spacy.load("en_core_web_sm")

# Load your cleaned data from the CSV file
df = pd.read_csv('amazon_reviews_cleaned.csv')

# (1) Parts of Speech (POS) Tagging
# Tag Parts of Speech of each word in the text
pos = [nltk.pos_tag(word_tokenize(sent)) for sent in df['Cleaned Text']]
print(pos, end="\n\n")

# Calculate the total number of Noun (N), Verb (V), Adjective (Adj), and Adverb (Adv)
counts = [Counter(tag for _, tag in tags) for tags in pos]
print(counts, end="\n\n")

# (2) Constituency Parsing
# Define a simple context-free grammar for constituency parsing
cgrammar = nltk.CFG.fromstring("""
    S -> NP VP
    VP -> V NP
    PP -> P NP
    NP -> NP PP | Det N | 'Peter' | 'Denver'
    V -> 'prefers'
    P -> 'from'
    N -> 'flight'
    Det -> 'the'
""")

# Example sentence for constituency parsing
sentence = ['Peter', 'prefers', 'the', 'flight', 'from', 'Denver']

# Using Chart Parser for constituency parsing
cparser = ChartParser(cgrammar)
for tree in cparser.parse(sentence):
    print(tree, end="\n")

# Drawing the constituency parsing tree
svgling.draw_tree(tree)

# (3) Named Entity Recognition
named_entities = []
for sentence in df['Cleaned Text']:
    doc = nlp(sentence)
    for entity in doc.ents:
        if entity.text and entity.label_:
            named_entities.append((entity.text, entity.label_))

# Count the occurrences of each named entity type
entity_counts = Counter(entity[1] for entity in named_entities)
print(named_entities, end="\n")
print(entity_counts)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[Counter({'NN': 58, 'JJ': 19, 'VBP': 8, 'VB': 6, 'CD': 5, 'VBD': 3, 'NNS': 3, 'FW': 2, 'JJS': 2, 'IN': 2, 'RB': 2, 'MD': 2, 'VBZ': 1, 'VBN': 1}), Counter({'NN': 35, 'JJ': 6, 'VBD': 4, 'VBP': 4, 'JJS': 3, 'CD': 2, 'VB': 2, 'VBN': 2, 'RB': 1, 'IN': 1, 'VBZ': 1}), Counter({'NN': 141, 'JJ': 46, 'RB': 11, 'VBP': 11, 'VB': 8, 'JJS': 6, 'NNS': 5, 'CD': 4, 'VBD': 3, 'VBN': 3, 'IN': 3, 'VBZ': 2, 'FW': 2, 'MD': 1, 'RBR': 1, 'WP': 1, 'EX': 1, 'JJR': 1}), Counter({'NN': 35, 'JJ': 18, 'VBD': 3, 'RB': 3, 'IN': 1, 'MD': 1, 'VB': 1, 'VBP': 1, 'VBN': 1, 'VBG': 1}), Counter({'NN': 69, 'JJ': 25, 'VBP': 8, 'RB': 5, 'VB': 5, 'IN': 4, 'VBD': 2, 'VBN': 1, 'VBG': 1, 'NNS': 1, 'PRP$': 1, 'MD': 1}), Counter({'NN': 58, 'JJ': 19, 'VBP': 8, 'RB': 7, 'VB': 4, 'IN': 4, 'JJS': 3, 'VBD': 3, 'CD': 3, 'VBN': 2, 'NNS': 2, 'VBG': 2, 'MD': 1, 'FW': 1, 'CC': 1}), Counter({'NN': 87, 'JJ': 31, 'RB': 8, 'VBP': 7, 'VBD': 6, 'IN': 5, 'VB': 4, 'VBZ': 2, 'CD': 2, 'FW': 1, 'CC': 1, 'MD': 1, 'DT': 1, 'NNS': 1, 'RBR': 1, 'VBN': 1, 'W

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

