<a href="https://colab.research.google.com/github/SoumyaNanditha/SoumyaNanditha_INFO5731_-Fall2023/blob/main/Chadalavada_SoumyaNanditha_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
# Write your code here


import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the IMDb page with user reviews
url = 'https://www.imdb.com/title/tt1877830/reviews/?ref_=tt_ql_2'

# Initialize empty lists to store data
usernames = []
ratings = []
review_dates = []
reviews = []

# Loop to collect data from multiple pages (adjust the range accordingly)
for page in range(1, 501):  # Collecting from 100 pages, 100 reviews per page
    page_url = f"{url}&start={10 * (page - 1)}"  # IMDb shows 10 reviews per page

    # Send an HTTP GET request to the page
    response = requests.get(page_url)

    if response.status_code != 200:
        print(f"Failed to retrieve data from page {page}")
        continue

    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract user reviews and related data
    user_review_elements = soup.find_all('div', class_='text show-more__control')

    for element in user_review_elements:
        username = element.find_previous('a', href=True).text
        rating = element.find_previous('span', class_='point-scale').text.strip()
        review_date = element.find_previous('span', class_='review-date').text.strip()
        review_text = element.text.strip()

        usernames.append(username)
        ratings.append(rating)
        review_dates.append(review_date)
        reviews.append(review_text)

# Create a list of dictionaries to store the data
data = []
for i in range(len(usernames)):
    data.append({
        'Username': usernames[i],
        'Rating': ratings[i],
        'Review Date': review_dates[i],
        'Review Text': reviews[i]
    })

# Write the data to a CSV file
with open('film_reviews.csv', 'w', newline='', encoding='utf-8') as csv_file:
    fieldnames = ['Username', 'Rating', 'Review Date', 'Review Text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(data)

print(f"Data from {len(usernames)} user reviews collected and saved to film_reviews.csv.")


Data from 12500 user reviews collected and saved to film_reviews.csv.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write your code here

import csv
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Initialize the NLTK Porter Stemmer
stemmer = PorterStemmer()

# Initialize spaCy for lemmatization
nlp = spacy.load("en_core_web_sm")

# Define a function for text cleaning
def clean_text(text):
    # Remove special characters, punctuation, and numbers
    text = re.sub(r'[^A-Za-z]+', ' ', text)

    # Tokenize the text
    words = word_tokenize(text)

    # Remove stopwords
    words = [word for word in words if word.lower() not in stopwords.words('english')]

    # Convert to lowercase
    words = [word.lower() for word in words]

    # Apply stemming
    words = [stemmer.stem(word) for word in words]

    # Apply lemmatization
    doc = nlp(" ".join(words))
    words = [token.lemma_ for token in doc]

    return " ".join(words)

# Read the CSV file with the original data
input_file = 'film_reviews.csv'
output_file = 'cleaned_film_reviews.csv'

with open(input_file, 'r', newline='', encoding='utf-8') as csv_file:
    reader = csv.DictReader(csv_file)
    rows = list(reader)

# Create a new list of dictionaries with cleaned data
cleaned_data = []
for row in rows:
    cleaned_text = clean_text(row['Review Text'])
    row['Cleaned Review'] = cleaned_text
    cleaned_data.append(row)

# Write the cleaned data to a new CSV file
with open(output_file, 'w', newline='', encoding='utf-8') as csv_file:
    fieldnames = ['Username', 'Rating', 'Review Date', 'Review Text', 'Cleaned Review']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(cleaned_data)

print(f"Cleaned data saved to {output_file}.")




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Cleaned data saved to cleaned_film_reviews.csv.


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
# Write your code here

import spacy
import pandas as pd

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Read the cleaned text data
df = pd.read_csv('cleaned_film_reviews.csv')

# Initialize counts for POS tagging
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Initialize variables for constituency and dependency parsing
sample_text = df['Cleaned Review'].iloc[0]  # Take the first sentence as an example

# (1) Parts of Speech (POS) Tagging
for text in df['Cleaned Review']:
    doc = nlp(text)
    for token in doc:
        if token.pos_ == "NOUN":
            noun_count += 1
        elif token.pos_ == "VERB":
            verb_count += 1
        elif token.pos_ == "ADJ":
            adj_count += 1
        elif token.pos_ == "ADV":
            adv_count += 1

# (2) Constituency Parsing and Dependency Parsing (using one sentence as an example)
sample_doc = nlp(sample_text)

# Constituency Parsing Tree
print("\nConstituency Parsing Tree:")
for token in sample_doc:
    print(f"{token.text} [{token.dep_}]", end=" -> ")
print()

# Dependency Parsing Tree
print("\nDependency Parsing Tree:")
for token in sample_doc:
    print(f"{token.text} [{token.head.text}]", end=" -> ")
print()

# (3) Named Entity Recognition
entities = {
    "PERSON": 0,
    "ORG": 0,
    "GPE": 0,
    "PRODUCT": 0,
    "DATE": 0
}

for text in df['Cleaned Review']:
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_] += 1

# Print Named Entity Counts
print("\nNamed Entity Counts:")
for entity, count in entities.items():
    print(f"{entity}: {count}")

# Print POS tagging counts
print(f"Noun Count: {noun_count}")
print(f"Verb Count: {verb_count}")
print(f"Adjective Count: {adj_count}")
print(f"Adverb Count: {adv_count}")






Constituency Parsing Tree:
detect [ROOT] -> batman [compound] -> peak [compound] -> great [amod] -> storylin [compound] -> dark [dobj] -> univer [ccomp] -> come [conj] -> expect [dep] -> dc [compound] -> gloomi [nsubj] -> gritti [compound] -> dark [compound] -> tone [compound] -> film [compound] -> exactli [compound] -> want [nsubj] -> think [ccomp] -> movi [nsubj] -> beauti [nsubj] -> cinematographi [ccomp] -> great [amod] -> score [dobj] -> 

Dependency Parsing Tree:
detect [detect] -> batman [peak] -> peak [dark] -> great [storylin] -> storylin [dark] -> dark [detect] -> univer [detect] -> come [detect] -> expect [detect] -> dc [gloomi] -> gloomi [gritti] -> gritti [want] -> dark [tone] -> tone [exactli] -> film [exactli] -> exactli [want] -> want [think] -> think [expect] -> movi [cinematographi] -> beauti [cinematographi] -> cinematographi [think] -> great [score] -> score [cinematographi] -> 

Named Entity Counts:
PERSON: 95000
ORG: 13500
GPE: 5500
PRODUCT: 500
DATE: 4000
Noun C

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Dependency parsing tree and Constituency parsing trees play a key role in semantics analysis, sentence structure analysis.Dependency parsing considers each word of sentence as a node and grammatical connections between different words as links.It follows a bottom-up approach.Whereas constituency parsing is based on context-free grammar and it follows a top-down approach.

Constituency parsing tree indicates how words in a sentence can be merged/grouped together to form different sentences.It follows a hierarchical model.It can be used in actively identifying and classifying POS in sentences and tagging them and helps in forming clusters of them.

Unlike Constituency parsing tree, dependency parsing tree gives more importance to connection between words and identifies various types of dependencies like subject-verb etc..This can prove useful in tasks like information extraction, sentiment analysis, ML etc.. These both are useful to bettter understand the structure of a sentence.

in the trees above, from the constituency parsing tree we can see that the root word is detect and it continues with other words which are nodesand other words describe film's storyline, cinematography etc..

In the dependancy parsing,it analyses the grammatical structure of sentences.It details the subject-verb structure in sentences.The words like gloomi, gritti indicate syntactic roles of words and describe the dark and gritty nature of film.It provides us understanding of roles of different words in sentences.