<a href="https://colab.research.google.com/github/MPrasanna14/prasanna_INFO5731_Fall2023/blob/main/PrasannaMalreddy_INFO5731_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
# Write your code here
import requests
from bs4 import BeautifulSoup
import csv


In [2]:
# Define the IMDB movie URL
movie_url = "https://www.imdb.com/title/tt10648342/reviews/?ref_=tt_ov_rt"

# Function to scrape reviews from a single page
def scrape_reviews(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    reviews = []

    review_containers = soup.find_all("div", class_="lister-item-content")
    for review_container in review_containers:
        title = review_container.find("a", class_="title").get_text(strip=True)
        username = review_container.find("span", class_="display-name-link").find("a").get_text(strip=True)
        review_date = review_container.find("span", class_="review-date").get_text(strip=True)
        review_text = review_container.find("div", class_="text").get_text(strip=True)

        reviews.append([title, username, review_date, review_text])

    return reviews

# Function to scrape reviews from multiple pages
def scrape_reviews_from_multiple_pages(base_url, num_pages):
    all_reviews = []
    for page_num in range(1, num_pages + 1):
        page_url = f"{base_url}&start={10 * (page_num - 1)}"
        reviews = scrape_reviews(page_url)
        all_reviews.extend(reviews)
    return all_reviews

# Specify the number of pages to scrape (adjust as needed)
num_pages_to_scrape = 400 # 400 pages x 10 reviews per page = 1000 reviews

# Scrape reviews from multiple pages
all_reviews = scrape_reviews_from_multiple_pages(movie_url, num_pages_to_scrape)

# Save the reviews to a CSV file
csv_file = "movie_reviews.csv"
with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Username", "Review Date", "Review Text"])
    writer.writerows(all_reviews)

print(f"{len(all_reviews)} reviews have been saved to {csv_file}.")

10000 reviews have been saved to movie_reviews.csv.


In [3]:
# Print the head of the dataset
head_reviews = all_reviews[:5]  # Print the first 5 reviews as an example
for review in head_reviews:
    print("Title:", review[0])
    print("Username:", review[1])
    print("Review Date:", review[2])
    print("Review Text:", review[3])
    print()

Title: Enjoyable but empty
Username: masonsaul
Review Date: 7 July 2022
Review Text: Thor: Love and Thunder does attempt to explore themes of love and loss whilst introducing the Mighty Thor and putting Thor on a journey of self discovery. However, it sadly doesn't work as well as it should due to a rushed pace and way too many jokes that are almost never funny.Chris Hemsworth is still going strong as Thor but the extreme goofiness is getting a little stale. Natalie Portman has never been better as this character and Tessa Thompson is still great as Valkyrie, even though she doesn't really get much to do.Taika Waititi massively overstays his welcome as Korg this time, who becomes very annoying really fast. Christian Bale is one of the better MCU villains with a good motivation and an unsettling presence but is let down by limited screen time.Takia's direction on the other hand is stronger, there's some nice visual imagery and the colour palette is pretty vibrant but the MCU grey is sti

# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [4]:
import csv
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer





In [5]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Define the CSV file paths
input_csv_file = "movie_reviews.csv"
output_csv_file = "movie_reviews_cleaned.csv"

# Initialize NLTK components
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Function to clean and preprocess text
def clean_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove noise, numbers, and stopwords, and lowercase
    cleaned_tokens = [word.lower() for word in tokens if word.isalpha() and word.lower() not in stop_words]

    # Apply stemming and lemmatization
    stemmed_tokens = [stemmer.stem(word) for word in cleaned_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

    return ' '.join(lemmatized_tokens)

# Read the input CSV and write to the output CSV with cleaned data
with open(input_csv_file, 'r', newline='', encoding='utf-8') as input_file:
    with open(output_csv_file, 'w', newline='', encoding='utf-8') as output_file:
        reader = csv.reader(input_file)
        writer = csv.writer(output_file)

        header = next(reader)
        header.append("Cleaned Text")  # Add a new column for cleaned text
        writer.writerow(header)

        for row in reader:
            title, username, review_date, review_text = row[:4]
            cleaned_text = clean_text(review_text)
            row.append(cleaned_text)
            writer.writerow(row)

print(f"Data has been cleaned and saved to {output_csv_file}.")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Data has been cleaned and saved to movie_reviews_cleaned.csv.


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [7]:
pip install spacy




In [12]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [9]:
import csv
import nltk
import spacy
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from spacy import displacy


In [10]:
# Download spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned text from the CSV file
input_csv_file = "movie_reviews_cleaned.csv"

# Initialize counters for POS tagging
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Initialize counters for named entities
person_count = 0
organization_count = 0
location_count = 0
product_count = 0
date_count = 0

# Function to perform POS tagging and entity recognition
def analyze_text(text):
    global noun_count, verb_count, adj_count, adv_count
    global person_count, organization_count, location_count, product_count, date_count

    # Tokenize text into sentences
    sentences = sent_tokenize(text)

    for sentence in sentences:
        # POS tagging
        words = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(words)

        for word, pos in pos_tags:
            if pos.startswith('N'):  # Noun
                noun_count += 1
            elif pos.startswith('V'):  # Verb
                verb_count += 1
            elif pos.startswith('J'):  # Adjective
                adj_count += 1
            elif pos.startswith('R'):  # Adverb
                adv_count += 1

        # Named Entity Recognition with spaCy
        doc = nlp(sentence)
        for ent in doc.ents:
            if ent.label_ == 'PERSON':
                person_count += 1
            elif ent.label_ == 'ORG':
                organization_count += 1
            elif ent.label_ == 'GPE':
                location_count += 1
            elif ent.label_ == 'PRODUCT':
                product_count += 1
            elif ent.label_ == 'DATE':
                date_count += 1

# Read the CSV file and analyze the text
with open(input_csv_file, 'r', newline='', encoding='utf-8') as input_file:
    reader = csv.reader(input_file)
    header = next(reader)  # Skip the header row

    for row in reader:
        cleaned_text = row[-1]
        analyze_text(cleaned_text)

# Display the analysis results
print("POS Tagging Results:")
print(f"Total Nouns: {noun_count}")
print(f"Total Verbs: {verb_count}")
print(f"Total Adjectives: {adj_count}")
print(f"Total Adverbs: {adv_count}")

print("\nNamed Entity Recognition Results:")
print(f"Persons: {person_count}")
print(f"Organizations: {organization_count}")
print(f"Locations: {location_count}")
print(f"Products: {product_count}")
print(f"Dates: {date_count}")

POS Tagging Results:
Total Nouns: 481200
Total Verbs: 125600
Total Adjectives: 180400
Total Adverbs: 47200

Named Entity Recognition Results:
Persons: 19600
Organizations: 6000
Locations: 800
Products: 0
Dates: 2000


In [None]:
import spacy
import pandas as pd
from spacy import displacy
from nltk.tokenize import sent_tokenize

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Load your CSV file using Pandas
df = pd.read_csv('movie_reviews_cleaned.csv')

# Function to analyze text
def analyze_text(text):
    # Split the text into sentences
    sentences = sent_tokenize(text)

    for sentence in sentences:
        doc = nlp(sentence)

        # Generate and display the constituency parsing tree
        displacy.render(doc, style="dep", jupyter=True, options={'compact': True})

        # Generate and display the dependency parsing tree
        displacy.render(doc, style="ent", jupyter=True)

# Process each row in the DataFrame
for index, row in df.iterrows():
    cleaned_text = row['Review Text']
    analyze_text(cleaned_text)





**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Dependency Parsing Tree and Constituency For examining the grammatical structure of phrases, natural language processing techniques such as
Parsing Tree are crucial. Constituency parsing separates a sentence into its individual phrases and shows their hierarchical connections in a tree structure.
A visual representation of the sentence's grammatical structure is produced by each node, which represents a word or phrase with precise syntactic labels.
 With this approach, the arrangement of the phrases within the sentence is highlighted, demonstrating how they come together to make a logical whole.
 Dependency parsing, on the other hand, focuses on highlighting the grammatical connections between certain words in a phrase. This method uses nodes to represent
 words, directed edges to express dependencies, and labels to describe the relationships between the nodes. It is a more direct and linear representation of
 grammatical connections since other words are linked as dependents and the root node often reflects the main verb. While dependency parsing stresses direct
 word-to-word links, constituency parsing emphasizes hierarchy and organization, making both parsing strategies useful for comprehending sentence syntax.