<a href="https://colab.research.google.com/github/Lavanya-INFO5731-Fall2024/Lavanya_INFO5731_Fall2024/blob/main/Nidamanuri_Lavanya_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pan
import time
import os

def parse_review_block(block):

    try:
        rev_txt = block.find("span", {"data-hook": "review-body"}).text.strip()
        ratng = block.find("i", {"data-hook": "review-star-rating"}).text.strip()
        date = block.find("span", {"data-hook": "review-date"}).text.strip()
        title = block.find("a", {"data-hook": "review-title"}).text.strip()

        return {
            "Title": title,
            "Rating": ratng,
            "Review_text": rev_txt,
            "Date": date
        }
    except AttributeError:
        return None

def review_block_generator(soup):
    rev_blk = soup.find_all("div", {"data-hook": "review"})
    for block in rev_blk:
        parsed_review = parse_review_block(block)
        if parsed_review:
            yield parsed_review

def scrape_single_page(url, headers):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')
        return review_block_generator(soup)
    except requests.RequestException as e:
        print(f"Failed to fetch {url}: {e}")
        return []

def scrape_reviews(product_base_url, headers, max_pages=20, max_rev=1000):
    reviews_data = []
    total_reviews = 0

    for page_num in range(1, max_pages + 1):
        page_url = f"{product_base_url}&pageNumber={page_num}"
        print(f"Scraping page {page_num}: {page_url}")

        reviews_gen = scrape_single_page(page_url, headers)

        for review in reviews_gen:
            reviews_data.append(review)
            total_reviews += 1

            if total_reviews >= max_rev:
                return reviews_data

        time.sleep(2)

    return reviews_data

def save_reviews_to_csv(rev, filename):
    dat_frm = pan.DataFrame(rev)
    dat_frm.to_csv(filename, index=False, header=True)
    print(f"Reviews saved to {filename}")

if __name__ == '__main__':

    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Accept-Language": "en-US, en;q=0.5",
    }

    base_url = "https://www.amazon.com/Dove-Body-Wash-Pump-Moisture/dp/B00MEDOY2G/ref=sr_1_5?crid=3S099968VXMXA&dib=eyJ2IjoiMSJ9.w9FjOgRJLM0vYdIVImsScUafugbNLSs5DshepgWg8oT-U-iYhsc89jpMVDMGQ0crEEj7joKGCKZCPzcJ4YnVL1YXnfjX9zHmLf7RDM7I9hxOMkUwBb5_dem7Mm1pJKG9atvE48H397MgYFMyfSBF2fRICZJUixqmPsOBhufU3q2KEqoQhKwrkM-UZjsnQfz0DaJgmLAnBYt2ljFEDFf6DPGDztKAGyePD08yjQ0nP0vtKjCWSrk5NT7Wpu6jYB6-W1A-wApfVlXNmWqmvoAIcM_6EUlWeQAq_7YSSSVjkKU.S65d_duoQ2duZ0JW_O-BrC4slV0OqymKzdzobkWORoA&dib_tag=se&keywords=body+wash&qid=1727489826&rdc=1&sprefix=Body+%2Caps%2C92&sr=8-5"

    scraped_reviews = scrape_reviews(base_url, HEADERS, max_pages=100)

    if scraped_reviews:
        csv_file = os.path.join(os.getcwd(), "Amazon_Product_Reviews_Alt.csv")
        save_reviews_to_csv(scraped_reviews, csv_file)
    else:
        print("No reviews were scraped.")


Scraping page 1: https://www.amazon.com/Dove-Body-Wash-Pump-Moisture/dp/B00MEDOY2G/ref=sr_1_5?crid=3S099968VXMXA&dib=eyJ2IjoiMSJ9.w9FjOgRJLM0vYdIVImsScUafugbNLSs5DshepgWg8oT-U-iYhsc89jpMVDMGQ0crEEj7joKGCKZCPzcJ4YnVL1YXnfjX9zHmLf7RDM7I9hxOMkUwBb5_dem7Mm1pJKG9atvE48H397MgYFMyfSBF2fRICZJUixqmPsOBhufU3q2KEqoQhKwrkM-UZjsnQfz0DaJgmLAnBYt2ljFEDFf6DPGDztKAGyePD08yjQ0nP0vtKjCWSrk5NT7Wpu6jYB6-W1A-wApfVlXNmWqmvoAIcM_6EUlWeQAq_7YSSSVjkKU.S65d_duoQ2duZ0JW_O-BrC4slV0OqymKzdzobkWORoA&dib_tag=se&keywords=body+wash&qid=1727489826&rdc=1&sprefix=Body+%2Caps%2C92&sr=8-5&pageNumber=1
Scraping page 2: https://www.amazon.com/Dove-Body-Wash-Pump-Moisture/dp/B00MEDOY2G/ref=sr_1_5?crid=3S099968VXMXA&dib=eyJ2IjoiMSJ9.w9FjOgRJLM0vYdIVImsScUafugbNLSs5DshepgWg8oT-U-iYhsc89jpMVDMGQ0crEEj7joKGCKZCPzcJ4YnVL1YXnfjX9zHmLf7RDM7I9hxOMkUwBb5_dem7Mm1pJKG9atvE48H397MgYFMyfSBF2fRICZJUixqmPsOBhufU3q2KEqoQhKwrkM-UZjsnQfz0DaJgmLAnBYt2ljFEDFf6DPGDztKAGyePD08yjQ0nP0vtKjCWSrk5NT7Wpu6jYB6-W1A-wApfVlXNmWqmvoAIcM_6EUlWeQAq_7YSSSVjkKU.

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write code for each of the sub parts with proper comments.
import pandas as pan
import re
import string
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Load the dataset
dat_frm = pan.read_csv('Amazon_Product_Reviews_Alt.csv')

# 1. Remove noise (punctuation and special characters)
def removeNoise(text):
    translator = str.maketrans('', '', string.punctuation)  # Translation table to remove punctuation
    return text.translate(translator)

dat_frm['cleaned_text'] = dat_frm['Review_text'].apply(removeNoise)

# Display the DataFrame with cleaned text
print(dat_frm[['Review_text', 'cleaned_text']].head())

# 2. Remove numbers
def removeNum(text):
    return ''.join([char for char in text if not char.isdigit()])  # List comprehension to remove digits

dat_frm['cleaned_text'] = dat_frm['cleaned_text'].apply(removeNum)

# Display the DataFrame with numbers removed
print(dat_frm[['Review_text', 'cleaned_text']].head())

# 3. Remove stopwords using `sklearn`
stpWord = ENGLISH_STOP_WORDS  # Using `sklearn`'s stop words set

def removeStopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stpWord]
    return ' '.join(filtered_words)

dat_frm['cleaned_text'] = dat_frm['cleaned_text'].apply(removeStopwords)

# Display the DataFrame with stopwords removed
print(dat_frm[['Review_text', 'cleaned_text']].head())

# 4. Lowercasing text
dat_frm['cleaned_text'] = dat_frm['cleaned_text'].str.lower()  # Direct use of pandas string method

# Display the DataFrame with lowercase text
print(dat_frm[['Review_text', 'cleaned_text']].head())

# 5. Stemming using `nltk`'s Snowball Stemmer (alternative to PorterStemmer)
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

def stemText(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

dat_frm['stemmed_text'] = dat_frm['cleaned_text'].apply(stemText)

# Display the DataFrame with stemming
print(dat_frm[['Review_text', 'stemmed_text']].head())

# 6. Lemmatization using `spacy` (alternative to NLTK's WordNetLemmatizer)
import spacy

# Load the SpaCy English language model
nlp = spacy.load('en_core_web_sm')

def lemmatizeText(text):
    doc = nlp(text)
    lemmatized_words = [token.lemma_ for token in doc]
    return ' '.join(lemmatized_words)

dat_frm['lemmatized_text'] = dat_frm['cleaned_text'].apply(lemmatizeText)

# Display the DataFrame with lemmatization
print(dat_frm[['Review_text', 'lemmatized_text']].head())


                                         Review_text  \
0  It smells good and makes my skin soft. It lath...   
1  The pump is handy in the shower or at the sink...   
2  I have been purchasing the Dove Body Wash with...   
3  I love dove. It’s a great body wash. It has a ...   
4  I normally buy this body wash and have been us...   

                                        cleaned_text  
0  It smells good and makes my skin soft It lathe...  
1  The pump is handy in the shower or at the sink...  
2  I have been purchasing the Dove Body Wash with...  
3  I love dove It’s a great body wash It has a ni...  
4  I normally buy this body wash and have been us...  
                                         Review_text  \
0  It smells good and makes my skin soft. It lath...   
1  The pump is handy in the shower or at the sink...   
2  I have been purchasing the Dove Body Wash with...   
3  I love dove. It’s a great body wash. It has a ...   
4  I normally buy this body wash and have been us... 

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
# Your code here
#Parts of Speech (POS) Tagging
import spacy
from collections import defaultdict

# Load the SpaCy English model
nlp = spacy.load("en_core_web_sm")

# Function to process text and count POS tags
def count_pos_alternative(doc):
    pos_counts = defaultdict(int)  # Using defaultdict for simpler counting logic
    for token in doc:
        pos_counts[token.pos_] += 1  # Increment the count for each POS tag
    return dict(pos_counts)

# Process the text data with SpaCy and count POS tags in one go
dat_frm['doc'] = dat_frm['lemmatized_text'].apply(lambda x: nlp(x))  # Apply SpaCy processing

# Apply the alternative POS counting function
dat_frm['pos_counts'] = dat_frm['doc'].apply(count_pos_alternative)

# Aggregating the total POS counts across all rows
total_counts_alternative = defaultdict(int)

for pos_count in dat_frm['pos_counts']:
    for pos, count in pos_count.items():
        total_counts_alternative[pos] += count  # Update overall counts

# Display results for specific POS tags
print("Total POS counts across all reviews (alternative):")
print(f"Nouns (NOUN): {total_counts_alternative['NOUN']}")
print(f"Verbs (VERB): {total_counts_alternative['VERB']}")
print(f"Adjectives (ADJ): {total_counts_alternative['ADJ']}")
print(f"Adverbs (ADV): {total_counts_alternative['ADV']}")

Total POS counts across all reviews (alternative):
Nouns (NOUN): 12084
Verbs (VERB): 3922
Adjectives (ADJ): 5936
Adverbs (ADV): 1060


In [4]:
#Constituency Parsing and Dependency Parsing
!pip install benepar
import benepar
benepar.download('benepar_en3')

if "benepar" not in nlp.pipe_names:
    nlp.add_pipe("benepar", config={"model": "benepar_en3"})

# Example DataFrame (replace with actual DataFrame)
dat_frm = pan.DataFrame({
    'lemmatized_text': ["The band has just a few months to find a new guitarist",
                        "Very few people have been to the South Pole",
                        "The restaurant has quite a few vegetarian options."]
})

# Apply NLP processing to the lemmatized text
dat_frm['doc'] = dat_frm['lemmatized_text'].apply(nlp)
for doc in dat_frm['doc']:
    for sent in doc.sents:
        # Check for constituency parse tree if benepar is integrated
        if hasattr(sent._, "parse_string"):  # Safe check for benepar parsing
            print("Constituency Parse Tree (Alternative Approach):")
            print(sent._.parse_string)

        # Print dependency parsing in a structured format
        print("\nDependency Parse Tree (Alternative):")
        dep_tree = [(token.text, token.dep_, token.head.text) for token in sent]
        for dep_rel in dep_tree:
            print(f"{dep_rel[0]} --> {dep_rel[1]} --> {dep_rel[2]}")
        print()

# Example sentence for detailed constituency and dependency parsing
example_sentence = dat_frm['lemmatized_text'].iloc[0]
doc = nlp(example_sentence)
sent = list(doc.sents)[0]

if hasattr(sent._, "parse_string"):
    print("\nExample Constituency Parsing Tree (Alternative Approach):")
    print(sent._.parse_string)

print("\nExample Dependency Parsing Tree (Alternative Approach):")
for token in sent:
    print(f"{token.text} --> {token.dep_} --> {token.head.text}")
print()

Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Building wheels for collected packages: benepar
  Building wheel for benepar (setup.py) ... [?25l[?25hdone
  Created wheel for benepar: filename=benepar-0.2.0-py3-none-any.whl size=37626 sha256=9be9f7308d6196f0ecef0b3921cf3112098aaf858e934c727c8739e7320c52df
  Stored in directory: /root/.cache/pip/wheels/8d/4d/c1/a5af726368d5dbaaaa0b2dd36ed39b9da8cec46279a49bd6db
Successfully built benepar
Installing collected packages: torch-struct, benepar
Successfully installed benepar-0.2.0 torch-struct-0.5


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Constituency Parse Tree (Alternative Approach):
(S (NP (DT The) (NN band)) (VP (VBZ has) (NP (NP (RB just) (DT a) (JJ few) (NNS months)) (SBAR (S (VP (TO to) (VP (VB find) (NP (DT a) (JJ new) (NN guitarist)))))))))

Dependency Parse Tree (Alternative):
The --> det --> band
band --> nsubj --> has
has --> ROOT --> has
just --> advmod --> few
a --> quantmod --> few
few --> nummod --> months
months --> dobj --> has
to --> aux --> find
find --> xcomp --> has
a --> det --> guitarist
new --> amod --> guitarist
guitarist --> dobj --> find

Constituency Parse Tree (Alternative Approach):
(S (NP (ADJP (RB Very) (JJ few)) (NNS people)) (VP (VBP have) (VP (VBN been) (PP (IN to) (NP (DT the) (NNP South) (NNP Pole))))))

Dependency Parse Tree (Alternative):
Very --> advmod --> few
few --> amod --> people
people --> nsubj --> been
have --> aux --> been
been --> ROOT --> been
to --> prep --> been
the --> det --> Pole
South --> compound --> Pole
Pole --> pobj --> to

Constituency Parse Tree (Alternativ



In [5]:
#Named Entity Recognition
def extract_entity_info(doc):
    entity_info = defaultdict(int)
    for ent in doc.ents:
        entity_info[ent.label_] += 1
    return dict(entity_info)

# Apply NER to each document
dat_frm['entities'] = dat_frm['doc'].apply(extract_entity_info)

# Summing the entity counts across all reviews
total_entity_counts = defaultdict(int)
for entity_dict in dat_frm['entities']:
    for label, count in entity_dict.items():
        total_entity_counts[label] += count

print("Total entity counts across all reviews are :")
print(f"Person (PERSON): {total_entity_counts['PERSON']}")
print(f"Organizations (ORG): {total_entity_counts['ORG']}")
print(f"Locations (GPE): {total_entity_counts['GPE']}")
print(f"Products (PRODUCT): {total_entity_counts['PRODUCT']}")
print(f"Dates (DATE): {total_entity_counts['DATE']}")

Total entity counts across all reviews are :
Person (PERSON): 0
Organizations (ORG): 0
Locations (GPE): 0
Products (PRODUCT): 0
Dates (DATE): 1


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
https://myunt-my.sharepoint.com/:x:/r/personal/lavanyanidamanuri_my_unt_edu/Documents/Amazon_Product_Reviews_Alt.xlsx?d=w87a69b99bcce4823960f7e3c8bed28ba&csf=1&web=1&e=HfOGGi

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
# The assignment is taking more time to solve the errors and corrrect the code. We need more time to complete this kind of lengthy assignments.