<a href="https://colab.research.google.com/github/NityaVattam2002/Nitya_INFO5731_Fall2024/blob/main/Vattam_Nitya_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Vital libraries
!pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_amazon_reviews(url, max_reviews=1000):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
    }
    reviews = []
    page = 1

    while len(reviews) < max_reviews:
        # Constructing the review page URL
        review_url = f"{url}&pageNumber={page}"
        print(f"Scraping {review_url}...")

        response = requests.get(review_url, headers=headers)
        if response.status_code != 200:
            print("Failed to retrieve page. Exiting...")
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        review_elements = soup.find_all('div', {'data-hook': 'review'})

        for review in review_elements:
            try:
                title = review.find('a', {'data-hook': 'review-title'}).text.strip()
                author = review.find('span', {'class': 'a-profile-name'}).text.strip()
                rating = review.find('i', {'data-hook': 'review-star-rating'}).text.strip().split(" ")[0]
                review_text = review.find('span', {'data-hook': 'review-body'}).text.strip()

                reviews.append({
                    'Title': title,
                    'Author': author,
                    'Rating': rating,
                    'Review': review_text
                })

                if len(reviews) >= max_reviews:
                    break
            except Exception as e:
                print(f"Error while parsing review: {e}")

        page += 1

    return reviews

# URL of the Amazon product reviews page
product_url = "https://www.amazon.com/dp/B0D8MWDJH2#customerReviews/"
reviews_data = get_amazon_reviews(product_url, max_reviews=1000)

# Save to CSV
df_reviews = pd.DataFrame(reviews_data)
df_reviews.to_csv('amazon_reviews.csv', index=False)
print(f"Saved {len(df_reviews)} reviews to 'amazon_reviews.csv'")


Scraping https://www.amazon.com/dp/B0D8MWDJH2#customerReviews/&pageNumber=1...
Scraping https://www.amazon.com/dp/B0D8MWDJH2#customerReviews/&pageNumber=2...
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no attribute 'text'
Scraping https://www.amazon.com/dp/B0D8MWDJH2#customerReviews/&pageNumber=3...
Scraping https://www.amazon.com/dp/B0D8MWDJH2#customerReviews/&pageNumber=4...
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no attribute 'text'
Error while parsing review: 'NoneType' object has no a

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write code for each of the sub parts with proper comments.
#  Libraries
!pip install nltk pandas

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv('amazon_reviews.csv')

# Initializing NLTK's stopwords, stemmer, and lemmatizer
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean text data
def clean_text(text):
    # (1) Remove noise (special characters and punctuations)
    text = re.sub(r'[^\w\s]', '', text)

    # (2) Remove numbers
    text = re.sub(r'\d+', '', text)

    # (3) Tokenize the text
    tokens = nltk.word_tokenize(text)

    # (4) Remove stopwords and lowercase all words
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]

    # (5) Stemming
    stemmed_tokens = [ps.stem(word) for word in tokens]

    # (6) Lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

    # Join the tokens back into a single string
    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text

# Apply the cleaning function to the "Review" column and create a new "Cleaned Review" column
df['Cleaned Review'] = df['Review'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv('amazon_reviews_cleaned.csv', index=False)

# Display the first few rows of the cleaned data
print(df[['Review', 'Cleaned Review']].head())





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


                                              Review  \
0  These sheets are super cute. We bought them fo...   
1  I am a fan of stripes. I purchased an ocean th...   
2  Really nice, good quality bedsheets!The design...   
3               top sheet not wide enough\nRead more   
4  Bought this because I needed six yards of mate...   

                                      Cleaned Review  
0  sheet super cute bought son love shark good pr...  
1  fan stripe purchas ocean theme quilt sham pack...  
2  realli nice good qualiti bedsheetsth design su...  
3                         top sheet wide enough read  
4  bought need six yard materi bought mimic seers...  


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
# Your code here
# Install the necessary libraries
!pip install spacy
!python -m spacy download en_core_web_sm

import spacy
import pandas as pd
from collections import Counter

# Load the cleaned Amazon reviews CSV file
df = pd.read_csv('amazon_reviews_cleaned.csv')

# Load the small English model in spaCy
nlp = spacy.load("en_core_web_sm")

# Function to perform POS tagging, Constituency & Dependency Parsing, and NER
def analyze_syntax_structure(text):
    doc = nlp(text)

    # (1) POS Tagging: Count the total number of Nouns, Verbs, Adjectives, Adverbs
    pos_counts = Counter([token.pos_ for token in doc])
    noun_count = pos_counts['NOUN']
    verb_count = pos_counts['VERB']
    adj_count = pos_counts['ADJ']
    adv_count = pos_counts['ADV']

    # (2) Dependency Parsing: Display dependency tree
    dependency_tree = [(token.text, token.dep_, token.head.text) for token in doc]

    # (3) Named Entity Recognition (NER): Extract entities like person names, organizations, locations, product names, and dates
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = Counter([ent.label_ for ent in doc.ents])

    return {
        'pos': {'Nouns': noun_count, 'Verbs': verb_count, 'Adjectives': adj_count, 'Adverbs': adv_count},
        'dependency_tree': dependency_tree,
        'entities': entities,
        'entity_counts': entity_counts
    }

# Function to print constituency parsing tree and dependency tree
def print_parsing_info(doc):
    for token in doc:
        print(f"Token: {token.text}, Head: {token.head.text}, Dependency: {token.dep_}, POS: {token.pos_}")

# Loop through cleaned reviews, analyze, and display results for one review as an example
cleaned_reviews = df['Cleaned Review']

# Analyze one review for example
example_review = cleaned_reviews.iloc[0]
doc = nlp(example_review)

# Part 1: POS Tagging
syntax_structure = analyze_syntax_structure(example_review)
print(f"POS Tagging Results:\nNouns: {syntax_structure['pos']['Nouns']}\nVerbs: {syntax_structure['pos']['Verbs']}")
print(f"Adjectives: {syntax_structure['pos']['Adjectives']}\nAdverbs: {syntax_structure['pos']['Adverbs']}\n")

# Part 2: Dependency Parsing - print dependency tree
print(f"Dependency Parsing Tree for the Review:\n")
print_parsing_info(doc)

# Part 3: Named Entity Recognition
print(f"\nNamed Entity Recognition (NER):\n")
for entity, label in syntax_structure['entities']:
    print(f"Entity: {entity}, Label: {label}")

print(f"\nEntity Counts:\n{syntax_structure['entity_counts']}")



Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
POS Tagging Results:
Nouns: 7
Verbs: 2
Adjectives: 7
Adverbs: 0

Dependency Parsing Tree for the Review:

Token: sheet, Head: cute, Dependency: nmod, POS: NOUN
Token: super, Head: cute, Dependency: amod, POS: ADJ
Token: cute, Head: bought, Dependency: nsubj, POS: NOUN
Token: bought, Head: bought, Dependency: ROOT, POS: VERB
Token: son, Hea

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [4]:
from google.colab import files
files.download('amazon_reviews_cleaned.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
'''
The assignment was fair enough. It give me a chance to experience how data is gathered stealthly from sites without
human intervention. The only issue was that at some point, it was not easy to gather the 1000 reviews needed since
the Amazon website kept blocking and delaying the scraping of data. The time to complete the assignment was not sufficient enough
due to the persistent error that kempt popping while working on the assignment.
'''

'\nThe assignment was fair enough. It give me a chance to experience how data is gathered stealthly from sites without\nhuman intervention. The only issue was that at some point, it was not easy to gather the 1000 reviews needed since\nthe Amazon website kept blocking and delaying the scraping of data. The time to complete the assignment was not sufficient enough\ndue to the persistent error that kempt popping while working on the assignment.\n'