# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
#question 1
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_amazon_reviews(product_url, num_reviews):
    all_reviews = []
    page_number = 1

    while len(all_reviews) < num_reviews:
        # URL to scrape the data from
        url = f"{product_url}&pageNumber={page_number}"

        # Using the requests module to get the web page content
        page = requests.get(url)

        # Parsing the HTML content using BeautifulSoup
        soup = BeautifulSoup(page.content, 'html.parser')

        # Finding all the review elements
        review_elements = soup.find_all('div', class_='a-section review aok-relative')

        for review in review_elements:
            review_text = review.find('span', class_='review-text')
            if review_text:
                review_text = review_text.get_text().strip()
                all_reviews.append(review_text)

                # Break the loop if the desired number of reviews is collected
                if len(all_reviews) == num_reviews:
                    break

        # If all reviews are not collected yet, increment the page number for the next iteration
        page_number += 1

    return all_reviews[:num_reviews]

if __name__ == "__main__":
    # URL of the product on Amazon
    product_url = "https://www.amazon.com/Hisense-58-inch-Quantum-Smart-58U6HF/dp/B0B7CLH7RW/ref=sxin_15_pa_sp_search_thematic_sspa?content-id=amzn1.sym.92181fe7-c843-4c1b-b489-84c087a93895%3Aamzn1.sym.92181fe7-c843-4c1b-b489-84c087a93895&crid=3V8LZF3PHD5RP&cv_ct_cx=tv&dib=eyJ2IjoiMSJ9.dq_EN1DBXJDI8EpF3NLWssCkwIoIHm5vVO2vBRewqJerLhUxWIx5AEk0nq__JnTAX4ALmW4jY5mwgaD2HvoBEA.5JOuSEYz3vmlPqIf7b1DdUknCWsI0EZbMcKy26fOu5Y&dib_tag=se&keywords=tv&pd_rd_i=B0B7CLH7RW&pd_rd_r=61c33673-4e08-45f1-808b-634c3f553839&pd_rd_w=BQiRe&pd_rd_wg=EbZRU&pf_rd_p=92181fe7-c843-4c1b-b489-84c087a93895&pf_rd_r=4TWVDP4K82K9M3MG1X5Y&qid=1708729010&s=electronics&sbo=RZvfv%2F%2FHxDF%2BO5021pAnSA%3D%3D&sprefix=tv%2Celectronics%2C610&sr=1-2-364cf978-ce2a-480a-9bb0-bdb96faa0f61-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9zZWFyY2hfdGhlbWF0aWM&th=1"
    # Number of reviews to scrape
    num_reviews = 1000  # Adjust the number of reviews as needed

    reviews = scrape_amazon_reviews(product_url, num_reviews)

    # Create a DataFrame
    df = pd.DataFrame({'Reviews': reviews})

    # Print the DataFrame
    print(df)


                                               Reviews
0    I was one of the lucky few to pick up one of t...
1    Had some Amazon card points and figured I'd us...
2    I got this for the price and the reviews I'd r...
3    Deluxe Delivery and Unpacking was the best ser...
4    And now I love them! This TV freaking rocks. O...
..                                                 ...
995  I was one of the lucky few to pick up one of t...
996  Had some Amazon card points and figured I'd us...
997  I got this for the price and the reviews I'd r...
998  Deluxe Delivery and Unpacking was the best ser...
999  And now I love them! This TV freaking rocks. O...

[1000 rows x 1 columns]


In [None]:
# Save DataFrame to CSV file
df.to_csv('amazon_reviews.csv', index=False)

# Print message indicating where the file is saved
print("CSV file saved successfully as 'amazon_reviews.csv'")


CSV file saved successfully as 'amazon_reviews.csv'


In [None]:
from google.colab import files

# Download the CSV file to your local system
files.download('amazon_reviews.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import nltk
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from textblob import Word

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file into a DataFrame
df = pd.read_csv('amazon_reviews.csv')

# Function to remove punctuation and special characters from text
def remove_punctuation_and_special_chars(text):
    return re.sub(r'[^\w\s]', '', text)

# Function to remove numbers from text
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Function to remove stopwords from text
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join(word for word in text.split() if word.lower() not in stop_words)

# Function to convert text to lowercase
def lowercase(text):
    return text.lower()

# Function to apply stemming to text
def stemming(text):
    stemmer = PorterStemmer()
    return ' '.join(stemmer.stem(word) for word in text.split())

# Function to apply lemmatization to text
def lemmatization(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join(lemmatizer.lemmatize(word) for word in text.split())

if __name__ == "__main__":
    # Apply text preprocessing steps to the Amazon reviews
    print("Original DataFrame:")
    print(df.head())

    df['cleaned_reviews'] = df['Reviews'].copy()
    df['cleaned_reviews'] = df['cleaned_reviews'].apply(remove_punctuation_and_special_chars)
    print("\nAfter removing punctuation and special characters:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(remove_numbers)
    print("\nAfter removing numbers:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(remove_stopwords)
    print("\nAfter removing stopwords:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(lowercase)
    print("\nAfter converting text to lowercase:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(stemming)
    print("\nAfter applying stemming:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(lemmatization)
    print("\nAfter applying lemmatization:")
    print(df.head())

    # Save the modified DataFrame to CSV file
    df.to_csv('cleaned_amazon_reviews.csv', index=False)

    # Print message indicating successful completion
    print("Text data cleaned and saved in 'cleaned_amazon_reviews.csv'")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original DataFrame:
                                             Reviews
0  I was one of the lucky few to pick up one of t...
1  Had some Amazon card points and figured I'd us...
2  I got this for the price and the reviews I'd r...
3  Deluxe Delivery and Unpacking was the best ser...
4  And now I love them! This TV freaking rocks. O...

After removing punctuation and special characters:
                                             Reviews  \
0  I was one of the lucky few to pick up one of t...   
1  Had some Amazon card points and figured I'd us...   
2  I got this for the price and the reviews I'd r...   
3  Deluxe Delivery and Unpacking was the best ser...   
4  And now I love them! This TV freaking rocks. O...   

                                     cleaned_reviews  
0  I was one of the lucky few to pick up one of t...  
1  Had some Amazon card points and figured Id use...  
2  I got this for the price and the reviews Id re...  
3  Deluxe Delivery and Unpacking was the best ser... 

In [None]:
df.head()

Unnamed: 0,Reviews,cleaned_reviews
0,I was one of the lucky few to pick up one of t...,one lucki pick one prime day amaz price decent...
1,Had some Amazon card points and figured I'd us...,amazon card point figur id use replac samsung ...
2,I got this for the price and the reviews I'd r...,got price review id read given price realli go...
3,Deluxe Delivery and Unpacking was the best ser...,delux deliveri unpack best servic guy fast eff...
4,And now I love them! This TV freaking rocks. O...,love tv freak rock ok review mention sound yea...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
#1)POS Tagging
#I am providing sample output in results instead of all the results by using function head()
import pandas as pd
import nltk

# Download NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load the cleaned text data
cleaned_df = pd.read_csv('cleaned_amazon_reviews.csv')

# Function to perform POS tagging for a sentence
def pos_tagging(text):
    pos_tags = nltk.pos_tag(nltk.word_tokenize(text))
    return pos_tags

# Apply POS tagging to each review
cleaned_df['POS_Tags'] = cleaned_df['cleaned_reviews'].apply(pos_tagging)

# Create a list to store (word, POS) tuples
word_pos_list = []
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

for pos_tags in cleaned_df['POS_Tags']:
    for word, pos in pos_tags:
        word_pos_list.append((word, pos))
        if pos.startswith('N'):
            noun_count += 1
        elif pos.startswith('V'):
            verb_count += 1
        elif pos.startswith('J'):
            adj_count += 1
        elif pos.startswith('R'):
            adv_count += 1

# Create a DataFrame from the list of (word, POS) tuples
word_pos_df = pd.DataFrame(word_pos_list, columns=['Word', 'POS'])

# Save the DataFrame to a CSV file
word_pos_df.to_csv('word_pos_tags.csv', index=False)

# Print the DataFrame
print("Words with their corresponding POS tags:")
print(word_pos_df.head())

# Print the total counts of each POS
total_counts = {'Noun': noun_count, 'Verb': verb_count, 'Adjective': adj_count, 'Adverb': adv_count}
print("Total counts of each POS:")
print(total_counts)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Words with their corresponding POS tags:
    Word POS
0    one  CD
1  lucki  NN
2   pick  NN
3    one  CD
4  prime  JJ
Total counts of each POS:
{'Noun': 86600, 'Verb': 21800, 'Adjective': 33800, 'Adverb': 7200}


In [12]:
#Contituency parsing
import pandas as pd
import spacy
from spacy import displacy

# Load the CSV file into a DataFrame
df = pd.read_csv('amazon_reviews.csv')

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Select one review from the DataFrame
sample_text = df['Reviews'].iloc[0]  # Adjust column name here

# Perform constituency parsing
doc = nlp(sample_text)

# Visualize Constituency Parsing Tree using displacy
displacy.render(doc, style="dep", options={'compact': True, 'color': 'blue'})

# Optionally, you can save the visualization to a file
# displacy.render(doc, style="dep", options={'compact': True, 'color': 'blue'}, jupyter=False, page=True)


In [None]:
#Dependency parsing
import pandas as pd
import spacy

# Load the CSV file into a DataFrame
df = pd.read_csv('amazon_reviews.csv')

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Perform dependency parsing for each review
dep_data = []
for text in df["Reviews"]:  # Adjust column name here
    doc = nlp(text)
    for token in doc:
        dep_data.append([token.text, token.dep_, token.head.text, token.head.pos_,
                         [child for child in token.children]])

# Create a DataFrame from the dependency parsing data
dep_df = pd.DataFrame(dep_data, columns=["text", "dependency", "head_text", "head_pos", "children"])

# Print the first few rows of the dependency parsing DataFrame
print("Dependency parsing results:")
print(dep_df.head())


Dependency parsing results:
  text dependency head_text head_pos     children
0    I      nsubj       was      AUX           []
1  was       ROOT       was      AUX  [I, one, .]
2  one       attr       was      AUX         [of]
3   of       prep       one      NUM        [few]
4  the        det       few      ADJ           []


In [None]:
#Dependency Parsing Tree
import pandas as pd
import spacy
from spacy import displacy

# Load the CSV file into a DataFrame
df = pd.read_csv('amazon_reviews.csv')

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Select one review from the DataFrame
review_text = df["Reviews"].iloc[0]  # Selecting the first review, you can change the index as needed

# Perform dependency parsing for the selected review
doc = nlp(review_text)

# Visualize dependency parsing tree for the selected review
displacy.render(doc, style='dep', jupyter=True)


In [None]:
#3)Named Entity Recognition
import pandas as pd
import spacy

# Load the CSV file into a DataFrame
df = pd.read_csv('cleaned_amazon_reviews.csv')

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Perform Named Entity Recognition (NER) for each review
ner_data = []
for text in df["cleaned_reviews"]:  # Assuming "cleaned_reviews" is the column containing the cleaned text data
    doc = nlp(text)
    for ent in doc.ents:
        ner_data.append((ent.text, ent.label_))

# Create a DataFrame from the NER data
ner_df = pd.DataFrame(ner_data, columns=["entity", "entity_type"])

# Print the count of each entity type
print("Count of each entity type:")
print(ner_df['entity_type'].value_counts())


Count of each entity type:
PERSON      2200
ORG         2200
CARDINAL    1400
DATE         400
GPE          400
ORDINAL      400
TIME         400
PRODUCT      200
Name: entity_type, dtype: int64


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
#It was very challenging for me to go thorough all the questions but I have used some of demo programs to learn
#I also struggled at collecting data and preprocessing data as its not easy for me to write code
#I faced lot of challenges with syntax and modules , but I took some help from google in clearing errors