<a href="https://colab.research.google.com/github/180030814-GnaneshwarReddy/GnaneswaraReddy_INFO5731_Fall2024/blob/main/Palem_Gnaneswara_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [2]:
import requests
import csv
import time
from bs4 import BeautifulSoup

My_Scrapper_Api_Key = '3b1bf40341e8e4da451c5ac266749f96'

amazon_product_url = 'https://www.amazon.com/Apple-iPhone-13-128GB-Midnight/dp/B09LNW3CY2'  # Change to any product you like

def scrape_amazon_reviews(url, api_key):
    api_url = f'http://api.scraperapi.com?api_key={api_key}&url={url}&render=true'
    response = requests.get(api_url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to retrieve the page: {response.status_code}")
        return None

def extract_reviews(page_content):
    soup = BeautifulSoup(page_content, 'html.parser')
    review_blocks = soup.select('.review')
    reviews = []

    for block in review_blocks:
        try:
            rating = block.select_one('.review-rating').text
            title = "Apple iPhone 13(128GB,Midnight) "
            content = block.select_one('.review-text-content').text.strip()
            date = block.select_one('.review-date').text
            reviews.append({
                'Rating': rating,
                'Title': title,
                'Content': content,
                'Date': date
            })
        except Exception as e:
            print(f"Error extracting review: {e}")

    return reviews

def scrape_and_save_reviews(amazon_product_url, api_key, max_reviews=1000, output_file='amazon_reviews_1000.csv'):
    all_reviews = []
    page = 1
    base_url = f'{amazon_product_url}/ref=cm_cr_arp_d_paging_btm_next_'

    while len(all_reviews) < max_reviews:
        print(f"Scraping page {page}...")
        page_url = f"{base_url}?pageNumber={page}"

        page_content = scrape_amazon_reviews(page_url, api_key)

        if page_content:
            reviews = extract_reviews(page_content)
            all_reviews.extend(reviews)
            print(f"Collected {len(reviews)} reviews from page {page}. Total: {len(all_reviews)}")
            if len(reviews) == 0:
                break
        else:
            break

        page += 1
        time.sleep(2)

    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Rating', 'Title', 'Content', 'Date']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(all_reviews)

    print(f"Saved {len(all_reviews)} reviews to {output_file}")

scrape_and_save_reviews(amazon_product_url, My_Scrapper_Api_Key)


Scraping page 1...
Collected 11 reviews from page 1. Total: 11
Scraping page 2...
Collected 11 reviews from page 2. Total: 22
Scraping page 3...
Collected 11 reviews from page 3. Total: 33
Scraping page 4...
Collected 11 reviews from page 4. Total: 44
Scraping page 5...
Collected 11 reviews from page 5. Total: 55
Scraping page 6...
Collected 11 reviews from page 6. Total: 66
Scraping page 7...
Collected 11 reviews from page 7. Total: 77
Scraping page 8...
Collected 11 reviews from page 8. Total: 88
Scraping page 9...
Collected 11 reviews from page 9. Total: 99
Scraping page 10...
Collected 11 reviews from page 10. Total: 110
Scraping page 11...
Collected 11 reviews from page 11. Total: 121
Scraping page 12...
Collected 11 reviews from page 12. Total: 132
Scraping page 13...
Collected 11 reviews from page 13. Total: 143
Scraping page 14...
Collected 11 reviews from page 14. Total: 154
Scraping page 15...
Collected 11 reviews from page 15. Total: 165
Scraping page 16...
Collected 11 revi

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [4]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

data = pd.read_csv("/content/amazon_reviews_1000.csv")
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # 1. Remove noise
    text = re.sub(r'[^\w\s]', '', text)

    # 2. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 3. Remove stopwords
    words = nltk.word_tokenize(text)
    words = [word for word in words if word.lower() not in stop_words]

    # 4. Lowercase all texts
    words = [word.lower() for word in words]

    # 5. Stemming
    stemmed_words = [ps.stem(word) for word in words]

    # 6. Lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
    return ' '.join(lemmatized_words)

data['Cleaned_Content'] = data['Content'].apply(clean_text)
data.to_csv('cleaned_amazon_reviews_1000.csv', index=False)
data[['Content', 'Cleaned_Content']].head()


Unnamed: 0,Content,Cleaned_Content
0,This review is based on the product quality an...,review base product qualiti condit phone came ...
1,"iPhone 13, midnight, 256gb, unlocked version, ...",iphon midnight gb unlock version seller direct...
2,The phone is in a very good condition:No damag...,phone good conditionno damag screen clean fine...
3,This was a great buy for me and the phone work...,great buy phone work extrem well far refurbish...
4,Ordered the iPhone 13 in Midnight for T-Mobile...,order iphon midnight tmobil phone came near pe...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m94.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [6]:
import spacy
import pandas as pd
import nltk
from collections import Counter

nlp = spacy.load('en_core_web_sm')
data1 = pd.read_csv('cleaned_amazon_reviews_1000.csv')

def analyze_text(text):
    doc = nlp(text)

    # 1. POS Tagging
    pos_counts = Counter([token.pos_ for token in doc])
    noun_count = pos_counts.get("NOUN", 0)
    verb_count = pos_counts.get("VERB", 0)
    adj_count = pos_counts.get("ADJ", 0)
    adv_count = pos_counts.get("ADV", 0)

    # 2. Constituency Parsing and Dependency Parsing
    constituency_parse = [token.dep_ for token in doc]
    dependency_parse = [(token.text, token.dep_, token.head.text) for token in doc]

    # 3. Named Entity Recognition
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    return {
        "noun_count": noun_count,
        "verb_count": verb_count,
        "adj_count": adj_count,
        "adv_count": adv_count,
        "constituency_parse": constituency_parse,
        "dependency_parse": dependency_parse,
        "entities": entities
    }

sample_review = data1['Cleaned_Content'].iloc[0]
analysis_result = analyze_text(sample_review)

print("Parts of Speech (POS) Counts:")
print(f"Nouns: {analysis_result['noun_count']}")
print(f"Verbs: {analysis_result['verb_count']}")
print(f"Adjectives: {analysis_result['adj_count']}")
print(f"Adverbs: {analysis_result['adv_count']}")

print("\nConstituency Parsing (Dependency Roles):")
print(analysis_result['constituency_parse'])

print("\nDependency Parsing (Word -> Head Relation):")
for dep in analysis_result['dependency_parse']:
    print(f"Word: {dep[0]}, Dependency: {dep[1]}, Head: {dep[2]}")

print("\nNamed Entity Recognition (Entities):")
for entity in analysis_result['entities']:
    print(f"Entity: {entity[0]}, Type: {entity[1]}")

def analyze_all_reviews(df):
    total_nouns = 0
    total_verbs = 0
    total_adj = 0
    total_adv = 0
    all_entities = Counter()

    for text in df['Cleaned_Content']:
        result = analyze_text(text)
        total_nouns += result['noun_count']
        total_verbs += result['verb_count']
        total_adj += result['adj_count']
        total_adv += result['adv_count']
        all_entities.update([ent[1] for ent in result['entities']])

    return {
        "total_nouns": total_nouns,
        "total_verbs": total_verbs,
        "total_adjectives": total_adj,
        "total_adverbs": total_adv,
        "entity_counts": dict(all_entities)
    }

overall_results = analyze_all_reviews(data1)

print("\nTotal POS Counts Across All Reviews:")
print(f"Total Nouns: {overall_results['total_nouns']}")
print(f"Total Verbs: {overall_results['total_verbs']}")
print(f"Total Adjectives: {overall_results['total_adjectives']}")
print(f"Total Adverbs: {overall_results['total_adverbs']}")

print("\nEntity Counts Across All Reviews:")
for entity_type, count in overall_results['entity_counts'].items():
    print(f"{entity_type}: {count}")


Parts of Speech (POS) Counts:
Nouns: 22
Verbs: 9
Adjectives: 6
Adverbs: 3

Constituency Parsing (Dependency Roles):
['compound', 'compound', 'compound', 'compound', 'compound', 'nsubj', 'ROOT', 'amod', 'amod', 'compound', 'npadvmod', 'advcl', 'compound', 'compound', 'compound', 'compound', 'nsubj', 'ccomp', 'compound', 'nsubj', 'ccomp', 'ccomp', 'neg', 'compound', 'compound', 'attr', 'nsubj', 'nsubj', 'ROOT', 'nsubj', 'advmod', 'ccomp', 'xcomp', 'compound', 'advmod', 'aux', 'neg', 'compound', 'compound', 'compound', 'nmod', 'compound', 'compound', 'compound', 'compound', 'compound', 'aux', 'neg', 'amod', 'nsubj', 'ccomp', 'amod', 'nsubj', 'ccomp', 'compound', 'dobj', 'advmod']

Dependency Parsing (Word -> Head Relation):
Word: review, Dependency: compound, Head: product
Word: base, Dependency: compound, Head: product
Word: product, Dependency: compound, Head: phone
Word: qualiti, Dependency: compound, Head: phone
Word: condit, Dependency: compound, Head: phone
Word: phone, Dependency: 

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [7]:
# Write your response below
# "The face the 1st question quiet challenging, but with the help of the chatgpt and course materials helped me a bit.
# I was in a dilemma to which choose question but finally chossed 1st part of the question 1 and used scrapper api key to get the reviews from
# amazon. The most interesting part was cleanning the amazon reviews that we got from the question one.
# Based on the assignment difficulty, the time was too short to complete it but with extended deadline t made the things easier."