<a href="https://colab.research.google.com/github/Chaitanya1081/Chaitanya_INFO5731_FALL2024/blob/main/INFO5731_Assignment_2_1_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [15]:
!pip install requests beautifulsoup4 pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Your ScraperAPI key
SCRAPER_API_KEY = '40b0015e08c30991680bb064ab9565f3'  # Replace with your ScraperAPI key

# Function to fetch the HTML content using ScraperAPI
def fetch_html_with_scraperapi(url):
    api_url = f'http://api.scraperapi.com?api_key={SCRAPER_API_KEY}&url={url}&render=true'
    response = requests.get(api_url)
    return response.text

# Function to parse and extract reviews
def parse_amazon_reviews(soup):
    reviews = soup.find_all('div', {'data-hook': 'review'})
    review_list = []

    for review in reviews:
        try:
            review_title = review.find('a', {'data-hook': 'review-title'}).text.strip()
            review_content = review.find('span', {'data-hook': 'review-body'}).text.strip()
            review_author = review.find('span', {'class': 'a-profile-name'}).text.strip()
            review_rating = review.find('i', {'data-hook': 'review-star-rating'}).text.strip()

            review_list.append({
                'Title': review_title,
                'Content': review_content,
                'Author': review_author,
                'Rating': review_rating
            })
        except:
            continue  # Skip any reviews with missing data

    return review_list

# Amazon product reviews URL (example URL, replace with the actual product reviews link)
product_url = 'https://www.amazon.com/Samsung-Galaxy-Note-10-Unlocked/dp/B07Z3XZDT5'  # Replace with the actual product reviews URL

# Initialize a list to hold all reviews
all_reviews = []

# Loop through review pages (pagination)
for page_num in range(1, 51):  # Adjust the range to collect more pages (50 pages x 10 reviews per page = 500 reviews)
    print(f"Fetching page {page_num}...")

    paginated_url = f"{product_url}?pageNumber={page_num}"
    html_content = fetch_html_with_scraperapi(paginated_url)
    soup = BeautifulSoup(html_content, 'html.parser')

    # Parse the reviews from the current page
    reviews = parse_amazon_reviews(soup)
    all_reviews.extend(reviews)

    # Sleep for a bit to avoid overloading the server
    time.sleep(2)

    # Stop when you reach the target number of reviews
    if len(all_reviews) >= 1000:
        break

# Save the reviews to a CSV file
df = pd.DataFrame(all_reviews)
df.to_csv('amazon_reviews.csv', index=False, encoding='utf-8')

print(f"Saved {len(all_reviews)} reviews to 'amazon_reviews.csv'.")






Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Fetching page 10...
Fetching page 11...
Fetching page 12...
Fetching page 13...
Fetching page 14...
Fetching page 15...
Fetching page 16...
Fetching page 17...
Fetching page 18...
Fetching page 19...
Fetching page 20...
Fetching page 21...
Fetching page 22...
Fetching page 23...
Fetching page 24...
Fetching page 25...
Fetching page 26...
Fetching page 27...
Fetching page 28...
Fetching page 29...
Fetching page 30...
Fetching page 31...
Fetching page 32...
Fetching page 33...
Fetching page 34...
Fetching page 35...
Fetching page 36...
Fetching page 37...
Fetching page 38...
Fetching page 39...
Fetching page 40...
Fetching page 41...
Fetching page 42...
Fetching page 43...
Fetching page 44...
Fetching page 45...
Fetching page 46...
Fetching page 47...
Fetching page 48...
Fetching page 49...
Fetching page 50...
Saved 250

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [16]:
!pip install nltk pandas
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Load the data from the CSV file
df = pd.read_csv('amazon_reviews.csv')

# Initialize stopwords, stemmer, and lemmatizer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean text
def clean_text(text):
    # 1. Remove special characters and punctuations
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # 2. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 3. Tokenize and remove stopwords
    tokens = text.split()
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # 4. Lowercase all text
    tokens = [word.lower() for word in tokens]

    # 5. Stemming
    tokens = [stemmer.stem(word) for word in tokens]

    # 6. Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join tokens back to string
    cleaned_text = ' '.join(tokens)
    return cleaned_text

# Apply the clean_text function to the 'Content' column and create a new 'Cleaned_Content' column
df['Cleaned_Content'] = df['Content'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv('amazon_reviews_cleaned.csv', index=False, encoding='utf-8')

print(f"Cleaned data saved to 'amazon_reviews_cleaned.csv'.")




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Cleaned data saved to 'amazon_reviews_cleaned.csv'.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [7]:
!pip install pandas nltk spacy
!python -m spacy download en_core_web_sm
!pip install -U spacy



Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m68.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [8]:
import spacy
from collections import Counter

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Example clean text
clean_text = """
edit august two year later phone still go strong even use phone throughout day need charg everi hour phone case kept protect still look brand new still run like new even though phone renew ive never issu requir troubleshoot repair ever replac phone im buy renewedorigin review product review first wanna discus satisfact fact renew itemth phone arriv neat littl box gener brand charg cabl adapt neither item suit fastcharg capabl phone bought item phone im gonna knock cord adapt phone look brand new scratch damag kind even much fingerprint itth phone power fine abl get setup without hitch far im concern excel job renew there way tell even renew except gener box cabl charger btw product come earbud anyth get phone gener cord gener adapt overal im happi buy renew save brand new pricea review phone first samsung phone ive never much seen galaxi note eye therefor im still learn thing say love stereo speaker great play music game sound nice punch better phone speaker ive ever heard worri wouldnt get use lack headphon jack dont actual need bought nice pair bluetooth headphon cant think reason would need headphon jackth screen realli nice amaz fingerprint scanner insid screen make unlock phone quit easi free hold phone awkwardli like phone hole punch camera screen much le distract thought would end hardli notic sinc fingerprint scanner locat insid screen cant get tradit otterbox case phone heavi duti screen protector doesnt allow fingerprint scan buy specif stickon screen protector phone case bought armadillotek case love iti like extra storag phone past alway felt like phone fill stuff quickli also previou phone prone becom slow fill app photo music phoneiv given everyth use like gig plenti space well fast chip set phone make phone lightn fasta batteri life last long time your text listen music period surf web phone last approxim hour singl charg your play game powerconsum thing believ phone might get pretti rigor hoursthi phone came preload useless app didnt come preload use app gener come android like music player amazon suchload end load samsung musicwhich alreadi love way itun appl musicim still learn use stylu pen believ ton way use pen ive tri far great especi like abl handwrit note convert text wish could say phone im still learn overal im happi purchas well worth money highli recommend read
"""

# Function for Parts of Speech tagging
def pos_tagging(text):
    doc = nlp(text)
    pos_counts = Counter(token.pos_ for token in doc)
    return pos_counts

# Function for Dependency Parsing
def dependency_parsing(text):
    doc = nlp(text)
    dependency_trees = []

    for sent in doc.sents:
        dependency_trees.append([(token.text, token.dep_, token.head.text) for token in sent])

    return dependency_trees

# Function for Named Entity Recognition
def named_entity_recognition(text):
    doc = nlp(text)
    entities = Counter((ent.text, ent.label_) for ent in doc.ents)
    return entities

# Conduct analysis
pos_counts = pos_tagging(clean_text)
dependency_trees = dependency_parsing(clean_text)
entities = named_entity_recognition(clean_text)

# Display results
print("\nParts of Speech Counts:")
print(f"Nouns: {pos_counts['NOUN']}")
print(f"Verbs: {pos_counts['VERB']}")
print(f"Adjectives: {pos_counts['ADJ']}")
print(f"Adverbs: {pos_counts['ADV']}")

print("\nDependency Parsing Trees:")
for sent_tree in dependency_trees:
    print(sent_tree)

print("\nNamed Entities:")
for entity, count in entities.items():
    print(f"{entity[0]}: {entity[1]}, Count: {count}")

# Example Explanation
print("\nExample Explanation:")
example_sentence = "The quick brown fox jumps over the lazy dog."
example_doc = nlp(example_sentence)
print("Example Sentence:", example_sentence)
print("Dependency Tree:", [(token.text, token.dep_, token.head.text) for token in example_doc])



Parts of Speech Counts:
Nouns: 162
Verbs: 62
Adjectives: 37
Adverbs: 25

Dependency Parsing Trees:
[('\n', 'dep', 'edit'), ('edit', 'npadvmod', 'go'), ('august', 'npadvmod', 'edit'), ('two', 'nummod', 'year'), ('year', 'npadvmod', 'later'), ('later', 'amod', 'phone'), ('phone', 'nsubj', 'go'), ('still', 'advmod', 'go'), ('go', 'ROOT', 'go'), ('strong', 'acomp', 'go'), ('even', 'advmod', 'use'), ('use', 'advcl', 'go'), ('phone', 'dobj', 'use'), ('throughout', 'prep', 'use'), ('day', 'pobj', 'throughout'), ('need', 'conj', 'go'), ('charg', 'amod', 'hour'), ('everi', 'amod', 'hour'), ('hour', 'compound', 'case'), ('phone', 'compound', 'case'), ('case', 'nsubj', 'kept'), ('kept', 'conj', 'go'), ('protect', 'xcomp', 'kept'), ('still', 'advmod', 'look'), ('look', 'xcomp', 'kept'), ('brand', 'npadvmod', 'new'), ('new', 'acomp', 'look'), ('still', 'advmod', 'run'), ('run', 'dep', 'go'), ('like', 'prep', 'run'), ('new', 'pobj', 'like'), ('even', 'advmod', 'though'), ('though', 'prep', 'go'), (

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below