<a href="https://colab.research.google.com/github/Sireesha-cloud/Sireesha_INFO5731_Fall2024/blob/main/INFO5731_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>













































































































# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [2]:
!pip install requests beautifulsoup4




In [3]:
import requests
import csv
import time
from bs4 import BeautifulSoup

# Scraper API Configuration
SCRAPER_API_KEY = '06de0be700dfe414381d19c077edc7ac'  # Replace with your Scraper API key
BASE_URL = 'http://api.scraperapi.com'

# Amazon Product URL (You can change this to any product URL)
amazon_product_url = "https://www.amazon.com/SAMSUNG-Smartphone-Unlocked-Android-Titanium/dp/B0CMDM65JH"  # Example URL for a product

# Function to get Amazon reviews via Scraper API
def get_reviews(page):
    params = {
        'api_key': SCRAPER_API_KEY,
        'url': f'{amazon_product_url}?pageNumber={page}'
    }
    response = requests.get(BASE_URL, params=params)

    if response.status_code == 200:
        return response.text
    else:
        print(f"Error fetching page {page}: Status Code {response.status_code}")
        return None

# Function to parse reviews from HTML response
def parse_reviews(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    reviews = []

    # Select the review blocks (this selector may need to be adjusted based on the actual HTML structure)
    review_blocks = soup.select('.review')

    for block in review_blocks:
        try:
            rating = block.select_one('.review-rating').text.strip()
            title = block.select_one('.review-title').text.strip()
            content = block.select_one('.review-text-content').text.strip()
            date = block.select_one('.review-date').text.strip()
            reviews.append({
                'Rating': rating,
                'Title': title,
                'Content': content,
                'Date': date
            })
        except Exception as e:
            print(f"Error extracting review: {e}")

    return reviews

# Function to save reviews to CSV file
def save_reviews_to_csv(reviews, output_file='amazon_reviews_1000.csv'):
    if reviews:
        keys = reviews[0].keys()
        with open(output_file, 'w', newline='', encoding='utf-8') as output_file:
            dict_writer = csv.DictWriter(output_file, fieldnames=keys)
            dict_writer.writeheader()
            dict_writer.writerows(reviews)

# Main function to collect reviews
def collect_amazon_reviews():
    all_reviews = []
    page = 1
    while len(all_reviews) < 1000:  # Collect at least 1000 reviews
        print(f"Fetching reviews from page {page}...")
        html_content = get_reviews(page)

        if html_content:
            reviews = parse_reviews(html_content)
            all_reviews.extend(reviews)
            page += 1
            time.sleep(2)  # Adding delay between requests to avoid getting blocked
        else:
            break  # Stop if there are issues fetching pages

    print(f"Collected {len(all_reviews)} reviews.")
    save_reviews_to_csv(all_reviews[:1000])

if __name__ == '__main__':
    collect_amazon_reviews()


Fetching reviews from page 1...
Fetching reviews from page 2...
Fetching reviews from page 3...
Fetching reviews from page 4...
Fetching reviews from page 5...
Fetching reviews from page 6...
Fetching reviews from page 7...
Fetching reviews from page 8...
Fetching reviews from page 9...
Fetching reviews from page 10...
Fetching reviews from page 11...
Fetching reviews from page 12...
Fetching reviews from page 13...
Fetching reviews from page 14...
Fetching reviews from page 15...
Fetching reviews from page 16...
Fetching reviews from page 17...
Fetching reviews from page 18...
Fetching reviews from page 19...
Fetching reviews from page 20...
Fetching reviews from page 21...
Fetching reviews from page 22...
Fetching reviews from page 23...
Fetching reviews from page 24...
Fetching reviews from page 25...
Fetching reviews from page 26...
Fetching reviews from page 27...
Fetching reviews from page 28...
Fetching reviews from page 29...
Fetching reviews from page 30...
Fetching reviews fr

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [4]:
!pip install pandas nltk




In [5]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the CSV file with Amazon reviews
input_file = 'amazon_reviews_1000.csv'  # Path to your input CSV file
output_file = 'cleaned_amazon_reviews.csv'

# Read the CSV file
reviews_df = pd.read_csv(input_file)

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define a function to clean the text
def clean_text(text):
    # Step 1: Remove noise (special characters and punctuations)
    text = re.sub(r'[^\w\s]', '', text)

    # Step 2: Remove numbers
    text = re.sub(r'\d+', '', text)

    # Step 3: Remove stopwords
    stop_words = set(stopwords.words('english'))
    text_tokens = nltk.word_tokenize(text)
    text = ' '.join([word for word in text_tokens if word.lower() not in stop_words])

    # Step 4: Lowercase all texts
    text = text.lower()

    # Step 5: Stemming
    text_stemmed = ' '.join([stemmer.stem(word) for word in text.split()])

    # Step 6: Lemmatization
    text_lemmatized = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

    return text_stemmed, text_lemmatized

# Apply the cleaning function to the 'Content' column and create new columns
reviews_df['Cleaned_Text_Stemmed'], reviews_df['Cleaned_Text_Lemmatized'] = zip(*reviews_df['Content'].apply(clean_text))

# Save the cleaned data to a new CSV file
reviews_df.to_csv(output_file, index=False)

# Display the first few rows of the cleaned data
print(reviews_df[['Content', 'Cleaned_Text_Stemmed', 'Cleaned_Text_Lemmatized']].head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


                                             Content  \
0  ⭐️⭐️⭐️⭐️⭐️The SAMSUNG Galaxy S24 Ultra is an a...   
1  I had a s20FE which was a very good phone at a...   
2  PLEEEEEASE people, stop complaining about the ...   
3  The best 1,100 dollars I've spent this year. I...   
4  Me encantó. Nuevo, original y en empaque sella...   

                                Cleaned_Text_Stemmed  \
0  samsung galaxi ultra absolut powerhous smartph...   
1  sfe good phone great price yr batteri would dr...   
2  pleeeeeas peopl stop complain display thing li...   
3  best dollar ive spent year went note plu two u...   
4  encantó nuevo origin en empaqu sellado sin dud...   

                             Cleaned_Text_Lemmatized  
0  samsung galaxy ultra absolute powerhouse smart...  
1  sfe good phone great price yr battery would dr...  
2  pleeeeease people stop complaining display thi...  
3  best dollar ive spent year went note plus two ...  
4  encantó nuevo original en empaque sellado sin ..

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
!pip install pandas nltk spacy
!python -m spacy download en_core_web_sm  # Download spaCy's English model


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m79.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [7]:
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk.tree import Tree
from nltk import pos_tag, word_tokenize, ne_chunk

# Load the cleaned data
input_file = 'cleaned_amazon_reviews.csv'  # Path to your cleaned CSV file
reviews_df = pd.read_csv(input_file)

# Load spaCy model for dependency parsing and NER
nlp = spacy.load("en_core_web_sm")

# Download NLTK resources for POS tagging and chunking
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Function to conduct POS tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    return pos_tags

# Function to perform Named Entity Recognition (NER)
def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Function to perform constituency parsing using NLTK
def constituency_parsing(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    chunked = ne_chunk(pos_tags)
    return chunked

# Function to perform dependency parsing
def dependency_parsing(text):
    doc = nlp(text)
    return doc

# Analyzing the cleaned text
pos_counts = {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}
entity_counts = Counter()

# Loop through each cleaned text entry
for text in reviews_df['Cleaned_Text_Lemmatized']:
    # POS tagging
    pos_tags = pos_tagging(text)
    for word, pos in pos_tags:
        if pos.startswith('N'):
            pos_counts['Noun'] += 1
        elif pos.startswith('V'):
            pos_counts['Verb'] += 1
        elif pos.startswith('J'):
            pos_counts['Adjective'] += 1
        elif pos.startswith('R'):
            pos_counts['Adverb'] += 1

    # Named Entity Recognition
    entities = extract_entities(text)
    entity_counts.update([entity[1] for entity in entities])  # Update the counts for the entity types

# Print the total counts of POS tags
print("Total Parts of Speech Counts:")
print(pos_counts)

# Print the entity counts
print("\nTotal Named Entity Counts:")
for entity, count in entity_counts.items():
    print(f"{entity}: {count}")

# Example of constituency and dependency parsing
example_text = reviews_df['Cleaned_Text_Lemmatized'].iloc[0]  # Example sentence

# Constituency Parsing
print("\nConstituency Parsing Tree:")
constituency_tree = constituency_parsing(example_text)
print(constituency_tree)  # Prints the chunked tree

# Dependency Parsing
dependency_doc = dependency_parsing(example_text)
print("\nDependency Parsing Tree:")
for sent in dependency_doc.sents:
    print(sent.root)  # Prints the root of the dependency parse tree
    for token in sent:
        print(f"{token.text} --> {token.dep_} --> {token.head.text}")

# Visualizing the dependency tree
# Uncomment the following line if you have Graphviz installed
# spacy.displacy.serve(dependency_doc, style="dep")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Total Parts of Speech Counts:
{'Noun': 41245, 'Verb': 12785, 'Adjective': 19348, 'Adverb': 7889}

Total Named Entity Counts:
ORG: 1224
PERSON: 1444
DATE: 334
TIME: 333
CARDINAL: 333
ORDINAL: 222
GPE: 111

Constituency Parsing Tree:
(S
  samsung/NN
  galaxy/NN
  ultra/JJ
  absolute/NN
  powerhouse/NN
  smartphone/NN
  moment/NN
  started/VBD
  using/VBG
  device/NN
  clear/JJ
  samsung/NN
  truly/RB
  outdone/JJ
  model/NN
  gb/JJ
  storage/NN
  ample/JJ
  apps/JJ
  photo/NN
  video/NN
  performance/NN
  lightning/VBG
  fast/JJ
  thanks/NNS
  latest/JJS
  ai/NN
  technology/NN
  powerful/JJ
  processingthe/JJ
  display/NN
  simply/RB
  stunningvivid/JJ
  color/NN
  sharp/JJ
  detail/NN
  deep/JJ
  black/JJ
  make/NN
  everything/NN
  streaming/VBG
  video/NN
  scrolling/VBG
  social/JJ
  medium/NN
  visually/RB
  immersive/JJ
  experience/NN
  build/VBP
  quality/NN
  topnotch/NN
  design/NN
  sleek/JJ
  premium/NN
  giving/VBG
  luxurious/JJ
  feel/NN
  handone/NN
  standout/NN
  featu

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

https://myunt-my.sharepoint.com/:x:/g/personal/sireesharusum_my_unt_edu/EVysDQExkAdKvQ1fN0K5XCwBj7W1vHQFVuEGqsJX5OFddw?e=G9ZNLG

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
Challenges: The assignment's main hurdles were data collecting and cleaning, especially from sites like IMDb and Amazon, where uneven formatting and noise in user-generated material made preprocessing difficult.
Enjoyable Aspects: Working with strong NLP tools like spaCy made the job more interesting because it streamlined complicated chores like dependency parsing and POS labelling. The hands-on approach reinforced core NLP ideas, allowing for practical application while strengthening theoretical comprehension.
Provided time to complete the assignment is sufficient.
