<a href="https://colab.research.google.com/github/Bhavyamadhuri/Bhavya_INFO5731_Fall2024/blob/main/Devarakonda_Bhavya_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [12]:
import requests  # For sending HTTP requests to access web pages
from bs4 import BeautifulSoup  # For parsing HTML content from those pages
import pandas as pd  # To organize collected data and save it into a CSV file
import time  # To manage delays between requests to avoid overwhelming the server

# Welcome to the Amazon Reviews Scraper! 🎉
# This program retrieves customer reviews from a specified Amazon product page.
# You can enter the product URL and specify how many reviews you'd like to collect.

def fetch_amazon_reviews(product_url, max_reviews=1000):
    all_reviews = []  # List to hold all the reviews we collect
    page_number = 1  # Starting at the first page of reviews

    while len(all_reviews) < max_reviews:
        # Construct the URL for the current page of reviews
        current_url = f"{product_url}&pageNumber={page_number}"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
        }

        try:
            # Make a request to the Amazon page
            response = requests.get(current_url, headers=headers)
            response.raise_for_status()  # Check for HTTP errors

            # Parse the page content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            # Locate all the review elements on the page
            review_elements = soup.find_all('div', {'data-hook': 'review'})

            # Stop if no reviews are found
            if not review_elements:
                print("No more reviews found. We can stop here!")
                break

            # Extract each review from the page
            for review in review_elements:
                review_text = review.find('span', {'data-hook': 'review-body'}).get_text(strip=True)
                all_reviews.append(review_text)  # Add the review to our list

                # Check if we've reached the maximum number of reviews
                if len(all_reviews) >= max_reviews:
                    break

            print(f"Collected {len(all_reviews)} reviews so far...")  # Track progress
            page_number += 1  # Move to the next page
            time.sleep(1)  # Pause to be considerate to the server

        except requests.exceptions.RequestException as error:
            print(f"An error occurred: {error}")
            break

    return all_reviews[:max_reviews]  # Return the reviews we've collected

def save_reviews_to_csv(reviews, filename='amazon_reviews.csv'):
    # Create a DataFrame to organize the reviews
    reviews_df = pd.DataFrame(reviews, columns=['Review'])

    # Save the DataFrame to a CSV file
    reviews_df.to_csv(filename, index=False, encoding='utf-8')
    print(f"Successfully saved {len(reviews)} reviews to {filename}! 🎉")  # Confirmation message

if __name__ == "__main__":
    # Prompt the user for the product URL and maximum number of reviews
    product_url = input("Enter the Amazon product URL: ")
    max_reviews = int(input("Enter the maximum number of reviews to collect (default 1000): ") or 1000)
    output_filename = input("Enter the output CSV filename (default 'amazon_reviews.csv'): ") or 'amazon_reviews.csv'

    # Start collecting reviews!
    reviews = fetch_amazon_reviews(product_url, max_reviews)

    # Save the collected reviews to a CSV file
    save_reviews_to_csv(reviews, output_filename)

    # Summary of collected reviews
    print(f"\nSummary of collected reviews:")
    print(f"Total reviews collected: {len(reviews)}")
    print("Sample reviews:")
    for review in reviews[:5]:  # Display the first 5 reviews as a sample
        print(f"- {review}")


Enter the Amazon product URL: https://www.amazon.com/TracFone-Samsung-Galaxy-A03s-Black/dp/B0CHK6LWKZ?asc_campaign=d069222cb4fc5a7bbdc984dfa1958b21&asc_source=01GTER1X19F274KT48MCTDGTQG&tag=namespacebran383-20
Enter the maximum number of reviews to collect (default 1000): 100
Enter the output CSV filename (default 'amazon_reviews.csv'): amazon_reviews.csv
Collected 8 reviews so far...
Collected 16 reviews so far...
No more reviews found. We can stop here!
Successfully saved 16 reviews to amazon_reviews.csv! 🎉

Summary of collected reviews:
Total reviews collected: 16
Sample reviews:
- My resident at the nursing home broke his screen (same phone) he didn't have enough money so I'm blessed I came across this to replace his since I work paycheck to paycheck. Works as should, he's happy! Amen..Read more
- I use these cheaper phones for my children as a better option than tablets. They are cheaper and are easier for the kids to store. My kids have had phone and it is still in the best of sh

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [13]:
import pandas as pd  # For handling data in tabular format
import re  # For regular expressions to process text
import nltk  # For various natural language processing tasks
from nltk.corpus import stopwords  # To filter out common words (stopwords)
from nltk.stem import PorterStemmer, WordNetLemmatizer  # For stemming and lemmatization

# Download NLTK resources if they haven't been downloaded yet
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

def load_reviews(file_path='amazon_reviews.csv'):
    """Load reviews from a CSV file and return as a DataFrame."""
    return pd.read_csv(file_path)

def clean_text(review_text):
    """Clean the provided text through multiple steps."""

    # Step 1: Remove special characters and punctuation
    review_text = re.sub(r'[^\w\s]', '', review_text)
    print(f"After removing noise: {review_text}")

    # Step 2: Remove numbers
    review_text = re.sub(r'\d+', '', review_text)
    print(f"After removing numbers: {review_text}")

    # Step 3: Convert text to lowercase
    review_text = review_text.lower()
    print(f"After lowercasing: {review_text}")

    # Step 4: Remove stopwords
    stop_words = set(stopwords.words('english'))
    review_text = ' '.join(word for word in review_text.split() if word not in stop_words)
    print(f"After removing stopwords: {review_text}")

    # Step 5: Stemming the words
    stemmer = PorterStemmer()
    stemmed_text = ' '.join(stemmer.stem(word) for word in review_text.split())
    print(f"After stemming: {stemmed_text}")

    # Step 6: Lemmatization of the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = ' '.join(lemmatizer.lemmatize(word) for word in review_text.split())
    print(f"After lemmatization: {lemmatized_text}")

    return stemmed_text, lemmatized_text

def process_reviews(df):
    """Apply the cleaning process to each review in the DataFrame."""
    df[['Stemmed_Review', 'Lemmatized_Review']] = df['Review'].apply(lambda x: clean_text(x)).apply(pd.Series)
    return df

def save_cleaned_reviews(df, output_file='cleaned_amazon_reviews.csv'):
    """Save the cleaned reviews to a new CSV file."""
    df.to_csv(output_file, index=False, encoding='utf-8')
    print(f"Cleaned reviews saved to {output_file}")

if __name__ == "__main__":
    # Load reviews from the specified CSV file
    reviews_df = load_reviews()

    # Clean the reviews using the defined process
    cleaned_reviews_df = process_reviews(reviews_df)

    # Save the cleaned reviews to a new CSV file
    save_cleaned_reviews(cleaned_reviews_df)


After removing noise: My resident at the nursing home broke his screen same phone he didnt have enough money so Im blessed I came across this to replace his since I work paycheck to paycheck Works as should hes happy AmenRead more
After removing numbers: My resident at the nursing home broke his screen same phone he didnt have enough money so Im blessed I came across this to replace his since I work paycheck to paycheck Works as should hes happy AmenRead more
After lowercasing: my resident at the nursing home broke his screen same phone he didnt have enough money so im blessed i came across this to replace his since i work paycheck to paycheck works as should hes happy amenread more
After removing stopwords: resident nursing home broke screen phone didnt enough money im blessed came across replace since work paycheck paycheck works hes happy amenread
After stemming: resid nurs home broke screen phone didnt enough money im bless came across replac sinc work paycheck paycheck work he hap

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [14]:
import pandas as pd  # For handling data in tables
import nltk  # For natural language processing tasks
import spacy  # For advanced NLP tasks
from nltk import pos_tag, word_tokenize  # For POS tagging and tokenization
from nltk.tree import Tree  # For handling parse trees
from nltk import RegexpParser  # For creating custom parsers

# Download necessary NLTK resources if not already downloaded
nltk.download('punkt')  # For tokenization
nltk.download('averaged_perceptron_tagger')  # For POS tagging

# Load the SpaCy model for Named Entity Recognition (NER)
nlp = spacy.load("en_core_web_sm")

def load_cleaned_reviews(file_path='cleaned_amazon_reviews.csv'):
    """Load cleaned reviews from a CSV file."""
    return pd.read_csv(file_path)

def pos_tagging(text):
    """Tag parts of speech for each word in the text."""
    tokens = word_tokenize(text)  # Tokenize the text into individual words
    tagged = pos_tag(tokens)  # Tag each token with its part of speech

    # Initialize counts for different parts of speech
    pos_counts = {'N': 0, 'V': 0, 'Adj': 0, 'Adv': 0}

    # Count occurrences of each part of speech
    for word, tag in tagged:
        if tag.startswith('N'):
            pos_counts['N'] += 1  # Count nouns
        elif tag.startswith('V'):
            pos_counts['V'] += 1  # Count verbs
        elif tag.startswith('J'):
            pos_counts['Adj'] += 1  # Count adjectives
        elif tag.startswith('R'):
            pos_counts['Adv'] += 1  # Count adverbs

    return pos_counts, tagged  # Return counts and tagged words

def constituency_parsing(sentence):
    """Create a constituency parsing tree for a given sentence."""
    # Define grammar for noun phrases (NP) and verb phrases (VP)
    grammar = r"""
        NP: {<DT>?<JJ>*<NN.*>}   # Noun Phrase
        VP: {<VB.*><NP|PP>+$}    # Verb Phrase
        PP: {<IN><NP>}           # Prepositional Phrase
    """
    cp = RegexpParser(grammar)  # Create a parser with the defined grammar
    tokens = word_tokenize(sentence)  # Tokenize the sentence
    tagged = pos_tag(tokens)  # Tag the tokens

    tree = cp.parse(tagged)  # Parse the tagged tokens to create the constituency tree
    return tree  # Return the parse tree

def dependency_parsing(sentence):
    """Create a dependency parsing tree for a given sentence using SpaCy."""
    doc = nlp(sentence)  # Process the sentence with SpaCy
    return [(token.text, token.dep_, token.head.text) for token in doc]  # Return a list of (word, dependency, head)

def named_entity_recognition(text):
    """Extract named entities and their counts from the text."""
    doc = nlp(text)  # Process the text with SpaCy
    # Initialize counts for different entity types
    entities = {'PERSON': 0, 'ORG': 0, 'GPE': 0, 'PRODUCT': 0, 'DATE': 0}

    # Count each type of entity found in the text
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_] += 1

    return entities  # Return the counts of named entities

if __name__ == "__main__":
    try:
        # Load cleaned reviews from the CSV file
        reviews_df = load_cleaned_reviews()

        # Ensure there are reviews to analyze
        if reviews_df.empty or 'Lemmatized_Review' not in reviews_df.columns:
            raise ValueError("No cleaned reviews found or the column 'Lemmatized_Review' is missing.")

        # Analyze the first review for demonstration
        sample_review = reviews_df['Lemmatized_Review'].iloc[0]
        print(f"Sample Review: {sample_review}\n")

        # (1) Parts of Speech (POS) Tagging
        pos_counts, tagged_words = pos_tagging(sample_review)
        print("POS Tagging Counts:", pos_counts)
        print("Tagged Words:", tagged_words)

        # (2) Constituency Parsing
        constituency_tree = constituency_parsing(sample_review)
        print("\nConstituency Parsing Tree:")
        print(constituency_tree)

        # (3) Dependency Parsing
        dep_tree = dependency_parsing(sample_review)
        print("\nDependency Parsing Tree:")
        for token, dep, head in dep_tree:
            print(f"{token} -> {dep} -> {head}")

        # (4) Named Entity Recognition
        entities = named_entity_recognition(sample_review)
        print("\nNamed Entity Counts:", entities)

    except Exception as e:
        print(f"An error occurred: {e}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Sample Review: resident nursing home broke screen phone didnt enough money im blessed came across replace since work paycheck paycheck work he happy amenread

POS Tagging Counts: {'N': 11, 'V': 4, 'Adj': 4, 'Adv': 0}
Tagged Words: [('resident', 'JJ'), ('nursing', 'NN'), ('home', 'NN'), ('broke', 'VBD'), ('screen', 'JJ'), ('phone', 'NN'), ('didnt', 'NN'), ('enough', 'JJ'), ('money', 'NN'), ('im', 'NNS'), ('blessed', 'VBD'), ('came', 'VBD'), ('across', 'IN'), ('replace', 'VB'), ('since', 'IN'), ('work', 'NN'), ('paycheck', 'NN'), ('paycheck', 'NN'), ('work', 'NN'), ('he', 'PRP'), ('happy', 'JJ'), ('amenread', 'NN')]

Constituency Parsing Tree:
(S
  (NP resident/JJ nursing/NN)
  (NP home/NN)
  broke/VBD
  (NP screen/JJ phone/NN)
  (NP didnt/NN)
  (NP enough/JJ money/NN)
  (NP im/NNS)
  blessed/VBD
  came/VBD
  across/IN
  replace/VB
  (PP since/IN (NP work/NN))
  (NP paycheck/NN)
  (NP paycheck/NN)
  (NP work/NN)
  he/PRP
  (NP happy/JJ amenread/NN))

Dependency Parsing Tree:
resident -> 

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [15]:
import pandas as pd  # Importing pandas for data manipulation and analysis

# Create a dictionary with some cleaned reviews as an example
student_cleaned_data = {
    "Lemmatized_Review": [
        "this is a cleaned review example one",
        "this is a cleaned review example two",
        "this is a cleaned review example three",
        # You can add more cleaned reviews here as needed
    ]
}

# Convert the dictionary into a DataFrame
cleaned_reviews_df = pd.DataFrame(student_cleaned_data)

# Save the DataFrame to a CSV file
cleaned_reviews_df.to_csv('cleaned_amazon_reviews.csv', index=False)
print("Successfully saved cleaned data to 'cleaned_amazon_reviews.csv'")

Successfully saved cleaned data to 'cleaned_amazon_reviews.csv'


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [16]:
This assignment provided a comprehensive introduction to text processing and analysis using Python.
It offered practical experience with different libraries and techniques,
enhancing both technical skills and a deeper insight into natural language processing.

SyntaxError: invalid syntax (<ipython-input-16-d9009d50ea7a>, line 1)