<a href="https://colab.research.google.com/github/Saikrishna2472/INFO-5731.020-7886-Assignment-1/blob/main/Paleru_Jai_Sai_Krishna_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [8]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the website you want to scrape
url = 'https://www.imdb.com/title/tt11821912/reviews/?ref_=tt_urv_sm'

# Custom headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}

# Send a GET request to the URL with custom headers
response = requests.get(url, headers=headers)

all_reviews = []

# Check if the request was successful
if response.status_code == 200:
    # Parse the page content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all review summaries
    review_summary = soup.find_all('span', {'data-testid': 'review-summary'})

    # Extract the text of each review and store in the list
    for review in review_summary:
        review_text = review.get_text(strip=True)
        all_reviews.append(review_text)

    # Print reviews (optional)
    print(all_reviews)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

# Save the reviews to a CSV file
df = pd.DataFrame(all_reviews, columns=['review'])
df.to_csv('all_reviews.csv', index=False)

print("All reviews have been saved to 'all_reviews.csv'.")
print(df)

['TIGER Reigns: RIP to the Rajamouli Myth', "A Visual and Musical Extravaganza Elevated by NTR's Performance", 'NTR electrifying movements on screen and Anirudh awesome music and BGM', 'Energetic movie. Very good bgm.', '"A Gripping Epic with Stellar Performances"', 'Pirates of Paadhaghattam', 'The Devara Review: A Half-Baked Action Spectacle', "Would've been a disaster if not for NTR", 'Watch for other wordly experience', '1 time watch', 'Time wast, money wast, visuvals not clear, hero like katikapari roteen dram there is no high. Voltage actions', 'Superb movie and awesome story and terrific performance by ntr', 'DEVARA: Fear of the Red Sea', 'Worst movie ever!', 'Blockbuster movie', 'Fantastic Story line', 'Fights 1 to 50 , Ok Movie with some good scenes on the sea', 'Save your money and time', 'SUPER MOVIE AND BGM AWESOME', 'Avg content.. the will ride on JR NTRs popularity', 'Middling Action Drama', 'A Disappointing Watch with Poor VFX, Characters, and Story', 'Blockbuster movie, 

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [11]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the CSV file containing the reviews
df = pd.read_csv('all_reviews.csv')

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define stopwords
stop_words = set(stopwords.words('english'))

# Function to clean the text
def clean_text(text):
    # (1) Remove noise (special characters and punctuation)
    text = re.sub(r'[^\w\s]', '', text)

    # (2) Remove numbers
    text = re.sub(r'\d+', '', text)

    # (3) Remove stopwords
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

    # (4) Lowercase all texts
    text = text.lower()

    # (5) Stemming
    text = ' '.join([stemmer.stem(word) for word in text.split()])

    # (6) Lemmatization
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

    return text

# Apply the cleaning function to the review column
df['clean_review'] = df['review'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv('cleaned_reviews.csv', index=False)

print("Cleaned reviews have been written to 'cleaned_reviews.csv'.")
print(df)

Cleaned reviews have been written to 'cleaned_reviews.csv'.
                                               review  \
0             TIGER Reigns: RIP to the Rajamouli Myth   
1   A Visual and Musical Extravaganza Elevated by ...   
2   NTR electrifying movements on screen and Aniru...   
3                     Energetic movie. Very good bgm.   
4         "A Gripping Epic with Stellar Performances"   
5                            Pirates of Paadhaghattam   
6    The Devara Review: A Half-Baked Action Spectacle   
7             Would've been a disaster if not for NTR   
8                   Watch for other wordly experience   
9                                        1 time watch   
10  Time wast, money wast, visuvals not clear, her...   
11  Superb movie and awesome story and terrific pe...   
12                        DEVARA: Fear of the Red Sea   
13                                  Worst movie ever!   
14                                  Blockbuster movie   
15                          

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [14]:
import pandas as pd
import nltk
from nltk import pos_tag, word_tokenize
import spacy
from collections import Counter

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the cleaned reviews
df = pd.read_csv('cleaned_reviews.csv')

# Initialize spaCy
nlp = spacy.load('en_core_web_sm')

# Parts of Speech Tagging and Counting
def pos_analysis(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    pos_counts = Counter(tag for word, tag in tagged)
    return pos_counts

# Dependency Parsing
def parse_sentence(text):
    doc = nlp(text)
    print("\nDependency Parsing Tree:")
    for sent in doc.sents:
        for token in sent:
            print(f"{token.text} --> {token.dep_} --> {token.head.text}")

# Named Entity Recognition
def named_entity_recognition(text):
    doc = nlp(text)
    entities = Counter((ent.text, ent.label_) for ent in doc.ents)
    return entities

# Analyze each review
for index, row in df.iterrows():
    text = row['clean_review']

    # POS Analysis
    pos_counts = pos_analysis(text)
    print(f"\nPOS counts for review {index + 1}: {pos_counts}")

    # Parse Sentence
    parse_sentence(text)

    # Named Entity Recognition
    entities = named_entity_recognition(text)
    print(f"\nEntities for review {index + 1}: {entities}")

# Summarize entity counts
all_entities = Counter()
for index, row in df.iterrows():
    text = row['clean_review']
    entities = named_entity_recognition(text)
    all_entities.update(entities)

print("\nOverall entity counts:", all_entities)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



POS counts for review 1: Counter({'NN': 5})

Dependency Parsing Tree:
tiger --> compound --> reign
reign --> compound --> rajamouli
rip --> compound --> rajamouli
rajamouli --> compound --> myth
myth --> ROOT --> myth

Entities for review 1: Counter()

POS counts for review 2: Counter({'NN': 4, 'JJ': 2})

Dependency Parsing Tree:
visual --> amod --> music
music --> compound --> extravaganza
extravaganza --> compound --> elev
elev --> nsubj --> ntr
ntr --> aux --> perform
perform --> ROOT --> perform

Entities for review 2: Counter()

POS counts for review 3: Counter({'NN': 6, 'JJ': 2})

Dependency Parsing Tree:
ntr --> compound --> bgm
electrifi --> compound --> movement
movement --> compound --> screen
screen --> compound --> anirudh
anirudh --> compound --> bgm
awesom --> compound --> music
music --> compound --> bgm
bgm --> ROOT --> bgm

Entities for review 3: Counter({('ntr electrifi', 'ORG'): 1})

POS counts for review 4: Counter({'NN': 2, 'VB': 1, 'JJ': 1})

Dependency Parsing T

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [15]:
from google.colab import files

# Download the cleaned_reviews.csv file
files.download('cleaned_reviews.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
# Challenges:
# Data Scraping: Setting up web scraping can be tricky due to varying HTML structures and potential anti-scraping measures. Debugging issues can be time-consuming.
# Data Cleaning: The multiple steps in cleaning text data require attention to detail, and managing different libraries like NLTK and spaCy can lead to confusion.
# Syntax Analysis: Understanding POS tagging and parsing can be complex, needing a solid grasp of linguistic concepts.

# Enjoyable Aspects:
# Learning Opportunities: Each step offers a chance to learn new techniques and tools, particularly in NLP.
# Problem Solving: Overcoming errors and achieving accurate results can be fulfilling and enhance critical thinking.
# Visualization: Seeing cleaned data and analysis results is satisfying and demonstrates effective processing.

# Opinion on Time Allocation:
# Time Considerations: The time needed varies; experienced individuals may find it adequate, while beginners might feel rushed.
# Iterative Nature: Additional time for experimentation and debugging could enhance learning.