<a href="https://colab.research.google.com/github/Sireesha-cloud/Sireesha_INFO5731_Fall2024/blob/main/INFO5731_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
# Your code here
!pip install requests beautifulsoup4 pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Function to get reviews from a single IMDb page
def get_reviews_from_page(movie_id, page_num):
    url = f"https://www.imdb.com/title/{movie_id}/reviews?ref_=tt_ql_3&paginationKey={page_num}"
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch page {page_num}. Status code: {response.status_code}")
        return None

    soup = BeautifulSoup(response.text, "html.parser")
    reviews = []

    # Find all review containers
    review_containers = soup.find_all("div", class_="lister-item-content")

    for container in review_containers:
        review_title = container.find("a", class_="title").get_text(strip=True)
        review_text = container.find("div", class_="text show-more__control").get_text(strip=True)
        rating_tag = container.find("span", class_="rating-other-user-rating")
        review_rating = rating_tag.find("span").get_text(strip=True) if rating_tag else "N/A"
        review_date = container.find("span", class_="review-date").get_text(strip=True)

        reviews.append({
            "Title": review_title,
            "Text": review_text,
            "Rating": review_rating,
            "Date": review_date
        })

    return reviews

# Function to scrape multiple pages until reaching the target number of reviews
def scrape_imdb_reviews(movie_id, max_reviews=1000):
    reviews = []
    page_num = 0  # IMDb doesn't use traditional pagination numbers; we use pagination keys
    pagination_key = ''
    while len(reviews) < max_reviews:
        print(f"Scraping page {page_num + 1}...")
        new_reviews = get_reviews_from_page(movie_id, pagination_key)
        if not new_reviews:
            break
        reviews.extend(new_reviews)
        if len(reviews) >= max_reviews:
            break
        page_num += 1
        time.sleep(1)  # Avoid getting blocked by IMDb

    return reviews[:max_reviews]

# Save reviews to CSV
def save_reviews_to_csv(reviews, filename="imdb_reviews.csv"):
    df = pd.DataFrame(reviews)
    df.to_csv(filename, index=False)
    print(f"Saved {len(reviews)} reviews to {filename}")

# Example usage
if __name__ == "__main__":
    # Example movie IDs: Oppenheimer (2023) -> tt15398776, Barbie (2023) -> tt1517268
    movie_ids = ["tt15398776", "tt1517268"]  # Replace with the desired movie IDs
    all_reviews = []
    max_reviews = 1000
    for movie_id in movie_ids:
        reviews = scrape_imdb_reviews(movie_id, max_reviews=len(all_reviews) < max_reviews)
        all_reviews.extend(reviews)
        if len(all_reviews) >= max_reviews:
            break
    save_reviews_to_csv(all_reviews)




Scraping page 1...
Scraping page 1...
Saved 1 reviews to imdb_reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
!pip install pandas nltk
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
# Load the sample dataset
df = pd.read_csv('/content/movie_reviews_sample.csv')
df['Text']
# Remove special characters and punctuation
df['Cleaned_Text'] = df['Text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
df[['Text', 'Cleaned_Text']].head()
# Remove numbers
df['Cleaned_Text'] = df['Cleaned_Text'].apply(lambda x: re.sub(r'\d+', '', x))
df[['Text', 'Cleaned_Text']].head()
# Define stopwords
stop_words = set(stopwords.words('english'))
# Remove stopwords
df['Cleaned_Text'] = df['Cleaned_Text'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in stop_words]))
df[['Text', 'Cleaned_Text']].head()
# Convert text to lowercase
df['Cleaned_Text'] = df['Cleaned_Text'].apply(lambda x: x.lower())
df[['Text', 'Cleaned_Text']].head()
# Apply stemming
stemmer = PorterStemmer()
df['Stemmed_Text'] = df['Cleaned_Text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df[['Text', 'Cleaned_Text', 'Stemmed_Text']].head()
# Apply lemmatization
lemmatizer = WordNetLemmatizer()
df['Lemmatized_Text'] = df['Cleaned_Text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
df[['Text', 'Cleaned_Text', 'Stemmed_Text', 'Lemmatized_Text']].head()













[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Text,Cleaned_Text,Stemmed_Text,Lemmatized_Text
0,The film was visually stunning and emotionally...,film visually stunning emotionally powerful,film visual stun emot power,film visually stunning emotionally powerful
1,"While the film had great visuals, the plot was...",film great visuals plot bit slow,film great visual plot bit slow,film great visuals plot bit slow
2,The direction was top-notch and the acting was...,direction topnotch acting superb,direct topnotch act superb,direction topnotch acting superb
3,"The story lacked depth in some areas, but over...",story lacked depth areas overall good watch,stori lack depth area overal good watch,story lacked depth area overall good watch
4,The actors delivered brilliant performances. O...,actors delivered brilliant performances one be...,actor deliv brilliant perform one best movi year,actor delivered brilliant performance one best...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [19]:
!pip install spacy textblob nltk
!python -m spacy download en_core_web_sm
import spacy
from textblob import TextBlob
from collections import Counter
import nltk

# Download NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Example cleaned texts (replace these with your cleaned text data)
cleaned_texts = [
    "film visually stunning emotionally powerful",
    "film great visuals plot bit slow",
    "direction topnotch acting superb"
]

# 1. Parts of Speech (POS) Tagging and Counting Nouns, Verbs, Adjectives, and Adverbs
pos_counts = Counter({'Nouns': 0, 'Verbs': 0, 'Adjectives': 0, 'Adverbs': 0})

# Function to map NLTK POS tags to broader categories
def get_pos_category(tag):
    if tag.startswith('N'):
        return 'Nouns'
    elif tag.startswith('V'):
        return 'Verbs'
    elif tag.startswith('J'):
        return 'Adjectives'
    elif tag.startswith('R'):
        return 'Adverbs'
    return None

# Perform POS tagging and count relevant categories using TextBlob
for text in cleaned_texts:
    blob = TextBlob(text)
    tags = blob.tags
    for word, pos in tags:
        category = get_pos_category(pos)
        if category:
            pos_counts[category] += 1

print("POS Tagging Counts:", pos_counts)

# 2. Constituency Parsing and Dependency Parsing
def analyze_parsing(text):
    doc = nlp(text)

    # Constituency Parsing (spaCy does not directly support constituency parsing)
    print(f"Constituency Parsing (approximated as dependency parsing): {doc}")

    # Dependency Parsing
    print("\nDependency Parsing:")
    for token in doc:
        print(f"{token.text} --> {token.dep_} ({token.head.text})")

    print("\nDependency Tree:")
    for token in doc:
        print(f"{token.text} ({token.dep_}) <-- {token.head.text} ({token.head.dep_})")

# Example parsing for one sentence
example_text = "The direction was top-notch and the acting was superb."
analyze_parsing(example_text)

# 3. Named Entity Recognition (NER)
ner_counts = Counter()

for text in cleaned_texts:
    doc = nlp(text)
    for ent in doc.ents:
        ner_counts[ent.label_] += 1
        print(f"Entity: {ent.text}, Label: {ent.label_}")

print("NER Entity Counts:", ner_counts)# Your code here



Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


POS Tagging Counts: Counter({'Nouns': 6, 'Verbs': 3, 'Adjectives': 3, 'Adverbs': 3})
Constituency Parsing (approximated as dependency parsing): The direction was top-notch and the acting was superb.

Dependency Parsing:
The --> det (direction)
direction --> nsubj (was)
was --> ROOT (was)
top --> amod (notch)
- --> punct (notch)
notch --> acomp (was)
and --> cc (was)
the --> det (acting)
acting --> nsubj (was)
was --> conj (was)
superb --> acomp (was)
. --> punct (was)

Dependency Tree:
The (det) <-- direction (nsubj)
direction (nsubj) <-- was (ROOT)
was (ROOT) <-- was (ROOT)
top (amod) <-- notch (acomp)
- (punct) <-- notch (acomp)
notch (acomp) <-- was (ROOT)
and (cc) <-- was (ROOT)
the (det) <-- acting (nsubj)
acting (nsubj) <-- was (conj)
was (conj) <-- was (ROOT)
superb (acomp) <-- was (conj)
. (punct) <-- was (conj)
NER Entity Counts: Counter()


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below