<a href="https://colab.research.google.com/github/GatlaDeepthi/Deepthi_INFO5731_Spring2024/blob/main/Deepthi_Gatla_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [13]:
# Write your code here
#Importing all required libraries and modules
import requests
from bs4 import BeautifulSoup
import csv

In [14]:
# Define the IMDB movie URL
movie_url = "https://www.imdb.com/title/tt6718170/reviews/?ref_=tt_ql_2"

# Function to scrape reviews from a single page for the provided url
def Scrapereviews(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    reviews = []

    review_containers = soup.find_all("div", class_="lister-item-content")
    for reviewcontainer in review_containers:
        Title = reviewcontainer.find("a", class_="title").get_text(strip=True)
        Review_username = reviewcontainer.find("span", class_="display-name-link").find("a").get_text(strip=True)
        date_of_review = reviewcontainer.find("span", class_="review-date").get_text(strip=True)
        Text_review = reviewcontainer.find("div", class_="text").get_text(strip=True)

        reviews.append([Title, Review_username, date_of_review, Text_review])

    return reviews

# Defining a function to scrape reviews from multiple pages providing the URL and num of pages to scrape as arguments
def scrape_reviews_from_multiple_pages(base_url, num_pages):
    all_movie_reviews = []
    for page_num in range(1, num_pages + 1):
        page_url = f"{base_url}&start={10 * (page_num - 1)}"
        page_reviews = Scrapereviews(page_url)
        all_movie_reviews.extend(page_reviews)
    return all_movie_reviews

# Specify the number of pages to scrape as required
num_pages_to_scrape = 40 #just collecting atleast 1000 reviews in total

# taking reviews from the provided URL and no of pages to scrape
all_movie_reviews = scrape_reviews_from_multiple_pages(movie_url, num_pages_to_scrape)

# Saving the obtained reviews in the csv file
csv_file = "movie_reviews.csv"
with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Username", "Review_Date", "Review_Text"])
    writer.writerows(all_movie_reviews)

print(f"{len(all_movie_reviews)} reviews have been saved to {csv_file}.")


1000 reviews have been saved to movie_reviews.csv.


In [15]:
# Print the first 10 movie reviews obtained from above provided url
head_reviews = all_movie_reviews[:10]
for review in head_reviews:
    print("Title:", review[0])
    print("Username:", review[1])
    print("Review_Date:", review[2])
    print("Review_Text:", review[3])
    print()

Title: Illumination's best since Despicable Me 1.
Username: benjaminskylerhill
Review_Date: 5 April 2023
Review_Text: Granted, this film is not going to appeal to people who have never been fans of any games featuring the titular brothers. But given the fact that these filmmakers were tasked with making a movie that caters to a fan base that spans four decades and thus have a myriad of tastes and expectations, they did a pretty good job. In fact, a slightly better job than I was expecting.The voice cast was the subject of much controversy, and I thought that for the most part it was actually spot-on. A couple of characters were horribly miscast in my opinion, but in particular Jack Black as Bowser and Charlie Day as Luigi shine as perfectly encapsulating the spirit of the characters.Platforming elements from the games was very efficiently worked into the action sequences, and some of them were quite thrilling. This is easily Illumination's most visually stunning film, and a massive ste

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [16]:
# Write code for each of the sub parts with proper comments.
#Importing all the required modules and libraries
import csv
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer


In [17]:
#downloading the NLTK resources

nltk.download('punkt') # downloading the punkt
nltk.download('stopwords') #downloading the stopwords corpus
nltk.download('wordnet') #downloading the wordnet data

# Defining the paths for CSV Files
input_csv_file = "movie_reviews.csv"
output_csv_file = "movie_reviews_cleanedData.csv"

# Initializing the NLTK components
stop_words = set(stopwords.words('english')) #retrieving the list of english stopwords.
stemmer = SnowballStemmer('english') #creating a english stemmer object
lemmatizer = WordNetLemmatizer() #Initializing a lematizer object

# Defining a function for cleaning and preprocessing the text
def Text_Cleaning(text):
    # Tokenize the text
    tokens_text = word_tokenize(text)

    # Removing the noise, numbers, and stopwords, and lowercase from tokenized text and storing in Cleaned_tokens
    cleaned_tokens = [word.lower() for word in tokens_text if word.isalpha() and word.lower() not in stop_words]

    # Applying stemming and lemmatization
    stemmed_tokens = [stemmer.stem(word) for word in cleaned_tokens]
    #lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

    return ' '.join(stemmed_tokens)

# Read the input CSV and write to the output CSV with cleaned data
with open(input_csv_file, 'r', newline='', encoding='utf-8') as input_file:
    with open(output_csv_file, 'w', newline='', encoding='utf-8') as output_file:
        reader = csv.reader(input_file)
        writer = csv.writer(output_file)

        header = next(reader)
        header.append("Cleaned_Text")  # Add a new column for cleaned text in the csv file
        writer.writerow(header)

        for row in reader:
            title, username, review_date, review_text = row[:4]
            cleaned_text = Text_Cleaning(review_text)
            row.append(cleaned_text)
            writer.writerow(row)

print(f"Data has been cleaned and saved to {output_csv_file}.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Data has been cleaned and saved to movie_reviews_cleanedData.csv.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [18]:
# Your code here
#Importing NLTK and downloading the 'averaged_perceptron_tagger' module required for POS tagging
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [19]:
#Importing the required libraries and modules
import csv
import nltk
import spacy
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from spacy import displacy

In [20]:
# Downloading the spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Loading the above cleaned and preprocessed text from the CSV file
input_csv_file = "movie_reviews_cleanedData.csv"

# Initializing the required counters ---- POS tagging
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Initializing the counters for named entities
person_count = 0
organization_count = 0
location_count = 0
product_count = 0
date_count = 0

# defining function to perform POS tagging and entity recognition
def Text_analyze(text):
    global noun_count, verb_count, adj_count, adv_count
    global person_count, organization_count, location_count, product_count, date_count

    # Tokenizing text into sentences
    Tokenized_sentences = sent_tokenize(text)

    for sentence in Tokenized_sentences:
        # performing POS tagging
        words = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(words) #tagging each word

        #Using for loop to get the count of each Noun,Verb,Adjective and Adverb
        for word, pos in pos_tags:
            if pos.startswith('N'):  # Noun
                noun_count += 1
            elif pos.startswith('V'):  # Verb
                verb_count += 1
            elif pos.startswith('J'):  # Adjective
                adj_count += 1
            elif pos.startswith('R'):  # Adverb
                adv_count += 1

        # Named Entity Recognition with spaCy
        doc = nlp(sentence)
        for ent in doc.ents:
            if ent.label_ == 'PERSON':
                person_count += 1
            elif ent.label_ == 'ORG':
                organization_count += 1
            elif ent.label_ == 'GPE':
                location_count += 1
            elif ent.label_ == 'PRODUCT':
                product_count += 1
            elif ent.label_ == 'DATE':
                date_count += 1

# Reading the CSV file and analyzing the text
with open(input_csv_file, 'r', newline='', encoding='utf-8') as input_file:
    reader = csv.reader(input_file)
    header = next(reader)  # Skipping the header row

    for row in reader:
        cleaned_text = row[-1]
        Text_analyze(cleaned_text)

# Display the analysis results
print("POS Tagging Results:")
print(f"Total Nouns: {noun_count}")
print(f"Total Verbs: {verb_count}")
print(f"Total Adjectives: {adj_count}")
print(f"Total Adverbs: {adv_count}")

print("\nNamed Entity Recognition Results:")
print(f"Persons: {person_count}")
print(f"Organizations: {organization_count}")
print(f"Locations: {location_count}")
print(f"Products: {product_count}")
print(f"Dates: {date_count}")

POS Tagging Results:
Total Nouns: 47480
Total Verbs: 11880
Total Adjectives: 18360
Total Adverbs: 4640

Named Entity Recognition Results:
Persons: 1600
Organizations: 640
Locations: 520
Products: 40
Dates: 360


In [21]:
#importing the libarries and required modukes
import spacy
import pandas as pd
from spacy import displacy
from nltk.tokenize import sent_tokenize

# Loading the English language model
nlp = spacy.load("en_core_web_sm")

# Loading my CSV file using Pandas and storing in a dataframe
df = pd.read_csv('movie_reviews_cleanedData.csv')

# Defining a function to analyze text
def analyze_text(text):
    # Split the text into sentences
    sentences = sent_tokenize(text)

    for sentence in sentences:
        doc = nlp(sentence)

        # Generating and displaying the constituency parsing tree
        displacy.render(doc, style="dep", jupyter=True, options={'compact': True})

        # Generateing and displaying the dependency parsing tree
        displacy.render(doc, style="ent", jupyter=True)

# Processing each row in the DataFrame
#for index, row in df.iterrows():
    #cleaned_text = row['Review_Text']
    #analyze_text(cleaned_text)

first_row = df.loc[0]['Review_Text'] #taking only first row into consideration
analyze_text(first_row)



# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

The assignment questions provided are very useful as they covered many of the topics like stopwords,stemming,lemmetization and Tokenization.
From the first question I have learnt about the structure of the page and how to scrape the reviews properly.and also how to send the HTTP request to
the page.
From the second question I have learnt how to clean and preprocess the data. And also the normalization techniques and NLTK libraries.
From Third question, I have learnt about the Text analysis.Like how to assign the grammatical categories to words and how to do parsing
by breaking down the sentences and providing grammatical relationship between words.
Overall I have learnt Data Cleaning and Data preprocessing and Text analysis and NLP techniques. Time provided for the assignment completion is sufficient.