<a href="https://colab.research.google.com/github/Grishma5278/Info-5731/blob/main/Tallapareddy_Grishma_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


# New Section

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

def get_soup(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'}
    try:
        data = requests.get(url, headers=headers)
        if data.status_code != 200:
            raise Exception("Failed to fetch data")
    except Exception as ex:
        print(f"Exception occurred: {ex}")
        return None
    soup = BeautifulSoup(data.text, "html.parser")
    return soup

def get_review_data(soup):
    final_reviews = []
    try:
        imdb_reviews = soup.find_all("div", class_="text show-more__control")
        final_reviews = [review.text for review in imdb_reviews]
    except Exception as ex:
        print(ex)
    return final_reviews

def web_scrape_imdb(url):
    final_reviews = []
    soup = get_soup(url)
    if not soup:
        return final_reviews

    final_reviews = get_review_data(soup)

    load_more = soup.select(".load-more-data")
    flag = False
    base_url = "https://www.imdb.com/"

    if load_more:
        ajaxurl = load_more[0]['data-ajaxurl']
        url = base_url + ajaxurl + "?ref_=undefined&paginationKey="
        try:
            key = load_more[0]['data-key']
            flag = True
        except KeyError:
            pass

    while flag:
        url_new = url + key
        soup = get_soup(url_new)
        if not soup:
            break

        reviews_new = get_review_data(soup)
        final_reviews.extend(reviews_new)
        load_more = soup.select(".load-more-data")
        if load_more:
            key = load_more[0]['data-key']
        else:
            flag = False
        time.sleep(1)  # Add a delay to avoid overwhelming the server

    return final_reviews

def save_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Review"])
        for review in data:
            writer.writerow([review])

# IMDb movie URL for 'The Godfather (1972)'
imdb_url = "https://www.imdb.com/title/tt0068646/reviews/"
final_reviews = web_scrape_imdb(imdb_url)

# Save reviews to CSV file
save_to_csv(final_reviews, "imdb_reviews.csv")
print("Reviews saved to imdb_reviews.csv")

print("Total reviews scraped:", len(final_reviews))
for idx, review in enumerate(final_reviews, start=1):
    print(f"Review {idx}: {review}\n")


Reviews saved to imdb_reviews.csv
Total reviews scraped: 5488
Review 1: 'The Godfather' is the pinnacle of flawless films! The first time I viewed 'The Godfather' I was in my early teens and it was the most astounding film I had ever seen, and has since then stood as my all-time favourite film. It is due to this that I have been looking forward to writing a review of this unforgettable classic. So let's start from the beginning. The film opens to four words, 'I believe in America', it's crazy to think that this simple line has become a resonant quote solely due to the impact it made on the entrance to the film's "threshold". This is just one of the many renowned quotes that litter the film, and believe me, there are a lot. After the first take we are then absorbed into the life of Vito Corleone, brilliantly portrayed by the Oscar- winning performance of Marlon Brando. Vito is a feared man, he is a criminal, he is a mafioso, but above all he is a respected family man, his three sons are

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Function to clean and preprocess the text
def clean_text(text):
    # Remove punctuation and numbers
    text = ''.join([char for char in text if char not in string.punctuation and not char.isdigit()])
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]
    # Stemming
    porter = PorterStemmer()
    stemmed_tokens = [porter.stem(word) for word in tokens]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    # Join the cleaned tokens
    cleaned_text = ' '.join(lemmatized_tokens)
    return cleaned_text

# Load the IMDb reviews into a DataFrame
data = pd.DataFrame({'Review': final_reviews})

# Apply the cleaning function to each review
data['Cleaned_Review'] = data['Review'].apply(clean_text)

# Save the DataFrame with cleaned data to a CSV file
data.to_csv('imdb_reviews_cleaned.csv', index=False)

print('Data cleaning and saving to CSV completed.')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Data cleaning and saving to CSV completed.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from collections import Counter
import spacy

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Load the spaCy English language model
nlp = spacy.load('en_core_web_sm')

# Function to clean and preprocess the text
def clean_text(text):
    # Remove punctuation and numbers
    text = ''.join([char for char in text if char not in string.punctuation and not char.isdigit()])
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]
    # Stemming
    porter = PorterStemmer()
    stemmed_tokens = [porter.stem(word) for word in tokens]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    # Join the cleaned tokens
    cleaned_text = ' '.join(lemmatized_tokens)
    return cleaned_text

# Function to perform Parts of Speech (POS) tagging and count POS types
def tags_count(text, tag, model=nlp):
    '''This function returns the number of specific POS tags in an item'''
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    # Return number of specific POS tags
    return pos.count(tag)

# Function to perform Named Entity Recognition (NER) and count entities
def ner_analysis(text):
    doc = nlp(text)
    entity_counts = Counter(ent.label_ for ent in doc.ents)
    return entity_counts

# IMDb movie URL for 'The Godfather (1972)'
imdb_url = "https://www.imdb.com/title/tt0068646/reviews/"
final_reviews = web_scrape_imdb(imdb_url)

# Create a DataFrame with the reviews
df = pd.DataFrame({'text_modified': final_reviews})

# Apply the cleaning function to each review
df['Cleaned_Review'] = df['text_modified'].apply(clean_text)

# Save the DataFrame with cleaned data to a CSV file
df.to_csv('imdb_reviews_cleaned.csv', index=False)

# Load the clean text from the CSV file
cleaned_text_list = list(df['Cleaned_Review'])

# Perform syntax and structure analysis
noun, verb, adj, adv = 0, 0, 0, 0

for text in cleaned_text_list:
    noun += tags_count(text, "NOUN", model=nlp)
    verb += tags_count(text, "VERB", model=nlp)
    adj += tags_count(text, "ADJ", model=nlp)
    adv += tags_count(text, "ADV", model=nlp)

print(f"No of nouns: {noun}")
print(f"No of verbs: {verb}")
print(f"No of adjectives: {adj}")
print(f"No of adverbs: {adv}")

# Perform Named Entity Recognition (NER) and count entities
entity_counts_sample = ner_analysis(cleaned_text_list[0])
print("\nNamed Entity Recognition (Sample Sentence):", entity_counts_sample)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


No of nouns: 136844
No of verbs: 58105
No of adjectives: 47866
No of adverbs: 15757

Named Entity Recognition (Sample Sentence): Counter({'PERSON': 6, 'CARDINAL': 5, 'ORDINAL': 2, 'ORG': 2})


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

The assignment presented a comprehensive task involving various natural language processing (NLP) techniques, which I found both challenging and enjoyable. Implementing functionalities like POS tagging, constituency parsing, dependency parsing, and named entity recognition required a good understanding of NLP concepts and tools like NLTK and Stanford CoreNLP. Managing the flow of data from web scraping to text cleaning and then to syntactic analysis demanded careful planning and attention to detail. The diversity of tasks within the assignment made it intellectually stimulating, and I appreciated the opportunity to apply and deepen my knowledge of NLP. However, ensuring the accuracy and efficiency of each step, especially in handling large volumes of text data, added complexity to the assignment. Overall, I found the time allotted for the assignment adequate, although certain aspects, like debugging and refining the code, could have benefited from more time.

