<a href="https://colab.research.google.com/github/SatyaA-dev/SatyaAditya_INFO5731_Fall2024/blob/main/Masimukku_SatyaAditya_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import spacy
import time
reviews = []
min_num_reviews = 1000
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
url = "https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ov_ql_2"
for start in range(0, min_num_reviews, 25):
  url = f"{url}reviews?ref_=tt_ql_3&start={start}"
  time.sleep(2)
  response = requests.get(url, headers=headers)
  soup = BeautifulSoup(response.text, 'html.parser')
  review_containers = soup.find_all('div', class_='text show-more__control')
  print(review_containers)
  for review in review_containers:
    reviews.append(review.text)
  if len(reviews) >= min_num_reviews:
    break

df = pd.DataFrame(reviews, columns=['review'])
df.to_csv('movie_reviews.csv', index=False)

[<div class="text show-more__control">You'll have to have your wits about you and your brain fully switched on watching Oppenheimer as it could easily get away from a nonattentive viewer. This is intelligent filmmaking which shows it's audience great respect. It fires dialogue packed with information at a relentless pace and jumps to very different times in Oppenheimer's life continuously through it's 3 hour runtime. There are visual clues to guide the viewer through these times but again you'll have to get to grips with these quite quickly. This relentlessness helps to express the urgency with which the US attacked it's chase for the atomic bomb before Germany could do the same. An absolute career best performance from (the consistenly brilliant) Cillian Murphy anchors the film. This is a nailed on Oscar performance. In fact the whole cast are fantastic (apart maybe for the sometimes overwrought Emily Blunt performance). RDJ is also particularly brilliant in a return to proper acting 

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [12]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize


# Load stopwords
stop_words = set(stopwords.words('english'))

# Initialize stemmer and lemmatizer
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean the text
def clean_text(text):
    # Step 1: Remove special characters and punctuations
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    print(f"After removing special characters and punctuations: {text}")

    # Step 2: Remove numbers
    text = re.sub(r'\d+', '', text)  # Remove numbers
    print(f"After removing numbers: {text}")

    # Step 3: Tokenize and remove stopwords
    tokens = word_tokenize(text.lower())  # Tokenize and lowercase the text
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    print(f"After removing stopwords: {' '.join(tokens)}")

    # Step 4: Lowercase is already applied in tokenization step

    # Step 5: Stemming
    stemmed = [ps.stem(word) for word in tokens]
    print(f"After stemming: {' '.join(stemmed)}")

    # Step 6: Lemmatization
    lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
    print(f"After lemmatization: {' '.join(lemmatized)}")

    return ' '.join(lemmatized)

# Load the previously scraped data from CSV
df = pd.read_csv('movie_reviews.csv')

# Create a new column for cleaned data
df['Cleaned_Review'] = df['review'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv('Cleaned_movie_reviews.csv', index=False)

print("Cleaned data saved to 'Cleaned_movie_reviews.csv'")



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


After removing special characters and punctuations: Youll have to have your wits about you and your brain fully switched on watching Oppenheimer as it could easily get away from a nonattentive viewer This is intelligent filmmaking which shows its audience great respect It fires dialogue packed with information at a relentless pace and jumps to very different times in Oppenheimers life continuously through its 3 hour runtime There are visual clues to guide the viewer through these times but again youll have to get to grips with these quite quickly This relentlessness helps to express the urgency with which the US attacked its chase for the atomic bomb before Germany could do the same An absolute career best performance from the consistenly brilliant Cillian Murphy anchors the film This is a nailed on Oscar performance In fact the whole cast are fantastic apart maybe for the sometimes overwrought Emily Blunt performance RDJ is also particularly brilliant in a return to proper acting afte

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [14]:
# Your code here
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords
from spacy import displacy

# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Load stopwords (if needed)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load cleaned data from CSV file
df = pd.read_csv('Cleaned_movie_reviews.csv')

# Initialize a string to collect all the cleaned reviews
text = ' '.join(df['Cleaned_Review'].tolist())

# (1) Parts of Speech (POS) Tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)

    # Count Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB)
    pos_counts = Counter(tag for word, tag in tagged)
    noun_count = sum(pos_counts[tag] for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
    verb_count = sum(pos_counts[tag] for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    adj_count = sum(pos_counts[tag] for tag in ['JJ', 'JJR', 'JJS'])
    adv_count = sum(pos_counts[tag] for tag in ['RB', 'RBR', 'RBS'])

    return noun_count, verb_count, adj_count, adv_count

noun_count, verb_count, adj_count, adv_count = pos_tagging(text)

print(f"Nouns: {noun_count}, Verbs: {verb_count}, Adjectives: {adj_count}, Adverbs: {adv_count}")

# (2) Constituency Parsing and Dependency Parsing
def constituency_and_dependency_parsing(text):
    doc = nlp(text)

    # Display Dependency Parsing
    print("Dependency Parsing:")
    displacy.render(doc, style='dep', jupyter=False)  # This will render the dependency tree in HTML

    # Show Constituency Parsing using nltk (simplified)
    for sent in nltk.sent_tokenize(text):
        words = nltk.word_tokenize(sent)
        tagged = nltk.pos_tag(words)
        grammar = "NP: {<DT>?<JJ>*<NN>}"  # A simple grammar for NP (Noun Phrases)
        cp = nltk.RegexpParser(grammar)
        tree = cp.parse(tagged)
        print("Constituency Parsing:")
        tree.pretty_print()  # Print constituency parsing tree

constituency_and_dependency_parsing("This product is amazing, highly recommended.")

# (3) Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    # Count entities by type
    entity_count = Counter(ent.label_ for ent in doc.ents)

    return entities, entity_count

entities, entity_count = named_entity_recognition(text)

print(f"Entities extracted: {entities}")
print(f"Entity count by type: {entity_count}")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Nouns: 29761, Verbs: 13232, Adjectives: 14767, Adverbs: 6944
Dependency Parsing:
Constituency Parsing:
                          S                                               
   _______________________|__________________________________              
  |        |       |      |            |         |           NP           
  |        |       |      |            |         |      _____|______       
is/VBZ amazing/JJ ,/, highly/RB recommended/VBN ./. This/DT     product/NN

Entities extracted: [('germany', 'GPE'), ('cillian murphy', 'PERSON'), ('decade', 'DATE'), ('one', 'CARDINAL'), ('two three hour', 'TIME'), ('christopher', 'PERSON'), ('dunkirk', 'ORG'), ('second', 'ORDINAL'), ('one', 'CARDINAL'), ('one year', 'DATE'), ('one', 'CARDINAL'), ('one', 'CARDINAL'), ('third', 'ORDINAL'), ('wayim', 'PERSON'), ('day', 'DATE'), ('cillian', 'NORP'), ('robert downey jr', 'PERSON'), ('wwii', 'EVENT'), ('christopher nolan dark knight trilogy', 'PERSON'), ('american prometheus kai', 'ORG'), ('m

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
#collecting the data and cleaning it is the biggest challenge of the assignment. It was challenging because the topic is relatively new to me.