<a href="https://colab.research.google.com/github/Likhi2001/Likhitha_Jarugula_INFO5731_SPRING2024/blob/main/Jarugula__Likhitha_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException

In [None]:
import google_colab_selenium as gs
from selenium.webdriver.chrome.options import Options

# Instantiate options
options = Options()

# Add extra options
options.add_argument("--window-size=1920,1080")  # Set the window size
options.add_argument("--disable-infobars")  # Disable the infobars
options.add_argument("--disable-popup-blocking")  # Disable pop-ups
options.add_argument("--ignore-certificate-errors")  # Ignore certificate errors
options.add_argument("--incognito")  # Use Chrome in incognito mode


driver = gs.Chrome(options=options)
driver.get('https://www.imdb.com/title/tt15398776/reviews?sort=userRating&dir=desc&ratingFilter=0')

In [None]:
# Wait for the "Load More" button to be clickable and click it
count = 0
reviews = []
try:
    while True and count<=40:
        load_more_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.ID, "load-more-trigger"))
        )
        load_more_button.click()
        count+=1
        # Optionally, add a sleep here if you need to wait for the content to load
except Exception as e:
    print("All content loaded or an error occurred:", e)

# Now that all content is loaded, use BeautifulSoup to parse the page source
soup = BeautifulSoup(driver.page_source, 'html.parser')

review_containers = soup.find_all('div', class_='review-container')

# Loop through each review container and extract information
for review in review_containers:
    # Extract the title of the review
    title_tag = review.find('a', class_='title')
    title = title_tag.text.strip() if title_tag else "No Title"
    rating_span = driver.find_element(By.CLASS_NAME, 'rating-other-user-rating')

    # Extract the text directly within this span, which includes the rating and scale
    rating_text = rating_span.text

    # Extract the rating
    rating_tag = review.find('span', class_='rating-other-user-rating')
    # Assuming the rating is structured as <span><svg>...</svg><span>Rating</span></span>, extract just the rating part
    rating = rating_tag.find_all('span')[-1].text.strip() if rating_tag else "No Rating"

    # Extract the review body
    body_tag = review.find('div', class_='text show-more__control')
    body = body_tag.text.strip() if body_tag else "No Body"
    review_details = {
        'Title': title,
        'Rating': rating_text,
        'Body': body
    }
    reviews.append(review_details)
# Your scraping logic here, similar to the previous BeautifulSoup examples

# Don't forget to close the driver when you're done
driver.quit()

In [None]:
reviews = reviews[:1000]

In [None]:
import csv

In [None]:
csv_file_path = "movie_reviews.csv"

# Writing data to CSV
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['Rating', 'Title', 'Body'])
    writer.writeheader()  # Write the header row

    # Write each review as a row in the CSV file
    for review in reviews:
        writer.writerow(review)

# Inform the user about the successful creation of the file
print(f"Data successfully written to {csv_file_path}")

Data successfully written to movie_reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
!pip install pandas nltk
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')


#(1) Remove noise, such as special characters and punctuations.

In [None]:
import pandas as pd
import re

# Load your dataset
df = pd.read_csv('movie_reviews.csv')

In [None]:
# Replace 'path_to_your_file.csv' with your file path

# Function to remove noise
def remove_noise(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Keep only letters and spaces
    return text

# Applying the function
df['Clean_Body'] = df['Body'].apply(remove_noise)
print(df)


    Rating                                            Title  \
0    10/10                              Everything you need   
1    10/10    An Authentic Masterpiece of Human Exploration   
2    10/10              It's clever but it also has a heart   
3    10/10                    This is a cinema masterpiece.   
4    10/10                             Alluring Jaw Dropper   
..     ...                                              ...   
995  10/10                                      PEAK CINEMA   
996  10/10  The most important film of the current century.   
997  10/10                  The best movie I have ever seen   
998  10/10                                       Nolanesque   
999  10/10                                    AMAZING MOVIE   

                                                  Body  \
0    It's just what you think when going to watch a...   
1    Christopher Nolan has once again delivered a c...   
2    Nolan strangely moves from petentious films, b...   
3    The st

#Step 2: Remove Numbers

In [None]:
# Function to remove numbers
def remove_numbers(text):
    text = re.sub(r'\d+', '', text)  # Remove numbers
    return text

# Applying the function
df['Clean_Body'] = df['Clean_Body'].apply(remove_numbers)
print(df)

    Rating                                            Title  \
0    10/10                              Everything you need   
1    10/10    An Authentic Masterpiece of Human Exploration   
2    10/10              It's clever but it also has a heart   
3    10/10                    This is a cinema masterpiece.   
4    10/10                             Alluring Jaw Dropper   
..     ...                                              ...   
995  10/10                                      PEAK CINEMA   
996  10/10  The most important film of the current century.   
997  10/10                  The best movie I have ever seen   
998  10/10                                       Nolanesque   
999  10/10                                    AMAZING MOVIE   

                                                  Body  \
0    It's just what you think when going to watch a...   
1    Christopher Nolan has once again delivered a c...   
2    Nolan strangely moves from petentious films, b...   
3    The st

#Step 3: Remove Stopwords

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return ' '.join(filtered_text)

# Applying the function
df['Clean_Body'] = df['Clean_Body'].apply(remove_stopwords)
print(df)

    Rating                                            Title  \
0    10/10                              Everything you need   
1    10/10    An Authentic Masterpiece of Human Exploration   
2    10/10              It's clever but it also has a heart   
3    10/10                    This is a cinema masterpiece.   
4    10/10                             Alluring Jaw Dropper   
..     ...                                              ...   
995  10/10                                      PEAK CINEMA   
996  10/10  The most important film of the current century.   
997  10/10                  The best movie I have ever seen   
998  10/10                                       Nolanesque   
999  10/10                                    AMAZING MOVIE   

                                                  Body  \
0    It's just what you think when going to watch a...   
1    Christopher Nolan has once again delivered a c...   
2    Nolan strangely moves from petentious films, b...   
3    The st

#Step 4: Lowercase All Texts

In [None]:
# Function to lowercase all texts
def lowercase_text(text):
    return text.lower()

# Applying the function
df['Clean_Body'] = df['Clean_Body'].apply(lowercase_text)
print(df)

    Rating                                            Title  \
0    10/10                              Everything you need   
1    10/10    An Authentic Masterpiece of Human Exploration   
2    10/10              It's clever but it also has a heart   
3    10/10                    This is a cinema masterpiece.   
4    10/10                             Alluring Jaw Dropper   
..     ...                                              ...   
995  10/10                                      PEAK CINEMA   
996  10/10  The most important film of the current century.   
997  10/10                  The best movie I have ever seen   
998  10/10                                       Nolanesque   
999  10/10                                    AMAZING MOVIE   

                                                  Body  \
0    It's just what you think when going to watch a...   
1    Christopher Nolan has once again delivered a c...   
2    Nolan strangely moves from petentious films, b...   
3    The st

#Step 5: Stemming

In [None]:
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Function for stemming
def stem_text(text):
    word_tokens = word_tokenize(text)
    stemmed_text = [stemmer.stem(word) for word in word_tokens]
    return ' '.join(stemmed_text)

# Applying the function
df['Clean_Body'] = df['Clean_Body'].apply(stem_text)
print(df)

    Rating                                            Title  \
0    10/10                              Everything you need   
1    10/10    An Authentic Masterpiece of Human Exploration   
2    10/10              It's clever but it also has a heart   
3    10/10                    This is a cinema masterpiece.   
4    10/10                             Alluring Jaw Dropper   
..     ...                                              ...   
995  10/10                                      PEAK CINEMA   
996  10/10  The most important film of the current century.   
997  10/10                  The best movie I have ever seen   
998  10/10                                       Nolanesque   
999  10/10                                    AMAZING MOVIE   

                                                  Body  \
0    It's just what you think when going to watch a...   
1    Christopher Nolan has once again delivered a c...   
2    Nolan strangely moves from petentious films, b...   
3    The st

#Step 6: Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Function for lemmatization
def lemmatize_text(text):
    word_tokens = word_tokenize(text)
    lemmatized_text = [lemmatizer.lemmatize(word) for word in word_tokens]
    return ' '.join(lemmatized_text)

# Applying the function
df['Clean_Body'] = df['Clean_Body'].apply(lemmatize_text)
print(df)

    Rating                                            Title  \
0    10/10                              Everything you need   
1    10/10    An Authentic Masterpiece of Human Exploration   
2    10/10              It's clever but it also has a heart   
3    10/10                    This is a cinema masterpiece.   
4    10/10                             Alluring Jaw Dropper   
..     ...                                              ...   
995  10/10                                      PEAK CINEMA   
996  10/10  The most important film of the current century.   
997  10/10                  The best movie I have ever seen   
998  10/10                                       Nolanesque   
999  10/10                                    AMAZING MOVIE   

                                                  Body  \
0    It's just what you think when going to watch a...   
1    Christopher Nolan has once again delivered a c...   
2    Nolan strangely moves from petentious films, b...   
3    The st

In [None]:
df.to_csv('clean_movie_reviews.csv', index=False)

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

#Parts of Speech (POS) Tagging:

In [None]:
import spacy
from collections import Counter

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Load the cleaned text data
df = pd.read_csv('clean_movie_reviews.csv')

# Function to tag POS and count Nouns, Verbs, Adjectives, Adverbs
def pos_tagging(text):
    doc = nlp(text)
    pos_counts = Counter(token.pos_ for token in doc)
    return pos_counts

# Apply POS tagging to the first cleaned review for demonstration
pos_counts_example = pos_tagging(df['Clean_Body'][0])
print(pos_counts_example)


Counter({'NOUN': 23, 'VERB': 15, 'PROPN': 15, 'ADJ': 10, 'PRON': 5, 'ADV': 3, 'DET': 1, 'SCONJ': 1, 'NUM': 1, 'ADP': 1})


#Dependency Parsing with spaCy

In [None]:
import pandas as pd
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Load the cleaned text data
df = pd.read_csv('clean_movie_reviews.csv')

# Select a sentence for demonstration
example_sentence = df['Clean_Body'][0]

# Dependency Parsing with spaCy
doc = nlp(example_sentence)
for token in doc:
    print(f"{token.text} --> {token.dep_} --> {token.head.text}")


it --> nsubj --> think
think --> ROOT --> think
go --> xcomp --> think
watch --> xcomp --> go
biographi --> amod --> movi
movi --> nsubj --> done
everyth --> advmod --> done
done --> ccomp --> watch
amazingli --> dobj --> done
the --> det --> nolan
use --> compound --> nolan
soundtrack --> compound --> cinematographi
cinematographi --> compound --> spot
spot --> compound --> christoph
christoph --> compound --> nolan
nolan --> nsubj --> want
want --> ccomp --> think
make --> xcomp --> want
simpl --> dobj --> make
that --> nsubj --> beauti
beauti --> ccomp --> make
although --> mark --> matter
first --> amod --> lot
two --> nummod --> hour
hour --> nmod --> lot
slow --> amod --> lot
lot --> compound --> matter
matter --> advcl --> think
shouldv --> nsubj --> explain
explain --> ccomp --> matter
audienc --> amod --> howev
howev --> dobj --> explain
last --> amod --> hour
hour --> npadvmod --> explain
movi --> nsubj --> move
move --> ccomp --> think
fast --> amod --> realli
realli --> nsu

#Constituency Parsing

In [None]:
# Iterate over the cleaned texts and perform dependency parsing
for index, row in df.iterrows():
    doc = nlp(row['Clean_Body'])
    # Here, we process each document, but for demonstration, we'll just work with the first sentence
    if index < 5:  # Let's limit the output to the first 5 reviews for demonstration
        for sent in doc.sents:
            print(f"Sentence: {sent.text}")
            for token in sent:
                print(f'{token.text:{12}} {token.dep_:{10}} {token.head.text:{12}} {token.head.pos_:{10}}')
            # Break after the first sentence to keep the output concise
            break


Sentence: it think go watch biographi movi everyth done amazingli the use soundtrack cinematographi spot christoph nolan want make simpl that beauti although first two hour slow lot matter shouldv explain audienc howev last hour movi move fast realli feel anyway
it           nsubj      think        VERB      
think        ROOT       think        VERB      
go           xcomp      think        VERB      
watch        xcomp      go           VERB      
biographi    amod       movi         NOUN      
movi         nsubj      done         VERB      
everyth      advmod     done         VERB      
done         ccomp      watch        VERB      
amazingli    dobj       done         VERB      
the          det        nolan        PROPN     
use          compound   nolan        PROPN     
soundtrack   compound   cinematographi PROPN     
cinematographi compound   spot         PROPN     
spot         compound   christoph    PROPN     
christoph    compound   nolan        PROPN     
nolan        

As an example, let's look at the text "The quick brown fox jumps over the lazy dog" to illustrate dependency parsing trees and constituency parsing.

### Explanation of Constituency Parsing Tree

For this sentence, we begin with the root (S) that represents the entire sentence in a constituency parsing tree. The sentence is dissected into its component components by this tree:

"The quick brown fox" is a noun phrase (NP) with the determiner "the" (DT) and the adjectives "quick brown" (JJ) moderating the noun "fox" (NN).
"Jumps over the lazy dog" is a verb phrase (VP) in which "jumps" is the verb (V).
  The preposition "over" (P) introduces another noun phrase.
  - "the" is a determiner (DT) and "the lazy dog" is a noun phrase (NP).
    
- "dog" is the noun (NN) and "lazy" is an adjective (JJ) modifying it.

The sentence's phrase and subphrase structures are highlighted by the constituency parsing tree, which also illustrates the sentence's hierarchical arrangement of parts of speech.

### Explaining Dependency Parsing Trees

Each word in a dependency parsing tree for the same phrase is linked to its "head" word, with arrows indicating the direction of the dependency. Since it is the primary action, the verb "jumps" is usually regarded as the sentence's root. Every word has a function:

- The words "the," "quick," and "brown" are modifiers of "fox," meaning they give details to the noun.
It is "Fox" who is the object of "jumps," indicating who is acting.

By connecting "jumps" and "dog," "Over" indicates which way the action is going.
- "The" and "lazy" are terms that modify "dog," giving additional information about the subject.
The action's direction is completed by the preposition "over," which has "dog" as its object.

Without taking into account hierarchical phrase structures, the dependency parsing tree concentrates on the relationships between words, showing how each one acts in relation to others to transmit the content of the sentence.


#Named Entity Recognition (NER)

In [None]:
import pandas as pd
import spacy
from collections import Counter

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned text data
df = pd.read_csv('clean_movie_reviews.csv')

# Initialize counters for entities and entity labels
total_entities = Counter()
total_entity_labels = Counter()

# Function to process a single document and update global counters
def process_doc(doc):
    entities = [ent.text for ent in doc.ents]
    entity_labels = [ent.label_ for ent in doc.ents]
    total_entities.update(entities)
    total_entity_labels.update(entity_labels)

# Iterate over the cleaned text, processing each one
for review in nlp.pipe(df['Clean_Body']):
    process_doc(review)

# Display the most common entities and entity labels
print("Most Common Entities:")
for entity, count in total_entities.most_common(10):
    print(f"{entity}: {count}")

print("\nMost Common Entity Labels:")
for label, count in total_entity_labels.most_common():
    print(f"{label}: {count}")


Most Common Entities:
oppenheim: 361
one: 303
cillian: 293
first: 129
robert oppenheim: 88
second: 76
hour: 55
manhattan: 54
three hour: 48
two: 46

Most Common Entity Labels:
PERSON: 1536
ORG: 611
GPE: 514
NORP: 493
CARDINAL: 438
ORDINAL: 221
DATE: 155
TIME: 151
FAC: 40
EVENT: 36
PRODUCT: 34
WORK_OF_ART: 10
LOC: 10
QUANTITY: 2
LAW: 1


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
As part of the task, a Selenium script had to be written in order to repeatedly loop over several pages of reviews and scrape review data, including titles, ratings, and body text, from an Amazon product page. Managing pagination and dynamic aspects on the Amazon website was a difficult part of the task because of the website's dynamic structure, which might alter over time. It took some thought and testing to make sure the script could reliably find and extract the needed data when traversing through several pages. The implementation became more sophisticated when addressing potential failures and exceptions, such as when buttons are clicked or items cannot be located.

All things considered, I enjoyed the work, especially coming up with methods for effectively leveraging Selenium to scrape data from the Amazon website.


It gave me a chance to put my problem-solving abilities to use and take advantage of Selenium's capabilities to dynamically interact with site elements. I also valued the chance to enhance the script's dependability and performance, making sure it could successfully handle a range of situations and edge cases.

The amount of time allotted to finish the assignment primarily relies on how difficult it is and how well-versed you are in Selenium and web scraping methods. It might be possible for someone with familiarity with Selenium and web scraping to finish the assignment in a few hours or a day. It might take longer to guarantee a solid and dependable solution, nevertheless, if the project calls for more functionalities or extensive testing, or if the person working on it is unfamiliar with these technologies. As a whole, the time provided should allow for sufficient exploration, implementation, and refinement of the script to meet the requirements effectively.