<a href="https://colab.research.google.com/github/MohanaSrinitha/Mohana_INF05731_Spring2024/blob/main/Shaga_Mohana_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [23]:
!pip install playwright



In [24]:
!playwright install

In [25]:
from playwright.async_api import async_playwright

In [32]:
import csv
from playwright.async_api import async_playwright


movie_id = 'tt15398776'  # Oppenheimer
url = f'https://www.imdb.com/title/{movie_id}/reviews'


reviews = []
async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)
    page = await browser.new_page()
    await page.goto(url)

    page_counter = 0
    while page_counter < 43:
        load_more_button = await page.query_selector('.ipl-load-more__button')
        if load_more_button:
            await load_more_button.click()
            await page.wait_for_load_state('networkidle', timeout=60000)
            page_counter += 1
        else:
            break

    review_elements = await page.query_selector_all('.text.show-more__control')
    for review_element in review_elements:
        reviews.append(await review_element.inner_text())

    await browser.close()

    # Save reviews to a CSV file'
    csv_file='imdb_reviews.csv'
    with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)
        csvwriter.writerow(['Review'])
        for review in reviews:
            csvwriter.writerow([review])
    print(f"{len(review)} reviews have been saved to {csv_file}.")



1900 reviews have been saved to imdb_reviews.csv.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [33]:
import csv
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the IMDb reviews from the CSV file
reviews = []
with open('imdb_reviews.csv', 'r', newline='', encoding='utf-8') as csvfile:
    csvreader = csv.reader(csvfile)
    next(csvreader)
    for row in csvreader:
        reviews.append(row[0])

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Remove punctuation and numbers
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ''.join([i for i in text if not i.isdigit()])

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove english stopwords
    stop_words = set(stopwords.words('english'))

    # lowercase all tokens
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]

    # perform stemming
    stemmed_tokens = [stemmer.stem(word) for word in tokens]

    # Lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join the tokens back into a single string
    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text

# Clean the reviews and save them in a new column in the CSV file
cleaned_reviews = [clean_text(review) for review in reviews]

with open('imdb_reviews.csv', 'w', newline='', encoding='utf-8') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(['Review', 'Cleaned Review'])
    for i in range(len(reviews)):
        csvwriter.writerow([reviews[i], cleaned_reviews[i]])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [35]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

# Load the cleaned reviews from the CSV file
cleaned_reviews = []
with open('imdb_reviews.csv', 'r', newline='', encoding='utf-8') as csvfile:
    csvreader = csv.reader(csvfile)
    next(csvreader)  # Skip header
    for row in csvreader:
        cleaned_reviews.append(row[1])

# Tokenize and POS tag each review
total_nouns = 0
total_verbs = 0
total_adjectives = 0
total_adverbs = 0
for review in cleaned_reviews:
    tokens = word_tokenize(review)
    tagged = pos_tag(tokens)
    for _, tag in tagged:
        if tag.startswith('N'):  # Noun
            total_nouns += 1
        elif tag.startswith('V'):  # Verb
            total_verbs += 1
        elif tag.startswith('J'):  # Adjective
            total_adjectives += 1
        elif tag.startswith('R'):  # Adverb
            total_adverbs += 1

# Print the total counts of each POS
print(f"Total Nouns: {total_nouns}")
print(f"Total Verbs: {total_verbs}")
print(f"Total Adjectives: {total_adjectives}")
print(f"Total Adverbs: {total_adverbs}")


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Total Nouns: 70349
Total Verbs: 29051
Total Adjectives: 34540
Total Adverbs: 14052


In [36]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

sentences = text.split('\n')

# Process each sentence and print out the constituency and dependency parsing trees
for sentence in sentences:
    doc = nlp(sentence)
    print("Sentence:", sentence)
    print("Constituency parsing tree:", [token for token in doc])
    print("Dependency parsing tree:")
    for token in doc:
        print(token.text, token.dep_, token.head.text, token.head.pos_,
              [child for child in token.children])
    print("\n")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
who nsubj were AUX []
were relcl Soviets PROPN [who, allies]
allies attr were AUX [to, in]
to prep allies NOUN [Americans]
the det Americans PROPN []
Americans pobj to ADP [the]
in prep allies NOUN [WW2]
WW2 pobj in ADP []
... punct emphasize VERB []
) punct emphasize VERB []
. punct emphasize VERB []


Sentence: 
Constituency parsing tree: []
Dependency parsing tree:


Sentence: I feel sad that this will be an award winning movie and that most people in the audience will say (or feel they have to say) ""A! Oh!! Great movie!"", because the director, writers, cast are famous and the topic is important - all these superlatives about a somewhat boring movie, decorated with some nudity (not clear either why this was needed and related to the topic) and overall pointless movie... Or, at least, much more pointless than it should be!",watch oppenheimer leave movie theater ask point unfortunately answer clear strongly beleive mov

In [38]:
import spacy
from collections import Counter

# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

# Load the clean text from the CSV file
with open('imdb_reviews.csv', 'r') as file:
    text = file.read()

# Split the text into smaller chunks
chunk_size = 100000
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Process each chunk with spaCy
entities = Counter()
for chunk in chunks:
    doc = nlp(chunk)

    # Named Entity Recognition
    entities.update([(ent.text, ent.label_) for ent in doc.ents])

# Print Named Entities
print("Named Entities:")
for entity, count in entities.items():
    print(f"{entity[0]} ({entity[1]}): {count}")


Named Entities:
Cleaned Review (ORG): 1
One (CARDINAL): 81
the year (DATE): 47
Oppenheimer (ORG): 2047
two (CARDINAL): 341
three hours (TIME): 65
the other hour (TIME): 1
Christopher Nolan's (PERSON): 126
Dunkirk (ORG): 60
second (ORDINAL): 263
one (CARDINAL): 1106
year (DATE): 80
Cillian Murphy (PERSON): 336
the final hour (TIME): 12
third (ORDINAL): 115
2.5 hours (TIME): 4
3 (CARDINAL): 59
Babylon (PERSON): 1
two three hour (TIME): 1
christopher (PERSON): 80
dunkirk (ORG): 8
one year (DATE): 4
3 hour (TIME): 58
US (GPE): 32
Germany (GPE): 27
Oscar (PERSON): 148
Emily Blunt (PERSON): 235
RDJ (PERSON): 47
decade (DATE): 18
Bible (WORK_OF_ART): 2
Florence Pugh (ORG): 34
germany (GPE): 23
cillian murphy (PERSON): 77
Nolan (ORG): 1239
the days (DATE): 7
day (DATE): 20
Oppenheimer (WORK_OF_ART): 189
Christopher Nolan (PERSON): 394
"The Dark Knight (WORK_OF_ART): 4
Interstellar (WORK_OF_ART): 9
American Prometheus (WORK_OF_ART): 12
Kai Bird (PERSON): 22
Martin J. Sherwin (PERSON): 10
Starri

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

The assignment was quite interesting and provided a good opportunity to apply various concepts and tools in natural language processing and text analysis from webscraping to entity recognition. One of the challenges I found was managing the large text input for processing, especially when dealing with memory constraints in spaCy. Splitting the text into smaller chunks helped to overcome this challenge, but it required careful handling to ensure that the text was processed accurately. Anther one was running asycio functions within google Colab. With this project I understood that you can't run another asyncio program within another and that the colab runtime already runs in asyncio process thus a loop. And you can't have another loop within it. What I enjoyed the most was the data collection (scraping with playwright) and the data cleansing part in preparation for the analysis.


