# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [5]:
# Your code here

import requests
from bs4 import BeautifulSoup
import csv

def scrape_imdb_movie_reviews(movie_url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
    response = requests.get(movie_url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    reviews = []
    review_count = 0

    # IMDb has a "load more" button to fetch additional reviews dynamically,
    # we need to simulate clicks on this button until we have at least 1000 reviews
    while review_count < 1000:
        for review in soup.find_all('div', class_='review-container'):
            if review_count >= 1000:
                break
            review_text = review.find('div', class_='text show-more__control').get_text().strip()
            reviews.append(review_text)
            review_count += 1

        next_button = soup.find('div', class_='load-more-data')
        if next_button:
            next_page = next_button.get('data-key')
            next_url = f'{movie_url}/reviews/_ajax?ref_=undefined&paginationKey={next_page}'
            response = requests.get(next_url, headers=headers)
            soup = BeautifulSoup(response.content, 'html.parser')
        else:
            break

    return reviews

def save_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Review'])
        for item in data:
            writer.writerow([item])

if __name__ == "__main__":
    # Example URLs for movies released in 2023 or 2024
    movie_urls = [
        'https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ql_2',
    ]

    all_reviews = []
    for movie_url in movie_urls:
        reviews = scrape_imdb_movie_reviews(movie_url)
        all_reviews.extend(reviews)

    save_to_csv(all_reviews, 'Oppenheimer_imdb_movie_reviews.csv')



# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [7]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from google.colab import files
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Read the CSV file
df = pd.read_csv("Oppenheimer_imdb_movie_reviews.csv")

# Define stopwords and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean text
def clean_text(text):
    # Remove noise
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove numbers
    text = ''.join([i for i in text if not i.isdigit()])
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])
    # Lowercase all texts
    text = text.lower()
    # Stemming
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    # Lemmatization
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

# Clean the text data
df['Cleaned_Review'] = df['Review'].apply(clean_text)

# Save the clean data to a new CSV file
df.to_csv("Oppenheimer_imdb_movie_reviews_Cleaned.csv", index=False)

files.download("Oppenheimer_imdb_movie_reviews_Cleaned.csv")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [10]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.parse import DependencyGraph
from nltk.chunk import ne_chunk
from collections import Counter

# Download large_grammars
nltk.download('large_grammars')

# Sample clean text
clean_text = """
Write a python program to conduct syntax and structure analysis of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.
"""

# Tokenize the text into sentences
sentences = sent_tokenize(clean_text)

# Initialize counts for POS tagging
pos_counts = Counter()

# Define a custom CFG for Constituency Parsing
custom_cfg = """
    NP: {<DT>?<JJ>*<NN>} # NP
    VP: {<VB.*><DT>?<JJ>*<NN>*} # VP
"""

# Perform syntax and structure analysis
for sentence in sentences:
    # Tokenize each sentence into words and tag the parts of speech
    words = word_tokenize(sentence)
    tagged_words = nltk.pos_tag(words)

    # POS tagging
    for word, pos in tagged_words:
        pos_counts[pos] += 1

    # Constituency Parsing using the custom CFG
    cp_parser = nltk.RegexpParser(custom_cfg)
    cp_tree = cp_parser.parse(tagged_words)
    print("\nConstituency Parsing Tree:")
    print(cp_tree)

    # Dependency Parsing
    dep_input = "\n".join([f"{i+1}\t{word}\t{tag}\t_\t_\t_\t_\t_\t_\t_" for i, (word, tag) in enumerate(tagged_words)])
    dep_graph = DependencyGraph(dep_input)
    print("\nDependency Parsing Tree:")
    print(dep_graph.to_conll(4))

    # Named Entity Recognition
    ne_tree = ne_chunk(tagged_words)
    entities = [entity for entity in ne_tree if isinstance(entity, nltk.Tree)]
    entity_labels = [entity.label() for entity in entities]
    print("\nNamed Entities:")
    print(Counter(entity_labels))

# Print POS tagging results
print("\nTotal POS Tag Counts:")
print(pos_counts)



Constituency Parsing Tree:
(S
  (VP Write/VB)
  (NP a/DT python/NN)
  (NP program/NN)
  to/TO
  (VP conduct/VB)
  (NP syntax/NN)
  and/CC
  (NP structure/NN)
  (NP analysis/NN)
  of/IN
  (NP the/DT clean/JJ text/NN)
  you/PRP
  just/RB
  (VP saved/VBN)
  above/RB
  ./.)

Dependency Parsing Tree:


Named Entities:
Counter()

Constituency Parsing Tree:
(S
  (NP The/DT syntax/NN)
  and/CC
  (NP structure/NN)
  (NP analysis/NN)
  (VP includes/VBZ)
  :/:
  (/(
  1/CD
  )/)
  Parts/NNS
  of/IN
  Speech/NNP
  (/(
  POS/NNP
  )/)
  (NP Tagging/NN)
  :/:
  Tag/NNP
  Parts/NNP
  of/IN
  Speech/NNP
  of/IN
  (NP each/DT word/NN)
  in/IN
  (NP the/DT text/NN)
  ,/,
  and/CC
  (VP calculate/VB)
  (NP the/DT total/JJ number/NN)
  of/IN
  N/NNP
  (/(
  (NP oun/NN)
  )/)
  ,/,
  V/NNP
  (/(
  (NP erb/NN)
  )/)
  ,/,
  Adj/NNP
  (/(
  ective/JJ
  )/)
  ,/,
  Adv/NNP
  (/(
  (NP erb/NN)
  )/)
  ,/,
  respectively/RB
  ./.)

Dependency Parsing Tree:


Named Entities:
Counter({'GPE': 6, 'ORGANIZATION': 1

[nltk_data] Downloading package large_grammars to /root/nltk_data...
[nltk_data]   Package large_grammars is already up-to-date!


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below