# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:


import requests
from bs4 import BeautifulSoup
import csv

def scrape_imdb_reviews(movie_urls):
    all_reviews = []
    for url in movie_urls:
        reviews = []
        page = 1
        while len(reviews) < 1000:
            response = requests.get(url + f'&start={page * 10}')
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                review_containers = soup.find_all('div', class_='text show-more__control')
                if not review_containers:
                    break
                for container in review_containers:
                    review_text = container.text.strip()
                    reviews.append(review_text)
                    if len(reviews) >= 1000:
                        break
                print(f"Collected {len(reviews)} reviews for {url}")
                page += 1
            else:
                print(f"Failed to retrieve data from {url}")
                break
        all_reviews.extend(reviews)
    return all_reviews


def get_movie_title(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.find('meta', property='og:title')['content']
        return title
    else:
        print(f"Failed to retrieve data from {url}")
        return None


def save_reviews_to_csv(reviews, filename, movie_name):
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Movie Name', 'Review'])
        writer.writerows([[movie_name, review] for review in reviews])


def main():
    movie_url = 'https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2'  # Replace with the IMDb URL of your chosen movie
    movie_name = get_movie_title(movie_url)

    if movie_name:
        reviews = scrape_imdb_reviews([movie_url])
        if len(reviews) >= 1000:
            save_reviews_to_csv(reviews[:1000], 'movie_reviews.csv', movie_name)
            print(f"Successfully saved ")
        else:
            print(f"Failed to collect 1000 reviews ", len(reviews), "reviews.")
    else:
        print("Failed to get movie name.")

if __name__ == '__main__':
    main()


Collected 25 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 50 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 75 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 100 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 125 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 150 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 175 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 200 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 225 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 250 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 275 reviews for https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2
Collected 300 reviews for https://www.imdb.com/title/tt6791350/revie

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [11]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK resources if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

def clean_text(input_text):
    # Remove noise (special characters and punctuations)
    cleaned_text = re.sub(r'[^\w\s]', '', input_text)

    # Remove numbers
    cleaned_text = re.sub(r'\d+', '', cleaned_text)

    # Tokenize the text
    words = word_tokenize(cleaned_text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word.lower() for word in words if word.lower() not in stop_words]

    # Lowercase all texts
    words = [word.lower() for word in words]

    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    # Join the cleaned words back into a sentence
    cleaned_text = ' '.join(words)

    return cleaned_text

def clean_and_save_data(input_csv_filename):
    # Read the CSV file
    df = pd.read_csv(input_csv_filename)

    # Apply the cleaning function to the 'Review' column and create a new 'Cleaned Review' column
    df['Cleaned Review'] = df['Review'].apply(clean_text)

    # Save the cleaned data to the same CSV file
    df.to_csv(input_csv_filename, index=False)

    # Print the head of the DataFrame
    print(df.head())

    return df


input_csv_file = 'movie_reviews.csv'
cleaned_df = clean_and_save_data(input_csv_file)


print(cleaned_df)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                     Movie Name  \
0  Guardians of the Galaxy Vol. 3 (2023) - IMDb   
1  Guardians of the Galaxy Vol. 3 (2023) - IMDb   
2  Guardians of the Galaxy Vol. 3 (2023) - IMDb   
3  Guardians of the Galaxy Vol. 3 (2023) - IMDb   
4  Guardians of the Galaxy Vol. 3 (2023) - IMDb   

                                              Review  \
0  Guardians of the Galaxy Volume 3 is chaotic, w...   
1  Having sat through some phase 4 films that fai...   
2  Up to this point, there has been one trilogy i...   
3  "There is no God. That's why I stepped in." I ...   
4  Firstly Adam warlock's intro was marvelous, wh...   

                                      Cleaned Review  
0  guardian galaxi volum chaotic weird oftentim r...  
1  sat phase film fail inspir guardian feel like ...  
2  point one trilog mcu excel start finish time a...  
3  god that step admit one best line ever spoken ...  
4  firstli adam warlock intro marvel made hate ka...  
                          

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [14]:
import nltk
from nltk import pos_tag, word_tokenize, RegexpParser
import spacy

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Assuming df and column name are defined
text = cleaned_df['Cleaned Review'].iloc[0]
print(text)  # Replace 'Cleaned Review' with the actual column name

# Tokenize the text
tokens = word_tokenize(text)

# Part-of-speech tagging using NLTK
pos_tags = pos_tag(tokens)

# Print POS tags obtained using NLTK
print("\nPOS Tags (NLTK):")
print(pos_tags)

# Named Entity Recognition using spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Print NER
print("\nNamed Entity Recognition:")
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")

pos_counts = {'NOUN': 0, 'VERB': 0, 'ADJ': 0, 'ADV': 0}
for _, pos in pos_tags:  # Fix the unpacking here
    if pos in pos_counts:
        pos_counts[pos] += 1

# Print POS counts
print("\nPOS Counts:")
for pos, count in pos_counts.items():
    print(f"{pos}: {count}")

# Dependency Parsing using spaCy
print("\nDependency Parsing Tree:")
for token in doc:
    print(f"{token.text} -- {token.dep_} --> {token.head.text}")




'''In this sentence, the constituency parsing tree breaks down the grammatical structure and organization of the text into phrases and their syntactic relationships. For instance, it identifies noun phrases like "the Guardian of the Galaxy volume" and "the story of family loss," as well as verb phrases like "handled with care" and "help hammer home themes." Each node in the constituency parsing tree represents a phrase, and the tree structure illustrates how these phrases are nested within one another.

On the other hand, the dependency parsing tree focuses on the grammatical dependencies between words, revealing how they relate to each other in terms of syntactic roles. For example, it identifies the subject-verb relationships like "sat phase film," "fail inspires guardian," and "feel like a breath of fresh air." The tree illustrates the dependencies between words, emphasizing the core elements of the sentence and showcasing the connections between them. Together, constituency and dependency parsing provide a comprehensive understanding of the sentence's syntactic structure and the relationships between its constituent parts. '''


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


guardian galaxi volum chaotic weird oftentim ridicul also full heart emot great themesi must say best marvel movi sinc endgam that necessarili hard though need surpass way home amaz high moment lazi other marvel desper need hit theyv final got ithighlightseveri member crew got time shine rocket definit one stood though see friend die wail pain grief made bawl high evolutionari mock man heavi stuff proce rip face two friend shot well thu rocket traumat pastim revealedchukwudi iwuji fantast villain certain point downright terrifi realli like line god that took place convinc villainth moment starlord scream agoni rocket live moment reson lost mani peopl close cant stand lose anyon elsegamora get lot charact develop im glad didnt kiss quill end would felt littl cheapdrax manti nebula adam warlock other also got moment well im surpris nobodi die honeston problem movi humor lot time good passabl time undercut realli emot scenesi could say lot review alreadi long

POS Tags (NLTK):
[('guardian

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
##The web-scraping is interesting but selecting the websites is challenging task few websites such as amazon required the code and permission to access their api if possible could you please let us know how can we overcome to access specific websites. Constituency Parsing and Dependency Parsing is also a tough is also challenging.Remaining everything is easy and time provided for the assignment is also sufficient.
