<a href="https://colab.research.google.com/github/RST0310/INFO-5731/blob/main/Rayabarapu_SaiTeja_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
pip install requests beautifulsoup4




In [None]:
import requests as rq
from bs4 import BeautifulSoup as Soup
import csv
import time

def get_movie_reviews(movie_id, max_reviews=1000):
    reviews = []
    url = f"https://www.imdb.com/title/{movie_id}/reviews"
    review_count = 0
    while review_count < max_reviews:
        response = rq.get(url)
        if response.status_code == 200:
            soup = Soup(response.text, 'html.parser')
            reviews_data = soup.find_all('div', class_='text show-more__control')
            reviews.extend([review.text.strip() for review in reviews_data])
            review_count = len(reviews)
            load_more_button = soup.find('div', class_='load-more-data')
            if load_more_button:
                load_more_url = f"https://www.imdb.com{load_more_button['data-ajaxurl']}"
                response = rq.get(load_more_url)
                if response.status_code == 200:
                    time.sleep(2)  # Add a delay to avoid overwhelming the server
                else:
                    break  # Stop if unable to load more reviews
            else:
                break  # Stop if no more reviews to load
        else:
            break  # Stop if unable to fetch the initial page
    return reviews[:max_reviews]

def save_to_csv(reviews, file_name):
    with open(file_name, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Review'])
        for review in reviews:
            writer.writerow([review])

def fetch_movie_reviews(movie_id, file_name, max_reviews=1000):
    reviews = get_movie_reviews(movie_id, max_reviews)
    save_to_csv(reviews, file_name)

movie_id = 'tt6791350'
file_name = 'movie_reviews.csv'
fetch_movie_reviews(movie_id, file_name, max_reviews=1000)


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import csv
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

# Download NLTK resources if not already downloaded
nltk.download('stopwords')
nltk.download('wordnet')

def remove_noise(text):
    # Remove special characters and punctuation
    cleaned_text = ''.join([char for char in text if char not in string.punctuation])
    return cleaned_text

def remove_numbers(text):
    # Remove numbers
    cleaned_text = ''.join([char for char in text if not char.isdigit()])
    return cleaned_text

def remove_stopwords(text):
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_text = ' '.join([word for word in text.split() if word.lower() not in stop_words])
    return cleaned_text

def lowercase(text):
    # Convert text to lowercase
    return text.lower()

def stemming(text):
    # Perform stemming
    stemmer = PorterStemmer()
    stemmed_text = ' '.join([stemmer.stem(word) for word in text.split()])
    return stemmed_text

def lemmatization(text):
    # Perform lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return lemmatized_text

# Read the original CSV file containing the reviews
original_file_name = 'movie_reviews.csv'
cleaned_file_name = 'cleaned_movie_reviews.csv'

with open(original_file_name, 'r', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    reviews = list(reader)

# Remove header
header = reviews[0]
reviews = reviews[1:]

# Choose either stemming or lemmatization
use_stemming = True
use_lemmatization = False

# Apply cleaning steps to each review
cleaned_reviews = []
for review in reviews:
    cleaned_review = review[:]  # Make a copy of the original review
    text = review[0]  # Extract the review text from the row

    # Step 1: Remove noise
    text = remove_noise(text)

    # Step 2: Remove numbers
    text = remove_numbers(text)

    # Step 3: Remove stopwords
    text = remove_stopwords(text)

    # Step 4: Lowercase
    text = lowercase(text)

    # Step 5: Apply stemming or lemmatization
    if use_stemming:
        text = stemming(text)
    elif use_lemmatization:
        text = lemmatization(text)

    cleaned_review.append(text)  # Add cleaned text as a new column
    cleaned_reviews.append(cleaned_review)

# Write cleaned data to a new CSV file
with open(cleaned_file_name, 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(header + ['Cleaned Review'])  # Write header with new column name
    writer.writerows(cleaned_reviews)

print("Data cleaning completed and saved to", cleaned_file_name)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Data cleaning completed and saved to cleaned_movie_reviews.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
!pip install stanfordnlp


Collecting stanfordnlp
  Downloading stanfordnlp-0.2.0-py3-none-any.whl (158 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/158.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: stanfordnlp
Successfully installed stanfordnlp-0.2.0


In [None]:
import stanfordnlp
stanfordnlp.download('en')


Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
Y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: en_ewt
Download location: /root/stanfordnlp_resources/en_ewt_models.zip


100%|██████████| 235M/235M [00:39<00:00, 5.92MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.


In [52]:
import os
from stanfordnlp.server import CoreNLPClient

# Set the CoreNLP path
corenlp_dir = '/root/stanfordnlp_resources'  # Update to the correct directory
corenlp_zip = 'en_ewt_models.zip'  # Update to the correct zip file

# Set CORENLP_HOME environment variable
os.environ["CORENLP_HOME"] = corenlp_dir

# Start the server
server = CoreNLPClient(
    annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
    timeout=30000,
    memory='4G',
    endpoint='http://localhost:9000',
    be_quiet=True,
    corenlp_jars=[corenlp_dir + '/' + corenlp_zip])
server.start()


Starting server with command: java -Xmx4G -cp /root/stanfordnlp_resources/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 30000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-69609c95f8d14a0f.props -preload tokenize,ssplit,pos,lemma,ner,parse,depparse


In [53]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.parse import CoreNLPParser
from nltk.tree import Tree
import requests

# Download NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('omw')
nltk.download('universal_tagset')
nltk.download('wordnet')

def is_corenlp_server_running():
    try:
        response = requests.get('http://localhost:9000')
        return response.status_code == 200
    except (requests.exceptions.ConnectionError, requests.exceptions.RequestException):
        return False

def pos_tagging(text):
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens, tagset='universal')
    return pos_tags

def print_constituency_parsing_trees(text):
    if not is_corenlp_server_running():
        print("Error: CoreNLP server is not running at http://localhost:9000")
        return

    try:
        parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
        sentences = sent_tokenize(text)
        for sentence in sentences:
            parsed_tree = list(parser.parse(sentence.split()))
            if parsed_tree:
                print(parsed_tree[0])
            else:
                print("Failed to parse constituency tree for sentence:", sentence)
    except Exception as e:
        print("Error:", e)

def print_dependency_parsing_trees(text):
    if not is_corenlp_server_running():
        print("Error: CoreNLP server is not running at http://localhost:9000")
        return

    try:
        parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
        sentences = sent_tokenize(text)
        for sentence in sentences:
            parsed_tree = list(parser.parse(sentence.split()))
            if parsed_tree:
                dependency_tree = Tree.fromstring(parsed_tree[0].pretty_print())
                print(dependency_tree)
            else:
                print("Failed to parse dependency tree for sentence:", sentence)
    except Exception as e:
        print("Error:", e)

def named_entity_recognition(text):
    sentences = sent_tokenize(text)
    entities = []
    for sentence in sentences:
        tokens = word_tokenize(sentence)
        tagged_tokens = nltk.pos_tag(tokens)
        named_entities = nltk.ne_chunk(tagged_tokens)
        for entity in named_entities:
            if isinstance(entity, Tree):
                entities.append(' '.join([leaf[0] for leaf in entity]))
    return entities

# Read the cleaned text from CSV
cleaned_file_name = 'cleaned_movie_reviews.csv'
cleaned_text = ""
with open(cleaned_file_name, 'r', encoding='utf-8') as f:
    for line in f:
        split_line = line.split(',')
        if len(split_line) > 1:  # Check if split operation produces at least two elements
            cleaned_text += split_line[1]  # Assuming the second column contains the cleaned text

# (1) Parts of Speech (POS) Tagging
if cleaned_text:
    pos_tags = pos_tagging(cleaned_text)
    noun_count = len([tag for _, tag in pos_tags if tag == 'NOUN'])
    verb_count = len([tag for _, tag in pos_tags if tag == 'VERB'])
    adj_count = len([tag for _, tag in pos_tags if tag == 'ADJ'])
    adv_count = len([tag for _, tag in pos_tags if tag == 'ADV'])

    print("(1) Parts of Speech (POS) Tagging:")
    print("Noun Count:", noun_count)
    print("Verb Count:", verb_count)
    print("Adjective Count:", adj_count)
    print("Adverb Count:", adv_count)
    print()
else:
    print("No text found for POS tagging.")

# (2) Constituency Parsing and Dependency Parsing
if cleaned_text:
    print("(2) Constituency Parsing Trees:")
    print_constituency_parsing_trees(cleaned_text)
    print()

    print("(3) Dependency Parsing Trees:")
    print_dependency_parsing_trees(cleaned_text)
    print()
else:
    print("No text found for parsing.")

# (3) Named Entity Recognition
if cleaned_text:
    entities = named_entity_recognition(cleaned_text)
    entity_counts = {}
    for entity in entities:
        entity_counts[entity] = entity_counts.get(entity, 0) + 1

    print("(3) Named Entity Recognition:")
    for entity, count in entity_counts.items():
        print(f"{entity}: {count}")
else:
    print("No text found for named entity recognition.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data]   Package omw is already up-to-date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


(1) Parts of Speech (POS) Tagging:
Noun Count: 10522
Verb Count: 5361
Adjective Count: 3999
Adverb Count: 2320

(2) Constituency Parsing Trees:
Error: CoreNLP server is not running at http://localhost:9000

(3) Dependency Parsing Trees:
Error: CoreNLP server is not running at http://localhost:9000

(3) Named Entity Recognition:
Review: 1
Galaxy: 80
MCU: 120
Galaxy Vol: 80
James: 40
Gunn: 40
Thanos: 40
Kang: 80
Quantumania: 40
Adam: 80
Adam Warlock: 40
James Gunn: 80
Rocket: 80
GOTG: 40
Marvel: 40
Marvel Cinematic Universe: 40
Vol: 40
Disney: 40
CGI: 40
High: 40
Guardians: 40


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
#The assignment is good enough to gain enough knowledge on webscraping from open sourses. The challenging thing was to get the error free code and get the desired out in the csv format. The provided time is really sufficient to complete the assignmnet. I was suprised to see the assignment opened up in the middle of the week and the deadline was on wednesday.