<a href="https://colab.research.google.com/github/HarishChinnakadiri/Harish_INFO5731_Spring2024/blob/main/Chinnakadiri_Harish_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
import requests
from bs4 import BeautifulSoup
import csv

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

def get_imdb_reviews(url, num_pages=200):  # Adjusted for 200 pages to attempt 10,000 reviews
    reviews = []

    for page in range(1, num_pages + 1):
        response = requests.get(url + '?start=' + str((page-1)*50), headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        review_divs = soup.find_all('div', {'class': 'text show-more__control'})

        for review in review_divs:
            reviews.append(review.text.strip())

        # Break if there are no more reviews
        if not review_divs:
            break

    return reviews

def save_to_csv(reviews, filename="reviews.csv"):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Review Number", "Review Text"])
        for i, review in enumerate(reviews, 1):
            writer.writerow([i, review])

# Example usage:
url = 'https://www.imdb.com/title/tt13927994/reviews/?ref_=ttrt_ql_2'
reviews = get_imdb_reviews(url)

# Save the first 1000 reviews to a CSV file
save_to_csv(reviews[:1000])

print("1000 reviews saved to reviews.csv")


1000 reviews saved to reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download stopwords and wordnet data
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Load stopwords
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # (1) Remove noise, such as special characters and punctuations.
    text = re.sub(r'[^\w\s]', '', text)

    # (2) Remove numbers.
    text = re.sub(r'\d+', '', text)

    # (4) Lowercase all texts.
    text = text.lower()

    # Tokenize the text for further processing
    words = text.split()

    # (3) Remove stopwords.
    words = [word for word in words if word not in stop_words]

    # (5) Stemming.
    words = [stemmer.stem(word) for word in words]

    # (6) Lemmatization.
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

def clean_reviews_in_csv(input_filename, output_filename):
    with open(input_filename, 'r', encoding='utf-8') as infile, open(output_filename, 'w', newline='', encoding='utf-8') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        # Write header
        header = next(reader)
        writer.writerow(header + ['Cleaned Review'])

        for row in reader:
            review = row[1]  # Assuming review text is in the second column
            cleaned_review = clean_text(review)
            writer.writerow(row + [cleaned_review])

    print(f"Cleaned reviews saved to {output_filename}")

clean_reviews_in_csv('reviews.csv', 'cleaned_reviews.csv')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned reviews saved to cleaned_reviews.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Your code here
import csv
import spacy
from collections import defaultdict

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def pos_analysis(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        next(reader)  # Skip header

        pos_counts = defaultdict(int)

        for row in reader:
            cleaned_review = row[2]
            doc = nlp(cleaned_review)

            # (1) Parts of Speech (POS) Tagging
            for token in doc:
                if token.pos_ in ["NOUN", "VERB", "ADJ", "ADV"]:
                    pos_counts[token.pos_] += 1

        # Print POS counts
        print("Parts of Speech Counts:")
        for pos, count in pos_counts.items():
            print(f"{pos}: {count}")

pos_analysis('cleaned_reviews.csv')



Parts of Speech Counts:
NOUN: 40320
VERB: 16320
ADJ: 16320
ADV: 3480


In [None]:
!pip install stanza
import pandas as pd
import spacy
import stanza

# Load spaCy's English model
nlp_spacy = spacy.load("en_core_web_sm")

# Initialize Stanza with English model, ensuring constituency parsing is enabled
stanza.download('en', processors='tokenize,pos,constituency')
nlp_stanza = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')

# Function to perform dependency parsing with spaCy
def dependency_parsing(text):
    doc = nlp_spacy(text)
    for sentence in doc.sents:
        print("Dependency Parsing for sentence:", sentence.text)
        for token in sentence:
            print(f"{token.text} <--{token.dep_}-- {token.head.text}")
        print("\n")

# Function to perform constituency parsing with Stanza
def constituency_parsing(text):
    doc = nlp_stanza(text)
    for sentence in doc.sentences:
        print("Constituency Parsing for sentence:", sentence.text)
        print(sentence.constituency)
        print("\n")

# Load the CSV file
file_path = 'cleaned_reviews.csv'
df = pd.read_csv(file_path)

# Ensure the 'Cleaned Review' column exists
if 'Cleaned Review' not in df.columns:
    raise ValueError("The 'Cleaned Review' column is not present in the CSV.")

# Iterate through each review in the DataFrame
for review in df['Cleaned Review']:
    # Perform dependency parsing
    dependency_parsing(review)
    # Perform constituency parsing
    constituency_parsing(review)
    # Break after one review for demonstration; remove this to process all reviews
    break




Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading these customized packages for language: en (English)...
| Processor       | Package             |
-----------------------------------------
| tokenize        | combined            |
| mwt             | combined            |
| pos             | combined_charlm     |
| constituency    | ptb3-revised_charlm |
| backward_charlm | 1billion            |
| forward_charlm  | 1billion            |
| pretrain        | conll17             |

INFO:stanza:File exists: /root/stanza_resources/en/tokenize/combined.pt
INFO:stanza:File exists: /root/stanza_resources/en/mwt/combined.pt
INFO:stanza:File exists: /root/stanza_resources/en/pos/combined_charlm.pt
INFO:stanza:File exists: /root/stanza_resources/en/constituency/ptb3-revised_charlm.pt
INFO:stanza:File exists: /root/stanza_resources/en/backward_charlm/1billion.pt
INFO:stanza:File exists: /root/stanza_resources/en/forward_charlm/1billion.pt
INFO:stanza:Fil

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| mwt          | combined            |
| pos          | combined_charlm     |
| constituency | ptb3-revised_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: constituency
INFO:stanza:Done loading processors!


Dependency Parsing for sentence: movi disappoint exampl telugu cinema lack origin first half movi clear stori line confus audienc feel like random video clip stitch togeth audienc fill gap part drag audienc lose interest point second
movi <--nsubj-- disappoint
disappoint <--compound-- exampl
exampl <--compound-- origin
telugu <--nmod-- origin
cinema <--compound-- origin
lack <--compound-- origin
origin <--nsubj-- feel
first <--advmod-- movi
half <--amod-- movi
movi <--nmod-- line
clear <--amod-- stori
stori <--compound-- line
line <--compound-- audienc
confus <--compound-- audienc
audienc <--appos-- origin
feel <--ROOT-- feel
like <--mark-- stitch
random <--amod-- clip
video <--compound-- clip
clip <--nsubj-- stitch
stitch <--advcl-- feel
togeth <--nmod-- gap
audienc <--compound-- gap
fill <--compound-- gap
gap <--dobj-- stitch
part <--compound-- audienc
drag <--compound-- audienc
audienc <--nsubj-- lose
lose <--conj-- stitch
interest <--compound-- point
point <--dobj-- lose
second <--

In [None]:
import spacy
import pandas as pd
from collections import defaultdict

# Load the English NER model
nlp = spacy.load("en_core_web_sm")

# Read the cleaned_reviews.csv
df = pd.read_csv('cleaned_reviews.csv')

# Ensure the 'clean_text' column exists
if 'Cleaned Review' not in df.columns:
    raise ValueError("The 'clean_text' column is not present in the CSV.")

# Dictionary to store counts of each entity
entity_counts = defaultdict(int)

# Process each review in the DataFrame
for review in df['Cleaned Review']:
    doc = nlp(str(review))  # Convert to string in case there are NaN values
    for ent in doc.ents:
        entity_counts[ent.label_] += 1

# Print the counts of each entity
for entity, count in entity_counts.items():
    print(f"{entity}: {count}")


ORDINAL: 960
CARDINAL: 1480
PERSON: 2080
ORG: 1120
NORP: 240
QUANTITY: 40
PRODUCT: 40
GPE: 200


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
'''The assignment is very lengthy but it gives me some understanding on data collection ,data cleaning and text classification .
I found difficulty in understanding Parsing which has constituency_parsing and dependency parsing.Here i struggled a lot.
Intially i found difficulty in web scrapping and when i go through it i found th trick and solved easily.
Whole assignment is very interesting which we can observe through our eyes. I think given is enough if we understand otherwise it is difficult. '''

'The assignment is very lengthy but it gives me some understanding on data collection ,data cleaning and text classification .\nI found difficulty in understanding Parsing which has constituency_parsing and dependency parsing.Here i struggled a lot.\nIntially i found difficulty in web scrapping and when i go through it i found th trick and solved easily. \nWhole assignment is very interesting which we can observe through our eyes. I think given is enough if we understand otherwise it is difficult. '