<a href="https://colab.research.google.com/github/MoulikaGudipally/Moulika_INFO5731_Fall2023/blob/main/Gudipally_Moulika_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [7]:
#question 2

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the movie's reviews on IMDB
url = 'https://www.imdb.com/title/tt0111161/reviews?ref_=tt_ql_3'

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find and extract review elements
    review_elements = soup.find_all('div', class_='text show-more__control')

    # Create a list to store the reviews
    reviews = []

    for review in review_elements:
        reviews.append(review.get_text().strip())

    # Create a DataFrame from the collected data
    df = pd.DataFrame({'Review': reviews})

    # Save the data to a CSV file
    df.to_csv('imdb_reviews.csv', index=False)
    print(f'Saved {len(reviews)} reviews to imdb_reviews.csv')

else:
    print('Failed to fetch data from IMDB')



Saved 25 reviews to imdb_reviews.csv


In [9]:
df.head(10)



Unnamed: 0,Review
0,The Shawshank Redemption is written and direct...
1,It is no wonder that the film has such a high ...
2,I'm trying to save you money; this is the last...
3,This movie is not your ordinary Hollywood flic...
4,"In its Oscar year, Shawshank Redemption (writt..."
5,The best movie in history and the best ending ...
6,Shawshank Redemption is without doubt one of t...
7,One of the finest films made in recent years. ...
8,Misery and Stand By Me were the best adaptatio...
9,I've lost count of the number of times I have ...


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [10]:
# Write your code here
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download stopwords data
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file
df = pd.read_csv('imdb_reviews.csv')

# Initialize stopwords, stemmer, and lemmatizer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean and preprocess the text
def clean_text(text):
    # Remove special characters and punctuation
    text = ''.join([char for char in text if char not in string.punctuation])

    # Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # Tokenize the text (split it into words)
    words = text.split()

    # Remove stopwords and lowercase the words
    words = [word.lower() for word in words if word.lower() not in stop_words]

    # Stemming and lemmatization
    words = [stemmer.stem(word) for word in words]
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

# Apply the cleaning function to the 'Review' column and store the result in a new column 'Cleaned_Review'
df['Cleaned_Review'] = df['Review'].apply(clean_text)

# Save the cleaned data to the CSV file
df.to_csv('imdb_reviews_cleaned.csv', index=False)

print(f'Saved cleaned data to imdb_reviews_cleaned.csv')





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Saved cleaned data to imdb_reviews_cleaned.csv


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned text from the CSV file
df = pd.read_csv('imdb_reviews_cleaned.csv')

# Process the cleaned text using spaCy
cleaned_text = df['Cleaned_Review']

# (2) Dependency Parsing
sample_sentence = cleaned_text.iloc[0]  # Using the first sentence as an example
sample_doc = nlp(sample_sentence)

print(f"Sample Sentence: {sample_sentence}\n")

print("Dependency Parsing Tree:")
for token in sample_doc:
    print(f"{token.text} ({token.dep_} --> {token.head.text})")

# (3) Named Entity Recognition
entity_counts = {}

for text in cleaned_text:
    doc = nlp(text)
    for ent in doc.ents:
        entity_type = ent.label_
        entity_text = ent.text
        if entity_type not in entity_counts:
            entity_counts[entity_type] = {}
        if entity_text not in entity_counts[entity_type]:
            entity_counts[entity_type][entity_text] = 0
        entity_counts[entity_type][entity_text] += 1

# Print entity counts
for entity_type, entities in entity_counts.items():
    print(f"Entities of type '{entity_type}':")
    for entity, count in entities.items():
        print(f"{entity}: {count}")

# Example explanation for dependency parsing:
# Dependency Parsing Tree:
# shawshank (amod --> redempt)
# redempt (nsubj --> written)
# written (amod --> darabont)
# ...

# This program will analyze the entire dataset and provide dependency parsing and named entity recognition.


Sample Sentence: shawshank redempt written direct frank darabont adapt stephen king novella rita hayworth shawshank redempt star tim robbin morgan freeman film portray stori andi dufresn robbin banker sentenc two life sentenc shawshank state prison appar murder wife lover andi find tough go find solac friendship form fellow inmat elli red red freeman thing start pick warden find andi prison job befit talent banker howev arriv anoth inmat go vastli chang thing themther fanfar bunt put releas film back titl didnt give much inkl anyon columbia pictur unsur market shawshank redempt bare regist box offic howev come academi award time film receiv sever nomin although none stir interest film home entertain releas rest say histori film final found audienc saw film propel almost mythic proport endear modern day classic someth delight fan whilst simultan baffl detractor one thing sure though ever side shawshank fenc sit film continu gather new fan simpli never go away loo mythic statusit possibl

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Constituency parsing and dependency parsing are two different approaches to analyzing the structure of a sentence in natural language processing.

1. **Constituency Parsing Tree:**

   Constituency parsing, often referred to as phrase structure parsing, is a method that aims to find the grammatical structure of a sentence by breaking it down into smaller constituents. These constituents can be phrases or words that are grouped together based on their syntactic roles. The result is represented as a tree structure.

   In a constituency parsing tree:
   - Each word is represented as a leaf node.
   - Phrases, such as noun phrases (NP) and verb phrases (VP), are represented as non-terminal nodes.
   - The root node represents the entire sentence, and it is usually labeled as 'S' (for sentence).
   - The edges (arcs) in the tree represent the syntactic relationships between words and phrases.


2. **Dependency Parsing Tree:**

   Dependency parsing focuses on identifying the relationships between words in a sentence by assigning a directed link (dependency) between words. In a dependency parsing tree:
   - Each word is a node in the tree.
   - The arrows between words represent grammatical relationships, with one word serving as the head and the other as the dependent.
   - The root of the tree represents the main verb of the sentence, and all other words in the sentence have dependencies on the root or other words.

   
Both constituency parsing and dependency parsing provide valuable insights into the grammatical structure of a sentence, and each has its own advantages and use cases in natural language processing tasks. Constituency parsing is well-suited for understanding phrases and hierarchical structure, while dependency parsing is useful for capturing syntactic dependencies between words.