<a href="https://colab.research.google.com/github/LavanyaPobbathi/Lavanya_INFO5731_Fall2023/blob/main/In_class_exercise/Pobbathi_Lavanya_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [8]:
# Write your code here

import requests
from bs4 import BeautifulSoup
import csv

def get_imdb_reviews(film_ids, num_reviews=10000):
    reviews = []
    for film_id in film_ids:
        base_url = f'https://www.imdb.com/title/{film_id}/reviews?ref_=tt_ql_3'
        while len(reviews) < num_reviews:
            response = requests.get(base_url)
            if response.status_code != 200:
                print(f"Failed to retrieve the webpage for film ID: {film_id}.")
                break

            soup = BeautifulSoup(response.content, 'html.parser')
            review_containers = soup.find_all('div', class_='text show-more__control')

            for review in review_containers:
                reviews.append(review.text)

            # Check for a 'Next' link to continue to the next page of reviews
            next_link = soup.find('a', {'class': 'ipl-load-more__button'})
            if next_link:
                base_url = 'https://www.imdb.com' + next_link['href']
            else:
                break

        # If we've reached or surpassed our desired number of reviews, break out of the loop
        if len(reviews) >= num_reviews:
            break

    return reviews[:num_reviews]

def save_to_csv(reviews, filename="imdb_reviews.csv"):
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Review'])
        for review in reviews:
            writer.writerow([review])

if __name__ == "__main__":
    film_ids = ["tt6710474", "tt1745960", "tt1877830", "tt19770238", "tt1630029", "tt15398776", "tt9603212"]
    reviews = get_imdb_reviews(film_ids)
    save_to_csv(reviews)


In [20]:
import pandas as pd

data_url = "imdb_reviews.csv"
df = pd.read_csv(data_url)
df

Unnamed: 0,Review
0,I have trouble turning off my brain. Anxieties...
1,"Profoundly deep, genuinely moving, utterly hil..."
2,If you take drugs for the first time and imagi...
3,"""Be kind, especially when you don't know what'..."
4,Everything Everywhere All At Once is even craz...
...,...
170,After the first 30 minutes that promised an in...
171,"No one liked my indy review, so they're really..."
172,All of the instalments in this franchise are v...
173,There are no major spoilers but some minor plo...


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [10]:
!pip install nltk



In [11]:
# Write your code here

import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download resources
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize tools
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # 1. Remove noise: special characters and punctuations
    text = re.sub(r'[^\w\s]', '', text)

    # 2. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 3. Remove stopwords
    words = text.split()
    words = [word for word in words if word.lower() not in stop_words]

    # 4. Lowercase all texts
    words = [word.lower() for word in words]

    # 5. Stemming
    words = [ps.stem(word) for word in words]

    # 6. Lemmatization
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

def save_cleaned_data(filename="imdb_reviews.csv", cleaned_filename="cleaned_imdb_reviews.csv"):
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        headers = next(reader)

        cleaned_data = []
        cleaned_data.append(headers + ["Cleaned_Review"])

        for row in reader:
            review = row[0]
            cleaned_review = clean_text(review)
            cleaned_data.append([review, cleaned_review])

    with open(cleaned_filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerows(cleaned_data)

if __name__ == "__main__":
    save_cleaned_data()




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [21]:
import pandas as pd

data_url = "cleaned_imdb_reviews.csv"
df = pd.read_csv(data_url)
df


Unnamed: 0,Review,Cleaned_Review
0,I have trouble turning off my brain. Anxieties...,troubl turn brain anxieti worri mundan todo ev...
1,"Profoundly deep, genuinely moving, utterly hil...",profoundli deep genuin move utterli hilari hig...
2,If you take drugs for the first time and imagi...,take drug first time imagin jacki chan femal d...
3,"""Be kind, especially when you don't know what'...",kind especi dont know what go onif could recog...
4,Everything Everywhere All At Once is even craz...,everyth everywher even crazier trailer would l...
...,...,...
170,After the first 30 minutes that promised an in...,first minut promis intellectu action thriller ...
171,"No one liked my indy review, so they're really...",one like indi review theyr realli gonna hate l...
172,All of the instalments in this franchise are v...,instal franchis wellmad highli entertain decen...
173,There are no major spoilers but some minor plo...,major spoiler minor plot exposit reviewth miss...


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [12]:
!pip install nltk spacy




In [13]:
!python -m spacy download en_core_web_sm


2023-10-17 17:57:57.634849: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [27]:
# Write your code here

import csv
import nltk
import spacy
from collections import defaultdict
from nltk import pos_tag, ne_chunk
from nltk.parse.stanford import StanfordParser
from spacy import displacy

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

def pos_analysis(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = pos_tag(tokens)
    pos_counts = defaultdict(int)
    for _, tag in pos_tags:
        if tag.startswith('N'):
            pos_counts['Noun'] += 1
        elif tag.startswith('V'):
            pos_counts['Verb'] += 1
        elif tag.startswith('J'):
            pos_counts['Adjective'] += 1
        elif tag.startswith('R'):
            pos_counts['Adverb'] += 1
    return pos_counts

def constituency_parsing(text):
    # NOTE: Ensure you've set up Stanford Parser and its paths for this to work.
    parser = StanfordParser()
    trees = list(parser.raw_parse(text))
    for tree in trees:
        tree.pretty_print()
    return trees

def dependency_parsing(text):
    doc = nlp(text)
    displacy.render(doc, style='dep')
    return doc

def named_entity_recognition(text):
    doc = nlp(text)
    entity_counts = defaultdict(int)
    for ent in doc.ents:
        entity_counts[ent.label_] += 1
    return entity_counts

def perform_syntax_structure_analysis(filename="cleaned_imdb_reviews.csv"):
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        next(reader)  # Skip header

        for row in reader:
            cleaned_review = row[1]

            # 1. POS Tagging
            pos_counts = pos_analysis(cleaned_review)
            print(f"POS Counts: {pos_counts}")

            # 2. Constituency Parsing
            # Uncomment the below line when you have Stanford Parser set up
            #constituency_par = constituency_parsing(cleaned_review)
            #print(f"constituency parsing: {constituency_par}")

            # 2. Dependency Parsing
            dependency_par = dependency_parsing(cleaned_review)
            print(f"dependency parsing: {dependency_par}")

            # 3. Named Entity Recognition
            entities = named_entity_recognition(cleaned_review)
            print(f"Entity Counts: {entities}")

if __name__ == "__main__":
    perform_syntax_structure_analysis()





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


POS Counts: defaultdict(<class 'int'>, {'Noun': 66, 'Adjective': 31, 'Verb': 14, 'Adverb': 5})
dependency parsing: troubl turn brain anxieti worri mundan todo even posit thing sometim feel like theyr swirl around chaotic funnel cloud would like noth sit physic mental silenceeveryth everywher felt like insid head world nonstop news bad person like troubl filter thing affect directli thing happen gener control suppos copeon answer decid noth matter anyway give care mean decid wife doesnt matter kid dont matter art natur thing bring joy life dont matteranoth way decid thing ok mayb thing dont matter thing thing make worth get decid thing areth first approach nihilist second approach empow film explor approach sob mess endi say time bit exhaust movi throw lot screen viewer occasion cant keep ambit mostli home runmichel yeoh terrif work mvp ke huy quan short round indiana jone moviesgrad
Entity Counts: defaultdict(<class 'int'>, {'ORDINAL': 2, 'GPE': 2, 'PERSON': 2})
POS Counts: defaultdict

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Constituency parsing, also known as syntactic parsing, is the process of dividing text into its constituent parts, commonly referred to as constituents. This type of parsing is vital for deciphering the syntactic structure of a sentence. The hierarchy in the structure comprises core elements like parts of speech tags, sentence constituents, and grammatical relationships between words. In the resulting parse tree, every word is treated as a node, and the tree's branches represent the syntactic relationships and grammar rules of the sentence. For instance, in the parse tree, phrases such as "insights", "abilities", and "illness" would be on one of the branches, illustrating their dependency and relation to other words.

Additionally, constituency parsing is essential in natural language processing tasks like machine translation and sentiment analysis. By understanding the hierarchical structure of a sentence, algorithms can capture the nuances and meaning more accurately, leading to better outcomes in tasks like translating a sentence from one language to another or determining the sentiment behind a given text.