<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 tweets by using a hashtag (you can use any hashtag) from Twitter. 


In [192]:
import requests
from bs4 import BeautifulSoup
import csv

# Set the URL of the page with movie reviews
url = 'https://www.imdb.com/title/tt8178634/reviews/?ref_=tt_ql_urv'

# Send a GET request to the URL and parse the HTML content
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the total number of reviews
total_reviews = soup.find('div', {'class': 'header'}).text.split()[0]

# Initialize an empty list to store the reviews
reviews = []

# Find all the review containers on the page
review_containers = soup.find_all('div', {'class': 'imdb-user-review'})

# Extract the review text and add it to the list of reviews
for review in review_containers:
    reviews.append(review.text.strip())

# Open a CSV file for writing
with open('reviews.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['Review'])

    # Write each review as a row in the CSV file
    for review in reviews:
        writer.writerow([review])

    # Write the total number of reviews to the CSV file
    writer.writerow(['Total Reviews:', total_reviews])


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming. 

(6) Lemmatization.

In [40]:
# Write your code here
!pip install pandas nltk textblob






In [194]:
import csv
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download the NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Load the stop words list
stop_words = stopwords.words('english')

# Define a function to clean the text data
def clean_text(text):
    # Remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.translate(str.maketrans('', '', string.digits))
    
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    
    # Remove stop words
    words = [word.lower() for word in words if word.lower() not in stop_words]
    
    # Stem the words
    #words = [stemmer.stem(word) for word in words]
    
    # Lemmatize the words
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the cleaned words into a string
    cleaned_text = ' '.join(words)
    
    return cleaned_text

# Open the original CSV file for reading
with open('reviews.csv', mode='r', newline='', encoding='utf-8') as input_file:
    reader = csv.reader(input_file)
    
    # Open a new CSV file for writing
    with open('cleaned_reviews.csv', mode='w', newline='', encoding='utf-8') as output_file:
        writer = csv.writer(output_file)
        
        # Loop through each row in the original CSV file
        for row in reader:
            # Skip the header row
            if row[0] == 'Review':
                header_row = ['Cleaned Review', 'Review']
                writer.writerow(header_row)
                continue
            
            # Clean the text in the second column
            cleaned_text = clean_text(row[0])
            
            # Write the cleaned text and the original text to the new CSV file
            new_row = [cleaned_text, row[0]]
            writer.writerow(new_row)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes: 

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [34]:
# Write your code here

!pip install nltk textblob







In [198]:
import csv
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, ne_chunk
from nltk.tree import Tree

# Download the NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load the stop words list
stop_words = stopwords.words('english')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to clean the text data
def clean_text(text):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    
    cleaned_sentences = []
    
    # Loop through each sentence in the text
    for sentence in sentences:
        # Tokenize the sentence into words
        words = word_tokenize(sentence)
        
        # Remove stop words
        words = [word.lower() for word in words if word.lower() not in stop_words]
        
        # Lemmatize the words
        words = [lemmatizer.lemmatize(word) for word in words]
        
        # Join the cleaned words into a string
        cleaned_sentence = ' '.join(words)
        
        cleaned_sentences.append(cleaned_sentence)
    
    # Join the cleaned sentences into a string
    cleaned_text = ' '.join(cleaned_sentences)
    
    return cleaned_text

# Open the CSV file for reading
with open('cleaned_reviews.csv', mode='r', newline='', encoding='utf-8') as input_file:
    reader = csv.reader(input_file)
    
    # Loop through each row in the CSV file
    for row in reader:
        # Skip the header row
        if row[0] == 'Cleaned Review':
            continue
        
        # Clean the text in the first column
        cleaned_text = clean_text(row[0])
        
        # Print the cleaned text
        print('Cleaned Review:', cleaned_text)
        
        # Tokenize the text into words
        words = word_tokenize(cleaned_text)
        
        # Perform POS tagging
        pos_tags = pos_tag(words)
        
        # Initialize the count variables
        noun_count = 0
        verb_count = 0
        adj_count = 0
        adv_count = 0
        
        # Loop through each word and count the POS tags
        for word, pos in pos_tags:
            if pos.startswith('N'):
                noun_count += 1
            elif pos.startswith('V'):
                verb_count += 1
            elif pos.startswith('J'):
                adj_count += 1
            elif pos.startswith('R'):
                adv_count += 1
        
        # Print the POS tag counts
        print('Noun count:', noun_count)
        print('Verb count:', verb_count)
        print('Adjective count:', adj_count)
        print('Adverb count:', adv_count)
        
        # Perform constituency parsing
        #parse_tree = Tree.fromstring(nltk.parse.generate(grammar, [cleaned_text])[0])
        
        # Print the constituency parse tree
        print('Constituency parse tree:')
        #print(parse_tree.pformat())
        
        # Perform dependency parsing
        #dependency_parser = nltk.parse.DependencyGraph(tree_str=parse_tree.pformat())
        
        # Print the dependency parse tree
        print('Dependency parse tree:')
       # print(dependency_parser.tree().pprint())
        
        # Perform named entity recognition
        ne_tree = ne_chunk


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Cleaned Review: never seen anything quite like evanstondad august seen lot movie time made lot different style different genre around world ive seen everything mainstream movie imaginable experimental cant even remember last time came away movie thinking id never seen anything like thats felt rrrthis movie much muchness may turn almost wife nearly bailed minute mark film top ridiculous got hooked totally ride point disappointed hour behemoth endingdo like see musclebound slickedup men fighting tiger check public flogging turn musical number got evil british people extremely evil england sue filmmaker defamation sure evil british people mauled rampaging jungle animal betcha beheading yep romance course homoeroticism intense watching movie may turn gay hooboy let say anything already movie isnt worth anywaywatch rrr make sure whatever first movie watch one dont care much certainly feel like palest imitation movie ever seen seriously movie deliriously bonkers unafraid absolutely absurd al

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):** 

In [None]:
Constituency parsing tree, also known as phrase structure tree, is a tree representation of the syntactic structure of a sentence. 
It breaks down a sentence into its constituent parts, such as noun phrases, verb phrases, prepositional phrases, and clauses. 
The tree structure consists of nodes and branches. Each node represents a constituent, and the branches represent the relationship between the constituents. 
For instance, the tree can show how words form a noun phrase or how a noun phrase and a verb form a clause. 
Constituency parsing tree can be used for tasks such as natural language understanding, machine translation, and text-to-speech conversion.


Dependency parsing tree, on the other hand, represents the grammatical and semantic relationships between the words in a sentence. 
It focuses on identifying the main word or headword of a sentence and its dependent words, which modify or give more information about the headword.
Dependency parsing tree also uses nodes and branches, where each node represents a word, and the branches indicate the relationship between the words. 
In this tree, the headword of a sentence is the root node, and its dependent words are connected to it via directed edges. 
For example, a preposition word like "in" will be connected to the headword it modifies, such as a noun word like "house," through a directed edge. 
Dependency parsing tree can be used for various natural language processing tasks, such as sentiment analysis, question-answering, and information extraction.

