<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
# Write your code here

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape IMDB user reviews
def scrape_imdb_reviews(movie_url):
    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.87 Safari/537.36"
    }

    response = requests.get(movie_url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        # Modify this part to extract the user reviews
        review_divs = soup.find_all('div', class_='text show-more__control')

        reviews = []
        for review_div in review_divs:
            review_text = review_div.get_text(strip=True)
            reviews.append({'Review': review_text})

        return reviews

    else:
        print("Failed to retrieve the webpage.")
        return None

# Example IMDB movie URL
movie_url = "https://www.imdb.com/title/tt0111161/reviews"

# Scrape the user reviews
reviews_data = scrape_imdb_reviews(movie_url)

if reviews_data:
    # Save the data to a CSV file
    df = pd.DataFrame(reviews_data)
    df.to_csv('imdb_movie_reviews.csv', index=False)
    print("Data saved to imdb_movie_reviews.csv")

Data saved to imdb_movie_reviews.csv


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write your code here

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

# Download NLTK data if not already downloaded
nltk.download('stopwords')
nltk.download('wordnet')

# Function to clean the text
def clean_text(text):
    # Remove punctuation and special characters
    text = ''.join([char for char in text if char not in string.punctuation])

    # Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # Lowercase all text
    text = text.lower()

    # Tokenize the text
    words = text.split()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

# Load the CSV file with the collected text data
df = pd.read_csv('imdb_movie_reviews.csv')  # Replace with your actual CSV file

# Clean the 'Review' column and store it in a new column 'Cleaned_Review'
df['Cleaned_Review'] = df['Review'].apply(clean_text)

# Save the data with cleaned text to a new CSV file
df.to_csv('imdb_movie_reviews_cleaned.csv', index=False)

print("Data with cleaned reviews saved to imdb_movie_reviews_cleaned.csv")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Data with cleaned reviews saved to imdb_movie_reviews_cleaned.csv


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
# Write your code here

import spacy
import nltk
from collections import Counter

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Function for syntax and structure analysis
def analyze_text(text):
    # (1) Parts of Speech (POS) Tagging and counting
    doc = nlp(text)
    pos_counts = Counter(token.pos_ for token in doc)

    # (2) Constituency Parsing and Dependency Parsing using spaCy
    dependency_trees = [sent.root for sent in doc.sents]

    # (3) Named Entity Recognition (NER)
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]

    return pos_counts, dependency_trees, named_entities

# Example sentence for analysis
example_sentence = "John works at Apple Inc. He lives in New York and loves Python programming. The meeting is scheduled for March 25, 2023."

# Analyze the example sentence
pos_counts, dependency_trees, named_entities = analyze_text(example_sentence)

# Print the results
print("Parts of Speech (POS) Counts:")
print(pos_counts)

print("\nDependency Parsing Trees:")
for tree in dependency_trees:
    print(tree)

print("\nNamed Entities:")
for entity, label in named_entities:
    print(f"{label}: {entity}")


Parts of Speech (POS) Counts:
Counter({'PROPN': 7, 'VERB': 4, 'ADP': 3, 'PUNCT': 3, 'NOUN': 2, 'NUM': 2, 'PRON': 1, 'CCONJ': 1, 'DET': 1, 'AUX': 1})

Dependency Parsing Trees:
works
lives
scheduled

Named Entities:
PERSON: John
ORG: Apple Inc.
GPE: New York
DATE: March 25, 2023


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

**Constituency Parsing Tree:**

Constituency parsing, also known as phrase structure parsing, is a natural language processing technique that aims to analyze the grammatical structure of a sentence. It involves breaking down a sentence into its constituent phrases or subparts. The primary elements in constituency parsing are:

**Root Node:** At the top of the tree is the root node, which represents the entire sentence.
Non-Terminal Nodes: These nodes represent phrases or subparts of the sentence, such as noun phrases (NP) or verb phrases (VP). Non-terminal nodes have children that are other non-terminal nodes or terminal nodes (words).

**Terminal Nodes:** These nodes represent individual words in the sentence.

Constituency parsing provides a hierarchical representation of the sentence's grammatical structure. For example, consider the sentence "The quick brown fox jumps over the lazy dog." In a constituency parsing tree, you might have a structure like:

(S
  (NP (DT The) (JJ quick) (JJ brown) (NN fox))
  (VP (VBZ jumps) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog)))))




In this tree, "S" represents the entire sentence, "NP" represents noun phrases, "VP" represents verb phrases, and so on. It shows how the sentence is organized into different parts and how those parts relate to each other.

**Dependency Parsing Tree:**

Dependency parsing is another method for analyzing the grammatical structure of a sentence, but it focuses on the relationships between words in a sentence. In a dependency parsing tree:

Each word in the sentence is a node in the tree.
The relationships between words are represented by directed edges between the nodes.
One word is typically the root of the tree, and all other words depend on it directly or indirectly.
For example, in the sentence "She eats pizza," the dependency parsing tree might look like this:

eats ─► She

eats ─► pizza

In this tree, "eats" is the root, and "She" and "pizza" depend on "eats" as its subject and object, respectively.

Dependency parsing is particularly useful for understanding the grammatical relationships in a sentence, such as subject-verb-object relationships, and it's often used in tasks like information extraction and named entity recognition.

Both constituency parsing and dependency parsing provide valuable insights into the grammatical structure of sentences, and the choice between them depends on the specific needs of a natural language processing task.