<a href="https://colab.research.google.com/github/NahidFathima/NahidF_INFO5731_Fall2023/blob/main/Syed_NahidFathima_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [None]:
# 6. Collecting top 10000 reddits using the hashtag 'jurassic'

# Importing necessary libraries
import praw
import csv

# My Reddit API credentials
client_id = 'ozmlnmur4OFHxEdAVorENw'
client_secret = '3BwnPwQYUoCIdCYd_4wmWUAwP2VJ6g'
user_agent = 'CollectHashtags by /u/Acad_N22'

# Authenticate with the Reddit API
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent
)

# Using the hashtag #jurassic to search
hashtag = 'jurassic'

# Creating a list to store the collected data
data = []

# To overcome Reddit API limits, we will fetch the reddit posts in batches
batch_size = 100
num_batches = 100
for i in range(num_batches):
    for submission in reddit.subreddit("all").search(hashtag, sort="top", time_filter="all", limit=batch_size, params={'after': str(i*batch_size)}):
        data.append([submission.title, submission.score, submission.url, submission.num_comments, submission.created_utc])

# Writing the data to a CSV file
with open('reddit_data.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Score', 'URL', 'Number of Comments', 'Created'])
    writer.writerows(data)

print(f"Data has been saved to 'reddit_data.csv'. Fetched {len(data)} posts.")



Data has been saved to 'reddit_data.csv'. Fetched 10000 posts.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
import re
import csv
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file containing the Reddit data
data = []
with open('reddit_data.csv', mode='r', encoding='utf-8') as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header row
    for row in reader:
        data.append(row)

# Data cleaning steps
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

cleaned_data = []
for row in data:
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', row[0])  # Remove special characters and punctuations
    cleaned_text = re.sub(r'\d+', '', cleaned_text)  # Remove numbers

    words = word_tokenize(cleaned_text.lower())  # Tokenize and lowercase text
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    words = [stemmer.stem(word) for word in words]  # Perform stemming
    words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # Perform lemmatization with POS tag 'v' for verb

    cleaned_text = ' '.join(words)
    cleaned_data.append(row + [cleaned_text])

# Write the cleaned data to a new CSV file
with open('cleaned_reddit_data.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Score', 'URL', 'Number of Comments', 'Created', 'Cleaned Text'])
    writer.writerows(cleaned_data)

print(f"Cleaned data has been saved to 'cleaned_reddit_data.csv'.")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data has been saved to 'cleaned_reddit_data.csv'.


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# 1. POS Tagging

# Import necessary libraries
import csv
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Load the cleaned data from the CSV file
cleaned_data = []
with open('cleaned_reddit_data.csv', mode='r', encoding='utf-8') as file:
    reader = csv.reader(file)
    for row in reader:
        cleaned_data.append(row[-1])  # Access the 'Cleaned Text' column

# Perform Parts of Speech (POS) tagging and analysis
nouns, verbs, adjectives, adverbs = 0, 0, 0, 0
for text in cleaned_data:
    words = word_tokenize(text)  # Tokenize the text into words
    tagged_words = pos_tag(words)  # Tag parts of speech for each word

    # Count the number of Nouns, Verbs, Adjectives, and Adverbs
    for word, tag in tagged_words:
        if tag.startswith('N'):  # Noun
            nouns += 1
        elif tag.startswith('V'):  # Verb
            verbs += 1
        elif tag.startswith('J'):  # Adjective
            adjectives += 1
        elif tag.startswith('R'):  # Adverb
            adverbs += 1

# Display the results
print(f"Number of Nouns: {nouns}")
print(f"Number of Verbs: {verbs}")
print(f"Number of Adjectives: {adjectives}")
print(f"Number of Adverbs: {adverbs}")

Number of Nouns: 71001
Number of Verbs: 7301
Number of Adjectives: 15100
Number of Adverbs: 2800


In [None]:
# 2. Constituency Parsing and Dependency Parsing

# Import necessary libraries
import csv
from nltk.parse import CoreNLPParser
from nltk.parse.corenlp import CoreNLPDependencyParser

# Load the cleaned data from the CSV file
cleaned_data = []
with open('cleaned_reddit_data.csv', mode='r', encoding='utf-8') as file:
    reader = csv.reader(file)  # Initialize the CSV reader
    for row in reader:  # Iterate through each row in the CSV
        cleaned_data.append(row[-1])  # Append the 'Cleaned Text' column to the cleaned_data list

# Initialize the CoreNLPParser for constituency parsing
const_parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')  # Initialize the parser with its specified URL and tagtype

# Initialize the CoreNLPDependencyParser for dependency parsing
dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')  # Initialize the dependency parser with its specified URL

# Perform constituency parsing and print trees
print("\nConstituency Parsing Trees:")
for text in cleaned_data:  # Iterate through each text in the cleaned_data list
    trees = list(const_parser.parse_text(text))  # Parse the text using the constituency parser
    for tree in trees:  # Iterate through each parsed tree
        print(tree)  # Print the parsed tree

# Perform dependency parsing and print trees
print("\nDependency Parsing Trees:")
for text in cleaned_data:  # Iterate through each text in the cleaned_data list
    parsed = dep_parser.parse_text(text)  # Parse the text using the dependency parser
    result = [parse.tree() for parse in parsed]  # Extract the trees from the parsed result
    for tree in result:  # Iterate through each parsed tree
        print(tree)  # Print the parsed tree



Constituency Parsing Trees:
(ROOT (S (NP (VBN Cleaned)) (VP (VBP Text))))
(ROOT (NP (NN jurass) (NN park) (NN delet) (NN scene)))
(ROOT (NP (NN jurass) (NN park) (NN delet) (NN scene)))
(ROOT
  (S
    (VP
      (VB ive)
      (NP (NN remov) (NN anim) (NN jurass) (NN park) (NN game)))))
(ROOT
  (S
    (NP (NML (NN film) (NN jurass)) (NN park) (NN trex))
    (VP
      (VBP know)
      (NP (NN sweat) (NN profus))
      (NP-TMP
        (NP (JJ first) (JJ major) (NN role))
        (NP-TMP (CD million) (NN year))))))
(ROOT
  (NP
    (NP (NN daughter) (NN watch) (NN jurass) (NN bark))
    (NP-TMP (JJ first) (NN time))))
(ROOT
  (FRAG
    (NP
      (NN iron)
      (NN pyrit)
      (NN nautilu)
      (NN jurass)
      (NN period)
      (CD million)
      (NN year))
    (ADJP (JJ old))))
(ROOT
  (FRAG
    (NP (NNP jeff) (NNP goldblum))
    (NP (NN recreat) (NN icon) (NN jurass) (NN park) (NN pose))
    (NP (NN year) (RB later))))
(ROOT
  (S
    (NP
      (NP
        (NN jurass)
        (NN park

In [None]:
pip install --upgrade pydantic typing-extensions

Note: you may need to restart the kernel to use updated packages.


In [None]:
# 3.Named Entity Recognition

# Import necessary libraries

import csv  # Library for reading CSV files
import spacy  # Library for natural language processing
import warnings  # Library for issuing warnings

# Load the cleaned data from the CSV file
cleaned_data = []  # Initialize an empty list to store the data
with open('cleaned_reddit_data.csv', mode='r', encoding='utf-8') as file:  # Open the CSV file in read mode
    reader = csv.reader(file)  # Create a CSV reader object
    for row in reader:  # Iterate through each row in the CSV file
        cleaned_data.append(row[-1])  # Append the last column of each row (assuming it is the 'Cleaned Text' column) to the cleaned_data list

# Load the English model for spacy
nlp = spacy.load('en_core_web_sm')  # Load the small English language model for text processing

# Initialize counts for different entity types
entity_counts = {'PERSON': 0, 'ORG': 0, 'GPE': 0, 'PRODUCT': 0, 'DATE': 0}  # Initialize a dictionary to store the counts of different types of entities

# Sample usage of the deprecated function
def deprecated_function():
    warnings.warn("deprecated_function is deprecated and will be removed in the future", DeprecationWarning)
    # Rest of the function implementation

# Iterate through each text and extract entities
for text in cleaned_data:  # Iterate through each text in the cleaned_data list
    doc = nlp(text)  # Process the text using the loaded NLP model
    for ent in doc.ents:  # Iterate through each identified entity in the processed text
        if ent.label_ in entity_counts:  # Check if the entity type is one of the predefined types
            entity_counts[ent.label_] += 1  # Increment the count for the identified entity type

# Usage of the deprecated function
deprecated_function()

# Display the counts of each entity type
for entity, count in entity_counts.items():  # Iterate through the items in the entity_counts dictionary
    print(f"{entity}: {count}")  # Print the entity type and its corresponding count


PERSON: 5300
ORG: 1400
GPE: 300
PRODUCT: 200
DATE: 1300




**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

<b>Q.2</b>  Now let's consider this reddit post from the cleaned data "In Jurassic Park(1993), there is a scene where the raptor opens the door to the kitchen and you can spot an operator grab the raptor's tail.",49001,https://v.redd.it/rvxr2bp6vfg51,791,1597180297.0,jurass park scene raptor open door kitchen spot oper grab raptor tail  , for this explain your understanding about the constituency parsing tree and dependency parsing tree.

<b><u>Constituency Parsing Tree</u>:</b>
Constituency parsing involves breaking down a sentence into its phrases based on grammatical structure of the sentence. It generates a tree structure that represents the syntactic structure of the sentence in terms of its different constituents. Each node in the tree represents a phrase, and the tree's hierarchical structure demonstrates how these phrases are nested within each other.

<b>Output:</b>
(S (PP (IN In) (NP (NNP Jurassic) (NNP Park(1993)))) (, ,) (NP (EX there)) (VP (VBZ is) (NP (DT a) (NN scene)) (SBAR (WHADVP (WRB where)) (S (NP (DT the) (NN raptor)) (VP (VBZ opens) (NP (DT the) (NN door)) (PP (TO to) (NP (DT the) (NN kitchen))))))) (. .)))
The constituency parsing tree would illustrate the hierarchical structure of phrases such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP).


<b><u>Dependency Parsing Tree</u>:</b>
Dependency parsing, on the other hand, focuses on the relationships between words in a sentence. It represents these relationships as directed links between words, where each word is a node in the tree. The links, known as dependencies, indicate the syntactic and semantic relationships between the words.

<b>Output:</b>
('open', 'ROOT', 'open')
('park', 'prep', 'open')
('Jurassic', 'compound', 'park')
('1993', 'pobj', 'park')
('open', 'advcl', 'spot')
('scene', 'det', 'scene')
('open', 'dobj', 'scene')
('grab', 'dobj', 'open')
('raptor', 'det', 'raptor')
('tail', 'dobj', 'grab')
The tree represents the dependencies between words, showing how each word relates to other words in the sentence. It demonstrates relationships such as the root word ('open'), prepositions, compound words, subordinate clauses, and direct objects.