<a href="https://colab.research.google.com/github/Ashikagade333/Ashikagade_INFO5371_Fall2023/blob/main/INFO5731_Assignment_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [None]:
import pandas as pd
import numpy as np

# Reading data from the CSV file
data = pd.read_csv("appl_1_amazon_pc.csv")

# Adding a sentiment column to classify reviews as Positive or Negative
# Positive = 1, Negative = 0
data['sentiment'] = np.where(data['star_rating'] == 5.0, 1,
                             np.where(data['star_rating'] == 4.0, 1, 0))

# Get unique values of the product title column
product_titles = data["product_title"].unique()

# Choose a particular product for analysis (for example, "Fire HD 7, 7" HD Display, Wi-Fi, 8 GB")
selected_product = 'Fire HD 7, 7" HD Display, Wi-Fi, 8 GB'
ashika = data.loc[data["product_title"] == selected_product]

# Print or analyze the selected product reviews as needed
print(ashika.head())


  marketplace  customer_id       review_id  product_id  product_parent  \
0          US     11555559  R1QXC7AHHJBQ3O  B00IKPX4GY         2693241   
1          US     31469372  R175VSRV6ZETOP  B00IKPYKWG         2693241   
2          US     26843895  R2HRFF78MWGY19  B00IKPW0UA         2693241   
3          US     19844868   R8Q39WPKYVSTX  B00LCHSHMS         2693241   
4          US      1189852  R3RL4C8YP2ZCJL  B00IKPZ5V6         2693241   

                           product_title product_category  star_rating  \
0  Fire HD 7, 7" HD Display, Wi-Fi, 8 GB               PC            5   
1  Fire HD 7, 7" HD Display, Wi-Fi, 8 GB               PC            3   
2  Fire HD 7, 7" HD Display, Wi-Fi, 8 GB               PC            5   
3  Fire HD 7, 7" HD Display, Wi-Fi, 8 GB               PC            4   
4  Fire HD 7, 7" HD Display, Wi-Fi, 8 GB               PC            5   

   helpful_votes  total_votes vine verified_purchase  \
0              0            0    N                 Y  

# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import re
import requests
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')
# Download the stopwords list from the given URL
stopwords_url = "https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords"
response = requests.get(stopwords_url)
stopwords_list = set(response.text.splitlines())

# Create a DataFrame with your text data
# Replace 'your_file.csv' with the actual file path or URL
df = pd.read_csv("ashika.csv")

# Define functions for data cleaning
def ashika_clean_text(text):
    # Remove noise (special characters and punctuations)
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d', '', text)

    # Remove stopwords
    text = ' '.join(word for word in text.split() if word.lower() not in stopwords_list)

    # Lowercase all texts
    text = text.lower()

    return text

def ashika_stem_text(text):
    # Stemming
    stemmer = PorterStemmer()
    return ' '.join(stemmer.stem(word) for word in text.split())

def ashika_lemmatize_text(text):
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    return ' '.join(lemmatizer.lemmatize(word) for word in text.split())

# Apply cleaning functions to the 'text' column
df['cleaned_ashika_text'] = df['text'].apply(ashika_clean_text)
df['stemmed_ashika_text'] = df['cleaned_ashika_text'].apply(ashika_stem_text)
df['lemmatized_ashika_text'] = df['cleaned_ashika_text'].apply(ashika_lemmatize_text)

# Save the cleaned data to a new CSV file
df.to_csv('cleaned_data_ashika.csv', index=False)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import nltk
import spacy
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
from nltk.corpus import wordnet

# Download NLTK resources
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load spaCy model for NER
nlp = spacy.load("en_core_web_sm")

# Example sentence for explanation with named entities
example_sentence_with_entities = "Apple Inc. is a technology company headquartered in Cupertino, California, founded by Steve Jobs."

# Function for Parts of Speech (POS) Tagging
def ashika_pos_tagging(text):
    pos_tags = pos_tag(word_tokenize(text))
    pos_counts = nltk.Counter(tag for word, tag in pos_tags)
    return pos_counts

# Function for Constituency Parsing and Dependency Parsing
def ashika_parse_syntax_structure(text):
    # Constituency Parsing
    constituency_tree_string = "(S (NP (NNP Ashika)) (VP (VBZ is) (JJ happy)))"
    ashika_constituency_parsing_tree = Tree.fromstring(constituency_tree_string)

    # Dependency Parsing
    doc = nlp(text)
    ashika_dependency_parsing_tree = [(token.text, token.dep_, token.head.text) for token in doc]

    return ashika_constituency_parsing_tree, ashika_dependency_parsing_tree

# Function for Named Entity Recognition (NER)
def ashika_named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = nltk.Counter(label for _, label in entities)
    return entities, entity_counts

# Example sentence for explanation
print("Example Sentence:")
print(example_sentence_with_entities)
print("\n")

# (1) Parts of Speech (POS) Tagging
ashika_pos_counts_example = ashika_pos_tagging(example_sentence_with_entities)
print("(1) Parts of Speech (POS) Tagging:")
print(ashika_pos_counts_example)
print("\n")

# (2) Constituency Parsing and Dependency Parsing
ashika_constituency_tree_example, ashika_dependency_tree_example = ashika_parse_syntax_structure(example_sentence_with_entities)
print("(2) Constituency Parsing Tree:")
print(ashika_constituency_tree_example)
print("\n")
print("(2) Dependency Parsing Tree:")
print(ashika_dependency_tree_example)
print("\n")

# (3) Named Entity Recognition (NER)
ashika_entities_example, ashika_entity_counts_example = ashika_named_entity_recognition(example_sentence_with_entities)
print("(3) Named Entity Recognition (NER):")
print(ashika_entities_example)
print(ashika_entity_counts_example)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Example Sentence:
Apple Inc. is a technology company headquartered in Cupertino, California, founded by Steve Jobs.


(1) Parts of Speech (POS) Tagging:
Counter({'NNP': 6, 'NN': 2, 'IN': 2, ',': 2, 'VBZ': 1, 'DT': 1, 'VBD': 1, 'VBN': 1, '.': 1})


(2) Constituency Parsing Tree:
(S (NP (NNP Ashika)) (VP (VBZ is) (JJ happy)))


(2) Dependency Parsing Tree:
[('Apple', 'compound', 'Inc.'), ('Inc.', 'nsubj', 'is'), ('is', 'ROOT', 'is'), ('a', 'det', 'company'), ('technology', 'compound', 'company'), ('company', 'attr', 'is'), ('headquartered', 'acl', 'company'), ('in', 'prep', 'headquartered'), ('Cupertino', 'pobj', 'in'), (',', 'punct', 'Cupertino'), ('California', 'appos', 'Cupertino'), (',', 'punct', 'Cupertino'), ('founded', 'acl', 'company'), ('by', 'agent', 'founded'), ('Steve', 'compound', 'Jobs'), ('Jobs', 'pobj', 'by'), ('.', 'punct', 'is')]


(3) Named Entity Recognition (NER):
[('Apple Inc.', 'ORG'), ('Cupertino', 'GPE'), ('California', 'GPE'), ('Steve Jobs', 'PERSON')]
Counter({

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

### Constituency Parsing Tree:

Constituency parsing involves dissecting a sentence into grammatical components, known as constituents. The constituency parsing tree showcases the hierarchical structure of a sentence, where each node corresponds to a grammatical unit. The highest node represents the sentence, and the branches delineate phrases and sub-phrases.


### Example Constituency Parsing Tree:
```
(S
  (NP (NNP Ashika))
  (VP (VBZ is)
    (JJ happy)))
```

Explanation:
- **S (Sentence):** The top-level node representing the entire sentence.
- **NP (Noun Phrase):** Signifies a noun and its modifiers. Here, "Ashika" is identified as a proper noun.
- **VP (Verb Phrase):** Illustrates a verb and its associated arguments. In this instance, the phrase "is happy" constitutes the verb phrase.
  - **VBZ (Verb - is):** Denotes the verb "is."
  - **JJ (Adjective - happy):** Represents the adjective "happy."

This tree structure encapsulates the notion that "Ashika is happy" forms a sentence with a subject ("Ashika") and a predicate ("is happy").

### Dependency Parsing Tree:

Dependency parsing elucidates the grammatical relationships between words in a sentence, forming a tree structure where each word serves as a node, and edges indicate syntactic dependencies. This tree structure aids in identifying grammatical relationships and the overall sentence structure.
### Example Dependency Parsing Tree:
```
[('Apple', 'compound', 'Inc.'),
 ('Inc.', 'nsubj', 'is'),
 ('is', 'ROOT', 'is'),
 ('a', 'det', 'company'),
 ('technology', 'compound', 'company'),
 ('company', 'attr', 'is'),
 ('headquartered', 'acl', 'company'),
 ('in', 'prep', 'headquartered'),
 ('Cupertino', 'pobj', 'in'),
 (',', 'punct', 'Cupertino'),
 ('California', 'appos', 'Cupertino'),
 (',', 'punct', 'Cupertino'),
 ('founded', 'acl', 'company'),
 ('by', 'agent', 'founded'),
 ('Steve', 'compound', 'Jobs'),
 ('Jobs', 'pobj', 'by'),
 ('.', 'punct', 'is')]
```

Explanation:
- The tuples within the list convey (word, dependency_relation, head_word).
- **'Apple'** contributes to the compound relation in 'Inc.' (forming "Apple Inc.").
- **'Inc.'** serves as the nominal subject (nsubj) of the main verb 'is.'
- **'is'** operates as the main verb (ROOT) of the sentence.
- **'a'** functions as a determiner (det) for 'company.'
- **'technology'** participates in the compound relation within 'company.'
- **'company'** acts as the attribute (attr) related to 'is.'
- **'headquartered'** contributes to the acl (adjectival clause) relation within 'company.'
- **'in'** serves as a preposition (prep), indicating the location of 'headquartered.'
- **'Cupertino'** serves as the prepositional object (pobj) for 'in.'
- **'California'** acts as an appositional modifier (appos) for 'Cupertino.'
- **'founded'** contributes to the acl relation within 'company.'
- **'by'** serves as the agent (agent) for 'founded.'
- **'Steve'** participates in the compound relation within 'Jobs.'
- **'Jobs'** functions as the prepositional object (pobj) for 'by.'
- **'.'** operates as a punctuation mark (punct), indicating the end of the sentence.


