<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 tweets by using a hashtag (you can use any hashtag) from Twitter. 


In [5]:
!pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import urllib
from urllib.request import urlopen
import re
import pandas as pd

text = []
headers = {'User-Agent': 'Edge/110.0.1587.56'}
def ref_url(url):
    r = requests.get(url, headers = headers, proxies=urllib.request.getproxies())
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup


def rev_find(soup):
    reviews = soup.find_all('div', {'data-hook': 'review'})
    try:
        for item in reviews:
            review = {'review': item.find('span', {'data-hook': 'review-body'}).text.strip(),}
            text.append(review)
    except:
        pass

for page in range(1,150):
    soup = ref_url(f'https://www.amazon.in/New-Apple-iPhone-12-128GB/product-reviews/B08L5TNJHG/ref=cm_cr_arp_d_paging_btm_next_{page}?ie=UTF8&reviewerType=all_reviews&pageNumber={page}')
    rev_find(soup)
    if not soup.find('li', {'class': 'a-disabled a-last'}):
        pass
    else:
        break

ws = pd.DataFrame(text)
ws.to_csv('apple_reviews_1.csv', index=False)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming. 

(6) Lemmatization.

In [7]:
import pandas as pd
import numpy as np
import nltk, string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
csv_link = "/content/apple_reviews_1.csv"
ws = pd.read_csv(csv_link)
ws.head(5)

Unnamed: 0,review
0,Camera superb
1,Good product
2,camera quality is good
3,Battery not goodCamera is fine onlyvery very o...
4,Nice product


In [8]:
ws.review.dtype

dtype('O')

In [9]:
#Remove noise, such as special characters and punctuations
from string import punctuation
def remove_punc_mark(text):
    for punc_mark in string.punctuation:
        text = str(text).replace(punctuation, '')
    return text
ws['rev_punc_rem'] = ws['review'].apply(remove_punc_mark)

In [10]:
#2 Removing numbers
import re
ws['rev_num_rem'] = ws["rev_punc_rem"].replace('d+', '', regex=True)
ws.head()

Unnamed: 0,review,rev_punc_rem,rev_num_rem
0,Camera superb,Camera superb,Camera superb
1,Good product,Good product,Goo prouct
2,camera quality is good,camera quality is good,camera quality is goo
3,Battery not goodCamera is fine onlyvery very o...,Battery not goodCamera is fine onlyvery very o...,Battery not gooCamera is fine onlyvery very ok...
4,Nice product,Nice product,Nice prouct


In [13]:
#3 Removing stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stopws = stopwords.words('english')
ws['rev_stop_rem'] = ws['rev_num_rem'].apply(
    lambda x : ' '.join([w for w in str(x).split() if w in (stopws)]))
ws['rev_stop_rem']

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0                                                       
1                                                       
2                                                     is
3                                            not is very
4                                                       
                             ...                        
347                                    an but this is an
348                           my from to it was was very
349                                        just the with
350                            with an an the with as in
351    there is than in has you in a the you'll not i...
Name: rev_stop_rem, Length: 352, dtype: object

In [14]:
#4. Lowercasing the text
ws["rev_lower"] = ws["rev_stop_rem"].str.lower()
ws["rev_lower"] 

0                                                       
1                                                       
2                                                     is
3                                            not is very
4                                                       
                             ...                        
347                                    an but this is an
348                           my from to it was was very
349                                        just the with
350                            with an an the with as in
351    there is than in has you in a the you'll not i...
Name: rev_lower, Length: 352, dtype: object

In [15]:
#5. Stemming
stemmer = PorterStemmer()
ws['rev_stemmed'] = ws['rev_lower'].apply(lambda i: [stemmer.stem(j) for j in i.split()])
ws['rev_stemmed']

0                                                     []
1                                                     []
2                                                   [is]
3                                        [not, is, veri]
4                                                     []
                             ...                        
347                               [an, but, thi, is, an]
348                     [my, from, to, it, wa, wa, veri]
349                                    [just, the, with]
350                    [with, an, an, the, with, as, in]
351    [there, is, than, in, ha, you, in, a, the, you...
Name: rev_stemmed, Length: 352, dtype: object

In [19]:
#6. Lemmatization
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
lemmatization = WordNetLemmatizer()
ws['rev_lemmatized'] = ws['rev_lower'].apply(lambda i: [lemmatization.lemmatize(j) for j in i.split()])
ws['rev_lemmatized']

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0                                                     []
1                                                     []
2                                                   [is]
3                                        [not, is, very]
4                                                     []
                             ...                        
347                              [an, but, this, is, an]
348                     [my, from, to, it, wa, wa, very]
349                                    [just, the, with]
350                     [with, an, an, the, with, a, in]
351    [there, is, than, in, ha, you, in, a, the, you...
Name: rev_lemmatized, Length: 352, dtype: object

In [20]:
ws.head()

Unnamed: 0,review,rev_punc_rem,rev_num_rem,rev_stop_rem,rev_lower,rev_stemmed,rev_lemmatized
0,Camera superb,Camera superb,Camera superb,,,[],[]
1,Good product,Good product,Goo prouct,,,[],[]
2,camera quality is good,camera quality is good,camera quality is goo,is,is,[is],[is]
3,Battery not goodCamera is fine onlyvery very o...,Battery not goodCamera is fine onlyvery very o...,Battery not gooCamera is fine onlyvery very ok...,not is very,not is very,"[not, is, veri]","[not, is, very]"
4,Nice product,Nice product,Nice prouct,,,[],[]


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes: 

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [21]:
ws.columns

Index(['review', 'rev_punc_rem', 'rev_num_rem', 'rev_stop_rem', 'rev_lower',
       'rev_stemmed', 'rev_lemmatized'],
      dtype='object')

In [23]:
#1. POS Tagging
import nltk
nltk.download('punkt')
from nltk import pos_tag
from nltk.tokenize import word_tokenize

ws['rev_tokenized'] = ws['rev_lower'].apply(nltk.word_tokenize)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [25]:
import nltk
nltk.download('averaged_perceptron_tagger')
ws['rev_tags'] = ws['rev_tokenized'].apply(nltk.pos_tag)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [26]:
ws['Noun'] = ws['rev_tags'].apply(lambda x: [word for word, tag in x if tag in ['NN']])
ws['Verb'] = ws['rev_tags'].apply(lambda x: [word for word, tag in x if tag in ['VB']])
ws['Adjective'] = ws['rev_tags'].apply(lambda x: [word for word, tag in x if tag in ['JJ']])
ws['Adverb'] = ws['rev_tags'].apply(lambda x: [word for word, tag in x if tag in ['RB']])

In [27]:
cols = ['Noun', 'Verb', 'Adjective', 'Adverb']

for col in cols:
    ws[str(col) + '_nos'] = ws[col].apply(len)

In [28]:
!pip install stanza
import stanza
stanza.download('en')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.4.2-py3-none-any.whl (691 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m691.3/691.3 KB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting emoji
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 KB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234926 sha256=2c5eeaf391d36a1c3f1fdc8e75def9f956beb97059bbcc39eb50304bd9666365
  Stored in directory: /root/.cache/pip/wheels/86/62/9e/a6b27a681abcde69970dbc0326ff51955f3beac72f15696984
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-2.2.0 stan

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


In [29]:
ws.iloc[[15],[0]].values

array([['Battery life is poor. No VoLTE calling. Excellent camera.']],
      dtype=object)

In [30]:
# Constituency Parsing and Dependency Parsing
pipeline = stanza.Pipeline('en') # initialize English neural pipeline
doc = pipeline("Camera is awesome")

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: constituency
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [31]:
for sen in doc.sentences:
    print(sen.constituency)

(ROOT (S (NP (NN Camera)) (VP (VBZ is) (ADJP (JJ awesome)))))


In [32]:
for sen in doc.sentences:
    print(sen.dependencies)

[({
  "id": 3,
  "text": "awesome",
  "lemma": "awesome",
  "upos": "ADJ",
  "xpos": "JJ",
  "feats": "Degree=Pos",
  "head": 0,
  "deprel": "root",
  "start_char": 10,
  "end_char": 17
}, 'nsubj', {
  "id": 1,
  "text": "Camera",
  "lemma": "camera",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 3,
  "deprel": "nsubj",
  "start_char": 0,
  "end_char": 6
}), ({
  "id": 3,
  "text": "awesome",
  "lemma": "awesome",
  "upos": "ADJ",
  "xpos": "JJ",
  "feats": "Degree=Pos",
  "head": 0,
  "deprel": "root",
  "start_char": 10,
  "end_char": 17
}, 'cop', {
  "id": 2,
  "text": "is",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBZ",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
  "head": 3,
  "deprel": "cop",
  "start_char": 7,
  "end_char": 9
}), ({
  "id": 0,
  "text": "ROOT"
}, 'root', {
  "id": 3,
  "text": "awesome",
  "lemma": "awesome",
  "upos": "ADJ",
  "xpos": "JJ",
  "feats": "Degree=Pos",
  "head": 0,
  "deprel": "root",
  "start_cha

In [33]:
#Named Entity Recognition
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
doc = pipeline("Camera is awesome")
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


token: Camera	ner: O
token: is	ner: O
token: awesome	ner: O


In [None]:
# The below mentioned are some referneces sources where I used in the code for better understanding

# 1.
#Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. 
# Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. 
# In Association for Computational Linguistics (ACL) System Demonstrations. 2020.

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):** 

constituency Parsing is the process of breaking down a sentence into its constituent pieces, or phrases, using formal grammar principles. It generates a hierarchical structure known as a parse tree or a syntax tree, with the topmost node representing the full sentence and the lower nodes representing the phrases that comprise the sentence. Each node in the parse tree represents one of the sentence's constituents or sub-constituents, such as a noun phrase, verb phrase, or clause. Constituency parsing aids in the identification of a sentence's syntactic structure and the extraction of significant phrases for use in natural language processing activities such as machine translation, sentiment analysis, and text production.

Dependency parsing, on the other hand, is concerned with the relationships that exist between words in a phrase. It depicts the sentence as a directed graph, with nodes representing words and edges representing grammatical relationships between them. The graph has a root node that represents the sentence's main verb, and all other nodes are connected to it via a series of directed edges.

In the above code here we are using a sentence as :Camera is Awesome " is used for analysing the constituency and depenedency parsing tree. The result is clear seen in the result.


