<a href="https://colab.research.google.com/github/DurgaBhavana/5731Submissions/blob/master/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of the product [2019 Dell labtop](https://www.amazon.com/Dell-Inspiron-5000-5570-Laptop/dp/B07N49F51N/ref=sr_1_11?crid=1IJ7UWF2F4GHH&keywords=dell%2Bxps%2B15&qid=1580173569&sprefix=dell%2Caps%2C181&sr=8-11&th=1) on amazon.

(2) Collect the top 100 User Reviews of the film [Joker](https://www.imdb.com/title/tt7286456/reviews?ref_=tt_urv) from IMDB.

(3) Collect the abstracts of the top 100 research papers by using the query [natural language processing](https://citeseerx.ist.psu.edu/search?q=natural+language+processing&submit.x=0&submit.y=0&sort=rlv&t=doc) from CiteSeerX.

(4) Collect the top 100 tweets by using hashtag ["#CovidVaccine"](https://twitter.com/hashtag/CovidVaccine) from Twitter. 


In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [28]:
abstract_text = []
count = 0
for i in range(10):
  link = 'https://citeseerx.ist.psu.edu/search;jsessionid=87FF6C66EA09F22314C131669600CF98?q=natural+language+processing&t=doc&sort=rlv&start='
  page = requests.get(link + str (count))
  abstract = (BeautifulSoup(page.text, 'html.parser')).find_all(class_='pubabstract')
  count = count + 10
  for i in abstract:
    abstract_text.append(i.text.replace('\n', '').strip())

In [37]:
df = pd.DataFrame((abstract_text), columns =['Abstract'])
df.to_csv('abstracts.csv')
df

Unnamed: 0,Abstract
0,Abstract not found
1,describe a method for statistical modeling bas...
2,Scaling conditional random fields for natural ...
3,The paper addresses the issue of cooperation b...
4,In most natural language processing applicatio...
...,...
95,This paper presents a workbench built by Pribe...
96,Abstract—Natural Language Processing (NLP) is ...
97,"ABSTRACT: After twenty years of disfavor, a te..."
98,Text statistics are frequently used in stylome...


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming. 

(6) Lemmatization.

In [53]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
import re
from nltk.stem import PorterStemmer
from textblob import TextBlob
from textblob import Word
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [55]:
# Special characters removal
df['After noise removal'] = df['Abstract'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))
# Punctuation removal
df['Punctuation removal'] = df['After noise removal'].str.replace('[^\w\s]','')
# Remove numbers
df['Remove numbers'] = df['Punctuation removal'].str.replace('\d+', '')
# Stopwords removal
stop_word = stopwords.words('english')
df['Stopwords removal'] = df['Remove numbers'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_word))
# Lower Casing
df['Lower casing'] = df['Stopwords removal'].apply(lambda x: " ".join(x.lower() for x in x.split()))
# Stemming
st = PorterStemmer()
df['Stemming'] = df['Lower casing'].apply(lambda x: " ".join([st.stem(word) for word in x]))
# Lemmatization
df['Lemmatization'] = df['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes: 

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [75]:
import spacy
nltk.download('punkt')
from spacy import displacy
from nltk import word_tokenize, pos_tag, pos_tag_sents
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [76]:
# POS Tagging
from collections import Counter
parts_of_speech = []
iterns = []
texts = df['Stopwords removal'].tolist()
for iter in texts:
  parts_of_speech.append(nltk.pos_tag(word_tokenize(iter)))
print(parts_of_speech)
for iter in parts_of_speech:
  iterns.append(Counter(tag for word, tag in iter))
print(iterns)

[[('Abstract', 'NNP'), ('found', 'VBD')], [('describe', 'NN'), ('method', 'NN'), ('statistical', 'JJ'), ('modeling', 'NN'), ('based', 'VBN'), ('maximum', 'JJ'), ('entropy', 'NN'), ('We', 'PRP'), ('present', 'JJ'), ('maximum', 'JJ'), ('likelihood', 'NN'), ('approach', 'NN'), ('automatically', 'RB'), ('constructing', 'VBG'), ('maximum', 'JJ'), ('entropy', 'JJ'), ('models', 'NNS'), ('describe', 'VBP'), ('implement', 'JJ'), ('approach', 'NN'), ('efficiently', 'RB'), ('using', 'VBG'), ('examples', 'NNS'), ('several', 'JJ'), ('problems', 'NNS'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN')], [('Scaling', 'VBG'), ('conditional', 'JJ'), ('random', 'NN'), ('fields', 'NNS'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('Terms', 'NNS'), ('Conditions', 'NNPS'), ('Terms', 'NNS'), ('Conditions', 'NNPS'), ('Copyright', 'NNP'), ('works', 'NNS'), ('deposited', 'VBD'), ('Minerva', 'NNP'), ('Access', 'NNP'), ('retained', 'VBD')], [('The', 'DT'), ('paper', 'NN'), ('addresse

In [77]:
pip install benepar



In [67]:
%tensorflow_version 1.x
import benepar
benepar.download('benepar_en2')

TensorFlow 1.x selected.
[nltk_data] Downloading package benepar_en2 to /root/nltk_data...


True

In [70]:
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent("benepar_en2"))
for iter in texts:
  doc = nlp(iter)
  sent = list(doc.sents)[0]
  print(sent._.parse_string)

(NP (NP (NN Abstract)) (VBN found))
(S (VP (VB describe) (NP (NP (NN method) (JJ statistical) (NN modeling)) (VP (VBN based)))))
(S (S (VP (VBG Scaling) (NP (JJ conditional) (JJ random) (NNS fields)))) (NP (JJ natural) (NN language) (NNP processing) (NNS Terms) (NP (NNS Conditions) (NP (NNS Terms)))))
(S (NP (DT The) (NN paper)) (VP (VBZ addresses) (NP (NP (NNP issue) (NNP cooperation) (NNP linguistics)) (NP (NNP natural) (NNP language) (NNP processing) (NNP NLP) (NNP general) (NNP linguistics) (NNP machine) (NNP translation) (NNP MT) (NNP particular)))))
(S (PP (IN In) (NP (JJ natural) (NN language) (VBG processing) (NNS applications))) (NP (NP (NP (NN Description) (NNS Logics)) (VBN used) (VBP encode) (NP (NN knowledge) (NN base) (JJ syntactic) (JJ semantic) (JJ pragmatic) (NNS elements))) (VP (VP (VBN needed) (S (VP (VB drive) (NP (JJ semantic) (NN interpretation))))) (NP (NP (JJ natural) (NN language) (NN generation) (NNS processes)) (ADVP (ADVP (RBR More) (RB recently)) (NP (NN De

In [72]:
for iter in texts:
  spacy.displacy.render(nlp(iter), style = 'dep', jupyter = True, options={'distance': 150})

In [74]:
for iter in texts:
  print([(line.text, line.label_) for line in nlp(iter).ents])

[]
[]
[('Minerva Access', 'PERSON')]
[('NLP', 'ORG'), ('one', 'CARDINAL'), ('NLP', 'ORG')]
[('Description Logics', 'PERSON')]
[('algorithm', 'ORG')]
[]
[]
[]
[('ABSTRACT Ambiguity', 'ORG'), ('one', 'CARDINAL'), ('one', 'CARDINAL')]
[('Introduction Statistical', 'ORG'), ('SNLP', 'PRODUCT')]
[]
[('NLP', 'ORG')]
[('NLP', 'ORG')]
[]
[]
[]
[('Schank', 'ORG'), ('Kass Leake', 'PERSON')]
[('NLP', 'ORG'), ('NLP', 'ORG'), ('Target', 'ORG'), ('NLP', 'ORG')]
[('MARIE', 'ORG'), ('Descriptive', 'ORG')]
[('NLP', 'ORG'), ('Motivation', 'PERSON')]
[('NLP', 'ORG')]
[('Natural Language Processing NLP', 'LAW')]
[('one', 'CARDINAL')]
[]
[]
[('NLP', 'ORG'), ('Kolmogorov', 'PERSON')]
[]
[]
[('NLP', 'ORG'), ('Applications NLP', 'PERSON'), ('NLP', 'ORG')]
[]
[('Booth Brandwood Cleave', 'GPE'), ('four decades', 'DATE')]
[('MFCC', 'ORG')]
[]
[]
[('us', 'GPE')]
[]
[('recent years', 'DATE'), ('Data Mining Information', 'ORG')]
[('NLP', 'ORG'), ('NLP', 'ORG'), ('WASPS', 'ORG'), ('Thesaurus', 'ORG'), ('NLP', 'ORG')]

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):** 

In [None]:
'''
Constituency Parsing: 
Constituency parsing is breaking a sentence into sub-texts.
Using a set of rules, we try to achieve building valid sentences through constituency parsing.

Dependency Parsing:
Dependency parsing is representing a sentence's grammatical structure.
As opposed to constituency parsing, dependency parsing doesn't make use of sub-phrases.

'''

'\nWrite your explanations of the constituency parsing tree and dependency parsing tree here\n\n\n\n'