<a href="https://colab.research.google.com/github/DurgaBhavana/5731Submissions/blob/master/In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (9/30/2020, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [None]:
# import all the required libraries and packages
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [None]:
# read the link of the website that we want to scrape
scrape_link = 'https://www.trustpilot.com/review/www.bestbuy.com'
text = []
# loop to move through the number of pages| Here, we have selected 26 pages as each page has 20 rows to extract at least of 500 rows
for i in range(27):
  if i>0:
    # moving from second page through 26 pages
    scrape_link = scrape_link + '?page=' + str(i)
  #initialised the parser
  reviews = (BeautifulSoup((requests.get(scrape_link)).text, 'html.parser')).find_all(class_='review-content__text')
  #appending and removing white spaces and new line characters
  for review in reviews:
    removing_white_spaces = review.text.strip()
    text.append(removing_white_spaces.replace('\n', ''))

In [None]:
# creating a data frame from the list
reviews_df = pd.DataFrame((text), columns =['Best Buy Review'])
reviews_df.head(10)

Unnamed: 0,Best Buy Review
0,I wish I read the reviews first! I was recomme...
1,"Purchased two 55 "" Samsung picture tv during t..."
2,"Need more products of same kind , and Covid is..."
3,Bought a defective product from Best Buy in Ro...
4,BestBuy promised they would deliver in 1 month...
5,If I could give Best Buy a zero star Rating I ...
6,Purchased Open Box dryer. Picked it up from s...
7,This company is the worst! The only thing good...
8,I bought a ROKU device from bestbuy's online s...
9,This is to ALERT you to Best Buy's incompetent...


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from collections import Counter

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
# Tokenization
reviews_df['Tokenization'] = reviews_df['Best Buy Review'].apply(lambda x: TextBlob(x).words)
# Stemming
st = PorterStemmer()
reviews_df['Stemming'] = reviews_df['Tokenization'].apply(lambda x: " ".join([st.stem(word) for word in x]))
# Lemmatization
reviews_df['Lemmatization'] = reviews_df['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
reviews_df.head(10)

Unnamed: 0,Best Buy Review,Tokenization,Stemming,Lemmatization
0,I wish I read the reviews first! I was recomme...,"[I, wish, I, read, the, reviews, first, I, was...",I wish I read the review first I wa recommend ...,I wish I read the review first I wa recommend ...
1,"Purchased two 55 "" Samsung picture tv during t...","[Purchased, two, 55, Samsung, picture, tv, dur...",purchas two 55 samsung pictur tv dure the covi...,purchas two 55 samsung pictur tv dure the covi...
2,"Need more products of same kind , and Covid is...","[Need, more, products, of, same, kind, and, Co...",need more product of same kind and covid is a ...,need more product of same kind and covid is a ...
3,Bought a defective product from Best Buy in Ro...,"[Bought, a, defective, product, from, Best, Bu...",bought a defect product from best buy in rockw...,bought a defect product from best buy in rockw...
4,BestBuy promised they would deliver in 1 month...,"[BestBuy, promised, they, would, deliver, in, ...",bestbuy promis they would deliv in 1 month fro...,bestbuy promis they would deliv in 1 month fro...
5,If I could give Best Buy a zero star Rating I ...,"[If, I, could, give, Best, Buy, a, zero, star,...",If I could give best buy a zero star rate I wo...,If I could give best buy a zero star rate I wo...
6,Purchased Open Box dryer. Picked it up from s...,"[Purchased, Open, Box, dryer, Picked, it, up, ...",purchas open box dryer pick it up from store h...,purchas open box dryer pick it up from store h...
7,This company is the worst! The only thing good...,"[This, company, is, the, worst, The, only, thi...",thi compani is the worst the onli thing good a...,thi compani is the worst the onli thing good a...
8,I bought a ROKU device from bestbuy's online s...,"[I, bought, a, ROKU, device, from, bestbuy, 's...",I bought a roku devic from bestbuy 's onlin st...,I bought a roku devic from bestbuy 's onlin st...
9,This is to ALERT you to Best Buy's incompetent...,"[This, is, to, ALERT, you, to, Best, Buy, 's, ...",thi is to alert you to best buy 's incompet an...,thi is to alert you to best buy 's incompet an...


In [13]:
# POS Tagging
parts_of_speech = []
iterns = []
texts = reviews_df['Lemmatization'].tolist()
for iter in texts:
  parts_of_speech.append(nltk.pos_tag(word_tokenize(iter)))
print(parts_of_speech)

[[('I', 'PRP'), ('wish', 'VBP'), ('I', 'PRP'), ('read', 'VBP'), ('the', 'DT'), ('review', 'NN'), ('first', 'RB'), ('I', 'PRP'), ('wa', 'VBP'), ('recommend', 'VB'), ('to', 'TO'), ('use', 'VB'), ('them', 'PRP'), ('they', 'PRP'), ('sent', 'VBD'), ('a', 'DT'), ('messag', 'NN'), ('and', 'CC'), ('I', 'PRP'), ('had', 'VBD'), ('spoke', 'VBN'), ('to', 'TO'), ('someon', 'VB'), ('and', 'CC'), ('have', 'VB'), ('the', 'DT'), ('whole', 'JJ'), ('email', 'NN'), ('when', 'WRB'), ('I', 'PRP'), ('order', 'NN'), ('that', 'IN'), ('I', 'PRP'), ('had', 'VBD'), ('to', 'TO'), ('have', 'VB'), ('my', 'PRP$'), ('thing', 'NN'), ('deliv', 'NN'), ('saturday', 'NN'), ('they', 'PRP'), ('call', 'VBP'), ('saturday', 'JJ'), ('to', 'TO'), ('reschedul', 'VB'), ('for', 'IN'), ('the', 'DT'), ('11th', 'CD'), ('and', 'CC'), ('I', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('have', 'VB'), ('my', 'PRP$'), ('hous', 'JJ'), ('go', 'VB'), ('on', 'IN'), ('the', 'DT'), ('market', 'NN'), ('the', 'DT'), ('week', 'NN'), ('of', 'IN'), ('the',

In [None]:
# POS features
adjective = []
adverb = []
noun = []
possessor = []
preposition = []
pronoun = []
interjection = []
verb = []
coordinating_conjunction = []
determiner = []
adj = adv = n = posr = prepn = pron = interjectn = v = cc = dt = 0
for ele in parts_of_speech:
  for value in ele:
    if(value[1]=='JJ'):
      adj = adj+1
    elif (value[1]=='RB'):
      adv = adv+1
    elif (value[1]=='NN'):
      n = n+1
    elif (value[1]=='POS'):
      posr = posr+1
    elif (value[1]=='IN'):
      prepn = prepn+1
    elif (value[1]=='PRP'):
      pron = pron+1
    elif (value[1]=='UH'):
      interjectn = interjectn+1
    elif (value[1]=='VB'):
      v = v+1
    elif (value[1]=='CC'):
      cc = cc+1
    elif (value[1]=='DT'):
      dt = dt+1
  adjective.append(adj)
  adverb.append(adv)
  noun.append(n)
  possessor.append(posr)
  preposition.append(prepn)
  pronoun.append(pron)
  interjection.append(interjectn)
  verb.append(v)
  coordinating_conjunction.append(cc)
  determiner.append(dt)
  adj = adv = n = posr = prepn = pron = interjectn = v = cc = dt = 0

In [None]:
# Converting list of lists to dataframe
df_pos_features = pd.DataFrame([adjective, adverb, noun, possessor, preposition, pronoun, interjection, verb, coordinating_conjunction, determiner])
df_pos_features = df_pos_features.transpose()
df_pos_features.columns = ['Adjective', 'Adverb', 'Noun', 'Possessor', 'Preposition', 'Pronoun', 'Interjection', 'Verb', 'Coordinating Conjunction', 'Determiner']
df_pos_features.head(10)

Unnamed: 0,Adjective,Adverb,Noun,Possessor,Preposition,Pronoun,Interjection,Verb,Coordinating Conjunction,Determiner
0,6,5,20,0,10,25,0,17,6,9
1,9,12,50,0,14,17,0,17,2,11
2,2,1,7,0,2,0,0,1,1,1
3,3,0,7,0,5,4,0,3,1,3
4,3,4,33,0,18,8,0,4,4,9
5,34,13,85,0,27,19,0,24,12,24
6,11,7,49,0,17,15,0,12,7,9
7,9,7,34,0,17,16,0,6,1,11
8,8,6,30,2,9,13,0,5,6,13
9,8,5,24,1,9,10,0,16,6,7


In [12]:
POS_df = pd.concat([reviews_df['Best Buy Review'], df_pos_features],ignore_index=True, sort=False, axis=1)
POS_df.columns = ['Sentences', 'Adjective', 'Adverb', 'Noun', 'Possessor', 'Preposition', 'Pronoun', 'Interjection', 'Verb', 'Coordinating Conjunction', 'Determiner']
POS_df.head(10)

Unnamed: 0,Sentences,Adjective,Adverb,Noun,Possessor,Preposition,Pronoun,Interjection,Verb,Coordinating Conjunction,Determiner
0,I wish I read the reviews first! I was recomme...,6,5,20,0,10,25,0,17,6,9
1,"Purchased two 55 "" Samsung picture tv during t...",9,12,50,0,14,17,0,17,2,11
2,"Need more products of same kind , and Covid is...",2,1,7,0,2,0,0,1,1,1
3,Bought a defective product from Best Buy in Ro...,3,0,7,0,5,4,0,3,1,3
4,BestBuy promised they would deliver in 1 month...,3,4,33,0,18,8,0,4,4,9
5,If I could give Best Buy a zero star Rating I ...,34,13,85,0,27,19,0,24,12,24
6,Purchased Open Box dryer. Picked it up from s...,11,7,49,0,17,15,0,12,7,9
7,This company is the worst! The only thing good...,9,7,34,0,17,16,0,6,1,11
8,I bought a ROKU device from bestbuy's online s...,8,6,30,2,9,13,0,5,6,13
9,This is to ALERT you to Best Buy's incompetent...,8,5,24,1,9,10,0,16,6,7
