<a href="https://colab.research.google.com/github/DurgaBhavana/5731Submissions/blob/master/In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (9/30/2020, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [2]:
# import all the required libraries and packages
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [8]:
# read the link of the website that we want to scrape
scrape_link = 'https://www.trustpilot.com/review/www.bestbuy.com'
text = []
# loop to move through the number of pages| Here, we have selected 26 pages as each page has 20 rows to extract at least of 500 rows
for i in range(27):
  if i>0:
    # moving from second page through 26 pages
    scrape_link = scrape_link + '?page=' + str(i)
  #initialised the parser
  reviews = (BeautifulSoup((requests.get(scrape_link)).text, 'html.parser')).find_all(class_='review-content__text')
  #appending and removing white spaces and new line characters
  for review in reviews:
    removing_white_spaces = review.text.strip()
    text.append(removing_white_spaces.replace('\n', ''))

In [9]:
# creating a data frame from the list
reviews_df = pd.DataFrame((text), columns =['Best Buy Review'])
reviews_df

Unnamed: 0,Best Buy Review
0,This is to ALERT you to Best Buy's incompetent...
1,Best Buy in GeneralI have a fraud case ($400+ ...
2,I spent many hours researching washer dryer se...
3,After looking for a computer I found BestBuy h...
4,My experience with best buy was frustrating. B...
...,...
508,I bought a refrigerator at the Best Buy in Har...
509,DO NOT SHOP AT BEST BUY! If I could give it a ...
510,I really liked best buy re-releasing films but...
511,My Complaint is about the horrendous process o...


In [44]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from collections import Counter

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [45]:
# Tokenization
reviews_df['Tokenization'] = reviews_df['Best Buy Review'].apply(lambda x: TextBlob(x).words)
# Stemming
st = PorterStemmer()
reviews_df['Stemming'] = reviews_df['Tokenization'].apply(lambda x: " ".join([st.stem(word) for word in x]))
# Lemmatization
reviews_df['Lemmatization'] = reviews_df['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
reviews_df

Unnamed: 0,Best Buy Review,Tokenization,Stemming,Lemmatization
0,This is to ALERT you to Best Buy's incompetent...,"[This, is, to, ALERT, you, to, Best, Buy, 's, ...",thi is to alert you to best buy 's incompet an...,thi is to alert you to best buy 's incompet an...
1,Best Buy in GeneralI have a fraud case ($400+ ...,"[Best, Buy, in, GeneralI, have, a, fraud, case...",best buy in generali have a fraud case 400 del...,best buy in generali have a fraud case 400 del...
2,I spent many hours researching washer dryer se...,"[I, spent, many, hours, researching, washer, d...",I spent mani hour research washer dryer set be...,I spent mani hour research washer dryer set be...
3,After looking for a computer I found BestBuy h...,"[After, looking, for, a, computer, I, found, B...",after look for a comput I found bestbuy had th...,after look for a comput I found bestbuy had th...
4,My experience with best buy was frustrating. B...,"[My, experience, with, best, buy, was, frustra...",My experi with best buy wa frustrat bought new...,My experi with best buy wa frustrat bought new...
...,...,...,...,...
508,I bought a refrigerator at the Best Buy in Har...,"[I, bought, a, refrigerator, at, the, Best, Bu...",I bought a refriger at the best buy in hartsda...,I bought a refriger at the best buy in hartsda...
509,DO NOT SHOP AT BEST BUY! If I could give it a ...,"[DO, NOT, SHOP, AT, BEST, BUY, If, I, could, g...",DO not shop AT best buy If I could give it a m...,DO not shop AT best buy If I could give it a m...
510,I really liked best buy re-releasing films but...,"[I, really, liked, best, buy, re-releasing, fi...",I realli like best buy re-releas film but i pr...,I realli like best buy re-releas film but i pr...
511,My Complaint is about the horrendous process o...,"[My, Complaint, is, about, the, horrendous, pr...",My complaint is about the horrend process of c...,My complaint is about the horrend process of c...


In [46]:
# POS Tagging
parts_of_speech = []
iterns = []
texts = reviews_df['Lemmatization'].tolist()
for iter in texts:
  parts_of_speech.append(nltk.pos_tag(word_tokenize(iter)))
parts_of_speech

[[('thi', 'NN'),
  ('is', 'VBZ'),
  ('to', 'TO'),
  ('alert', 'VB'),
  ('you', 'PRP'),
  ('to', 'TO'),
  ('best', 'VB'),
  ('buy', 'NN'),
  ("'s", 'POS'),
  ('incompet', 'NN'),
  ('and', 'CC'),
  ('mislead', 'JJ'),
  ('tacticsi', 'NN'),
  ("'ve", 'VBP'),
  ('place', 'NN'),
  ('no', 'DT'),
  ('fewer', 'JJR'),
  ('than', 'IN'),
  ('9', 'CD'),
  ('phone', 'NN'),
  ('call', 'NN'),
  ('over', 'IN'),
  ('4', 'CD'),
  ('day', 'NN'),
  ('where', 'WRB'),
  ('they', 'PRP'),
  ('have', 'VBP'),
  ('refus', 'VBN'),
  ('to', 'TO'),
  ('engag', 'VB'),
  ('they', 'PRP'),
  ("'ve", 'VBP'),
  ('repeatedli', 'VBN'),
  ('hung', 'NN'),
  ('up', 'RP'),
  ('on', 'IN'),
  ('me', 'PRP'),
  ('after', 'IN'),
  ('hold', 'NN'),
  ('for', 'IN'),
  ('25min+they', 'CD'),
  ("'ve", 'VBP'),
  ('refus', 'VBN'),
  ('to', 'TO'),
  ('deliv', 'VB'),
  ('two', 'CD'),
  ('sono', 'JJ'),
  ('speaker', 'NN'),
  ('to', 'TO'),
  ('the', 'DT'),
  ('correct', 'JJ'),
  ('address', 'NN'),
  ('they', 'PRP'),
  ("'re", 'VBP'),
  ('hold'

In [48]:
# POS tagging per type
for iter in parts_of_speech:
  iterns.append(Counter(tag for word, tag in iter))
iterns

[Counter({'CC': 6,
          'CD': 4,
          'DT': 7,
          'IN': 9,
          'JJ': 8,
          'JJR': 1,
          'NN': 24,
          'NNS': 2,
          'POS': 1,
          'PRP': 10,
          'PRP$': 1,
          'RB': 5,
          'RP': 1,
          'TO': 10,
          'VB': 16,
          'VBD': 2,
          'VBN': 3,
          'VBP': 9,
          'VBZ': 1,
          'WRB': 1}),
 Counter({'CC': 19,
          'CD': 3,
          'DT': 17,
          'IN': 30,
          'JJ': 13,
          'JJS': 4,
          'MD': 4,
          'NN': 60,
          'NNP': 3,
          'NNS': 1,
          'PDT': 1,
          'PRP': 40,
          'PRP$': 4,
          'RB': 12,
          'RBS': 1,
          'RP': 3,
          'TO': 6,
          'VB': 16,
          'VBD': 5,
          'VBN': 1,
          'VBP': 24,
          'VBZ': 5,
          'WDT': 2,
          'WP': 2,
          'WRB': 3}),
 Counter({'CC': 11,
          'CD': 5,
          'DT': 31,
          'IN': 23,
          'JJ': 22,
    

In [82]:
# POS features
adjective = []
adverb = []
noun = []
possessor = []
preposition = []
pronoun = []
interjection = []
verb = []
coordinating_conjunction = []
determiner = []
adj = adv = n = posr = prepn = pron = interjectn = v = cc = dt = 0
for ele in parts_of_speech:
  for value in ele:
    if(value[1]=='JJ'):
      adj = adj+1
    elif (value[1]=='RB'):
      adv = adv+1
    elif (value[1]=='NN'):
      n = n+1
    elif (value[1]=='POS'):
      posr = posr+1
    elif (value[1]=='IN'):
      prepn = prepn+1
    elif (value[1]=='PRP'):
      pron = pron+1
    elif (value[1]=='UH'):
      interjectn = interjectn+1
    elif (value[1]=='VB'):
      v = v+1
    elif (value[1]=='CC'):
      cc = cc+1
    elif (value[1]=='DT'):
      dt = dt+1
  adjective.append(adj)
  adverb.append(adv)
  noun.append(n)
  possessor.append(posr)
  preposition.append(prepn)
  pronoun.append(pron)
  interjection.append(interjectn)
  verb.append(v)
  coordinating_conjunction.append(cc)
  determiner.append(dt)
  adj = adv = n = posr = prepn = pron = interjectn = v = cc = dt = 0

In [101]:
# Converting list of lists to dataframe
df_pos_features = pd.DataFrame([adjective, adverb, noun, possessor, preposition, pronoun, interjection, verb, coordinating_conjunction, determiner])
df_pos_features = df_pos_features.transpose()
df_pos_features.columns = ['Adjective', 'Adverb', 'Noun', 'Possessor', 'Preposition', 'Pronoun', 'Interjection', 'Verb', 'Coordinating Conjunction', 'Determiner']
df_pos_features.head(10)

Unnamed: 0,Adjective,Adverb,Noun,Possessor,Preposition,Pronoun,Interjection,Verb,Coordinating Conjunction,Determiner
0,8,5,24,1,9,10,0,16,6,7
1,13,12,60,0,30,40,0,16,19,17
2,22,13,87,0,23,28,0,10,11,31
3,25,27,125,0,51,48,0,36,17,56
4,2,10,10,0,4,6,0,5,3,1
5,10,7,24,0,10,8,0,5,2,9
6,23,10,57,2,25,24,0,15,14,19
7,8,8,26,0,4,18,0,6,6,10
8,2,10,17,0,12,16,0,8,4,3
9,9,10,39,0,16,11,0,13,3,12


In [105]:
POS_df = pd.concat([reviews_df['Best Buy Review'], df_pos_features],ignore_index=True, sort=False, axis=1)
POS_df.columns = ['Sentences', 'Adjective', 'Adverb', 'Noun', 'Possessor', 'Preposition', 'Pronoun', 'Interjection', 'Verb', 'Coordinating Conjunction', 'Determiner']
POS_df.head(10)

Unnamed: 0,Sentences,Adjective,Adverb,Noun,Possessor,Preposition,Pronoun,Interjection,Verb,Coordinating Conjunction,Determiner
0,This is to ALERT you to Best Buy's incompetent...,8,5,24,1,9,10,0,16,6,7
1,Best Buy in GeneralI have a fraud case ($400+ ...,13,12,60,0,30,40,0,16,19,17
2,I spent many hours researching washer dryer se...,22,13,87,0,23,28,0,10,11,31
3,After looking for a computer I found BestBuy h...,25,27,125,0,51,48,0,36,17,56
4,My experience with best buy was frustrating. B...,2,10,10,0,4,6,0,5,3,1
5,mmhmm Best Buy. If there are other option...go...,10,7,24,0,10,8,0,5,2,9
6,"Today is 9/29/20, and we have been without a f...",23,10,57,2,25,24,0,15,14,19
7,"I ordered an instant Pot to be picked up, I ca...",8,8,26,0,4,18,0,6,6,10
8,I ordered a handbag and purse back in November...,2,10,17,0,12,16,0,8,4,3
9,Bought a item online wich said it would be rea...,9,10,39,0,16,11,0,13,3,12
