## The fifth In-class-exercise (9/30/2020, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

# **Extracting Data**

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
main_text = [] # List to store Review headings
sub_text =[] #List to store reviews
for number in range(52):
  link = "https://www.flipkart.com/apple-iphone-11-black-64-gb/product-reviews/itm4e5041ba101fd?pid=MOBFWQ6BXGJCEYNY&lid=LSTMOBFWQ6BXGJCEYNYZE3ENS&marketplace=FLIPKART&page=" + str(number) # Generating link dynamically
  page = requests.get(link) # Accessing the webpage
  soup = BeautifulSoup(page.text, 'html.parser')
  main_reviews = soup.find_all(class_='_2-N8zT') # Getting the Review Heading by using the class name
  text_reviews = soup.find_all(class_='t-ZTKy') # Getting the full reviews by using the class name
  for ele, sub_ele in zip(main_reviews, text_reviews) : # Iterating through the list
      main_text.append(ele.text) #Appending to empty list
      sub_text.append(sub_ele.text)
df = pd.DataFrame(list(zip(main_text, sub_text)), columns =['Glimpse of Review', 'Full Review'])  # Creating Dataframe
print("Length of data frame is {0}".format(len(df)))
df

Length of data frame is 510


Unnamed: 0,Glimpse of Review,Full Review
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...
1,Perfect product!,Amazing phone with great cameras and better ba...
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...
3,Worth every penny,Previously I was using one plus 3t it was a gr...
4,Good choice,So far it’s been an AMAZING experience coming ...
...,...,...
505,Perfect product!,Original product for apple and security seal p...
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...
507,Brilliant,Love itREAD MORE
508,Perfect product!,All are good not giving charger and earphones ...


# **Preprocessing Data**

**Converting to lower case**

In [2]:
df['After Preprocessing'] = df['Full Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...,the best phone for the moneythe iphone 11 offe...
1,Perfect product!,Amazing phone with great cameras and better ba...,amazing phone with great cameras and better ba...
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...,amazing powerful and durable gadget.i’m am ver...
3,Worth every penny,Previously I was using one plus 3t it was a gr...,previously i was using one plus 3t it was a gr...
4,Good choice,So far it’s been an AMAZING experience coming ...,so far it’s been an amazing experience coming ...
...,...,...,...
505,Perfect product!,Original product for apple and security seal p...,original product for apple and security seal p...
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...,good phone by apple battery is far better ( on...
507,Brilliant,Love itREAD MORE,love itread more
508,Perfect product!,All are good not giving charger and earphones ...,all are good not giving charger and earphones ...


**Removing Punctuation**

In [3]:
df['After Preprocessing'] = df['After Preprocessing'].str.replace('[^\w\s]','')
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...,the best phone for the moneythe iphone 11 offe...
1,Perfect product!,Amazing phone with great cameras and better ba...,amazing phone with great cameras and better ba...
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...,amazing powerful and durable gadgetim am very ...
3,Worth every penny,Previously I was using one plus 3t it was a gr...,previously i was using one plus 3t it was a gr...
4,Good choice,So far it’s been an AMAZING experience coming ...,so far its been an amazing experience coming b...
...,...,...,...
505,Perfect product!,Original product for apple and security seal p...,original product for apple and security seal p...
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...,good phone by apple battery is far better one...
507,Brilliant,Love itREAD MORE,love itread more
508,Perfect product!,All are good not giving charger and earphones ...,all are good not giving charger and earphones ...


**Removing Numerics**

In [4]:
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

**Removing Special Characters**

In [5]:
import re
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...,the best phone for the moneythe iphone offers...
1,Perfect product!,Amazing phone with great cameras and better ba...,amazing phone with great cameras and better ba...
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...,amazing powerful and durable gadgetim am very ...
3,Worth every penny,Previously I was using one plus 3t it was a gr...,previously i was using one plus t it was a gre...
4,Good choice,So far it’s been an AMAZING experience coming ...,so far its been an amazing experience coming b...
...,...,...,...
505,Perfect product!,Original product for apple and security seal p...,original product for apple and security seal p...
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...,good phone by apple battery is far better one...
507,Brilliant,Love itREAD MORE,love itread more
508,Perfect product!,All are good not giving charger and earphones ...,all are good not giving charger and earphones ...


In [8]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> c

Data Server:
  - URL: <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml>
  - 7 Package Collections Available
  - 107 Individual Packages Available

Local Machine:
  - Data directory: /root/nltk_data

---------------------------------------------------------------------------
    s) Show Config   u) Set Server URL   d) Set Data Dir   m) Main Menu
---------------------------------------------------------------------------
Config> m

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifie

True

**Removing Stop Words**

In [9]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...,best phone moneythe iphone offers superb camer...
1,Perfect product!,Amazing phone with great cameras and better ba...,amazing phone great cameras better battery giv...
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...,amazing powerful durable gadgetim happy camera...
3,Worth every penny,Previously I was using one plus 3t it was a gr...,previously using one plus great phone decided ...
4,Good choice,So far it’s been an AMAZING experience coming ...,far amazing experience coming back ios nearly ...
...,...,...,...
505,Perfect product!,Original product for apple and security seal p...,original product apple security seal pack flip...
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...,good phone apple battery far better one kidney...
507,Brilliant,Love itREAD MORE,love itread
508,Perfect product!,All are good not giving charger and earphones ...,good giving charger earphones read


**Spelling Correction**

In [10]:
from textblob import TextBlob
df['After Preprocessing'].apply(lambda x: str(TextBlob(x).correct()))

0      best phone moneythe phone offers superb camera...
1      amazing phone great camera better battery give...
2      amazing powerful unable gadgetim happy camera ...
3      previously using one plus great phone decided ...
4      far amazing experience coming back is nearly d...
                             ...                        
505    original product apple security seal pack flip...
506    good phone apple battery far better one kidney...
507                                           love tread
508                    good giving charge earphones read
509    decided buy phone phone or using week dont reg...
Name: After Preprocessing, Length: 510, dtype: object

**Stemming**

In [11]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['After Preprocessing'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0      best phone moneyth iphon offer superb camera d...
1      amaz phone great camera better batteri give be...
2      amaz power durabl gadgetim happi camera pictur...
3      previous use one plu great phone decid upgrad ...
4      far amaz experi come back io nearli decad vers...
                             ...                        
505    origin product appl secur seal pack flipcart i...
506    good phone appl batteri far better one kidney ...
507                                          love itread
508                       good give charger earphon read
509    decid buy iphon iphon xr use week dont regret ...
Name: After Preprocessing, Length: 510, dtype: object

**Lemmatization**

In [12]:
from textblob import Word
import nltk
nltk.download('wordnet')

df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...,best phone moneythe iphone offer superb camera...
1,Perfect product!,Amazing phone with great cameras and better ba...,amazing phone great camera better battery give...
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...,amazing powerful durable gadgetim happy camera...
3,Worth every penny,Previously I was using one plus 3t it was a gr...,previously using one plus great phone decided ...
4,Good choice,So far it’s been an AMAZING experience coming ...,far amazing experience coming back io nearly d...
...,...,...,...
505,Perfect product!,Original product for apple and security seal p...,original product apple security seal pack flip...
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...,good phone apple battery far better one kidney...
507,Brilliant,Love itREAD MORE,love itread
508,Perfect product!,All are good not giving charger and earphones ...,good giving charger earphone read


# **Parts of Speech Tagging and Features**

In [14]:
from nltk.tokenize import word_tokenize
pos = []
for sentence in df['After Preprocessing']:
  text = word_tokenize(sentence)
  pos.append(nltk.pos_tag(text))
pos

[[('best', 'JJS'),
  ('phone', 'NN'),
  ('moneythe', 'NN'),
  ('iphone', 'NN'),
  ('offer', 'NN'),
  ('superb', 'NN'),
  ('camera', 'NN'),
  ('durable', 'JJ'),
  ('design', 'NN'),
  ('excellent', 'JJ'),
  ('battery', 'NN'),
  ('life', 'NN'),
  ('affordable', 'JJ'),
  ('pricecompelling', 'VBG'),
  ('ultrawide', 'JJ'),
  ('cameranew', 'NN'),
  ('night', 'NN'),
  ('mode', 'NN'),
  ('excellentlong', 'JJ'),
  ('battery', 'NN'),
  ('liferead', 'NN')],
 [('amazing', 'VBG'),
  ('phone', 'NN'),
  ('great', 'JJ'),
  ('camera', 'NN'),
  ('better', 'RBR'),
  ('battery', 'NN'),
  ('give', 'JJ'),
  ('best', 'RBS'),
  ('performance', 'NN'),
  ('love', 'NN'),
  ('camera', 'NN'),
  ('read', 'NN')],
 [('amazing', 'VBG'),
  ('powerful', 'JJ'),
  ('durable', 'JJ'),
  ('gadgetim', 'NN'),
  ('happy', 'JJ'),
  ('camera', 'NN'),
  ('picture', 'NN'),
  ('quality', 'NN'),
  ('amazing', 'VBG'),
  ('face', 'NN'),
  ('id', 'NN'),
  ('unlocked', 'VBD'),
  ('dark', 'JJ'),
  ('room', 'NN'),
  ('strong', 'JJ'),
  ('ba

In [15]:
Adjective = []
Adverb = []
CordinatingConjunction = []
SubordinatingConjuction = []
Interjection = []
Noun = []
Verb = []
PersonalPronoun = []
predeterminer = []
Determiner = []

In [16]:
for value in pos:
  AdjectiveCount = 0
  AdverbCount = 0
  CordinatingConjunctionCount = 0
  SubordinatingConjuctionCount = 0
  InterjectionCount = 0
  NounCount = 0
  VerbCount = 0
  PersonalPronounCount = 0
  predeterminerCount = 0
  DeterminerCount = 0
  for word,tag in value:
    if tag == 'JJ':
      AdjectiveCount = AdjectiveCount + 1
    elif tag == 'RB':
      AdverbCount = AdverbCount + 1
    elif tag == 'CC':
      CordinatingConjunctionCount = CordinatingConjunctionCount + 1
    elif tag == 'UH':
      InterjectionCount = InterjectionCount + 1
    elif tag == 'NN':
      NounCount = NounCount + 1
    elif tag == 'VR':
      VerbCount = VerbCount + 1
    elif tag == 'PRP':
      PersonalPronounCount = PersonalPronounCount + 1
    elif tag == 'PDT':
      predeterminerCount = predeterminerCount + 1
    elif tag == 'DT':
      DeterminerCount = DeterminerCount + 1
    elif tag == 'IN':
      SubordinatingConjuctionCount = SubordinatingConjuctionCount + 1
  Adjective.append(AdjectiveCount)
  Adverb.append(AdverbCount)
  CordinatingConjunction.append(CordinatingConjunctionCount)
  Interjection.append(InterjectionCount)
  Noun.append(NounCount)
  Verb.append(VerbCount)
  PersonalPronoun.append(PersonalPronounCount)
  predeterminer.append(predeterminerCount)
  Determiner.append(DeterminerCount)
  SubordinatingConjuction.append(SubordinatingConjuctionCount)

In [17]:
df['Number of Adjectives'] = Adjective
df['Number of Adverbs'] = Adverb
df['Number of Cordinating Conjunctions'] = CordinatingConjunction
df['Number of Interjections'] = Interjection
df['Number of Nouns'] = Noun
df['Number of Verbs'] = Verb
df['Number of Personal Pronouns'] = PersonalPronoun
df['Number of Predeterminers'] = predeterminer
df['Number of Determiners'] = Determiner
df['Number of Subordinating Conjuctions'] = SubordinatingConjuction
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing,Number of Adjectives,Number of Adverbs,Number of Cordinating Conjunctions,Number of Interjections,Number of Nouns,Number of Verbs,Number of Personal Pronouns,Number of Predeterminers,Number of Determiners,Number of Subordinating Conjuctions
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...,best phone moneythe iphone offer superb camera...,5,0,0,0,14,0,0,0,0,0
1,Perfect product!,Amazing phone with great cameras and better ba...,amazing phone great camera better battery give...,2,0,0,0,7,0,0,0,0,0
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...,amazing powerful durable gadgetim happy camera...,14,4,0,0,25,0,0,0,0,0
3,Worth every penny,Previously I was using one plus 3t it was a gr...,previously using one plus great phone decided ...,6,4,2,0,19,0,0,0,0,3
4,Good choice,So far it’s been an AMAZING experience coming ...,far amazing experience coming back io nearly d...,5,6,0,0,7,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
505,Perfect product!,Original product for apple and security seal p...,original product apple security seal pack flip...,2,0,0,0,13,0,0,0,0,0
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...,good phone apple battery far better one kidney...,1,1,0,0,6,0,0,0,0,0
507,Brilliant,Love itREAD MORE,love itread,0,0,0,0,2,0,0,0,0,0
508,Perfect product!,All are good not giving charger and earphones ...,good giving charger earphone read,1,0,0,0,3,0,0,0,0,0


# **Linguistic features**

**Number of right-branching nodes**

In [18]:
RightBranchingNodes = []
import spacy
nlp = spacy.load("en_core_web_sm")

for sentence in df['After Preprocessing']:
  doc = nlp(sentence)
  try:
    RightBranchingNodes.append(doc[0].n_rights)
  except:
    RightBranchingNodes.append('No')
df['Number of Right Branching Nodes'] = RightBranchingNodes

**Sentence Length**

In [19]:
df['Sentenece Length'] = df['Full Review'].apply(lambda x: len(x))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing,Number of Adjectives,Number of Adverbs,Number of Cordinating Conjunctions,Number of Interjections,Number of Nouns,Number of Verbs,Number of Personal Pronouns,Number of Predeterminers,Number of Determiners,Number of Subordinating Conjuctions,Number of Right Branching Nodes,Sentenece Length
0,Brilliant,The Best Phone for the MoneyThe iPhone 11 offe...,best phone moneythe iphone offer superb camera...,5,0,0,0,14,0,0,0,0,0,0,219
1,Perfect product!,Amazing phone with great cameras and better ba...,amazing phone great camera better battery give...,2,0,0,0,7,0,0,0,0,0,0,123
2,Great product,Amazing Powerful and Durable Gadget.I’m am ver...,amazing powerful durable gadgetim happy camera...,14,4,0,0,25,0,0,0,0,0,0,501
3,Worth every penny,Previously I was using one plus 3t it was a gr...,previously using one plus great phone decided ...,6,4,2,0,19,0,0,0,0,3,0,504
4,Good choice,So far it’s been an AMAZING experience coming ...,far amazing experience coming back io nearly d...,5,6,0,0,7,0,0,0,0,1,0,241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
505,Perfect product!,Original product for apple and security seal p...,original product apple security seal pack flip...,2,0,0,0,13,0,0,0,0,0,0,179
506,Mind-blowing purchase,Good phone by apple battery is far better ( on...,good phone apple battery far better one kidney...,1,1,0,0,6,0,0,0,0,0,0,69
507,Brilliant,Love itREAD MORE,love itread,0,0,0,0,2,0,0,0,0,0,1,16
508,Perfect product!,All are good not giving charger and earphones ...,good giving charger earphone read,1,0,0,0,3,0,0,0,0,0,0,56
