We cannot work with the text data in machine learning so we need to convert them into numerical vectors, As a part of this practice exercise you will implement different techniques to do the same.

In this notebook we are going to understand some basic text cleaning steps and techniques for encoding text data. We are going to learn about
1. **Understanding the data** - See what's data is all about. what should be considered for cleaning for data (Punctuations , stopwords etc..).
2. **Basic Cleaning** -We will see what parameters need to be considered for cleaning of data (like Punctuations , stopwords etc..)  and its code.
3. **Techniques for Encoding** - All the popular techniques that are used for encoding that I personally came across.
    *           **Bag of Words**
    *           **Binary Bag of Words**
    *           **Bigram, Ngram**
    *           **TF-IDF**( **T**erm  **F**requency - **I**nverse **D**ocument **F**requency)


In [20]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

 **Importing Libraries**

In [0]:
import numpy as np                                  #for large and multi-dimensional arrays
import pandas as pd                                 #for data manipulation and analysis
import nltk                                         #Natural language processing tool-kit

from nltk.corpus import stopwords                   #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer

from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
data_path = "/content/drive/My Drive/Natural Language Processing/Week 1/Practise Ex/Reviews.csv"
data = pd.read_csv(data_path)
data_sel = data.head(10000)                                #Considering only top 10000 rows

In [5]:
data_sel.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
# Shape of our data
data_sel.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [6]:
data_sel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
Id                        10000 non-null int64
ProductId                 10000 non-null object
UserId                    10000 non-null object
ProfileName               10000 non-null object
HelpfulnessNumerator      10000 non-null int64
HelpfulnessDenominator    10000 non-null int64
Score                     10000 non-null int64
Time                      10000 non-null int64
Summary                   10000 non-null object
Text                      10000 non-null object
dtypes: int64(5), object(5)
memory usage: 781.4+ KB


1. **Understanding the data**

Our main objective from the dataset is to predict whether a review is **Positive** or **Negative** based on the Text.
 
If we see the Score column, it has values 1,2,3,4,5 .  Considering 1, 2 as Negative reviews and 4, 5 as Positive reviews.
 For Score = 3 we will consider it as Neutral review and lets delete the rows that are neutral, so that we can predict either Positive or Negative
 
HelpfulnessNumerator says about number of people found that review usefull and HelpfulnessDenominator is about usefull review count + not so usefull count.
So, from this we can see that HelfulnessNumerator is always less than or equal to HelpfulnesDenominator.

In [0]:
# Write the code to remove all the rows from the dataset that have neutral review ie. Score value as 3


data_sel=data_sel[data_sel['Score']!=3]

In [8]:
data_sel.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9138 entries, 0 to 9999
Data columns (total 10 columns):
Id                        9138 non-null int64
ProductId                 9138 non-null object
UserId                    9138 non-null object
ProfileName               9138 non-null object
HelpfulnessNumerator      9138 non-null int64
HelpfulnessDenominator    9138 non-null int64
Score                     9138 non-null int64
Time                      9138 non-null int64
Summary                   9138 non-null object
Text                      9138 non-null object
dtypes: int64(5), object(5)
memory usage: 785.3+ KB


Converting Score values into class label either Positive or Negative.

In [10]:
# Write the code to replace the values of Score column with "positive" or "Negative" depending on the Score value


data_sel["Score"].replace({1: "Negative", 2: "Negative" , 4: "Positive" , 5:"Positive"}, inplace=True)
data_sel.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,Positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,Negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,Positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,Negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,Positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


2. **Basic Cleaning**
 
**Deduplication** means removing duplicate rows, It is necessary to remove duplicates in order to get unbaised results. Checking duplicates based on UserId, ProfileName, Time, Text. If all these values are equal then we will remove those records. (No user can type a review on same exact time for different products.)


We have seen that HelpfulnessNumerator should always be less than or equal to HelpfulnessDenominator so checking this condition and removing those records also.


In [0]:
# Write the code to remove dulicates from the data and remove the rows where HelpfulnessNumerator is greater than 
# HelpfulnessDenominator. Store the resultant in a dataframe variable called "final"


data_sel.drop_duplicates(subset=['UserId', 'ProfileName','Text'],keep='first',inplace=True)

data_sel.drop(data_sel[data_sel['HelpfulnessNumerator'] > data_sel['HelpfulnessNumerator']].index, inplace=True)

In [15]:
final=data_sel
#final.head()
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8716 entries, 0 to 9999
Data columns (total 10 columns):
Id                        8716 non-null int64
ProductId                 8716 non-null object
UserId                    8716 non-null object
ProfileName               8716 non-null object
HelpfulnessNumerator      8716 non-null int64
HelpfulnessDenominator    8716 non-null int64
Score                     8716 non-null object
Time                      8716 non-null int64
Summary                   8716 non-null object
Text                      8716 non-null object
dtypes: int64(4), object(6)
memory usage: 749.0+ KB


In [0]:
final_X = final['Text']
final_y = final['Score']

In [29]:
print(final_X[1])
print(final_y[1])

Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
Negative


Converting all words to lowercase and removing punctuations and html tags if any

**Stemming**- Converting the words into their base word or stem word ( Ex - tastefully, tasty,  these words are converted to stem word called 'tasti'). This reduces the vector dimension because we dont consider all similar words  

**Stopwords** - Stopwords are the unnecessary words that even if they are removed the sentiment of the sentence dosent change.

Ex -    This pasta is so tasty ==> pasta tasty    ( This , is, so are stopwords so they are removed)

To see all the stopwords see the below code cell.

In [30]:
stop = set(stopwords.words('english')) 
print(stop)

{'doing', 'yourself', 'until', 'wasn', 'at', 'any', 't', 'their', 'some', 'shan', 'up', 'if', "needn't", 'so', 'y', 'your', "it's", 'on', 'him', 'his', 'from', 'down', 'own', "shan't", 'o', 'my', 'whom', 'only', 'been', 'a', 'just', 'same', 'i', 'do', 'or', "that'll", 'has', "didn't", "won't", 'yourselves', 'she', 'that', 'they', 'he', 'there', 'not', 'aren', 'what', 'doesn', 'myself', 'each', 'over', "shouldn't", 'be', 'was', 'very', 'nor', 'its', 'themselves', 'will', 've', 'theirs', 'did', 'can', 'then', 'ma', "mightn't", 'needn', 'between', "isn't", 'her', 'while', 'most', 'below', 'which', 'll', 'our', 'out', 'before', 'by', 'as', "she's", 'having', 'hadn', 'those', "you'll", 'and', 'after', 'against', 'we', "you'd", 'me', 'himself', 'itself', 'again', 'few', "hasn't", 're', 'had', "should've", 'above', 'should', 'wouldn', "you've", 'both', 'through', "hadn't", 'shouldn', 'no', 'why', 'isn', 'other', 'into', "don't", 'you', 'mustn', "couldn't", 'once', 'to', 'more', 'haven', 'hasn

For each sentence in final_X perform the following operations in sequence
* Convert each character in a sentence to lowercase character
* Remove HTML Tags
* Remove punctuations
* Remove stopwords
* Stem each word using SnowballStemmer in nltk library

Hint: 
* Use regular expressions
* Use nltk.stem.SnowballStemmer

In [0]:
from nltk.tokenize import word_tokenize

In [0]:
# Solution

import re
temp =[]
snow = nltk.stem.SnowballStemmer('english')
for sentence in final_X:
    sentence = sentence.lower()                 # Converting to lowercase
    cleanr = re.compile('<.*?>')
    sentence = re.sub(cleanr, ' ', sentence)        #Removing HTML tags
    sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence)        #Removing Punctuations
    
    words = [snow.stem(word) for word in sentence.split() if word not in stopwords.words('english')]   # Stemming and removing stopwords
    temp.append(words)
    
final_X = temp    

In [33]:
print(final_X[1])

['product', 'arriv', 'label', 'jumbo', 'salt', 'peanut', 'peanut', 'actual', 'small', 'size', 'unsalt', 'sure', 'error', 'vendor', 'intend', 'repres', 'product', 'jumbo']


In [34]:
sent = []
for row in final_X:
    sequ = ''
    for word in row:
        sequ = sequ + ' ' + word
    sent.append(sequ)

final_X = sent
print(final_X[1])

 product arriv label jumbo salt peanut peanut actual small size unsalt sure error vendor intend repres product jumbo


3. **Techniques for Encoding**

      **BAG OF WORDS**
      
      In BoW we construct a dictionary that contains set of all unique words from our text review dataset.The frequency of the word is counted here. if there are **d** unique words in our dictionary then for every sentence or review the vector will be of length **d** and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.
      
      Ex. pasta is tasty and pasta is good
      
     **[0]....[1]............[1]...........[2]..........[2]............[1]..........**             <== Its vector representation ( remaining all dots will be represented as zeroes)
     
     **[a]..[and].....[good].......[is].......[pasta]....[tasty].......**            <==This is dictionary
      .
      
    Using scikit-learn's CountVectorizer we can get the BoW and check out all the parameters it consists of, one of them is max_features =5000 it tells about to consider only top 5000 most frequently repeated words to place in a dictionary. so our dictionary length or vector length will be only 5000
    


   **BINARY BAG OF WORDS**
    
   In binary BoW, we dont count the frequency of word, we just place **1** if the word appears in the review or else **0**. In CountVectorizer there is a parameter **binary = true** this makes our BoW to binary BoW.
   
  

In [0]:
# Here we use the CountVectorizer from sklearn to create bag of words
count_vect = CountVectorizer(max_features=5000)
bow_data = count_vect.fit_transform(final_X)

In [39]:
final_X[1]

' product arriv label jumbo salt peanut peanut actual small size unsalt sure error vendor intend repres product jumbo'

In [38]:
print(bow_data[1])

  (0, 3438)	2
  (0, 339)	1
  (0, 2479)	1
  (0, 2398)	2
  (0, 3778)	1
  (0, 3230)	2
  (0, 167)	1
  (0, 4024)	1
  (0, 3991)	1
  (0, 4672)	1
  (0, 4322)	1
  (0, 1513)	1
  (0, 4729)	1
  (0, 2304)	1
  (0, 3648)	1


 **Drawbacks of BoW/ Binary BoW**
 
 Our main objective in doing these text to vector encodings is that similar meaning text vectors should be close to each other, but in some cases this may not possible for Bow
 
For example, if we consider two reviews **This pasta is very tasty** and **This pasta is not tasty** after stopwords removal both sentences will be converted to **pasta tasty** so both giving exact same meaning.

The main problem is here we are not considering the front and back words related to every word, here comes Bigram and Ngram techniques.

**BI-GRAM BOW**

Considering pair of words for creating dictionary is Bi-Gram , Tri-Gram means three consecutive words so as NGram.

CountVectorizer has a parameter **ngram_range** if assigned to (1,2) it considers Bi-Gram BoW

But this massively increases our dictionary size 

In [0]:
final_B_X = final_X

In [41]:
count_vect = CountVectorizer(ngram_range=(1,2))
Bigram_data = count_vect.fit_transform(final_B_X)
print(Bigram_data[1])

  (0, 142748)	2
  (0, 11784)	1
  (0, 100430)	1
  (0, 97859)	2
  (0, 155850)	1
  (0, 133854)	2
  (0, 3831)	1
  (0, 165423)	1
  (0, 164485)	1
  (0, 193558)	1
  (0, 177092)	1
  (0, 60852)	1
  (0, 196632)	1
  (0, 95076)	1
  (0, 151689)	1
  (0, 142800)	1
  (0, 11861)	1
  (0, 100490)	1
  (0, 97865)	1
  (0, 155987)	1
  (0, 133898)	1
  (0, 133855)	1
  (0, 4021)	1
  (0, 165627)	1
  (0, 164722)	1
  (0, 193567)	1
  (0, 177168)	1
  (0, 60866)	1
  (0, 196648)	1
  (0, 95087)	1
  (0, 151696)	1
  (0, 143171)	1


**TF-IDF**

**Term Frequency -  Inverse Document Frequency** it makes sure that less importance is given to most frequent words and also considers less frequent words.

**Term Frequency** is number of times a **particular word(W)** occurs in a review divided by totall number of words **(Wr)** in review. The term frequency value ranges from 0 to 1.

**Inverse Document Frequency** is calculated as **log(Total Number of Docs(N) / Number of Docs which contains particular word(n))**. Here Docs referred as Reviews.


**TF-IDF** is **TF * IDF** that is **(W/Wr)*LOG(N/n)**


 Using scikit-learn's tfidfVectorizer we can get the TF-IDF.

So even here we get a TF-IDF value for every word and in some cases it may consider different meaning reviews as similar after stopwords removal. so to over come we can use BI-Gram or NGram.

In [42]:
final_tf = final_X
tf_idf = TfidfVectorizer(max_features=5000)
tf_data = tf_idf.fit_transform(final_tf)
print(tf_data[1])

  (0, 3648)	0.27632712135962173
  (0, 2304)	0.25859653983415387
  (0, 4729)	0.22110118037757334
  (0, 1513)	0.2676445880029433
  (0, 4322)	0.14376307914308592
  (0, 4672)	0.27031268989556007
  (0, 3991)	0.14758383587179663
  (0, 4024)	0.14731004824244134
  (0, 167)	0.14731004824244134
  (0, 3230)	0.372643738302882
  (0, 3778)	0.15376521824831518
  (0, 2398)	0.5671036965848731
  (0, 2479)	0.18769096566284565
  (0, 339)	0.15742580595200475
  (0, 3438)	0.18223349846935735
