We cannot work with the text data in machine learning so we need to convert them into numerical vectors, As a part of this practice exercise you will implement different techniques to do the same.

In this notebook we are going to understand some basic text cleaning steps and techniques for encoding text data. We are going to learn about
1. **Understanding the data** - See what's data is all about. what should be considered for cleaning for data (Punctuations , stopwords etc..).
2. **Basic Cleaning** -We will see what parameters need to be considered for cleaning of data (like Punctuations , stopwords etc..)  and its code.
3. **Techniques for Encoding** - All the popular techniques that are used for encoding that I personally came across.
    *           **Bag of Words**
    *           **Binary Bag of Words**
    *           **Bigram, Ngram**
    *           **TF-IDF**( **T**erm  **F**requency - **I**nverse **D**ocument **F**requency)


# 1.Importing Libraries

Libraries used in this notebook along with their version:

google	2.0.3

nltk	3.2.5

numpy	1.18.3

pandas	1.0.3

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from six import string_types
from nltk.corpus import reuters
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize

# 2.Reading the data

We will employ a text categorization dataset based on Reviews. Each article is assigned a specific captegory. 
###Implement the code to load the dataset.(Hint: Use the pandas library to load the csv file.)

In [2]:
df = pd.read_csv("Reviews.csv")

In [3]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
df.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [5]:
df.shape

(568454, 10)

In [6]:
df = df.dropna()

In [7]:
df.shape

(568411, 10)

In [8]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [9]:
df = df.iloc[:100000,:]

In [10]:
df.shape

(100000, 10)

1. **Understanding the data**

Our main objective from the dataset is to predict whether a review is **Positive** or **Negative** based on the Text.
 
If we see the Score column, it has values 1,2,3,4,5 .  Considering 1, 2 as Negative reviews and 4, 5 as Positive reviews.
 For Score = 3 we will consider it as Neutral review and lets delete the rows that are neutral, so that we can predict either Positive or Negative
 
HelpfulnessNumerator says about number of people found that review usefull and HelpfulnessDenominator is about usefull review count + not so usefull count.
So, from this we can see that HelfulnessNumerator is always less than or equal to HelpfulnesDenominator.

In [11]:
df.Score.unique()

array([5, 1, 4, 2, 3], dtype=int64)

In [12]:
df.Score.value_counts()

5    62415
4    14643
1     9317
3     8060
2     5565
Name: Score, dtype: int64

Converting Score values into class label either Positive or Negative.

In [13]:
data =df[df.Score !=3]

In [14]:
data.shape

(91940, 10)

In [15]:
data = data[data.HelpfulnessNumerator<= data.HelpfulnessDenominator]

In [16]:
data.shape

(91938, 10)

In [17]:
data.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...


In [18]:
data['label'] = data['Score'].apply(lambda x: 0 if (x==1 or x==2) else 1)

In [19]:
data.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,label
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,1
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0


2. **Basic Cleaning**
 
**Deduplication** means removing duplicate rows, It is necessary to remove duplicates in order to get unbaised results. Checking duplicates based on UserId, ProfileName, Time, Text. If all these values are equal then we will remove those records. (No user can type a review on same exact time for different products.)


We have seen that HelpfulnessNumerator should always be less than or equal to HelpfulnessDenominator so checking this condition and removing those records also.


Converting all words to lowercase and removing punctuations and html tags if any

**Stemming**- Converting the words into their base word or stem word ( Ex - tastefully, tasty,  these words are converted to stem word called 'tasti'). This reduces the vector dimension because we dont consider all similar words  

**Stopwords** - Stopwords are the unnecessary words that even if they are removed the sentiment of the sentence dosent change.

Ex -    This pasta is so tasty ==> pasta tasty    ( This , is, so are stopwords so they are removed)

To see all the stopwords see the below code cell.

###Create a function called "complaint_to_words" to convert each consumer complaint narrative to individual tokens.(Hint: Use regular expression based tokenizer.)

# 3.Basic Cleaning

We will use the above function here to create a list of list that will store each complaint tokenized into separate words.

## 3.1.Tokenize

In [21]:
Report = data["Text"]
Report.shape

(91938,)

In [22]:
Report

0         I have bought several of the Vitality canned d...
1         Product arrived labeled as Jumbo Salted Peanut...
2         This is a confection that has been around a fe...
3         If you are looking for the secret ingredient i...
4         Great taffy at a great price.  There was a wid...
                                ...                        
100001    The taste is great! especially when you cook i...
100002    This is the best instant noodle I have tried. ...
100003    I don't see how anyone could say anything bad ...
100004    These are very good noodles - better than the ...
100005    Very spicy packaged ramen. Good for someone wh...
Name: Text, Length: 91938, dtype: object

In [23]:
word = []
for file_id in Report:
    words = word_tokenize(file_id)
    word.extend(words)
len(words)

40

In [26]:
len(word)

8806686

## 3.2.Lower Case

In [27]:
text_tokens = [w.lower() for w in word]
len(text_tokens)

8806686

## 3.3.Removing Stopwords

### 3.3.1.Removing Punctuation

In [28]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bapan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [29]:
puncList = [";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]

In [30]:
filtered_sentence = [w for w in text_tokens if not w in puncList]

In [31]:
print(len(text_tokens) )
print(len(filtered_sentence))

8806686
8240301


### 3.3.2.Removing the Stop Words

In [32]:
stop_words = set(stopwords.words('english')) 

In [34]:
len(stop_words)

179

In [35]:
filtered_sentence = [w for w in filtered_sentence if not w in stop_words]

In [36]:
len(filtered_sentence)

4628870

## 3.4.Stemming & Lemitization

### 3.4.1.Stemming

In [38]:
from nltk.stem import PorterStemmer

In [39]:
porter = PorterStemmer()

In [40]:
filtered_sentence = list(map(lambda x : porter.stem(x),filtered_sentence))

In [41]:
len(filtered_sentence)

4628870

### 3.4.2.Lemitization

In [42]:
from nltk.stem import LancasterStemmer

In [43]:
lancaster=LancasterStemmer()

In [44]:
filtered_sentence = list(map(lambda x : lancaster.stem(x),filtered_sentence))

In [45]:
len(filtered_sentence)

4628870

## 3.5.PoS

In [49]:
#nltk.pos_tag(filtered_sentence)

# Save the data

In [47]:
filter1 = nltk.pos_tag(filtered_sentence)
df_filter = pd.DataFrame(filter1,columns=['pos_tag', 'tag_type'])

In [48]:
df_filter.head()

Unnamed: 0,pos_tag,tag_type
0,bought,VBN
1,sev,NNS
2,vit,NN
3,can,MD
4,dog,VB


# 4.**Techniques for Encoding**

4. **Techniques for Encoding**

      **BAG OF WORDS**
      
      In BoW we construct a dictionary that contains set of all unique words from our text review dataset.The frequency of the word is counted here. if there are **d** unique words in our dictionary then for every sentence or review the vector will be of length **d** and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.
      
      Ex. pasta is tasty and pasta is good
      
     **[0]....[1]............[1]...........[2]..........[2]............[1]..........**             <== Its vector representation ( remaining all dots will be represented as zeroes)
     
     **[a]..[and].....[good].......[is].......[pasta]....[tasty].......**            <==This is dictionary
      .
      
    Using scikit-learn's CountVectorizer we can get the BoW and check out all the parameters it consists of, one of them is max_features =5000 it tells about to consider only top 5000 most frequently repeated words to place in a dictionary. so our dictionary length or vector length will be only 5000
    


   **BINARY BAG OF WORDS**
    
   In binary BoW, we dont count the frequency of word, we just place **1** if the word appears in the review or else **0**. In CountVectorizer there is a parameter **binary = true** this makes our BoW to binary BoW.
   
  

In [51]:
#filtered_sentence

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

In [54]:
vectorizer  = CountVectorizer()
vectorizer.fit(filtered_sentence)

CountVectorizer()

In [58]:
len(vectorizer.vocabulary_)

37733

In [59]:
vector = vectorizer.transform(filtered_sentence)
vector.shape

(4628870, 37733)

In [65]:
vector

<4628870x37733 sparse matrix of type '<class 'numpy.int64'>'
	with 3821044 stored elements in Compressed Sparse Row format>

In [61]:
binary_vectorizer  = CountVectorizer(binary=True)
binary_vectorizer.fit(filtered_sentence)

CountVectorizer(binary=True)

In [63]:
len(binary_vectorizer.vocabulary_)

37733

In [67]:
binary_vector = binary_vectorizer.transform(filtered_sentence)
binary_vector.shape

(4628870, 37733)

In [69]:
binary_vector

<4628870x37733 sparse matrix of type '<class 'numpy.int64'>'
	with 3821044 stored elements in Compressed Sparse Row format>

 **Drawbacks of BoW/ Binary BoW**
 
 Our main objective in doing these text to vector encodings is that similar meaning text vectors should be close to each other, but in some cases this may not possible for Bow
 
For example, if we consider two reviews **This pasta is very tasty** and **This pasta is not tasty** after stopwords removal both sentences will be converted to **pasta tasty** so both giving exact same meaning.

The main problem is here we are not considering the front and back words related to every word, here comes Bigram and Ngram techniques.

## 4.1.N-gram 

### 4.1.1.**BI-GRAM BOW**

Considering pair of words for creating dictionary is Bi-Gram , Tri-Gram means three consecutive words so as NGram.

CountVectorizer has a parameter **ngram_range** if assigned to (1,2) it considers Bi-Gram BoW

But this massively increases our dictionary size 

In [70]:
CountVec = CountVectorizer(ngram_range=(0,1)) # forward bigram
bi = CountVec.fit(filtered_sentence)

In [72]:
#bi.vocabulary_

In [73]:
bigram_vector = CountVec.transform(filtered_sentence)
print(bigram_vector.shape)

(4628870, 37733)


## 4.2.**TF-IDF**

**Term Frequency -  Inverse Document Frequency** it makes sure that less importance is given to most frequent words and also considers less frequent words.

**Term Frequency** is number of times a **particular word(W)** occurs in a review divided by totall number of words **(Wr)** in review. The term frequency value ranges from 0 to 1.

**Inverse Document Frequency** is calculated as **log(Total Number of Docs(N) / Number of Docs which contains particular word(n))**. Here Docs referred as Reviews.


**TF-IDF** is **TF * IDF** that is **(W/Wr)*LOG(N/n)**


 Using scikit-learn's tfidfVectorizer we can get the TF-IDF.

So even here we get a TF-IDF value for every word and in some cases it may consider different meaning reviews as similar after stopwords removal. so to over come we can use BI-Gram or NGram.

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [75]:
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  
                        smooth_idf=True,  
                        ngram_range=(0,1),stop_words='english')
tf_idf_vec_smooth.fit(filtered_sentence)
len(tf_idf_vec_smooth.vocabulary_)

37456

In [76]:
tf_idf_vector = tf_idf_vec_smooth.transform(filtered_sentence)
tf_idf_vector.shape 

(4628870, 37456)

## 4.3.Word2Vec

Gensim is a free to use python library. It provides APIs to solve various problems relating to natural language processing. It is fast, scalable and robust.

In this practice exercise we will train our own Word2Vec model using gensim Word2Vec API. Objectives of this practice exercise are, 


1.   Train your word2vec word embedding model.
2.   Visualize trained word embedding model using principal component analysis.


First step will be to load the corpus, clean it and tokenize it.

Libraries used in this notebook along with their version:

google	2.0.3

matplotlib	3.2.1

numpy	1.18.3

pandas	1.0.3

In [78]:
pip install gensim

Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-win_amd64.whl (24.0 MB)
Collecting Cython==0.29.23
  Downloading Cython-0.29.23-cp38-cp38-win_amd64.whl (1.7 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
Installing collected packages: Cython, smart-open, gensim
  Attempting uninstall: Cython
    Found existing installation: Cython 0.29.21
    Uninstalling Cython-0.29.21:
      Successfully uninstalled Cython-0.29.21
Successfully installed Cython-0.29.23 gensim-4.1.2 smart-open-5.2.1
Note: you may need to restart the kernel to use updated packages.


In [79]:
from gensim.test.utils import common_texts

In [80]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Next step is to import the Word2Vec model from gensim.

In [81]:
from gensim.models import Word2Vec

##### Create your own model using the data_list defined above and gensim Word2Vec API. (Hint: https://radimrehurek.com/gensim/models/word2vec.html)

In [82]:
type(Report)

pandas.core.series.Series

In [83]:
train = Report.head(2000)

In [84]:
document = []
for val in train:
    document.append(val)


In [85]:
len(document)

2000

In [91]:
model = Word2Vec(sentences=document, vector_size=4, window=1, min_count=1, workers=4)
model.save("word2vec.model")

##### Use PCA algorithm from sklearn to convert high dimesnional word embeddings to two diemnsions and save them in the variable "results".

##### Visualizing the word embeddings.

# 5.Emotion and Sentiment Analysis