# Feedback Classification using Word2Vec and LSTM

**What is Word2Vec?**

Word2vec is a group of Deep Learning models developed by Google with the aim of capturing the context of words while at 
the same time proposing a very efficient way of preprocessing raw text data. This model takes as input a large corpus of 
documents like tweets or news articles or feedbacks and generates a vector space of typically several hundred dimensions. Each word in the corpus is being assigned a unique vector in the vector space.


The powerful concept behind word2vec is that word vectors that are close to each other in the vector space represent words that are not only of the same meaning but of the same context as well.

Its interesting about the vector representation of words is that it automatically embeds several features that we would normally have to handcraft ourselves. Since word2vec relies on Deep Neural Nets to detect patterns, we can rely on it to detect multiple features on different levels of abstractions.

See the following images for clear justification on how we visualize some word vectors projected on 2D space after a dimensionality reduction.

![title](w2v_1.jpg)

A couple of things to notice:

On the right chart, the words of similar meaning, concept and context are grouped together.

The chart on the left is quite similar to the one on the right except that it translates the syntaxic relationships between words. slow - slowest = short - shortest is such an example.

So, On a more general level, **word2vec** embeds non trivial semantic and syntaxic relationships between words. This results in preserving a rich context.

Problem Statement:
    
Can we use Word2Vec and LSTMs model to identify these Themes based on the semantic context of the text?

Answer is **YES**

********************************************************

Data used:

1000 raw feedbacks from the member regarding their issues.

We labelled 500 feedbacks, and now we will try to learn the hidden embeddings to predict **Theme** of other 500 feedbacks.

*************************************************

# Approach we will be using:

## Step 1: Data Shuffling

## Step 2: Data Cleaning [Lemmatization, Stop Words Removal, Special Character Removal]

## Step 3: Created Word2Vec Model from the corpus

## Step 4: Make feature embedding vector for each feedback

## Step 5: Train model and prediction on test feedback


Important Libraries to import:
    
We will make a virtual environment, and we'll need to install these libraries:

**gensim** is a natural language processing python library. It makes text mining, cleaning and modeling very easy. Besides, it provides an implementation of the word2vec model.

**Keras** is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano. We'll be using it to train our sentiment classifier. In this tutorial, it will run on top of TensorFlow.

**nltk** stands for Natural Language Toolkit. This toolkit is one of the most powerful NLP libraries which contains packages to make machines understand human language and reply to it with an appropriate response. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of the important features.

**tqdm** is cool progress bar utility package we can use to monitor dataframes creation.

In [9]:
# Necessary imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import keras
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers.core import Dense
from keras.layers import Embedding, LSTM
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers.merge import concatenate
from keras.models import Model
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint
#from keras.layers import Conv1D, MaxPooling1D, Embedding
import nltk

## Reading the Data

In [14]:
import pandas as pd
df = pd.read_csv('HS_Theme.csv', encoding = 'latin-1')
df.head()
df['Theme'].nunique()

12

In [3]:
df.columns = ['respId','feedback','rating', 'Theme']  # Renaming the column Names
len(set(df['Theme']))    

13

In [4]:
set(df['Theme'])

{'Dissatisfied with service provided even after multiple visits',
 'Dissatisfied with warranty terms and conditions',
 'Extremely dissatisifed with experience',
 'Others',
 'Parts unavailability or Wrong Part delivered',
 'Repaired product still not working',
 'Request for contact back by customer',
 'Scheduling issue (Call center experience)',
 'Service charges are too costly',
 "Technician code of conduct\n(Didn't inspected thoroughly/Racist comments/nasty/came drunk/improper afterworks)",
 'Technician not having required expertise',
 'Technician not turning up at all or on scheduled time',
 nan}

In [5]:
df.head()

Unnamed: 0,respId,feedback,rating,Theme
0,p3084928113_486992,The repairman left oil smears all over the was...,3,Technician code of conduct\n(Didn't inspected ...
1,p3084928113_457095,The entire process was Long and drawn out th...,1,Dissatisfied with service provided even after ...
2,p3084928113_381320,"To whom may correspond, the technician came in...",1,Request for contact back by customer
3,p3084928113_428015,the dishwasher is still not working,2,Repaired product still not working
4,p3084928113_499227,I have filled out two of these today.is everyt...,1,Others


So this data has 4 columns. <br />
**respId** : Identifier of each feedback <br />
**feedback** : raw feedback that we received from the member <br />
**rating** : rating on a scale of 1-5 provided by the member <br />
**Theme** : 12 Themes defined by us

In [6]:
data = df.copy()
data.feedback.fillna('none', inplace=True) # Removing the rows having blank/No feedback

### List of all possible Themes defined by us and their count in train data

In [7]:
j = df.Theme
import collections
j.value_counts()

Scheduling issue (Call center experience)                                                                         89
Technician not turning up at all or on scheduled time                                                             69
Extremely dissatisifed with experience                                                                            48
Dissatisfied with service provided even after multiple visits                                                     45
Others                                                                                                            43
Repaired product still not working                                                                                43
Technician not having required expertise                                                                          38
Parts unavailability or Wrong Part delivered                                                                      36
Service charges are too costly                                  

### Data Preprocessing Starts

#### Converting all feedback in lowercase
#### Replace frequently occured feedback removing special characters
#### Regex to remove special character if any
#### Removal of custom stop words: Words which occur very frequently in our data like 'The','He','She' as these words don't add anything in the context

In [128]:
# Apply Data Filtering
data.feedback = data.feedback.apply(lambda x: x.lower())  # Converting all text into lower case
data['feedback'].replace('doesnt','does not',inplace=True,regex=True)
data['feedback'].replace('whats','what is',inplace=True,regex=True)
data['feedback'].replace('canst','cannot',inplace=True,regex=True)
data['feedback'].replace('can\'t','cannot',inplace=True,regex=True)
data['feedback'].replace('n\'t','not',inplace=True,regex=True)
data['feedback'].replace('dont','do not',inplace=True,regex=True)
data['feedback'].replace('\'s',' is',inplace=True,regex=True)
data['feedback'].replace('n\?t','not',inplace=True,regex=True)
data['feedback'].replace('\?s',' is',inplace=True,regex=True)
data['feedback'].replace('ism','i am',inplace=True,regex=True)
data['feedback'].replace('[!"#%\()''*+,-./:;<=>?@\[\]^_`{|}~1234567890’”“′‘\\\]','',inplace=True,regex=True)

#Our custom Stop Words
custom_stop=['a','about','above','after','again','against','all','am','an','and','any','are','aren','as','at','be','because','been','before','being','below','between','both','but','by','d','during','each','few','for','from','further','he','her','here','hers','herself','him','himself','his','i','if','in','into','is','isn','it','its','itself','just','ll','m','ma','me','mightn','mustn','my','myself','needn','now','o','of','on','once','or','other','our','ours','ourselves','out','over','own','re','s','shan','she','should','so','some','such','t','than','that','the','their','theirs','them','themselves','then','there','these','they','this','those','through','to','too','under','until','up','ve','very','wasn','we','were','will','with','won','wouldn','y','yo','your','yours','yourself','yourselves','s','x','n','j','k']

##### Word Lemmatizer to extract root words if there are multiple verbs of same words like reading,read

In [129]:
from nltk.stem.wordnet import WordNetLemmatizer 
lmtzr = WordNetLemmatizer()

#### Convert each row in a list of words to apply WordNetLemmatizer filter and Stopwords for each word

In [130]:
data['feedback']=data['feedback'].apply(lambda x : filter(None,x.split(" ")))   # Creating a list of words
data['feedback'] = data.feedback.apply(lambda x :[ lmtzr.lemmatize(word,'v') for word in x]) # Appling the Lemmatizer
data['feedback'] = data.feedback.apply(lambda x: [word for word in x if word not in (custom_stop)]) # Removing StopWords
data['feedback']=data['feedback'].apply(lambda x : " ".join(x)) # Joining to words to get back the feedback

In [131]:
data.head()

Unnamed: 0,respId,feedback,rating,Theme
0,p3084928113_486992,repairman leave oil smear wash machine,3,Technician code of conduct\n(Didn't inspected ...
1,p3084928113_457095,entire process long draw technician order part...,1,Dissatisfied with service provided even after ...
2,p3084928113_381320,whom may correspond technician come look appli...,1,Request for contact back by customer
3,p3084928113_428015,dishwasher still not work,2,Repaired product still not working
4,p3084928113_499227,have fill two todayis everything screw service...,1,Others


### We will remove words which has very few occurences in our whole corpus. Because it might be a spelling mistake. So every word should occur atleast thrice to be considered in our corpus

In [132]:
import collections
minimum_count = 3
import itertools
str_frequencies = pd.DataFrame(list(collections.Counter(filter(None,list(itertools.chain(*data['feedback'].str.split(' '))))).items()),columns=['word','count'])
low_frequency_words = set(str_frequencies[str_frequencies['count'] < minimum_count]['word'])
len(low_frequency_words)

1883

In [133]:
data['stemmed_text_data'] = [' '.join(filter(None,filter(lambda word: word not in low_frequency_words, line))) for line in data['feedback'].str.split(' ')]
data.feedback = data['stemmed_text_data']
del data['stemmed_text_data']

Before feeding lists of tokens into the word2vec model, we must turn them into LabeledSentence objects beforehand. 
Here's how to do it:

In [134]:
def labelizeFeedbacks(feedback, label_type):
    labelized = []
    for i,v in tqdm(enumerate(feedback)):
        label = '%s_%s'%(label_type,i)
        labelized.append(LabeledSentence(v, [label]))
    return labelized

In [135]:
data.Theme[499:] = 'Others'

Encoding each Theme into a number to pass into the model. We will decode it once we get the results to get our original theme.

In [136]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(data['Theme'].astype(str))
data.Theme = le.transform(data['Theme'].astype(str))
#df[cat].astype(str)

In [137]:
data.Theme[1:10]

1     0
2     6
3     5
4     3
5    10
6     2
7     8
8     3
9    10
Name: Theme, dtype: int64

In [138]:
data['Theme'].value_counts()

3     542
7      89
11     69
2      48
0      45
5      43
10     38
4      36
8      32
9      28
6      15
1      13
Name: Theme, dtype: int64

In [139]:
data['feedback']=data['feedback'].apply(lambda x : list(filter(None,x.split(" "))))

In [140]:
from gensim.models.doc2vec import LabeledSentence, Doc2Vec

In [141]:
X_train = data.feedback[0:499]
X_test = data.feedback[499:]
Y_train = data.Theme[0:499]
Y_test = data.Theme[499:]

In [142]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

In [143]:
x_train = labelizeFeedbacks(X_train, 'TRAIN')
x_test = labelizeFeedbacks(X_test, 'TEST')
x_whole = labelizeFeedbacks(data.feedback, 'TEST_1')

  """
499it [00:00, 119905.91it/s]
499it [00:00, 264864.30it/s]
998it [00:00, 279284.45it/s]


In [144]:
import gensim                                       # importing the word2vec model gensim class
from gensim.models.word2vec import Word2Vec

### Training word2vec Model

In [145]:
input_size = 300
model1 = Word2Vec(data.feedback , min_count=10 , size = input_size ,  alpha=0.2988 , window=100 , workers=8, seed=2178977)

In [146]:
model1.wv.vocab

{'repairman': <gensim.models.keyedvectors.Vocab at 0x7f0d947de390>,
 'leave': <gensim.models.keyedvectors.Vocab at 0x7f0d947de080>,
 'wash': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2be0>,
 'machine': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2c50>,
 'entire': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2cf8>,
 'process': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2da0>,
 'long': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2e48>,
 'technician': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2f28>,
 'order': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2f60>,
 'part': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2f98>,
 'sear': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2fd0>,
 'cancel': <gensim.models.keyedvectors.Vocab at 0x7f0d947e25c0>,
 'have': <gensim.models.keyedvectors.Vocab at 0x7f0d947e2668>,
 'come': <gensim.models.keyedvectors.Vocab at 0x7f0d947e26a0>,
 'back': <gensim.models.keyedvectors.Vocab at 0x7f0d947e26d8>,
 'look': <gensim.models.keyedvec

In [147]:
model1.most_similar('repairman')  # Checking all the words similar to repairman

  """Entry point for launching an IPython kernel.


[('return', 0.38222774863243103),
 ('representative', 0.38058820366859436),
 ('situation', 0.3534719944000244),
 ('heater', 0.34667131304740906),
 ('report', 0.34176570177078247),
 ('instal', 0.3282763361930847),
 ('enough', 0.3186523914337158),
 ('&', 0.3119112253189087),
 ('finally', 0.30898115038871765),
 ('horrible', 0.2751517593860626)]

In [148]:
model1['repairman']

  """Entry point for launching an IPython kernel.


array([ 1.6827319 , -0.53736347, -0.4222089 ,  1.8111184 ,  0.5805491 ,
       -0.53974885,  1.0798132 , -0.51003444, -0.6268802 , -2.4994645 ,
       -3.2260056 ,  1.9444085 ,  0.47631308,  0.39567935,  1.2158549 ,
       -1.0989057 , -0.5542868 ,  0.0828223 ,  0.5541401 , -1.4545407 ,
        1.6008999 ,  0.951255  , -0.53193015, -0.9689449 , -0.28741553,
       -1.3887663 , -0.33563244, -0.7965994 , -0.28676262,  1.1787198 ,
       -0.94548726,  1.1061553 , -0.9657425 ,  0.51316065,  1.8412879 ,
       -0.01422593,  0.5889897 ,  1.0532249 , -0.44548166,  0.12657626,
       -0.03266321, -0.61646193,  0.6188718 , -0.6116894 , -0.8054796 ,
       -0.13376845, -0.26679343, -0.58641165,  0.44420508, -0.69058216,
        0.3834735 ,  0.2798166 ,  1.2926862 ,  0.58021736, -0.9665022 ,
        1.1386018 ,  0.48485073,  0.5862532 ,  0.30891368, -0.6403976 ,
        0.7439217 ,  0.32564458, -1.5680337 ,  0.68737924,  0.16987216,
       -3.3778696 , -2.25833   ,  2.0282166 , -1.4312823 ,  0.73

Now we will write a function to build word vectors from the embeddings we got from Word2Vec

In [149]:
def buildWordVector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model1[word].reshape((1, size))  
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    return vec

Now we convert x_train and and x_test into list of vectors using labelizeFeedbacks function. <br />
We also scale each column to have zero mean and unit standard deviation.

In [150]:
from sklearn.preprocessing import scale 
n_dim = input_size
# To create vector corresponding to each feedback

train_vecs_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, x_train))])
#train_vecs_w2v = scale(train_vecs_w2v) # if we want to scale

test_vecs_w2v =  np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, x_test))])
#test_vecs_w2v = scale(test_vecs_w2v)  # if we want to scale

  
499it [00:00, 5069.60it/s]
499it [00:00, 5574.81it/s]


In [151]:
print(train_vecs_w2v.shape)
print(test_vecs_w2v.shape)

(499, 300)
(499, 300)


In [152]:
# One-hot encoding of the target 
Y_train = pd.get_dummies(Y_train).values

In [153]:
Y_train.shape

(499, 12)

In [154]:
model = Sequential()
model.add(Dense(196,activation='tanh',input_dim=input_size, kernel_initializer='random_uniform')) # random_uniform
model.add(Dense(12, activation='softmax'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 196)               58996     
_________________________________________________________________
dense_8 (Dense)              (None, 12)                2364      
Total params: 61,360
Trainable params: 61,360
Non-trainable params: 0
_________________________________________________________________
None


In [155]:
model.fit(train_vecs_w2v, Y_train, epochs=20, batch_size=196, verbose=2) #, class_weight = 'auto')

Epoch 1/20
 - 0s - loss: 2.6364 - acc: 0.1623
Epoch 2/20
 - 0s - loss: 1.8475 - acc: 0.3848
Epoch 3/20
 - 0s - loss: 1.4503 - acc: 0.5371
Epoch 4/20
 - 0s - loss: 1.2358 - acc: 0.5972
Epoch 5/20
 - 0s - loss: 1.0961 - acc: 0.6613
Epoch 6/20
 - 0s - loss: 0.9710 - acc: 0.7074
Epoch 7/20
 - 0s - loss: 0.8643 - acc: 0.7635
Epoch 8/20
 - 0s - loss: 0.7910 - acc: 0.7836
Epoch 9/20
 - 0s - loss: 0.7157 - acc: 0.8096
Epoch 10/20
 - 0s - loss: 0.6500 - acc: 0.8477
Epoch 11/20
 - 0s - loss: 0.5988 - acc: 0.8717
Epoch 12/20
 - 0s - loss: 0.5505 - acc: 0.8938
Epoch 13/20
 - 0s - loss: 0.5102 - acc: 0.9018
Epoch 14/20
 - 0s - loss: 0.4737 - acc: 0.9178
Epoch 15/20
 - 0s - loss: 0.4391 - acc: 0.9339
Epoch 16/20
 - 0s - loss: 0.4087 - acc: 0.9479
Epoch 17/20
 - 0s - loss: 0.3818 - acc: 0.9479
Epoch 18/20
 - 0s - loss: 0.3582 - acc: 0.9599
Epoch 19/20
 - 0s - loss: 0.3360 - acc: 0.9679
Epoch 20/20
 - 0s - loss: 0.3157 - acc: 0.9679


<keras.callbacks.History at 0x7f0d9c180470>

Now we save the prediction on our test data

In [156]:
out_ = model.predict_classes(test_vecs_w2v)
out_put = le.inverse_transform(out_)
np.set_printoptions(suppress=True)
prob = model.predict_proba(test_vecs_w2v)

This is the calculation of probability score of each feedbacks.
We will predict top 3 Themes for each feedback based on their probability score we got from the model

In [157]:
a = np.argmax(prob,axis=1)
p = prob.max(axis=1)
label_1 = le.inverse_transform(a)

for i in range(test_vecs_w2v.shape[0]):
    prob[i,a[i]] = -1
    
b = np.argmax(prob,axis=1)
q = prob.max(axis=1)
label_2 = le.inverse_transform(b)

for i in range(test_vecs_w2v.shape[0]):
    prob[i,b[i]] = -1

c = np.argmax(prob,axis=1)
r = prob.max(axis=1)
label_3 = le.inverse_transform(c)

labels=np.column_stack((label_1,label_2,label_3))
probabilities=np.column_stack((p,q,r))

In [158]:
output= np.column_stack((df.feedback[499:],labels,probabilities))

In [159]:
output = pd.DataFrame(output)

In [162]:
output.columns =['Feedback','Theme_1','Theme_2','Theme_3','prob_Theme_1','prob_Theme_2','prob_Theme_3']

In [163]:
output

Unnamed: 0,Feedback,Theme_1,Theme_2,Theme_3,prob_Theme_1,prob_Theme_2,prob_Theme_3
0,"I get a bit frustrated , when I ha e to spend ...",Technician not having required expertise,Parts unavailability or Wrong Part delivered,Scheduling issue (Call center experience),0.420471,0.178976,0.149998
1,I called the morning of appt. and they said th...,Scheduling issue (Call center experience),Technician not turning up at all or on schedul...,Technician code of conduct\n(Didn't inspected ...,0.739004,0.195259,0.0160344
2,Tech was rude and cursed my family. Did not sp...,Repaired product still not working,Technician code of conduct\n(Didn't inspected ...,Technician not having required expertise,0.38967,0.347107,0.171259
3,I ordered my dish washer fixed. Your computer...,Dissatisfied with service provided even after ...,Extremely dissatisifed with experience,Scheduling issue (Call center experience),0.23868,0.218975,0.176128
4,additional service visit required to address i...,Dissatisfied with service provided even after ...,Scheduling issue (Call center experience),Dissatisfied with warranty terms and conditions,0.396233,0.130867,0.107617
5,The technician from A&E called at 8:45 today f...,Technician not turning up at all or on schedul...,Scheduling issue (Call center experience),Extremely dissatisifed with experience,0.585341,0.377898,0.0111576
6,Call me at (989) 839 0436,Scheduling issue (Call center experience),Technician not turning up at all or on schedul...,Service charges are too costly,0.296956,0.260585,0.133381
7,Tech ' sass. All of them ordered parts for ...,Repaired product still not working,Parts unavailability or Wrong Part delivered,Technician not having required expertise,0.362215,0.308458,0.245629
8,Fifth visit now and 2 hours after the tech lef...,Technician not turning up at all or on schedul...,Others,Scheduling issue (Call center experience),0.959888,0.0127786,0.00879363
9,Tom was not helpful in solving the problem. Af...,Repaired product still not working,Technician not having required expertise,Others,0.496128,0.170389,0.140685
