# Live Chat Natural Language Processing

From the conversations recorded on Live chat, we are going to develop a model which can classify whether or not a conversation should be rated good.

This should help reclassify those conversations which were not rated in order to help with overall text analysis (done in a later notebook) which should lead to improved customer support and also help to create automation in terms of preempting customer requests.

It will also help in the generation of macro and may lead to a the start of an Answer Bot.

Lets get started by loading up the data, our usual imports and perhaps a few others like re for delimiting and create our dataframe

The text analysis model we will be following for today can be found here

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

Unfortunately because of the size of the table and for data protection I can't load the data set for you to repeat the process like I usually do. Following the comments though should provide enough insight for you to complete the process on your own dataset.


In [1]:
#this will keep all our graphs in the page
%matplotlib inline

# a few libraries that we will need

import numpy as np 
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime

#set up pandas table display
pd.set_option ("display.width", 500)
pd.set_option ("display.max_columns", 100)
pd.set_option ("display.notebook_repr_html", True)
import seaborn as sns #gives us a bit more style in our plots

#We need this to look at excel files
import openpyxl

from pandas import DataFrame, read_excel, merge

file_name = r"C:\Users\Mrs Farrelly\Documents\James\LiveChat\Live_chat_conversations_public.xlsx"

#table1 = chats_report_October18
#this line is key...it calls the str day an actual date
dateparse = lambda x:pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
#df_chats = pd.read_csv(file_name, header = 0, parse_dates = ["chat start date Europe/London"], date_parser = dateparse)


#lets create a dataframe using out chats_report_October18.xlsx
df_text = pd.read_excel (file_name, sheet_name = "Sheet2" , header = 0, date_parser = dateparse)

df_text.head()

Unnamed: 0,Page,id,text,date,timestamp,user_type,rate,duration,skill
0,1114,PI83U5GYOV,Hello. How may I help you?,"Mon, 11/12/18 03:06:31 pm",1542035191,operator,not_rated,317,11
1,1114,PI83U5GYOV,I’m trying to change my password,"Mon, 11/12/18 03:06:51 pm",1542035211,visitor,not_rated,317,11
2,1114,PI83U5GYOV,Okay Drew,"Mon, 11/12/18 03:07:07 pm",1542035227,operator,not_rated,317,11
3,1114,PI83U5GYOV,I’ve gone ho,"Mon, 11/12/18 03:07:37 pm",1542035257,visitor,not_rated,317,11
4,1114,PI83U5GYOV,page but nothing on account to say about pass...,"Mon, 11/12/18 03:07:58 pm",1542035278,visitor,not_rated,317,11


In [2]:
list(df_text)

['Page',
 'id',
 'text',
 'date',
 'timestamp',
 'user_type',
 'rate',
 'duration',
 'skill']

Create a new dataframe based on just three columns

In [3]:
#df.groupby('id').agg(lambda x: x.tolist())
#s.str.cat(sep=', ')
#new = old.filter(['A','B','D'], axis=1)
#'I, will, hereby, am, gonna, going, far, to, do, this'
df_text2 = df_text.filter(["id", "text","rate"])
df_text2.head()

Unnamed: 0,id,text,rate
0,PI83U5GYOV,Hello. How may I help you?,not_rated
1,PI83U5GYOV,I’m trying to change my password,not_rated
2,PI83U5GYOV,Okay Drew,not_rated
3,PI83U5GYOV,I’ve gone ho,not_rated
4,PI83U5GYOV,page but nothing on account to say about pass...,not_rated


Okay, so now we need to get all the text onto one like and replace the words 'not_rated' with a 1,  'rated_good' with a 2 and 'rated_bad' with a 3.  The text below will help with this

In [5]:
#df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
#df_text2 = df_text2.groupby(["id"])["text"].s.str.cat(sep = " ")
df_text2 = df_text2.groupby(["id","rate"])["text"].apply(lambda x: ' '.join(x)).reset_index()
df_text2 = df_text2.replace(['not_rated','rated_good','rated_bad'], ['1','2','3'])
#df['BrandName'].replace(['ABC', 'AB'], 'A')
df_text2.head()

Unnamed: 0,id,rate,text
0,NAVJWP07JZ,2,"Hello, would you like to talk about our produc..."
1,NAYI9C8DOY,1,Hello Cliff. How may I help you? What is it li...
2,NBILEBQTKW,1,"Hello, how may I help you? Lee here Hi Lee :) ..."
3,NBVFJEWUEY,2,Hello ben. How may I help you? I'd like help s...
4,NE456RGIDZ,1,Hello George. How may I help you? Hello George...


In [6]:
print (df_text2.shape)

(27829, 3)


In [7]:
#Lets take a look at the first line of our text
print("\n".join(df_text2.text[3].split("\n")[:3])) #prints first line of the first data file


Hello ben. How may I help you? I'd like help sorting something please ok Hello ben. How may I help you? I'd like help sorting something please ok Hello ben. How may I help you? I'd like help sorting something please ok


# Training

The first thing we have to do is assign part of the data to a training set and then the other part to da testing set.  We don't need to worry about order or anything like that here to begin wiht since we are sorted by "id".  

In [8]:
#df.iloc[int(len(df)*0.33):int(len(df)*0.66)]
#df[(df.index>np.percentile(df.index, 33))
text_train = df_text2[(df_text2.index < np.percentile(df_text2.index, 90))]
text_train.shape

(25046, 3)

Scikit-learn has a high level component which will create feature vectors for us ‘CountVectorizer’.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
#we must remember to pick the text column from our training data
X_train_counts =count_vect.fit_transform(text_train.text)
X_train_counts.shape

(25046, 39399)

In [10]:
X_train_counts

<25046x39399 sparse matrix of type '<class 'numpy.int64'>'
	with 1871393 stored elements in Compressed Sparse Row format>

In [11]:
#the amended original script I am following
#from sklearn.feature_extraction.text import CountVectorizer
#count_vect = CountVectorizer()
#X_train_counts = count_vect.fit_transform(twenty_train.data)
#X_train_counts.shape

TF: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each document.

TF-IDF: Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency times inverse document frequency.

We can achieve both using below line of code:

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(25046, 39399)

There are various algorithms which can be used for text classification. We will start with the most simplest one ‘Naive Bayes (NB)’ (don’t think it is too Naive! 😃)

You can easily build a NBclassifier in scikit using below 2 lines of code: (note - there are many variants of NB, but discussion about them is out of scope)

In [13]:
from sklearn.naive_bayes import MultinomialNB
#sometimes the rate column is referred to as the target
clf = MultinomialNB().fit(X_train_tfidf, text_train.rate)

This will train the NB classifier on the training data we provided.

Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:

In [14]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()), ])

text_clf = text_clf.fit(text_train.text, text_train.rate)

The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.

# Test

Performance of NB Classifier: Now we will test the performance of the NB classifier on test set.

In [15]:
#lets start by assigning the test data
#We shouldn't really need the shuffle column but it can't hurt
#remember column names data = text and target = rate
text_test = df_text2[(df_text2.index >= np.percentile(df_text2.index, 90))]
predicted = text_clf.predict(text_test.text)
np.mean(predicted == text_test.rate)

0.803090190441969

The accuracy we get is ~80.31%, which is not bad for start and for a naive classifier. Also, congrats!!! you have now written successfully a text classification algorithm 👍

This is actually way better than I thought I would get

# Support Vector Machines (SVM):

We can try a new approach using SVM and see if we can do any better

In [16]:
>>> from sklearn.linear_model import SGDClassifier
>>> text_clf_svm = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
...                                            alpha=1e-3, n_iter=5, random_state=42)),
... ])
>>> _ = text_clf_svm.fit(text_train.text, text_train.rate)
>>> predicted_svm = text_clf_svm.predict(text_test.text)
>>> np.mean(predicted_svm == text_test.rate)



0.803090190441969

wow, exactly the same value...that's weird

# GridSearchCV

Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.

In [17]:
>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
...               'tfidf__use_idf': (True, False),
...               'clf__alpha': (1e-2, 1e-3),
... }

Here, we are creating a list of parameters for which we would like to do performance tuning. All the parameters name start with the classifier name (remember the arbitrary name we gave). E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

Next, we create an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine.

In [18]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(text_train.text, text_train.rate)

This might take few minutes to run depending on the machine configuration.

Lastly, to see the best mean score and the params, run the following code:

In [19]:
print (gs_clf.best_score_)
gs_clf.best_params_

0.9450610875988181


{'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}

The accuracy has now increased to ~94.51% for the NB classifier (not so naive anymore! 😄) and the corresponding parameters are {‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.

Thats pretty much it for now.  You can now use this fairly accurately to make predictions against new Live chat's or go back over a different data set and reclassify existing chats to be used in either response automation or personel management.  The next step from here is to get the NLTK and start breaking down the text.