In [70]:
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

import datetime
from sklearn.model_selection import train_test_split, cross_val_score
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier


# Reading CSV file

The input file is .csv and it contains 2 columns consisting of data and label. 

In [71]:
train = pd.read_csv('round2_task_data.csv')

In [72]:
train.head()

Unnamed: 0,data,label
0,"{'id': 'KG0OUA', 'data': 'Good morning', 'mess...",location
1,"{'id': 'L9DC9H', 'data': 'Location', 'message_...",whoAreYou
2,"{'id': 'ZQR6R5', 'data': 'hi', 'message_order'...",whoAreYou
3,"{'id': 'RH0M4E', 'data': 'Hi', 'message_order'...",greeting
4,"{'id': 'WLVX8I', 'data': 'Hello', 'message_ord...",greeting


In [73]:
train['label'].value_counts()

whoAreYou               627
greeting                619
notInterested           474
dontMeetRequirements    160
location                120
Name: label, dtype: int64

In [98]:
train.loc[train.label == 'location' ]

Unnamed: 0,data,label
0,"{'id': 'KG0OUA', 'data': 'Good morning', 'mess...",location
27,"{'id': '2YAC8M', 'data': 'Iska location kya ho...",location
30,"{'id': 'YBEI5V', 'data': 'Ok', 'message_order'...",location
31,"{'id': '0C9Y5K', 'data': 'K', 'message_order':...",location
35,"{'id': 'N84XF2', 'data': 'How r u', 'message_o...",location
58,"{'id': 'GE2L4N', 'data': '📷 G😊😊D Morning', 'me...",location
76,"{'id': 'U237AR', 'data': 'Ok', 'message_order'...",location
84,"{'id': 'AX28KL', 'data': 'Hi', 'message_order'...",location
85,"{'id': 'K7VCBE', 'data': 'I applied', 'message...",location
87,"{'id': 'ID97YE', 'data': 'No', 'message_order'...",location


In [75]:
X = train[['data']]
Y = train['label']
np.shape(Y)

(2000,)

# Extracting required data
The data is in the form of dictionary but we are only intersted in the data key, since the message(chat) is stored as value for the data key. So, we extract the value from the 'data' key and store them as  list elements.   

In [97]:

import ast

X_lst = X.values.tolist()
Y_lst = Y.values.tolist()
datas = []
for dt in X_lst:
    x_l = ast.literal_eval(dt[0])
    datas.append(x_l['data'])


# Bag of Words
In order to perform machine learning on text documents, we first need to turn these text content into numerical feature vectors that Scikit-Learn can use. The simplest and most intuitive way to do so, is the “bags-of-words” representation which ignores structure and simply counts how often each word occurs. CountVectorizer allows us to use the bags-of-words approach, by converting a collection of text documents into a matrix of token counts.

In [77]:
vect = CountVectorizer().fit(datas)
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

The default configuration tokenizes the string, by extracting words of at least 2 letters or numbers, separated by word boundaries, it then converts everything to lowercase and builds a vocabulary using these tokens. We can get some of the vocabularies by using the get_feature_names method like so:

In [78]:
vect.get_feature_names()[::100]

['0lz', 'bhr', 'drive', 'hr', 'lagana', 'nearest', 'prasent', 'ther', 'దయ']

Looking at those vocabularies, we can get a small sense of what they are about . By checking the length of get_feature_names, we can see that we’re working with 804 features.

In [79]:
len(vect.get_feature_names())

804

Next, we transform the documents in datas to a document term matrix, which gives us the bags-of-word representation of datas. The result is stored in a SciPy sparse matrix, where each row corresponds to a document, and each column is a word from our training vocabulary.

In [80]:
datas_vectorized = vect.transform(datas)
datas_vectorized.shape

(2000, 804)

In [81]:
datas_vectorized.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

# Machine Learning - Logistic Regression

Since, we have the data in vectorized format we can now apply Machine Learning algorithms, to figure out the accuracy of our model. Logistic Regression is one of the ML classification techniques, which can be described in simple terms - Data is fit into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable.Ref.(1)

For fitting the model with Logistic Regression, I have used OnevsRest classifier. This strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. This strategy is used for multilabel learning, where a classifier is used to predict multiple labels for instance, by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.

3 fold cross validation technique have been used where our training data is split into 3 parts. The first fold is treated as a validation set, and the method is fit on the remaining 2 folds. It is called K - fold cross validation where in this case k = 3.

In [82]:
logreg = LogisticRegression()
ovr = OneVsRestClassifier(logreg)

In [83]:
%%time
ovr.fit(datas_vectorized, Y)

CPU times: user 80.5 ms, sys: 0 ns, total: 80.5 ms
Wall time: 79.7 ms


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)

# Model Dumping
Used cloud pickle to save model file

In [84]:
import cloudpickle
ouf = open('model_LR.txt', 'wb')
cloudpickle.dump(ovr, ouf)
cloudpickle.dump(range(19), ouf)
ouf.close(  )

# Accuracy Calculation

In [85]:
scores = cross_val_score(ovr, datas_vectorized, Y, scoring='accuracy', n_jobs=-1, cv=3)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

Cross-validation mean accuracy 31.40%, std 0.13.


# TfidfVectorizer
This data is further processed by applying Tfidf Vectorizer, which helps us to give more weight-age to important words which less important words for the case study would be given more weights.

Since, our code is based on counting the frequency of each word in the document, so if certain words like ‘the’, ‘if’ etc. which are present more frequently then words which are more important such as ‘buy’,’product’ etc. , which gives us the context. Ref.(2)

The cross validation accuracy is 31.40 % which is quite low. 

In [86]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer().fit(datas)
len(vect.get_feature_names())

804

The n-grams is set  in the range of 1–2 which helps us to extract features for 1 and 2 grams.

In [87]:
vect = CountVectorizer(ngram_range = (1,2)).fit(datas)
datas_tfidf = vect.transform(datas)
len(vect.get_feature_names())

2270

In [88]:
logreg = LogisticRegression()
ovr = OneVsRestClassifier(logreg)

In [89]:
%%time
ovr.fit(datas_tfidf, Y)

CPU times: user 54.5 ms, sys: 3.62 ms, total: 58.1 ms
Wall time: 59.2 ms


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)

# Model Dumping
Used cloud pickle to save model file

In [90]:
import cloudpickle
ouf = open('model_LR_tfidf.txt', 'wb')
cloudpickle.dump(ovr, ouf)
cloudpickle.dump(range(19), ouf)
ouf.close(  )

In [91]:
scores = cross_val_score(ovr, datas_tfidf, Y, scoring='accuracy', n_jobs=-1, cv=3)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

Cross-validation mean accuracy 32.00%, std 0.75.


Although I was expecting better results with tfidf and n-grams but unfortunately there was not much change in the accuracy of the model.

# Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. Ref.(3)

In [92]:
%%time
model = MultinomialNB()

scores =  cross_val_score(model, datas_tfidf, Y, scoring='accuracy', n_jobs=-1, cv=3)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

Cross-validation mean accuracy 30.90%, std 0.41.
CPU times: user 76.5 ms, sys: 27.2 ms, total: 104 ms
Wall time: 209 ms


# Model Dumping
Used cloud pickle to save model file

In [93]:
import cloudpickle
ouf = open('model_NB_tfidf.txt', 'wb')
cloudpickle.dump(ovr, ouf)
cloudpickle.dump(range(19), ouf)
ouf.close(  )

# Support Vector Classifier

The maximal margin classifier is a very natural way to perform classification, is a separating hyperplane exists. However existence of such a hyperplane may not be guaranteed, or even if it exists, the data is noisy so that maximal margin classifier provides a poor solution. In such cases, the concept can be extended where a hyperplane exists which almost separates the classes, using what is known as a soft margin. 

The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier, where a small proportion of the training sample is allowed to cross the margins, or even the separating hyperplane. Rather than looking for the largest possible margin so that every observation is on the correct side of the margin, thereby making the margins very narrow or non-existent, some observations are allowed to be on the incorrect side of the margins. 

The margin is soft as a small number of observations violate the margin. The softness is controlled by slack variables which control the position of the observations relative to the margins and separating hyperplane. The support vector classifier maximizes a soft margin. Ref.(4)

In [94]:
from sklearn.svm import LinearSVC

In [95]:

%%time
svc = LinearSVC(dual=False)
scores = cross_val_score(svc, datas_tfidf, Y, scoring='accuracy', n_jobs=-1, cv=3)
print('Cross-validation mean accuracy {0:.2f}%, std {1:.2f}.'.format(np.mean(scores) * 100, np.std(scores) * 100))

Cross-validation mean accuracy 31.45%, std 0.92.
CPU times: user 81.1 ms, sys: 51.6 ms, total: 133 ms
Wall time: 231 ms


# Model Dumping
Used cloud pickle to save model file

In [96]:
import cloudpickle
ouf = open('model_SVC_tfidf.txt', 'wb')
cloudpickle.dump(ovr, ouf)
cloudpickle.dump(range(19), ouf)
ouf.close(  )

Conclusion: By using various ML techniques the accuracy is low around 30-32 %. Although, the labeling of the training data could be one issue, but nevertheless further data preprocessing techniques such as resampling the data for balancing or lemmatization could lead to better results. Also, we could use word2vec model, to check if the accuracy improves any further.

# References
1. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
2. https://medium.com/@josephroy/when-i-decided-to-work-on-sentiment-analysis-amazon-fine-food-review-kaggle-project-was-quite-3575721a8849
3. http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
4. https://onlinecourses.science.psu.edu/stat857/node/241/