# Proof of Concept of different binary text classifiers on our data

In the world of ML, there is a vast amount of choice when it comes to which classification method to use.
In this notebook, I will be demonstrating how 4 of the most popular classifiers perform:

- <b>Multinomial Naive Bayes</b>, the 'punching bag' benchmark classifier of the ML world, using class membership probabilities found by feature vector weights, to predict the membership of a new data point,

- <b>Logistic Regression</b>, a method that uses the sigmoid function to transform a representation of how far a new data point lies from a decision boundary found via gradient descent to a class probability,

- <b>K-nearest-neighbours</b>, where the classification of X is a vote of the K nearest items to X,

- <b>SVM</b>, a method that tries to find a hyperplane to seperate classes by treating them as coordinates in an m dimensional space, m being the number of features.

The dataset that we are using in this notebook is very small, containing only 1600 data points, 800 of each class.

However, it is a good dataset to use to produce a POC in this notebook, because the feature extraction is fast, it is balanced, and the labels belong to the gold standard. This means we can cross examine multiple classifiers with multiple features in a fast manner to get a feel for how they perform.

In terms of producing a reliable model to serve on our API, it is not a good choice, as it does not generalize well.

Without further ado, let us begin.

I start by importing some modules for data processing and manipulation, Numpy and Pandas.

I also import a helper function to process the raw data into a frame for us, for the sake of clarity.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from get_df import get_data_frame

Great. Let's take a look at our data:

In [2]:
df = get_data_frame()
df.head

<bound method NDFrame.head of      sentiment                                             review deceptive
0            0  We stayed at the Schicago Hilton for 4 days an...         1
1            0  Hotel is located 1/2 mile from the train stati...         1
2            0  I made my reservation at the Hilton Chicago be...         1
3            0  When most people think Hilton, they think luxu...         1
4            0  My husband and I recently stayed stayed at the...         1
5            0  My wife and I booked a room at the Hilton Chic...         1
6            0  For a hotel rated with four diamonds by AAA, o...         1
7            0  I had high hopes for the Hilton Chicago, but I...         1
8            0  We booked a room at the Hilton Chicago for two...         1
9            0  I've stayed at other hotels in Chicago, but th...         1
10           0  During my stay at the Hilton Chicago it has be...         1
11           0  I stayed two nights at the Hilton Chicago.

Sweet. We have 3 columns: 
- Sentiment (0 is negative, 1 is positive)
- Review, our review text,
- Deceptive (0 is genuine, 1 is deceptive)

Sentiment and Deceptive are pre-labelled for us.
We will be focusing on the deceptive column, the label we wish to predict.

Let's seperate our data from the labels:

In [3]:
X = df['review']
y = np.asarray(df['deceptive'], dtype=int)

These classifiers only work on numeric features, not the strings that our reviews are currently represented by. To represent our reviews as numeric features, we use a Bag of Words model. 

CountVectorizer takes our review and returns a m-dimensional array, where $m$ is our vocabularly size and $m_i$ is 1 if the word $i$ appears in the review, 0 if not.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

(1600, 9571)


Lets take a look at the shape of our data after this transformation.

In [13]:
print(cv.fit_transform(X).shape)

(1600, 9571)


In this example, 1600 represents the number of reviews in our data, and 9571 the number of words (size of vocab.)
Let's see what happens if we remove stop words:

In [38]:
cv = CountVectorizer(stop_words='english')
print(cv.fit_transform(X).shape)

(1600, 9284)


That will help our classifier slightly. Stop words generally don't contribute anything to class, and add noise.

Here's what our first review looks like in count vector format:

In [39]:
print(cv.fit_transform(X)[0])

  (0, 3266)	1
  (0, 337)	1
  (0, 7482)	1
  (0, 9210)	1
  (0, 1754)	1
  (0, 9023)	1
  (0, 4085)	1
  (0, 3744)	1
  (0, 4770)	1
  (0, 2829)	1
  (0, 6507)	1
  (0, 2535)	1
  (0, 4343)	1
  (0, 5252)	1
  (0, 2172)	1
  (0, 699)	1
  (0, 4902)	1
  (0, 2859)	1
  (0, 8777)	1
  (0, 1514)	1
  (0, 6982)	1
  (0, 5775)	1
  (0, 2603)	1
  (0, 3518)	1
  (0, 4727)	1
  :	:
  (0, 6979)	2
  (0, 731)	1
  (0, 4143)	4
  (0, 7848)	2
  (0, 8566)	1
  (0, 6886)	1
  (0, 9231)	1
  (0, 387)	1
  (0, 8374)	1
  (0, 8174)	1
  (0, 889)	2
  (0, 3154)	1
  (0, 4839)	1
  (0, 1719)	1
  (0, 571)	1
  (0, 3728)	1
  (0, 2849)	1
  (0, 5563)	1
  (0, 7104)	1
  (0, 1943)	1
  (0, 5528)	1
  (0, 2311)	1
  (0, 4050)	3
  (0, 7137)	1
  (0, 7849)	1


ie. the 4050'th word in the vocab appeared in the first review 3 times.

Now, let's apply another transformation. 

$Tf$ is term frequency, and corresponds to how many times a word appears in our vocab. 
This will weight words in longer reviews greater however, so we discount it by dividing by the number of times it appears in a particular review, with $idf$, inverse document-frequency. 
Together, this is $Tf-idf$, and provides a better representation of our data than just word counts.

In [40]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()

Let's transform our first count vector and see what it looks like now.

In [46]:
print(tfidf.fit_transform(cv.fit_transform(X)[0]))

  (0, 9246)	0.07738232325341368
  (0, 9231)	0.07738232325341368
  (0, 9210)	0.07738232325341368
  (0, 9201)	0.07738232325341368
  (0, 9023)	0.07738232325341368
  (0, 9014)	0.07738232325341368
  (0, 8848)	0.07738232325341368
  (0, 8813)	0.07738232325341368
  (0, 8777)	0.07738232325341368
  (0, 8566)	0.07738232325341368
  (0, 8419)	0.07738232325341368
  (0, 8401)	0.15476464650682736
  (0, 8374)	0.07738232325341368
  (0, 8316)	0.15476464650682736
  (0, 8174)	0.07738232325341368
  (0, 8059)	0.07738232325341368
  (0, 7849)	0.07738232325341368
  (0, 7848)	0.15476464650682736
  (0, 7761)	0.07738232325341368
  (0, 7580)	0.07738232325341368
  (0, 7482)	0.07738232325341368
  (0, 7249)	0.07738232325341368
  (0, 7201)	0.07738232325341368
  (0, 7137)	0.07738232325341368
  (0, 7104)	0.07738232325341368
  :	:
  (0, 2311)	0.07738232325341368
  (0, 2172)	0.07738232325341368
  (0, 1943)	0.07738232325341368
  (0, 1879)	0.07738232325341368
  (0, 1754)	0.07738232325341368
  (0, 1721)	0.15476464650682736
  

The weight of every word now will sum to 1 but is distributed among reviews.

Now for the fun stuff! Let's import all of our classifiers listed above:

In [49]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import sklearn.svm as svm

Pipelines allow us to combine feature extractors and classifiers to make life easier.

In [50]:
from sklearn.pipeline import Pipeline

Because we want to use the same features for every classifier, we initialize our CountVectorizer and TfIdf transformer beforehand, and pass it into every pipeline. 

In [51]:
cv = CountVectorizer(stop_words='english', ngram_range = (0, 2))
tfidf = TfidfTransformer()

Now we create all of our pipelines, one for each classifier.

In [52]:
nbayes = Pipeline([ ('cv', cv), ('tfidf', tfidf), ('nbayes', MultinomialNB()) ])
logreg = Pipeline([ ('cv', cv), ('tfidf', tfidf), ('logreg', LogisticRegression(random_state=42, solver='lbfgs')) ])
knn = Pipeline([ ('cv', cv), ('tfidf', tfidf), ('knn', KNeighborsClassifier(n_neighbors=6)) ])
svc = Pipeline([ ('cv', cv), ('tfidf', tfidf), ('svm', svm.LinearSVC(random_state=42)) ])

models = {'nbayes': nbayes, 'logreg':logreg, 'knn':knn, 'svm':svc}

For getting an accurate test/train split, we split our data up using 10 fold cross validation.

In [53]:
from sklearn.model_selection import cross_val_score

Now lets get our accuracy scores!

In [54]:
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=10)
    print("%s Accuracy: %0.2f (+/- %0.2f)" % (name, scores.mean(), scores.std() * 2))

nbayes Accuracy: 0.86 (+/- 0.06)
logreg Accuracy: 0.87 (+/- 0.06)
knn Accuracy: 0.52 (+/- 0.09)
svm Accuracy: 0.88 (+/- 0.06)


As we can see, logistic regression and the linear SVM work best, whereas the k-NN algorithm does very poorly.