# Proof of Concept of different binary text classifiers on our data

In the world of ML, there is a vast amount of choice when it comes to which classification method to use.
In this notebook, I will be demonstrating how 4 of the most popular classifiers perform:

- <b>Multinomial Naive Bayes</b>, the 'punching bag' benchmark classifier of the ML world, using class membership probabilities found by feature vector weights, to predict the membership of a new data point,

- <b>Logistic Regression</b>, a method that uses the sigmoid function to transform a representation of how far a new data point lies from a decision boundary found via gradient descent to a class probability,

- <b>K-nearest-neighbours</b>, where the classification of X is a vote of the K nearest items to X,

- <b>SVM</b>, a method that tries to find a hyperplane to seperate classes by treating them as coordinates in an m dimensional space, m being the number of features.

The dataset that we are using in this notebook is very small, containing only 1600 data points, 800 of each class.

However, it is a good dataset to use to produce a POC in this notebook, because the feature extraction is fast, it is balanced, and the labels belong to the gold standard. This means we can cross examine multiple classifiers with multiple features in a fast manner to get a feel for how they perform.

In terms of producing a reliable model to serve on our API, it is not a good choice, as it does not generalize well.

Without further ado, let us begin.

## Data Processing & Feature Extraction

I start by importing some modules for data processing and manipulation, Numpy and Pandas.

I also import a helper function to process the raw data into a frame for us, for the sake of clarity.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

Great. Let's take a look at our data:

In [2]:
dataset = tf.keras.utils.get_file(
      fname="opspam.pkl", 
      origin="https://storage.googleapis.com/lucas0/opspam.pkl", 
      extract=False)
df = pd.read_pickle(dataset)
df.head()

Unnamed: 0,sentiment,review,deceptive
0,0,We stayed at the Schicago Hilton for 4 days an...,1
1,0,Hotel is located 1/2 mile from the train stati...,1
2,0,I made my reservation at the Hilton Chicago be...,1
3,0,"When most people think Hilton, they think luxu...",1
4,0,My husband and I recently stayed stayed at the...,1


Sweet. We have 3 columns: 
- Sentiment (0 is negative, 1 is positive)
- Review, our review text,
- Deceptive (0 is genuine, 1 is deceptive)

Sentiment and Deceptive are pre-labelled for us.
We will be focusing on the deceptive column, the label we wish to predict.

Let's seperate our data from the labels:

In [3]:
X = df['review']
y = np.asarray(df['deceptive'], dtype=int)

These classifiers only work on numeric features, not the strings that our reviews are currently represented by. To represent our reviews as numeric features, we use a Bag of Words model. 

CountVectorizer takes our review and returns a $m$-dimensional array, where $m$ is our vocabularly size and $m_i$ is 1 if the word $i$ appears in the review, 0 if not.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

Lets take a look at the shape of our data after this transformation.

In [5]:
print(cv.fit_transform(X).shape)

(1600, 9571)


In this example, 1600 represents the number of reviews in our data, and 9571 the number of words (size of vocab.)
Let's see what happens if we remove stop words:

In [6]:
cv = CountVectorizer(stop_words='english')
print(cv.fit_transform(X).shape)

(1600, 9284)


That will help our classifier slightly. Stop words generally don't contribute anything to class membership, and add noise.

Here's what our first review looks like in count vector format:

In [7]:
print((cv.fit_transform(X)[0]))

  (0, 3266)	1
  (0, 337)	1
  (0, 7482)	1
  (0, 9210)	1
  (0, 1754)	1
  (0, 9023)	1
  (0, 4085)	1
  (0, 3744)	1
  (0, 4770)	1
  (0, 2829)	1
  (0, 6507)	1
  (0, 2535)	1
  (0, 4343)	1
  (0, 5252)	1
  (0, 2172)	1
  (0, 699)	1
  (0, 4902)	1
  (0, 2859)	1
  (0, 8777)	1
  (0, 1514)	1
  (0, 6982)	1
  (0, 5775)	1
  (0, 2603)	1
  (0, 3518)	1
  (0, 4727)	1
  :	:
  (0, 6979)	2
  (0, 731)	1
  (0, 4143)	4
  (0, 7848)	2
  (0, 8566)	1
  (0, 6886)	1
  (0, 9231)	1
  (0, 387)	1
  (0, 8374)	1
  (0, 8174)	1
  (0, 889)	2
  (0, 3154)	1
  (0, 4839)	1
  (0, 1719)	1
  (0, 571)	1
  (0, 3728)	1
  (0, 2849)	1
  (0, 5563)	1
  (0, 7104)	1
  (0, 1943)	1
  (0, 5528)	1
  (0, 2311)	1
  (0, 4050)	3
  (0, 7137)	1
  (0, 7849)	1


ie. the 4050'th word in the vocab appeared in the first review 3 times.

Now, let's apply another transformation. 

$Tf$ is term frequency, and corresponds to how many times a word appears in a review. The higher, more chances are that this particular review is relevant to this word.  

$df$ is the number of reviews the word occured in. The higher, the less we should weight the review because of the word. (If a word occurs in a large number of reviews , it wont play important role in finding out reviews relevant to  the word).

Therefore, we use the calculation $tf * \frac {1} {df} $.

This is known as $Tf-idf$, and provides a better representation of our data than just word counts. It weights words that are important to a review's classification higher.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()

Let's transform our first count vector and see what it looks like now.

In [9]:
print(tfidf.fit_transform(cv.fit_transform(X)[0]))

  (0, 9246)	0.07738232325341368
  (0, 9231)	0.07738232325341368
  (0, 9210)	0.07738232325341368
  (0, 9201)	0.07738232325341368
  (0, 9023)	0.07738232325341368
  (0, 9014)	0.07738232325341368
  (0, 8848)	0.07738232325341368
  (0, 8813)	0.07738232325341368
  (0, 8777)	0.07738232325341368
  (0, 8566)	0.07738232325341368
  (0, 8419)	0.07738232325341368
  (0, 8401)	0.15476464650682736
  (0, 8374)	0.07738232325341368
  (0, 8316)	0.15476464650682736
  (0, 8174)	0.07738232325341368
  (0, 8059)	0.07738232325341368
  (0, 7849)	0.07738232325341368
  (0, 7848)	0.15476464650682736
  (0, 7761)	0.07738232325341368
  (0, 7580)	0.07738232325341368
  (0, 7482)	0.07738232325341368
  (0, 7249)	0.07738232325341368
  (0, 7201)	0.07738232325341368
  (0, 7137)	0.07738232325341368
  (0, 7104)	0.07738232325341368
  :	:
  (0, 2311)	0.07738232325341368
  (0, 2172)	0.07738232325341368
  (0, 1943)	0.07738232325341368
  (0, 1879)	0.07738232325341368
  (0, 1754)	0.07738232325341368
  (0, 1721)	0.15476464650682736
  

## Finding Optimal Classifier Parameters

Now for the fun stuff! Let's import all of our classifiers listed above:

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import sklearn.svm as svm

Pipelines allow us to combine feature extractors with classifiers to make life easier.

In [11]:
from sklearn.pipeline import Pipeline

In [12]:
scoring = {
    'acc': 'accuracy',
    'auroc': 'roc_auc',
    'f1': 'f1'
}

### Naive Bayes

There aren't much different things we can try with Naive Bayes, however we can see the difference of unigrams to bigrams.

In [13]:
from sklearn.metrics import roc_auc_score

def auroc_score_from_probabilities(probabilities, labels):
  true_probabilities = [probabilities[i][1] for i in range(0, len(labels))]
  return roc_auc_score(labels, true_probabilities)

In [14]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, log_loss
import numpy as np
def run_cross_validate1(model, X, y, get_features_fn, cv=5):
  splitter = StratifiedShuffleSplit(n_splits=cv, test_size=0.2)
  metrics = {
    "accuracies": [],
    "auroc": [],
    "f1 scores": [],
    "log loss": []
  }
    
  false_negatives = 0
  false_positives = 0
  for train_indices, test_indices in splitter.split(X, y):
    training_X = get_features_fn([X[x] for x in train_indices], fit=True)
    training_y = [y[x] for x in train_indices]
    test_X = get_features_fn([X[x] for x in test_indices])
    test_y = [y[x] for x in test_indices]

    model.fit(training_X, training_y)
    probabilities = model.predict_proba(test_X)
    
    predicted = [0 if x[0] > x[1] else 1 for x in probabilities]
    for i in range(0, len(predicted)):
      if (predicted[i] == 0 and test_y[i] == 1):
        false_negatives+=1
      if (predicted[i] == 1 and test_y[i] == 0):
        false_positives+=1
    
    metrics["accuracies"].append(accuracy_score(test_y, predicted))
    metrics["auroc"].append(auroc_score_from_probabilities(probabilities, test_y))
    metrics["f1 scores"].append(f1_score(test_y, predicted))
    metrics["log loss"].append(log_loss(test_y, probabilities))

  num_samples = len(X)
  metrics["mean accuracy"] = np.mean(metrics["accuracies"])
  metrics["mean variance"] = np.var(metrics["accuracies"])
  metrics["mean auroc"] = np.mean(metrics["auroc"])
  metrics["mean f1 scores"] = np.mean(metrics["f1 scores"])
  metrics["mean log loss"] = np.mean(metrics["log loss"])
  metrics["false negatives rate"] = (false_negatives / cv) / num_samples
  metrics["false positives rate"] = (false_positives / cv) / num_samples
  return metrics

In [15]:
cv_unigram = CountVectorizer(stop_words='english', ngram_range = (0, 1))
cv_bigram = CountVectorizer(stop_words='english', ngram_range = (1, 2))
mnb = MultinomialNB(alpha=2)

def get_features(X, fit=False):
    return X

models = []

models.append(('Bigram MNB tfidf', Pipeline([ ('cv', cv_bigram), ('tfidf', tfidf), ('mnb', mnb) ])))
models.append(('Unigram MNB tfidf', Pipeline([ ('cv', cv_unigram), ('tfidf', tfidf), ('mnb', mnb) ])))

CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
  model_name = model[0]
  accuracies = run_cross_validate1(model[1], [x for x in X], y, get_features, cv=10)
  entries.append((model_name, accuracies['mean accuracy'], accuracies['mean auroc'], accuracies['mean f1 scores']))

In [16]:
cv_df = pd.DataFrame(entries, columns=['model_name', 'accuracy', 'auroc', 'f1'])

In [17]:
cv_df.groupby('model_name').accuracy.mean()

model_name
Bigram MNB tfidf     0.865937
Unigram MNB tfidf    0.851250
Name: accuracy, dtype: float64

In [18]:
cv_df.groupby('model_name').auroc.mean()

model_name
Bigram MNB tfidf     0.954734
Unigram MNB tfidf    0.948762
Name: auroc, dtype: float64

In [19]:
cv_df.groupby('model_name').f1.mean()

model_name
Bigram MNB tfidf     0.876378
Unigram MNB tfidf    0.863841
Name: f1, dtype: float64

Looks like our Bigram-trained MNB is better.

### Support Vector Machine

There's a couple of different types of SVM with a bunch of different parameters. NuSVC is the classic SVC except you can tweak the number of support vectors. Let's compare the different types, using bigram tf-idf vectors as features.

In [20]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, log_loss
import numpy as np
def run_cross_validate2(model, X, y, get_features_fn, cv=5):
  splitter = StratifiedShuffleSplit(n_splits=cv, test_size=0.2)
  metrics = { "accuracies": [], "auroc": [], "f1 scores": [] }
    
  for train_indices, test_indices in splitter.split(X, y):
    training_X = get_features_fn([X[x] for x in train_indices], fit=True)
    training_y = [y[x] for x in train_indices]
    test_X = get_features_fn([X[x] for x in test_indices])
    test_y = [y[x] for x in test_indices]
  
    model.fit(training_X, training_y)
    scores = model.decision_function(test_X)
    
    predicted = [1 if score >= 0 else 0 for score in scores]
    metrics["accuracies"].append(accuracy_score(test_y, predicted))
    metrics["auroc"].append(roc_auc_score(test_y, scores))
    metrics["f1 scores"].append(f1_score(test_y, predicted))

  num_samples = len(X)
  metrics["mean accuracy"] = np.mean(metrics["accuracies"])
  metrics["mean variance"] = np.var(metrics["accuracies"])
  metrics["mean auroc"] = np.mean(metrics["auroc"])
  metrics["mean f1 scores"] = np.mean(metrics["f1 scores"])
  return metrics

In [21]:
# Setup arrays to store models and param sets
cv = CountVectorizer(stop_words='english', ngram_range = (1, 2))

models = []

models.append(('SVC', Pipeline([ ('cv', cv), ('tfidf', tfidf), ('svm', svm.SVC(kernel="sigmoid", gamma='scale', random_state=42)) ])))
models.append(('LinearSVC', Pipeline([ ('cv', cv), ('tfidf', tfidf), ('svm', svm.LinearSVC(random_state=42)) ])))
models.append(('NuSVC', Pipeline([ ('cv', cv), ('tfidf', tfidf), ('svm', svm.NuSVC(kernel='sigmoid', gamma='scale', random_state=42)) ])))

CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
  model_name = model[0]
  accuracies = run_cross_validate2(model[1], [x for x in X], y, get_features, cv=10)
  entries.append((model_name, accuracies['mean accuracy'], accuracies['mean auroc'], accuracies['mean f1 scores']))

In [22]:
cv_df = pd.DataFrame(entries, columns=['model_name', 'accuracy', 'auroc', 'f1'])

In [23]:
cv_df.groupby('model_name').accuracy.mean()

model_name
LinearSVC    0.887188
NuSVC        0.883750
SVC          0.860313
Name: accuracy, dtype: float64

In [24]:
cv_df.groupby('model_name').auroc.mean()

model_name
LinearSVC    0.957461
NuSVC        0.956477
SVC          0.947082
Name: auroc, dtype: float64

In [25]:
cv_df.groupby('model_name').f1.mean()

model_name
LinearSVC    0.890778
NuSVC        0.888324
SVC          0.850551
Name: f1, dtype: float64

Looks like the LinearSVC is our best option.

## Cross Classifier Comparison

Now we create all of our pipelines, one for each classifier.

In [26]:
nbayes = Pipeline([ ('cv', cv), ('tfidf', tfidf), ('nbayes', MultinomialNB()) ])
logreg = Pipeline([ ('cv', cv), ('tfidf', tfidf), ('logreg', LogisticRegression(random_state=42, solver='liblinear')) ])

models = {'Naive Bayes': nbayes, 'Log. Regression':logreg}

For getting an accurate test/train split, we split our data up using 10 fold cross validation.

Now lets get our accuracy scores!

In [27]:
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models.keys():
  model_name = model
  accuracies = run_cross_validate1(models[model_name], [x for x in X], y, get_features, cv=10)
  entries.append((model_name, accuracies['mean accuracy'], accuracies['mean auroc'], accuracies['mean f1 scores']))

cv_df = pd.DataFrame(entries, columns=['model_name', 'accuracy', 'auroc', 'f1'])

Let's plot the 4 classifier accuracies against each other using the matplotlib library Seaborn.

In [28]:
cv_df.groupby('model_name').accuracy.mean()

model_name
Log. Regression    0.871250
Naive Bayes        0.865312
Name: accuracy, dtype: float64

In [29]:
cv_df.groupby('model_name').auroc.mean()

model_name
Log. Regression    0.945820
Naive Bayes        0.951773
Name: auroc, dtype: float64

In [30]:
cv_df.groupby('model_name').f1.mean()

model_name
Log. Regression    0.875232
Naive Bayes        0.874346
Name: f1, dtype: float64

As we can see, the SVM worked best, whereas the k-NN algorithm does relatively poorly.