# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
We expect that SVMs should be more capable than this, so in this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier. The recommended SVM kernel for text classification is the linear kernel, so we will attempt to obtain a result from this.
Let's import what we need.

In [3]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [5]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

In [6]:
all_reviews = reviews_set

X = [x.review_content for x in all_reviews]
y = [1 if x.label else 0 for x in all_reviews]

## Start with Logistic Regression
Logistic Regression is simpler and faster than an SVM. It also correlates well with SVMs with a linear kernel. Both attempt to divide the samples. SVMs attempt to produce a wide margin in the division, and althugh logistic regression does not do this by design, it is known to produce results of this nature. Logistic regression is also a little easier to obtain metrics for, as we can also view the log loss metric. Because of all of this we will start with logistic regression to do our initial experimentation, and then switch to SVMs at the end when we are ready to change the penalty variable.

First are some helper functions to get the metrics:

In [7]:
from sklearn.metrics import roc_auc_score

def auroc_score_from_probabilities(probabilities, labels):
  true_probabilities = [probabilities[i][1] for i in range(0, len(labels))]
  return roc_auc_score(labels, true_probabilities)

In [8]:
import unittest

class TestRocScoreFromProbabilities(unittest.TestCase):
  
  def test_uses_correct_probabilities(self):
    probabilities = [[0.9, 0.1], [0.1, 0.9]]
    labels = [0, 1]
    self.assertEquals(1, auroc_score_from_probabilities(probabilities, labels))

unittest.main(argv=[''], verbosity=2, exit=False)

ok

----------------------------------------------------------------------
Ran 1 test in 0.006s

OK


<unittest.main.TestProgram at 0x7f885ddb6be0>

We define a function to automate the process of cross validation and finding the accuracy, the mean and the variance:

In [17]:
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, log_loss
import numpy as np
def run_cross_validate(model, X, y, get_features_fn, cv=5):
  skfSplitter = StratifiedKFold(n_splits=cv)
  metrics = {
    "accuracies": [],
    "auroc": [],
    "f1 scores": [],
    "log loss": []
  }
    
  false_negatives = 0
  false_positives = 0
  for train_indices, test_indices in skfSplitter.split(X, y):
    training_X = get_features_fn([X[x] for x in train_indices], fit=True)
    training_y = [y[x] for x in train_indices]
    test_X = get_features_fn([X[x] for x in test_indices])
    test_y = [y[x] for x in test_indices]
  
    model.fit(training_X, training_y)
    probabilities = model.predict_proba(test_X)
    
    predicted = [0 if x[0] > x[1] else 1 for x in probabilities]
    for i in range(0, len(predicted)):
      if (predicted[i] == 0 and test_y[i] == 1):
        false_negatives+=1
      if (predicted[i] == 1 and test_y[i] == 0):
        false_positives+=1
    
    metrics["accuracies"].append(accuracy_score(test_y, predicted))
    metrics["auroc"].append(auroc_score_from_probabilities(probabilities, test_y))
    metrics["f1 scores"].append(f1_score(test_y, predicted))
    metrics["log loss"].append(log_loss(test_y, probabilities))

  num_samples = len(X)
  metrics["mean accuracy"] = np.mean(metrics["accuracies"])
  metrics["mean variance"] = np.var(metrics["accuracies"])
  metrics["mean auroc"] = np.mean(metrics["auroc"])
  metrics["mean f1 scores"] = np.mean(metrics["f1 scores"])
  metrics["mean log loss"] = np.mean(metrics["log loss"])
  metrics["false negatives rate"] = (false_negatives / cv) / num_samples
  metrics["false positives rate"] = (false_positives / cv) / num_samples
  return metrics

In [20]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver="liblinear")

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
cVec = CountVectorizer()
def get_features(predictor_set, fit=False):
  if fit:
    return cVec.fit_transform(predictor_set)
  return cVec.transform(predictor_set)
run_cross_validate(classifier, X, y, get_features)

NameError: name 'run_cross_validate' is not defined

## Metrics to evaluate

### Accuracy
This is the percentage of correct classifications. The higher the better, however it is not appropriate in all cases.

This metric falls to the imbalanced classification problem. When there are many more of one class, the classifier can choose it much more, or all the time, to achieve a high accuracy.

### False Positives, False Negatives
False positives are samples that were classified as fake, but were in fact genuine. False negatives are samples that were classified as genuine, but were in fact fake. We want to reduce both of these values as much as possible. True positives and true negatives are the adverse values which we want to maximise, although they correlate to false positives and negatives so there's no need to include them here.

One important question here is **False negatives vs False positives**? Is it worse to falsely suggest that something is fake, or is it worse to falsely suggest that something is genuine? Probably in this system a human might be paid to read those suspicious reviews. It would be good to catch all the fake reviews, plus some genuine ones, because this is just like filtering the huge number of reviews to make a human's job easier. In this case it is better to reduce false negatives. If humans are not reviewing the system, then this would be a different situation, it would probably be better to reduce false positives.

### Recall
Of all the fake reviews, what percentage were identified as fake? This is not subject to the imbalanced
classification problem. We aim to maximise it as an indication of how well we are really identifying our fake reviews. 

We cannot focus solely on recall, because we could identify all reviews as fake and achieve 100% recall. Precision must be included in the consideration.

### Precision
Of all the reviews identified as fake, what percentage are actually fake? If we classify all reviews as fake, then our precision will be low. If we classify all reviews as genuine, then we wont have any precision either.

In our case it might be more important to have a high recall, if we don't want to miss any fake reviews. Otherwise if we want to be as accurate as possible we can balance recall and precision.

### F1 Score
This is a harmonic mean of precision and recall. Because of this it punishes extreme values such as a recall of or a precision of 0.0

This also acts as a single number metric representing precision and recall.

### Area Under Curve (AUC) 
This gives us a measure of discrimination, how well we correctly classify both classes. This does not use a 'Yes' or 'No' which can make it more interesting than accuracy.

At different classification thresholds, how well do we predict fake reviews as 'more fake' than genuine reviews. We plot the true positive rate against the false positive rate to get a graph. Changing the threshold allows us to create a graph because at low thresholds we will have more fake reviews, increasing the true positives rate. Decreasing the treshold means we will have less genuine reviews, decreasing the true negative rate, which therefore increases the false positive rate.

An AUC of 0.8 means the chance of the model distinguishing positive and negative classes is 80%.

### Mean Squared Error
The average of the square difference between the original values and the predicted values. Adds focus to large errors, and is easier to compute than mean absolute error.

The closer the mean squared error is to zero the better. It incorporates the variance and the bias of the estimator

### Logarithmic loss
This takes into account the uncertainty of a prediction, taking into account how far away from the actual label the prediction is. As the probability approaches correct the log loss reduces only very little. As the probability approaches incorrect the log loss increases rapidly. This means that confident incorrect values are highly penalized.

We aim to minimize log loss.

### Cohen's Kappa
A reliability metric used when there is more than one classifier. Computes an agreement percentage of the used classifiers. It is out of scope at this stage.

## Current Results

Our results show quite normal results. The AUROC score and F1 score above are as expected, reflective of similar work in this field. We can see that the variance is not very large, so we can depend on our scores.

Since there is nothing alarming, the next thing to consider is improving features or trying to find more predictive features. We might see better results if convert our bow to tfidf. This is especially important because the normalisation function of logistic regression uses regularisation that depends on the features being scaled. Tfidf will scale our features for us.

In [19]:
def tf_idf_bag_of_words(cv, tfidf, predictor_set, fit):
  if fit:
    return tfidf.fit_transform(cv.fit_transform(predictor_set))
  else:
    return tfidf.transform(cv.transform(predictor_set))

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer

count_vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
def get_features(predictor_features, fit=False):
  return tf_idf_bag_of_words(count_vectorizer, tfidf_transformer, predictor_set, fit)

run_cross_validate(classifier, X, y, get_features)

NameError: name 'classifier' is not defined

Looks like switching to Tfidf had a positive effect on our results, which is not unexpected. Let's have a look at our vocabulary, to see what words are actually in our bag of words. Let's try to clean our features more, hopefully a smaller number of features, more concentrated with quality words will improve accuracy. Currently our bag of words has the following shape: 

In [23]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X).shape

(20000, 33653)

Let's set a low limit for the min document frequency. This cleans out everything from the typos to the random jibberish:

In [24]:
count_vectorizer = CountVectorizer(min_df=4)
count_vectorizer.fit_transform(X).shape

(20000, 11533)

In [25]:
run_cross_validate(classifier, X, y, get_features)

{'accuracies': [0.6518370407398151,
  0.6550862284428893,
  0.66525,
  0.6559139784946236,
  0.6636659164791198],
 'auroc': [0.71634402750229,
  0.7159391043564783,
  0.7199524948931135,
  0.7156260114447396,
  0.7168274889857265],
 'f1 scores': [0.6573185731857318,
  0.6563745019920318,
  0.6681536555142503,
  0.6575410652065704,
  0.6624843161856964],
 'log loss': [0.6160163377452522,
  0.6176232360162895,
  0.6149526840588501,
  0.6183856677609213,
  0.6180224249695648],
 'mean accuracy': 0.6583506328312895,
 'mean variance': 2.6974801977417865e-05,
 'mean auroc': 0.7169378254364696,
 'mean f1 scores': 0.6603744224168562,
 'mean log loss': 0.6170000701101755,
 'false negatives rate': 0.0337,
 'false positives rate': 0.03463}

Now let's try adding reviewer features, which have been proven to be predictive:

In [21]:
from sklearn.preprocessing import StandardScaler

def scaled_reviewer_features(reviews):
  reviewer_reviews = reviews_by_reviewer(reviews)
  reviewer_predictors = [list(reviewer_features(x.user_id, reviewer_reviews)) for x in reviews]
  return StandardScaler().fit_transform(reviewer_predictors)

In [23]:
from exp2_feature_extraction import reviewer_features, reviews_by_reviewer
from scipy.sparse import coo_matrix, hstack

count_vectorizer = CountVectorizer(max_features=1000, min_df=4)
tfidf_transformer = TfidfTransformer()
def get_features(predictor_features, fit=False):
  predictor_features_text = [x.review_content for x in predictor_features]
  bow = tf_idf_bag_of_words(count_vectorizer, tfidf_transformer, predictor_features_text, fit)
  reviewer_scaled = scaled_reviewer_features(predictor_features)
  return hstack([coo_matrix(reviewer_scaled), bow])

run_cross_validate(LogisticRegression(solver="liblinear"), all_reviews, y, get_features)

NameError: name 'LogisticRegression' is not defined

Let's try adding some more features, this time for review details like date posted, the id of the reviewer and the id of the product:

In [25]:
count_vectorizer = CountVectorizer(max_features=1000, min_df=4)
tfidf_transformer = TfidfTransformer()

from datetime import datetime as dt
def extract_date_ordinals(reviews):
  return [dt.strptime(x.date, '%Y-%m-%d').date().toordinal() for x in reviews]

def get_features(predictor_features, fit=False):
  predictor_features_text = [x.review_content for x in predictor_features]
  bow = tf_idf_bag_of_words(count_vectorizer, tfidf_transformer, predictor_features_text, fit)
  reviewer_scaled = scaled_reviewer_features(predictor_features)

  date_ordinals = extract_date_ordinals(predictor_features)
  review_details = StandardScaler().fit_transform([[date_ordinals[i], predictor_features[i].user_id, predictor_features[i].product_id] for i in range(len(predictor_features))])
    
  return hstack([coo_matrix(reviewer_scaled), coo_matrix(review_details), bow])

run_cross_validate(LogisticRegression(solver="liblinear"), all_reviews, y, get_features)

NameError: name 'LogisticRegression' is not defined

Looks like this gave us a boost. Now let's try adding bigrams:

In [134]:
from exp2_feature_extraction import sentiment_features, find_words
from nltk.sentiment.vader import SentimentIntensityAnalyzer

count_vectorizer = CountVectorizer(max_features=10000, min_df=4, ngram_range=(1, 2))
tfidf_transformer = TfidfTransformer()

def get_features(predictor_features, fit=False):
  predictor_features_text = [x.review_content for x in predictor_features]
  bow = tf_idf_bag_of_words(count_vectorizer, tfidf_transformer, predictor_features_text, fit)
  reviewer_scaled = scaled_reviewer_features(predictor_features)

  date_ordinals = extract_date_ordinals(predictor_features)
  
  review_details = [[date_ordinals[i], predictor_features[i].user_id, predictor_features[i].product_id] for i in range(len(predictor_features))]
  review_scaled = StandardScaler().fit_transform(review_details)

  return hstack([coo_matrix(reviewer_scaled), coo_matrix(review_scaled), bow])

run_cross_validate(LogisticRegression(solver="liblinear"), all_reviews, y, get_features)

{'accuracies': [0.7207344351932398,
  0.7172150246994128,
  0.7180451127819549,
  0.719753930280246,
  0.7177654881004163],
 'auroc': [0.7941542649374345,
  0.7921109198984456,
  0.7930493710459263,
  0.7926693562443541,
  0.791249333855305],
 'f1 scores': [0.7200124591185173,
  0.7167486151739592,
  0.7184562404988677,
  0.7193877551020409,
  0.718098311817279],
 'mean accuracy': 0.7187027982110539,
 'mean variance': 1.7513995686725878e-06,
 'mean auroc': 0.792646649196293,
 'mean f1 scores': 0.7185406763421328}

Adding bigrams gives us a tiny boost, and now we're closing in on a good statistical benchmark. Let's switch to SVC as we should have more tweakable options. We have to drop some of our metrics based on probabilities because the underlying implementation of LinearSVC does not expose them to us. We will disable dual because our number of samples exceeds our number of features:

In [27]:
def run_cross_validate(model, X, y, get_features_fn, cv=5):
  skfSplitter = StratifiedKFold(n_splits=cv)
  metrics = { "accuracies": [], "auroc": [], "f1 scores": [] }
    
  for train_indices, test_indices in skfSplitter.split(X, y):
    training_X = get_features_fn([X[x] for x in train_indices], fit=True)
    training_y = [y[x] for x in train_indices]
    test_X = get_features_fn([X[x] for x in test_indices])
    test_y = [y[x] for x in test_indices]
  
    model.fit(training_X, training_y)
    scores = model.decision_function(test_X)
    
    predicted = [1 if score >= 0 else 0 for score in scores]
    metrics["accuracies"].append(accuracy_score(test_y, predicted))
    metrics["auroc"].append(roc_auc_score(test_y, scores))
    metrics["f1 scores"].append(f1_score(test_y, predicted))

  num_samples = len(X)
  metrics["mean accuracy"] = np.mean(metrics["accuracies"])
  metrics["mean variance"] = np.var(metrics["accuracies"])
  metrics["mean auroc"] = np.mean(metrics["auroc"])
  metrics["mean f1 scores"] = np.mean(metrics["f1 scores"])
  return metrics

In [137]:
run_cross_validate(LinearSVC(max_iter=2500, dual=False), all_reviews, y, get_features)

{'accuracies': [0.7110413818814465,
  0.7094168453102184,
  0.7098117193810973,
  0.7104020381532343,
  0.7101845522898155],
 'auroc': [0.783419792835397,
  0.7821408596582322,
  0.7823409564325181,
  0.7830343720488236,
  0.7806408518058535],
 'f1 scores': [0.7117131078944922,
  0.7090371752994246,
  0.71113997649533,
  0.7115223917551298,
  0.7109389525875426],
 'mean accuracy': 0.7101713074031626,
 'mean variance': 3.017916598229258e-07,
 'mean auroc': 0.7823153665561648,
 'mean f1 scores': 0.7108703208063838}

It looks like the accuracy is pretty similar, but a little lower. Now let's try automatic tuning of hyperparameters. Since this will be tuning our parameters towards a set, we will use a dev set for this, and have a test set to check how the tuned model works on totally unseen data. A major parameter to grid search on is C, the penalty parameter of the error term.

In [58]:
from sklearn.model_selection import GridSearchCV

train_dev_set = all_reviews[:-5000]
train_dev_y = y[:-5000]

model = LinearSVC(max_iter=2500, dual=False)

grid_search = GridSearchCV(cv=5, estimator=model, param_grid={"C": [0.01, 0.025, 0.05, 0.075, 0.1, 0.3, 0.7, 1.0, 10.0]})
grid_search.fit(get_features(train_dev_set), train_dev_y).best_params_

{'C': 0.025}

In [57]:
print(len(train_dev_set))

5000


In [59]:
from sklearn.model_selection import cross_validate
cross_validate(LinearSVC(max_iter=2500, dual=False, C=0.025), get_features(train_dev_set), train_dev_y)



{'fit_time': array([7.38575029, 8.26084542, 7.40343952]),
 'score_time': array([0.03568912, 0.03565645, 0.02719092]),
 'test_score': array([0.74381962, 0.74452162, 0.74346345]),
 'train_score': array([0.76037478, 0.7602832 , 0.76128362])}

In [52]:
run_cross_validate(LinearSVC(max_iter=2500, dual=False, C=0.1), all_reviews, y, get_features)

{'accuracies': [0.721604324593016,
  0.7199179793084164,
  0.7190393338718698,
  0.7199092773255452,
  0.7161188094202449],
 'auroc': [0.7923979350902117,
  0.7927542076308028,
  0.794105893125735,
  0.792860062528262,
  0.790102489819773],
 'f1 scores': [0.7223203495398346,
  0.7208373331681789,
  0.7202474864655839,
  0.7224360355922289,
  0.7169893139228743],
 'mean accuracy': 0.7193179449038184,
 'mean variance': 3.2498678704489683e-06,
 'mean auroc': 0.792444117638957,
 'mean f1 scores': 0.7205661037377401}

In [None]:
from sklearn.svm import NuSVC
run_cross_validate(NuSVC(kernel="sigmoid", gamma="scale"), all_reviews, y, get_features)

In [None]:
import string

from exp2_feature_extraction import find_words
x_words = [find_words(x) for x in X]

count_vectorizer.fit_transform([" ".join([y for y in x if y not in string.punctuation]) for x in x_words]).shape

In [None]:
from sklearn.preprocessing import StandardScaler

count_vectorizer.set_params(vocabulary=None, stop_words=None)

count_vectorizer = CountVectorizer()
model = Pipeline([
  ('cv', count_vectorizer),
  ('tfidf', TfidfTransformer()),
  ('classifier', LogisticRegression(solver="liblinear"))
])
run_cross_validate(model, X, y)

In [None]:
from exp2_feature_extraction import find_words, preprocess_words
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

Results for lemmatized words with all stopwords and all words with <= 3 characters removed:

In [None]:
run_cross_validate(model, X_lemmatized, y)

This shows only a slightly better result. Perhaps the different versions of words people use are actually important, and perhaps stopwords are important here too. At least more important than other tasks, for example identifying sentiment or topic. Let's try lemmatizing, but without removing stopwords.

In [None]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False, stopwords=[]))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
training_helpers.get_accuracy(model, [preprocess(x.review_content) for x in all_reviews], y, 5)