# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
In this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier.
Let's import what we need.

In [2]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [4]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

In [5]:
all_reviews = reviews_set[:20000]

In [75]:
X = [x.review_content for x in all_reviews]
y = [1 if x.label else 0 for x in all_reviews]

## Start with Logistic Regression


~~## No AUROC or F1 Score with LinearSVC~~

~~Since LinearSVC uses LibLinear it is not possible to retrieve the probabilities to calculate the AUROC or F1 score. The SVC implementations in sklearn have an option to return probabilities which can be used to calculate the AUC, however Liblinear does not provide this. It is possible to use CalibratedClassifierCV to obtain probabilities, however this also tries to tune the hyperparameters and severely limits the amount of data we can ue. If we want to use LinearSVC we will have to leave this out for now. It is also possible to use SVC with a linear kernel, but the LinearSVM is known to perform better.~~

~~Instead of this, I will use Logistic Regression which is very similar to LinearSVC, and performs similarly. Logistic regression will allow me to view these metrics and once I have explored how to improve this I can switch back to LinearSVC. Both classifiers attempt to divide the samples, and logistic regression is known to do a good job at producing a wide margin in it's division, which is what SVMs try to do.~~

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
model = Pipeline([
  ('cv', CountVectorizer()),
  ('classifier', LogisticRegression())
])

We define a function to automate the process of cross validation and finding the accuracy, the mean and the variance:

In [76]:
def auroc_score_from_probabilities(probabilities, labels):
  true_probabilities = [probabilities[i][1] for i in range(0, len(labels))]
  return roc_auc_score(labels, true_probabilities)

In [77]:
import unittest

class TestRocScoreFromProbabilities(unittest.TestCase):
  
  def test_uses_correct_probabilities(self):
    probabilities = [[0.9, 0.1], [0.1, 0.9]]
    labels = [0, 1]
    self.assertEquals(1, auroc_score_from_probabilities(probabilities, labels))

unittest.main(argv=[''], verbosity=2, exit=False)

  
ok

----------------------------------------------------------------------
Ran 1 test in 0.003s

OK


<unittest.main.TestProgram at 0x7f425daecb70>

In [92]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, log_loss
import numpy as np

def run_cross_validate(model, X, y, cv=5):
  skfSplitter = StratifiedKFold(n_splits=cv)
  metrics = {
    "accuracies": [],
    "auroc": [],
    "f1 scores": [],
    "log loss": []
  }
    
  false_negatives = 0
  false_positives = 0
  for train_indices, test_indices in skfSplitter.split(X, y):
    training_X = [X[x] for x in train_indices]
    training_y = [y[x] for x in train_indices]
    test_X = [X[x] for x in test_indices]
    test_y = [y[x] for x in test_indices]
  
    model.fit(training_X, training_y)
    probabilities = model.predict_proba(test_X)
    
    predicted = [0 if x[0] > x[1] else 1 for x in probabilities]
    for i in range(0, len(predicted)):
      if (predicted[i] == 0 and test_y[i] == 1):
        false_negatives+=1
      if (predicted[i] == 1 and test_y[i] == 0):
        false_positives+=1
    
    metrics["accuracies"].append(accuracy_score(test_y, predicted))
    metrics["auroc"].append(auroc_score_from_probabilities(probabilities, test_y))
    metrics["f1 scores"].append(f1_score(test_y, predicted))
    metrics["log loss"].append(log_loss(test_y, probabilities))

  num_samples = len(X)
  metrics["mean accuracy"] = np.mean(metrics["accuracies"])
  metrics["mean variance"] = np.var(metrics["accuracies"])
  metrics["% false negatives"] = (false_negatives / cv) / num_samples
  metrics["% false positives"] = (false_positives / cv) / num_samples
  return metrics

In [94]:
run_cross_validate(model, X, y)



{'accuracies': [0.6335916020994752,
  0.6163459135216196,
  0.62775,
  0.6186546636659165,
  0.6244061015253813],
 'auroc': [0.6717081872765418,
  0.6616381399886253,
  0.6740033891220084,
  0.6594462348431714,
  0.6690801197038021],
 'f1 scores': [0.6465766634522662,
  0.6273367322165574,
  0.6393799951562121,
  0.6326186461093712,
  0.6366715045960329],
 'log loss': [0.842830348638973,
  0.8295839561356689,
  0.8308771414279466,
  0.8298968201343846,
  0.8273542307425769],
 'mean accuracy': 0.6241496561624785,
 'mean variance': 3.865438495834681e-05,
 '% false negatives': 0.0336,
 '% false positives': 0.041569999999999996}

## Metrics to evaluate

### Accuracy
This is the percentage of correct classifications. The higher the better, however it is not appropriate in all cases.

This metric falls to the imbalanced classification problem. When there are many more of one class, the classifier can choose it much more, or all the time, to achieve a high accuracy.

### False Positives, False Negatives
False positives are samples that were classified as fake, but were in fact genuine. False negatives are samples that were classified as genuine, but were in fact fake. We want to reduce both of these values as much as possible. True positives and true negatives are the adverse values which we want to maximise, although they correlate to false positives and negatives so there's no need to include them here.

One important question here is **False negatives vs False positives**? Is it worse to falsely suggest that something is fake, or is it worse to falsely suggest that something is genuine? Probably in this system a human might be paid to read those suspicious reviews. It would be good to catch all the fake reviews, plus some genuine ones, because this is just like filtering the huge number of reviews to make a human's job easier. In this case it is better to reduce false negatives. If humans are not reviewing the system, then this would be a different situation, it would probably be better to reduce false positives.

### Recall
Of all the fake reviews, what percentage were identified as fake? This is not subject to the imbalanced
classification problem. We aim to maximise it as an indication of how well we are really identifying our fake reviews. 

We cannot focus solely on recall, because we could identify all reviews as fake and achieve 100% recall. Precision must be included in the consideration.

### Precision
Of all the reviews identified as fake, what percentage are actually fake? If we classify all reviews as fake, then our precision will be low. If we classify all reviews as genuine, then we wont have any precision either.

In our case it might be more important to have a high recall, if we don't want to miss any fake reviews. Otherwise if we want to be as accurate as possible we can balance recall and precision.

### F1 Score
This is a harmonic mean of precision and recall. Because of this it punishes extreme values such as a recall of or a precision of 0.0

This also acts as a single number metric representing precision and recall.

### Area Under Curve (AUC) 
This gives us a measure of discrimination, how well we correctly classify both classes. This does not use a 'Yes' or 'No' which can make it more interesting than accuracy.

At different classification thresholds, how well do we predict fake reviews as 'more fake' than genuine reviews. We plot the true positive rate against the false positive rate to get a graph. Changing the threshold allows us to create a graph because at low thresholds we will have more fake reviews, increasing the true positives rate. Decreasing the treshold means we will have less genuine reviews, decreasing the true negative rate, which therefore increases the false positive rate.

An AUC of 0.8 means the chance of the model distinguishing positive and negative classes is 80%.

### Mean Squared Error
The average of the square difference between the original values and the predicted values. Adds focus to large errors, and is easier to compute than mean absolute error.

The closer the mean squared error is to zero the better. It incorporates the variance and the bias of the estimator

### Logarithmic loss
This takes into account the uncertainty of a prediction, taking into account how far away from the actual label the prediction is. As the probability approaches correct the log loss reduces only very little. As the probability approaches incorrect the log loss increases rapidly. This means that confident incorrect values are highly penalized.

We aim to minimize log loss.

### Cohen's Kappa
A reliability metric used when there is more than one classifier. Computes an agreement percentage of the used classifiers. It is out of scope at this stage.

SVM with a linear kernel is actually supposed to be well suited to text classification. We might however see better results if we preprocess our text to lemmatize and remove stopwords. Since bag of words is our main feature here, this should hopefully be influential. In this case we are removing all of the stopwords, which may be a bad idea. We can't know for sure unless we experiment.

In [None]:
from exp2_feature_extraction import find_words, preprocess_words
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

In [None]:
X_set = [x.review_content for x in all_reviews[:100]]

cv = CountVectorizer()
training_X2 = cv.fit_transform(X_set)
training_X2.shape

In [None]:
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

In [None]:
run_cross_validate(model, X_lemmatized, y)

This shows only a slightly better result. Perhaps the different versions of words people use are actually important, and perhaps stopwords are important here too. At least more important than other tasks, for example identifying sentiment or topic. Let's try lemmatizing, but without removing stopwords.

In [None]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False, stopwords=[]))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
training_helpers.get_accuracy(model, [preprocess(x.review_content) for x in all_reviews], y, 5)