# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
In this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier.
Let's import what we need.

In [1]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [3]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

In [115]:
all_reviews = reviews_set[:20000]

X = [x.review_content for x in all_reviews]
y = [1 if x.label else 0 for x in all_reviews]

## Start with Logistic Regression


~~## No AUROC or F1 Score with LinearSVC~~

~~Since LinearSVC uses LibLinear it is not possible to retrieve the probabilities to calculate the AUROC or F1 score. The SVC implementations in sklearn have an option to return probabilities which can be used to calculate the AUC, however Liblinear does not provide this. It is possible to use CalibratedClassifierCV to obtain probabilities, however this also tries to tune the hyperparameters and severely limits the amount of data we can ue. If we want to use LinearSVC we will have to leave this out for now. It is also possible to use SVC with a linear kernel, but the LinearSVM is known to perform better.~~

~~Instead of this, I will use Logistic Regression which is very similar to LinearSVC, and performs similarly. Logistic regression will allow me to view these metrics and once I have explored how to improve this I can switch back to LinearSVC. Both classifiers attempt to divide the samples, and logistic regression is known to do a good job at producing a wide margin in it's division, which is what SVMs try to do.~~

In [11]:
from sklearn.metrics import roc_auc_score

def auroc_score_from_probabilities(probabilities, labels):
  true_probabilities = [probabilities[i][1] for i in range(0, len(labels))]
  return roc_auc_score(labels, true_probabilities)

In [12]:
import unittest

class TestRocScoreFromProbabilities(unittest.TestCase):
  
  def test_uses_correct_probabilities(self):
    probabilities = [[0.9, 0.1], [0.1, 0.9]]
    labels = [0, 1]
    self.assertEquals(1, auroc_score_from_probabilities(probabilities, labels))

unittest.main(argv=[''], verbosity=2, exit=False)

  
ok

----------------------------------------------------------------------
Ran 1 test in 0.003s

OK


<unittest.main.TestProgram at 0x7f5b66394d30>

We define a function to automate the process of cross validation and finding the accuracy, the mean and the variance:

In [63]:
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, log_loss
import numpy as np
def run_cross_validate(model, X, y, get_features_fn, cv=5):
  skfSplitter = StratifiedKFold(n_splits=cv)
  metrics = {
    "accuracies": [],
    "auroc": [],
    "f1 scores": [],
    "log loss": []
  }
    
  false_negatives = 0
  false_positives = 0
  for train_indices, test_indices in skfSplitter.split(X, y):
    training_X = get_features_fn([X[x] for x in train_indices], fit=True)
    training_y = [y[x] for x in train_indices]
    test_X = get_features_fn([X[x] for x in test_indices])
    test_y = [y[x] for x in test_indices]
  
    model.fit(training_X, training_y)
    probabilities = model.predict_proba(test_X)
    
    predicted = [0 if x[0] > x[1] else 1 for x in probabilities]
    for i in range(0, len(predicted)):
      if (predicted[i] == 0 and test_y[i] == 1):
        false_negatives+=1
      if (predicted[i] == 1 and test_y[i] == 0):
        false_positives+=1
    
    metrics["accuracies"].append(accuracy_score(test_y, predicted))
    metrics["auroc"].append(auroc_score_from_probabilities(probabilities, test_y))
    metrics["f1 scores"].append(f1_score(test_y, predicted))
    metrics["log loss"].append(log_loss(test_y, probabilities))

  num_samples = len(X)
  metrics["mean accuracy"] = np.mean(metrics["accuracies"])
  metrics["mean variance"] = np.var(metrics["accuracies"])
  metrics["mean auroc"] = np.mean(metrics["auroc"])
  metrics["mean f1 scores"] = np.mean(metrics["f1 scores"])
  metrics["mean log loss"] = np.mean(metrics["log loss"])
  metrics["false negatives rate"] = (false_negatives / cv) / num_samples
  metrics["false positives rate"] = (false_positives / cv) / num_samples
  return metrics

In [69]:
classifier = LogisticRegression(solver="liblinear")

In [64]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
cVec = CountVectorizer()
def get_features(predictor_set, fit=False):
  if fit:
    return cVec.fit_transform(predictor_set)
  return cVec.transform(predictor_set)
run_cross_validate(classifier, X, y, get_features)

{'accuracies': [0.6298425393651587,
  0.6223444138965258,
  0.62375,
  0.6246561640410102,
  0.6266566641660415],
 'auroc': [0.674034894913445,
  0.6668230945534108,
  0.6716134280834111,
  0.6671050696325319,
  0.6693794205076535],
 'f1 scores': [0.6502951593860685,
  0.639809296781883,
  0.6402103753287114,
  0.6427041180671269,
  0.6392848514133849],
 'log loss': [0.8279193869383008,
  0.8392649040012382,
  0.8308313138743578,
  0.8469880476800975,
  0.828149195756153],
 'mean accuracy': 0.6254499562937472,
 'mean variance': 6.783056210080569e-06,
 'mean auroc': 0.6697911815380906,
 'mean f1 scores': 0.642460760195435,
 'mean log loss': 0.8346305696500295,
 'false negatives rate': 0.03331,
 'false positives rate': 0.0416}

## Metrics to evaluate

### Accuracy
This is the percentage of correct classifications. The higher the better, however it is not appropriate in all cases.

This metric falls to the imbalanced classification problem. When there are many more of one class, the classifier can choose it much more, or all the time, to achieve a high accuracy.

### False Positives, False Negatives
False positives are samples that were classified as fake, but were in fact genuine. False negatives are samples that were classified as genuine, but were in fact fake. We want to reduce both of these values as much as possible. True positives and true negatives are the adverse values which we want to maximise, although they correlate to false positives and negatives so there's no need to include them here.

One important question here is **False negatives vs False positives**? Is it worse to falsely suggest that something is fake, or is it worse to falsely suggest that something is genuine? Probably in this system a human might be paid to read those suspicious reviews. It would be good to catch all the fake reviews, plus some genuine ones, because this is just like filtering the huge number of reviews to make a human's job easier. In this case it is better to reduce false negatives. If humans are not reviewing the system, then this would be a different situation, it would probably be better to reduce false positives.

### Recall
Of all the fake reviews, what percentage were identified as fake? This is not subject to the imbalanced
classification problem. We aim to maximise it as an indication of how well we are really identifying our fake reviews. 

We cannot focus solely on recall, because we could identify all reviews as fake and achieve 100% recall. Precision must be included in the consideration.

### Precision
Of all the reviews identified as fake, what percentage are actually fake? If we classify all reviews as fake, then our precision will be low. If we classify all reviews as genuine, then we wont have any precision either.

In our case it might be more important to have a high recall, if we don't want to miss any fake reviews. Otherwise if we want to be as accurate as possible we can balance recall and precision.

### F1 Score
This is a harmonic mean of precision and recall. Because of this it punishes extreme values such as a recall of or a precision of 0.0

This also acts as a single number metric representing precision and recall.

### Area Under Curve (AUC) 
This gives us a measure of discrimination, how well we correctly classify both classes. This does not use a 'Yes' or 'No' which can make it more interesting than accuracy.

At different classification thresholds, how well do we predict fake reviews as 'more fake' than genuine reviews. We plot the true positive rate against the false positive rate to get a graph. Changing the threshold allows us to create a graph because at low thresholds we will have more fake reviews, increasing the true positives rate. Decreasing the treshold means we will have less genuine reviews, decreasing the true negative rate, which therefore increases the false positive rate.

An AUC of 0.8 means the chance of the model distinguishing positive and negative classes is 80%.

### Mean Squared Error
The average of the square difference between the original values and the predicted values. Adds focus to large errors, and is easier to compute than mean absolute error.

The closer the mean squared error is to zero the better. It incorporates the variance and the bias of the estimator

### Logarithmic loss
This takes into account the uncertainty of a prediction, taking into account how far away from the actual label the prediction is. As the probability approaches correct the log loss reduces only very little. As the probability approaches incorrect the log loss increases rapidly. This means that confident incorrect values are highly penalized.

We aim to minimize log loss.

### Cohen's Kappa
A reliability metric used when there is more than one classifier. Computes an agreement percentage of the used classifiers. It is out of scope at this stage.

## Current Results

Our results show quite normal results. The AUROC score and F1 score above are as expected, reflective of similar work in this field. We can see that the variance is not very large, so we can depend on our scores.

Since there is nothing alarming, the next thing to consider is improving features or trying to find more predictive features. We might see better results if convert our bow to tfidf. ~~We might see better results if we preprocess our text to lemmatize and remove stopwords. Since bag of words is our main feature here, this should hopefully be influential. In this case we are removing all of the stopwords, which may be a bad idea. We can't know for sure unless we experiment.~~

In [65]:
from sklearn.feature_extraction.text import TfidfTransformer

count_vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
def get_features(predictor_features, fit=False):
  if fit:
    return tfidf_transformer.fit_transform(count_vectorizer.fit_transform(predictor_features))
  else:
    return tfidf_transformer.transform(count_vectorizer.transform(predictor_features))

run_cross_validate(classifier, X, y, get_features)

{'accuracies': [0.6628342914271432,
  0.6585853536615845,
  0.6715,
  0.6624156039009752,
  0.6609152288072018],
 'auroc': [0.7180413579339924,
  0.7220066549320763,
  0.7318238456584436,
  0.7173215112945878,
  0.7198641317670225],
 'f1 scores': [0.6678158089140607,
  0.659351620947631,
  0.6755555555555556,
  0.6666666666666666,
  0.6618453865336659],
 'log loss': [0.615781638358499,
  0.6121420811627103,
  0.6065944124939086,
  0.6172954348356238,
  0.6146133744730216],
 'mean accuracy': 0.663250095559381,
 'mean variance': 1.922832248143387e-05,
 'mean auroc': 0.7218115003172245,
 'mean f1 scores': 0.666247007723516,
 'mean log loss': 0.6132853882647528,
 'false negatives rate': 0.033389999999999996,
 'false positives rate': 0.033960000000000004}

Looks like switching to Tfidf had a positive effect on our results, which is not unexpected. Let's have a look at our vocabulary, to see what words are actually in our bag of words. Let's try to clean our features more, hopefully a smaller number of features, more concentrated with quality words will improve accuracy. Currently our bag of words has the following shape: ~~Let's be adventurous and try using a bag of words that only includes stopwords, on the hypothesis that it is the way people say things that is predictive of a fake review, rather than what they talk about.~~ 

In [17]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X).shape

(20000, 33772)

Let's set a low limit for the min document frequency. This cleans out everything from the typos to the random jibberish:

In [18]:
count_vectorizer = CountVectorizer(min_df=4)
count_vectorizer.fit_transform(X).shape

(20000, 11548)

In [70]:
run_cross_validate(classifier, X, y, get_features)

{'accuracies': [0.6628342914271432,
  0.6585853536615845,
  0.6715,
  0.6624156039009752,
  0.6609152288072018],
 'auroc': [0.7180413579339924,
  0.7220066549320763,
  0.7318238456584436,
  0.7173215112945878,
  0.7198641317670225],
 'f1 scores': [0.6678158089140607,
  0.659351620947631,
  0.6755555555555556,
  0.6666666666666666,
  0.6618453865336659],
 'log loss': [0.615781638358499,
  0.6121420811627103,
  0.6065944124939086,
  0.6172954348356238,
  0.6146133744730216],
 'mean accuracy': 0.663250095559381,
 'mean variance': 1.922832248143387e-05,
 'mean auroc': 0.7218115003172245,
 'mean f1 scores': 0.666247007723516,
 'mean log loss': 0.6132853882647528,
 'false negatives rate': 0.033389999999999996,
 'false positives rate': 0.033960000000000004}

Now let's try adding reviewer features, which have been proven to be predictive:

In [118]:
from exp2_feature_extraction import reviewer_features, reviews_by_reviewer
from sklearn.preprocessing import StandardScaler
from scipy.sparse import coo_matrix, hstack

count_vectorizer = CountVectorizer(max_features=100, min_df=4)
tfidf_transformer = TfidfTransformer()
def get_features(predictor_features, fit=False):
  bow = None
  predictor_features_text = [x.review_content for x in predictor_features]
  if fit:
    bow = tfidf_transformer.fit_transform(count_vectorizer.fit_transform(predictor_features_text))
  else:
    bow = tfidf_transformer.transform(count_vectorizer.transform(predictor_features_text))

  reviewer_reviews = reviews_by_reviewer(predictor_features)
  reviewer_predictors = [list(reviewer_features(x.user_id, reviewer_reviews)) for x in predictor_features]
  reviewer_scaled = StandardScaler().fit_transform(reviewer_predictors)
  return hstack([coo_matrix(reviewer_scaled), bow])

run_cross_validate(LogisticRegression(solver="liblinear"), all_reviews, y, get_features)

{'accuracies': [0.6518370407398151,
  0.6425893526618346,
  0.64775,
  0.6524131032758189,
  0.6509127281820455],
 'auroc': [0.7053698245109004,
  0.7013279378206647,
  0.7032494419799112,
  0.7002054107163762,
  0.7096352433961907],
 'f1 scores': [0.6528781460254175,
  0.6405228758169934,
  0.644640605296343,
  0.6599804305283757,
  0.6511744127936032],
 'log loss': [0.6252394261910932,
  0.6304959625433099,
  0.6293422524297301,
  0.6293933869266951,
  0.6227539646349712],
 'mean accuracy': 0.6491004449719029,
 'mean variance': 1.3193011312318335e-05,
 'mean auroc': 0.7039575716848085,
 'mean f1 scores': 0.6498392940921466,
 'mean log loss': 0.6274449985451599,
 'false negatives rate': 0.035480000000000005,
 'false positives rate': 0.0347}

Although reviewer features are predictive alone, here they are doing very little to increase the accuracy. In fact, our AUROC is worse! Let's try switching to LinearSVC now

In [None]:
import string

from exp2_feature_extraction import find_words
x_words = [find_words(x) for x in X]

count_vectorizer.fit_transform([" ".join([y for y in x if y not in string.punctuation]) for x in x_words]).shape

In [None]:
from sklearn.preprocessing import StandardScaler

count_vectorizer.set_params(vocabulary=None, stop_words=None)

count_vectorizer = CountVectorizer()
model = Pipeline([
  ('cv', count_vectorizer),
  ('tfidf', TfidfTransformer()),
  ('classifier', LogisticRegression(solver="liblinear"))
])
run_cross_validate(model, X, y)

In [None]:
from exp2_feature_extraction import find_words, preprocess_words
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

Results for lemmatized words with all stopwords and all words with <= 3 characters removed:

In [None]:
run_cross_validate(model, X_lemmatized, y)

This shows only a slightly better result. Perhaps the different versions of words people use are actually important, and perhaps stopwords are important here too. At least more important than other tasks, for example identifying sentiment or topic. Let's try lemmatizing, but without removing stopwords.

In [None]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False, stopwords=[]))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
training_helpers.get_accuracy(model, [preprocess(x.review_content) for x in all_reviews], y, 5)