# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
In this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier.
Let's import what we need.

In [2]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [4]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

In [121]:
all_reviews = reviews_set[:20000]

In [75]:
X = [x.review_content for x in all_reviews]
y = [1 if x.label else 0 for x in all_reviews]

## Start with Logistic Regression


~~## No AUROC or F1 Score with LinearSVC~~

~~Since LinearSVC uses LibLinear it is not possible to retrieve the probabilities to calculate the AUROC or F1 score. The SVC implementations in sklearn have an option to return probabilities which can be used to calculate the AUC, however Liblinear does not provide this. It is possible to use CalibratedClassifierCV to obtain probabilities, however this also tries to tune the hyperparameters and severely limits the amount of data we can ue. If we want to use LinearSVC we will have to leave this out for now. It is also possible to use SVC with a linear kernel, but the LinearSVM is known to perform better.~~

~~Instead of this, I will use Logistic Regression which is very similar to LinearSVC, and performs similarly. Logistic regression will allow me to view these metrics and once I have explored how to improve this I can switch back to LinearSVC. Both classifiers attempt to divide the samples, and logistic regression is known to do a good job at producing a wide margin in it's division, which is what SVMs try to do.~~

In [76]:
def auroc_score_from_probabilities(probabilities, labels):
  true_probabilities = [probabilities[i][1] for i in range(0, len(labels))]
  return roc_auc_score(labels, true_probabilities)

In [77]:
import unittest

class TestRocScoreFromProbabilities(unittest.TestCase):
  
  def test_uses_correct_probabilities(self):
    probabilities = [[0.9, 0.1], [0.1, 0.9]]
    labels = [0, 1]
    self.assertEquals(1, auroc_score_from_probabilities(probabilities, labels))

unittest.main(argv=[''], verbosity=2, exit=False)

  
ok

----------------------------------------------------------------------
Ran 1 test in 0.003s

OK


<unittest.main.TestProgram at 0x7f425daecb70>

We define a function to automate the process of cross validation and finding the accuracy, the mean and the variance:

In [95]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, log_loss
import numpy as np

def run_cross_validate(model, X, y, cv=5):
  skfSplitter = StratifiedKFold(n_splits=cv)
  metrics = {
    "accuracies": [],
    "auroc": [],
    "f1 scores": [],
    "log loss": []
  }
    
  false_negatives = 0
  false_positives = 0
  for train_indices, test_indices in skfSplitter.split(X, y):
    training_X = [X[x] for x in train_indices]
    training_y = [y[x] for x in train_indices]
    test_X = [X[x] for x in test_indices]
    test_y = [y[x] for x in test_indices]
  
    model.fit(training_X, training_y)
    probabilities = model.predict_proba(test_X)
    
    predicted = [0 if x[0] > x[1] else 1 for x in probabilities]
    for i in range(0, len(predicted)):
      if (predicted[i] == 0 and test_y[i] == 1):
        false_negatives+=1
      if (predicted[i] == 1 and test_y[i] == 0):
        false_positives+=1
    
    metrics["accuracies"].append(accuracy_score(test_y, predicted))
    metrics["auroc"].append(auroc_score_from_probabilities(probabilities, test_y))
    metrics["f1 scores"].append(f1_score(test_y, predicted))
    metrics["log loss"].append(log_loss(test_y, probabilities))

  num_samples = len(X)
  metrics["mean accuracy"] = np.mean(metrics["accuracies"])
  metrics["mean variance"] = np.var(metrics["accuracies"])
  metrics["mean auroc"] = np.mean(metrics["auroc"])
  metrics["mean f1 scores"] = np.mean(metrics["f1 scores"])
  metrics["mean log loss"] = np.mean(metrics["log loss"])
  metrics["% false negatives"] = (false_negatives / cv) / num_samples
  metrics["% false positives"] = (false_positives / cv) / num_samples
  return metrics

In [106]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
model = Pipeline([
  ('cv', CountVectorizer()),
  ('classifier', LogisticRegression(solver="liblinear"))
])

In [103]:
run_cross_validate(model, X, y)

{'accuracies': [0.6335916020994752,
  0.6163459135216196,
  0.62775,
  0.6186546636659165,
  0.6244061015253813],
 'auroc': [0.6717081872765418,
  0.6616381399886253,
  0.6740033891220084,
  0.6594462348431714,
  0.6690801197038021],
 'f1 scores': [0.6465766634522662,
  0.6273367322165574,
  0.6393799951562121,
  0.6326186461093712,
  0.6366715045960329],
 'log loss': [0.842830348638973,
  0.8295839561356689,
  0.8308771414279466,
  0.8298968201343846,
  0.8273542307425769],
 'mean accuracy': 0.6241496561624785,
 'mean variance': 3.865438495834681e-05,
 'mean auroc': 0.6671752141868298,
 'mean f1 scores': 0.6365167083060881,
 'mean log loss': 0.83210849941591,
 '% false negatives': 0.0336,
 '% false positives': 0.041569999999999996}

## Metrics to evaluate

### Accuracy
This is the percentage of correct classifications. The higher the better, however it is not appropriate in all cases.

This metric falls to the imbalanced classification problem. When there are many more of one class, the classifier can choose it much more, or all the time, to achieve a high accuracy.

### False Positives, False Negatives
False positives are samples that were classified as fake, but were in fact genuine. False negatives are samples that were classified as genuine, but were in fact fake. We want to reduce both of these values as much as possible. True positives and true negatives are the adverse values which we want to maximise, although they correlate to false positives and negatives so there's no need to include them here.

One important question here is **False negatives vs False positives**? Is it worse to falsely suggest that something is fake, or is it worse to falsely suggest that something is genuine? Probably in this system a human might be paid to read those suspicious reviews. It would be good to catch all the fake reviews, plus some genuine ones, because this is just like filtering the huge number of reviews to make a human's job easier. In this case it is better to reduce false negatives. If humans are not reviewing the system, then this would be a different situation, it would probably be better to reduce false positives.

### Recall
Of all the fake reviews, what percentage were identified as fake? This is not subject to the imbalanced
classification problem. We aim to maximise it as an indication of how well we are really identifying our fake reviews. 

We cannot focus solely on recall, because we could identify all reviews as fake and achieve 100% recall. Precision must be included in the consideration.

### Precision
Of all the reviews identified as fake, what percentage are actually fake? If we classify all reviews as fake, then our precision will be low. If we classify all reviews as genuine, then we wont have any precision either.

In our case it might be more important to have a high recall, if we don't want to miss any fake reviews. Otherwise if we want to be as accurate as possible we can balance recall and precision.

### F1 Score
This is a harmonic mean of precision and recall. Because of this it punishes extreme values such as a recall of or a precision of 0.0

This also acts as a single number metric representing precision and recall.

### Area Under Curve (AUC) 
This gives us a measure of discrimination, how well we correctly classify both classes. This does not use a 'Yes' or 'No' which can make it more interesting than accuracy.

At different classification thresholds, how well do we predict fake reviews as 'more fake' than genuine reviews. We plot the true positive rate against the false positive rate to get a graph. Changing the threshold allows us to create a graph because at low thresholds we will have more fake reviews, increasing the true positives rate. Decreasing the treshold means we will have less genuine reviews, decreasing the true negative rate, which therefore increases the false positive rate.

An AUC of 0.8 means the chance of the model distinguishing positive and negative classes is 80%.

### Mean Squared Error
The average of the square difference between the original values and the predicted values. Adds focus to large errors, and is easier to compute than mean absolute error.

The closer the mean squared error is to zero the better. It incorporates the variance and the bias of the estimator

### Logarithmic loss
This takes into account the uncertainty of a prediction, taking into account how far away from the actual label the prediction is. As the probability approaches correct the log loss reduces only very little. As the probability approaches incorrect the log loss increases rapidly. This means that confident incorrect values are highly penalized.

We aim to minimize log loss.

### Cohen's Kappa
A reliability metric used when there is more than one classifier. Computes an agreement percentage of the used classifiers. It is out of scope at this stage.

## Current Results

Our results show quite normal results. The AUROC score and F1 score above are as expected, reflective of similar work in this field. We can see that the variance is not very large, so we can depend on our scores.

Since there is nothing alarming, the next thing to consider is improving features or trying to find more predictive features. We might see better results if convert our bow to tfidf. ~~We might see better results if we preprocess our text to lemmatize and remove stopwords. Since bag of words is our main feature here, this should hopefully be influential. In this case we are removing all of the stopwords, which may be a bad idea. We can't know for sure unless we experiment.~~

In [122]:
from sklearn.feature_extraction.text import TfidfTransformer

count_vectorizer = CountVectorizer()
model = Pipeline([
  ('cv', count_vectorizer),
  ('tfidf', TfidfTransformer()),
  ('classifier', LogisticRegression(solver="liblinear"))
])
run_cross_validate(model, X, y)

{'accuracies': [0.6690827293176705,
  0.6510872281929517,
  0.66775,
  0.6546636659164791,
  0.6619154788697175],
 'auroc': [0.722337793250552,
  0.7113747879740162,
  0.7300764077506791,
  0.7134590236596106,
  0.7257707107888505],
 'f1 scores': [0.6659939455095862,
  0.6478304742684157,
  0.666164280331575,
  0.6546636659164792,
  0.6551020408163266],
 'log loss': [0.6156789102877588,
  0.6212551438499597,
  0.6076363719266309,
  0.6188976615989417,
  0.6118785210913428],
 'mean accuracy': 0.6608998204593638,
 'mean variance': 5.001862191968383e-05,
 'mean auroc': 0.7206037446847416,
 'mean f1 scores': 0.6579508813684766,
 'mean log loss': 0.6150693217509268,
 '% false negatives': 0.03419,
 '% false positives': 0.03363}

Looks like switching to Tfidf had a positive effect on our results, which is not unexpected.

In [97]:
from exp2_feature_extraction import find_words, preprocess_words
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [98]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

All I wanted was a quick and healthy dinner.  When I walked in there were 2 staff members and no customers at all.  I ordered a "designer" shrimp salad but asked to sub the dressing.  I looked up to read the dressing menu, decided on chilled avocado out loud and turned to see he'd walked out of sight. After a minute he came back and slowly put the items in a bowl leaving me ample time to see some selections were a bit dried out, others also looked unappealing, I asked to add black olives but they don't have any. I told him 3 times which dressing I'd like on top.       Second salad I asked for a build your own and he asked me over and over what I'd said (though I only said 2 items at a time).  As he slowly put it together I tallied up the salads add ons and felt it was too pricey for what I was getting; a ton of lettuce and def less than a full serving of turkey.     Then I waited at the counter to pay from the girl who didn't seem to know how to work the register.  She asked for help a

'want quick healthi dinner when walk staff member custom order design shrimp salad ask dress look read dress menu decid chill avocado loud turn walk sight after minut come slowli item bowl leav ampl time select dri look unapp ask black oliv tell time dress like second salad ask build ask say say item time slowli talli salad felt pricey get lettuc serv turkey then wait counter girl know work regist ask help wait minut manag come stand minut wait realiz avocado dress salad person order avocado explain remak after minut leav home realiz charg premium salad turkey coke run home give mistak salad fix mean turkey salad price shrimp salad minor melt blood sugar level frustrat sharp headach develop wast effort wind scroung leav over want place kill salad fresh salad want home coupl dollar time go return salad order money'

In [100]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

Results for lemmatized words with all stopwords and all words with <= 3 characters removed:

In [104]:
run_cross_validate(model, X_lemmatized, y)

{'accuracies': [0.6285928517870533,
  0.6183454136465883,
  0.62775,
  0.6286571642910728,
  0.6311577894473619],
 'auroc': [0.6680352775253957,
  0.6601132271229335,
  0.6667615034141229,
  0.6678853578957584,
  0.6720064294268845],
 'f1 scores': [0.6438159156279961,
  0.6312484907027288,
  0.6360303104375459,
  0.6463443677065968,
  0.6451768101996632],
 'log loss': [0.7759444849708552,
  0.776865274259934,
  0.7745017827134671,
  0.7678159432315894,
  0.7639525810825661],
 'mean accuracy': 0.6269006438344153,
 'mean variance': 1.9597118020436023e-05,
 'mean auroc': 0.666960359077019,
 'mean f1 scores': 0.6405231789349062,
 'mean log loss': 0.7718160132516825,
 '% false negatives': 0.03293,
 '% false positives': 0.04169}

This shows only a slightly better result. Perhaps the different versions of words people use are actually important, and perhaps stopwords are important here too. At least more important than other tasks, for example identifying sentiment or topic. Let's try lemmatizing, but without removing stopwords.

In [None]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False, stopwords=[]))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
training_helpers.get_accuracy(model, [preprocess(x.review_content) for x in all_reviews], y, 5)