# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
In this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier.
Let's import what we need.

In [2]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [4]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

In [5]:
all_reviews = reviews_set[:20000]

In [6]:
X = [x.review_content for x in all_reviews]
y = [x.label for x in all_reviews]

## No AUROC or F1 Score with LinearSVC

Since LinearSVC uses LibLinear it is not possible to retrieve the probabilities to calculate the AUROC or F1 score. The SVC implementations in sklearn have an option to return probabilities which can be used to calculate the AUC, however Liblinear does not provide this. It is possible to use CalibratedClassifierCV to obtain probabilities, however this also tries to tune the hyperparameters and severely limits the amount of data we can ue. If we want to use LinearSVC we will have to leave this out for now. It is also possible to use SVC with a linear kernel, but the LinearSVM is known to perform better.

Instead of this, I will use Logistic Regression which is very similar to LinearSVC, and performs similarly. Logistic regression will allow me to view these metrics and once I have explored how to improve this I can switch back to LinearSVC. Both classifiers attempt to divide the samples, and logistic regression is known to do a good job at producing a wide margin in it's division, which is what SVMs try to do.

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
model = Pipeline([
  ('cv', CountVectorizer()),
  ('classifier', LogisticRegression())
])

We define a function to automate the process of cross validation and finding the accuracy, the mean and the variance:

In [45]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.calibration import CalibratedClassifierCV
import numpy as np

def run_cross_validate(model, X, y, cv=5):
  skfSplitter = StratifiedKFold(n_splits=cv)

  scores = []
  auc = []
  false_negatives = 0
  false_positives = 0
  for train_indices, test_indices in skfSplitter.split(X, y):
    training_X = [X[x] for x in train_indices]
    training_y = [y[x] for x in train_indices]
    test_X = [X[x] for x in test_indices]
    test_y = [y[x] for x in test_indices]
  
    model.fit(training_X, training_y)
    probabilities = model.predict_proba(test_X)
    
    predicted = [0 if x[0] > x[1] else 1 for x in probabilities]
    for i in range(0, len(predicted)):
      if (predicted[i] == 0 and test_y[i] == 1):
        false_negatives+=1
      if (predicted[i] == 1 and test_y[i] == 0):
        false_positives+=1
    
    true_probabilities = [probabilities[i][(1 if test_y[i] else 0)] for i in range(0, len(test_y))]
    
    scores.append(accuracy_score(test_y, predicted))
    auc.append(roc_auc_score(test_y, true_probabilities))

  num_samples = len(X)
  return {
    "scores": scores,
    "mean": np.mean(scores),
    "variance": np.var(scores),
    "% false negatives": (false_negatives / cv) / num_samples,
    "% false positives": (false_positives / cv) / num_samples,
    "auc": auc
  }

In [46]:
run_cross_validate(model, X, y)



{'scores': [0.6335916020994752,
  0.6163459135216196,
  0.62775,
  0.6186546636659165,
  0.6244061015253813],
 'mean': 0.6241496561624785,
 'variance': 3.865438495834681e-05,
 '% false negatives': 0.0336,
 '% false positives': 0.041569999999999996,
 'auc': [0.4899456953602668,
  0.49318293357002274,
  0.4782727178178415,
  0.5032204665086492,
  0.48034827563091126]}

## Metrics to evaluate

### False Positives, False Negatives
False positives are samples that were classified as fake, but were in fact genuine. False negatives are samples that were classified as genuine, but were in fact fake. We want to reduce both of these values as much as possible. True positives and true negatives are the adverse values which we want to maximise, although they correlate to false positives and negatives so there's no need to include them here.

One important question here is **False negatives vs False positives**? Is it worse to falsely suggest that something is fake, or is it worse to falsely suggest that something is genuine? Probably in this system

### Area Under Curve (AUC) 

### F1 Score
In this case we can't use these metrics metric, as long as we want to continue with LinearSVC. More later.

SVM with a linear kernel is actually supposed to be well suited to text classification. We might however see better results if we preprocess our text to lemmatize and remove stopwords. Since bag of words is our main feature here, this should hopefully be influential. In this case we are removing all of the stopwords, which may be a bad idea. We can't know for sure unless we experiment.

In [None]:
from exp2_feature_extraction import find_words, preprocess_words
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

In [None]:
X_set = [x.review_content for x in all_reviews[:100]]

cv = CountVectorizer()
training_X2 = cv.fit_transform(X_set)
training_X2.shape

In [None]:
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

In [None]:
run_cross_validate(model, X_lemmatized, y)

This shows only a slightly better result. Perhaps the different versions of words people use are actually important, and perhaps stopwords are important here too. At least more important than other tasks, for example identifying sentiment or topic. Let's try lemmatizing, but without removing stopwords.

In [None]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False, stopwords=[]))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
training_helpers.get_accuracy(model, [preprocess(x.review_content) for x in all_reviews], y, 5)