# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
In this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier.
Let's import what we need.

In [2]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [4]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

In [5]:
all_reviews = reviews_set[:20000]

In [None]:
cv = CountVectorizer(stop_words='english', ngram_range=(0, 2))
classifier = SVC(random_state=42) # Starting seed

In [6]:
X = [x.review_content for x in all_reviews]
y = [x.label for x in all_reviews]

In [23]:
cv = CountVectorizer()
classifier = LinearSVC(random_state=42) # Starting seed

In [18]:
model = Pipeline([ ('cv', cv), ('classifier', classifier) ])

Without any processing, we have the following accuracy:

We define a function to automate the process of cross validation and finding the accuracy, the mean and the variance:

In [19]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.calibration import CalibratedClassifierCV
import numpy as np

def run_cross_validate(model, X, y, cv=5):
  scores = cross_val_score(model, X, y, cv=cv)
        
  return {
    "scores": scores,
    "mean": np.mean(scores),
    "variance": np.var(scores)
  }

In [21]:
classifier = LinearSVC(random_state=42) # Starting seed
run_cross_validate(model, X, y)



{'scores': array([0.60409898, 0.58660335, 0.59925   , 0.58489622, 0.59589897]),
 'mean': 0.594149504643719,
 'variance': 5.412706563136708e-05}

## Metrics to evaluate

### Area Under Curve (AUC) and F1 Score
In this case we can't use these metrics metric, as long as we want to continue with LinearSVC. More later.

### Mean Squared Error

## No AUROC or F1 Score

Since LinearSVM uses LibLinear it is not possible to retrieve the probabilities to calculate the AUROC or F1 score. The SVC implementations in sklearn have an option to return probabilities which can be used to calculate the AUC, however Liblinear does not provide this. It is possible to use CalibratedClassifierCV to obtain probabilities, however this also tries to tune the hyperparameters and severely limits the amount of data we can ue. We will have to leave this out for now. It is also possible to use SVC with a linear kernel, but the LinearSVM is known to perform better.

SVM with a linear kernel is actually supposed to be well suited to text classification. We might however see better results if we preprocess our text to lemmatize and remove stopwords. Since bag of words is our main feature here, this should hopefully be influential. In this case we are removing all of the stopwords, which may be a bad idea. We can't know for sure unless we experiment.

In [None]:
from exp2_feature_extraction import find_words, preprocess_words
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

In [None]:
X_set = [x.review_content for x in all_reviews[:100]]

cv = CountVectorizer()
training_X2 = cv.fit_transform(X_set)
training_X2.shape

In [None]:
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

In [None]:
run_cross_validate(model, X_lemmatized, y)

This shows only a slightly better result. Perhaps the different versions of words people use are actually important, and perhaps stopwords are important here too. At least more important than other tasks, for example identifying sentiment or topic. Let's try lemmatizing, but without removing stopwords.

In [None]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False, stopwords=[]))

In [None]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

In [None]:
training_helpers.get_accuracy(model, [preprocess(x.review_content) for x in all_reviews], y, 5)