# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
In this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier.
Let's import what we need.

In [1]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [3]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

We wish to look further into how an SVC peforms on our data, by tweaking the kernel and parameters.
Let's run the classic SVC without changing any parameters first (the default kernel is 'rbf'.)

In [4]:
from sklearn.svm import SVC

In [23]:
all_reviews = reviews_set[:20000]

In [24]:
cv = CountVectorizer(stop_words='english', ngram_range=(0, 2))
classifier = SVC(random_state=42) # Starting seed

X = [x.review_content for x in all_reviews]
y = [x.label for x in all_reviews]

In [None]:
model = Pipeline([ ('cv', cv), ('classifier', classifier) ])
training_helpers.get_accuracy(model, X, y, 5)

SVM with a linear kernel is actually supposed to be well suited to text classification. We should however see better results if we preprocess our text to lemmatize and remove stopwords. Since bag of words is our main feature here, this should hopefully be influential. In this case we are removing all of the stopwords, which may be a bad idea. We can't know for sure unless we experiment.

In [7]:
from exp2_feature_extraction import find_words, preprocess_words
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [8]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

My husband and myself went to Lo Monaco's for dinner.  We ordered an appetizer, salads and entrées.  The appetizer was fine but my salad contained a very large dead moth.  We tried to discreetly to get our server's attention. Finally she came over and looked at me as if I had put the moth in the salad.  She took the salad into the kitchen, everyone in the kitchen came to the door to stare at me one at a time.  No apology or restitution was offered.  I spread the word about the service and the moth.  We will never return.



'husband go monaco dinner order appet salad entré appet fine salad contain larg dead moth tri discreet server attent final come look moth salad take salad kitchen kitchen come door stare time apolog restitut offer spread word servic moth return'

In [10]:
cv = CountVectorizer()
classifier = LinearSVC(random_state=42) # Starting seed

In [11]:
model = Pipeline([ ('cv', cv), ('classifier', classifier) ])

Without any processing, we have the following accuracy:

In [77]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.calibration import CalibratedClassifierCV
import numpy as np

def run_cross_validate(model, X, y, cv=5):
  skfSplitter = StratifiedKFold(n_splits=cv)
  clf = CalibratedClassifierCV(model)
  scores = []
  auc = []
  for train_indices, test_indices in skfSplitter.split(X, y):
    training_X = [X[x] for x in train_indices]
    training_y = [y[x] for x in train_indices]
    test_X = [X[x] for x in test_indices]
    test_y = [y[x] for x in test_indices]
    
    cv = CountVectorizer(stop_words='english', ngram_range=(0, 2))
    training_X2 = cv.fit_transform(training_X)
    
    clf.fit(training_X2, training_y)
    print(model.predict_proba(training_X))
    
    #scores.append(.score(test_X, test_y))
    auc.append(roc_auc_score(training_y, model.predict(training_X)))
        
  return {
    "scores": scores,
    "auc": auc,
    "mean": np.mean(scores),
    "variance": np.var(scores)
  }

In [78]:
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

0.75

In [80]:
run_cross_validate(classifier, X[:10000], y[:10000])



AttributeError: predict_proba is not available when  probability=False

In [9]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

In [37]:
run_cross_validate(model, X_lemmatized, y)

ValueError: Found input variables with inconsistent numbers of samples: [10000, 20000]

This shows only a slightly better result. Perhaps the different versions of words people use are actually important, and perhaps stopwords are important here too. At least more important than other tasks, for example identifying sentiment or topic. Let's try lemmatizing, but without removing stopwords.

In [14]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False, stopwords=[]))

In [15]:
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

My husband and myself went to Lo Monaco's for dinner.  We ordered an appetizer, salads and entrées.  The appetizer was fine but my salad contained a very large dead moth.  We tried to discreetly to get our server's attention. Finally she came over and looked at me as if I had put the moth in the salad.  She took the salad into the kitchen, everyone in the kitchen came to the door to stare at me one at a time.  No apology or restitution was offered.  I spread the word about the service and the moth.  We will never return.



'husband myself go monaco dinner order appet salad entré appet fine salad contain veri larg dead moth tri discreet server attent final come over look moth salad take salad into kitchen everyon kitchen come door stare time apolog restitut offer spread word about servic moth will never return'

In [16]:
training_helpers.get_accuracy(model, [preprocess(x.review_content) for x in all_reviews], y, 5)



0.5962042695260673