# A deeper look at SVM architectures

In the last experiment that we ran, where we used a LinearSVM without changing any of the deafult parameters or feaure engineering, we achieved an accuracy of 63%. It also gave us convergence warnings for the linear kernel.
In this notebook we will iterate over the SVM design and try different approaches to the problem using this classifier.
Let's import what we need.

In [1]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

In [4]:
reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()

We wish to look further into how an SVC peforms on our data, by tweaking the kernel and parameters.
Let's run the classic SVC without changing any parameters first (the default kernel is 'rbf'.)

In [5]:
from sklearn.svm import SVC

In [7]:
all_reviews = reviews_set[:10000]

In [17]:
cv = CountVectorizer(stop_words='english', ngram_range=(0, 2))
classifier = SVC(random_state=42) # Starting seed

X = [x.review_content for x in all_reviews]
y = [x.label for x in all_reviews]

In [None]:
model = Pipeline([ ('cv', cv), ('classifier', classifier) ])
training_helpers.get_accuracy(model, X, y, 5)

SVM with a linear kernel is actually supposed to be well suited to text classification. We should however see better results if we preprocess our text to lemmatize and remove stopwords. Since bag of words is our main feature here, this should hopefully be influential. In this case we are removing all of the stopwords, which may be a bad idea. We can't know for sure unless we experiment.

In [21]:
def preprocess(review_content): # Not adding bigrams yet
  return " ".join(preprocess_words(find_words(review_content), bigrams=False))

In [23]:
from exp2_feature_extraction import find_words, preprocess_words
review_content = all_reviews[0].review_content
print(review_content)
preprocess(review_content)

Great food. I always get their octopus appetizer because it's so good and I've never had better! Great value too...everything is well-priced. I've been here a handful of times and they never disappoint. Service is usually excellent though it can get spotty when they're busy. Take your date here. Take your family here.  Come here by yourself. I see that too.



'great food octopus appet good better great valu hand time disappoint servic usual excel spotti busi take date take famili come'

In [None]:
X_lemmatized = [preprocess(x.review_content) for x in all_reviews]

In [None]:
cv = CountVectorizer()
classifier = LinearSVC(random_state=42) # Starting seed

In [None]:
model = Pipeline([ ('cv', cv), ('classifier', classifier) ])
training_helpers.get_accuracy(model, X_lemmatized, y, 5)