# Creating Support Vector Machines in Python

We will now continue our 20 Newsgroups Classifier, but this time with Support Vector Machines (SVMs) and compare the performance with the Multinomial Naive Bayes classifier from the previous section.

Here we will use a linear SVM. To control overfitting we will use the L2 regularizer. All of this may not make much sense to you now but we will revisit them again later in this module.

As a final note we will use "hinge loss" to measure how well our model fits the data during the training process.

Let's begin! We start as always by importing our packages, and we will use a Pipeline to simplify our work. (Note: This will take several minutes to run; SVMs are slower to train than Naive-Bayes)

In [None]:
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset = 'train', shuffle = True)
twenty_test = fetch_20newsgroups(subset = 'test', shuffle = True)

# The penalty here is "L2", with the "L" in lower-case. The penalty is NOT
# the number 12.

svc_clf = Pipeline([('vect', CountVectorizer(stop_words = 'english')),
                   ('tfidf', TfidfTransformer()), 
                   ('svm', LinearSVC(loss = 'hinge', penalty = 'l2',
                    random_state = 42, max_iter = 5000))])
svc_clf.fit(twenty_train.data, twenty_train.target)
predicted_svc = svc_clf.predict(twenty_test.data)
perf_svc = np.mean(predicted_svc == twenty_test.target)


print("Our 20 Newsgroup's SVM classifer correctly classified %3.4f%% of the articles." 
      % (perf_svc * 100.0))

We now have a 85.24% accuracy! We can compare our results so far (These results were for when this document was written; since the datasets are shuffled you may get slightly different results)

- NB with raw word counts: 77.28%
- NB with tf.idf: 77.39%
- NB with tf.idf and stop words: 81.69%
- SVM: 85.24%
