# CS229 Notes: SVM
An auxiliary notebook for the blog essay.

## Example: text classification
An example from a tutorial in *scikit-learn*, aiming to show the use of svm in text classification, and compare svm with some other classifiers.

Source: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

#### Import the 20newsgroups dataset

The data used here needs to be downloaded via the scikit-learn script, which will take some time.

In [1]:
import numpy as np

In [2]:
from sklearn.datasets import fetch_20newsgroups

Choose part of the dataset, not all.

In [3]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [4]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True)

#### Here, we use the "tfidf" feature of the text data
Example code:

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

#### Combine the above with the classifier into a pipeline

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

text_clf_with_svm = Pipeline([('vect', CountVectorizer()),
                              ('tfidf', TfidfTransformer()),
                              ('clf', SGDClassifier(max_iter=5, tol=None))])
text_clf_with_svm.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False))])

In [9]:
docs_test = twenty_test.data
predicted = text_clf_with_svm.predict(docs_test)

Classification accuracy:

In [10]:
np.mean(predicted == twenty_test.target)

0.92410119840213045

#### Compare Linear Svm with Multinomial Naive Bayes classifier

In [12]:
from sklearn.naive_bayes import MultinomialNB
text_clf_with_MNB = Pipeline([('vect', CountVectorizer()),
                              ('tfidf', TfidfTransformer()),
                              ('clf', MultinomialNB())])
text_clf_with_MNB.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [14]:
predicted_MNB = text_clf_with_MNB.predict(docs_test)

In [15]:
np.mean(predicted_MNB == twenty_test.target)

0.83488681757656458