### Text Classification using Support Vector Machines(SVM)

In this document, we will use SVM to help to classify sentences using the categories of [20 newsgroups text dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). 
Through this evaluation, it will clear that SVM is mode accurate than the naive bayes classifier on this dataset.

The initial steps are the same as for the naive bayes classifier: See [Naive Bayes classification on 20newsgroups dataset](https://github.com/ooduor/machine-learning/blob/master/Naive%20Bayes%20classification%20on%2020newsgroups%20dataset%20.ipynb) Notebook.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn import metrics
import numpy as np

# specify the categories to training with from the list of 20
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print(twenty_train.target_names)

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


#### Retrieve count of training data and turn the text data into vectors of numerical values for statistical analysis

In [2]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

#### Testing the model

We provide 2 sentences as input to the classifier model and expect the test to satisfy expectations by categorizing them into the correct category. We use [Support Vector Classificaton](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)(SVC) strategy. See all [classification strategies](http://scikit-learn.org/stable/modules/svm.html).

In [3]:
clf = svm.SVC(kernel = 'linear')
clf.fit(X_train_tfidf, twenty_train.target)

# we will write two sentences to test the model.
docs_new = ['Abuse of antibiotics is very common', 
            'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

# show the category predicted by the model
predicted = clf.predict(X_new_tfidf)
print(predicted)

for doc, category in zip(docs_new, predicted):
    print('{} => {}'.format(doc, twenty_train.target_names[category]))

[2 1]
Abuse of antibiotics is very common => sci.med
OpenGL on the GPU is fast => comp.graphics


### Verify the classification

We use [F1 Score](https://en.wikipedia.org/wiki/F1_score) to measure the accuracy of the test. The best and maximum accuracy value is 1.0, the least accuracy being 0.

In [4]:
# get the test data from test dataset
twenty_test = fetch_20newsgroups(
    subset='test', 
    categories=categories, 
    shuffle=True, 
    random_state=42
)
docs_test = twenty_test.data

# vectorize test data
X_test_counts = count_vect.transform(docs_test)

# extract feature of test data
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# use the model to predict the category 
predicted = clf.predict(X_test_tfidf)

# get the precision, recall, f1-score and support of this model
print(metrics.classification_report(
        twenty_test.target, 
        predicted,target_names=twenty_test.target_names)
     )

# get the accuracy of the model
print("Accuracy:\t {}".format((np.mean(predicted == twenty_test.target))))

                        precision    recall  f1-score   support

           alt.atheism       0.96      0.83      0.89       319
         comp.graphics       0.90      0.96      0.93       389
               sci.med       0.94      0.91      0.93       396
soc.religion.christian       0.89      0.96      0.93       398

           avg / total       0.92      0.92      0.92      1502

Accuracy:	 0.9207723035952063


#### We can see the accuracy of this text is 0.92.