# Mini-Project - Support Vector Machine

##### Student Tags

Author: Anderson Hitoshi Uyekita    
Mini-Project: Support Vector Machine  
Course: Data Science - Foundations II  
COD: ND111  
Date: 15/01/2019    

***

## Table of Contents
- [Introduction](#intro)
- [Given code](#code)
    - [Function](#function)
- [Exercise 1](#exercise_1)
- [Exercise 2](#exercise_2)
- [Exercise 3](#exercise_3)
- [Exercise 4](#exercise_4)
- [Exercise 5](#exercise_5)
- [Exercise 6](#exercise_6)
- [Exercise 7](#exercise_7)
- [Exercise 8](#exercise_8)
- [Exercise 9](#exercise_9)
- [Exercise 10](#exercise_10)

### General Information

This Jupyter Notebook (in Python 2) aims to record all exercise coded to the Support Vector Machine Mini Project.

## Introduction <a id='intro'></a>

In this mini-project, we’ll tackle the exact same email author ID problem as the Naive Bayes mini-project, but now with an SVM. What we find will help clarify some of the practical differences between the two algorithms. This project also gives us a chance to play around with parameters a lot more than Naive Bayes did, so we will do that too.


## Given Code <a id='code'></a>

In [1]:
#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""
    
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


### Defining a function <a id='function'></a>

This function aims to save lines of code.

In [14]:
# Function to calculate the accuracy.
def my_SVC(kernel, features_train = features_train, labels_train = labels_train,
                   features_test = features_test, labels_test = labels_test,
           C = 0, gamma = 0, prediction = False):
    """
    This function will calculate the accuracy given the kernek, C, and gamma.
    """
    
    # Creating the classifier using the linear kernel.
    if (C == 0) & (gamma == 0):
        clf = SVC(kernel = kernel)
    elif gamma == 0:
        clf = SVC(kernel=kernel, C = C)
    else:
        clf = SVC(kernel=kernel, gamma=gamma)
        
    #clf = SVC(kernel=kernel, C = C, gamma = gamma)

    # Saving time to compute the elapse time of fitting process.
    t0 = time()

    # Fitting/Training clf based on training dataframes.
    clf.fit(features_train, labels_train)

    # Calculating the elapse time of fit calculation.
    print "training time:", round(time()-t0, 3), "s"

    # Saving time to compute the elapse time of predicting process. 
    t1 = time()

    # Storing the predict from features_test in pred.
    pred = clf.predict(features_test)

    # Calculating the elapse time of predicting calculation.
    print "training time:", round(time()-t1, 3), "s"

    # Calculating the accuracy and storing in acc.
    acc = accuracy_score(pred, labels_test)

    # Printing the acc.
    print "Accuracy:", round(acc,4)
    
    # Returning or not
    if prediction == True:
        return pred
    

### Importing Packages

In [3]:
# Importing packages.

# Importing the Scikit Learn package of Support Vector Machine
from sklearn.svm import SVC

# Importing the Scikit Learn to calcutate the accuracy.
from sklearn.metrics import accuracy_score

## Exercise 1 - SVM Author ID Accuracy <a id='exercise_1'></a>

Go to the svm directory to find the starter code (svm/svm_author_id.py).

Import, create, train and make predictions with the sklearn SVC classifier. When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy of the classifier?

In [4]:
my_SVC(kernel = 'linear', C = 0, gamma = 0)

training time: 176.859 s
training time: 17.849 s
Accuracy: 0.9841


>What is the accuracy of your author identification SVM?

Accuracy: 0.9841

## Exercise 2 - SVM Author ID Timing <a id='exercise_2'></a>

Place timing code around the fit and predict functions, like you did in the Naive Bayes mini-project. How do the training and prediction times compare to Naive Bayes?

>Are the SVM training and predicting times faster or slower than Naïve Bayes?

Slower.

## Exercise 3 - A Smaller Training Set <a id='exercise_3'></a>

One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier. 

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?

In [5]:
my_SVC(kernel = 'linear', features_train = features_train[:len(features_train)/100] ,
                          labels_train = labels_train[:len(labels_train)/100])

training time: 0.184 s
training time: 1.125 s
Accuracy: 0.8845


>What's the accuracy of your SVM now?

Accuracy: 0.8845

## Exercise 4 - Speed-Accuracy Tradeoff <a id='exercise_4'></a>

If speed is a major consideration (and for many real-time machine learning applications, it certainly is) then you may want to sacrifice a bit of accuracy if it means you can train/predict faster. Which of these are applications where you can imagine a very quick-running algorithm is especially important?

predicting the author of an email
flagging credit card fraud, and blocking a transaction before it goes through
voice recognition, like Siri

>Which of these are applications where you can imagine a very quicky-running algorithm is specially important?

* [ ] Predicting the author of an email;
* [x] flagging credit card fraud, and blocking a transaction before it goes through;
* [x] Voice recognition, like Siri.

We agree!  Voice recognition and transaction blocking need to happen in real time, with almost no delay.  There's no obvious need to predict an email author instantly.

## Exercise 5 - Deploy an RBF Kernel <a id='exercise_5'></a>

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?

In [6]:
my_SVC(kernel = 'rbf', features_train = features_train[:len(features_train)/100] ,
                          labels_train = labels_train[:len(labels_train)/100])

training time: 0.165 s




training time: 1.176 s
Accuracy: 0.616


>What is the accuracy of your SMV now, with this more complex kernel?

Accuracy: 0.6160

## Exercise 6 - Optimize C Parameter <a id='exercise_6'></a>

Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?

In [7]:
# List of C values to be tested.
list_C = [10, 100, 1000, 10000]

# Loop to calculate the accuracy given the C values.
for index in list_C:
    my_SVC(kernel = 'rbf', features_train = features_train[:len(features_train)/100] ,
                          labels_train = labels_train[:len(labels_train)/100], C = index)

training time: 0.151 s
training time: 1.158 s
Accuracy: 0.616
training time: 0.101 s
training time: 1.161 s
Accuracy: 0.616
training time: 0.122 s
training time: 1.141 s
Accuracy: 0.8214
training time: 0.121 s
training time: 0.919 s
Accuracy: 0.8925


>Which of these values of C gives the best SVM accuracy?

For C equal to 10,000 the accuracy is 0.8925.

## Exercise 7 - Accuracy after Optimizing C <a id='exercise_7'></a>

Once you've optimized the C value for your RBF kernel, what accuracy does it give? Does this C value correspond to a simpler or more complex decision boundary?

(If you're not sure about the complexity, go back a few videos to the "SVM C Parameter" part of the lesson. The result that you found there is also applicable here, even though it's now much harder or even impossible to draw the decision boundary in a simple scatterplot.)

>What's the accuracy of your SVM now? Is the boundary more or less complex than if C its default values of 1.0?

* Accuracy: 0.8925
* More complex

A low `C` makes the decision surface smooth, while a high `C` aims at classifying all training examples correctly.

## Exercise 8 - Optimized RBF vs. Linear SVM: Accuracy <a id='exercise_8'></a>

Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?

In [8]:
my_SVC(kernel = 'rbf', C = 10000)

training time: 121.597 s
training time: 12.316 s
Accuracy: 0.9909


>What's the accuracy of your optimized SVM?

Accuracy: 0.9909

## Exercise 9 - Extracting Predictions from an SVM <a id='exercise_9'></a>

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

In [20]:
pred_9 = my_SVC(kernel = 'rbf', features_train = features_train[:len(features_train)/100] ,
                                labels_train = labels_train[:len(labels_train)/100], C = 10000, prediction = True)

training time: 0.122 s
training time: 0.901 s
Accuracy: 0.8925


In [19]:
# Results of 10, 26 and 50.
pred_9[10], pred_9[26], pred_9[50]

(1, 0, 1)

Just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

>What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test? The 26th? The 50th?

* 10: 1 (Chris);
* 26: 0 (Sara), and;
* 50: 1 (Chris).

## Exercise 10 - How Many Chris Emails Predicted? <a id='exercise_10'></a>

There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

In [28]:
pred_10 = my_SVC(kernel = 'rbf', C = 10000, prediction = True)

sum(pred_10)

training time: 113.98 s
training time: 10.663 s
Accuracy: 0.9909


877

>There are over 1700 test events-how many are predicted to be the "Chris" (1) class? 

877 classification as Chris.