# Support Vector Machines

1. [The Tasks](#tt) <br>
2. [Loading our Data and Libraries](#ld) <br>
3. [SVM Linear Kernel](#rbf) <br>
4. [SVM RBF Kernel](#rbf) <br>
5. [Tuned Models](#tm) <br>

***

## The Tasks
<a id="tt" > 

For the SVM model with a linear kernel - using the high-level OverFeat features

- Tune the C parameter using grid search with cross-validation.
- Collect the results in a DataFrame as described.
- Find the C value with the best mean accuracy and print it.


For the SVM model with an RBF kernel - using the high-level OverFeat features

- Tune the C and γ parameters using grid search with cross-validation.
- Collect the results in a DataFrame as described.
- Find the combination of C and γ with the best mean accuracy and print it.

For both models

- You might want to use PCA as a preprocessing step. In any case, justify your choice.
- Justify the choice of estimator, e.g., SVC, LinearSVC, SGDClassifier
- Evaluate and report the accuracy on the 1,000 points from the test set.

***

## Loading/Preparing our Data and Libraries
<a id="ld" > 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.dummy import DummyClassifier
from sklearn.svm import LinearSVC, SVC

In [2]:
with np.load('cifar4-train.npz', allow_pickle=False) as npz_file:
    cifar4 = dict(npz_file.items())
print('Our data countains {}'.format(cifar4.keys()))

Our data countains dict_keys(['pixels', 'overfeat', 'labels', 'names', 'allow_pickle'])


In [3]:
X_of = cifar4['overfeat']
y = cifar4['labels']

# Splitting our data into a train- and test set 
X_tr, X_te, y_tr, y_te = train_test_split(X_of, y, test_size=1000, random_state=0)

***

## SVM Linear Kernel
<a id="lk" > 

In [4]:
# We use PCA to speed up processing/prevent overfitting and set it to 200 retaining 90+% of the variance
pca = PCA(n_components=200)
pipe = Pipeline([('pca', pca),
                 ('linearsvc', LinearSVC())])

In [5]:
grid_cv = GridSearchCV(pipe, {'linearsvc__C':[0.0001, 0.001,0.01,0.1,1,]}, 
                       cv=5,
                       n_jobs=-1
                      )

grid_cv.fit(X_tr, y_tr)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=200, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('linearsvc', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'linearsvc__C': [0.0001, 0.001, 0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [6]:
df_lin_svp = pd.DataFrame(grid_cv.cv_results_)[['param_linearsvc__C', 
                                                'mean_test_score', 
                                                'std_test_score',
                                                'mean_train_score',
                                                'std_train_score']]
df_lin_svp.head()

Unnamed: 0,param_linearsvc__C,mean_test_score,std_test_score,mean_train_score,std_train_score
0,0.0001,0.83475,0.009764,0.860187,0.002414
1,0.001,0.83325,0.00897,0.877438,0.002323
2,0.01,0.8275,0.011505,0.890126,0.00335
3,0.1,0.825,0.010461,0.8885,0.003649
4,1.0,0.7835,0.013469,0.839378,0.005229


df_lin_svp.plot(x = df_lin_svp['param_linearsvc__C'], y = df_lin_svp['mean_test_score'])
plt.show()

In [7]:
best = df_lin_svp['mean_test_score'].idxmax()

print('our best mean test accuracy was {:.3f} and we achieved tjis with a C value of {}'
      .format(df_lin_svp.loc[best, 'mean_test_score'], 
              df_lin_svp.loc[best, 'param_linearsvc__C']))

our best mean test accuracy was 0.835 and we achieved tjis with a C value of 0.0001


***

## SVM RBF Kernel
<a id="rbf" > 

In [9]:
svc_rbf = SVC(kernel='rbf', random_state=0)

pipe = Pipeline([('pca', pca),
                 ('svc_rbf', svc_rbf)])

In [10]:
grid_cv_rbf = GridSearchCV(pipe, {'svc_rbf__C':[0.01, 0.1, 1, 10], 
                                  'svc_rbf__gamma':[0.0001, 0.001, 0.01, 0.1, 1]}, 
                           cv=5,
                           n_jobs=-1)

grid_cv_rbf.fit(X_tr, y_tr)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=200, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('svc_rbf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'svc_rbf__C': [0.01, 0.1, 1, 10], 'svc_rbf__gamma': [0.0001, 0.001, 0.01, 0.1, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [11]:
# We safe the elements we care about in a data frame
df_rbf_svp = pd.DataFrame(grid_cv_rbf.cv_results_)[['param_svc_rbf__C', 
                                                    'param_svc_rbf__gamma',
                                                    'mean_test_score', 
                                                    'std_test_score',
                                                    'mean_train_score',
                                                    'std_train_score']]
df_rbf_svp.head()

Unnamed: 0,param_svc_rbf__C,param_svc_rbf__gamma,mean_test_score,std_test_score,mean_train_score,std_train_score
0,0.01,0.0001,0.45,0.004404,0.450813,0.002341
1,0.01,0.001,0.25575,0.00025,0.25575,6.3e-05
2,0.01,0.01,0.25575,0.00025,0.25575,6.3e-05
3,0.01,0.1,0.25575,0.00025,0.25575,6.3e-05
4,0.01,1.0,0.25575,0.00025,0.25575,6.3e-05


In [12]:
best = df_rbf_svp['mean_test_score'].idxmax()

print('Our best accuracy was {:.3f} and we achieved this with a C value of {} and a gamme of {}'
      .format(df_rbf_svp.loc[best, 'mean_test_score'], 
              df_rbf_svp.loc[best, 'param_svc_rbf__C'], 
              df_rbf_svp.loc[best, 'param_svc_rbf__gamma'] ))

Our best accuracy was 0.840 and we achieved this with a C value of 10 and a gamme of 0.0001


***

## Tuned Models
<a id="tm" > 

In [14]:
# Getting a baseline-accuracy based on the most frequent category
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_tr, y_tr)
accuracy_tr = dummy.score(X_tr, y_tr)
accuracy_te = dummy.score(X_te, y_te)

print('Our Baseline (most frequent) accuracy on the training set is {:.3f}'.format(accuracy_tr))
print('Our Baseline (most frequent) accuracy on the test set is {:.3f}'.format(accuracy_te))

Our Baseline (most frequent) accuracy on the training set is 0.256
Our Baseline (most frequent) accuracy on the test set is 0.227


In [15]:
# Testing our models on the test-set
acc_lin = grid_cv.score(X_te, y_te)
acc_rbf = grid_cv_rbf.score(X_te, y_te)

print ('Our Linear SVM model gives us an accuracy on the test set of {:.3f}'.format(acc_lin))
print ('Our RBF SVM model gives us an accuracy on the test set of {:.3f}'.format(acc_rbf))

Our Linear SVM model gives us an accuracy on the test set of 0.817
Our RBF SVM model gives us an accuracy on the test set of 0.823
