# 04 Support Vector Machines

## Perform the following tasks

### Using the OverFeat features:


#### For the SVM classifier classifier with a linear kernel:

- Create an SVM classifier with a linear kernel. Tune its C parameter.
- Tune the parameters using grid search with cross-validation. Use the stratified 5-fold strategy.
- Collect the results in a DataFrame with a column for the mean and the standard deviation of the accuracy values across all folds.

#### For the SVM classifier classifier with an RBF kernel:

- Create an SVM classifier with an RBF kernel. Tune its C and γ parameters.
- The DataFrame for the RBF kernel will have an additional column for the γ values. 


#### For both:

- There are many ways to create an SVM classifier with Scikit-learn, e.g., LinearSVC, SVC or even SGDClassifier with hinge loss. Briefly explain in your code your choice of estimator in both cases.

- You might want to use PCA as a preprocessing step before your SVM estimators to improve the results, e.g., speed or accuracy. In any case, justify your choice in a comment or a markdown cell.

- In both cases, find (using code) the parameters that maximize the mean accuracy and print them.

- Finally, evaluate and report the accuracy of your (tuned) estimators on the 1,000 points from the test set.

## Load data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC, SVC

In [2]:
# Load the data form the .npz
with np.load('cifar4-train.npz', allow_pickle=False) as npz_file:
    cifar4 = dict(npz_file.items())
# Overview of the data content    
print('Data keys {}'.format(cifar4.keys()))

Data keys dict_keys(['pixels', 'overfeat', 'labels', 'names', 'allow_pickle'])


In [3]:
# Generate the features matrices with pixels and overfeat
# Create X/y arrays
Xo = cifar4['overfeat']
y = cifar4['labels']

In [4]:
# Partition data set to be able to assess model performance and over/under fitting issues
from sklearn.model_selection import train_test_split

# Split data into train/test sets and insure balanced data sets (stratify)
Xo_train, Xo_test, y_train, y_test = train_test_split(Xo, y, train_size=4000, test_size=1000,
    random_state=0, stratify=y)
#print ('Size of Xo_train, y_train :', Xo_train.shape, y_train.shape)
#print ('Size of Xo_test, y_test :', Xo_test.shape, y_test.shape)

## SVM classifier with a linear kernel
Our data set has a lot of features. It means the classifier could tend to overfit.
To prevent this one will use a PCA as preprocessing step to reduce dimentionality and then the number of features.

In [16]:
# Create SVM classifier with linear kernel
linSVC=LinearSVC()
# Create a PCA object
# From data exploration stage one knows that with 162 and 387 first components 90-95%
# of variance is explained, then one can take the mean ~275 to start
n_compo = 210 # adjusted to  after the grid search
pca = PCA(n_components=210)

# Create a pipeline with a scaler & PCA steps to be assessed if better with or not
pipe = Pipeline([
    #('scaler', StandardScaler()), # to test with/without
    ('pca', pca),
    ('linearsvc', linSVC)
])

### Grid search

In [17]:
# Define the grid of parameters and values to assess
# Define a set of C values
C_val = [0.0001, 0.001,0.01,0.1,1,10] # np.logspace(-4, 1, num=6) #C_val={0.0001, 0.001,0.01,0.1,1,10}
n_compo = np.arange(160, 260,25) # 160, 185, .., 260

# Create cross-validation object
grid_cv = GridSearchCV(pipe, {'pca__n_components': n_compo,
                              'linearsvc__C':C_val,
                              },
                              cv=5,
                              n_jobs=-1
                              )


In [18]:
# Fit estimator
grid_cv.fit(Xo_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=210, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('linearsvc', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'pca__n_components': array([160, 185, 210, 235]), 'linearsvc__C': [0.0001, 0.001, 0.01, 0.1, 1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [19]:
# Print the number of combinations
print('Number of combinations:', len(grid_cv.cv_results_['params']))

Number of combinations: 24


### Score dataframe & best results

In [20]:
# Overview of all the keys names
grid_cv.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_linearsvc__C', 'param_pca__n_components', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])

In [21]:
# Create a df with all the desired results for the differents C
df_linSVC_results = pd.DataFrame(grid_cv.cv_results_)[['param_linearsvc__C', 
                                                'mean_test_score', 
                                                'std_test_score',
                                                'mean_train_score',
                                                'std_train_score',
                                                'param_pca__n_components'
                                                      ]]

df_linSVC_results.sort_values(by='mean_test_score', ascending= False).head()

Unnamed: 0,param_linearsvc__C,mean_test_score,std_test_score,mean_train_score,std_train_score,param_pca__n_components
6,0.001,0.83875,0.015752,0.8815,0.00272,210
7,0.001,0.83625,0.013555,0.887063,0.00282,235
5,0.001,0.8345,0.01336,0.876375,0.003209,185
11,0.01,0.8345,0.013477,0.897813,0.003137,235
9,0.01,0.8335,0.012309,0.886063,0.002813,185


In [22]:
# Print out the best configuration and score on validation set 
linSVC_idx = df_linSVC_results['mean_test_score'].idxmax()
print ('Linear SVM - top accuracy across folds: {:.3f} (std: {:.3f}) with C: {:.4f}'
       .format(df_linSVC_results.loc[linSVC_idx,'mean_test_score'],
               df_linSVC_results.loc[linSVC_idx,'std_test_score'],
               df_linSVC_results.loc[linSVC_idx,'param_linearsvc__C']))

print ('Top accuracy on test set with {} PCA components.'
       .format(df_linSVC_results.loc[linSVC_idx,'param_pca__n_components']))

Linear SVM - top accuracy across folds: 0.839 (std: 0.016) with C: 0.0010
Top accuracy on test set with 210 PCA components.


## SVM classifier with a RFB kernel
Our data set has a lot of features. It means the classifier could tend to overfit.
To prevent this one will use a PCA as preprocessing step to reduce dimentionality and then the number of features.

In [12]:
# Create SVM classifier with a RFB kernel
Svc_Rbf = SVC(kernel='rbf', random_state=0)

# Create a PCA object
# From data exploration stage one knows that with 162 and 387 first components 90-95%
# of variance is explained, then one can take the mean ~275 to start
pca = PCA()

# Create a pipeline with a scaler & PCA steps to be assessed if better with or not
pipe = Pipeline([
    #('scaler', StandardScaler()), # to test with/without
    ('pca', pca),
    ('svc_rbf', Svc_Rbf) # Create SVM with Rfb kernel
])

### Grid search

In [13]:
# Define the grid of parameters and values to assess
# Define a set of C, gamma and component values
C_val = [0.001,0.01,0.1] #np.logspace(-4, 1, num=6) # C_val={0.0001, 0.001,0.01,0.1,1,10}
gam_val = [0.0001, 0.001] #np.logspace(-4, 0, num=5) # gam_val={0.0001, 0.001, 0.01, 0.1, 1}
n_compo = np.arange(160, 260,25) # 160, 185, .., 260

# Create cross-validation object
grid_cv2 = GridSearchCV(pipe, {'svc_rbf__C':C_val, 
                              'svc_rbf__gamma':gam_val,
                              'pca__n_components': n_compo
                             },
                       cv=5, n_jobs=-1)

In [14]:
# Fit estimator
grid_cv2.fit(Xo_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('svc_rbf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'svc_rbf__C': [0.001, 0.01, 0.1], 'svc_rbf__gamma': [0.0001, 0.001], 'pca__n_components': array([160, 185, 210, 235])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [7]:
# Print the number of combinations
print('Number of combinations:', len(grid_cv2.cv_results_['params']))

Number of combinations: 24


### Score dataframe & best results

In [8]:
# Overview of all the keys names
grid_cv2.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_pca__n_components', 'param_svc_rbf__C', 'param_svc_rbf__gamma', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])

In [9]:
# Create a df with all the desired results for the differents C and gamma
df_Svc_Rbf_results = pd.DataFrame(grid_cv2.cv_results_)[['param_svc_rbf__C',
                                                        'param_svc_rbf__gamma',
                                                        'mean_test_score', 
                                                        'std_test_score',
                                                        'mean_train_score',
                                                        'std_train_score',
                                                        'param_pca__n_components'
                                                       ]]

df_Svc_Rbf_results.sort_values(by='mean_test_score', ascending= False).head()

Unnamed: 0,param_svc_rbf__C,param_svc_rbf__gamma,mean_test_score,std_test_score,mean_train_score,std_train_score,param_pca__n_components
22,0.1,0.0001,0.78825,0.011308,0.805625,0.002476,235
4,0.1,0.0001,0.788,0.010914,0.80525,0.002196,160
16,0.1,0.0001,0.788,0.011253,0.805625,0.00233,210
10,0.1,0.0001,0.78725,0.010794,0.805125,0.002259,185
14,0.01,0.0001,0.69325,0.007608,0.697375,0.003116,210


In [10]:
# Print out the best configuration and score on validation set 
Svc_Rbf_idx = df_Svc_Rbf_results['mean_test_score'].idxmax()
print ('RBF SVM - top accuracy across folds: {:.3f} (std: {:.3f}) with C: {:.4f} and gamma: {:.4f}'
       .format(df_Svc_Rbf_results.loc[Svc_Rbf_idx,'mean_test_score'],
               df_Svc_Rbf_results.loc[Svc_Rbf_idx,'std_test_score'],
               df_Svc_Rbf_results.loc[Svc_Rbf_idx,'param_svc_rbf__C'],
               df_Svc_Rbf_results.loc[Svc_Rbf_idx,'param_svc_rbf__gamma']))

print ('Top accuracy on test set with {} PCA components.'
       .format(df_Svc_Rbf_results.loc[Svc_Rbf_idx,'param_pca__n_components']))

RBF SVM - top accuracy across folds: 0.788 (std: 0.011) with C: 0.1000 and gamma: 0.0001
Top accuracy on test set with 235 PCA components.


# Tuned estimators on the 1,000 points from the test set

In [23]:
# Test models on the test set
linSVC_acc = grid_cv.score(Xo_test, y_test)
Svc_Rbf_acc = grid_cv2.score(Xo_test, y_test)

print ('Linear SVM accuracy (test set):f {:.3f}'.format(linSVC_acc))
print ('RBF SVM accuracy (test set): {:.3f}'.format(Svc_Rbf_acc))

Linear SVM accuracy (test set):f 0.810
RBF SVM accuracy (test set): 0.796


In [24]:
# Test models on the test set add "best_estimator_" to see if it brings anyting
linSVC_acc = grid_cv.best_estimator_.score(Xo_test, y_test)
Svc_Rbf_acc = grid_cv2.best_estimator_.score(Xo_test, y_test)
# did not add anything, does it means .score already do that job ?

print ('Linear SVM accuracy (test set):f {:.3f}'.format(linSVC_acc))
print ('RBF SVM accuracy (test set): {:.3f}'.format(Svc_Rbf_acc))

Linear SVM accuracy (test set):f 0.810
RBF SVM accuracy (test set): 0.796


## Note:

For the linear classifier I chosed the LinearSVC estimator as it uses the "liblinear" solver (based on the coordinate descent algorithm) that is faster than "libsvm" solver. Even with this choice, my CPU (8 Core) went up to 100% for several minutes.

Moreover I found that 4000 rows and 4096 features is a lot then SVC would not support it well I guess.

I could fine tune a lot the SVM RFB as my CPU was really in trouble and I had to reduce the number of configuration of the grid search. My PC is Intel Core i7-6700 CPU @ 3.4GHz/ 16 GB RAM/ 64-bit OS.

SVM RFB computation was around 9 min, CPU usage 100%, Physical Memory up to 40%. Any advices to speed up such computations ? 

Maybe you could check, please, if there is no mistake in my code that would have slowed the calculation. I don't know if it is recommended to put the number of component for PCA as a parameter in Gridsearch(), when the pipe object already uses based on a fixed number of component. I also tried to start with a empty PCA() object but it does not help.