<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating SVM on Multiple Datasets


---

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    resource-datasets/breast_cancer_wisconsin

**Spambase**

    resource-datasets/spam

**Car evaluation**

    resource-datasets/car_evaluation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

## A: Breast cancer data

### 1. Load and prepare the data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.
- Determine the baseline for accuracy.
- Rescale the data.

In [3]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X = pd.DataFrame(data.data,columns=data.feature_names)
y = pd.Series(data.target)

In [4]:
# check for missing values
X.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64

In [5]:
y.unique()

array([0, 1])

In [6]:
#baseline => 65%
y.value_counts(normalize=True)

1    0.627417
0    0.372583
dtype: float64

### 2. Build an SVM classifier on the data

For details on the SVM classifier, see [SVM-classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

- Initialize and train a linear SVM with the default settings. What is the average accuracy score with 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print the confusion matrix and classification report for your models.

- [Classification report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

- Confusion matrix:

 ```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [7]:
# create train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=10)

In [8]:
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train),columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test),columns=X.columns)

In [9]:
def get_accuracy(model, X_train, y_train, X_test, y_test, cv=5):
    model.fit(X_train,y_train)
    scores_train = cross_val_score(model, X_train, y_train, cv=cv)
    predictions_test = model.predict(X_test)
    sm = scores_train.mean()
    print("Average training score: {:0.3}".format(sm))
    print("Test score: {:0.3}".format(model.score(X_test, y_test)))
    return predictions_test


def print_cm_cr(y_true, y_pred):
    """prints the confusion matrix and the classification report"""
    confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=[
                            'Predicted'], margins=True)
    print(confusion)
    print()
    print(metrics.classification_report(y_true, y_pred))

In [10]:
model_1 = LinearSVC(loss='hinge')
model_2 = SVC(kernel='linear')

In [11]:
model_1.fit(X_train,y_train)
model_1.score(X_train,y_train)

0.9868131868131869

In [12]:
model_2.fit(X_train,y_train)
model_2.score(X_train,y_train)

0.9868131868131869

In [10]:
model_lin = LinearSVC()
model_rbf = SVC(kernel='rbf')

In [14]:
predictions = get_accuracy(model_lin, X_train, y_train, X_test, y_test, cv=5);
print_cm_cr(y_test, predictions)

Average training score: 0.969
Test score: 0.956
Predicted   0   1  All
Actual                
0          40   2   42
1           3  69   72
All        43  71  114

             precision    recall  f1-score   support

          0       0.93      0.95      0.94        42
          1       0.97      0.96      0.97        72

avg / total       0.96      0.96      0.96       114



In [15]:
predictions = get_accuracy(model_rbf, X_train, y_train, X_test, y_test, cv=5);
print_cm_cr(y_test, predictions)

Average training score: 0.969
Test score: 0.965
Predicted   0   1  All
Actual                
0          39   3   42
1           1  71   72
All        40  74  114

             precision    recall  f1-score   support

          0       0.97      0.93      0.95        42
          1       0.96      0.99      0.97        72

avg / total       0.97      0.96      0.96       114



### 3. Tune the SVM classifiers with gridsearch

- Check in the documentation which parameters can be tuned in combination with different kernels.
- Create a further train-test split to obtain a hold-out validation set.
- Cross-validate scores.
- Examine confusion matrices and classification reports.

In [16]:
def grid_search_func(estimator, params, X_train, y_train, X_test, y_test, scoring_function=metrics.accuracy_score, scoring='accuracy'):
    gs = GridSearchCV(
        estimator=estimator,
        param_grid=params,
        return_train_score=True,
        scoring=scoring)

    gs.fit(X_train, y_train)

    print("Best score")
    print(gs.best_score_)
    print()
    print("Best estimator")
    print(gs.best_estimator_.get_params())
    print()

    predictions = gs.best_estimator_.predict(X_test)
    print('Test score: ', scoring_function(y_test, predictions))
    print()
    print_cm_cr(y_test, predictions)

    return gs

In [17]:
params_lin = {'C': np.logspace(-10, 10, 21),
              'fit_intercept': [True, False]}
gs_lin = grid_search_func(model_lin, params_lin,
                          X_train, y_train, X_test, y_test,
                          scoring_function=metrics.accuracy_score,
                          scoring='accuracy')

Best score
0.9802197802197802

Best estimator
{'C': 0.01, 'class_weight': None, 'dual': True, 'fit_intercept': False, 'intercept_scaling': 1, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': None, 'tol': 0.0001, 'verbose': 0}

Test score:  0.9824561403508771

Predicted   0   1  All
Actual                
0          42   0   42
1           2  70   72
All        44  70  114

             precision    recall  f1-score   support

          0       0.95      1.00      0.98        42
          1       1.00      0.97      0.99        72

avg / total       0.98      0.98      0.98       114



In [18]:
params_rbf = {'C': np.logspace(-10, 10, 21),
          'gamma':np.linspace(0.01,2,20)}
gs_rbf = grid_search_func(model_rbf,params_rbf,X_train,y_train,X_test,y_test)

Best score
0.967032967032967

Best estimator
{'C': 1.0, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.01, 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}

Test score:  0.956140350877193

Predicted   0   1  All
Actual                
0          38   4   42
1           1  71   72
All        39  75  114

             precision    recall  f1-score   support

          0       0.97      0.90      0.94        42
          1       0.95      0.99      0.97        72

avg / total       0.96      0.96      0.96       114



### 4. Compare kNN and logistic regression on the dataset.


- Gridsearch optimal parameters 
- Cross-validate scores.
- Examine confusion matrices and classification reports.

#### kNN

In [19]:
model_knn = KNeighborsClassifier()

params_knn = {
    'n_neighbors':list(range(1,21,3)),
    'weights':['distance','uniform']
}

gs_knn = grid_search_func(model_knn,params_knn,X_train,y_train,X_test,y_test)

Best score
0.9648351648351648

Best estimator
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': 1, 'n_neighbors': 4, 'p': 2, 'weights': 'uniform'}

Test score:  0.9824561403508771

Predicted   0   1  All
Actual                
0          41   1   42
1           1  71   72
All        42  72  114

             precision    recall  f1-score   support

          0       0.98      0.98      0.98        42
          1       0.99      0.99      0.99        72

avg / total       0.98      0.98      0.98       114



#### Logistic regression

In [20]:
model_lr = LogisticRegression()
params_lr = {
    'penalty':['l1','l2'],
    'C':np.logspace(-4, 2, 20),
    'solver':['liblinear']
}
gs_lr = grid_search_func(model_lr,params_lr,X_train,y_train,X_test,y_test)

Best score
0.9758241758241758

Best estimator
{'C': 0.06951927961775606, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': 1, 'penalty': 'l2', 'random_state': None, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}

Test score:  0.9824561403508771

Predicted   0   1  All
Actual                
0          41   1   42
1           1  71   72
All        42  72  114

             precision    recall  f1-score   support

          0       0.98      0.98      0.98        42
          1       0.99      0.99      0.99        72

avg / total       0.98      0.98      0.98       114



### 5. Consider different scores in the gridsearch

In [21]:
gs_lin_pr = grid_search_func(model_lin, params_lin,
                          X_train, y_train, X_test, y_test,
                          scoring_function=metrics.precision_score,
                          scoring='precision')

Best score
0.9791781371549413

Best estimator
{'C': 0.01, 'class_weight': None, 'dual': True, 'fit_intercept': False, 'intercept_scaling': 1, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': None, 'tol': 0.0001, 'verbose': 0}

Test score:  1.0

Predicted   0   1  All
Actual                
0          42   0   42
1           2  70   72
All        44  70  114

             precision    recall  f1-score   support

          0       0.95      1.00      0.98        42
          1       1.00      0.97      0.99        72

avg / total       0.98      0.98      0.98       114



In [22]:
gs_rbf_pr = grid_search_func(model_lin, params_lin,
                          X_train, y_train, X_test, y_test,
                          scoring_function=metrics.precision_score,
                          scoring='precision')

Best score
0.9791781371549413

Best estimator
{'C': 0.01, 'class_weight': None, 'dual': True, 'fit_intercept': False, 'intercept_scaling': 1, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': None, 'tol': 0.0001, 'verbose': 0}

Test score:  1.0

Predicted   0   1  All
Actual                
0          42   0   42
1           2  70   72
All        44  70  114

             precision    recall  f1-score   support

          0       0.95      1.00      0.98        42
          1       1.00      0.97      0.99        72

avg / total       0.98      0.98      0.98       114



## B: Car data

### 1. Load and prepare the data

In [23]:
car = pd.read_csv('../../../../../resource-datasets/car_evaluation/car.csv')

In [24]:
car.head(3)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc


In [25]:
car.buying.unique()

array(['vhigh', 'high', 'med', 'low'], dtype=object)

In [26]:
car.maint.unique()

array(['vhigh', 'high', 'med', 'low'], dtype=object)

In [27]:
car.lug_boot.unique()

array(['small', 'med', 'big'], dtype=object)

In [28]:
car.safety.unique()

array(['low', 'med', 'high'], dtype=object)

In [29]:
car.acceptability.unique()

array(['unacc', 'acc', 'vgood', 'good'], dtype=object)

In [30]:
# any na?
car.isnull().sum()

buying           0
maint            0
doors            0
persons          0
lug_boot         0
safety           0
acceptability    0
dtype: int64

In [31]:
y = car.acceptability.map(lambda x: 1 if x in ['vgood','good'] else 0)

In [32]:
categorical = [col for col in car.columns if col!='acceptability']

In [33]:
X = pd.get_dummies(car,columns=categorical,drop_first=True)
X.drop('acceptability',inplace=True,axis=1)

In [34]:
X.head()

Unnamed: 0,buying_low,buying_med,buying_vhigh,maint_low,maint_med,maint_vhigh,doors_3,doors_4,doors_5more,persons_4,persons_more,lug_boot_med,lug_boot_small,safety_low,safety_med
0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0
1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1
2,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0
3,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1


In [35]:
y.value_counts(normalize=True)
# baseline is 92.2%

0    0.922454
1    0.077546
Name: acceptability, dtype: float64

### 2. Build an SVM classifier

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=10)

In [37]:
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train),columns=X.columns)
X_test = pd.DataFrame(scaler.transform(X_test),columns=X.columns)

In [38]:
predictions = get_accuracy(model_lin, X_train, y_train, X_test, y_test, cv=5);
print_cm_cr(y_test, predictions)

Average training score: 0.986
Test score: 0.986
Predicted    0   1  All
Actual                 
0          317   2  319
1            3  24   27
All        320  26  346

             precision    recall  f1-score   support

          0       0.99      0.99      0.99       319
          1       0.92      0.89      0.91        27

avg / total       0.99      0.99      0.99       346



In [39]:
predictions = get_accuracy(model_rbf, X_train, y_train, X_test, y_test, cv=5);
print_cm_cr(y_test, predictions)

Average training score: 0.97
Test score: 0.968
Predicted    0   1  All
Actual                 
0          317   2  319
1            9  18   27
All        326  20  346

             precision    recall  f1-score   support

          0       0.97      0.99      0.98       319
          1       0.90      0.67      0.77        27

avg / total       0.97      0.97      0.97       346



### 3. Grid search SVM

In [40]:
gs_lin = grid_search_func(model_lin,params_lin,X_train,y_train,X_test,y_test)

Best score
0.9898697539797395

Best estimator
{'C': 1.0, 'class_weight': None, 'dual': True, 'fit_intercept': True, 'intercept_scaling': 1, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': None, 'tol': 0.0001, 'verbose': 0}

Test score:  0.9855491329479769

Predicted    0   1  All
Actual                 
0          317   2  319
1            3  24   27
All        320  26  346

             precision    recall  f1-score   support

          0       0.99      0.99      0.99       319
          1       0.92      0.89      0.91        27

avg / total       0.99      0.99      0.99       346



In [41]:
gs_rbf = grid_search_func(model_rbf,params_rbf,X_train,y_train,X_test,y_test)

Best score
0.9985528219971056

Best estimator
{'C': 1000.0, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.01, 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}

Test score:  0.9971098265895953

Predicted    0   1  All
Actual                 
0          319   0  319
1            1  26   27
All        320  26  346

             precision    recall  f1-score   support

          0       1.00      1.00      1.00       319
          1       1.00      0.96      0.98        27

avg / total       1.00      1.00      1.00       346



### 4. Compare with kNN and logistic regression

#### kNN 

In [42]:
gs_knn = grid_search_func(model_knn,params_knn,X_train,y_train,X_test,y_test)

Best score
0.9319826338639653

Best estimator
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': 1, 'n_neighbors': 7, 'p': 2, 'weights': 'distance'}

Test score:  0.9335260115606936

Predicted    0  1  All
Actual                
0          318  1  319
1           22  5   27
All        340  6  346

             precision    recall  f1-score   support

          0       0.94      1.00      0.97       319
          1       0.83      0.19      0.30        27

avg / total       0.93      0.93      0.91       346



#### Logistic regression

In [43]:
gs_lr = grid_search_func(model_lr,params_lr,X_train,y_train,X_test,y_test)

Best score
0.9905933429811867

Best estimator
{'C': 1.2742749857031321, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': 1, 'penalty': 'l1', 'random_state': None, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}

Test score:  0.9855491329479769

Predicted    0   1  All
Actual                 
0          318   1  319
1            4  23   27
All        322  24  346

             precision    recall  f1-score   support

          0       0.99      1.00      0.99       319
          1       0.96      0.85      0.90        27

avg / total       0.99      0.99      0.99       346



## C: Spam data

### 1. Load and prepare the data

In [44]:
spam = pd.read_csv('../../../../../resource-datasets/spam/spambase.csv')
spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [45]:
print(spam.shape)
print(1-spam['class'].mean())

(4601, 58)
0.6059552271245381


In [46]:
y = spam['class']
X = spam.iloc[:,:-1]

### 2. Build an SVM classifier

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

In [48]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [49]:
predictions = get_accuracy(model_lin, X_train, y_train, X_test, y_test, cv=5);
print_cm_cr(y_test, predictions)

Average training score: 0.925
Test score: 0.916
Predicted    0    1  All
Actual                  
0          523   35  558
1           42  321  363
All        565  356  921

             precision    recall  f1-score   support

          0       0.93      0.94      0.93       558
          1       0.90      0.88      0.89       363

avg / total       0.92      0.92      0.92       921



In [50]:
predictions = get_accuracy(model_rbf, X_train, y_train, X_test, y_test, cv=5);
print_cm_cr(y_test, predictions)

Average training score: 0.93
Test score: 0.934
Predicted    0    1  All
Actual                  
0          537   21  558
1           40  323  363
All        577  344  921

             precision    recall  f1-score   support

          0       0.93      0.96      0.95       558
          1       0.94      0.89      0.91       363

avg / total       0.93      0.93      0.93       921



### 3. Grid search SVM

In [51]:
gs_lin = grid_search_func(model_lin,params_lin,X_train,y_train,X_test,y_test)

Best score
0.9239130434782609

Best estimator
{'C': 10.0, 'class_weight': None, 'dual': True, 'fit_intercept': True, 'intercept_scaling': 1, 'loss': 'squared_hinge', 'max_iter': 1000, 'multi_class': 'ovr', 'penalty': 'l2', 'random_state': None, 'tol': 0.0001, 'verbose': 0}

Test score:  0.9196525515743756

Predicted    0    1  All
Actual                  
0          524   34  558
1           40  323  363
All        564  357  921

             precision    recall  f1-score   support

          0       0.93      0.94      0.93       558
          1       0.90      0.89      0.90       363

avg / total       0.92      0.92      0.92       921



In [52]:
gs_rbf = grid_search_func(model_rbf,params_rbf,X_train,y_train,X_test,y_test)

Best score
0.9290760869565218

Best estimator
{'C': 10.0, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.01, 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}

Test score:  0.9348534201954397

Predicted    0    1  All
Actual                  
0          536   22  558
1           38  325  363
All        574  347  921

             precision    recall  f1-score   support

          0       0.93      0.96      0.95       558
          1       0.94      0.90      0.92       363

avg / total       0.93      0.93      0.93       921



### 4. Compare to kNN and logistic regression

#### kNN

In [53]:
model_knn = KNeighborsClassifier()

params_knn = {
    'n_neighbors':list(range(1,21,3)),
    'weights':['distance','uniform']
}

gs_knn = grid_search_func(model_knn,params_knn,X_train,y_train,X_test,y_test)

Best score
0.9141304347826087

Best estimator
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': 1, 'n_neighbors': 10, 'p': 2, 'weights': 'distance'}

Test score:  0.9131378935939196

Predicted    0    1  All
Actual                  
0          526   32  558
1           48  315  363
All        574  347  921

             precision    recall  f1-score   support

          0       0.92      0.94      0.93       558
          1       0.91      0.87      0.89       363

avg / total       0.91      0.91      0.91       921



#### Logistic regression

In [None]:
model_lr = LogisticRegression()
params_lr = {
    'penalty':['l1','l2'],
    'C':np.logspace(-4, 2, 20),
    'solver':['liblinear']
}
gs_lr = grid_search_func(model_lr,params_lr,X_train,y_train,X_test,y_test)