<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating SVM on Multiple Datasets


---

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    from sklearn.datasets import load_breast_cancer
    
**Spambase**

    resource-datasets/spam

**Car evaluation**

    resource-datasets/car_evaluation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

## A: Breast cancer data

### 1. Load and prepare the data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.
- Determine the baseline for accuracy.
- Rescale the data.

In [3]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X = pd.DataFrame(data.data,columns=data.feature_names)
y = data.target

In [4]:
X.shape

(569, 30)

### 2. Build an SVM classifier on the data

For details on the SVM classifier, see [SVM-classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

- Initialize and train a linear SVM with the default settings. What is the average accuracy score with 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print the confusion matrix and classification report for your models.

- [Classification report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

- Confusion matrix:

 ```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

### 3. Tune the SVM classifiers with gridsearch

- Check in the documentation which parameters can be tuned in combination with different kernels.
- Create a further train-test split to obtain a hold-out validation set.
- Cross-validate scores.
- Examine confusion matrices and classification reports.

### 4. Compare kNN and logistic regression on the dataset.


- Gridsearch optimal parameters 
- Cross-validate scores.
- Examine confusion matrices and classification reports.

### 5. Bonus: Consider different scores in the gridsearch

## B: Car data

- Repeat the same steps

### 1. Load and prepare the data

In [6]:
car = pd.read_csv('../../../../resource-datasets/car_evaluation/car.csv')

### 2. Build an SVM classifier

### 3. Grid search SVM

### 4. Compare with kNN and logistic regression

## C: Spam data

- Repeat the same steps

### 1. Load and prepare the data

In [7]:
spam = pd.read_csv('../../../../resource-datasets/spam/spambase.csv')
spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### 2. Build an SVM classifier

### 3. Grid search SVM

### 4. Compare to kNN and logistic regression