# Support Vector Machines Lab

In this lab we will explore several datasets with SVMs. The assets folder contains several datasets (in order of complexity):

1. Breast cancer
- Spambase
- Car evaluation
- Mushroom

For each of these a `.names` file is provided with details on the origin of data.

In [59]:
import pandas as pd
import numpy as np

# Exercise 1: Breast Cancer



## 1.a: Load the Data
Use `pandas.read_csv` to load the data and assess the following:
- Are there any missing values? (how are they encoded? do we impute them?)
- Are the features categorical or numerical?
- Are the values normalized?
- How many classes are there in the target?

Perform what's necessary to get to a point where you have a feature matrix `X` and a target vector `y`, both with only numerical entries.

In [60]:
bc = pd.read_csv('../../assets/datasets/breast_cancer.csv')
bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null object
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


In [61]:
for val in bc.Bare_Nuclei:
    try:
        int(val)
    except:
        print("Cannot cast %s" % val)

Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?
Cannot cast ?


In [62]:
bc = bc[bc.Bare_Nuclei != '?']
bc.Bare_Nuclei = [int(bn) for bn in bc.Bare_Nuclei]

In [63]:
bc.describe()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
count,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0
mean,1076720.0,4.442167,3.150805,3.215227,2.830161,3.234261,3.544656,3.445095,2.869693,1.603221,2.699854
std,620644.0,2.820761,3.065145,2.988581,2.864562,2.223085,3.643857,2.449697,3.052666,1.732674,0.954592
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,877617.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171795.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238705.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [64]:
y = bc.Class.map({2:0,4:1})
X = bc.ix[:,1:-1]

## 1.b: Model Building

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?
- Repeat using an rbf classifier. Compare the scores. Which one is better?
- Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?
- What's the best model?
- Print a confusion matrix and classification report for your best model using:
        train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

**Check** to decide which model is best, look at the average cross validation score. Are the scores significantly different from one another?

In [67]:
# Baseline
y.describe()[1]

0.34992679355783307

In [71]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score,train_test_split
model = SVC(kernel='linear')
model.fit(X,y)
np.mean(cross_val_score(model,X,y,cv=3))

0.96489295927042285

In [72]:
model = SVC(kernel='rbf')
model.fit(X,y)
np.mean(cross_val_score(model,X,y,cv=3))

0.95758301774995491

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

In [78]:
model = SVC(kernel='linear')
model.fit(X_train,y_train)
y_pred=model.predict(X_test)

from sklearn.metrics import confusion_matrix
conmat = np.array(confusion_matrix(y_test, y_pred))
confusion = pd.DataFrame(conmat, index=['is_benign', 'is_malignant'],columns=['predicted_benign', 'predicted_malignant'])
confusion

Unnamed: 0,predicted_benign,predicted_malignant
is_benign,142,5
is_malignant,3,76


In [79]:
from sklearn.metrics import classification_report
cls_rep = classification_report(y_test, y_pred)
print(cls_rep)

             precision    recall  f1-score   support

          0       0.98      0.97      0.97       147
          1       0.94      0.96      0.95        79

avg / total       0.96      0.96      0.96       226



**Check:** Are there more false positives or false negatives? Is this good or bad?

## 1.c: Feature Selection

Use any of the strategies offered by `sklearn` to select the most important features.

Repeat the cross validation with only those 5 features. Does the score change?

In [93]:
from sklearn.feature_selection import SelectKBest
mask= SelectKBest(k=5).fit(X, y).get_support(indices=True)

model = SVC(kernel='linear')
model.fit(X[mask],y)
np.mean(cross_val_score(model,X[mask],y,cv=3))

0.96050699435814213

## 1.d: Learning Curves

Learning curves are useful to study the behavior of training and test errors as a function of the number of datapoints available.

- Plot learning curves for train sizes between 10% and 100% (use StratifiedKFold with 5 folds as cross validation)
- What can you say about the dataset? do you need more data or do you need a better model?

In [132]:
from sklearn.learning_curve import learning_curve
from sklearn.model_selection import StratifiedKFold
from bokeh.plotting import figure, output_notebook, show
output_notebook()

def do_learning_curve(model, X, y, sizes=np.linspace(0.1, 1.0, 10), y_range=(0.9,1)):
    sizes, tr_scores, te_scores = learning_curve(model,
                                                 X,
                                                 y,
                                                 train_sizes=sizes,
                                                 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0).split(X,y),
                                                 n_jobs=-1)
    # Create our base figure
    p = figure(title='Learning Curve',y_range=y_range)

    # Create our Training score line
    p.line(x=sizes,
           y=tr_scores.mean(axis=1),
           color='red',
           legend="Train Scores")

    #Create our Testing score line
    p.line(x=sizes,
           y=te_scores.mean(axis=1),
           color='blue',
           legend= "Test Scores")

    #Move our legend around
    p.legend.location = "top_right"

    # Render the plot!!
    show(p)
    
do_learning_curve(model, X, y)

##  1.e: Grid Ssearch

Use the grid_search function to explore different kernels and values for the C parameter.

- Can you improve on your best previous score?
- Print the best parameters and the best score

# Exercise 2
Now that you've completed steps 1.a through 1.e it's time to tackle some harder datasets. But before we do that, let's encapsulate a few things into functions so that it's easier to repeat the analysis.

## 2.a: Cross Validation
Implement a function `do_cv(model, X, y, cv)` that does the following:
- Calculates the cross validation scores
- Prints the model
- Prints and returns the mean and the standard deviation of the cross validation scores

> Answer: see above

## 2.b: Confusion Matrix and Classification report
Implement a function `do_cm_cr(model, X, y, names)` that automates the following:
- Split the data using `train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)`
- Fit the model
- Prints confusion matrix and classification report in a nice format

**Hint:** names is the list of target classes

> Answer: see above

## 2.c: Learning Curves
Implement a function `do_learning_curve(model, X, y, sizes)` that automates drawing the learning curves:
- Allow for sizes input
- Use 5-fold StratifiedKFold cross validation

> Answer: see above

## 2.d: Grid Search
Implement a function `do_grid_search(model, parameters)` that automates the grid search by doing:
- Calculate grid search
- Print best parameters
- Print best score
- Return best estimator


> Answer: see above

# Exercise 3
Using the functions above, analyze the Spambase dataset.

Notice that now you have many more features. Focus your attention on step C => feature selection

- Load the data and get to X, y
- Select the 15 best features
- Perform grid search to determine best model
- Display learning curves

# Exercise 4
Repeat steps 1.a - 1.e for the car dataset. Notice that now features are categorical, not numerical.
- Find a suitable way to encode them
- How does this change our modeling strategy?

Also notice that the target variable `acceptability` has 4 classes. How do we encode them?


# Bonus
Repeat steps 1.a - 1.e for the mushroom dataset. Notice that now features are categorical, not numerical. This dataset is quite large.
- How does this change our modeling strategy?
- Can we use feature selection to improve this?
