<a href="https://colab.research.google.com/github/HaidyGiratallah/Intro_to_Machine_Learning_2019/blob/master/Intro_to_Machine_Learning_classification_code_along.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification using SVMs


In this notebook we'll play around with how Machine learning models can perform classification tasks. In particular we'll explore SVM's. As with the regression module we'll employ some validation to ensure that our results generalize well. We'll also look into evaluation methods for classification models such as sensitivity, specificity and receiver operating characteristic curves. 

First, as always, import required libraries

## Importing Breast Cancer Dataset

In [0]:
# Import Cancer data from SKlearn library
# Datasets can be also found here: (http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29)


It's a good practice when working with new datasets is to perform some visualization. While we won't have the luxury to do this with high dimensional data which is probably most contexts in which classification is performed, playing with a low-dimensional case is good for building intuition:

Let's view the dataset as a table. Note that our target column is coded as binary "1, 0" for benign vs cancerious. Which makes this a binary classification problem.

Scatterplots or pairplots are very useful to visualize all your parameters or features. You can also plot corrolations between your features to help you identify important features for hyperparameter tuning or feature engineering. Here, let's assume we decided that the 'mean radius' and the 'mean concavity' are the most important features for detecting cancerious from benign tumors. 

As you can see from the data there's some separation between both classes of the data. Our goal is to train a Support Vector Machine classifier to model the separation between the classes. As with most machine learning tools, <code>sklearn</code> also has a support vector machine classifier:

### Splitting your data into 'Training' and 'Testing' datasets

First let's split the dataset into a training and a testing set.

N.B. There are multiple ways by which you can split your data which can further explained in validation methods. The main purpose of spliting is that you are able to hold out some data for testing your model performance on new unseen data "for better generalizability".

## Training your model using the training dataset

Now we can inspect from properties of this model to get a better idea about how it performed on our full dataset. First we'll visualize the dividing line generated by this model:

First, note that since this particular SVM is a linear model, fitting the model results in a linear model much like linear regression. The only difference being is that this line is designed to cut across two classes rather than to  minimize the mean squared error as we did with linear regression:

Evaluation of the model requires us to write out the equation of the plane decided by the svm model and re-arrange the equation to solve for $x_1$ or $x_2$ (both are equivalent):

$$ax_1 + bx_2 + c = 0$$
$$x_2 = \frac{-ax_1 - c}{b}$$

Now that we've compute our linear boundary let's visualize what it looks like!

Furthermore, we can visualize which vectors were used as support vectors as well!

We can also plot the margins used in the SVM model as well. The margins of the SVM are described by the following equation:

For the top margin:
$$ ax_1 + bx_2 + c = 1 $$

For the bottom margin:
$$ ax_1 + bx_2 + c = -1 $$

Re-arranging the equations to solve for $x_2$ as usual (for the top margin):

$$x_2 = \frac{1- ax_1 - c}{b}$$

Now we can visualize the full SVM result!

This visualization will becoming increasingly useful as we start thinking about regularization!

## Let's use our trained model to make a prediction using our test dataset

In [0]:
# predict test set using your trained model


### Computing Classification Metrics

Now that we've fit our model, we can start to calculate classification metrics on the test dataset. *Only metrics calculated on the test dataset are useful towards evaluating the expected performance of your model on unseen data!*. 

An easy way to generate these probabilities is to predict the classes in the test case, then use <code>sklearn.metrics.confusion_matrix</code> to generate our 2x2 table

In [0]:
# Import metric libraries



Recall that the confusion matrix assesses in a table:

<table>
    <tr>
        <td> True Negatives </td>
        <td> False Positives </td>
    </tr>
    <tr>
        <td> False Negative </td>
        <td> True Positives </td>
</table>



In [0]:
# create confustion matrix



In [0]:
# print classification report


***

### Exercise:
Using the confusion matrix table calculate:

1. Accuracy on test set
2. Specificity on test set
3. Sensitivity on test set

### Solution:

The following equations are used:

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$


$$\text{Specificity} = \frac{TN}{TN + FP}$$


$$\text{Sensitivity} = \frac{TP}{TP + FN}$$

Looks like our model did pretty well! The final step is to explore the sensitivity/specificity trade-off and to plot the ROC curve. In order to explore the ROC curve we must first generate scores for each data point in the test set. Changing the threshold at which we classify data as being at class 0 or class 1 will yield the ROC curve:

As you can see SVMs perform quite well in high dimensional space, there are some theoretical reasons why this is the case but that topic is too advanced for an intro course. We could do better by performing dimensionality reduction techniques or regularization (which is a feature that SVMs actually have built-in, see the $C$ parameter)... 

Finally, you might have noticed that our SVM is a linear function. However, we can extend the SVM to non-linear cases using something called the **Kernel Trick**. We won't get into it in this course but the **Kernel Trick** is an extraordinary property of the SVM that allows it to be widely applicable!