In this notebook, we'll cover one of the major algorithms used in Supervised Learning--Support Vector Machines (or SVMs for short!). We'll start by playing around with a visual implementation to gain an intuition for how SVMs work, and then we'll grab an SVM implementation from sklearn and use to it make some classifcations on a real world data set.

at first glance, SVMs are similar to other supervised learning algorithms such as Logistic Regression, because the algorithm find the optimal line for a decision boundary. However, unlike Logistic Regression, SVMs don't just find a line for the decision boundary--they try to maximize the margin between the two sides.

The points that touch the sides of the margin are called support vectors. By maximizing the margin by finding support vectors, this has the effect of "balancing" the the decision boundary so that it evenly splits the area between the two classes. This is not always the case with Logistic Regression--see the image below for a visual example.

Notice that on the image on the right, the line is a bit skewed through the datapoints. This is a problem that can occur with Logistic Regression, since it's job is to fit a line that linearly separates the two classes. The line in the image on the right technically accomplishes this task, but we can see by looking at the decision boundary that this is not optimal. Contrast this with the decision boundary on the left, which splits the area between the two classes perfectly.

SVMs are not perfect, however--they only work when the data is linearly separable--that is, the decision boundary is linear, and can be drawn as a straight line. Take a look at the picture below, and consider where you would draw the ideal decision boundary to split the two--remember, it has to be a straight line!

The data is non linearly separable, so we can't draw a decision boundary--or can we? This is where the cool part of SVMs comes in--what if we mapped the data to a higher-dimension space--maybe we could draw a decision boundary there?

Ah, there it is! In this higher dimensional space, we can see an easy place to draw a linear decision boundary. It's important to note that in 2 dimensions, our decision boundary looks like a straight line--but for this data, in its current form, our decision boundary will need to look like a piece of paper (with no thickness). This is because our decision boundary will always have one less dimension than the data we are trying to find a decision boundary for. If our data has 4 dimensions (which we can't visualize), then our decision boundary would be a hyperplane that would look like a rectangle. We can generalize this rule to say that for any dataset with [n] dimensions, our decision boundary will have [n - 1] dimensions.

The process of mapping data to a higher-dimensional space is called the Kernel Method. There are several different kernels that are typically used, but the most common ones you'll typically need to know are the Polynomial Kernel and the Radial Basis Function (RBF)--these are complicated data transformations that any ML library worth its salt can handle for you. You don't need to know the math behind them, but you should definitely be aware that they exist, and that they are tools in your ML toolbox for SVMs!

Let's review what we've done so far:

Determined that the data is not linearly separable in its current form.
Mapped the data to a higher dimensional space using a kernel method.
Found a linear decision boundary in the higher dimensional space.
Now what?

Now that we've identified support vectors that allow us to linearly separate the data in a higher dimensional space, all that we need to do is to bring the data (and the decision boundary) back to our original, lower-dimensional space. If we visualize the decision boundary for our data in the lower-dimensional space, it will appear as a circle:

It's important to understand that although our decision boundary isn't linear in this lower-dimensional space, that's okay--we found a linear decision boundary in a higher-dimensional space and made our classifications, so we didn't actually break the rules of Support Vector Machines.

To make learning how SVMs work a bit easier, the sklearn community has built an awesome interactive visualization that lets users plot points and fit an SVM for binary classification. We highly recommend running this python script and getting a feel for how SVMs work--plot different data points and see how the decision boundary changes, try different kernel methods, visualize the decision surface of the SVM, etc. You'll find all of these activities very useful, and very interesting.

Check out this link to see the page on sklearn.org that gives an example of how everything works. To download the file, download and run the python script linked at the bottom of the page (use the script version, not the jupyter notebook!)

For the remainder of this notebook, you'll use everything you've learned in DS2 to use a Support Vector Classifier on the Wisconsin Breast Cancer Dataset. Note that you do not need to download the dataset, as it comes preloaded as a sample in sklearn. To get the data, just use the load_breast_cancer() method found within sklearn.datasets.

Challenge:

Import and explore the dataset. Recall the load_breast_cancer() method will return an object that contains the data in .data, the labels in .target, and the column names in .feature_names attributes.
Build a Correlation Heatmap using Seaborn to check for each feature's correlation with the labels.
Build a second Correlation Heatmap using Seaborn to check for mutlicollinearity between features.
Scale and transform the data using a StandardScaler() object and any appropriate methods it contains.
Split the newly scaled data into training and testing sets using train_test_split().
Create an SVC() object, which can be found in sklearn.svm
Fit the model to the scaled data.
Use your validation data to check the accuracy metrics for your model.
Stretch Challenge:

Try different parameters such as different kernels to see how it affects the overall performance of the model. For a full list of the tunable parameters you can use with an SVC, see the documentation on sklearn.org.

In [15]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()
print(breast_cancer.data)
print(breast_cancer.data.shape)

df = pd.DataFrame(breast_cancer.data)
df.columns = breast_cancer.feature_names
df['TARGET'] = breast_cancer.target

print(df.head())

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
(569, 30)
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0

In [16]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,TARGET
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [19]:
bc_corr_map = breast_cancer.corr()
sns.heatmap(bc_corr_map)

AttributeError: 'DataFrame' object has no attribute 'target'