<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# KNN and Support Vector Machines

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from matplotlib.colors import ListedColormap
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings("ignore")

## Setup

We are going to use a dataset from the University of Wisconsin which contains features of the cell nuclei present in biopsies of breast masses.  The target to predict is whether the mass is malignant or benign.  Description of the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

In [None]:
from sklearn.datasets import load_breast_cancer
data=load_breast_cancer(as_frame=True)
X,y=data.data,data.target
# Since the default in the file is 0=malignant 1=benign we want to reverse these
y=(y==0).astype(int)
X.head()

In [None]:
# Let's set aside a test set and use the remainder for training and cross-validation
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.2)

# Let's scale the inputs to help it converge more easily
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train),columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test),columns=X_train.columns)

# Let's create a model using just two features so we can visualize it
X_train_2feats = X_train_scaled[['worst concave points','worst area']]
X_test_2feats = X_test_scaled[['worst concave points','worst area']]

In [None]:
def plot_decision_boundaries(X,y,model):
    """
    Plots the 2D decision boundary of a classification model
    Parameters:
    X (pandas dataframe): input features
    y (pandas series): target values
    model: trained scikit-learn model object
    """
    markers = ['^','s','v','o','x']
    colors = ['yellow','green','purple','blue','orange']
    cmap = ListedColormap(colors[:len(np.unique(y))])
    
    for i,k in enumerate(np.unique(y)):
        plt.scatter(X.loc[y.values==k].iloc[:,0],X.loc[y.values==k].iloc[:,1],
                    c=colors[i],marker=markers[i],label=k,edgecolor='black')

    xgrid = np.arange(X.iloc[:,0].min(),X.iloc[:,0].max(),
                      (X.iloc[:,0].max()-X.iloc[:,0].min())/500)
    ygrid = np.arange(X.iloc[:,1].min(),X.iloc[:,1].max(),
                      (X.iloc[:,1].max()-X.iloc[:,1].min())/500)
    xx,yy = np.meshgrid(xgrid,ygrid)
    
    mesh_preds = model.predict(np.c_[xx.ravel(),yy.ravel()])
    mesh_preds = mesh_preds.reshape(xx.shape)
    plt.contourf(xx,yy,mesh_preds,alpha=0.2,cmap=cmap)
    plt.legend()
    return

## PART 1: KNN
In this part we will use cross-validation with 5 folds and accuracy as the evaluation metric to find the optimal value for `n_neighbors`.  The search space we will evaluate for n_neighbors is [1,3,5,10].  After you find the optimal `n_neighbors`, plot the decision boundaries of your model and calculate the accuracy on the test set.

In [None]:
### BEGIN SOLUTION ###



### END SOLUTION ###

## PART 2: Support Vector Classifiers
### 2.1
We will now try a SVC on our two-feature simplified dataset.  In the cells below, create two different SVC models:  
- SVC with a linear kernel. 
- SVC with a RBF kernel. 

For each model, keep the value of C fixed at 1.  Use k-folds cross-validation with k=10 to compare the performance of the two models.  Also, display the decision boundary for each model. Then, select the kernel which gives you better cross-validation performance as your final model and calculate the accuracy on the test set.

In [None]:
### BEGIN SOLUTION ###



### END SOLUTION ###

### 2.2
Now, let's try a polynomial kernel.  Vary the polynomial degree from 2 through 4 and create and train a SVC model with a polynomial kernel of each degree.  Leave the value of C constant at 1.  For each model, display the resulting decision boundary and calculate the cross-validation accuracy using k=10.  Visually compare the decision boundaries and their performance in classifying the data.  Then, determine which degree has the best performance in classifying the data and calculate the performance of a SVC model with that degree of polynomial kernel on the test set.

In [None]:
### BEGIN SOLUTION ###



### END SOLUTION ###