## Importing libraries

We'll need the following libraries for today's lecture:
1. `pandas`
4. `KNeighborsClassifier` from `sklearn`'s `neighbors` module
5. The `load_breast_cancer` function from `sklearn`'s `datasets` module
6. `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module
7. `StandardScaler` from `sklearn`'s `preprocessing` module
8. The `confusion_matrix` function from `sklearn`'s `metrics` module

In [50]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split

# new
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix

## Create dataset

Similar to `load_iris` from this morning, we'll call the `load_breast_cancer()` function to create our dataset.

In [2]:
data = load_breast_cancer()

## Create `X` and `y`

The dataset labels benign tumors as 1, and malignant tumors as 0. This is counter to how you typically label data: the more important class (in our case it's malignant) should be labeled 1.

In [8]:
X = data.data
y = 1 - data.target

## Train/Test Split

In the cell below, train/test split your `X` and `y` variables. 

**Note** we'll want to create a stratified split.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [47]:
y_test.mean()

0.3706293706293706

In [48]:
y_train.mean()

0.3732394366197183

## Scaling our features

Because we're using KNN for our model, we'll want to scale our training and testing sets.

In [20]:
ss = StandardScaler()
ss.fit(X_train)
X_trian_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

## Instantiate and fitting our model

In the cells provided, create and fit an instance of `KNeighborsClassifier`. You can use the default parameters.

In [21]:
knn = KNeighborsClassifier()

In [22]:
knn.fit(X_trian_sc, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

## Predictions

Use our newly fitted KNN model to create predictions from `X_test_scaled`.

In [23]:
predictions = knn.predict(X_test_sc)

In [26]:
predictions

array([1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0])

## Confusion Matrix

We'll create a confusion matrix using the `confusion_matrix` function from `sklearn`'s `metrics` module.

In [46]:
cm = confusion_matrix(y_test, predictions)
cm

array([[89,  1],
       [ 5, 48]])

## Confusion DataFrame

The confusion matrix we just created isn't very explanatory, so let's drop it into a pandas `DataFrame`.

In [32]:
cm_df = pd.DataFrame(data=cm, columns=['predicted benign', 'predicted malignant'], index=['actual benign', 'actual malignant'])
cm_df

Unnamed: 0,predicted benign,predicted malignant
actual benign,89,1
actual malignant,5,48


## Calculate recall

<details>
    <summary>Need a hint?</summary>
    Recall = Sensitivity, and there are no p's in sensitivity.
</details>

In [36]:
48 / 53

0.9056603773584906

## How many Type I errors are there?

<details>
    <summary>Need a hint?</summary>
    Type I = False positive
</details>

In [38]:
1

1

## How many Type II errors are there?
<details>
    <summary>Need a hint?</summary>
    Type II = False negatives
</details>

In [39]:
5

5

## Which error is worse (Type I vs Type II)?

In [40]:
# Type II, because they have a malignant tumor and we just told them they didn't.

## Calculate the sensitivity

<details>
    <summary>Need a hint?</summary>
    There are no p's in sensitivity: TP/P
</details>

In [41]:
48/53

0.9056603773584906

## Calculate the specificity

<details>
    <summary>Need a hint?</summary>
    There is a p in specificity, therefore there are no p's in the calculation: TN/N
</details>

In [43]:
89/90

0.9888888888888889