### Codio Activity 12.3: Decision Boundaries 

**Estimated Time: 60 Minutes**

**Total Points: 55**

This activity focuses on the effect of changing your decision threshold and the resulting predictions.  Again, you will use the `KNeighborsClassifier` but this time you will explore the `predict_proba` method of the fit estimator to change the thresholds for classifying observations.  You will explore the results of changing the decision threshold on the false negative rate of the classifier for the insurance data.  Here, we suppose the important thing is to not make the mistake of predicting somebody would not default when they really do.  

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn import set_config

set_config(display="diagram")

### The Dataset

You continue to use the default example, and the data is again loaded and split for you below. 

In [5]:
default = pd.read_csv('data/default.csv')

In [6]:
default.head(2)

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347


In [7]:
X_train, X_test, y_train, y_test = train_test_split(default.drop('default', axis = 1), 
                                                    default['default'],
                                                   random_state=42)

In [8]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['student']),
                                     remainder = StandardScaler())

[Back to top](#-Index)

### Problem 1

#### Basic Pipeline

**10 Points**

Build a pipeline `base_pipe` with named steps `transformer` and `knn` the implements a `KNeighborsClassifier` with `n_neighbors = 10`. 

In [9]:
### GRADED

base_pipe = ''

### BEGIN SOLUTION
base_pipe = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
### END SOLUTION

# Answer check
base_pipe

In [10]:
### BEGIN HIDDEN TESTS
base_pipe_ = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
names = list(base_pipe.named_steps.keys())
names_ = list(base_pipe_.named_steps.keys())
#
#
#
assert names == names_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 2

#### Accuracy of KNN with 50% probability boundary

**10 Points**

The default decision boundary for classification will be 50%.  Determine the accuracy for this default setting and assign it to `base_acc`.  Also, consider the proportion of false negatives here.  Assign these as `base_fn`.  

In [11]:
### GRADED

base_acc = ''
base_fp = ''

### BEGIN SOLUTION
base_pipe = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
base_pipe.fit(X_train, y_train)
base_acc = base_pipe.score(X_test, y_test)
preds = base_pipe.predict(X_test)
base_fp = 0
for i, j in zip(preds, y_test):
    if i == 'No':
        if j == 'Yes':
            base_fp += 1
### END SOLUTION

# Answer check
base_pipe

In [12]:
### BEGIN HIDDEN TESTS
base_pipe_ = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
base_pipe_.fit(X_train, y_train)
base_acc_ = base_pipe_.score(X_test, y_test)
preds_ = base_pipe_.predict(X_test)
base_fp_ = 0
for i, j in zip(preds, y_test):
    if i == 'No':
        if j == 'Yes':
            base_fp_ += 1
#
#
#
#
assert base_acc == base_acc_
assert base_fp == base_fp_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 3

#### Prediction probabilities

**10 Points**

As demonstrated in Video 12.5, your fit estimator has a `predict_proba` method that will output a probability for each observation.  Assign the predicted probabilities as an array using the test data to `base_probs` below. Note that the first column represents the probability for `No`.  

In [13]:
### GRADED

base_probs = ''

### BEGIN SOLUTION
base_pipe = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
base_pipe.fit(X_train, y_train)
base_probs = base_pipe.predict_proba(X_test)
### END SOLUTION

# Answer check
pd.DataFrame(base_probs[:5], columns = ['p_no', 'p_yes'])

Unnamed: 0,p_no,p_yes
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


In [14]:
### BEGIN HIDDEN TESTS
base_pipe_ = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
base_pipe_.fit(X_train, y_train)
base_probs_ = base_pipe_.predict_proba(X_test)
#
#
#
#
np.testing.assert_array_equal(base_probs, base_probs_)
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 4

#### A Stricter `default` estimation

**10 Points**

As discussed in the previous assignment, if you aim to minimize the number of predictions that miss default observations you may consider increasing the probability threshold to make such a classification.  Accordingly, use your probabilities from the last problem to only predict 'No' if you have a higher than 70% probability that this is the label.  Assign your new predictions as an array to `strict_preds`.  Determine the number of false positive predictions here and assign them to `strict_fp` below.  

In [15]:
### GRADED

strict_fp = ''

### BEGIN SOLUTION
base_pipe = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
base_pipe.fit(X_train, y_train)
base_probs = base_pipe.predict_proba(X_test)
strict_preds = np.where(base_probs[:, 0] > .7, 'No', 'Yes')
strict_fp = 0
for i, j in zip(strict_preds, y_test):
    if i == 'No':
        if j == 'Yes':
            strict_fp += 1
### END SOLUTION

# Answer check
print(strict_fp)

44


In [16]:
### BEGIN HIDDEN TESTS
base_pipe_ = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
base_pipe_.fit(X_train, y_train)
base_probs_ = base_pipe_.predict_proba(X_test)
strict_preds_ = np.where(base_probs_[:, 0] > .7, 'No', 'Yes')
strict_fp_ = 0
for i, j in zip(strict_preds_, y_test):
    if i == 'No':
        if j == 'Yes':
            strict_fp_ += 1
#
#
#
#
np.testing.assert_array_equal(strict_preds,strict_preds_)
assert strict_fp == strict_fp_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 5

#### Minimizing False Negatives

**10 Points**

Consider a 50%, 70%, and 90% decision boundary for predicting "No".  Which of these minimize the number of false negatives?  Assign your solution as an integer -- 50, 70, or 90 -- to `ans5` below.



In [17]:
### GRADED

ans5 = ''

### BEGIN SOLUTION
base_pipe = Pipeline([('transformer', transformer), ('knn', KNeighborsClassifier(n_neighbors=10))])
base_pipe.fit(X_train, y_train)
base_probs = base_pipe.predict_proba(X_test)
stricter_preds = np.where(base_probs[:, 0] > .9, 'No', 'Yes')
stricter_fp = 0
for i, j in zip(stricter_preds, y_test):
    if i == 'No':
        if j == 'Yes':
            stricter_fp += 1
ans5 = 90
### END SOLUTION

# Answer check
print(ans5)

90


In [18]:
### BEGIN HIDDEN TESTS
ans5_ = 90
#
#
#
#
assert ans5 == ans5_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 6

#### Visualizing decision boundaries

**5 Points**

For this exercise, a visualization of the decision boundary using a synthetic dataset is created and plotted below.  Which of these would you choose for minimizing the number of false positives?  Enter your choice as an integer -- 1, 20, or 50 -- to `ans6` below.

<center>
    <img src = images/dbounds.png />
</center>

In [19]:
### GRADED

ans6 = ''

### BEGIN SOLUTION
ans6 = 1
### END SOLUTION

# Answer check
print(ans6)

1


In [20]:
### BEGIN HIDDEN TESTS
ans6_ = 1
#
#
#
#
assert ans6 == ans6_
### END HIDDEN TESTS