### Codio Assignment 12.4: KNN for Regression and Imputation

**Expected Time = 60 minutes** 

**Total Points = 50** 

This activity extends the use of K Nearest Neighbors to the problem of regression.  While typically not as high performing in predictive models, the KNN model for regression can be an effective approach to imputing missing data.  You will explore both of these ideas using scikit-learn, where there exists the `KNeighborsRegressor` and the `KNNImputer`.

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config
set_config("figure")

### The Data

To begin, you will use a dataset accessed from the R languages DAAG package containing information on possums trapped at seven different sites in Australia.  It is loaded and displayed below.  Your regression task will be to predict the head size using the other features.  The training and testing data is created for you below as well.

In [2]:
possums_missing = pd.read_csv('data/possum.csv')

In [3]:
possums_missing.info() #note the missing values -- we will drop these to begin

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   case      104 non-null    int64  
 1   site      104 non-null    int64  
 2   Pop       104 non-null    object 
 3   sex       104 non-null    object 
 4   age       102 non-null    float64
 5   hdlngth   104 non-null    float64
 6   skullw    104 non-null    float64
 7   totlngth  104 non-null    float64
 8   taill     104 non-null    float64
 9   footlgth  103 non-null    float64
 10  earconch  104 non-null    float64
 11  eye       104 non-null    float64
 12  chest     104 non-null    float64
 13  belly     104 non-null    float64
dtypes: float64(10), int64(2), object(2)
memory usage: 11.5+ KB


In [4]:
possums = possums_missing.dropna()

In [5]:
possums.head()

Unnamed: 0,case,site,Pop,sex,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,1,1,Vic,m,8.0,94.1,60.4,89.0,36.0,74.5,54.5,15.2,28.0,36.0
1,2,1,Vic,f,6.0,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0
2,3,1,Vic,f,6.0,94.0,60.0,95.5,39.0,75.4,51.9,15.5,30.0,34.0
3,4,1,Vic,f,6.0,93.2,57.1,92.0,38.0,76.1,52.2,15.2,28.0,34.0
4,5,1,Vic,f,2.0,91.5,56.3,85.5,36.0,71.0,53.2,15.1,28.5,33.0


In [6]:
X = possums.drop(['skullw', 'Pop'], axis = 1)

In [7]:
y = possums.skullw

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 34)

[Back to top](#-Index)

### Problem 1

#### A Basic Regression Pipeline

**10 Points**

Use the `make_column_transformer` function to define a transformer instance named `transformer`. Apply a `OneHotEncoder` transformation with `drop = 'if_binary'` to the `sex` column. Transform the `remainder` columns using `StandardScaler()`.


Next, build a basic regression pipeline with steps `transformer` and `knn` that binarizes the categorical feature  and feeds these into a `KNeighborsRegressor` with all default settings. Assign your pipeline to `knn_pipe`.

Use the `fit` function to fit the pipeline to the training sets.

Use the `predict` function on `knn_pipe` to make predictions on `X_test`. Assign the result to `preds`.

Finally, use the `mean_squared_error` function to compute the MSE between `y_test` and `preds`. Assign the results to `test_mse`.

In [9]:
### GRADED

transformer = ''
knn_pipe = ''


test_mse = ''

### BEGIN SOLUTION
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['sex']),
                                     remainder = StandardScaler())
knn_pipe = Pipeline([('transformer', transformer), ('knn', KNeighborsRegressor())])
knn_pipe.fit(X_train, y_train)
preds = knn_pipe.predict(X_test)
test_mse = mean_squared_error(y_test, preds)
### END SOLUTION

# Answer check
print(test_mse)

9.236092307692314


[Back to top](#-Index)

### Problem 2

#### GridSearch the Pipeline

**10 Points**

Define a dictionary `params`. The key of this dictionary will be `'knn__n_neighbors'`, the values will be equal to `range(1, len(y_test), 2)`.

Use the `GridSearchCV` function to perform a grid search on `knn_pipe` with `param_grid` equal to `params`.

Use the `fit` function to fit the pipeline to the training sets.

Use the `best_params_` method on `knn_pipe` with argument `'knn__n_neighbors'`. Assign the result to `best_k` below.

In [10]:
### GRADED

params = {}
knn_grid = ''
best_k = ''

### BEGIN SOLUTION
params = {'knn__n_neighbors': range(1, len(y_test), 2)}
knn_grid = GridSearchCV(knn_pipe, param_grid=params)
knn_grid.fit(X_train, y_train)
best_k = knn_grid.best_params_['knn__n_neighbors']
### END SOLUTION

# Answer check
print(best_k)

3


[Back to top](#-Index)

### Problem 3

#### Handling the missing data

**10 Points**

Earlier, we dropped the rows containing missing data.  If we wanted to keep these rows for our model we need to make a decision about what values to fill in.  The `KNNImputer` uses the K Nearest Neighbor algorithm in order to determine this value.  Intuitively, you could see the argument for this where you use similar observations to stand in for the missing values.  

```
Each sample's missing values are imputed using the mean value from `n_neighbors` nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.
```



Use the `make_column_transformer` function to define a transformer instance named `transformer`. Apply a `OneHotEncoder` transformation with `drop = 'if_binary'` to the `sex` column. Transform the `remainder` columns using `StandardScaler()`.


Next, build a basic regression pipeline with steps `'transform'`, `'impute'`, and `'model'`. Assign `transformer` to `'transform'`, `KNNImputer()` to `'impute'`, and `KNeighborsRegressor()` to `'model'`.

Use the `fit` function to fit the pipeline to the `X_train_missing` and `y_train_missing`.

Use the `predict` function on `imputer_pipe` to make predictions on `X_test_missing`. Assign the result to `preds`.

Finally, use the `mean_squared_error` function to compute the MSE between `y_test_missing` and `preds`. Assign the results to `test_mse`.

In [11]:
X = possums_missing.drop(['skullw', 'Pop'], axis = 1)
y = possums_missing.skullw
X_train_missing, X_test_missing, y_train_missing, y_test_missing = train_test_split(X, y, random_state = 43)

In [12]:
### GRADED
imputer_pipe = ''

### BEGIN SOLUTION
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['sex']), 
                                     remainder = StandardScaler())
imputer_pipe = Pipeline([('transform', transformer), ('impute', KNNImputer()), ('model', KNeighborsRegressor())])
imputer_pipe.fit(X_train_missing, y_train_missing)
preds = imputer_pipe.predict(X_test_missing)
test_mse = mean_squared_error(y_test_missing, preds)
### END SOLUTION

# Answer check
print(test_mse)

3.4073538461538493


[Back to top](#-Index)

### Problem 4

#### Grid Searching the Pipeline

**10 Points**


Define a dictionary `params`. The keys of this dictionary will be `'model__n_neighbors'` and `'impute__n_neighbors'` with values  `range(1, len(y_test), 2)` and `[1, 2, 3, 4, 5]`, respectively.

Use the `GridSearchCV` function to perform a grid search on `imputer_pipe` with `param_grid` equal to `params`. Assign the result to `imputer_grid`.

Use the `fit` function to fit `imputer_grid` to `X_train_missing` and `y_train_missing`.

Use the `best_params_` method on `imputer_grid`. Assign the result to `best_ks` below.

Use the `predict` functions on `imputer_grid` to calculate the predictions on `X_test_missing`. Assign the result to `preds`.

Finally, use the `mean_squared_error` function to calculate the MSE between `y_test_missing` and `preds`. Assign
the mean squared error to `imputer_mse` below.  

In [13]:
### GRADED

params = {}
imputer_grid = ''
best_ks = ''
imputer_mse = ''

### BEGIN SOLUTION
params = {'model__n_neighbors': range(1, len(y_test), 2),
         'impute__n_neighbors': [1, 2, 3, 4, 5]}
imputer_grid = GridSearchCV(imputer_pipe, param_grid=params)
imputer_grid.fit(X_train_missing, y_train_missing)
best_ks = imputer_grid.best_params_
preds = imputer_grid.predict(X_test_missing)
imputer_mse = mean_squared_error(y_test_missing, preds)
### END SOLUTION

# Answer check
print(best_ks)
print(imputer_mse)

{'impute__n_neighbors': 2, 'model__n_neighbors': 5}
3.4073538461538493


[Back to top](#-Index)

### Problem 5

#### Interpreting the model

**10 Points**

Unlike linear regression, we have no parameters from the resulting model to investigate and understand the effect of increasing or decreasing certain features based on these coefficients.  All hope is not lost however, as you can simulate this behavior by running through different values of each feature and exploring how the predictions from the model change.

This is the idea behind the `partial_dependence` function in scikit-learn.  Note that it works in a similar manner to the confusion matrix display from earlier.  For a deeper discussion/example of partial dependence plots see the user guide [here](https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence). Below, the Partial Dependence plots for six features are plotted.  Which feature seems more important -- `hdlngth` or `footlgth` based on these plots.  Assign your response as a string to `ans5` below. 

Again, the big idea is the x-axis represents increasing values of the feature, and the y-values represent the predicted value of the target.  The code that produced the plots is shown below as well as the plot. 

```python
from sklearn.inspection import PartialDependenceDisplay, partial_dependence
fig, ax = plt.subplots(figsize = (20, 6))
PartialDependenceDisplay.from_estimator(pipe, X, features = ['hdlngth', 'totlngth', 'footlgth', 'earconch', 'eye', 'chest'], ax = ax)
ax.set_title('Partial Dependence Plots for 6 Features')
```

<center>
    <img src = 'images/part_dep.png'/>
</center>


In [14]:
### GRADED

ans5 = ''

### BEGIN SOLUTION
ans5 = 'hdlngth'
### END SOLUTION

# Answer check
print(ans5)

hdlngth


In a similar way, you could understand the features and their importance in the case of KNN for classification through partial dependence plots -- another situation where after fitting the model you do not get parameters.  In the next module, you will explore a classification method called Logistic Regression that does solve classification problems and contains coefficients after fitting. 