In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

# KNN Regression

### Warm-up 🥵

* How are weights implemented in K Nearest Neighbors when using the default in `sklearn` (aka the `'uniform'` option)?
* How are weights implemented in K Nearest Neighbors when using the `'distance'` option in `sklearn`?
  
* What type of supervised learning problem were we using KNN on up until now?

## Data Import and General EDA 🚗

We'll be looking at the auto MPG dataset from UCI.  Which can be found [here](https://archive.ics.uci.edu/ml/datasets/auto+mpg).  From the description we see:

```
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
```

Our target variable will be `mpg`.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"
names = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "year",
    "origin",
    "model",
]

# '\s+' means "more than 1 space" you can download the
# data from the data_url to inspect the data and see why this makes sense
auto = pd.read_csv(data_url, sep="\s+", names=names)
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,model
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino


<IPython.core.display.Javascript object>

Do some general eda.

Display rows with any NAs in them.

## Data Cleaning and Feature Engineering

### Handling NAs

Since our target variable is `mpg` we probably don't want to do any imputation strategy on it.  We should drop NAs in the target unless we have some domain expertise that tells us otherwise.

The `horsepower` column is responsible for the rest of the NAs.  In practice, we might look up this info somehow, but that would probably take too much time for this demo.

So how do you want to handle these? by dropping? with imputation?  If imputing, what should we impute?

### Handling Categorical Variables

* From the description of the columns above, we can see that `origin` should is a 'discrete' value.
* The `model` column is also a categorical variable
* You can also see that `year` is 'discrete' from the description, but in practice we'll treat year variables as ordinal, so we don't need to make any changes.

Show the value counts for each of our categorical columns (don't include year).  How do we want to handle each?

The `model` column has perhaps too much variation.  Maybe we'd like to have less variation, how can we transform this variable to have a higher level categorical variable?

After making the transformation in mind, we still have some category levels with very little support (i.e. not a lot of observations to learn from).  Create an `'other'` category to hold these low occuring category levels.

## Modeling

Perform a train/test split with 20% of the data in the test set.

In [None]:
X = auto.drop(columns=["mpg", "model"])
y = auto["mpg"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

We're going to build... a modeling pipeline for KNN.

In [None]:
cat_cols = # What categorical columns do we have?
drop_cats = # Which categories from those columns do we want to drop?

# The rest are numeric
num_cols = # What numeric columns do we have?

In [None]:
preprocessing = ColumnTransformer(
    [
        ("scale", ____, ____),
        ("one_hot_encode", ____, ____),
    ]
)

In [None]:
pipeline = Pipeline(
    [
        # ("name of step", sklearn object with a fit method)
        ("preprocessing", ____),
        ("knn", ____),
    ]
)

In [None]:
pipeline.fit(____, ____)

How does the model perform?  What metric is being shown? What other metrics might we consider?

In [None]:
pipeline.score(____, ____)

In [None]:
pipeline.score(____, ____)

Optimize the value of k in your pipeline.

In [None]:
grid = {"knn__n_neighbors": ____}

In [None]:
pipeline_cv = GridSearchCV(____, ____, verbose=1)
pipeline_cv.fit(X_train, y_train)

What is the best value of k?

In [None]:
pipeline_cv.____

How does the model perform?

In [None]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"train_score: {train_score}")
print(f"test_score: {test_score}")