In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

# KNN Regression

### Warm-up 🥵

* How are weights implemented in K Nearest Neighbors when using the default in `sklearn` (aka the `'uniform'` option)?
* How are weights implemented in K Nearest Neighbors when using the `'distance'` option in `sklearn`?
  
* What type of supervised learning problem were we using KNN on up until now?

## Data Import and General EDA 🚗

We'll be looking at the auto MPG dataset from UCI.  Which can be found [here](https://archive.ics.uci.edu/ml/datasets/auto+mpg).  From the description we see:

```
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
```

Our target variable will be `mpg`.

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"
names = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "year",
    "origin",
    "model",
]

# '\s+' means "more than 1 space" you can download the
# data from the data_url to inspect the data and see why this makes sense
auto = pd.read_csv(data_url, sep="\s+", names=names)
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,model
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino


<IPython.core.display.Javascript object>

Do some general eda.

In [3]:
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     406 non-null    float64
 2   displacement  406 non-null    float64
 3   horsepower    400 non-null    float64
 4   weight        406 non-null    float64
 5   acceleration  406 non-null    float64
 6   year          406 non-null    float64
 7   origin        406 non-null    float64
 8   model         406 non-null    object 
dtypes: float64(8), object(1)
memory usage: 28.7+ KB


<IPython.core.display.Javascript object>

In [4]:
auto.isna().sum()

mpg             8
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
year            0
origin          0
model           0
dtype: int64

<IPython.core.display.Javascript object>

Display rows with any NAs in them.

In [5]:
auto[auto["mpg"].isna()]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,model
10,,4.0,133.0,115.0,3090.0,17.5,70.0,2.0,citroen ds-21 pallas
11,,8.0,350.0,165.0,4142.0,11.5,70.0,1.0,chevrolet chevelle concours (sw)
12,,8.0,351.0,153.0,4034.0,11.0,70.0,1.0,ford torino (sw)
13,,8.0,383.0,175.0,4166.0,10.5,70.0,1.0,plymouth satellite (sw)
14,,8.0,360.0,175.0,3850.0,11.0,70.0,1.0,amc rebel sst (sw)
17,,8.0,302.0,140.0,3353.0,8.0,70.0,1.0,ford mustang boss 302
39,,4.0,97.0,48.0,1978.0,20.0,71.0,2.0,volkswagen super beetle 117
367,,4.0,121.0,110.0,2800.0,15.4,81.0,2.0,saab 900s


<IPython.core.display.Javascript object>

In [6]:
auto[auto["horsepower"].isna()]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,model
38,25.0,4.0,98.0,,2046.0,19.0,71.0,1.0,ford pinto
133,21.0,6.0,200.0,,2875.0,17.0,74.0,1.0,ford maverick
337,40.9,4.0,85.0,,1835.0,17.3,80.0,2.0,renault lecar deluxe
343,23.6,4.0,140.0,,2905.0,14.3,80.0,1.0,ford mustang cobra
361,34.5,4.0,100.0,,2320.0,15.8,81.0,2.0,renault 18i
382,23.0,4.0,151.0,,3035.0,20.5,82.0,1.0,amc concord dl


<IPython.core.display.Javascript object>

In [7]:
auto.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
count,398.0,406.0,406.0,400.0,406.0,406.0,406.0,406.0
mean,23.514573,5.475369,194.779557,105.0825,2979.413793,15.519704,75.921182,1.568966
std,7.815984,1.71216,104.922458,38.768779,847.004328,2.803359,3.748737,0.797479
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,105.0,75.75,2226.5,13.7,73.0,1.0
50%,23.0,4.0,151.0,95.0,2822.5,15.5,76.0,1.0
75%,29.0,8.0,302.0,130.0,3618.25,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


<IPython.core.display.Javascript object>

## Data Cleaning and Feature Engineering

### Handling NAs

Since our target variable is `mpg` we probably don't want to do any imputation strategy on it.  We should drop NAs in the target unless we have some domain expertise that tells us otherwise.

In [8]:
auto[auto.isna().any(axis=1)]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,model
10,,4.0,133.0,115.0,3090.0,17.5,70.0,2.0,citroen ds-21 pallas
11,,8.0,350.0,165.0,4142.0,11.5,70.0,1.0,chevrolet chevelle concours (sw)
12,,8.0,351.0,153.0,4034.0,11.0,70.0,1.0,ford torino (sw)
13,,8.0,383.0,175.0,4166.0,10.5,70.0,1.0,plymouth satellite (sw)
14,,8.0,360.0,175.0,3850.0,11.0,70.0,1.0,amc rebel sst (sw)
17,,8.0,302.0,140.0,3353.0,8.0,70.0,1.0,ford mustang boss 302
38,25.0,4.0,98.0,,2046.0,19.0,71.0,1.0,ford pinto
39,,4.0,97.0,48.0,1978.0,20.0,71.0,2.0,volkswagen super beetle 117
133,21.0,6.0,200.0,,2875.0,17.0,74.0,1.0,ford maverick
337,40.9,4.0,85.0,,1835.0,17.3,80.0,2.0,renault lecar deluxe


<IPython.core.display.Javascript object>

The `horsepower` column is responsible for the rest of the NAs.  In practice, we might look up this info somehow, but that would probably take too much time for this demo.

So how do you want to handle these? by dropping? with imputation?  If imputing, what should we impute?

In [9]:
auto = auto.dropna(subset=["mpg"])

<IPython.core.display.Javascript object>

In [10]:
# median imputation
auto['horsepower'] = auto['horsepower'].fillna(auto['horsepower'].median())


<IPython.core.display.Javascript object>

In [11]:
auto.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
year            0
origin          0
model           0
dtype: int64

<IPython.core.display.Javascript object>

### Handling Categorical Variables

* From the description of the columns above, we can see that `origin` should is a 'discrete' value.
* The `model` column is also a categorical variable
* You can also see that `year` is 'discrete' from the description, but in practice we'll treat year variables as ordinal, so we don't need to make any changes.

Show the value counts for each of our categorical columns (don't include year).  How do we want to handle each?

In [12]:
auto["origin"].value_counts()

1.0    249
3.0     79
2.0     70
Name: origin, dtype: int64

<IPython.core.display.Javascript object>

In [13]:
auto["model"].value_counts()

ford pinto           6
amc matador          5
ford maverick        5
toyota corolla       5
amc gremlin          4
                    ..
dodge dart custom    1
mercury marquis      1
ford pinto (sw)      1
amc pacer d/l        1
dodge st. regis      1
Name: model, Length: 305, dtype: int64

<IPython.core.display.Javascript object>

The `model` column has perhaps too much variation.  Maybe we'd like to have less variation, how can we transform this variable to have a higher level categorical variable?

After making the transformation in mind, we still have some category levels with very little support (i.e. not a lot of observations to learn from).  Create an `'other'` category to hold these low occuring category levels.

In [14]:
auto["make"] = auto["model"].str.split(" ").str[0]

<IPython.core.display.Javascript object>

In [15]:
auto.make.value_counts()
# typos need to be fixed

ford             51
chevrolet        43
plymouth         31
amc              28
dodge            28
toyota           25
datsun           23
buick            17
pontiac          16
volkswagen       15
honda            13
mercury          11
oldsmobile       10
mazda            10
peugeot           8
fiat              8
audi              7
volvo             6
vw                6
chrysler          6
renault           5
opel              4
saab              4
subaru            4
chevy             3
cadillac          2
maxda             2
mercedes-benz     2
bmw               2
toyouta           1
vokswagen         1
mercedes          1
chevroelt         1
triumph           1
capri             1
nissan            1
hi                1
Name: make, dtype: int64

<IPython.core.display.Javascript object>

In [16]:
make_counts = auto['make'].value_counts()
above_thresh_counts = make_counts[make_counts > 4]
keep_makes = above_thresh_counts.index


<IPython.core.display.Javascript object>

In [40]:
keep_makes

Index(['ford', 'chevrolet', 'plymouth', 'amc', 'dodge', 'toyota', 'datsun',
       'buick', 'pontiac', 'volkswagen', 'honda', 'mercury', 'oldsmobile',
       'mazda', 'peugeot', 'fiat', 'audi', 'volvo', 'vw', 'chrysler',
       'renault'],
      dtype='object')

<IPython.core.display.Javascript object>

In [17]:
make_filter = auto["make"].isin(keep_makes)
auto.loc[~make_filter, "make"] = "other"
auto["make"].value_counts()

ford          51
chevrolet     43
other         31
plymouth      31
amc           28
dodge         28
toyota        25
datsun        23
buick         17
pontiac       16
volkswagen    15
honda         13
mercury       11
mazda         10
oldsmobile    10
peugeot        8
fiat           8
audi           7
volvo          6
chrysler       6
vw             6
renault        5
Name: make, dtype: int64

<IPython.core.display.Javascript object>

In [18]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,model,make
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu,chevrolet
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320,buick
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite,plymouth
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst,amc
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino,ford


<IPython.core.display.Javascript object>

## Modeling

Perform a train/test split with 20% of the data in the test set.

In [19]:
X = auto.drop(columns=["mpg", "model", "displacement", "weight"])
y = auto["mpg"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

<IPython.core.display.Javascript object>

We're going to build... a modeling pipeline for KNN.

In [20]:
cat_cols = [
    'make',
    'origin',
]
drop_cats = ['other', 1.0]

# The rest are numeric
num_cols = ["cylinders", "horsepower", "acceleration", "year"]


<IPython.core.display.Javascript object>

In [24]:
preprocessing = ColumnTransformer(
    [
        ("scale", StandardScaler(), num_cols),
        ("one_hot_encode", OneHotEncoder(drop=drop_cats), cat_cols),
    ]
)

<IPython.core.display.Javascript object>

In [27]:
pipeline = Pipeline(
    [
        # ("name of step", sklearn object with a fit method)
        ("preprocessing", preprocessing),
        ("knn", KNeighborsRegressor()),
    ]
)

<IPython.core.display.Javascript object>

In [28]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('scale',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True),
                                                  ['cylinders', 'horsepower',
                                                   'acceleration', 'year']),
                                                 ('one_hot_encode',
                                                  OneHotEncoder(categories='auto',
                                                                drop=['other',
                                                                      1.0],
  

<IPython.core.display.Javascript object>

How does the model perform?  What metric is being shown? What other metrics might we consider?

In [29]:
pipeline.score(X_train, y_train)

0.864233642763084

<IPython.core.display.Javascript object>

In [30]:
pipeline.score(X_test, y_test)

0.8763469454867936

<IPython.core.display.Javascript object>

Optimize the value of k in your pipeline.

In [36]:
grid = {
    "knn__n_neighbors": np.arange(1, 100, 5),
    "knn__weights": ["uniform", "distance"],
}

<IPython.core.display.Javascript object>

In [37]:
pipeline_cv = GridSearchCV(pipeline, grid, verbose=1)
pipeline_cv.fit(X_train, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    2.4s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessing',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('scale',
                                                                         StandardScaler(copy=True,
                                                                                        with_mean=True,
                                                                                        with_std=True),
                                                                         ['cylinders',
                                                                        

<IPython.core.display.Javascript object>

What is the best value of k?

In [38]:
pipeline_cv.best_params_

{'knn__n_neighbors': 21, 'knn__weights': 'distance'}

<IPython.core.display.Javascript object>

How does the model perform?

In [39]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"train_score: {train_score}")
print(f"test_score: {test_score}")

train_score: 0.9998746075136313
test_score: 0.8608755154632207


<IPython.core.display.Javascript object>