# Exercise 06: k-nearest neighbors

Welcome to the sixth exercise for Applied Machine Learning. 

Your objectives for this session are to: 
- understand and apply feature scaling as an extra preprocessing step, 
- implement `KNeighborsRegressor`, and
- tune model hyperparameters with `GridSearchCV`.

---------------------

### Part 1: Data exploration and feature engineering

We will again be looking at `HomesSoldHellerup.csv`, so the data is already familar to you if you did the exercises over the last two weeks. But, instead of using parametric models to predict `price`, today we'll use *instance-based learning*.

Let's start by importing our libraries and exploring the dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

Read in the dataset and inspect the dimensions.

In [2]:
homes_df = pd.read_csv('HomesSoldHellerup.csv', sep=';')
homes_df.shape


(2160, 10)

Take a look at the column names and the first five instances.

In [3]:
homes_df.head()

Unnamed: 0,Road name,Road Number,Type,m2,Build Year,ZipCode,City,Date of Sale,Type of Sale,Price
0,Tuborgvej,54,Lejlighed,54,1932,2900,Hellerup,20-07-15,Alm. Salg,1700000
1,Tuborgvej,54,Lejlighed,87,1932,2900,Hellerup,12-05-15,Alm. Salg,2815000
2,Tuborgvej,54,Lejlighed,63,1932,2900,Hellerup,29-12-10,Alm. Salg,1575000
3,Tuborgvej,54,Lejlighed,54,1932,2900,Hellerup,10-04-12,Alm. Salg,1340000
4,Tuborgvej,54,Lejlighed,63,1932,2900,Hellerup,04-02-12,Alm. Salg,1435000


One of the columns has the `Date of Sale` with the exact date the house was sold. It seems unlikely that many houses would sell on the exact same date, and the date format can be tricky to work with.

Let's see how many unique values there are in `Date of Sale`.

In [4]:
homes_df['Date of Sale'].nunique()

1218

1218 unique dates across the 2160 totals instances suggests that `Date of Sale` is a *high-cardinality* attribute: it contains a large number of unique values relative to the total number of instances, which can make it challenging for machine learning models to use effectively.

But, there could still be useful information in the `Date of Sale`. For example, maybe homes sell for more in summer than in winter, or maybe homes sell for more in recent years relative to older years. So, to make use of such information, we could do some feature engineering.

Use the code below to create a `Year of Sale` feature by extracting the year from the `Date of Sale` attribute.

In [5]:
homes_df['Year of Sale'] = homes_df['Date of Sale'].apply(lambda x: '20' + x.split('-')[-1])

Check that there's the new `Year of Sale` column.

In [6]:
homes_df.head()

Unnamed: 0,Road name,Road Number,Type,m2,Build Year,ZipCode,City,Date of Sale,Type of Sale,Price,Year of Sale
0,Tuborgvej,54,Lejlighed,54,1932,2900,Hellerup,20-07-15,Alm. Salg,1700000,2015
1,Tuborgvej,54,Lejlighed,87,1932,2900,Hellerup,12-05-15,Alm. Salg,2815000,2015
2,Tuborgvej,54,Lejlighed,63,1932,2900,Hellerup,29-12-10,Alm. Salg,1575000,2010
3,Tuborgvej,54,Lejlighed,54,1932,2900,Hellerup,10-04-12,Alm. Salg,1340000,2012
4,Tuborgvej,54,Lejlighed,63,1932,2900,Hellerup,04-02-12,Alm. Salg,1435000,2012


What we can also see in the data shown above is that there are different types of variables in the dataset. For example, `Road name`, `Type`, `City`, and `Type of Sale` are all categorical variables, whereas `m2`, `Build Year`, and `Year of Sale` are numeric. 

Use the code below to inspect the variable type assigned to each column. 

In [7]:
homes_df.dtypes

Road name       object
Road Number     object
Type            object
m2               int64
Build Year       int64
ZipCode          int64
City            object
Date of Sale    object
Type of Sale    object
Price            int64
Year of Sale    object
dtype: object

`Year of Sale`, the feature we just added ourselves, is currently considered as an `object` (i.e., a categorical variable). Use the code below to change it to `int64` (i.e., a continuous, numeric variable).

In [8]:
homes_df['Year of Sale'] = homes_df['Year of Sale'].astype('int64')

Now let's look at some descriptive statistics for the continuous variables in the dataset.

In [9]:
homes_df.describe()

Unnamed: 0,m2,Build Year,ZipCode,Price,Year of Sale
count,2160.0,2160.0,2160.0,2160.0,2160.0
mean,137.58287,1941.846296,2920.328241,5214775.0,2012.676389
std,67.275942,33.95288,309.679072,4470072.0,1.70072
min,37.0,1850.0,2900.0,133333.0,2010.0
25%,86.0,1919.0,2900.0,2450000.0,2011.0
50%,124.0,1933.0,2900.0,4000000.0,2013.0
75%,170.0,1960.0,2900.0,6600000.0,2014.0
max,570.0,2015.0,8800.0,50000000.0,2015.0


In the output above, notice how the attributes have different scales. For example, `m2` has a mean of 138 and ranges between 37 and 570, whereas `Build Year` mean of 1942 and ranges between 1850 and 2015. That makes sense given what the variables represent, but it could affect the performance of an instance-based learning algorithm, which rely on measures of similarity or distance. 

### Part 2: Feature scaling

Now that we've taken a look at our data and done a little feature engineering, it's time to make a train-test split. Then, once we've put aside a test set, we can do some feature scaling to address the issue of varying scales to suit an instance-based learning algorithm like `KNeighborsRegressor`. 

# <font color='red'>TASK 1</font>

Define your feature matrix `X` and target `y`. 

Create `X` with all the available attributes, except for `Date of Sale`, from which we already extracted the year to create the new attrubte, `Year of Sale`. 

`y` should be the `Price` of a home.

In [15]:
# your code here
X = pd.get_dummies(homes_df.drop(columns=['Date of Sale', 'Price']))
y = homes_df['Price']

Then use the code below to make a train-test split.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

Now, in addition to our unscaled feature matrices, `X_train` and `X_test`, let's use the `StandardScaler` to standardize the scales of the attributes. 

`StandardScaler` standardizes the scales across features by removing the mean and scaling to unit variance. For example, for each `m2` value, `StandardScaler` will re-scale that value by subtracting the mean `m2` and then dividing that difference by the standard deviation of `m2`. This can be useful because it centers all variables 0. However, `StandardScaler` is sensitive to outliers.

Use the code below to apply `StandardScaler` *separately* to the `X_train` and `X_test`.

In [17]:
scaler = StandardScaler() # define the scaler
X_train_scaled = scaler.fit_transform(X_train) # apply to the training feature matrix
X_test_scaled = scaler.transform(X_test) # apply to the testing feature matrix

In addition to `StandardScaler`, let's also try out `RobustScaler` so we can check if different scaling methods affect model performance.

`RobustScaler` applies a different standardization by substracting the median (instead of the mean) and then divides by the inter-quartile range (instead of the standard deviation). In effect, this means `RobustScaler` is less sensitive to outliers.

# <font color='red'>TASK 2</font>

Apply `RobustScaler` to the `X_train` and `X_test` to create `X_train_robust` and `X_test_robust`.

In [18]:
# your code here
robustScaler = RobustScaler() # define the scaler
X_train_robust = robustScaler.fit_transform(X_train) # apply to the training feature matrix
X_test_robust = robustScaler.transform(X_test) # apply to the testing feature matrix

### Part 3: Implementing `KNeighborsRegressor`

Now that we have three different kinds of feature matrices:
* `X_train` and `X_test` have the original features without any scaling,
* `X_train_scaled` and `X_test_scaled` have the features with `StandardScaler`, and
* `X_train_robust` and `X_test_robust` have the features with `RobustScaler`.

Let's experiment and see which kind of scaling leads to the best results with `KNeighborsRegressor`.

# <font color='red'>TASK 3</font>

Fit a `KNeighborsRegressor` to `X_train` and `y_train` and define it as `knn`. Then print the model's training and test scores.

In [19]:
# your code here - fit knn to training data
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

In [20]:
# your code here - print scores
print("KNN training score:", knn.score(X_train, y_train))
print("KNN testing score:", knn.score(X_test, y_test))

KNN training score: 0.6632970249705157
KNN testing score: 0.579975169614829


# <font color='red'>TASK 4</font>

Fit a `KNeighborsRegressor` to `X_train_scaled` and `y_train` and define it as `knn_scaled`. Then print the model's training and test scores.

In [21]:
# your code here - fit knn to scaled training data
knn_scaled = KNeighborsRegressor()
knn_scaled.fit(X_train_scaled, y_train)

In [22]:
# your code here - print scores on scaled data
print("KNN (scaled) training score:", knn_scaled.score(X_train_scaled, y_train))
print("KNN (scaled) testing score:", knn_scaled.score(X_test_scaled, y_test))

KNN (scaled) training score: 0.662846676393098
KNN (scaled) testing score: 0.4850255945332237


# <font color='red'>TASK 5</font>

Fit a `KNeighborsRegressor` to `X_train_robust` and `y_train` and define it as `knn_robust`. Then print the model's training and test scores.

In [23]:
# your code here - fit knn to robust scaled training data
knn_robust = KNeighborsRegressor()
knn_robust.fit(X_train_robust, y_train)

In [24]:
# your code here - print scores on robust scaled data
print("KNN (robust scaled) training score:", knn_robust.score(X_train_robust, y_train))
print("KNN (robust scaled) testing score:", knn_robust.score(X_test_robust, y_test))

KNN (robust scaled) training score: 0.7832536798259411
KNN (robust scaled) testing score: 0.7529845442680467


# <font color='red'>TASK 6</font>

Which kind of feature scaling led to the best performance with `KNeighborsRegressor`? Why do you think that is?

### Part 4: Tuning hyperparameters with `GridSearchCV`

From our experiment above, we learned which scaling method was best... but can we improve performance even further?

Let's see if tuning hyperparameters helps. With k-nearest neighbor models we can tune things like the number of neighbors to consider for the final prediction (`n_neighbors`), whether all neighbors should be weighed equally (`weights`), and the distance metric used (`metric` and `p`. 

Use the code below to set up a parameter grid.

In [25]:
param_grid = {
    'n_neighbors': [1, 5, 10, 20, 30, 40, 50, 60], # what should we set k to? 
    'weights': ['uniform', 'distance'], # uniform weighting vs. distance weighting
    'p': [1, 2], # 1 for manhattan vs. 2 for euclidean distance
}

# <font color='red'>TASK 7</font>

Use `GridSearchCV` to figure out what the optimal hyperparameter settings are for `knn_robust`. Define the grid search as `grid_search`.

*HINT: Check out the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for how to implement `GridSearchCV`.*

In [26]:
# your code here - define `grid_search` and fit it to the robust scaled data
grid_search = GridSearchCV(KNeighborsRegressor(), param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train_robust, y_train)

Use the code below to print the best hyperparameter settings found, and the scores for `best_knn`, the `KNeighborsRegressor` with tuned hyperparameters.

In [27]:
print("Best hyperparameter settings found: ", grid_search.best_params_)

Best hyperparameter settings found:  {'n_neighbors': 10, 'p': 1, 'weights': 'distance'}


In [28]:
best_knn = grid_search.best_estimator_
print("Score on training set: {:.3f}".format(best_knn.score(X_train_robust, y_train)))
print("Score on test set: {:.3f}".format(best_knn.score(X_test_robust, y_test)))

Score on training set: 1.000
Score on test set: 0.751


Did tuning help much in this case? Maybe... or maybe not so much in this case?

### Bonus section: Including scaling methods in `GridSearchCV`

In this notebook, we applied different scalers to the feature matrix one at a time for the sake of demonstration. However, you can also include different scalers *within* a tuning method like `GridSearchCV` to find the optimal configuration of scaler + hyperparameters all in one go. The code below shows you how to do this by creating a modelling `Pipeline`.

In [None]:
from sklearn.pipeline import Pipeline

# define the pipeline with a placeholder scaler and the KNN regressor
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Placeholder scaler
    ('knn', KNeighborsRegressor())
])

# define the parameter grid to include different scalers and KNN parameters
param_grid = {
    'scaler': [StandardScaler(),  RobustScaler(), 'passthrough'],  # different scaling methods (`passthrough` means no scaling)
    'knn__n_neighbors': [3, 6, 12, 24, 48, 96],
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2],
}

# set up GridSearchCV with the pipeline and parameter grid
grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=3,
    scoring='r2',
    return_train_score=True,
    verbose=3
)

#fit the model with the original training data (without prior scaling)
grid_search.fit(X_train, y_train)

In [None]:
print("Best hyperparameter settings found: ", grid_search.best_params_)

---------
**That's it for this week! Next week we'll try out some unsupervised learning techniquea.**