# Distance Based Metrics

### Introduction

### Loading the Data

In [283]:
import pandas as pd
from sklearn.datasets import fetch_california_housing

cal = fetch_california_housing()
X = pd.DataFrame(cal['data'], columns = cal['feature_names'])
y = pd.Series(cal['target'])

In [284]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, test_size = .2)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, random_state = 1, test_size = .5)

### Baseline model

In [285]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(random_state = 1, min_samples_leaf = 7)
rfr.fit(X_train, y_train)

RandomForestRegressor(min_samples_leaf=7, random_state=1)

In [286]:
rfr.score(X_validate, y_validate)

0.7927389769247034

### Feature Engineering

In [287]:
X[:2]

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22


In [288]:
X_train.shape

(16512, 8)

To start let's apply a K-NN model.  KNN stands for the k nearest neighbors.  Here, we'll try to estimate the price of someone's house by gathering the average price of those closest to the house.  To start, let's just consider the "closest" house by longitude and latitude, as opposed to including other features to consider the most *similar* house.

In [289]:
from sklearn.neighbors import KNeighborsRegressor
geo_cols = ['Latitude', 'Longitude']
knns = [KNeighborsRegressor(n_neighbors = i).fit(X_train[geo_cols], y_train) for i in range(2, 20, 1)]

In [290]:
scores = [knn.score(X_validate[geo_cols], y_validate) for knn in knns]

In [291]:
knn_scores = pd.DataFrame({
    'neighbors': range(2, 20, 1),
    'scores': scores
})

In [292]:
knn_scores[:10]

Unnamed: 0,neighbors,scores
0,2,0.773056
1,3,0.789243
2,4,0.80264
3,5,0.80564
4,6,0.808214
5,7,0.802368
6,8,0.80063
7,9,0.798597
8,10,0.79751
9,11,0.794712


Here, our model peaks at nine neighbors.  So we can choose that as our hyperparameter.

Now, beyond using this in itself as a model, we can also use this for feature engineering.  We can simply get the average price of an observations 9 nearest neighbors with the following:

In [293]:
knn_nine = KNeighborsRegressor(n_neighbors = 6).fit(X_train[geo_cols], y_train)

In [294]:
train_neighbor_prices = knn_nine.predict(X_train[geo_cols])
train_neighbor_prices

array([2.65366667, 1.196     , 0.8385    , ..., 0.95133333, 1.20983333,
       1.4315    ])

So let's now use this as a new feature, and see if it improves our random forest model.

In [295]:
knn_nine.predict(X_train[geo_cols])

array([2.65366667, 1.196     , 0.8385    , ..., 0.95133333, 1.20983333,
       1.4315    ])

In [296]:
X_train_neighbors_price = X_train.assign(price_neighbors = knn_nine.predict(X_train[geo_cols]))

In [297]:
rfr_neighbor = RandomForestRegressor(random_state = 1, min_samples_leaf = 7, max_features = 'log2')
rfr_neighbor.fit(X_train_neighbors_price, y_train)

RandomForestRegressor(max_features='log2', min_samples_leaf=7, random_state=1)

In [298]:
X_validate_neighbors_price = X_validate.assign(price_neighbors = knn_nine.predict(X_validate[geo_cols]))

In [299]:
rfr_neighbor.score(X_validate_neighbors_price, y_validate)

0.8637763810819373

So we can see that we get a significant increase in score by including this in our model.

### KNN of other features

Now let's see if we can additional features with KNN.  Take another look at our current list of features.

In [301]:
X_train[:2]

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
15961,3.1908,52.0,5.0,1.014184,879.0,3.117021,37.71,-122.43
1771,3.6094,42.0,4.90099,0.957096,971.0,3.20462,37.95,-122.35


So we could add features of median income of nearest neighbors, household age of nearest neighbors, average rooms and bedrooms, and so on.

In [302]:
def neighbor_model(feature, train_dataset, n_neighbors = 9):
    geo_data = train_dataset[['Latitude', 'Longitude']]
    return KNeighborsRegressor(n_neighbors=n_neighbors).fit(geo_data, train_dataset[feature])

In [303]:
knn_model = neighbor_model('MedInc', X_train)

In [304]:
# X_med_inc_neighbors = add_neighbor_feature(knn_model, 'MedInc', X_train)
# X_med_inc_neighbors[:2]

### Adding multiple new features

Now, let's do this for all of the features except longitude and latitude.

In [313]:
selected_features = X_train_neighbors_price.columns[:6]
selected_features

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
       'AveOccup'],
      dtype='object')

In [310]:
def build_feature_models(dataset, n_neighbors = 6):
    return [neighbor_model(feature, dataset, n_neighbors = n_neighbors) for feature in selected_features]
        

In [311]:
train_feature_models = build_feature_models(X_train_neighbors_price)

In [312]:
train_feature_models

[KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6),
 KNeighborsRegressor(n_neighbors=6)]

In [278]:
def new_features_from(models, dataset, selected_features):
    geo_dataset = dataset[['Latitude', 'Longitude']]
    return pd.concat([pd.Series(model.predict(geo_dataset), name = f'{feature}_neighbors') 
                      for model, feature in zip(models, selected_features)], axis = 1)

Now let's add these same features to our validation data.

In [314]:
train_neighbor_features = new_features_from(train_feature_models, X_train_neighbors_price, selected_features)

In [317]:
train_neighbor_features.index

RangeIndex(start=0, stop=16512, step=1)

In [319]:
# X_train

In [328]:
X_train_neighbors = pd.concat([X_train_neighbors_price.reset_index(),train_neighbor_features], axis = 1).iloc[:, 1:]

In [329]:
X_train_neighbors.isna().sum()

MedInc                  0
HouseAge                0
AveRooms                0
AveBedrms               0
Population              0
AveOccup                0
Latitude                0
Longitude               0
price_neighbors         0
MedInc_neighbors        0
HouseAge_neighbors      0
AveRooms_neighbors      0
AveBedrms_neighbors     0
Population_neighbors    0
AveOccup_neighbors      0
dtype: int64

In [330]:
rfr_med_neighbors = RandomForestRegressor(random_state = 1,
                                          min_samples_leaf = 7,
                                          max_features='log2').fit(X_train_neighbors, y_train)

### Fit Validation Set

In [331]:
validate_neighbors_features = new_features_from(train_feature_models, X_validate_neighbors_price, selected_features)

In [341]:
validate_X_combined = pd.concat([X_validate_neighbors_price.reset_index(),validate_neighbors_features], axis = 1).iloc[:, 1:]

In [342]:
rfr_med_neighbors.score(validate_X_combined, y_validate)

0.8643596775867174

So we see a *small* increase by including these other features.

### Wrapping up

In this lesson, we saw how to use distance based features to enhance our model.  We did so with the KNN model, by finding the closest features by latitude and longitude.  Now there are some different variations to add even additional features to the model: 

1. Change the number of nearest neighbors

We don't just have to use add one set of nearest neighbors in our dataframe, but could add the same statistics with different variations of neighbors

2. Change the statistic

KNN allowed us to quickly group our 

### Summary

### Resources

[Kaggle Competitions](http://www.chioka.in/kaggle-competition-solutions/)