## Regression based on k-nearest neighbors. Find the best metrics for particular task. 
We will choose Minkowski distance $$ \rho(x,z)=\left(\sum_{j=1}^{d}|x_{j}-z_{j}|^{p}\right)^{1/p} $$
and find parameter p  which is the best for the problem (the smallest mean squared error MSE).
<br>
Note: p = 1 corresponds to Manhattan distance, p = 2 corresponds to Euclidean distance

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_val_score

# Importing the dataset (house-prices dataset)
data_set = load_boston()
# Scaling the features
X = scale(data_set.data)
y = data_set.target
# Minkowski parameter

In [2]:
# Parameter p
mink_param = np.linspace(1, 10, 200)
# Number of neighbors
num_neigh = 5
# Array of errors
error_arr = np.zeros((len(mink_param)))
# K-Folds cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for ii in range(len(mink_param)):
    # Regression based on k-nearest neighbors
    k_neigh = KNeighborsRegressor(n_neighbors=num_neigh, weights='distance', metric='minkowski', p=mink_param[ii])
    # Evaluation of mean value of MSE for each block
    error_arr[ii] = (cross_val_score(estimator=k_neigh, X=X, y=y, cv=kf, scoring='neg_mean_squared_error')).mean()

best_p = mink_param[np.argmax(error_arr)]
print('The best parameter p is %g' % best_p)

The best parameter p is 1
