## KNN ALGORITHM
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning regressor, which uses proximity to make predictions about the grouping of an individual data point. While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.

- Knn-Regression works by finding the k-nearest data points to the point we want to predict, and then taking the mean of their target values as the prediction.

1) After choosing the number of k nearest neighbors that will be used to make the prediction,
2) it calculates the distance between the point to be predicted and all the other points in the dataset then find the k points in the dataset that are closest to the point to be predicted.
3) Finally, it takes the mean of the target values of the k-nearest points. This mean value is the prediction for the point to be predicted.


# Differences between Minmax and SS
At first, we try to run the algorithm with two different normalized dataset to understand which one is the best because it can vary based on the distribution of data.

In [28]:
from sklearn.neighbors import KNeighborsRegressor
import pandas as pd

X_train = pd.read_csv('x_train_preprocessed_minmax.csv')
X_test = pd.read_csv('x_test_preprocessed_minmax.csv')
y_train = pd.read_csv('y_train_preprocessed_minmax.csv')
y_test = pd.read_csv('y_test_preprocessed_minmax.csv')

for k in range(1,50):
    kNN = KNeighborsRegressor(n_neighbors=k, weights='uniform', algorithm='auto', leaf_size=50, n_jobs=-1)

    kNN.fit(X_train,y_train)
    print("n_iter_",k,"TRAIN: ",kNN.score(X_train,y_train),", TEST: ",kNN.score(X_test,y_test))

n_iter_ 1 TRAIN:  0.999995556064519 , TEST:  0.6548162930324317
n_iter_ 2 TRAIN:  0.9040245526188013 , TEST:  0.7306352206587241
n_iter_ 3 TRAIN:  0.8686925706979628 , TEST:  0.7438862499080493
n_iter_ 4 TRAIN:  0.8432623458644879 , TEST:  0.7614987131490388
n_iter_ 5 TRAIN:  0.8286652591564905 , TEST:  0.7619037890590099
n_iter_ 6 TRAIN:  0.8181674269402867 , TEST:  0.7720531882337277
n_iter_ 7 TRAIN:  0.8129233702437663 , TEST:  0.7815060639119966
n_iter_ 8 TRAIN:  0.8039981380895762 , TEST:  0.7815980177695694
n_iter_ 9 TRAIN:  0.7981602884602009 , TEST:  0.7844442765562456
n_iter_ 10 TRAIN:  0.7925707344398562 , TEST:  0.7845447223803126
n_iter_ 11 TRAIN:  0.7871256105650429 , TEST:  0.7844038103053651
n_iter_ 12 TRAIN:  0.78268666867862 , TEST:  0.7895855174808046
n_iter_ 13 TRAIN:  0.7775245014153158 , TEST:  0.7874934790378751
n_iter_ 14 TRAIN:  0.7735572484377018 , TEST:  0.7860394075985018
n_iter_ 15 TRAIN:  0.7691026873392809 , TEST:  0.7841251587065841
n_iter_ 16 TRAIN:  0.7

In [29]:
X_train = pd.read_csv('x_train_preprocessed.csv')
X_test = pd.read_csv('x_test_preprocessed.csv')
y_train = pd.read_csv('y_train_preprocessed.csv')
y_test = pd.read_csv('y_test_preprocessed.csv')


for k in range(1,50):
    kNN = KNeighborsRegressor(n_neighbors=k, weights='uniform', algorithm='auto', leaf_size=50, n_jobs=-1)

    kNN.fit(X_train,y_train)
    print("n_iter_",k, "TRAIN: ",kNN.score(X_train,y_train),", TEST: ",kNN.score(X_test,y_test))

n_iter_ 1 TRAIN:  0.999995556064519 , TEST:  0.7341107423159402
n_iter_ 2 TRAIN:  0.9239282321866633 , TEST:  0.7843391542310382
n_iter_ 3 TRAIN:  0.8845033020614547 , TEST:  0.8047429091469638
n_iter_ 4 TRAIN:  0.8557651900725336 , TEST:  0.8108380255069882
n_iter_ 5 TRAIN:  0.8385545348405409 , TEST:  0.8181521804541428
n_iter_ 6 TRAIN:  0.8343931856569102 , TEST:  0.8136496625454154
n_iter_ 7 TRAIN:  0.8285442112490753 , TEST:  0.8134753915884895
n_iter_ 8 TRAIN:  0.8209914177815101 , TEST:  0.8143971882725571
n_iter_ 9 TRAIN:  0.8195809914274905 , TEST:  0.8097621894962198
n_iter_ 10 TRAIN:  0.8142898417549622 , TEST:  0.8080765603696098
n_iter_ 11 TRAIN:  0.8108614148288164 , TEST:  0.8104540302623253
n_iter_ 12 TRAIN:  0.8088114323221752 , TEST:  0.8138451555359058
n_iter_ 13 TRAIN:  0.8061429403552252 , TEST:  0.8119898223464308
n_iter_ 14 TRAIN:  0.8013940744444299 , TEST:  0.8096252888758164
n_iter_ 15 TRAIN:  0.7989302271772308 , TEST:  0.806962282102901
n_iter_ 16 TRAIN:  0.

- Best scaling depends on data
- **StandardScaler** does not remap every feature into the interval 0-1, but it depends on variance. A feature with a large variance may strongly impact on the overall distance
- **MinMaxScaling** is usually more sensitive to outliers, but weights features more evenly
- Euclidean Distance assumes all features are equally important, and this is usually not the case
**In our case the algorithm seems to be working better with a SS normalization, so we continue with that.**
Arriviamo a questa considerazione osservando i valori ottenuti dai due differenti test, si può notare molto facilmente come con SS si riesca ad ottenere una migliore precisione a discapito di un lieve overffiting mentre usando MinMax possiamo notare come in tutte le osservazioni i valori fra TEST e TRAIN sono simili e quindi l'algoritmo è piu' stabile.

# Remove some features to see if the algorithm work better
Based on the idea of the k-nn algorithm we could expect that dropping some features that should not be important will not modify the behavior / r^2 we get.
Since the k-nn algorithm select the best ones based on the k parameter.

In [30]:
oh_neighbor = []
for col in X_train.columns:
    if 'Neighborhood_b' in col:
        oh_neighbor.append(col)

X_train_modified = X_train.drop(columns=oh_neighbor)
X_test_modified= X_test.drop(columns=oh_neighbor)

porch = ['Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch']
surface = ['Total_Finished_Bsmt_SF', 'First_Flr_SF', 'Second_Flr_SF', 'Garage_Area']
baths = ['Full_Bath', 'Half_Bath', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath']

X_train_modified = X_train_modified.drop(columns=porch)
X_test_modified = X_test_modified.drop(columns=porch)

X_train_modified = X_train_modified.drop(columns=surface)
X_test_modified = X_test_modified.drop(columns=surface)

X_train_modified = X_train_modified.drop(columns=baths)
X_test_modified = X_test_modified.drop(columns=baths)

for k in range(1,20):
    kNN = KNeighborsRegressor(n_neighbors=k, weights='uniform', algorithm='auto', leaf_size=50, n_jobs=-1)

    kNN.fit(X_train_modified,y_train)
    print("TRAIN ",k,": ",kNN.score(X_train_modified,y_train),", TEST ",k,": ",kNN.score(X_test_modified,y_test))

TRAIN  1 :  0.999995556064519 , TEST  1 :  0.680007639337778
TRAIN  2 :  0.9131524907229598 , TEST  2 :  0.7390203275564162
TRAIN  3 :  0.8635235026374533 , TEST  3 :  0.757524765293857
TRAIN  4 :  0.8289474291787493 , TEST  4 :  0.7587567914732702
TRAIN  5 :  0.8126653527470866 , TEST  5 :  0.7645733030090183
TRAIN  6 :  0.801824033807873 , TEST  6 :  0.7716116991319467
TRAIN  7 :  0.7935786550103925 , TEST  7 :  0.78034358534564
TRAIN  8 :  0.7904745795250236 , TEST  8 :  0.7824203665578333
TRAIN  9 :  0.7861069957706673 , TEST  9 :  0.7773828713800537
TRAIN  10 :  0.7826504655019403 , TEST  10 :  0.7808421776185573
TRAIN  11 :  0.7814800838603049 , TEST  11 :  0.7829142622677039
TRAIN  12 :  0.7767103348571357 , TEST  12 :  0.7822102280051103
TRAIN  13 :  0.7739589704751667 , TEST  13 :  0.7818693128276317
TRAIN  14 :  0.7722847686959927 , TEST  14 :  0.7839116947625162
TRAIN  15 :  0.7685102795821933 , TEST  15 :  0.7862896016239149
TRAIN  16 :  0.7689706780268877 , TEST  16 :  0.7

We can see that the r^2 is reduced, probably some feature removed
accurately explained the model.

# Hyperparameters tuning


In [31]:
from sklearn.model_selection import GridSearchCV
parameters={"n_neighbors":range(5,15), #based on the observation made before choose a correlated range of n neighbors
            "weights" : ["uniform"], # weights of objects based on :
            # uniform = All points in each neighborhood are weighted equally / distance = weight points by the inverse of their distance. Closer cause more influence
            "algorithm":["auto"], #type of algorithm used
            "leaf_size":range(10,50),
            "p":[1,2],
            "n_jobs":[-1] }

reg_decision_model=KNeighborsRegressor()

tuning_model=GridSearchCV(
    reg_decision_model,param_grid=parameters,
    scoring='r2'
    ,cv=2, # cv fold low because when selecting a certain number of neighboors the algorithm will "become deterministic"
    verbose=3,n_jobs= -1)

tuning_model.fit(X_train,y_train)


Fitting 2 folds for each of 800 candidates, totalling 1600 fits


GridSearchCV(cv=2, estimator=KNeighborsRegressor(), n_jobs=-1,
             param_grid={'algorithm': ['auto'], 'leaf_size': range(10, 50),
                         'n_jobs': [-1], 'n_neighbors': range(5, 15),
                         'p': [1, 2], 'weights': ['uniform']},
             scoring='r2', verbose=3)

# Final results
Show parameters selected by CV grid and score result on dataset

In [32]:
print("Best parameter selected by CV grid execution",tuning_model.best_params_)

print("Scoro R2 on train data",tuning_model.score(X_train,y_train)) # Test model on our test split
print("Score R2 on test data",tuning_model.score(X_test,y_test)) # Test model on our test split

Best parameter selected by CV grid execution {'algorithm': 'auto', 'leaf_size': 10, 'n_jobs': -1, 'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}
Scoro R2 on train data 0.8631381022347369
Score R2 on test data 0.8628416503608851
