### KNN
Regression based on k-nearest neighbors.

The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.

One thing to note KNN wont require any transformation so we are opting power transformation here!

In [1]:
# lets import some dependencies

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot')
import pandas as pd
import numpy as np

import seaborn as sns
sns.set()

from sklearn.model_selection import train_test_split , KFold
from sklearn.model_selection import GridSearchCV, cross_val_score

from sklearn.neighbors import KNeighborsRegressor


from sklearn.metrics import mean_absolute_error, r2_score

In [2]:
# loading our data
df = pd.read_csv("./Data/data.csv",sep=",")
df.drop(['Unnamed: 0'], axis=1, inplace=True) # There were some formatting issues while
                                              # writing the csv

In [3]:
df.head()

Unnamed: 0,DISTRICT,UPAZILA,STATION_ID,STATION_NAME,DATE,RAIN_FALL(mm),LATITUDE,LONGITUDE,WATER_LEVEL(m)
0,Bandarban,Lama,CL317,Lama,01-jan-2017,0.0,21.81,92.19,6.22
1,Bandarban,Lama,CL317,Lama,02-jan-2017,0.0,21.81,92.19,6.22
2,Bandarban,Lama,CL317,Lama,03-jan-2017,0.0,21.81,92.19,6.22
3,Bandarban,Lama,CL317,Lama,04-jan-2017,0.0,21.81,92.19,6.21
4,Bandarban,Lama,CL317,Lama,05-jan-2017,0.0,21.81,92.19,6.21


Defining our X and y

In [5]:
X = df['RAIN_FALL(mm)'].values.reshape(-1,1) # input feature
y = df['WATER_LEVEL(m)'].values.reshape(-1,1) # target feature

In [6]:
X.shape, y.shape

((1826, 1), (1826, 1))

Making the train test split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size=0.2, random_state=17, shuffle=True
)

#### Model Building

Initialize our CV

In [8]:
kflod = KFold(n_splits=5,shuffle=True, random_state=17)

Initialize the KNN model with default parameters
- n_neighbors: 5
- weights: uniform
- algorithm: auto
- leaf_size: 30
- p: 2 (power parameter of Minkowski metric)
- metric: minkowski
- metric_params: None

In [9]:
knn =  KNeighborsRegressor(n_jobs=-1)

As usual, lets first check our CV scores with the default parameters!

In [10]:
results = cross_val_score(
    knn,
    X_train,
    y_train,
    cv=kflod,
    scoring='neg_mean_absolute_error'
)
-results.mean()

0.4486728767123287

Note:
- This is better than our linear models!