## K Nearest Neighbors

*The nearest neighbors method* (k-Nearest Neighbors, or k-NN) follows the intuition that you look like your neighbors. More formally, the method follows the compactness hypothesis: if the distance between the examples is measured well enough, then similar examples are much more likely to belong to the same class.

Many types of distances (like euclidian)

In [None]:
import numpy as np
def dist(x,y):
    sum2 = np.sum((x-y)**2)
    distance = np.sqrt(sum2)
    return distance

![Knn Example](../../img/knn1.png)

![Knn Example 2](../../img/knn2.png)

### KNN Algorithm

```python
def knn(K):
    for all data points:
        calculate distance to all other points
        select K neighbors
        set t = average of the targets values of neighbors
    return t     
```

$$\Large Predicted = \frac{1}{K} \sum\limits_{x_i \in N} y_i $$

### Code Dictionary
code | description
-----|------------
`.DecisionTreeRegressor` | Regression model with Decision Trees.
`.arrange()` | Generates evenly spaced values within a given interval.
`.reshape()` | Gives a new shape to an array without changing its data.

In [None]:
import pandas as pd
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error 
from math import sqrt
%matplotlib inline

## Get the Data

Big mart sales prediction form various attribute
Set index_col=0 to use the first column as the index.

In [None]:
df = pd.read_csv("Bigmart_sales_Train.csv")
df_test = pd.read_csv("Bigmart_sales_Test.csv")

In [None]:
df.head()

## Impute missing values

In [None]:
print(df.isnull().sum())

#missing values in Item_weight and Outlet_size needs to be imputed
mean = df['Item_Weight'].mean() #imputing item_weight with mean
df['Item_Weight'].fillna(mean, inplace =True)

mode = df['Outlet_Size'].mode() #imputing outlet size with mode
df['Outlet_Size'].fillna(mode[0], inplace =True)

In [None]:
#do same for test data
mean = df_test['Item_Weight'].mean()
df_test['Item_Weight'].fillna(mean, inplace =True)

mode = df_test['Outlet_Size'].mode() #imputing outlet size with mode
df_test['Outlet_Size'].fillna(mode[0], inplace =True)

#### Remove unecessary columns and deal with Categorical variables

In [None]:
df.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1, inplace=True)
df = pd.get_dummies(df)
df.head()

In [None]:
df_test.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1, inplace=True)
df_test = pd.get_dummies(df_test)
df_test.head()

#### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
train , test = train_test_split(df, test_size = 0.2)

x_train = train.drop('Item_Outlet_Sales', axis=1)
y_train = train['Item_Outlet_Sales']

x_test = test.drop('Item_Outlet_Sales', axis=1)
y_test = test['Item_Outlet_Sales']

### Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))

In [None]:
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)

In [None]:
x_test_scaled = scaler.fit_transform(x_test)
x_test = pd.DataFrame(x_test_scaled)

X_test_scaled = scaler.fit_transform(x_test)
X_test = pd.DataFrame(X_test_scaled)

## Using KNN

Remember that we are trying to come up with a model to predict whether someone will TARGET CLASS or not. We'll start with k=1.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
knn = KNeighborsRegressor(n_neighbors=1)

In [None]:
knn.fit(x_train,y_train)

In [None]:
pred = knn.predict(x_test)

## Choosing a K Value

Let's go ahead and use the elbow method to pick a good K Value:

In [None]:
rmse_val = [] #to store rmse values for different k
for K in range(15, 30):
    K = K+1
    model = KNeighborsRegressor(n_neighbors = K)

    model.fit(x_train, y_train)  #fit the model
    pred=model.predict(x_test) #make prediction on test set
    error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
    rmse_val.append(error) #store rmse values
    print('RMSE value for k= ' , K , 'is:', error)

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,16),rmse_val,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')