# k-NN regression clearly explained

![](https://i.imgflip.com/6kxcbl.jpg)

## Need statment:

**Consider the following scenario:** A friend of yours would like to sell his house and invites you to help him assess what price his house should sell for. The problem here is that you don't know how to calculate that price (you've never bought or sold a house).** And then what do you do?**


* a) You simply tell your friend that you cannot help him 😕🙄😢
* b) You use your knowledge of mathematics, statistics and astrology to help him, thus recognizing **the value of sincere and true friendship**!!! 😊😊😊

I knew you would choose option b)! That's exactly why we're here!!! 


Now comes the question: how can you help him? Your challenge now is to develop a technique that takes into account **all aspects of the house** so that the price is **as fair as possible**. To try to help you, I will introduce you to the k-nearest neighbors algorithm (**kNN** for short) algorithm, this way you can help your friend, okay? But first... **upvote my notebook, please!**

Part of the work presented here is based on the following books:

![](https://images-na.ssl-images-amazon.com/images/I/41RgG05lZaL._SY344_BO1,204,203,200_.jpg)
[Link to amazon](https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1071614177/)

![](https://images-na.ssl-images-amazon.com/images/I/41TmbdP0EZL._SY344_BO1,204,203,200_.jpg)
[Link to amazon](https://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576/)

![](https://thumbs.dreamstime.com/b/lets-go-handwritten-white-background-169989567.jpg)

## x_trainx_trainPlease upvote me if you like, ok? (this is really really important to me)

# 1. What is the k-NN algorithm?

In a nutshell, the k-NN is *memory-based* algorithm and require no model to be fit. It can be used for both **classification** and **regression** problems (with small changes of course). At this point I will focus only on the regression problem. If you want I can make another notebook for classification problems, **just ask in the comments, ok?**


Despite its simplicity, the k-NN algorithm is quite competitive. And what do we need to implement this method? Just three things: i) the **k** parameter, some **reference samples** and a **distance** measurement.


## 1.1 k-NN regression algorithm

The algorithm consists of "only" 1 equation:

<img src="https://latex.codecogs.com/svg.image?\LARGE&space;\hat{f}(x_0)&space;=\frac{1}{k}&space;\sum_{x_i&space;\in&space;\mathit{N}_0}y_i" />
[https://latex.codecogs.com/](https://latex.codecogs.com/)

Where:
* $\hat{f}()$ is the estimated function of the true (and unknown) function $f()$
* $k$ is the the **k** parameter
* $x_0$ is the query point
* $\mathit{N}_0$ are the points  nearest to $x_0$ (or the **reference samples**)
* $y_i$ is the value of $f(x_i)$

And now? What is missing? The **distance** measurement...

According to wikipedia: "*Distance is a numerical measurement of how far apart objects or points are.*" And how can we calculate it? To simplify a bit I will use the **Euclidean distance**. Considering an $n$-dimensional space, the Euclidean distance is given by:

<img src="https://latex.codecogs.com/svg.image?\LARGE&space;d(p,&space;q)&space;=&space;\sqrt{\sum_{i=1}^{n}(p_i&space;-&space;q_i)^2}" />

Where:
* $n$ is the dimension of the function
* $p$ and $q$ are points from which we want to know the distance between them

Other distance measures can be used. You can learn more about it here: [4 Distance Measures for Machine Learning](https://machinelearningmastery.com/distance-measures-for-machine-learning/)


Okay, now we have everything we need to implement our own k-NN algorithm. **Here we go**?

In [None]:
# general imports
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from matplotlib import pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

        
# sklearn
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

# yellow bricks
from yellowbrick.regressor import residuals_plot

# constant for reproducibility
np.random.seed = 42        

# 2 Implementing our own k-NN

Suppose you want to regression a function:

<img src="https://latex.codecogs.com/svg.image?\LARGE&space;f(x)=x^3-2x^2-11x&plus;12" />


As it is a 1-dimensional function it is possible to present it in a simple graph. For this we will generate **100** random points (uniformly distributed) in the interval [-5, 5]. Of these **100** points I will display only **80** at this time, the other **20** I will use to check the performance of the k-NN algorithm.

In [None]:
def f(x):
    """
    Didatical function
    """
    return x**3 - 2*(x**2) - 11*x + 12


# Generation of random points 
lower_bound = -5
upper_bound = 5
n_points = 100
x = np.random.uniform(lower_bound, upper_bound, n_points)

#Points that will be displayed
x_train = x[0:80]
y_train = f(x_train)

# Points that will be used to measure the performance of the algorithm
x_test = x[80:100]
y_test = f(x_test)

# Function graph
plt.figure(figsize=(8,8))
sns.scatterplot(x=x_train, y=y_train)
plt.legend(['Train points'])
plt.show()

## 2.1 How does knn work?


Suppose I wanted to estimate the value of $f(-2)$ using the k-NN algorithm. The first step is to choose the parameter *k* (and trust me, this is not an easy task...). In our example we are going to experiment with k=2, 3, 5 and 7 (This means that we will use the average of the 2 nearest  points, the 3 nearest  points and so on...). First with k=2.




In [None]:
def knn(x_train, y_train, x_0, k=2):
    """
    Simulate the k-NN algorithm 
    
    :params:
    x: x train samples
    
    y: y train samples
    
    x_0: query points
    
    k: Number of nearest  neighbors taken into account
    """
    
    # calculates the euclidian distance between the query point and the training set
    distances = [np.linalg.norm(x - x_0) for x in x_train] 
    
    # for each point in the set x_train, create a record containing the distance (to point x_0)
    # and the corresponding value in y_train
    result = []
    for d, y in zip(distances, y_train):
        result.append((d, y))
    
    # sort the list by distance
    result.sort(key=lambda tup: tup[0]) 
    
    # transform to a numpy object to facilitate operations
    result = np.array(result)
    
    # get the first k results (only the column that contains 
    # the second column, which contains the values of y_train)
    k_results = result[:k, 1]
    
    # calculate the mean of the k_results
    return np.mean(k_results)


$\hat{f}(-2)$ for $k=2$

In [None]:
y_hat = knn(x_train, y_train, -2, k=2)
y_hat

Real value for $f(-2)$

In [None]:
y_real = f(-2)
y_real

Let's check the difference between them

In [None]:
y_real - y_hat

Note that the difference is very small between the estimated value and the actual value of the function. Now let's test with $k=3$

In [None]:
y_hat = knn(x_train, y_train, -2, k=3)
y_hat

In [None]:
y_real - y_hat

Notice that the difference changes a little bit. Now we will be with $k=5$

In [None]:
y_hat = knn(x_train, y_train, -2, k=5)
y_hat

In [None]:
y_real - y_hat

Note that the algorithm** is very sensitive** to the parameter k. 

Generally, low values of *k* makes the algorithm more flexible, but leave it with a greater variance, it means,**low bias but high variance**.

As *k* gorws, the algorithm becomes less flexible, it means, **high bias but low variance**.

To learn more about bias and variance, visit [Bias–variance tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)

To demonstrate this statement, let's do the following experiment: let's use $k=40$ and calculate $\hat{f}(-3)$, $\hat{f}(0)$, $\hat{f}(2)$

In [None]:
knn(x_train, y_train, -3, k=80)

In [None]:
knn(x_train, y_train, 0, k=80)

In [None]:
knn(x_train, y_train, 2, k=80)

Interestingly, these values are too very close to the mean of *y_train*

In [None]:
np.mean(y_train)

## Applying the algorithm to all test set points

Now let's see how the algorithm behaves for **all points in the test set**. After the calculation I will show the graph with the test data and estimated values. 

In [None]:
# Calculates the estimated value of y for all points in the test set
# usa k=3

y_hat = [knn(x_train, y_train, x, k=3) for x in x_test]

# Function graph
plt.figure(figsize=(8,8))
sns.scatterplot(x=x_test,y=y_test)
sns.scatterplot(x=x_test, y=y_hat)
plt.legend(['Test points', 'Estimated point'])
plt.show()

But is our estimator any good? let's use 2 metrics implemented in sklearn: Mean squared error regression loss and $R^2$ (coefficient of determination) regression score function.


### RMSE

To learn more about the RMSE metric: [Mean squared error regression loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [None]:
mean_squared_error(y_test, y_hat, squared=False)

### $R^2$


To learn more about the $R^2$ metric: [coefficient of determination](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

In [None]:
r2_score(y_test, y_hat)

### Oh, it looks really good!!!!


### Now let's implement in sklearn style

In [None]:
class MyOwnKnnRegression(BaseEstimator):
    # Class responsible for simulating a Knn Regression class.

    def __init__(self, k_neighbors):
        """
        Here we will define the pipeline for each tree.
        
        :params:
        n_estimators: The number of nearest points
        """
               
        self.k = k_neighbors
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        # do nothing....
        # Only stores training points
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        # predicts all
        return [self.knn(x) for x in X]
    
    
    def knn(self, x_0):
        """
        Simulate the k-NN algorithm 

        :params:
        x: x train samples

        y: y train samples

        x_0: query points

        k: Number of nearest  neighbors taken into account
        """

        # calculates the euclidian distance between the query point and the training set
        distances = [np.linalg.norm(x - x_0) for x in self.X_train] 

        # for each point in the set x_train, create a record containing the distance (to point x_0)
        # and the corresponding value in y_train
        result = []
        for d, y in zip(distances, self.y_train):
            result.append((d, y))
        
        # sort the list by distance
        result.sort(key=lambda tup: tup[0]) 

        # transform to a numpy object to facilitate operations
        result = np.array(result)

        # get the first k results (only the column that contains 
        # the second column, which contains the values of y_train)
        k_results = result[:self.k, 1]

        # calculate the mean of the k_results
        return np.mean(k_results)
    

Does it work? Let's check it out...

In [None]:
knn = MyOwnKnnRegression(k_neighbors=3)
knn.fit(x_train, y_train)
y_hat = knn.predict(x_test)

# Function graph
plt.figure(figsize=(8,8))
sns.scatterplot(x=x_test,y=y_test)
sns.scatterplot(x=x_test, y=y_hat)
plt.legend(['Test points', 'Estimated point'])
plt.show()

### RMSE

To learn more about the RMSE metric: [Mean squared error regression loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [None]:
mean_squared_error(y_test, y_hat)

### $R^2$


To learn more about the $R^2$ metric: [coefficient of determination](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

In [None]:
r2_score(y_test, y_hat)

#### Looks like it worked... now let's apply it to a slightly more difficult problem... 

# 3. House Prices - Advanced Regression Techniques

![](https://azbigmedia.com/wp-content/uploads/2020/08/selling-home.jpg)



Let's go back to the original problem of helping your friend sell his house at a fair price. 

Now that you know the k-NN algorithm you can apply it to this problem!

First, let's see how the data is...

In [None]:
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
print(df.shape)
df.head()

Let's make a little exploratory data analysis + data engineering

In [None]:
count = 1
for c in df.columns:
    print(f'{count} - {c}')
    print(f'- # of unique elements: {df[c].nunique()}')
    print(f'- Sample: {df[c].unique()[0:20]}')
    print(f'- Dtype: {df[c].dtype}')
    print(f'- # of missing values: {df[c].isnull().sum()} of {df.shape[0]}')
    print(f'- % of missing values: {np.round(df[c].isnull().sum() / df.shape[0], 3)}')
    
    
    if df[c].dtype == int or df[c].dtype == float:
        s = "- Statistics:\n"

        me = np.round(df[c].mean(), 2)
        st = np.round(df[c].std(), 2)
        s += f"-- Mean (std): {me} ({st})\n"

        q1 = np.round(df[c].quantile(0.25), 2)
        q2 = np.round(df[c].quantile(0.5), 2)
        q3 = np.round(df[c].quantile(0.75), 2)
        s += f"-- Quantiles: q1={q1}, q2={q2}, q3={q3}\n"
        s += f"-- Min {df[c].min()}\n"
        s += f"-- Max {df[c].max()}"    
        print(s)
        
    print('='*30)
    count += 1

What we have?

* Numerical variables:
    * LotFrontage
    * LotArea
    * OverallQual
    * OverallCond
    * YearBuilt
    * YearRemodAdd
    * MasVnrArea
    * BsmtFinSF2
    * BsmtUnfSF
    * TotalBsmtSF
    * 1stFlrSF
    * 2ndFlrSF
    * LowQualFinSF
    * GrLivArea
    * BsmtFullBath
    * BsmtHalfBath
    * FullBath
    * HalfBath
    * BedroomAbvGr
    * KitchenAbvGr
    * TotRmsAbvGrd
    * Fireplaces
    * GarageYrBlt
    * GarageCars
    * GarageArea
    * WoodDeckSF
    * OpenPorchSF
    * EnclosedPorch
    * 3SsnPorch
    * ScreenPorch
    * PoolArea
    * MiscVal
    * MoSold
    * YrSold 
* Categorical variables:
    * MSSubClass
    * MSZoning
    * LotShape
    * Alley
    * LandContour
    * LotConfig
    * LandSlope
    * Neighborhood
    * Condition1
    * Condition2
    * BldgType
    * HouseStyle 
    * RoofStyle
    * RoofMatl
    * Exterior1st
    * Exterior2nd
    * MasVnrType
    * ExterQual
    * ExterCond
    * Foundation
    * BsmtQual
    * BsmtCond
    * BsmtExposure
    * BsmtFinType1
    * BsmtFinType2
    * Heating
    * HeatingQC
    * Electrical
    * KitchenQual
    * Functional
    * FireplaceQu
    * GarageType
    * GarageFinish
    * GarageQual
    * GarageCond
    * PavedDrive
    * PoolQC
    * Fence
    * MiscFeature
    * SaleType
    * SaleCondition
* Binary variables:
    * Street
    * Utilities
    * CentralAir
* Columns to drop:
    * Id (Not interesting for the model)
* Outcome:
    * SalePrice

Let's create a (very very very simple) pipeline for the predictor variables, which performs the following tasks:

* Numeric features:
    * Imputer: KNNImputer (k=5)
    * Scaler: StandardScaler
* Categorial features:
    * Imputer: Most frequent
    * Encoder: One Hot Encoder
* Binary features:
    * Imputer: Most frequent
    * Encoder: Ordinal Encoder
This pipeline will be used later, at the time of the experiments, ok?

In [None]:
# numerical features
numeric_features = ["LotFrontage", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd",
                    "MasVnrArea", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF",
                    "LowQualFinSF", "GrLivArea", "BsmtFullBath", "BsmtHalfBath","FullBath","HalfBath","BedroomAbvGr",
                    "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageYrBlt","GarageCars", "GarageArea","WoodDeckSF",
                    "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch","PoolArea","MiscVal","MoSold","YrSold"]
numeric_transformer = Pipeline(
    steps=[("imputer", KNNImputer(n_neighbors=5)), 
           ("scaler", StandardScaler())]
)

Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())])

# categorial features
categorical_features = ["MSSubClass", "MSZoning", "LotShape", "Alley", "LandContour", "LotConfig", "LandSlope",
                        "Neighborhood","Condition1", "Condition2","BldgType","HouseStyle", "RoofStyle","RoofMatl",
                        "Exterior1st", "Exterior2nd", "MasVnrType", "ExterQual", "ExterCond","Foundation",
                        "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1","BsmtFinType2","Heating",
                        "HeatingQC", "Electrical", "KitchenQual", "Functional", "FireplaceQu","GarageType",
                        "GarageFinish", "GarageQual", "GarageCond","PavedDrive","PoolQC","Fence","MiscFeature","SaleType",
                        "SaleCondition"
]
categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), 
           ("ohe", OneHotEncoder(handle_unknown="ignore"))])
    
# binary features
binary_features = ["Street", "Utilities", "CentralAir"]
binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), 
           ("ohe", OrdinalEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("bin", binary_transformer, binary_features),
    ]
)

Now let's separate the predictor variables ( X ) and the outcome ( y )...

In [None]:
X = df.drop(columns=['SalePrice'])
y = df['SalePrice']



Now let's separate training and test sets, being 70% and 30% respectively. These sets will be used by the following experiments...

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'X_train shape {X_train.shape}')
print(f'y_train shape {y_train.shape}')
print('-'*20)
print(f'X_test shape {X_test.shape}')
print(f'y_test shape {y_test.shape}')

## Now let's apply the method we created from knn and see the result...

In [None]:
X_train_transformed = preprocessor.fit_transform(X_train).toarray()
X_test_transformed = preprocessor.transform(X_test).toarray()


knn = MyOwnKnnRegression(k_neighbors=3)
knn.fit(X_train_transformed, y_train)
y_hat = knn.predict(X_test_transformed)

### RMSE

To learn more about the RMSE metric: [Mean squared error regression loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [None]:
mean_squared_error(y_test, y_hat, squared=False)

### $R^2$


To learn more about the $R^2$ metric: [coefficient of determination](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

In [None]:
r2_score(y_test, y_hat)

### Residuals plot

To learn more about the Residuals plot: [Residuals Plot](https://www.scikit-yb.org/en/latest/api/regressor/residuals.html)

In [None]:
# Plotting the residuals of y and pred_y
sns.residplot(y_test,y_hat)
plt.title('Model Residuals')
plt.xlabel('Obsevation #')
plt.ylabel('Error')

### Histogram of prediction errors


In [None]:
diff = y_test - y_hat
diff.hist(bins = 40)
plt.title('Histogram of prediction errors')
plt.xlabel('House price prediction error')
plt.ylabel('Frequency')

### Prediction Error Plot

To learn more about the Prediction Error Plot metric: [Prediction Error Plot](https://www.scikit-yb.org/en/latest/api/regressor/peplot.html)


In [None]:
plt.scatter(y_test,y_hat)
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.plot(np.unique(y_test), np.poly1d(np.polyfit(y_test, y_hat, 1))(np.unique(y_test)))
plt.show()

## 3.1 Selecting the best parameter k

Now let's choose, in a somewhat rudimentary way, the best parameter *k*. For this we will iterate on the parameters $k=3, 5, 7, ...97, 99$. The one with the lowest value for the **RMSE metric** (in the test set) will be chosen.


*Later I will write something about gridsearch and cross-validation*

In [None]:
results = []
results_train = []
possibles_k = np.arange(3,99, 2)

for k in tqdm(possibles_k):
    knn = MyOwnKnnRegression(k_neighbors=k)
    knn.fit(X_train_transformed, y_train)
    y_hat = knn.predict(X_test_transformed)
    
    results.append(mean_squared_error(y_test, y_hat, squared=False))
    
    y_hat = knn.predict(X_train_transformed)
    results_train.append(mean_squared_error(y_train, y_hat, squared=False))
    
    
idx = np.argmin(results)   
print(f"Best k: {possibles_k[idx]}")
print(f"Best RMSE: {results[idx]}")

In [None]:
best_k = possibles_k[idx]

plt.figure(figsize=(8,8))
sns.lineplot(x=possibles_k, y=results)
sns.lineplot(x=possibles_k, y=results_train)
plt.title('RMSE variation in relation to parameter k (lower is better)')
plt.xlabel('K')
plt.ylabel('RMSE')
plt.axvline(x = best_k, color = 'black', label = 'axvline - full height')
plt.legend(['Test points', 'Train point', 'best-k'])
plt.show()

Let's check the result...

In [None]:


knn = MyOwnKnnRegression(k_neighbors=best_k)
knn.fit(X_train_transformed, y_train)
y_hat = knn.predict(X_test_transformed)

### RMSE

To learn more about the RMSE metric: [Mean squared error regression loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [None]:
mean_squared_error(y_test, y_hat, squared=False)

### $R^2$


To learn more about the $R^2$ metric: [coefficient of determination](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

In [None]:
r2_score(y_test, y_hat)

### Residuals plot

To learn more about the Residuals plot: [Residuals Plot](https://www.scikit-yb.org/en/latest/api/regressor/residuals.html)

In [None]:
# Plotting the residuals of y and pred_y
sns.residplot(y_test,y_hat)
plt.title('Model Residuals')
plt.xlabel('Obsevation #')
plt.ylabel('Error')

### Histogram of prediction errors

In [None]:
diff = y_test - y_hat
diff.hist(bins = 40)
plt.title('Histogram of prediction errors')
plt.xlabel('House price prediction error')
plt.ylabel('Frequency')

### Prediction Error Plot

To learn more about the Prediction Error Plot metric: [Prediction Error Plot](https://www.scikit-yb.org/en/latest/api/regressor/peplot.html)


In [None]:
plt.scatter(y_test,y_hat)
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.plot(np.unique(y_test), np.poly1d(np.polyfit(y_test, y_hat, 1))(np.unique(y_test)))
plt.show()

# 4. (Some) Conclusions:

1. I tried to show here the main concepts related to the k-NN algorithm
1. Obviously the algorithm I implemented is a rudimentary version and needs optimization
1. If you have any suggestions (or criticisms), leave them in the comments...
1. We can see that the algorithm is very simple and, at the same time, very competitive!

# 5. Generating the prediction for the test data



In [None]:
# Load test data
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
df_test

In [None]:
_ids = df_test['Id']

X_test_transformed = preprocessor.transform(df_test).toarray()
y_hat = knn.predict(X_test_transformed)

result = {
    'Id':[],
    'SalePrice':[]
}


for _id, price in zip(_ids, y_hat):
    result['Id'].append(_id)
    result['SalePrice'].append(price)

result_df = pd.DataFrame(result)
result_df

In [None]:
# Save submission
result_df.to_csv('submission.csv', index=False, header=True)