In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Hands-on implementation of K-NN in Python**

As we have got a good understanding of the K-NN algorithm, now we will see how it can be applied to real-life problems. Here, we will implement K-NN in a regression problem where we need to predict the values of a continuous variable. 

To implement the K-NN regression in python, let us take an example where the expenses on marketing and profits earned by 200 companies are given. We need to fit a regression model so that we can predict the profit given marketing spend as input. To do so, first, based on the marketing spending and profits earned by a certain number of companies, we will train a K-NN regression model. 

Once the model gets trained, it will fit a relationship between marketing spending and profits. Using this fitted relationship, when we give the marketing spend as input, the trained K-NN regression model will predict the profit earned by the company 

This hands-on implementation will be done by following the below steps.

**Step 1: Reading the dataset**

As a first step of the implementation, first of all, we will read the dataset of the problem as discussed above to the program. 

In [2]:
data = pd.read_csv(
    'https://gitlab.com/AnalyticsIndiaMagazine/practicedatasets/-/raw/main/bootcamp/decision_tree_regression/Profit.csv'
    )

In [3]:
data.head()

Unnamed: 0,Marketing Spend,Profit
0,471784.1,192261.83
1,443898.53,191792.06
2,407934.54,191050.39
3,383199.62,182901.99
4,366168.42,166187.94


As we can see the dataset has two columns, the marketing spending and the profit with values of these features. Let’s check the shape of this dataset to know the total number of records.

In [4]:
# Checking the shape of the data
print('Shape of the dataset (No. of rows, No. of columns):', data.shape)

Shape of the dataset (No. of rows, No. of columns): (200, 2)


As we can see in the output this dataset has 200 records of the two features. 

**Step 2: Defining the input-output features**

After reading the dataset successfully to the program, we will define the input and out features. As discussed above, the input feature would be the marketing spending and the output feature would be the profits earned. So, let's define them now. 

In [5]:
# Defining input and output features
X = data.iloc[:, 0:-1].values
y = data.iloc[:, -1].values

The input feature, i.e., marketing spends has been defined as input feature X and the output feature, i.e, profit has been defined as y. Now, let us check the shapes of input and output features. 

In [6]:
# Checking the shape of input and output features
print('Shape of the input features:', X.shape)
print('Shape of the output features:', y.shape)

Shape of the input features: (200, 1)
Shape of the output features: (200,)


As we can see in the above output, the input feature matrix has a shape of (200, 1) while the output feature is an array with 200 items. 

Step 3: Defining the training-test features

As we have defined the input and output features, let us define the training and test patterns on which the K-NN model will get trained and tested. We will use 90% of the data for training purposes and 10% of the data for testing purposes. 

In [7]:
# Defining the training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42)

Here, we have defined the training patterns as X_train and y_train which are the training input pattern and training output pattern respectively. Similarly, we have defined the test patterns as X_test and y_test which are test input patterns and test output patterns respectively. 

After having the training and test patterns defined, let's check their shape.

In [8]:
# Checking the shape of the training and test sets
print('Shape of the training input data:', X_train.shape)
print('Shape of the training output data:', y_train.shape)
print('Shape of the test input data:', X_test.shape)
print('Shape of the test output data:', y_test.shape)

Shape of the training input data: (180, 1)
Shape of the training output data: (180,)
Shape of the test input data: (20, 1)
Shape of the test output data: (20,)


As we can see in the output, there are 180 records that will be used for training and 20 records that will be used for testing. 

**Step 4: Defining and training a K-NN regression model**

As we have the training patterns ready, let's build the K-NN model for regression.

**Step 4.1: Initializing a K-NN Regression model**

 First of all, we will import the K-NN regressor from SK-Learn and initialize the K-NN regression model.

In [9]:
# Defining a KNN Regression model
from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor()

The K-NN regression model has been initialized and the object of this regression model has been instantiated as regressor. 

**Step 4.2: Hyperparameter tuning**

To define the model, we need to know how many neighbours we should use with which we can get the best results. For this purpose, we will use the grid search and 10-fold cross-validation for hyperparameter tuning. 

In [10]:
# Finding the optimal value of K
from sklearn.model_selection import GridSearchCV

k_range = list(range(1, 21))
param_grid = dict(n_neighbors=k_range)
grid = GridSearchCV(regressor, param_grid, cv=10, scoring='r2', return_train_score=False,verbose=0)
grid.fit(X_train, y_train)
print(grid.best_params_)

{'n_neighbors': 2}


As we can see in the output, we have got 2 as the optimal number of neighbours. 

**Step 4.3: Defining training the K-NN Regression model**

Using the optimal number of neighbours as obtained in the last step, we will define the K-NN regression and fit it with the training patterns.

In [11]:
# Defining the KNN regressor with optimal value of K
regressor = KNeighborsRegressor(n_neighbors=2)
regressor.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=2)

**Step 5: Predicting and evaluating the predictions**
As a next step, as we have got the K-NN model trained, we will make predictions with it using the test data.

In [12]:
# Making predictions on the test data
y_pred = regressor.predict(X_test)

The predicted results on test data are saved into y_pred. We will bind these predicted profits with actual profits together in a data frame to see the differences. 

In [13]:
# Comparing the predicted profits with actual profits
pd.DataFrame(data={'Predicted Profit': y_pred, 'Actual Profit': y_test})

Unnamed: 0,Predicted Profit,Actual Profit
0,64921.08,64926.08
1,128429.985,129917.04
2,99957.59,99937.59
3,152211.77,152161.77
4,103257.38,103322.38
5,128429.985,129957.04
6,79225.135,122776.86
7,118474.03,118424.03
8,108552.04,108502.04
9,64921.08,64926.08


As we can see that there is not much difference between the actual values and the predicted values, let's evaluate the accuracy of prediction using the evaluation metrics. 

First, we will obtain the mean squared error (MSE) between the actual (y_test) and predicted (y_pred) values.

In [14]:
# Mean Squared Error (MSE)
from sklearn.metrics import mean_squared_error
MSE=mean_squared_error(y_test, y_pred)
print('Mean Squared Error is:', MSE)

Mean Squared Error is: 95169375.35473497


As this is the mean of squared errors, the magnitude of the error looks too high. Let's find the root mean squared error (RMSE) which is the square root of MSE. 

In [15]:
# Root Mean Squared Error (RMSE)
import math
RMSE = math.sqrt(MSE)
print('Root Mean Squared Error is:', RMSE)

Root Mean Squared Error is: 9755.479247824525


This shows the actual error between the actual and predicted values of profits. So there is an overall 9755.47 of error between the actual and predicted profits. 

To find how well the K-NN regression model was fitted with the data, we will obtain the R-squared which is a measure of the fitness of the regression models. 

In [16]:
# R-Squared
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print('R-Squared is:', r2)

R-Squared is: 0.8858429403678624


As we can see that the value of R-squared is nearer to 1 on a scale of 0 to 1, we can say that the model was well fitted and the prediction results will be satisfactory with this well-fitted model. 

So this is how we can use a K-NN regression model in a real-life regression problem.