# KNN Regression Model

This notebook provides the code for the KNN Regression model.

At the top of each notebook all neccessary libraries will be imported, before starting with the actual coding. \
Afterwards, functions are defined that will later be called to simulate the data, print out metrics and to create visualisations. \
The functions are the same for each model within a notebook file.

In [1]:
#importing all necessary packages
import pandas as pd
import numpy as np
import random
import ipywidgets as widgets
from IPython.display import Javascript, display

## Function for Data Simulation

In [2]:
def DataFunction(alpha, rho, intercept, n, min=0, max=10):
    """
    This function generates a dataset with three different variables.
    The variables ln(L) and ln(K) are randomly drawn from a uniform distribution and lie in a range between 0 and 10 by default.
    A seed is set to 0 to enable reproducibility.
    The varibale ln(Y) is then computed with the Translog function, using the randomly generated values and the parameters alpha, rho, and the intercept.
    Afterwards, a dictionary with all values is created.
    Applying the function returns a pandas.DataFrame object with n samples.
    """
    #setting the seed
    np.random.seed(0)
    
    #draw random values
    l_rand = np.random.uniform(min, max, n)
    k_rand = np.random.uniform(min, max, n)
    
    #computing the values for ln(Y) with the Translog production function
    y_TL = intercept + alpha*l_rand + (1-alpha)*k_rand - 1/2*rho*alpha*(1-alpha)*((k_rand-l_rand)**2)
    
    #create a dictionary with all variables
    TL_dict = {'ln(Y)': y_TL, 'ln(L)': l_rand, 'ln(K)': k_rand}
    
    return(pd.DataFrame(TL_dict))

## Error term 

In [3]:
def error_term(sigma, n, mu=0):
    """
    This function randomly draws n values from a normal distribution.
    When the function is called, the standard deviation and the number of values have to be defined.
    The mean of the distribution is 0 by default.
    Values are returned in form of a numpy.array.
    """
    np.random.seed(0)
    
    u = np.array(np.random.normal(mu, sigma, n))
    
    return(u)

## Summary function

In [4]:
from sklearn import metrics

def summary(test_values, predicted_values):
    """
    This function computes the root mean sqared error (RMSE) and the mean absolute error (MAE).
    It uses the vaules form a test set and the fitted values to return a pandas.DataFrame object.
    """
    #computing the RMSE and the MAE with the respective functions from the sklearn library
    RMSE = (metrics.mean_squared_error(y_test, y_pred))**(0.5)
    MAE = metrics.mean_absolute_error(y_test, y_pred)
    
    #create a dictionary withe the metrics
    summary_dict = {'Metric': ['RMSE','MAE'],
                       'Value': [RMSE, MAE]}
    
    return(pd.DataFrame(summary_dict))

## 3D Plot function

In [5]:
import plotly.express as px
import plotly.graph_objects as go

def Plot_function(model, data):
    """
    This function visualises a regression model for a data set in a 3 dimensional space.
    The plotly package is used and enables interaction with the plot.
    """
    #defining the size of the mesh grid and the margins
    mesh_size = 0.09
    margin = 0
    
    #fitting the model to the exogenous and endogenous variables of the whole dataset.
    model.fit(X, y)
    
    #create a mesh grid to later run the model on
    x_min, x_max = X.min() - margin, X.max() + margin
    y_min, y_max = X.min() - margin, X.max() + margin
    xrange = np.arange(x_min, x_max, mesh_size)
    yrange = np.arange(y_min, y_max, mesh_size)
    xx, yy = np.meshgrid(xrange, yrange)
    
    #run model
    pred = model.predict(np.c_[xx.ravel(), yy.ravel()])
    pred = pred.reshape(xx.shape)

    #generate the plot
    fig = px.scatter_3d(data, x='ln(L)', y='ln(K)', z='ln(Y)')
    fig.update_traces(marker=dict(size=2))
    fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred, name='pred_surface'))
    fig.show()

## Heatmap function

In [6]:
#To prepare the grid of the heatmap, a list is created to represent the pixels on the grid.
#This double loop creates a list of value combinations between 0 and 10
liste = []
i = 0

while i <= 10:
    j = 0
    while j <= 10:
        liste.append([i,j])
        j += 0.1
    i += 0.1

In [7]:
#The previously created list is then transformed into a numpy.array
data = np.asarray(liste)
#A pandas.DataFrame object is then created and the generated grid values are assigned to the Variables ln(L) and ln(K)
columns = ['ln(L)', 'ln(K)']
df_heatmap = pd.DataFrame(data = data, columns = columns)

In [8]:
import plotly.express as px

def Heatmap_function(model):
    '''
    Returns a heatmap of the given regression model for values between 0 and 10.
    '''
    #defining the grid values
    X_heatmap = df_heatmap[['ln(L)','ln(K)']].values
    
    #compute predictions of the model
    y_heatmap = model.predict(X_heatmap)
    
    #create a pandas.DataFrame with the grid values and the fitted values
    df_heatmap['predicted ln(Y)'] = y_heatmap
    heatmap = df_heatmap.pivot('ln(L)','ln(K)','predicted ln(Y)')
    
    #displaying the heatmap
    fig = px.imshow(heatmap,labels=dict(color="predicted value"))
    fig.update_yaxes(autorange=True)
    fig.show()

## KNN Regression Model

This is where the actual regression model begins.\
The user has to define values for the given parameters by adjusting the values of the sliders and the input cell. \
Do not execute the cell, because it will reset the parameters to their defaults.
Afterwards, press the "Run all cells below" button to execute the code below with the desired set of parameters.

As mentioned in the written part of the thesis, the scikit-learn library by Pedregosa et al.(2011) is used to code the different regression models. \
The documentation of the library can be accessed with the following link:
https://scikit-learn.org/stable/ \
Functions that are taken from this package will be explained, when they occur.

In [9]:
intercept_slider = widgets.FloatSlider(value=0.1, min=0.1, max=1, step=0.1, description='Intercept')
alpha_slider = widgets.FloatSlider(value=0.5, min=0.5, max=1, step=0.1, description='α')

rho_slider = widgets.FloatSlider(value=0, min= 0, max= 1 , step=0.1, description='ρ')
sigma_slider = widgets.FloatSlider(value=1, min=0.5, max=1.5, step=0.25, description='σ')
n_input = widgets.IntText(value = 125, description = 'Samples')

display(intercept_slider,alpha_slider,rho_slider,sigma_slider, n_input)

FloatSlider(value=0.1, description='Intercept', max=1.0, min=0.1)

FloatSlider(value=0.5, description='α', max=1.0, min=0.5)

FloatSlider(value=0.0, description='ρ', max=1.0)

FloatSlider(value=1.0, description='σ', max=1.5, min=0.5, step=0.25)

IntText(value=125, description='Samples')

In [10]:
def run_all(ev):
    display(Javascript('IPython.notebook.execute_cell_range(IPython.notebook.get_selected_index()+1, IPython.notebook.ncells())'))

button = widgets.Button(description="Run all cells below")
button.on_click(run_all)
display(button)

Button(description='Run all cells below', style=ButtonStyle())

## Regression on the  Dataset

In [11]:
#calling the data function on the parameter values and assign it to the variable data
data = DataFunction(alpha = alpha_slider.value,
                    intercept = intercept_slider.value,
                    n = n_input.value,
                    rho = rho_slider.value)

In [12]:
#getting the first 5 rows of the DataFrame
data.head()

Unnamed: 0,ln(Y),ln(L),ln(K)
0,5.876034,5.488135,6.063932
1,3.771913,7.151894,0.191932
2,4.621691,6.027634,3.015748
3,6.125284,5.448832,6.601735
4,3.668662,4.236548,2.900776


In [13]:
#adding the error term to the values of the endogenous variable
data['ln(Y)'] += error_term(sigma = sigma_slider.value,
                            n = n_input.value)

In [14]:
#getting the first 5 rows of the new DataFrame
data.head()

Unnamed: 0,ln(Y),ln(L),ln(K)
0,7.640086,5.488135,6.063932
1,4.17207,7.151894,0.191932
2,5.600429,6.027634,3.015748
3,8.366177,5.448832,6.601735
4,5.53622,4.236548,2.900776


In [15]:
#assigning the exogenous variables to X and the endogenous variable to y
X = data[['ln(L)','ln(K)']].values
y = data[['ln(Y)']].values

The `train_test_split` function splits vectors and matrices into test and training set.\
In this case the matrix X and the vector y are split.\
The `test_size` is set to 25% of the whole data.\
`random_state = 0` sets a random seed to make the splitted sets reproducible, since the data is shuffled before splitting (Pedregosa et al. ,2011).

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

The KNN regression model is initiated using the `KNeighborsRegressor` class to generate a KNN Regression model.\
Parameters are assigned randomly.\
`n_neighbours` defines the total number of neighbours in the predictor space. \
`weights` decides if the values should be weighted by distance or not, before averaging.\
`metric` specifies the measure of distance.\
`p` is a parameter used for the Minkowski distance.\
Afterwards, the model is fitted to the training data. (Pedregosa et al. , 2011)

In [17]:
from sklearn.neighbors import KNeighborsRegressor
KNNReg = KNeighborsRegressor(n_neighbors = 2,
                             weights = 'uniform',
                             metric = 'euclidean',
                             p = 1)

## Hyperparametertuning and Cross-Validation

Initiate a grid with parameters of interest.

In [18]:
#create a list from 1 to 100 that will be used for the parameter grid
number_neighbors = list(range(1, 101))

In [19]:
params_KNN = {'n_neighbors':number_neighbors,
              'weights':['uniform','distance'],
              'metric':['euclidean','manhattan','minkowski'],
              'p':[1,2]}

The GridsearchCV class is initiated to tune the previously initiated model. \
The class performs "Exhaustive search over specified parameter values for an estimator." (Pedregosa et al. ,2011)\
`estimator` specifies the class of the regression model, which should be tuned and cross-validated. \
`param_grid` defines the grid of parameters.\
`scoring` defines the metric that should be used to identify the best model.\
`neg_root_mean_squared_error` defines the model with the lowest RMSE as the best model.\
`cv` specifies the amount of folds. In this case 10-fold cross validation will be performed.\
`verbose` controls for messages.\
`verbose = 1` does not produce any messages.\
`n_jobs` defines the the amount of computational jobs that run in parallel. \
`n_jobs = -1` specicifes that all available processors should run in parallel.

In [20]:
from sklearn.model_selection import GridSearchCV

grid_KNN = GridSearchCV(estimator = KNNReg,
                           param_grid = params_KNN,
                           scoring = 'neg_root_mean_squared_error',
                           cv = 10,
                           verbose = 1,
                           n_jobs = -1)

In [21]:
#fitting the grid search class to the training data
grid_KNN.fit(X_train, y_train)
#accessing the best cross-validated and tuned model
best_model = grid_KNN.best_estimator_

Fitting 10 folds for each of 1200 candidates, totalling 12000 fits


         nan]


In [22]:
#best cross-validated and tuned model
best_model

KNeighborsRegressor(metric='manhattan', n_neighbors=9, p=1)

In [27]:
#fitting the best model to training set
best_model.fit(X_train, y_train)
#compute the fitted values for the test set
y_pred = best_model.predict(X_test)

In [24]:
summary(test_values=y_test, predicted_values=y_pred)

Unnamed: 0,Metric,Value
0,RMSE,1.194599
1,MAE,0.986091


## 3D Regression Plot and Heatmap

In [28]:
Plot_function(model=best_model, data=data)

In [29]:
Heatmap_function(model=best_model)