# KNN in `Python`

More articles in my blog:   $\hspace{0.1cm}$   [Estadistica4all](https://fabioscielzoortiz.github.io/Estadistica4all.github.io/)

&nbsp;

How to reference this article ? 

Scielzo Ortiz, F. (2022). Optimizando código en Python. *Estadistica4all*.  [Optimizando código en Python](https://fabioscielzoortiz.github.io/Estadistica4all.github.io/Articulos/Optimizando%20codigo%20en%20Python.html )

## Index

* [KNN for supervised classification   ](#1)
* * [Toy example](#2)
* * [ KNN for supervised classification in `Python` with `Sklearn`](#3)
* * * [Simple validation with own validation function and `sklearn` KNN classification function](#3.1)
* * * [Simple validation with `sklearn` validation function](#3.2)
* * [ KNN for classification in `Python` with own algorithm](#4)
* * * [Simple validation with own validation function and own KNN classification function](#4.1)


* [KNN for regression](#5)
* * [KNN for regression in `Python` with `Sklearn`](#6)
* * * [Simple validation with own validation function and `sklearn` KNN regression function](#6.1)
* * * [Simple validation with `sklearn` validation function](#6.2)
* * [KNN for regression in Python with own algorithm](#7)
* * * [Simple validation with own validation function and own KNN regression function](#7.1)

* [Selecting an optimal k with cross-validation](#8)

---

## KNN for supervised classification <a class="anchor" id="1"></a>  

- We have $\hspace{0.1cm} p \hspace{0.1cm}$ variables $\hspace{0.1cm} X=(X_1,...,X_p) \hspace{0.1cm}$ measurements on a $n$ size sample.

- We also have a **categorical** response variable $\hspace{0.1cm} Y \hspace{0.1cm}$ with $\hspace{0.1cm} g \hspace{0.1cm}$  categories that indicates  the group to which each element of the sample belongs  $ ( \hspace{0.05cm} Range(Y)=\lbrace c_1 ,..., c_g \rbrace \hspace{0.05cm})$

- The groups generated by $\hspace{0.1cm} Y \hspace{0.1cm}$ are denoted as $\hspace{0.1cm} \Omega_1 ,..., \Omega_g \hspace{0.15cm}$   $\hspace{0.15cm}( \hspace{0.1cm} y_i = c_r \hspace{0.15cm} \Leftrightarrow \hspace{0.15cm}$  $ i \in \Omega_r \hspace{0.1cm})$



The supervised classification problem consists in, for a new observation  of the variables $X_1,...,X_p  \hspace{0.1cm}$, $\hspace{0.1cm} x_{new} = (x_{new,1}\hspace{0.1cm},\hspace{0.1cm}x_{new,2}\hspace{0.1cm},\dots,\hspace{0.1cm}x_{new,p}) \hspace{0.1cm}$, predict it's $\hspace{0.1cm} Y \hspace{0.1cm}$ value $\hspace{0.1cm} (y_{new})\hspace{0.1cm}$  using the available information of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ and $ \hspace{0.1cm} Y$

So , the problem is to classify a new element/individual in one of the $\hspace{0.1cm} g \hspace{0.1cm}$ groups generated by $\hspace{0.1cm} Y \hspace{0.1cm}$ using the information available of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ and $Y$, and also  $\hspace{0.1cm} x_{new} = (x_{new,1}\hspace{0.1cm},\hspace{0.1cm}x_{new,2}\hspace{0.1cm},\dots,\hspace{0.1cm}x_{new,p}) \hspace{0.1cm}$

Note that if we haven't information about $\hspace{0.1cm} Y \hspace{0.1cm}$ this would be an unsupervised classification problem.

----

The KNN (K-nearest neighbors) algorithm for supervised classification have the following steps:



 $1. \hspace{0.15cm}$ Define a **distance** measure between the observations of the original sample respect to the variables $X_1,...,X_p$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\delta$



 $2. \hspace{0.15cm}$ Compute the distances between $\hspace{0.1 cm}x_{new}\hspace{0.1 cm}$ and the initial observations $\hspace{0.1cm} \lbrace x_1,...,x_n \rbrace$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\lbrace \hspace{0.1 cm}  \delta(x_{new}\hspace{0.03 cm},\hspace{0.03 cm} x_i) \hspace{0.1 cm} / \hspace{0.1 cm}  i=1,...,n \hspace{0.1 cm}  \rbrace$

  
 $3. \hspace{0.15cm}$ Select the  $\hspace{0.03 cm} k \hspace{0.03 cm}$ nearest observation to $\hspace{0.06 cm} x_{new}\hspace{0.06 cm}$ based on $\hspace{0.05cm} \delta \hspace{0.12cm}$ $(k$ nearest neighbors of $x_{new})$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$   The set of these observation will be denote by $KNN(x_{new})$ 

 $4. \hspace{0.15cm}$ Compute the proportion of these observation (neighbors) that belongs to each group $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$  
 
 $\hspace{0.65cm} \Rightarrow \hspace{0.15cm}$ The proportion of $KNN$ that belongs to the group $\hspace{0.15cm} \Omega_r$ $\hspace{0.1cm}(Y=c_r)\hspace{0.1cm}$ will be denote by $\hspace{0.1 cm} f^{knn}_{r}  $



   $$ \hspace{0.1 cm} f^{KNN(x_{new})}_{r} \hspace{0.15cm}=\hspace{0.15cm} \dfrac{ \# \hspace{0.1cm}\lbrace\hspace{0.1cm} i \in KNN(x_{new}) \hspace{0.1cm}/\hspace{0.1cm} i \in \Omega_r \hspace{0.1cm}\rbrace  }{\# \hspace{0.1cm} KNN(x_{new}) = k} \hspace{0.15cm}=\hspace{0.15cm}  \dfrac{ \# \hspace{0.1cm}\lbrace\hspace{0.1cm} i \in KNN \hspace{0.1cm}/\hspace{0.1cm} y_i = r \hspace{0.1cm}\rbrace  }{ k}$$
   


$5. \hspace{0.15cm}$ Classify $\hspace{0.1cm} x_{new} \hspace{0.1cm}$ in that group/class $($ defined by $Y)$ more frequent in KNN :



 $\hspace{0.25cm} \hspace{0.2cm}$ $\text{If}   \hspace{0.15cm} \underbrace{ f^{knn}_{s} \geqslant f^{knn}_{r} \hspace{0.15cm},\hspace{0.15cm} \forall r = 1,...,g  }_{\Omega_s \hspace{0.1cm}\text{is the most frequent group in}\hspace{0.1cm} KNN } $    $\hspace{0.1cm} \hspace{0.15cm}  \Rightarrow \hspace{0.15cm} x_{new} \hspace{0.1cm}$ is classify in $\hspace{0.1cm} \Omega_s$ $ \hspace{0.25cm}  \Rightarrow \hspace{0.15cm} \widehat{y}_{new} = s \hspace{0.1cm}$

$\hspace{0.2 cm}$ In other words:

$\hspace{0.6 cm} \text{If} \hspace{0.4 cm} r^*  \hspace{0.05 cm}= \hspace{0.05 cm} \underset{\hspace{0.7 cm} r}{arg \hspace{0.1 cm} Max} \hspace{0.05 cm} \left(\hspace{0.1 cm} f^{KNN(x_{new})}_{r} \hspace{0.1 cm}\right) \hspace{0.2 cm} \hspace{0.15cm}  \Rightarrow \hspace{0.25cm} \widehat{y}_{new} = r^* \hspace{0.1cm}$



-----

Why KNN is a supervised classification method and not an unsupervised ?



Because in this problem we have a vector of observations of the response variable $Y$

The fact that we haven't $\hspace{0.1 cm} y_{new} \hspace{0.1 cm}$ doesn't transform it in a unsupervised problem

---

### Toy example <a class="anchor" id="2"></a>



- Sample: $n=3$



- Predictors: $\hspace{0.15cm} X1 = (10 , 2 , 4)$ , $\hspace{0.15cm} X2 = (20 , 25, 40)$



- Observations: $\hspace{0.15cm} x_1 =(10,20)$ , $\hspace{0.15cm} x_2=(2,25)$ , $\hspace{0.15cm} x_3=(4,40)$



- Response  $\hspace{0.1cm}(2$ categories $(0,1)$, then $2$ groups $\hspace{0.1cm}\Omega_0 , \Omega_1)$ : $\hspace{0.18cm} Y =( 1 , 1 , 0 )$



- Distance $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$  $ \delta_{Euclidean}$



- New observation:  $\hspace{0.15cm} x_{new}=(6, 20)$



- Computing the distances:

$\hspace{0.85cm} \delta(x_{new}, x_1)_{Euclidean} = (10-6)^2 + (20-20)^2 = 16$

$\hspace{0.85cm} \delta(x_{new}, x_2)_{Euclidean} = (2-6)^2 + (25-20)^2 = 16 + 25 = 41$

$\hspace{0.85cm} \delta(x_{new}, x_3)_{Euclidean} = (4-6)^2 + (40-20)^2 = 4 + 400 = 404$



- Selecting $\hspace{0.05cm} k=2 \hspace{0.05cm}$ nearest neighbor to $\hspace{0.05cm}x_{new}$ $\hspace{0.2cm}\Rightarrow\hspace{0.215cm}$ $ KNN \hspace{0.01cm}=\hspace{0.01cm} \lbrace\hspace{0.1cm} x_1 , x_2 \hspace{0.1cm}\rbrace \hspace{0.01cm}=\hspace{0.01cm} \lbrace \hspace{0.1cm} individual 1 , individual 2  \hspace{0.1cm}\rbrace $



- Computing the proportions $f^{knn}$ : 

$\hspace{0.85cm}$ Note that $\hspace{0.1cm} y_1 = 1 \hspace{0.1cm}$ and $ \hspace{0.1cm} y_2 = 1\hspace{0.2cm} \Rightarrow \hspace{0.2cm} f^{knn}_0 =  0/2 = 0\hspace{0.1cm} $ and $\hspace{0.1cm} f^{knn}_1 =  2/2 = 1$



- So, the algorithm classify $\hspace{0.1 cm}x_{new} \hspace{0.1 cm}$ in the group $\hspace{0.1 cm}\Omega_1 \hspace{0.1 cm}$ , so the algorithm predict that $\hspace{0.15cm} \hat{y}_{new} = 1$


---

### KNN for supervised classification in `Python` with `sklearn`<a class="anchor" id="3"></a>

In [431]:
import warnings
warnings.filterwarnings("ignore")

In [432]:
import pandas as pd
import numpy as np

Loading data:

In [433]:
Gender_classification = pd.read_csv('gender_classification.csv')

In [434]:
Gender_classification.head()

Unnamed: 0,long_hair,forehead_width_cm,forehead_height_cm,nose_wide,nose_long,lips_thin,distance_nose_to_lip_long,gender
0,1,11.8,6.1,1,0,1,1,Male
1,0,14.0,5.4,0,0,1,0,Female
2,0,11.8,6.3,1,1,1,1,Male
3,0,14.4,6.1,0,1,1,1,Male
4,1,13.5,5.9,0,0,0,0,Female


Recoding gender to the standard encode $\lbrace 0,1,2,... \rbrace $

In [435]:
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()

In [436]:
Gender_classification['gender'] = ord_enc.fit_transform(Gender_classification[['gender']])

In [437]:
Gender_classification.head()

Unnamed: 0,long_hair,forehead_width_cm,forehead_height_cm,nose_wide,nose_long,lips_thin,distance_nose_to_lip_long,gender
0,1,11.8,6.1,1,0,1,1,1.0
1,0,14.0,5.4,0,0,1,0,0.0
2,0,11.8,6.3,1,1,1,1,1.0
3,0,14.4,6.1,0,1,1,1,1.0
4,1,13.5,5.9,0,0,0,0,0.0


In [551]:
Gender_classification.dtypes

long_hair                      int64
forehead_width_cm            float64
forehead_height_cm           float64
nose_wide                      int64
nose_long                      int64
lips_thin                      int64
distance_nose_to_lip_long      int64
gender                       float64
dtype: object


To later do simple validation we are going to divide the dataset into a train part and a test part:

In [438]:
Gender_classification_Train = Gender_classification.sample(frac=0.8, replace=False, weights=None, random_state=222, axis=None, ignore_index=False)

Gender_classification_Test = Gender_classification.drop( Gender_classification_Train.index , )

In [439]:
## TEST

X_test = Gender_classification_Test.loc[: , Gender_classification_Test.columns != 'gender']
Y_test = Gender_classification_Test.loc[: , 'gender']

Data_Test = pd.concat([Y_test , X_test], axis=1)

##################################################################################################

## TRAIN

X_train = Gender_classification_Train.loc[: , Gender_classification_Test.columns != 'gender']
Y_train = Gender_classification_Train.loc[: , 'gender']

Data_Train = pd.concat([Y_train , X_train], axis=1)

In [441]:
# Como ejemplo de x_new (nueva observacion de los predictores) cogemos la sexta (5 en python) observacion de X_test

x_new = X_test.iloc[ 5 , :]

---

In [442]:
import sklearn

from sklearn.neighbors import NearestNeighbors

In [443]:
## sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None) 

It's advisable  to see the sklearn documentation first: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [444]:
knn_classification = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', p=2, metric='minkowski')

In [445]:
knn_classification.fit(X_train, Y_train)

KNeighborsClassifier(n_neighbors=10)

In [446]:
print( knn_classification.predict( [x_new] ) )

[1.]


In [447]:
print( knn_classification.predict_proba([x_new]) )

[[0. 1.]]


---

### Simple validation with own validation function and `sklearn` KNN classification function<a class="anchor" id="3.1"></a>


In [448]:
def Simple_Validation_Classification(Data_Test, X_train, Y_train, Y_test) :

    ##########################

    from joblib import Parallel, delayed
    import multiprocessing

    n_jobs  = multiprocessing.cpu_count()

    ##########################

    knn_classification = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', p=2, metric='minkowski')

    ##########################

    def prediction(i, Data_Test, X_train, Y_train ):

     x_new = Data_Test.iloc[ i , range(1, Data_Test.shape[1])]

     knn_classification.fit(X_train, Y_train)
     
     y_new_predict = knn_classification.predict( [x_new] )

     return(y_new_predict)

    ##########################

    y_predictions_vector = []

    # Paralelizamos el siguiente bucle for :

    # for i in  range(0, len(Data_Test)):

        # y_new_predict = prediction(i, Data_Test, X_train, Y_train )

        # y_predictions_vector.append( y_new_predict )

    
    y_predictions_vector = Parallel(n_jobs=n_jobs)( delayed(prediction)( i, Data_Test, X_train, Y_train) for i in range(0, len(Data_Test)) )

    #########################

    from itertools import chain

    y_predictions_vector = list(chain(*y_predictions_vector))

    TEC = 1 - sum(y_predictions_vector == Y_test)/len(Y_test)     

 
    return(y_predictions_vector , TEC)

Simple validation with sklearn function for KNN classification with metric=Minkowski and q=2

In [449]:
y_predictions_vector , TEC = Simple_Validation_Classification(Data_Test, X_train, Y_train, Y_test)

In [450]:
TEC

0.03700000000000003

---

### Simple validation with `sklearn` validation function  <a class="anchor" id="3.2"></a>


In [451]:
knn_classification = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', p=2, metric='minkowski')

knn_classification.fit(X_train, Y_train)

KNeighborsClassifier(n_neighbors=10)

In [452]:
TEC_sklearn = 1 - knn_classification.score(X_test, Y_test)

TEC_sklearn

0.03700000000000003

---

Testing `KNeighborsClassifier` with **other distances** :

In [453]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform',  metric='cityblock')

In [454]:
knn.fit(X_train, Y_train)

print( knn.predict( [x_new] ) )
 
print( knn.predict_proba([x_new]) )

[1.]
[[0. 1.]]


In [455]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', metric='cosine')

In [456]:
knn.fit(X_train, Y_train)

print( knn.predict( [x_new] ) ) 

print( knn.predict_proba([x_new]) )

[1.]
[[0. 1.]]


In [457]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', metric='nan_euclidean')

In [458]:
knn.fit(X_train, Y_train)

print( knn.predict( [x_new] ) ) 

print( knn.predict_proba([x_new]) )

[1.]
[[0. 1.]]


In [459]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', metric='manhattan')

In [460]:
knn.fit(X_train, Y_train)

print( knn.predict( [x_new] ) ) 

print( knn.predict_proba([x_new]) )

[1.]
[[0. 1.]]


----

### KNN for classification in `Python` with own algorithm <a class="anchor" id="4"></a>

We are going to develop our own algorithm so as not to depend on sklearn

In [461]:
def KNN_classification( X , Y , x_new, k, distance = "Minkowski" , q = 0, p1=0, p2=0, p3=0 ):

    
## Para paralelizar el algoritmo 

    from joblib import Parallel, delayed
    import multiprocessing

    n_jobs  = multiprocessing.cpu_count()

####################################################################################################################################################################################################################################################

    # Y, X y x_new deben ser objetos Pandas ya que luego seran convertidos a objetos Numpy automaticamente por el algoritmo
    
    # Y tiene que ser un Pandas data frame con la variable respuesta (que en este caso debe ser categorica y con categorias estandar {0,1,2,...}) 

    # X tiene que ser un Pandas data frame con los predictotres (X1,...,Xp). 

    # x_new tiene que ser un vector con una nueva observacion de los predictores. 


####################################################################################################################################################################################################################################################

    Y = Y.to_numpy()

    X = X.to_numpy() 

    x_new = x_new.to_numpy()

    X = np.concatenate((X, [x_new]), axis=0)


    distances = []

    groups_knn = []

##########################################################################################
    
    def a(Binary_Data) :

            X = Binary_Data

            a = X @ X.T

            return(a)

##########################################################################################

    def d(Binary_Data):

            X = Binary_Data

            ones_matrix = np.ones(( X.shape[0] , X.shape[1])) 

            d = (ones_matrix - X) @ (ones_matrix - X).T

            return(d)

##########################################################################################

    def alpha_py(i,j, Multiple_Categorical_Data):

        X = Multiple_Categorical_Data

        alpha = np.repeat(0, X.shape[1])

        for k in range(0, X.shape[1]) :

            if X[i-1, k] == X[j-1, k] :

                alpha[k] = 1

            else :

                alpha[k] = 0

        alpha = alpha.sum()

        return(alpha)

####################################################################################################################################################################################################################################################
    
    if distance == "Euclidean":

        def Dist_Euclidea_Python(i, j, Quantitative_Data_set): 

            Dist_Euclidea = ( ( Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :] )**2 ).sum()

            Dist_Euclidea = np.sqrt(Dist_Euclidea)

            return Dist_Euclidea

    ###################################################################
           
        ## PARTE DEL CODIGO A PARALELIZAR

        #for j in range(1, len(X)):

          # distances.append( Dist_Euclidea_Python( len(X), i , X ) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Euclidea_Python)( len(X), s , X ) for s in range(1, len(X)) )
           

    ###################################################################

    if distance == "Minkowski":

        def Dist_Minkowski_Python(i,j, q , Quantitative_Data_set):

            Dist_Minkowski = ( ( ( abs( Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :] ) )**q ).sum() )**(1/q)

            return Dist_Minkowski

    ###################################################################

        ## PARTE DEL CODIGO A PARALELIZAR

        # for i in range(1, len(X)):

          #  distances.append( Dist_Minkowski_Python( len(X), i , q , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Minkowski_Python)( len(X), s , q , X) for s in range(1, len(X)) )

    ###################################################################

    if distance == "Canberra":

        def Dist_Canberra_Python(i,j, Quantitative_Data_set):

            numerator =  abs( Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :] )  

            denominator =  ( abs(Quantitative_Data_set[i-1, :]) + abs(Quantitative_Data_set[j-1, :]) )
       
            numerator=np.array([numerator], dtype=float)

            denominator=np.array([denominator], dtype=float)

            Dist_Canberra = ( np.divide( numerator , denominator , out=np.zeros_like(numerator), where=denominator!=0) ).sum()

            return Dist_Canberra

    ###################################################################

        ## PARTE DEL CODIGO A PARALELIZAR

        # for i in range(1, len(X)):

          #  distances.append( Dist_Canberra_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Canberra_Python)( len(X), s , X) for s in range(1, len(X)) )
                

    ###################################################################
   
    if distance == "Pearson":

        def Dist_Pearson_Python(i, j, Quantitative_Data_set):

            Dist_Pearson = ( ( Quantitative_Data_set[i-1, ] - Quantitative_Data_set[j-1, ] )**2 / Quantitative_Data_set.var() ).sum()

            Dist_Pearson = np.sqrt(Dist_Pearson)

            return Dist_Pearson

    ###################################################################

       ## PARTE DEL CODIGO A PARALELIZAR
       
       # for i in range(1, len(X)):

        #   distances.append( Dist_Pearson_Python( len(X), i , X) )

        
        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Pearson_Python)( len(X), s , X) for s in range(1, len(X)) )

    ###################################################################
    
    if distance == "Mahalanobis":

        def Dist_Mahalanobis_Python(i, j, Quantitative_Data_set):

            # All the columns of Quantitative_Data_set must be type = 'float' or 'int' (specially not 'object'), in other case we will find 
            # dimensional problems when Python compute   x @ S_inv @ x.T

            x = (Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :])

            x = np.array([x]) # necessary step to transpose a 1D array

            S_inv = np.linalg.inv( np.cov(Quantitative_Data_set , rowvar=False) ) # inverse of covariance matrix

            Dist_Maha = np.sqrt( x @ S_inv @ x.T )  # x @ S_inv @ x.T = np.matmul( np.matmul(x , S_inv) , x.T )

            Dist_Maha = float(Dist_Maha)

            return Dist_Maha

        
    ###################################################################

    ## PARTE DEL CODIGO A PARALELIZAR

       # for i in range(1, len(X)):

        #    distances.append( Dist_Mahalanobis_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Mahalanobis_Python)( len(X), s , X) for s in range(1, len(X)) )
       

    ###################################################################
    
    if distance == "Sokal":

        a = X @ X.T
        n = X.shape[0]
        p = X.shape[1]
        ones_matrix = np.ones((n, p))
        b = (ones_matrix - X) @ X.T
        c = b.T
        d = (ones_matrix - X) @ (ones_matrix - X).T


        def Sokal_Similarity_Py(i, j):

            Sokal_Similarity = ( a[i-1 , j-1] + d[i-1 , j-1] ) / p

            return Sokal_Similarity


        def Dist_Sokal_Python(i, j, Binary_Data_set):

            dist_Sokal = np.sqrt( 2 - 2*Sokal_Similarity_Py(i,j, Binary_Data_set) )

            return dist_Sokal

    ###################################################################

    ## PARTE DEL CODIGO A PARALELIZAR

      #  for i in range(1, len(X)):

        #    distances.append( Dist_Sokal_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Sokal_Python)( len(X), s , X) for s in range(1, len(X)) )

    ###################################################################
   
    if distance == "Jaccard":


        a = X @ X.T
        n = X.shape[0]
        p = X.shape[1]
        ones_matrix = np.ones((n, p))
        b = (ones_matrix - X) @ X.T
        c = b.T
        d = (ones_matrix - X) @ (ones_matrix - X).T


        def Jaccard_Similarity_Py(i, j):

            Jaccard_Similarity = a[i-1,j-1] / (a[i-1,j-1] + b[i-1,j-1] + c[i-1,j-1])
            
            return Jaccard_Similarity


        def Dist_Jaccard_Python(i, j):

            dist_Jaccard = np.sqrt( Jaccard_Similarity_Py(i,i) + Jaccard_Similarity_Py(i,i) - 2*Jaccard_Similarity_Py(i,j) )

            return dist_Jaccard

    ###################################################################

    ## PARTE DEL CODIGO A PARALELIZAR

       # for i in range(1, len(X)):

        #    distances.append( Dist_Jaccard_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Jaccard_Python)( len(X), s , X) for s in range(1, len(X)) )

    ###################################################################
    
    if distance == "Matches":

        def matches_similarity_py(i, j, Multiple_Categorical_Data):

            p = Multiple_Categorical_Data.shape[1]

            matches_similarity = alpha_py(i,j, Multiple_Categorical_Data) / p

            return(matches_similarity)


        def Dist_Matches_Py(i,j, Multiple_Categorical_Data):

            Dist_Matches = np.sqrt( matches_similarity_py(i, i, Multiple_Categorical_Data) +  matches_similarity_py(j, j, Multiple_Categorical_Data) - 2*matches_similarity_py(i, j, Multiple_Categorical_Data) )

            return( Dist_Matches )

    ###################################################################

        # for i in range(1, len(X)):

          #  distances.append( Dist_Matches_Py( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Matches_Py)( len(X), s , X) for s in range(1, len(X)) )

 ##############################################################################################################################################   
   
    if distance == "Gower":

        # The data matrix X have to be order in the following way:
        # The p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.



##########################################################################################


        def Gower_Similarity_Python(i,j, Mixed_Data_Set, p1, p2, p3):

            X = Mixed_Data_Set

   # The data matrix X have to be order in the following way:
   # The p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.

   #####################################################################################
        
            def G(k, X):

                range = X[:,k].max() - X[:,k].min() 

                return(range)

            G_vector = np.repeat(0.5, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)
                
      
    ##########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X[: , 0:p1]

            Binary_Data = X[: , (p1):(p1+p2)]
            
            Multiple_Categorical_Data = X[: , (p1+p2):(p1+p2+p3) ]

    ##########################################################################################

            numerator_part_1 = ( ones - ( abs(Quantitative_Data[i-1,:] - Quantitative_Data[j-1,:]) / G_vector ) ).sum() 

            numerator_part_2 = a(Binary_Data)[i-1,j-1] + alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_1 + numerator_part_2
 
            denominator = p1 + (p2 - d(Binary_Data)[i-1,j-1]) + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

##########################################################################################

        def Dist_Gower_Py(i, j, Mixed_Data , p1, p2, p3):

            Dist_Gower = np.sqrt( 1 - Gower_Similarity_Python(i, j, Mixed_Data , p1, p2, p3) )

            return(Dist_Gower)    

    ###################################################################

        # for i in range(1, len(X)):

            # distances.append( Dist_Gower_Py( len(X), i , X, p1, p2, p3) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Gower_Py)( len(X), s , X, p1, p2, p3) for s in range(1, len(X)) )

##############################################################################################################################################

    if distance == "Gower-BM" :

        def GowerBM_Similarity_Python(i,j, BM_Data_Set, p2, p3):

            X = BM_Data_Set

          # The data matrix X have to be order in the following way:
          # The p2 first are binary categorical, and the following p3 are multiple categorical.

##########################################################################################
       
            Binary_Data = X[: , 0:p2]

            Multiple_Categorical_Data = X[: , (p2):(p2+p3)]
 
##########################################################################################

 
            numerator_part_2 = a(Binary_Data)[i-1,j-1] + alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_2

            denominator = (p2 - d(Binary_Data)[i-1,j-1]) + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

##############################################################################################################################################
        
        def Dist_GowerBM_Py(i, j, BM_Data ,  p2, p3):

            Dist_Gower = np.sqrt( 1 - GowerBM_Similarity_Python(i, j, BM_Data , p2, p3) )

            return(Dist_Gower)

##############################################################################################################################################

        # for i in range(1, len(X)):

            # distances.append( Dist_GowerBM_Py( len(X), i , X, p2, p3) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_GowerBM_Py)( len(X), s , X, p2, p3) for s in range(1, len(X)) )

##############################################################################################################################################
    
    if distance == "Gower-BQ" :

        def GowerBQ_Similarity_Python(i,j, BQ_Data_Set, p1, p2):

            X = BQ_Data_Set


        # The data matrix X have to be order in the following way:
        # The p1 first are quantitative, the following p2 are binary categorical 

##########################################################################################
        
            def G(k, X):

                range = X[:,k].max() - X[:,k].min() 

                return(range)

            G_vector = np.repeat(0.5, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)
##########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X[: , 0:p1]

            Binary_Data = X[: , (p1):(p1+p2)]
         
 
##########################################################################################

            numerator_part_1 = ( ones - ( abs(Quantitative_Data[i-1,:] - Quantitative_Data[j-1,:]) / G_vector ) ).sum() 

            numerator_part_2 = a(Binary_Data)[i-1,j-1] 
     
            numerator = numerator_part_1 + numerator_part_2

            denominator = p1 + (p2 - d(Binary_Data)[i-1,j-1])  

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

###############################################################################

        def Dist_GowerBQ_Py(i, j, BQ_Data ,  p1, p2):

            Dist_Gower = np.sqrt( 1 - GowerBQ_Similarity_Python(i, j, BQ_Data , p1, p2) )

            return(Dist_Gower)

##############################################################################################################################################

        # for i in range(1, len(X)):

        # distances.append( Dist_GowerBQ_Py( len(X), i , X, p1, p2) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_GowerBQ_Py)( len(X), s , X, p1, p2) for s in range(1, len(X)) )


##############################################################################################################################################
    
    if distance == "Gower-MQ" :
        
        def GowerMQ_Similarity_Python(i,j, MQ_Data_Set, p1, p3):

            X = MQ_Data_Set

   # The data matrix X have to be order in the following way:
   # The p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.

##########################################################################################
            
            def G(k, X):

                range = X[:,k].max() - X[:,k].min() 

                return(range)

            G_vector = np.repeat(0.5, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)

##########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X[: , 0:p1]
    
            Multiple_Categorical_Data = X[: , (p1):(p1+p3)]
 
    
##########################################################################################

            numerator_part_1 = ( ones - ( abs(Quantitative_Data[i-1,:] - Quantitative_Data[j-1,:]) / G_vector ) ).sum() 

            numerator_part_2 =   alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_1 + numerator_part_2

            denominator = p1 + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)



############################################################################################

        def Dist_GowerMQ_Py(i, j, MQ_Data ,  p1, p3):

                Dist_Gower = np.sqrt( 1 - GowerMQ_Similarity_Python(i, j, MQ_Data , p1, p3) )

                return(Dist_Gower)


######################################################################################################################################
        # for i in range(1, len(X)):

        # distances.append( Dist_GowerMQ_Py( len(X), i , X, p1, p3) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_GowerMQ_Py)( len(X), s , X, p1, p3) for s in range(1, len(X)) )

######################################################################################################################################

######################################################################################################################################

    distances = pd.DataFrame({'distances': distances})

    distances = distances.sort_values(by=["distances"]).reset_index(drop=False)
        
    knn = distances.iloc[0:k , :]

    for i in knn.iloc[:,0]:

        groups_knn.append(Y[i])

    unique, counts = np.unique(groups_knn , return_counts=True)

    unique_Y , counts_Y = np.unique(Y , return_counts=True)

    if len(unique) == len(unique_Y) :

        proportions_groups_knn = pd.DataFrame({'proportions_groups': counts/k, 'groups': unique_Y })
    
    elif len(unique) < len(unique_Y) :

        proportions_groups_knn = pd.DataFrame({'proportions_groups': counts/k, 'groups': unique })



    prediction_group = proportions_groups_knn.sort_values(by=["proportions_groups"], ascending=False).iloc[0,:]['groups']
                                      

    return prediction_group, proportions_groups_knn   

Testing our `KNN_classification` function in a binary classification problem:

In [462]:
prediction_group, proportions_groups_knn  = KNN_classification( X_train , Y_train , x_new, 10 , distance = "Euclidean" )

In [463]:
prediction_group

1.0

In [464]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,1.0,1.0


Using another distance function  :

In [465]:
prediction_group, proportions_groups_knn  =  KNN_classification( X_train , Y_train , x_new, 10 , distance = "Minkowski" , q = 2 )

In [466]:
prediction_group

1.0

In [467]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,1.0,1.0


In [468]:
prediction_group, proportions_groups_knn  =  KNN_classification( X_train , Y_train , x_new, 10 , distance = "Minkowski" , q = 1 )

In [469]:
prediction_group

1.0

In [470]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,1.0,1.0


In [471]:
prediction_group, proportions_groups_knn  =  KNN_classification( X_train , Y_train , x_new, 10 , distance = "Canberra")

In [472]:
prediction_group

1.0

In [473]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,1.0,1.0


In [474]:
prediction_group, proportions_groups_knn  =  KNN_classification( X_train , Y_train , x_new, 10 , distance = "Pearson")

In [475]:
prediction_group

1.0

In [476]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,1.0,1.0


In [477]:
prediction_group, proportions_groups_knn  =  KNN_classification( X_train , Y_train , x_new, 10 , distance = "Mahalanobis")

In [478]:
prediction_group

1.0

In [479]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,1.0,1.0


In [480]:
# KNN_classification( X , Y , x_new, 10 , distance = "Gower"  )

In this case, the Gower distance cannot be implemented because $X$ is not a mixed data matrix (it only has quantitative and binary, the multiclasses are missing).
Neither can the Sokal and Jaccard distances be used because $X$ is not a binary data matrix, nor can the matches coefficient be used because $X$ is not a multiclass categorical data matrix.

---

Since in this case $X$ is a matrix of **Binary-Quantitative data**, the most suitable distance allowed by our `KNN_regression` function is the **Gower-BQ distance**.

To use our Gower distance we must re-order the columns of $X_{train}$ appropriately. The first $p1$ will be the quantitative variables, and the next $p2$ the binary ones.

In [481]:
X_train_2 = X_train.loc[: , ['forehead_width_cm', 'forehead_height_cm',   # Quantitative (2)

                 'long_hair', 'nose_wide', 'nose_long', 'lips_thin', 'distance_nose_to_lip_long'     # Binary (5)
                 
                            ]] 

In [482]:
X_train_subset = X_train_2.iloc[0:2000 , ]
Y_train_subset = X_train_2.iloc[0:2000 , ]

In [483]:
prediction_group, proportions_groups_knn  = KNN_classification( X_train_subset, Y_train_subset , x_new, 10 ,  "Gower-BQ" , 2, 5 )

In [484]:
prediction_group

1.0

In [485]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,0.2,0.0
1,4.8,1.0
2,0.1,6.7
3,0.1,6.8
4,0.3,6.9
5,0.3,7.0
6,0.2,7.1
7,0.2,11.5
8,0.4,11.6
9,0.1,11.7


Note: with 3000 observations it takes 2 minutes, but with 4001 (all) it takes 8.45 minutes.

----

### Simple validation with own validation function and own KNN classification function <a class="anchor" id="4.1"></a>

In [486]:
def Simple_Validation_Classification(Data_Test, X_train, Y_train, Y_test) :

    ##########################

    from joblib import Parallel, delayed
    import multiprocessing

    n_jobs  = multiprocessing.cpu_count()

    ##########################

    ##########################

    def prediction(i, Data_Test, X_train, Y_train ):

     x_new = Data_Test.iloc[ i , range(1, Data_Test.shape[1])]

     prediction_group, proportions_groups_knn  =  KNN_classification( X_train , Y_train , x_new, 10 , distance = "Minkowski" , q = 2 )
     
     y_new_predict = prediction_group

     return(y_new_predict)

    ##########################

    y_predictions_vector = []

    # Paralelizamos el siguiente bucle for :

    # for i in  range(0, len(Data_Test)):

        # y_new_predict = prediction(i, Data_Test, X_train, Y_train )

        # y_predictions_vector.append( y_new_predict )

    
    y_predictions_vector = Parallel(n_jobs=n_jobs)( delayed(prediction)( i, Data_Test, X_train, Y_train) for i in range(0, len(Data_Test)) )

    #########################

    from itertools import chain

    TEC = 1 - sum(y_predictions_vector == Y_test)/len(Y_test)     

 
    return(y_predictions_vector , TEC)

Simple validation with our function of KNN classification with metric=Minkowski and q=2

In [487]:
y_predictions_vector , TEC = Simple_Validation_Classification(Data_Test, X_train, Y_train, Y_test)

In [488]:
TEC

0.03700000000000003

----

Testing our `KNN_classification` function in a multi-class classification problem:

In [489]:
Wine_Classification = pd.read_csv('WineQT.csv')

In [490]:
Wine_Classification.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [491]:
Wine_Classification_Train = Wine_Classification.sample(frac=0.8, replace=False, weights=None, random_state=222, axis=None, ignore_index=False)

Wine_Classification_Test = Wine_Classification.drop( Wine_Classification_Train.index , )

In [492]:
## TEST

X_test = Wine_Classification_Test.loc[: , Wine_Classification_Test.columns != 'quality']
Y_test = Wine_Classification_Test.loc[: , 'quality']

Data_Test = pd.concat([Y_test , X_test], axis=1)

##################################################################################################

## TRAIN

X_train = Wine_Classification_Train.loc[: , Wine_Classification_Train.columns != 'quality']
Y_train = Wine_Classification_Train.loc[: , 'quality']

Data_Train= pd.concat([Y_train , X_train], axis=1)

In [493]:
x_new = X_test.iloc[5, :]

In this case response variable $Y$ has $10$ categories, so we are in a multi-class classification problem.

In [494]:
prediction_group, proportions_groups_knn  = KNN_classification( X_train , Y_train , x_new, 10 , distance = "Euclidean" )

In [495]:
prediction_group

5.0

In [496]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,0.2,4
1,0.7,5
2,0.1,6


We tried with another distance :

In [497]:
prediction_group, proportions_groups_knn  = KNN_classification( X_train , Y_train , x_new, 10 , distance = "Minkowski" , q = 1 )

In [498]:
prediction_group

5.0

In [499]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,0.1,4
1,0.8,5
2,0.1,6


In [500]:
prediction_group, proportions_groups_knn  = KNN_classification( X_train , Y_train , x_new, 10 , distance = "Mahalanobis" )

In [501]:
prediction_group

6.0

In [502]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,0.4,5
1,0.6,6


Comparing with `Sklearn` function:

In [503]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform',  metric='euclidean')

In [504]:
knn.fit(X_train , Y_train)

print( knn.predict( [x_new] ) )

print( knn.predict_proba([x_new]) )

[5]
[[0.  0.2 0.7 0.1 0.  0. ]]


Each proportion correspond to the following categories of the response variables:

In [505]:
np.sort(Y_train.unique())

array([3, 4, 5, 6, 7, 8], dtype=int64)

In [506]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform',   p=1, metric='minkowski')

In [507]:
knn.fit(X_train , Y_train)

print( knn.predict( [x_new] ) )

print( knn.predict_proba([x_new]) )

[5]
[[0.  0.1 0.8 0.1 0.  0. ]]



**Simple validation with KNN classification (metric=Minkowski , q=2) using our validation function**

In [508]:
y_predictions_vector , TEC = Simple_Validation_Classification(Data_Test, X_train, Y_train, Y_test)

In [509]:
TEC

0.5502183406113537

**Simple validation with KNN classification (metric=Minkowski , q=2) using sklearn validation function**

In [510]:
knn_classification = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', p=2, metric='minkowski')

knn_classification.fit(X_train, Y_train)

KNeighborsClassifier(n_neighbors=10)

In [511]:
TEC_sklearn = 1 - knn_classification.score(X_test, Y_test)

TEC_sklearn

0.5502183406113537

----

### KNN for regression <a class="anchor" id="5"></a>

- We have $\hspace{0.1cm} p \hspace{0.1cm}$ variables $\hspace{0.1cm} X=(X_1,...,X_p) \hspace{0.1cm}$ measurements on a $n$ size sample.

- We also have a **quantitative** response variable $\hspace{0.1cm} Y $ 


The regression problem consists in, for a new observation $\hspace{0.1cm} x_{new} = (x_{new,1},x_{new,2},...,x_{new,p}) \hspace{0.1cm}$ of the variables $X_1,...,X_p  \hspace{0.1cm}$, predict it's $\hspace{0.1cm} Y \hspace{0.05cm}$ value $\hspace{0.1cm} (y_{new})\hspace{0.1cm}$  using the information of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ and $ \hspace{0.1cm} Y$

So , the problem is to get $\hspace{0.1cm} \hat{y}_{new} \hspace{0.1cm}$  using the information available of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ , $Y$ and  $\hspace{0.1cm} x_{new} = (x_{new,1},x_{new,2},...,x_{new,p})$

---

The KNN (K-nearest neighbors) algorithm for regression have the following steps:



 $1. \hspace{0.15cm}$ Define a distance measure between the observation of the original sample respect to the variables $X_1,...,X_p$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\delta$



 $2. \hspace{0.15cm}$ Compute the distances between $x_{new}$ and the initial observations $\hspace{0.1cm} \lbrace x_1,...,x_n \rbrace$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\lbrace \hspace{0.1 cm}  \delta(x_{new}, x_i) \hspace{0.1 cm} / \hspace{0.1 cm}  i=1,...,n \hspace{0.1 cm}  \rbrace$

  
 $3. \hspace{0.15cm}$ Select the  $k$ nearest observation to $x_{new}$ based on $\hspace{0.05cm} \delta \hspace{0.12cm}$ $(k$ nearest neighbors of $x_{new})$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$   The set of these observation will be denote by $KNN$ 




$5. \hspace{0.15cm}$ The method predict $\hspace{0.1cm} y_{new} \hspace{0.1cm}$  as follows:



$$
   
 \widehat{y}_{new} =  \dfrac{1}{KNN }\cdot \sum_{i \in KNN}  y_i
  
$$

----

### KNN for  regression  in `Python` with `sklearn`<a class="anchor" id="6"></a>

Loading data:

In [512]:
House_Price_Regression = pd.read_csv('House_Price_Regression.csv')

In [513]:
House_Price_Regression.head()

Unnamed: 0,neighborhood_recode,latitude,longitude,price,no_of_bedrooms,no_of_bathrooms,quality_recode,maid_room_recode,unfurnished_recode,balcony_recode,...,private_garden_recode,private_gym_recode,private_jacuzzi_recode,private_pool_recode,security_recode,shared_gym_recode,shared_pool_recode,shared_spa_recode,view_of_water_recode,size_in_m_2
0,46.0,25.113208,55.138932,2700000,1,2,2.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,100.242337
1,46.0,25.106809,55.151201,2850000,2,2,2.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,146.972546
2,36.0,25.063302,55.137728,1150000,3,5,2.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,181.253753
3,11.0,25.227295,55.341761,2850000,2,3,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,187.66406
4,46.0,25.114275,55.139764,1729200,0,1,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,47.101821


In [514]:
House_Price_Regression_Train = House_Price_Regression.sample(frac=0.8, replace=False, weights=None, random_state=222, axis=None, ignore_index=False)

House_Price_Regression_Test = House_Price_Regression.drop( House_Price_Regression_Train.index , )

In [552]:
House_Price_Regression_Train

Unnamed: 0,neighborhood_recode,latitude,longitude,price,no_of_bedrooms,no_of_bathrooms,quality_recode,maid_room_recode,unfurnished_recode,balcony_recode,...,private_garden_recode,private_gym_recode,private_jacuzzi_recode,private_pool_recode,security_recode,shared_gym_recode,shared_pool_recode,shared_spa_recode,view_of_water_recode,size_in_m_2
925,22.0,25.083483,55.148688,3195000,3,4,2.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,233.558142
1502,6.0,25.065872,55.232983,430000,0,1,2.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,43.478604
867,15.0,25.191285,55.275202,3499900,3,3,2.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,221.016237
746,34.0,25.083858,55.138714,8000000,3,4,2.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,240.340061
729,37.0,25.046296,55.200783,635356,1,2,3.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,58.435987
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,22.0,25.073593,55.137516,2175000,2,3,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,128.670655
1324,34.0,25.072573,55.131009,2300000,2,3,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,127.277110
1391,34.0,25.079895,55.134126,1396000,1,2,2.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,78.874647
1191,15.0,25.191719,55.270677,3100000,3,5,2.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,185.341485


In [515]:
## TEST

X_test = House_Price_Regression_Test.loc[: , House_Price_Regression_Test.columns != 'price']
Y_test = House_Price_Regression_Test.loc[: , 'price']

Data_Test = pd.concat([Y_test , X_test], axis=1)

##################################################################################################

## TRAIN

X_train = House_Price_Regression_Train.loc[: , House_Price_Regression_Train.columns != 'price']
Y_train = House_Price_Regression_Train.loc[: , 'price']

Data_Train = pd.concat([Y_train , X_train], axis=1)

In [516]:
# Como ejemplo de x_new (nueva observacion de los predictores) cogemos la sexta (5 en python) observacion de X_test

x_new = X_test.iloc[ 8 , :]

-----

In [517]:
import sklearn

from sklearn.neighbors import NearestNeighbors

In [518]:
## sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

More informetion about the function params  in: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [519]:
knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=10,  weights='uniform', p=2, metric='minkowski')

In [520]:
knn.fit(X_train , Y_train)

print( knn.predict( [x_new] ) )

[2026203.4]


In [521]:
knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=10,  weights='uniform', p=5, metric='minkowski')

In [522]:
knn.fit(X_train , Y_train)

print( knn.predict( [x_new] ) )

[2026203.4]


In [523]:
knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=10,  weights='uniform', metric='cityblock')

In [524]:
knn.fit(X_train , Y_train)

print( knn.predict( [x_new] ) )

[1789314.6]


In [525]:
knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=10,  weights='uniform', metric='cosine')

In [526]:
knn.fit(X_train , Y_train)

print( knn.predict( [x_new] ) )

[1750914.6]


---

### Simple validation with own validation function and `sklearn` KNN regression function <a class="anchor" id="6.1"></a>

In [527]:
def Simple_Validation_Regression(Data_Test, X_train, Y_train, Y_test) :

    ##########################

    from joblib import Parallel, delayed
    import multiprocessing

    n_jobs  = multiprocessing.cpu_count()

    ##########################

    knn_regression = sklearn.neighbors.KNeighborsRegressor(n_neighbors=10,  weights='uniform', p=2, metric='minkowski')

    ##########################

    def prediction(i, Data_Test, X_train, Y_train ):

     x_new = Data_Test.iloc[ i , range(1, Data_Test.shape[1])]

     knn_regression.fit(X_train, Y_train)
     
     y_new_predict = knn_regression.predict( [x_new] )

     return(y_new_predict)

    ##########################

    y_predictions_vector = []

    # Paralelizamos el siguiente bucle for :

    # for i in  range(0, len(Data_Test)):

        # y_new_predict = prediction(i, Data_Test, X_train, Y_train )

        # y_predictions_vector.append( y_new_predict )

    
    y_predictions_vector = Parallel(n_jobs=n_jobs)( delayed(prediction)( i, Data_Test, X_train, Y_train) for i in range(0, len(Data_Test)) )

    #########################

    from itertools import chain

    y_predictions_vector = list(chain(*y_predictions_vector))

    ECM = sum( (Y_test - y_predictions_vector)**2 )/len(Y_test)     

 
    return(y_predictions_vector , ECM)

In [528]:
y_predictions_vector , ECM = Simple_Validation_Regression(Data_Test, X_train, Y_train, Y_test)

In [529]:
ECM

1705355050546.1501

---

### Simple validation with `sklearn` validation function <a class="anchor" id="6.2"></a>

In this case `Sklearn` use $R^2$ as validation metric instead of ECM

In [530]:
knn_regression = sklearn.neighbors.KNeighborsRegressor(n_neighbors=10,  weights='uniform', p=2, metric='minkowski')

knn_regression.fit(X_train , Y_train)

KNeighborsRegressor(n_neighbors=10)

In [531]:
R2_sklearn = knn_regression.score(X_test, Y_test)

R2_sklearn

0.745616988261137

---

### KNN for regression in `Python` with own algorithm <a class="anchor" id="7"></a>

In [532]:
def KNN_regression( X , Y , x_new, k, distance = "Minkowski" , q = 0, p1=0, p2=0, p3=0 ):

    
## Para paralelizar el algoritmo 

    from joblib import Parallel, delayed
    import multiprocessing

    n_jobs  = multiprocessing.cpu_count()

####################################################################################################################################################################################################################################################

    # Y, X y x_new deben ser objetos Pandas ya que luego seran convertidos a objetos Numpy automaticamente por el algoritmo
    
    # Y tiene que ser un Pandas data frame con la variable respuesta (que en este caso debe ser cuantitativa) 

    # X tiene que ser un Pandas data frame con los predictotres (X1,...,Xp). 

    # x_new tiene que ser un vector con una nueva observacion de los predictores. 


####################################################################################################################################################################################################################################################

    Y = Y.to_numpy()

    X = X.to_numpy() 

    x_new = x_new.to_numpy()

    X = np.concatenate((X, [x_new]), axis=0)


    distances = []

    Y_values_knn = []

##########################################################################################
    
    def a(Binary_Data) :

            X = Binary_Data

            a = X @ X.T

            return(a)

##########################################################################################

    def d(Binary_Data):

            X = Binary_Data

            ones_matrix = np.ones(( X.shape[0] , X.shape[1])) 

            d = (ones_matrix - X) @ (ones_matrix - X).T

            return(d)

##########################################################################################

    def alpha_py(i,j, Multiple_Categorical_Data):

            X = Multiple_Categorical_Data

            alpha = np.repeat(0, X.shape[1])

            def argumento_bucle_for(k):

                if X[i-1, k] == X[j-1, k] :

                    alpha[k] = 1

                else :

                    alpha[k] = 0

                return(alpha) 
    
            alpha=Parallel(n_jobs=n_jobs)( delayed(argumento_bucle_for)( k ) for k in range(0, X.shape[1]) )
    
            alpha = sum(alpha)

            return(alpha)

####################################################################################################################################################################################################################################################
    
    if distance == "Euclidean":

        def Dist_Euclidea_Python(i, j, Quantitative_Data_set): 

            Dist_Euclidea = ( ( Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :] )**2 ).sum()

            Dist_Euclidea = np.sqrt(Dist_Euclidea)

            return Dist_Euclidea

    ###################################################################
           
        ## PARTE DEL CODIGO A PARALELIZAR

        #for j in range(1, len(X)):

          # distances.append( Dist_Euclidea_Python( len(X), i , X ) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Euclidea_Python)( len(X), s , X ) for s in range(1, len(X)) )
           

    ###################################################################

    if distance == "Minkowski":

        def Dist_Minkowski_Python(i,j, q , Quantitative_Data_set):

            Dist_Minkowski = ( ( ( abs( Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :] ) )**q ).sum() )**(1/q)

            return Dist_Minkowski

    ###################################################################

        ## PARTE DEL CODIGO A PARALELIZAR

        # for i in range(1, len(X)):

          #  distances.append( Dist_Minkowski_Python( len(X), i , q , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Minkowski_Python)( len(X), s , q , X) for s in range(1, len(X)) )

    ###################################################################

    if distance == "Canberra":

        def Dist_Canberra_Python(i,j, Quantitative_Data_set):

            numerator =  abs( Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :] )  

            denominator =  ( abs(Quantitative_Data_set[i-1, :]) + abs(Quantitative_Data_set[j-1, :]) )
       
            numerator=np.array([numerator], dtype=float)

            denominator=np.array([denominator], dtype=float)

            Dist_Canberra = ( np.divide( numerator , denominator , out=np.zeros_like(numerator), where=denominator!=0) ).sum()

            return Dist_Canberra

    ###################################################################

        ## PARTE DEL CODIGO A PARALELIZAR

        # for i in range(1, len(X)):

          #  distances.append( Dist_Canberra_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Canberra_Python)( len(X), s , X) for s in range(1, len(X)) )
                

    ###################################################################
   
    if distance == "Pearson":

        def Dist_Pearson_Python(i, j, Quantitative_Data_set):

            Dist_Pearson = ( ( Quantitative_Data_set[i-1, ] - Quantitative_Data_set[j-1, ] )**2 / Quantitative_Data_set.var() ).sum()

            Dist_Pearson = np.sqrt(Dist_Pearson)

            return Dist_Pearson

    ###################################################################

       ## PARTE DEL CODIGO A PARALELIZAR
       
       # for i in range(1, len(X)):

        #   distances.append( Dist_Pearson_Python( len(X), i , X) )

        
        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Pearson_Python)( len(X), s , X) for s in range(1, len(X)) )

    ###################################################################
    
    if distance == "Mahalanobis":

        def Dist_Mahalanobis_Python(i, j, Quantitative_Data_set):

            # All the columns of Quantitative_Data_set must be type = 'float' or 'int' (specially not 'object'), in other case we will find 
            # dimensional problems when Python compute   x @ S_inv @ x.T

            x = (Quantitative_Data_set[i-1, :] - Quantitative_Data_set[j-1, :])

            x = np.array([x]) # necessary step to transpose a 1D array

            S_inv = np.linalg.inv( np.cov(Quantitative_Data_set , rowvar=False) ) # inverse of covariance matrix

            Dist_Maha = np.sqrt( x @ S_inv @ x.T )  # x @ S_inv @ x.T = np.matmul( np.matmul(x , S_inv) , x.T )

            Dist_Maha = float(Dist_Maha)

            return Dist_Maha

        
    ###################################################################

    ## PARTE DEL CODIGO A PARALELIZAR

       # for i in range(1, len(X)):

        #    distances.append( Dist_Mahalanobis_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Mahalanobis_Python)( len(X), s , X) for s in range(1, len(X)) )
       

    ###################################################################
    
    if distance == "Sokal":

        a = X @ X.T
        n = X.shape[0]
        p = X.shape[1]
        ones_matrix = np.ones((n, p))
        b = (ones_matrix - X) @ X.T
        c = b.T
        d = (ones_matrix - X) @ (ones_matrix - X).T

        def Sokal_Similarity_Py(i, j):

            Sokal_Similarity = ( a[i-1 , j-1] + d[i-1 , j-1] ) / p

            return Sokal_Similarity


        def Dist_Sokal_Python(i, j, Binary_Data_set):

            dist_Sokal = np.sqrt( 2 - 2*Sokal_Similarity_Py(i,j, Binary_Data_set) )

            return dist_Sokal

    ###################################################################

    ## PARTE DEL CODIGO A PARALELIZAR

      #  for i in range(1, len(X)):

        #    distances.append( Dist_Sokal_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Sokal_Python)( len(X), s , X) for s in range(1, len(X)) )

    ###################################################################
   
    if distance == "Jaccard":

        def Jaccard_Similarity_Py(i, j, Binary_Data_Matrix):

            X = Binary_Data_Py
            a = X @ X.T
            n = X.shape[0]
            p = X.shape[1]
            ones_matrix = np.ones((n, p)) 
            b = (ones_matrix - X) @ X.T
            c = b.T
            d = (ones_matrix - X) @ (ones_matrix - X).T

            Jaccard_Similarity = a[i-1,j-1] / (a[i-1,j-1] + b[i-1,j-1] + c[i-1,j-1])
            
            return Jaccard_Similarity


        def Dist_Jaccard_Python(i, j, Binary_Data_set):

            dist_Jaccard = np.sqrt( Jaccard_Similarity_Py(i,i, Binary_Data_set) + Jaccard_Similarity_Py(i,i, Binary_Data_set) - 2*Jaccard_Similarity_Py(i,j, Binary_Data_set) )

            return dist_Jaccard

    ###################################################################

    ## PARTE DEL CODIGO A PARALELIZAR

       # for i in range(1, len(X)):

        #    distances.append( Dist_Jaccard_Python( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Jaccard_Python)( len(X), s , X) for s in range(1, len(X)) )

    ###################################################################
    
    if distance == "Matches":

        def matches_similarity_py(i, j, Multiple_Categorical_Data):

            p = Multiple_Categorical_Data.shape[1]

            matches_similarity = alpha_py(i,j, Multiple_Categorical_Data) / p

            return(matches_similarity)


        def Dist_Matches_Py(i,j, Multiple_Categorical_Data):

            Dist_Matches = np.sqrt( matches_similarity_py(i, i, Multiple_Categorical_Data) +  matches_similarity_py(j, j, Multiple_Categorical_Data) - 2*matches_similarity_py(i, j, Multiple_Categorical_Data) )

            return( Dist_Matches )

    ###################################################################

        # for i in range(1, len(X)):

          #  distances.append( Dist_Matches_Py( len(X), i , X) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Matches_Py)( len(X), s , X) for s in range(1, len(X)) )

 #######################################################################   
   
    if distance == "Gower":

        # The data matrix X have to be order in the following way:
        # The p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.

##########################################################################################


        def Gower_Similarity_Python(i,j, Mixed_Data_Set, p1, p2, p3):

            X = Mixed_Data_Set

   # The data matrix X have to be order in the following way:
   # The p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.

   #####################################################################################
        
            def G(k, X):

                range = X[:,k].max() - X[:,k].min() 

                return(range)

            G_vector = np.repeat(0.5, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)
                
      
    ##########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X[: , 0:p1]

            Binary_Data = X[: , (p1):(p1+p2)]
            
            Multiple_Categorical_Data = X[: , (p1+p2):(p1+p2+p3) ]

    ##########################################################################################

            numerator_part_1 = ( ones - ( abs(Quantitative_Data[i-1,:] - Quantitative_Data[j-1,:]) / G_vector ) ).sum() 

            numerator_part_2 = a(Binary_Data)[i-1,j-1] + alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_1 + numerator_part_2
 
            denominator = p1 + (p2 - d(Binary_Data)[i-1,j-1]) + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

##########################################################################################

        def Dist_Gower_Py(i, j, Mixed_Data , p1, p2, p3):

            Dist_Gower = np.sqrt( 1 - Gower_Similarity_Python(i, j, Mixed_Data , p1, p2, p3) )

            return(Dist_Gower)    

    ###################################################################

        # for i in range(1, len(X)):

            # distances.append( Dist_Gower_Py( len(X), i , X, p1, p2, p3) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_Gower_Py)( len(X), s , X, p1, p2, p3) for s in range(1, len(X)) )

##############################################################################################################################################

    if distance == "Gower-BM" :

        def GowerBM_Similarity_Python(i,j, BM_Data_Set, p2, p3):

            X = BM_Data_Set

          # The data matrix X have to be order in the following way:
          # The p2 first are binary categorical, and the following p3 are multiple categorical.

##########################################################################################
       
            Binary_Data = X[: , 0:p2]

            Multiple_Categorical_Data = X[: , (p2):(p2+p3)]
 
##########################################################################################

 
            numerator_part_2 = a(Binary_Data)[i-1,j-1] + alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_2

            denominator = (p2 - d(Binary_Data)[i-1,j-1]) + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

##############################################################################################################################################
        
        def Dist_GowerBM_Py(i, j, BM_Data ,  p2, p3):

            Dist_Gower = np.sqrt( 1 - GowerBM_Similarity_Python(i, j, BM_Data , p2, p3) )

            return(Dist_Gower)

##############################################################################################################################################

        # for i in range(1, len(X)):

            # distances.append( Dist_GowerBM_Py( len(X), i , X, p2, p3) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_GowerBM_Py)( len(X), s , X, p2, p3) for s in range(1, len(X)) )

##############################################################################################################################################
    
    if distance == "Gower-BQ" :

        def GowerBQ_Similarity_Python(i,j, BQ_Data_Set, p1, p2):

            X = BQ_Data_Set


        # The data matrix X have to be order in the following way:
        # The p1 first are quantitative, the following p2 are binary categorical 

##########################################################################################
        
            def G(k, X):

                range = X[:,k].max() - X[:,k].min() 

                return(range)

            G_vector = np.repeat(0.5, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)
##########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X[: , 0:p1]

            Binary_Data = X[: , (p1):(p1+p2)]
         
 
##########################################################################################

            numerator_part_1 = ( ones - ( abs(Quantitative_Data[i-1,:] - Quantitative_Data[j-1,:]) / G_vector ) ).sum() 

            numerator_part_2 = a(Binary_Data)[i-1,j-1] 
     
            numerator = numerator_part_1 + numerator_part_2

            denominator = p1 + (p2 - d(Binary_Data)[i-1,j-1])  

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

###############################################################################

        def Dist_GowerBQ_Py(i, j, BQ_Data ,  p1, p2):

            Dist_Gower = np.sqrt( 1 - GowerBQ_Similarity_Python(i, j, BQ_Data , p1, p2) )

            return(Dist_Gower)

##############################################################################################################################################

        # for i in range(1, len(X)):

        # distances.append( Dist_GowerBQ_Py( len(X), i , X, p1, p2) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_GowerBQ_Py)( len(X), s , X, p1, p2) for s in range(1, len(X)) )


##############################################################################################################################################
    
    if distance == "Gower-MQ" :
        
        def GowerMQ_Similarity_Python(i,j, MQ_Data_Set, p1, p3):

            X = MQ_Data_Set

   # The data matrix X have to be order in the following way:
   # The p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiple categorical.

##########################################################################################
            
            def G(k, X):

                range = X[:,k].max() - X[:,k].min() 

                return(range)

            G_vector = np.repeat(0.5, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)

##########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X[: , 0:p1]
    
            Multiple_Categorical_Data = X[: , (p1):(p1+p3)]
 
    
##########################################################################################

            numerator_part_1 = ( ones - ( (Quantitative_Data[i-1,:] - Quantitative_Data[j-1,:]).abs() / G_vector ) ).sum() 

            numerator_part_2 =   alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_1 + numerator_part_2

            denominator = p1 + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)



############################################################################################

        def Dist_GowerMQ_Py(i, j, MQ_Data ,  p1, p3):

                Dist_Gower = np.sqrt( 1 - GowerMQ_Similarity_Python(i, j, MQ_Data , p1, p3) )

                return(Dist_Gower)


######################################################################################################################################
        # for i in range(1, len(X)):

        # distances.append( Dist_GowerMQ_Py( len(X), i , X, p1, p3) )

        n_jobs  = multiprocessing.cpu_count()

        distances = Parallel(n_jobs=n_jobs)( delayed(Dist_GowerMQ_Py)( len(X), s , X, p1, p3) for s in range(1, len(X)) )
    
######################################################################################################################################
    
    distances = pd.DataFrame({'distances': distances})

    distances = distances.sort_values(by=["distances"]).reset_index(drop=False)
        
    knn = distances.iloc[0:k , :]

    for i in knn.iloc[:,0] :

        Y_values_knn.append(Y[i , ])


    y_predict = sum(Y_values_knn)/k

    
                                     
    return y_predict , distances 

Testing our `KNN_regression` function:

In [533]:
y_predict , distances  = KNN_regression( X_train, Y_train , x_new, k=10, distance = "Minkowski" , q = 2 )

In [534]:
y_predict

2026203.4

In [535]:
distances

Unnamed: 0,index,distances
0,622,3.092918
1,141,4.384152
2,841,4.455426
3,457,4.823216
4,1414,5.331320
...,...,...
1519,922,515.588851
1520,1276,571.648778
1521,598,572.971905
1522,1401,625.342887


In [536]:
y_predict , distances  = KNN_regression( X_train , Y_train , x_new, k=10, distance = "Minkowski" , q = 5 )

In [537]:
y_predict

2026203.4

In [538]:
y_predict , distances  = KNN_regression( X_train , Y_train , x_new, k=10, distance = "Pearson" )

In [539]:
y_predict

2026203.4

In [540]:
y_predict , distances  = KNN_regression( X_train , Y_train , x_new, k=10, distance = "Canberra" )

In [541]:
y_predict

1500199.9

In [542]:
y_predict , distances  = KNN_regression( X_train , Y_train , x_new, k=10, distance = "Mahalanobis" )

In [543]:
y_predict

1706999.8

Since in this case $X$ is a matrix of **mixed data**, the most suitable distance allowed by our `KNN_regression` function is the **Gower distance**.

To use our Gower distance we must order the columns of $X$ appropriately. The first $p1$ will be the quantitative variables, the next $p2$ the binary ones and the last $p3$ the multi class ones

In [544]:
X_train_new = X_train.loc[ : , ['size_in_m_2',  'latitude', 'longitude', 'no_of_bedrooms', 'no_of_bathrooms',           # Quantitatives (5)
                                    
                                     'maid_room_recode', 'unfurnished_recode', 'balcony_recode', 'barbecue_area_recode', 'central_ac_recode',
                                     'childrens_play_area_recode', 'childrens_pool_recode', 'concierge_recode', 'covered_parking_recode',
                                     'kitchen_appliances_recode', 'maid_service_recode', 'pets_allowed_recode', 'private_garden_recode',         # Binary (21)
                                     'private_gym_recode', 'private_jacuzzi_recode', 'private_pool_recode', 'security_recode',
                                     'shared_gym_recode', 'shared_pool_recode', 'shared_spa_recode', 'view_of_water_recode' ,
                                       
                                     'neighborhood_recode', 'quality_recode'         # Multi-class (2)
                                       
                            ]]    

In [545]:
y_predict , distances  = KNN_regression( X_train_new , Y_train , x_new, 10,  "Gower",  5, 21, 2  )

In [546]:
y_predict

2261610.0

In [547]:
distances

Unnamed: 0,index,distances
0,771,0.467648
1,992,0.467826
2,712,0.474961
3,357,0.479484
4,508,0.485017
...,...,...
1519,28,0.772241
1520,1504,0.789965
1521,1496,0.793064
1522,364,0.811451


----

### Simple validation with own validation function and own KNN regression function  <a class="anchor" id="7.1"></a>

In [548]:
def Simple_Validation_Regression(Data_Test, X_train, Y_train, Y_test) :

    ##########################

    from joblib import Parallel, delayed
    import multiprocessing

    n_jobs  = multiprocessing.cpu_count()

    ##########################

    # Aqui usamos ahora nuestra funcion en lugar de la de Sklearn

    ##########################

    def prediction(i, Data_Test, X_train, Y_train ):

     x_new = Data_Test.iloc[ i , range(1, Data_Test.shape[1])]

     y_predict , distances  = KNN_regression( X_train , Y_train , x_new, k=10, distance = "Minkowski" , q = 2 )

     return(y_predict)

    ##########################

    y_predictions_vector = []

    # Paralelizamos el siguiente bucle for :

    # for i in  range(0, len(Data_Test)):

        # y_new_predict = prediction(i, Data_Test, X_train, Y_train )

        # y_predictions_vector.append( y_new_predict )

    
    y_predictions_vector = Parallel(n_jobs=n_jobs)( delayed(prediction)( i, Data_Test, X_train, Y_train) for i in range(0, len(Data_Test)) )

    #########################

    ECM = sum( (Y_test - y_predictions_vector)**2 )/len(Y_test)     

 
    return(y_predictions_vector , ECM)

In [549]:
y_predictions_vector , ECM = Simple_Validation_Regression(Data_Test, X_train, Y_train, Y_test)

In [550]:
ECM

1705111679187.3896

----

### Selecting an optimal $k$ with cross-validation <a class="anchor" id="8"></a>

---

## Bibliography

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

apuntes aurea

apuntes nogales
