# KNN in `Python`

* [KNN for supervised classification   ](#1)
* * [Toy example](#2)
* * [ KNN for supervised classification in `Python` with `Sklearn`](#3)
* * [ KNN for classification in `Python` with own algorithm](#4)

* [KNN for regression](#5)
* * [KNN for regression in `Python` with `Sklearn`](#6)
* * [KNN for regression in Python with own algorithm](#7)

* [Selecting an optimal k with cross-validation](#8)

---

## KNN for supervised classification <a class="anchor" id="1"></a>  

- We have $\hspace{0.1cm} p \hspace{0.1cm}$ variables $\hspace{0.1cm} X=(X_1,...,X_p) \hspace{0.1cm}$ measurements on a $n$ size sample.

- We also have a **categorical** response variable $\hspace{0.1cm} Y \hspace{0.1cm}$ with $\hspace{0.1cm} g \hspace{0.1cm}$  categories that indicates  the group to which each element of the sample belongs  $ ( \hspace{0.05cm} Range(Y)=\lbrace c_1 ,..., c_g \rbrace \hspace{0.05cm})$

- The groups generated by $\hspace{0.1cm} Y \hspace{0.1cm}$ are denoted as $\hspace{0.1cm} \Omega_1 ,..., \Omega_g \hspace{0.15cm}$   $\hspace{0.15cm}( \hspace{0.1cm} y_i = c_r \hspace{0.15cm} \Leftrightarrow \hspace{0.15cm}$  $ i \in \Omega_r \hspace{0.1cm})$



The supervised classification problem consists in, for a new observation  of the variables $X_1,...,X_p  \hspace{0.1cm}$, $\hspace{0.1cm} x_{new} = (x_{new,1}\hspace{0.1cm},\hspace{0.1cm}x_{new,2}\hspace{0.1cm},\dots,\hspace{0.1cm}x_{new,p}) \hspace{0.1cm}$, predict it's $\hspace{0.1cm} Y \hspace{0.1cm}$ value $\hspace{0.1cm} (y_{new})\hspace{0.1cm}$  using the available information of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ and $ \hspace{0.1cm} Y$

So , the problem is to classify a new element/individual in one of the $\hspace{0.1cm} g \hspace{0.1cm}$ groups generated by $\hspace{0.1cm} Y \hspace{0.1cm}$ using the information available of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ and $Y$, and also  $\hspace{0.1cm} x_{new} = (x_{new,1}\hspace{0.1cm},\hspace{0.1cm}x_{new,2}\hspace{0.1cm},\dots,\hspace{0.1cm}x_{new,p}) \hspace{0.1cm}$

Note that if we haven't information about $\hspace{0.1cm} Y \hspace{0.1cm}$ this would be an unsupervised classification problem.

----

The KNN (K-nearest neighbors) algorithm for supervised classification have the following steps:



 $1. \hspace{0.15cm}$ Define a **distance** measure between the observations of the original sample respect to the variables $X_1,...,X_p$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\delta$



 $2. \hspace{0.15cm}$ Compute the distances between $\hspace{0.1 cm}x_{new}\hspace{0.1 cm}$ and the initial observations $\hspace{0.1cm} \lbrace x_1,...,x_n \rbrace$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\lbrace \hspace{0.1 cm}  \delta(x_{new}\hspace{0.03 cm},\hspace{0.03 cm} x_i) \hspace{0.1 cm} / \hspace{0.1 cm}  i=1,...,n \hspace{0.1 cm}  \rbrace$

  
 $3. \hspace{0.15cm}$ Select the  $\hspace{0.03 cm} k \hspace{0.03 cm}$ nearest observation to $\hspace{0.06 cm} x_{new}\hspace{0.06 cm}$ based on $\hspace{0.05cm} \delta \hspace{0.12cm}$ $(k$ nearest neighbors of $x_{new})$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$   The set of these observation will be denote by $KNN$ 

 $4. \hspace{0.15cm}$ Compute the proportion of these observation (neighbors) that belongs to each group $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$  
 
 $\hspace{0.65cm} \Rightarrow \hspace{0.15cm}$ The proportion of $KNN$ that belongs to the group $\hspace{0.15cm} \Omega_r$ $\hspace{0.1cm}(Y=c_r)\hspace{0.1cm}$ will be denote by $\hspace{0.1 cm} f^{knn}_{r}  $



   $$ \hspace{0.1 cm} f^{knn}_{r} \hspace{0.15cm}=\hspace{0.15cm} \dfrac{ \# \hspace{0.1cm}\lbrace\hspace{0.1cm} i \in KNN \hspace{0.1cm}/\hspace{0.1cm} i \in \Omega_r \hspace{0.1cm}\rbrace  }{\# \hspace{0.1cm} KNN = k} $$
   


$5. \hspace{0.15cm}$ Classify $\hspace{0.1cm} x_{new} \hspace{0.1cm}$ in that group such has a bigger nearest neighbors proportion $\hspace{0.18cm} \Rightarrow \hspace{0.2cm}$ $\text{If}   \hspace{0.15cm} \underbrace{ f^{knn}_{s} \geqslant f^{knn}_{r} \hspace{0.15cm},\hspace{0.15cm} \forall r = 1,...,g  }_{\Omega_s \hspace{0.1cm}\text{is the most frequent group in}\hspace{0.1cm} KNN } $    $\hspace{0.1cm} \hspace{0.15cm}  \Rightarrow \hspace{0.15cm} x_{new} \hspace{0.1cm}$ is classify in $\hspace{0.1cm} \Omega_s$

-----

Why KNN is a supervised classification method and not an unsupervised ?



Because in this problem we have a vector of observations of the response variable $Y$

The fact that we haven't $\hspace{0.1 cm} y_{new} \hspace{0.1 cm}$ doesn't transform it in a unsupervised problem

---

### Toy example <a class="anchor" id="2"></a>



- Sample: $n=3$



- Predictors:

$X1 = (10 , 2 , 4)$

$X2 = (20 , 25, 40)$



- Observations:

$x_1 =(10,20)$

$x_2=(2,25)$

$x_3=(4,40)$



- Response: $\hspace{0.1cm}(2$ categories $(0,1)$, then $2$ groups $\hspace{0.1cm}\Omega_0 , \Omega_1)$

$Y =( 1 , 1 , 0 )$



- Distance $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$  $ \delta_{Euclidean}$



- New observation:  

$x_{new}=(6, 20)$



- Computing the distances:

$\delta(x_{new}, x_1)_{Euclidean} = (10-6)^2 + (20-20)^2 = 16$

$\delta(x_{new}, x_2)_{Euclidean} = (2-6)^2 + (25-20)^2 = 16 + 25 = 41$

$\delta(x_{new}, x_3)_{Euclidean} = (4-6)^2 + (40-20)^2 = 4 + 400 = 404$



- Selecting $\hspace{0.05cm} k=2 \hspace{0.05cm}$ nearest neighbor to $\hspace{0.05cm}x_{new}$ $\hspace{0.2cm}\Rightarrow\hspace{0.215cm}$ $ KNN \hspace{0.01cm}=\hspace{0.01cm} \lbrace\hspace{0.1cm} x_1 , x_2 \hspace{0.1cm}\rbrace \hspace{0.01cm}=\hspace{0.01cm} \lbrace \hspace{0.1cm} individual 1 , individual 2  \hspace{0.1cm}\rbrace $



- Computing the proportions $f^{knn}$ : 

Note that $\hspace{0.1cm} y_1 = 1 \hspace{0.1cm}$ and $ \hspace{0.1cm} y_2 = 1\hspace{0.2cm} \Rightarrow \hspace{0.2cm} f^{knn}_0 =  0/2 = 0\hspace{0.1cm} $ and $\hspace{0.1cm} f^{knn}_1 =  2/2 = 1$



- So, the algorithm classify $x_{new}$ in the group $\Omega_1$ , so the algorithm predict that $\hat{y}_{new} = 1$


---

### KNN for supervised classification in `Python` with `sklearn`<a class="anchor" id="3"></a>

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np

In [3]:
Gender_classification = pd.read_csv('gender_classification.csv')

In [4]:
Gender_classification.head()

Unnamed: 0,long_hair,forehead_width_cm,forehead_height_cm,nose_wide,nose_long,lips_thin,distance_nose_to_lip_long,gender
0,1,11.8,6.1,1,0,1,1,Male
1,0,14.0,5.4,0,0,1,0,Female
2,0,11.8,6.3,1,1,1,1,Male
3,0,14.4,6.1,0,1,1,1,Male
4,1,13.5,5.9,0,0,0,0,Female


In [5]:
import sklearn

from sklearn.neighbors import NearestNeighbors

In [6]:
X = Gender_classification.iloc[ : , 0:7]

Y = Gender_classification.iloc[ : , 7]

x_new = [1,	10, 5, 1, 0, 1,	1]

In [7]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', p=2, metric='minkowski')

It's advisable  to see the sklearn documentation first: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [8]:
knn.fit(X, Y)

KNeighborsClassifier(n_neighbors=10)

In [9]:
print( knn.predict( [x_new] ) )

['Male']


In [10]:
print( knn.predict_proba([x_new]) )

[[0.2 0.8]]


With other distances:

In [11]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform',  metric='cityblock')

In [12]:
knn.fit(X, Y)

print( knn.predict( [x_new] ) )
 
print( knn.predict_proba([x_new]) )

['Male']
[[0.1 0.9]]


In [13]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', metric='cosine')

In [14]:
knn.fit(X, Y)

print( knn.predict( [x_new] ) ) 

print( knn.predict_proba([x_new]) )

['Male']
[[0. 1.]]


In [15]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', metric='nan_euclidean')

In [16]:
knn.fit(X, Y)

print( knn.predict( [x_new] ) ) 

print( knn.predict_proba([x_new]) )

['Male']
[[0.2 0.8]]


In [17]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform', metric='manhattan')

In [18]:
knn.fit(X, Y)

print( knn.predict( [x_new] ) ) 

print( knn.predict_proba([x_new]) )

['Male']
[[0.1 0.9]]


----

### KNN for classification in `Python` with own algorithm <a class="anchor" id="4"></a>

We are going to develop our own algorithm so as not to depend on sklearn

In [19]:
def KNN_classification( X , Y , x_new, k, distance = "Minkowski" , q=0, p1=0, p2=0, p3=0 ):

####################################################################################################################################################################################################################################################

    # Y tiene que ser una variable categorica con categorias estandar (0,1,2,...)

    # Ejemplo de Y :  Y = Gender_classification.iloc[0:20 , 7]

    # Ejemplo de como codificar Y en categorias estandar (si ya esta en el formato estandar indicado no hace falta):
      
      # for i in range(0, len(Y)):
          # if Y[i]=='Male':
          #   Y[i] = 0
          # elif Y[i]=='Female':
          #   Y[i]=1

    # X tiene que ser un panda data frame con los predictotres (X1,...,Xp). Ejemplo X = Gender_classification.iloc[0:20 , 0:7]

    # x_new tiene que ser una panda series. Ejemplo x_new = pd.Series({'long_hair': 1, 'forehead_width_cm': 4, 'forehead_height_cm': 6, 'nose_wide': 1 , 'nose_long': 1 , 'nose_long': 1 , 'lips_thin':1, 'distance_nose_to_lip_long': 1 })

####################################################################################################################################################################################################################################################

    X = pd.concat([X, x_new.to_frame().T], ignore_index=True)

    distances = []

    groups_knn = []


####################################################################################################################################################################################################################################################
    
    if distance == "Euclidean":

        def Dist_Euclidea_Python(i, j, Quantitative_Data_set): 

            Dist_Euclidea = ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] )**2 ).sum()

            Dist_Euclidea = np.sqrt(Dist_Euclidea)

            return Dist_Euclidea

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Euclidea_Python( len(X), i , X) )

    ###################################################################

    if distance == "Minkowski":

        def Dist_Minkowski_Python(i,j, q , Quantitative_Data_set):

            Dist_Minkowski = ( ( ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] ).abs() )**q ).sum() )**(1/q)

            return Dist_Minkowski

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Minkowski_Python( len(X), i , q , X) )

        

    ###################################################################

    if distance == "Canberra":

        def Dist_Canberra_Python(i,j, Quantitative_Data_set):

            numerator =  ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] ).abs()  

            denominator =  ( (Quantitative_Data_set.iloc[i-1, ]).abs() + (Quantitative_Data_set.iloc[j-1, ]).abs() )
       
            numerator=np.array([numerator], dtype=float)

            denominator=np.array([denominator], dtype=float)

            Dist_Canberra = ( np.divide( numerator , denominator , out=np.zeros_like(numerator), where=denominator!=0) ).sum()

            return Dist_Canberra

    ###################################################################
    
        for i in range(1, len(X)):

            distances.append( Dist_Canberra_Python( len(X), i , X) )

    ###################################################################
   
    if distance == "Pearson":

        def Dist_Pearson_Python(i, j, Quantitative_Data_set):

            Dist_Pearson = ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] )**2 / Quantitative_Data_set.var() ).sum()

            Dist_Pearson = np.sqrt(Dist_Pearson)

            return Dist_Pearson

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Pearson_Python( len(X), i , X) )

    ###################################################################
    
    if distance == "Mahalanobis":

        def Dist_Mahalanobis_Python(i, j, Quantitative_Data_set):

            x = (Quantitative_Data_set.to_numpy()[i-1, ] - Quantitative_Data_set.to_numpy()[j-1, ])

            x = np.array([x]) # necessary step to transpose a 1D array

            S_inv = np.linalg.inv( Quantitative_Data_set.cov() ) # inverse of covariance matrix

            Dist_Maha = np.sqrt( x @ S_inv @ x.T )  # x @ S_inv @ x.T = np.matmul( np.matmul(x , S_inv) , x.T )

            Dist_Maha = float(Dist_Maha)

            return Dist_Maha

        
    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Mahalanobis_Python( len(X), i , X) )

       

    ###################################################################
    
    if distance == "Sokal":

        a = X @ X.T
        n = X.shape[0]
        p = X.shape[1]
        ones_matrix = np.ones((n, p))
        b = (ones_matrix - X) @ X.T
        c = b.T
        d = (ones_matrix - X) @ (ones_matrix - X).T

        def Sokal_Similarity_Py(i, j):

            Sokal_Similarity = (a.iloc[i-1,j-1] + d.iloc[i-1,j-1])/p

            return Sokal_Similarity


        def Dist_Sokal_Python(i, j, Binary_Data_set):

            dist_Sokal = np.sqrt( 2 - 2*Sokal_Similarity_Py(i,j, Binary_Data_set) )

            return dist_Sokal

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Sokal_Python( len(X), i , X) )

    ###################################################################
   
    if distance == "Jaccard":

        def Jaccard_Similarity_Py(i, j, Binary_Data_Matrix):

            X = Binary_Data_Py
            a = X @ X.T
            n = X.shape[0]
            p = X.shape[1]
            ones_matrix = np.ones((n, p)) 
            b = (ones_matrix - X) @ X.T
            c = b.T
            d = (ones_matrix - X) @ (ones_matrix - X).T

            Jaccard_Similarity = a.iloc[i-1,j-1] / (a.iloc[i-1,j-1] + b.iloc[i-1,j-1] + c.iloc[i-1,j-1])
            
            return Jaccard_Similarity


        def Dist_Jaccard_Python(i, j, Binary_Data_set):

            dist_Jaccard = np.sqrt( Jaccard_Similarity_Py(i,i, Binary_Data_set) + Jaccard_Similarity_Py(i,i, Binary_Data_set) - 2*Jaccard_Similarity_Py(i,j, Binary_Data_set) )

            return dist_Jaccard

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Jaccard_Python( len(X), i , X) )

    ###################################################################
    
    if distance == "Matches":

        def alpha_py(i,j, Multiple_Categorical_Data):

            X = Multiple_Categorical_Data
            alpha = np.repeat(0, X.shape[1])

            for k in range(0, X.shape[1]) :

                if X.iloc[i-1, k] == X.iloc[j-1, k] :

                    alpha[k] = 1

                else :

                    alpha[k] = 0

            alpha = alpha.sum()

            return(alpha)


        def matches_similarity_py(i, j, Multiple_Categorical_Data):

            p = Multiple_Categorical_Data.shape[1]

            matches_similarity = alpha_py(i,j, Multiple_Categorical_Data) / p

            return(matches_similarity)


        def Dist_Matches_Py(i,j, Multiple_Categorical_Data):

            Dist_Matches = np.sqrt( matches_similarity_py(i, i, Multiple_Categorical_Data) +  matches_similarity_py(j, j, Multiple_Categorical_Data) - 2*matches_similarity_py(i, j, Multiple_Categorical_Data) )

            return( Dist_Matches )

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Matches_Py( len(X), i , X) )

    ###################################################################
   
    if distance == "Gower":

        def Gower_Similarity_Python(i,j, Mixed_Data_Set, p1, p2, p3):

            X = Mixed_Data_Set

            # The variable must to be order in the following way: 
            # the p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiclass categorical.

        #################################################################################
      
            def G(k, X):

                range = X.iloc[:,k].max() - X.iloc[:,k].min()
                
                return(range)

            G_vector = np.repeat(0, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)
            
        ########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X.iloc[: , 0:p1]
            Binary_Data = X.iloc[: , (p1):(p1+p2)]
            Multiple_Categorical_Data = X.iloc[: , (p1+p2):(p1+p2+p3) ]

            a = Binary_Data @ Binary_Data.T

            ones_matrix = np.ones(( Binary_Data.shape[0] , Binary_Data.shape[1])) 
   
            d = (ones_matrix - Binary_Data) @ (ones_matrix - Binary_Data).T

        #################################################################################

            numerator_part_1 = ( ones - ( (Quantitative_Data.iloc[i,:] - Quantitative_Data.iloc[j,:]).abs() / G_vector ) ).sum() 

            numerator_part_2 = a.iloc[i,j] + alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_1 + numerator_part_2

            denominator = p1 + (p2 - d.iloc[i,j]) + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

        #################################################################################
        
        def Dist_Gower_Py(i, j, Mixed_Data , p1, p2, p3):

            Dist_Gower = np.sqrt( 1 - Gower_Similarity_Python(i, j, Mixed_Data , p1, p2, p3) )

            return(Dist_Gower)    

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Gower_Py( len(X), i , X, p1, p2, p3) )

    ###################################################################
    
    distances = pd.DataFrame({'distances': distances})

    distances = distances.sort_values(by=["distances"]).reset_index(drop=False)
        
    knn = distances.iloc[0:k , :]

    for i in knn.iloc[:,0]:

        groups_knn.append(Y[i])

    unique, counts = np.unique(groups_knn , return_counts=True)

    unique_Y , counts_Y = np.unique(Y , return_counts=True)

    if len(unique) == len(unique_Y) :

        proportions_groups_knn = pd.DataFrame({'proportions_groups': counts/k, 'groups': unique_Y })
    
    elif len(unique) < len(unique_Y) :

        proportions_groups_knn = pd.DataFrame({'proportions_groups': counts/k, 'groups': unique })



    prediction_group = proportions_groups_knn.sort_values(by=["proportions_groups"], ascending=False).iloc[0,:]['groups']

    message = print( "x_new is classify in the group", prediction_group , ". So KNN algorithm predict y_new =",  prediction_group )                                      
                                       

    return proportions_groups_knn , message  

Testing our `KNN_classification` function in a binary classification problem:

In [20]:
Gender_classification = pd.read_csv('gender_classification.csv')

X = Gender_classification.iloc[: , 0:7]

Y = Gender_classification.iloc[: , 7]

x_new = pd.Series({'long_hair': 1, 'forehead_width_cm': 4, 'forehead_height_cm': 6, 'nose_wide': 1 , 'nose_long': 1 , 'nose_long': 1 , 'lips_thin':1, 'distance_nose_to_lip_long': 1 })


# Recoding Y using the standard format:

for i in range(0, len(Y)):

    if Y[i]=='Male':

        Y[i] = 0

    elif Y[i]=='Female':

        Y[i]=1

In [21]:
proportions_groups_knn , message = KNN_classification( X , Y , x_new, 10 , distance = "Euclidean" )

x_new is classify in the group 0 . So KNN algorithm predict y_new = 0


In [22]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,0.9,0
1,0.1,1


Using another distance function and $k$ :

In [23]:
KNN_classification( X , Y , x_new, 5 , distance = "Minkowski" , q = 2 )

x_new is classify in the group 0 . So KNN algorithm predict y_new = 0


(   proportions_groups groups
 0                 0.9      0
 1                 0.1      1,
 None)

In [24]:
KNN_classification( X , Y , x_new, 2 , distance = "Minkowski" , q = 1 )

x_new is classify in the group 0.0 . So KNN algorithm predict y_new = 0.0


(   proportions_groups  groups
 0                 1.0       0,
 None)

In [25]:
KNN_classification( X , Y , x_new, 10 , distance = "Minkowski" , q = 3 )

x_new is classify in the group 1.0 . So KNN algorithm predict y_new = 1.0


(   proportions_groups  groups
 0                 1.0       1,
 None)

In [26]:
KNN_classification( X , Y , x_new, 3 , distance = "Canberra"  )

x_new is classify in the group 0.0 . So KNN algorithm predict y_new = 0.0


(   proportions_groups  groups
 0                 1.0       0,
 None)

In [27]:
KNN_classification( X , Y , x_new, 15 , distance = "Mahalanobis"  )

x_new is classify in the group 0.0 . So KNN algorithm predict y_new = 0.0


(   proportions_groups  groups
 0                 1.0       0,
 None)

In [28]:
KNN_classification( X , Y , x_new, 6 , distance = "Pearson"  )

x_new is classify in the group 0.0 . So KNN algorithm predict y_new = 0.0


(   proportions_groups  groups
 0                 1.0       0,
 None)

In [29]:
# KNN_classification( X , Y , x_new, 10 , distance = "Gower"  )

En este caso no puede implementarse la distancia de Gower porque X no es una matriz de datos mixta (solo tiene cuanqitativas y binarias, faltan las multiclase).
Tampoco pueden usarse las distancias de Sokal y Jaccard porque X no es una matriz de datos binarios, ni tampoco el coeficiente de matches (coincidencias) porque X no es una matriz de datos categoricos multiclase.

----

Testing our `KNN_classification` function in a multi-class classification problem:

In [40]:
Wine_Classification = pd.read_csv('WineQT.csv')

In [41]:
Wine_Classification.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [51]:
X = Wine_Classification.iloc[: , 0:11]

Y = Wine_Classification.iloc[: , 11]

x_new = pd.Series({'fixed acidity': 7, 'volatile acidity': 0.8, 'citric acid': 0.04, 'residual sugar': 2.1 , 'chlorides': 0.070 , 'free sulfur dioxide': 12 , 'total sulfur dioxide':37 , 'density': 0.998 , 'pH':3.45, 'sulphates':0.60, 'alcohol':9.6 })

In this case response variable $Y$ has $10$ categories, so we are in a multi-class classification problem.

In [52]:
proportions_groups_knn , message = KNN_classification( X , Y , x_new, 10 , distance = "Euclidean" )

x_new is classify in the group 6.0 . So KNN algorithm predict y_new = 6.0


In [53]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,0.3,5
1,0.7,6


We tried with another $k$ and distance :

In [58]:
proportions_groups_knn , message = KNN_classification( X , Y , x_new, 5 , distance = "Minkowski" , q = 1 )

x_new is classify in the group 5.0 . So KNN algorithm predict y_new = 5.0


In [59]:
proportions_groups_knn

Unnamed: 0,proportions_groups,groups
0,0.6,5
1,0.4,6


Comparing with sklearn function:

In [56]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=10 ,  weights='uniform',  metric='euclidean')

In [57]:
knn.fit(X, Y)

print( knn.predict( [x_new] ) )

print( knn.predict_proba([x_new]) )

[6]
[[0.  0.  0.3 0.7 0.  0. ]]


In [60]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5 ,  weights='uniform',   p=1, metric='minkowski')

In [61]:
knn.fit(X, Y)

print( knn.predict( [x_new] ) )

print( knn.predict_proba([x_new]) )

[5]
[[0.  0.  0.6 0.4 0.  0. ]]


----

### KNN for regression <a class="anchor" id="5"></a>

- We have $\hspace{0.1cm} p \hspace{0.1cm}$ variables $\hspace{0.1cm} X=(X_1,...,X_p) \hspace{0.1cm}$ measurements on a $n$ size sample.

- We also have a **quantitative** response variable $\hspace{0.1cm} Y $ 


The regression problem consists in, for a new observation $\hspace{0.1cm} x_{new} = (x_{new,1},x_{new,2},...,x_{new,p}) \hspace{0.1cm}$ of the variables $X_1,...,X_p  \hspace{0.1cm}$, predict it's $\hspace{0.1cm} Y \hspace{0.05cm}$ value $\hspace{0.1cm} (y_{new})\hspace{0.1cm}$  using the information of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ and $ \hspace{0.1cm} Y$

So , the problem is to get $\hspace{0.1cm} \hat{y}_{new} \hspace{0.1cm}$  using the information available of $\hspace{0.1cm} X_1,...,X_p \hspace{0.1cm}$ , $Y$ and  $\hspace{0.1cm} x_{new} = (x_{new,1},x_{new,2},...,x_{new,p})$

---

The KNN (K-nearest neighbors) algorithm for regression have the following steps:



 $1. \hspace{0.15cm}$ Define a distance measure between the observation of the original sample respect to the variables $X_1,...,X_p$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\delta$



 $2. \hspace{0.15cm}$ Compute the distances between $x_{new}$ and the initial observations $\hspace{0.1cm} \lbrace x_1,...,x_n \rbrace$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$ $\lbrace \hspace{0.1 cm}  \delta(x_{new}, x_i) \hspace{0.1 cm} / \hspace{0.1 cm}  i=1,...,n \hspace{0.1 cm}  \rbrace$

  
 $3. \hspace{0.15cm}$ Select the  $k$ nearest observation to $x_{new}$ based on $\hspace{0.05cm} \delta \hspace{0.12cm}$ $(k$ nearest neighbors of $x_{new})$ $\hspace{0.15cm} \Rightarrow \hspace{0.15cm}$   The set of these observation will be denote by $KNN$ 




$5. \hspace{0.15cm}$ The method predict $\hspace{0.1cm} y_{new} \hspace{0.1cm}$  as follows:



$$
   
 \widehat{y}_{new} =  \dfrac{1}{KNN }\cdot \sum_{i \in KNN}  y_i
  
$$

KNN for regression in Python with `Sklearn` <a class="anchor" id="6"></a>

KNN for regression in Python with own algorithm <a class="anchor" id="7"></a>

In [30]:
def KNN_regression( X , Y , x_new, k, distance = "Minkowski" , q = 0, p1=0, p2=0, p3=0 ):

####################################################################################################################################################################################################################################################

    # Y tiene que ser una variable cuantitativa

    # Ejemplo de Y :  

 
    # X tiene que ser un panda data frame con los predictotres (X1,...,Xp). Ejemplo de X :

    # x_new tiene que ser una panda series. Ejemplo   

####################################################################################################################################################################################################################################################

    X = pd.concat([X, x_new.to_frame().T], ignore_index=True)

    distances = []

    Y_values_knn = []


####################################################################################################################################################################################################################################################
    
    if distance == "Euclidean":

        def Dist_Euclidea_Python(i, j, Quantitative_Data_set): 

            Dist_Euclidea = ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] )**2 ).sum()

            Dist_Euclidea = np.sqrt(Dist_Euclidea)

            return Dist_Euclidea

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Euclidea_Python( len(X), i , X) )

    ###################################################################

    if distance == "Minkowski":

        def Dist_Minkowski_Python(i,j, q , Quantitative_Data_set):

            Dist_Minkowski = ( ( ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] ).abs() )**q ).sum() )**(1/q)

            return Dist_Minkowski

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Minkowski_Python( len(X), i , q , X) )

        

    ###################################################################

    if distance == "Canberra":

        def Dist_Canberra_Python(i,j, Quantitative_Data_set):

            numerator =  ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] ).abs()  

            denominator =  ( (Quantitative_Data_set.iloc[i-1, ]).abs() + (Quantitative_Data_set.iloc[j-1, ]).abs() )
       
            numerator=np.array([numerator], dtype=float)

            denominator=np.array([denominator], dtype=float)

            Dist_Canberra = ( np.divide( numerator , denominator , out=np.zeros_like(numerator), where=denominator!=0) ).sum()

            return Dist_Canberra

    ###################################################################
    
        for i in range(1, len(X)):

            distances.append( Dist_Canberra_Python( len(X), i , X) )

    ###################################################################
   
    if distance == "Pearson":

        def Dist_Pearson_Python(i, j, Quantitative_Data_set):

            Dist_Pearson = ( ( Quantitative_Data_set.iloc[i-1, ] - Quantitative_Data_set.iloc[j-1, ] )**2 / Quantitative_Data_set.var() ).sum()

            Dist_Pearson = np.sqrt(Dist_Pearson)

            return Dist_Pearson

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Pearson_Python( len(X), i , X) )

    ###################################################################
    
    if distance == "Mahalanobis":

        def Dist_Mahalanobis_Python(i, j, Quantitative_Data_set):

            x = (Quantitative_Data_set.to_numpy()[i-1, ] - Quantitative_Data_set.to_numpy()[j-1, ])

            x = np.array([x]) # necessary step to transpose a 1D array

            S_inv = np.linalg.inv( Quantitative_Data_set.cov() ) # inverse of covariance matrix

            Dist_Maha = np.sqrt( x @ S_inv @ x.T )  # x @ S_inv @ x.T = np.matmul( np.matmul(x , S_inv) , x.T )

            Dist_Maha = float(Dist_Maha)

            return Dist_Maha

        
    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Mahalanobis_Python( len(X), i , X) )

       
    ###################################################################
    
    if distance == "Sokal":

        a = X @ X.T
        n = X.shape[0]
        p = X.shape[1]
        ones_matrix = np.ones((n, p))
        b = (ones_matrix - X) @ X.T
        c = b.T
        d = (ones_matrix - X) @ (ones_matrix - X).T

        def Sokal_Similarity_Py(i, j):

            Sokal_Similarity = (a.iloc[i-1,j-1] + d.iloc[i-1,j-1])/p

            return Sokal_Similarity


        def Dist_Sokal_Python(i, j, Binary_Data_set):

            dist_Sokal = np.sqrt( 2 - 2*Sokal_Similarity_Py(i,j, Binary_Data_set) )

            return dist_Sokal

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Sokal_Python( len(X), i , X) )

    ###################################################################
   
    if distance == "Jaccard":

        def Jaccard_Similarity_Py(i, j, Binary_Data_Matrix):

            X = Binary_Data_Py
            a = X @ X.T
            n = X.shape[0]
            p = X.shape[1]
            ones_matrix = np.ones((n, p)) 
            b = (ones_matrix - X) @ X.T
            c = b.T
            d = (ones_matrix - X) @ (ones_matrix - X).T

            Jaccard_Similarity = a.iloc[i-1,j-1] / (a.iloc[i-1,j-1] + b.iloc[i-1,j-1] + c.iloc[i-1,j-1])
            
            return Jaccard_Similarity


        def Dist_Jaccard_Python(i, j, Binary_Data_set):

            dist_Jaccard = np.sqrt( Jaccard_Similarity_Py(i,i, Binary_Data_set) + Jaccard_Similarity_Py(i,i, Binary_Data_set) - 2*Jaccard_Similarity_Py(i,j, Binary_Data_set) )

            return dist_Jaccard

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Jaccard_Python( len(X), i , X) )

    ###################################################################
    
    if distance == "Matches":

        def alpha_py(i,j, Multiple_Categorical_Data):

            X = Multiple_Categorical_Data
            alpha = np.repeat(0, X.shape[1])

            for k in range(0, X.shape[1]) :

                if X.iloc[i-1, k] == X.iloc[j-1, k] :

                    alpha[k] = 1

                else :

                    alpha[k] = 0

            alpha = alpha.sum()

            return(alpha)


        def matches_similarity_py(i, j, Multiple_Categorical_Data):

            p = Multiple_Categorical_Data.shape[1]

            matches_similarity = alpha_py(i,j, Multiple_Categorical_Data) / p

            return(matches_similarity)


        def Dist_Matches_Py(i,j, Multiple_Categorical_Data):

            Dist_Matches = np.sqrt( matches_similarity_py(i, i, Multiple_Categorical_Data) +  matches_similarity_py(j, j, Multiple_Categorical_Data) - 2*matches_similarity_py(i, j, Multiple_Categorical_Data) )

            return( Dist_Matches )

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Matches_Py( len(X), i , X) )

    ###################################################################
   
    if distance == "Gower":

        def Gower_Similarity_Python(i,j, Mixed_Data_Set, p1, p2, p3):

            X = Mixed_Data_Set

            # The variable must to be order in the following way: 
            # the p1 first are quantitative, the following p2 are binary categorical, and the following p3 are multiclass categorical.

        #################################################################################
      
            def G(k, X):

                range = X.iloc[:,k].max() - X.iloc[:,k].min()
                
                return(range)

            G_vector = np.repeat(0, p1)

            for r in range(0, p1):

                G_vector[r] = G(r, X)
            
        ########################################################################################
    
            ones = np.repeat(1, p1)

            Quantitative_Data = X.iloc[: , 0:p1]
            Binary_Data = X.iloc[: , (p1):(p1+p2)]
            Multiple_Categorical_Data = X.iloc[: , (p1+p2):(p1+p2+p3) ]

            a = Binary_Data @ Binary_Data.T

            ones_matrix = np.ones(( Binary_Data.shape[0] , Binary_Data.shape[1])) 
   
            d = (ones_matrix - Binary_Data) @ (ones_matrix - Binary_Data).T

        #################################################################################

            numerator_part_1 = ( ones - ( (Quantitative_Data.iloc[i,:] - Quantitative_Data.iloc[j,:]).abs() / G_vector ) ).sum() 

            numerator_part_2 = a.iloc[i,j] + alpha_py(i,j, Multiple_Categorical_Data)

            numerator = numerator_part_1 + numerator_part_2

            denominator = p1 + (p2 - d.iloc[i,j]) + p3

            Similarity_Gower = numerator / denominator  

            return(Similarity_Gower)

        #################################################################################
        
        def Dist_Gower_Py(i, j, Mixed_Data , p1, p2, p3):

            Dist_Gower = np.sqrt( 1 - Gower_Similarity_Python(i, j, Mixed_Data , p1, p2, p3) )

            return(Dist_Gower)    

    ###################################################################

        for i in range(1, len(X)):

            distances.append( Dist_Gower_Py( len(X), i , X, p1, p2, p3) )

    ######################################################################################################################################
    
    distances = pd.DataFrame({'distances': distances})

    distances = distances.sort_values(by=["distances"]).reset_index(drop=False)
        
    knn = distances.iloc[0:k , :]

    for i in knn.iloc[:,0]:

        Y_values_knn.append(Y.iloc[i , :])

    
    from itertools import chain

    Y_values_knn = list(chain(*Y_values_knn)) # To unlist a list


    prediction = sum(Y_values_knn)/k

    message = print( "KNN algorithm predict y_new =",  prediction )                                      
                                       

    return prediction , message  

Testing our `KNN_regression` function:

USAR OTRO DATA SET MEJOR PARA REGRESSION 

In [62]:
url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Linear%20Regression%20in%20Python%20and%20R/properties_data.csv'

House_Price_Regression = pd.read_csv(url)

In [63]:
House_Price_Regression.head()

Unnamed: 0,id,neighborhood,latitude,longitude,price,size_in_sqft,price_per_sqft,no_of_bedrooms,no_of_bathrooms,quality,...,private_pool,security,shared_gym,shared_pool,shared_spa,study,vastu_compliant,view_of_landmark,view_of_water,walk_in_closet
0,5528049,Palm Jumeirah,25.113208,55.138932,2700000,1079,2502.32,1,2,Medium,...,False,False,True,False,False,False,False,False,True,False
1,6008529,Palm Jumeirah,25.106809,55.151201,2850000,1582,1801.52,2,2,Medium,...,False,False,True,True,False,False,False,False,True,False
2,6034542,Jumeirah Lake Towers,25.063302,55.137728,1150000,1951,589.44,3,5,Medium,...,False,True,True,True,False,False,False,True,True,True
3,6326063,Culture Village,25.227295,55.341761,2850000,2020,1410.89,2,3,Low,...,False,False,False,False,False,False,False,False,False,False
4,6356778,Palm Jumeirah,25.114275,55.139764,1729200,507,3410.65,0,1,Medium,...,False,True,True,True,True,False,False,True,True,False


In [65]:
House_Price_Regression['size_in_m_2'] = 0.092903*House_Price_Regression['size_in_sqft']
House_Price_Regression['price_per_m_2'] = House_Price_Regression['price_per_sqft']/0.092903

In [66]:
House_Price_Regression.loc[: , ['price','size_in_m_2', 'no_of_bedrooms','no_of_bathrooms', 'quality', '' ]]

Unnamed: 0,size_in_m_2,no_of_bedrooms
0,100.242337,1
1,146.972546,2
2,181.253753,3
3,187.664060,2
4,47.101821,0
...,...,...
1900,100.985561,2
1901,70.606280,1
1902,179.302790,3
1903,68.748220,1


In [31]:
Gender_classification = pd.read_csv('gender_classification.csv')

for i in range(0, len(Gender_classification)):

    if Gender_classification['gender'][i] == 'Male':

        Gender_classification['gender'][i] = 0

    elif Gender_classification['gender'][i] == 'Female':

        Gender_classification['gender'][i] = 1


X = Gender_classification.loc[ : , ['long_hair', 'forehead_height_cm', 'nose_wide', 'nose_long', 'lips_thin', 'distance_nose_to_lip_long', 'gender']]

# forehead_width_cm will be our response in this case, because is quantitative (and  we are now in a regression problem)

Y = Gender_classification.loc[: , ['forehead_width_cm']]  

x_new = pd.Series({'long_hair': 1, 'forehead_height_cm': 6, 'nose_wide': 1 , 'nose_long': 1 , 'nose_long': 1 , 'lips_thin':1, 'distance_nose_to_lip_long': 1 , 'gender': 0})

In [32]:
prediction , message = KNN_regression( X , Y , x_new, k=10, distance = "Minkowski" , q = 2 )

KNN algorithm predict y_new = 13.11


In [33]:
prediction

13.11

In [34]:
prediction , message = KNN_regression( X , Y , x_new, k=10, distance = "Minkowski" , q = 1 )

KNN algorithm predict y_new = 12.95


In [35]:
prediction , message = KNN_regression( X , Y , x_new, k=10, distance = "Canberra" )

KNN algorithm predict y_new = 13.48


In [36]:
X['gender'] = X['gender'].astype(int) # it´s necessary to work with Mahalanobis distance function (all columns of X must be type = int or float)

In [37]:
prediction , message = KNN_regression( X , Y , x_new, k=10, distance = "Mahalanobis"  )

KNN algorithm predict y_new = 13.37


In [38]:
prediction , message = KNN_regression( X , Y , x_new, k=10, distance = "Euclidean"  )

KNN algorithm predict y_new = 13.11


In [39]:
prediction , message = KNN_regression( X , Y , x_new, k=10, distance = "Pearson"  )

KNN algorithm predict y_new = 13.820000000000002


Selecting an optimal k with cross-validation <a class="anchor" id="8"></a>

## Bibliography

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

apuntes aurea

apuntes nogales
