# K-NN con Hiperparámetros

### k-NN (Nearest Neighbour)


- Se miran los k-casos más cercanos.

- Se calcula la distancia media por clase o se asigna a la clase con más elementos.

- El valor de k se suele determinar heurísticamente $k=\sqrt{n} $ donde n es el número de ejemplos. (Es una opción con base teórica)

## 1. Librerias y configuraciones previas


In [None]:
# Tratamiento de datos
# ==============================================================================
import pandas as pd
import numpy as np


# Almacenar en caché los resultados de funciones en el disco
# ==============================================================================
import joblib


# Matemáticas y estadísticas
# ==============================================================================
import math


# Preprocesado y modelado
# ==============================================================================

#Separar los datos entrenamiento y prueba
from sklearn.model_selection import train_test_split


#Escalar Variables
from sklearn.preprocessing import MinMaxScaler


#Evaluación del modelo
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve


#Creación de modelo
from sklearn.neighbors import KNeighborsClassifier


#configuracion de hiperparámetros
from sklearn.model_selection import GridSearchCV


# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns


# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

In [None]:
## Cargar datos con colab
## =============================================================================

from google.colab import drive
import sys

# Path en google
PATH = '/gdrive/MyDrive/01_Academia/02_Cursos/20251001_AprendizajeAutomatico_UdeA/'

UTILS_PATH = PATH + 'utils/'
DATASET_PATH = PATH + 'datasets/'
MODELOS_PATH = PATH + 'modelos/'


# Montar Google Drive
drive.mount('/gdrive')

# Agregar utils al sys.path
sys.path.append(UTILS_PATH)

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


## 2. Funciones

In [None]:
# Funciones externas
# ==============================================================================
from funciones import multiple_plot, plot_roc_curve

## 3. Carga del dataset

In [None]:
#Se crea un dataframe d con los datos obtenidos de archivo de entrada
d = pd.read_csv(DATASET_PATH + '02_GermanCredit_Prep.csv')

In [None]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1138 entries, 0 to 1137
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   checking_account_status   1138 non-null   object
 1   loan_duration_mo          1138 non-null   int64 
 2   credit_history            1138 non-null   object
 3   purpose                   1138 non-null   object
 4   loan_amount               1138 non-null   int64 
 5   savings_account_balance   1138 non-null   object
 6   time_employed_yrs         1138 non-null   object
 7   payment_pcnt_income       1138 non-null   int64 
 8   gender_status             1138 non-null   object
 9   other_signators           1138 non-null   object
 10  time_in_residence         1138 non-null   int64 
 11  property                  1138 non-null   object
 12  age_yrs                   1138 non-null   int64 
 13  other_credit_outstanding  1138 non-null   object
 14  home_ownership          

## 4. Visualización de datos

### Variables de entrada

In [None]:
#Lista de variables categóricas
catCols = d.select_dtypes(include = ["object", 'category']).columns.tolist()

d[catCols].head(2)

Unnamed: 0,checking_account_status,credit_history,purpose,savings_account_balance,time_employed_yrs,gender_status,other_signators,property,other_credit_outstanding,home_ownership,job_category,telephone,foreign_worker
0,< 0 DM,critical account - other non-bank loans,car,< 100 DM,1 - 4 years,female-divorced/separated/married,co-applicant,real estate,none,own,skilled,none,yes
1,< 0 DM,current loans paid,car,< 100 DM,1 - 4 years,male-married/widowed,none,real estate,none,own,unskilled-resident,none,yes


In [None]:
#Lista de variables numéricas

numCols=d.select_dtypes(include = ['float64','float64','int32','int64']).columns.tolist()

d[numCols].head(2)

Unnamed: 0,loan_duration_mo,loan_amount,payment_pcnt_income,time_in_residence,age_yrs,number_loans,dependents,bad_credit
0,12,3499,3,2,29,2,1,1
1,12,1168,4,3,27,1,1,0


In [None]:
##Visualización de frecuencia de instancias para variables categóricas
#multiple_plot(3, d , catCols, None, 'countplot', 'Frecuencia de instancias para variables categóricas',30)

In [None]:
##Visualización de variables numéricas
#multiple_plot(1, d , numCols, None, 'scatterplot', 'Relación entre las variables numéricas',30)

In [None]:
#Eliminar la variable de salida de la lista de variable numéricas
numCols.remove('bad_credit')

### Variable de salida

In [None]:
# Distriución de la variable de salida

d.groupby('bad_credit').bad_credit.count().sort_values(ascending=False)

Unnamed: 0_level_0,bad_credit
bad_credit,Unnamed: 1_level_1
0,569
1,569


In [None]:
##Visualización de la variable de salida
#multiple_plot(1, d , None, 'bad_credit', 'countplot', 'Gráfica de frecuencia de bad Credit',0)

## 5. Transformación de datos

### Creación de variables Dummies

In [None]:
# Aplicación de la función de usuario Dummies: one-hot encoding

d =pd.get_dummies(d, drop_first=1)

d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1138 entries, 0 to 1137
Data columns (total 46 columns):
 #   Column                                                  Non-Null Count  Dtype
---  ------                                                  --------------  -----
 0   loan_duration_mo                                        1138 non-null   int64
 1   loan_amount                                             1138 non-null   int64
 2   payment_pcnt_income                                     1138 non-null   int64
 3   time_in_residence                                       1138 non-null   int64
 4   age_yrs                                                 1138 non-null   int64
 5   number_loans                                            1138 non-null   int64
 6   dependents                                              1138 non-null   int64
 7   bad_credit                                              1138 non-null   int64
 8   checking_account_status_< 0 DM                          11

## 6. Creación del modelo

### Selecionar el conjunto de datos

In [None]:
#Se establece las variables de entrada 'X' y la variable de salida 'y'

X = d.drop(columns ='bad_credit')
y = d['bad_credit']

# la validación cruzada se realiza sobre todo el dataset
X_Completo = X
y_Completo = y

### Escalar Variables

In [None]:
#Se establecen las variables numéricas a escalar

#Num_vars se le asigna la lista con las variables numerivas para posteriormente escalarlas
num_vars = numCols

print(num_vars)

['loan_duration_mo', 'loan_amount', 'payment_pcnt_income', 'time_in_residence', 'age_yrs', 'number_loans', 'dependents']


In [None]:
#Escalar Variables númericas

pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Se crea un objeto MinMaxScaler
scaler = MinMaxScaler()

# Se escalan los valores del dataset entrenamiento y prueba de las columnas numéricas
X_Completo[num_vars] = scaler.fit_transform(X_Completo[num_vars])


X_Completo[num_vars].head(2)

Unnamed: 0,loan_duration_mo,loan_amount,payment_pcnt_income,time_in_residence,age_yrs,number_loans,dependents
0,0.1176,0.2356,0.6667,0.3333,0.1667,0.3333,0.0
1,0.1176,0.0619,1.0,0.6667,0.1296,0.0,0.0


In [None]:
# Guardar el scaler
joblib.dump(scaler, MODELOS_PATH + '/scaler/minmaxFull_GermanCredits.pkl')

['/gdrive/MyDrive/01_Academia/02_Cursos/20251001_AprendizajeAutomatico_UdeA/modelos//scaler/minmaxFull_GermanCredits.pkl']

### Creación del modelo

#### Creación y entrenamiento del modelo

In [None]:
np.random.seed(4)


# Definición del modelo
modelKNN = KNeighborsClassifier()

#Número de vecinos a evaluar
k=[21, 25, 31, 35, 37]

# definicion de la variable con el número de pliegues
CV = 10

# valor de evaluación (scoring) del modelo
scoring = 'f1' # Otros valores que puede tomar son: accuracy, precision, recall2, f1, roc_auc, balanced_accuracy

# Definición de para
parameters = {'n_neighbors':k, 'metric':['euclidean','manhattan','chebyshev']}


# Creacion de gridSearch con los múltiples parámetros
grid_knn = GridSearchCV(estimator=modelKNN
                    , param_grid = parameters
                    , cv=CV
                    , scoring=scoring
                    , return_train_score=True
                    , verbose=4)


grid_knn.fit(X_Completo, y_Completo)

Fitting 10 folds for each of 15 candidates, totalling 150 fits
[CV 1/10] END metric=euclidean, n_neighbors=21;, score=(train=0.717, test=0.698) total time=   0.1s
[CV 2/10] END metric=euclidean, n_neighbors=21;, score=(train=0.729, test=0.703) total time=   0.0s
[CV 3/10] END metric=euclidean, n_neighbors=21;, score=(train=0.720, test=0.726) total time=   0.0s
[CV 4/10] END metric=euclidean, n_neighbors=21;, score=(train=0.726, test=0.730) total time=   0.0s
[CV 5/10] END metric=euclidean, n_neighbors=21;, score=(train=0.718, test=0.708) total time=   0.0s
[CV 6/10] END metric=euclidean, n_neighbors=21;, score=(train=0.726, test=0.631) total time=   0.0s
[CV 7/10] END metric=euclidean, n_neighbors=21;, score=(train=0.726, test=0.694) total time=   0.0s
[CV 8/10] END metric=euclidean, n_neighbors=21;, score=(train=0.729, test=0.739) total time=   0.0s
[CV 9/10] END metric=euclidean, n_neighbors=21;, score=(train=0.722, test=0.655) total time=   0.0s
[CV 10/10] END metric=euclidean, n_ne

### Evaluación del modelo

In [None]:
#grid_knn.cv_results_

In [None]:
# Resultados
resultados = pd.DataFrame(grid_knn.cv_results_)
resultados.filter(regex = '(param.*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(4)

Unnamed: 0,param_metric,param_n_neighbors,mean_test_score,std_test_score,mean_train_score,std_train_score
7,manhattan,31,0.7073,0.0358,0.718,0.0065
5,manhattan,21,0.6991,0.0325,0.7274,0.0083
4,euclidean,37,0.6969,0.0287,0.7174,0.0057
0,euclidean,21,0.6962,0.0323,0.724,0.0042


In [None]:
#grid_knn.cv_results_
#grid_knn.best_score_


In [None]:
# Obtener los resultados de la búsqueda de la cuadrícula para grid_knn
results_grid_knn = pd.DataFrame(grid_knn.cv_results_)

# Seleccionar las columnas deseadas
columns_grid_knn = ['param_metric', 'param_n_neighbors']  + \
               ['mean_test_score', 'std_test_score']  + \
               [f'split{i}_test_score' for i in range(CV)]

# Filtrar y mostrar los resultados
results_grid_knn_filtered = results_grid_knn[columns_grid_knn]

results_grid_knn_filtered.sort_values(by='mean_test_score', ascending=False).head(10)

Unnamed: 0,param_metric,param_n_neighbors,mean_test_score,std_test_score,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score
7,manhattan,31,0.7073,0.0358,0.7023,0.7679,0.7257,0.6607,0.7193,0.6422,0.7241,0.7458,0.6964,0.6891
5,manhattan,21,0.6991,0.0325,0.6875,0.7027,0.7143,0.713,0.7521,0.6609,0.6891,0.7227,0.6306,0.7179
4,euclidean,37,0.6969,0.0287,0.6929,0.7273,0.735,0.6726,0.6949,0.6429,0.7119,0.7227,0.708,0.661
0,euclidean,21,0.6962,0.0323,0.6984,0.7027,0.7257,0.7304,0.708,0.6306,0.6942,0.7395,0.6549,0.678
3,euclidean,35,0.6954,0.0276,0.7132,0.7321,0.713,0.6607,0.7009,0.6429,0.6949,0.7107,0.7193,0.6667
2,euclidean,31,0.6949,0.0344,0.7031,0.7434,0.7257,0.6606,0.7119,0.6364,0.6667,0.7395,0.6957,0.6667
1,euclidean,25,0.6935,0.0378,0.7143,0.708,0.7257,0.6957,0.7179,0.625,0.6504,0.7581,0.6786,0.661
9,manhattan,37,0.692,0.0393,0.6875,0.7568,0.6964,0.6486,0.6726,0.6364,0.7288,0.6897,0.7478,0.6552
8,manhattan,35,0.6885,0.0407,0.687,0.7748,0.6667,0.6372,0.6957,0.6306,0.7059,0.7009,0.7257,0.6609
6,manhattan,25,0.687,0.0293,0.6822,0.7037,0.7273,0.6842,0.7179,0.6364,0.6897,0.7193,0.6549,0.6549


In [None]:
# Resultados de grid_knn
print("Resultados grid_knn:")
print("Mejor score de validación (", scoring, "):"  ,grid_knn.best_score_)
print("Mejor conjunto de hiperparámetros:", grid_knn.best_params_)

Resultados grid_knn:
Mejor score de validación ( f1 ): 0.7073430146777449
Mejor conjunto de hiperparámetros: {'metric': 'manhattan', 'n_neighbors': 31}


### Creación el modelo final

In [None]:
# Usar los mejores parámetros para ajustar el modelo
modelKNN.set_params(**grid_knn.best_params_)
modelKNN.fit(X_Completo, y_Completo)

### Guardar modelo

In [None]:
#Se guarda el modelo de Regresión logística
joblib.dump(modelKNN, MODELOS_PATH + '/clasificacion/KNN_CV_manhattan.pkl')

['/gdrive/MyDrive/01_Academia/02_Cursos/20251001_AprendizajeAutomatico_UdeA/modelos//clasificacion/KNN_CV_manhattan.pkl']

#### Referencias


- K-Neighbors Classifier

    - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
    
    
- Parámetros Regresion Logística

    - https://holypython.com/log-reg/logistic-regression-optimization-parameters/


- *scikit-learn:*   
    - https://scikit-learn.org/stable/modules/svm.html



- *Gráficas con  seaborn:*
    - https://ichi.pro/es/como-utilizar-python-seaborn-para-analisis-de-datos-exploratorios-28897898172180



- *Analítica de grandes datos:*
    - https://jdvelasq.github.io/courses/analitica-de-grandes-datos/index.html