# K-Nearest Neighbor(KNN) Classifier
K-Nearest Neighbors is a supervised machine learning algorithm mainly used for classification. The way it works is finding the "k" closest points or "neighbors" to a given input and makes a predictions based on the majority class or the average value in case of regresion. 
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm how many nearby points or neighbors to look at when it makes a decision.
### Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors are used for classification and regression task. To identify nearest neighbor we use below distance metrics:

1. Euclidean Distance: the straight-line distance between two points.
2. Manhattan Distance: the total distance you would travel if you could only move along horizontal and vertical lines like a grid or city.
3. Minkowski Distance: is like a family of distances that in some cases includes euclidean and manhattan.

In [22]:
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, f1_score, classification_report, accuracy_score

# 1. Cargar dataset
dataset = fetch_ucirepo(id=544)

X = dataset.data.features
y = dataset.data.targets

# Verifica las primeras filas
print(X.head())
print(y.head())

# 2. Identificar columnas categóricas y numéricas
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Columnas categóricas: {categorical_cols}")
print(f"Columnas numéricas: {numerical_cols}")


   Gender   Age  Height  Weight family_history_with_overweight FAVC  FCVC  \
0  Female  21.0    1.62    64.0                            yes   no   2.0   
1  Female  21.0    1.52    56.0                            yes   no   3.0   
2    Male  23.0    1.80    77.0                            yes   no   2.0   
3    Male  27.0    1.80    87.0                             no   no   3.0   
4    Male  22.0    1.78    89.8                             no   no   2.0   

   NCP       CAEC SMOKE  CH2O  SCC  FAF  TUE        CALC  \
0  3.0  Sometimes    no   2.0   no  0.0  1.0          no   
1  3.0  Sometimes   yes   3.0  yes  3.0  0.0   Sometimes   
2  3.0  Sometimes    no   2.0   no  2.0  1.0  Frequently   
3  3.0  Sometimes    no   2.0   no  2.0  0.0  Frequently   
4  1.0  Sometimes    no   2.0   no  0.0  0.0   Sometimes   

                  MTRANS  
0  Public_Transportation  
1  Public_Transportation  
2  Public_Transportation  
3                Walking  
4  Public_Transportation  
            NO

# Advantages of KNN
Simple to use: Easy to understand and implement.
No training step: No need to train as it just stores the data and uses it during prediction.
Few parameters: Only needs to set the number of neighbors (k) and a distance method.
Versatile: Works for both classification and regression problems.
# Disadvantages of KNN
Slow with large data: Needs to compare every point during prediction.
Struggles with many features: Accuracy drops when data has too many features.
Can Overfit: It can overfit especially when the data is high-dimensional or not clean.

In [23]:

# 3. Crear transformadores
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 4. Crear ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

# 5. Crear pipeline completo con KNN
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=5))
])

# 6. Dividir datos en train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y.values.ravel(), test_size=0.2, random_state=42)

# 7. Entrenar modelo
clf.fit(X_train, y_train)

# 8. Evaluar modelo
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nReporte de Clasificación:\n", classification_report(y_test, y_pred))

Accuracy: 0.8203309692671394
Confusion Matrix:
 [[53  2  0  0  0  1  0]
 [15 19  8  2  0 10  8]
 [ 0  0 74  2  0  0  2]
 [ 0  0  2 56  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 2  5  0  0  0 46  3]
 [ 0  1  6  3  1  3 36]]

Reporte de Clasificación:
                      precision    recall  f1-score   support

Insufficient_Weight       0.76      0.95      0.84        56
      Normal_Weight       0.70      0.31      0.43        62
     Obesity_Type_I       0.82      0.95      0.88        78
    Obesity_Type_II       0.89      0.97      0.93        58
   Obesity_Type_III       0.98      1.00      0.99        63
 Overweight_Level_I       0.77      0.82      0.79        56
Overweight_Level_II       0.73      0.72      0.73        50

           accuracy                           0.82       423
          macro avg       0.81      0.82      0.80       423
       weighted avg       0.81      0.82      0.80       423



In [21]:
# Definir la grilla de hiperparámetros para KNN
param_grid = {
    'classifier__n_neighbors': [3, 5, 7, 9, 11],
    'classifier__weights': ['uniform', 'distance'],
    'classifier__metric': ['euclidean', 'manhattan']
}

# GridSearchCV con F1-score como métrica (micro, macro o weighted dependiendo de tus datos)
grid_search = GridSearchCV(
    clf,
    param_grid,
    cv=5,  # validación cruzada de 5 folds
    scoring='f1_weighted',  # usa 'f1_macro' o 'f1_micro' si prefieres
    n_jobs=-1,  # usa todos los núcleos disponibles
    verbose=2
)

# Ejecutar la búsqueda en la rejilla
grid_search.fit(X_train, y_train)

# Resultados del mejor modelo
print("Mejores hiperparámetros:", grid_search.best_params_)
print("Mejor F1 score (validación):", grid_search.best_score_)
# Evaluación en el conjunto de prueba
y_pred = grid_search.predict(X_test)
print("\nF1 Score en test:", f1_score(y_test, y_pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nReporte de Clasificación:\n", classification_report(y_test, y_pred))


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Mejores hiperparámetros: {'classifier__metric': 'manhattan', 'classifier__n_neighbors': 3, 'classifier__weights': 'distance'}
Mejor F1 score (validación): 0.8785808148883518

F1 Score en test: 0.8717558146191585
Confusion Matrix:
 [[54  1  0  0  0  1  0]
 [ 9 34  5  0  0 10  4]
 [ 0  0 73  2  0  1  2]
 [ 0  0  1 57  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 0  5  0  0  0 47  4]
 [ 0  1  2  0  1  3 43]]

Reporte de Clasificación:
                      precision    recall  f1-score   support

Insufficient_Weight       0.86      0.96      0.91        56
      Normal_Weight       0.83      0.55      0.66        62
     Obesity_Type_I       0.90      0.94      0.92        78
    Obesity_Type_II       0.97      0.98      0.97        58
   Obesity_Type_III       0.98      1.00      0.99        63
 Overweight_Level_I       0.76      0.84      0.80        56
Overweight_Level_II       0.81      0.86      0.83        50

           accuracy 

 0.78379578 0.80917206 0.76757613 0.79995529        nan 0.87858081
        nan 0.87288615        nan 0.86799686        nan 0.8594353
        nan 0.85224345]
