## Clasificador de Distancia y Bayesianos <a class="anchor" id="1"></a>

En este notebook, se tratará de aplicar tres modelos de clasificación a un dataset.
a. 1NN
b. KNN con K={3, 5, 7 y 9}
c. Naive Bayes
Con los métodos de validación.
a. Hold-Out 70/30
b. 10-Fold Cross-Validation
c. Leave-One-Out
El dataset es uno de clasificación de vidrio glass.xls; empezaremos preprocesando el dataset de manera rápida para poder trabajar con el después

## 1. Importar librerias <a class="anchor" id="2"></a>

In [40]:
# Env: Python 3.12
import os  # for file and directory manipulation
import matplotlib
matplotlib.use('qt5agg')  # o 'tkagg' si llegara a fallar
import matplotlib.pyplot as plt  # for data visualization
import numpy as np  # for linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px  # for interactive plots
import seaborn as sns  # for statistical data visualization
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, StratifiedKFold, LeaveOneOut, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix

In [41]:
import warnings

warnings.filterwarnings('ignore')

## 2. Importar dataset <a class="anchor" id="3"></a>

El dataset es de clasificación de vidrio

In [42]:
data = './glass.xls'

df = pd.read_csv(data)

### Ver las dimensiones del dataset <a class="anchor" id="4.1"></a>

In [43]:
df_shape = df.shape
print(f"DataFrame registries: {df_shape[0]}")
print(f"DataFrame variables: {df_shape[1]}")

DataFrame registries: 214
DataFrame variables: 10


### Previsión del dataset <a class="anchor" id="4.2"></a>

In [44]:
df.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


### Ver resumen del dataset <a class="anchor" id="4.5"></a>

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RI      214 non-null    float64
 1   Na      214 non-null    float64
 2   Mg      214 non-null    float64
 3   Al      214 non-null    float64
 4   Si      214 non-null    float64
 5   K       214 non-null    float64
 6   Ca      214 non-null    float64
 7   Ba      214 non-null    float64
 8   Fe      214 non-null    float64
 9   Type    214 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


### Ver propiedades estadisticas del dataset <a class="anchor" id="4.6"></a>

In [46]:
df.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009,2.780374
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439,2.103739
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
25%,1.516522,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0,1.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0,2.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1,3.0
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0


#### Checar valores pérdidos.

In [47]:
df['Type'].isna().sum()


0

#### Checar valores únicos.

In [48]:
print(df['Type'].nunique())
df['Type'].unique()

6


array([1, 2, 3, 5, 6, 7], dtype=int64)

#### Visualizar frecuencia distribucion de `Type` variable

In [49]:
df['Type'].value_counts(normalize=False, dropna=False)

Type
2    76
1    70
7    29
3    17
5    13
6     9
Name: count, dtype: int64

In [50]:
plt.figure(figsize=(6, 8))
sns.countplot(x="Type", data=df, palette="Set1")
plt.show()

### Explorar variables categoricas <a class="anchor" id="6.2"></a>

In [51]:
categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :', categorical)


There are 0 categorical variables

The categorical variables are : []


### Explorar Variables Numericas <a class="anchor" id="6.5"></a>

In [52]:
numerical = [var for var in df.columns if df[var].dtype!='O']
print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :', numerical)

There are 10 numerical variables

The numerical variables are : ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Type']


### 6. Explorar problemas con variables numericas <a class="anchor" id="6.7"></a>


### Valores pérdidos en variables numericas

In [53]:
df[numerical].isnull().sum()

RI      0
Na      0
Mg      0
Al      0
Si      0
K       0
Ca      0
Ba      0
Fe      0
Type    0
dtype: int64

### Outliers en variables numericas

In [54]:
print(round(df[numerical].describe()), 2)

          RI     Na     Mg     Al     Si      K     Ca     Ba     Fe   Type
count  214.0  214.0  214.0  214.0  214.0  214.0  214.0  214.0  214.0  214.0
mean     2.0   13.0    3.0    1.0   73.0    0.0    9.0    0.0    0.0    3.0
std      0.0    1.0    1.0    0.0    1.0    1.0    1.0    0.0    0.0    2.0
min      2.0   11.0    0.0    0.0   70.0    0.0    5.0    0.0    0.0    1.0
25%      2.0   13.0    2.0    1.0   72.0    0.0    8.0    0.0    0.0    1.0
50%      2.0   13.0    3.0    1.0   73.0    1.0    9.0    0.0    0.0    2.0
75%      2.0   14.0    4.0    2.0   73.0    1.0    9.0    0.0    0.0    3.0
max      2.0   17.0    4.0    4.0   75.0    6.0   16.0    3.0    1.0    7.0 2


### Visualizar outliers en gráfica


In [55]:
numer_cols = numerical # Exclude the last three columns (Year, Month, Day)

# draw boxplots to visualize outliers
plt.figure(figsize=(20, 40))

# Create boxplots for each numerical column
for i, col in enumerate(numer_cols, 1):
    plt.subplot(len(numer_cols), 2, i)
    sns.boxplot(data=df, x=col)
    plt.title(f"Boxplot for {col}", fontsize=10)
    plt.xlabel(col, fontsize=12)

# Overall title for the entire figure
plt.suptitle("Boxplots to Identify Outliers in Numerical Data", fontsize=16, weight='bold')

# Adjust layout for better spacing
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Leave space for the suptitle
plt.show()

### Visualizar histograma

In [56]:
# plot histogram to check distribution

plt.figure(figsize=(15, 40))

for i, col in enumerate(numer_cols, 1):
    plt.subplot(len(numer_cols), 2, i)
    sns.histplot(df[col], bins=10, kde=True)
    plt.title(f"Histogram for {col}", fontsize=14)
    plt.xlabel(col, fontsize=12)

# Overall title for the entire figure
plt.suptitle("Distribution in Numerical Data", fontsize=16, weight='bold')

# Adjust layout for better spacing
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Leave space for the suptitle
plt.show()

### Contar outliers

In [57]:
def detect_outliers_only(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    outlier_info = []
    for col in numeric_cols:
        non_null = df[col].dropna()
        if len(non_null)==0: continue
        Q1 = non_null.quantile(0.25)
        Q3 = non_null.quantile(0.75)
        IQR = Q3 - Q1
        if IQR>0:
            lb = Q1 - 1.5*IQR
            ub = Q3 + 1.5*IQR
            outliers = non_null[(non_null < lb) | (non_null > ub)]
            pct = 100 * len(outliers) / len(non_null)
            outlier_info.append((col, len(outliers), pct, lb, ub))
    out_df = pd.DataFrame(outlier_info, columns=['Variable','Outliers','Porcentaje','Límite_Inferior','Límite_Superior'])
    return out_df
print(detect_outliers_only(df))

  Variable  Outliers  Porcentaje  Límite_Inferior  Límite_Superior
0       RI        17    7.943925          1.51257          1.52311
1       Na         7    3.271028         11.53125         15.20125
2       Mg         0    0.000000         -0.11250          5.82750
3       Al        18    8.411215          0.53000          2.29000
4       Si        12    5.607477         71.06875         74.29875
5        K         7    3.271028         -0.60875          1.34125
6       Ca        26   12.149533          6.84125         10.57125
7       Fe        12    5.607477         -0.15000          0.25000
8     Type        29   13.551402         -2.00000          6.00000


## 7. Analisis Multivariable <a class="anchor" id="7"></a>


In [58]:
# corr() method computes the pairwise correlation of columns
correlation = df.corr(numeric_only=True)

### Heat Map <a class="anchor" id="7.1"></a>

In [59]:
plt.figure(figsize=(16,12))
plt.title('Correlation Heatmap of Rain in Australia Dataset')
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', cmap='coolwarm_r')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)           
plt.show()

## 8. Declarar vector de características y variable objetivo

In [60]:
# Delete the rows with missing values in the target variable
df.dropna(subset=['Type'], inplace=True)

# Features
X = df.drop(['Type'], axis=1)

# Target
y = df['Type']

## 9. Split data into separate training and test set <a class="anchor" id="9"></a>

In [61]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)


In [62]:
# check the shape of X_train and X_test
X_train.shape, X_test.shape

((171, 9), (43, 9))

### 10. Feature Scaling

In [63]:
# not truncate output
pd.set_option('display.max_columns', None)
X_train.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
count,171.0,171.0,171.0,171.0,171.0,171.0,171.0,171.0,171.0
mean,1.518483,13.386023,2.795322,1.417836,72.595088,0.500468,8.942164,0.17345,0.060468
std,0.003061,0.762327,1.380812,0.477435,0.768149,0.560228,1.417021,0.515266,0.100889
min,1.51131,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0
25%,1.51654,12.895,2.4,1.19,72.255,0.165,8.225,0.0,0.0
50%,1.51778,13.3,3.49,1.35,72.75,0.56,8.59,0.0,0.0
75%,1.5193,13.785,3.61,1.595,73.025,0.61,9.235,0.0,0.1
max,1.53393,15.79,4.49,3.5,75.18,6.21,16.19,3.15,0.51


In [64]:
cols = X_train.columns
cols

Index(['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe'], dtype='object')

In [65]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_test

array([[ 0.27586207,  0.4486166 ,  0.80178174,  0.33333333,  0.59217877,
         0.09178744,  0.2760223 ,  0.        ,  0.21568627],
       [ 0.26348364,  0.78458498,  0.        ,  0.63862928,  0.6461825 ,
         0.        ,  0.32713755,  0.20952381,  0.        ],
       [ 0.45137047,  0.45849802,  0.81291759,  0.18068536,  0.44878957,
         0.03059581,  0.41078067,  0.        ,  0.33333333],
       [ 0.20822281,  0.81422925,  0.        ,  0.65109034,  0.6461825 ,
         0.        ,  0.30947955,  0.2031746 ,  0.17647059],
       [ 0.255084  ,  0.59090909,  0.81959911,  0.47352025,  0.41899441,
         0.10305958,  0.22769517,  0.        ,  0.        ],
       [ 0.71087533,  0.05731225,  0.        ,  0.14330218,  0.60893855,
         0.        ,  0.88568773,  0.        ,  0.        ],
       [ 0.27851459,  0.41106719,  0.78841871,  0.29283489,  0.63873371,
         0.09339775,  0.27509294,  0.        ,  0.        ],
       [ 0.19363395,  0.49604743,  0.77728285,  0.36760125,  0

In [66]:
X_train = pd.DataFrame(X_train, columns=[cols])

In [67]:
X_test = pd.DataFrame(X_test, columns=[cols])

In [68]:
X_train.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
count,171.0,171.0,171.0,171.0,171.0,171.0,171.0,171.0,171.0
mean,0.317108,0.524906,0.622566,0.351351,0.518638,0.080591,0.326409,0.055064,0.118564
std,0.135335,0.150657,0.307531,0.148734,0.143044,0.090214,0.131693,0.163577,0.197822
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.231211,0.427866,0.534521,0.280374,0.455307,0.02657,0.259758,0.0,0.0
50%,0.28603,0.507905,0.777283,0.330218,0.547486,0.090177,0.29368,0.0,0.0
75%,0.353227,0.603755,0.804009,0.406542,0.598696,0.098229,0.353625,0.0,0.196078
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [69]:
df_train = X_train.copy()
df_train['Type'] = y_train.values

df_test = X_test.copy()
df_test['Type'] = y_test.values

df_final = pd.concat([df_train, df_test], axis=0)

df_final.to_csv("glass_preprocessed.csv", index=False)


In [70]:
# Configuración
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

### 12. Carga y preparación de datos

In [71]:
# Cargar dataset preprocesado
df = pd.read_csv('glass_preprocessed.csv')

# Asumiendo que la última columna es el target, ajusta según tu caso
X = df.iloc[:, :-1]  # Todas las columnas excepto la última
y = df.iloc[:, -1]   # Última columna como target

print(f"Dimensiones de X: {X.shape}")
print(f"Dimensiones de y: {y.shape}")
print(f"Distribución de clases: {y.value_counts()}")

Dimensiones de X: (214, 9)
Dimensiones de y: (214,)
Distribución de clases: Type
2    76
1    70
7    29
3    17
5    13
6     9
Name: count, dtype: int64


### 13. Funciones de evaluación

In [72]:
def print_confusion_matrix_details(y_true, y_pred, class_names=None):
    """Imprime matriz de confusión y métricas básicas"""
    cm = confusion_matrix(y_true, y_pred)
    
    print("Matriz de Confusión:")
    print(cm)
    print()
    
    # Para problemas binarios
    if cm.shape == (2, 2):
        TN, FP, FN, TP = cm.ravel()
        print(f"True Positives (TP): {TP}")
        print(f"False Positives (FP): {FP}")
        print(f"False Negatives (FN): {FN}")
        print(f"True Negatives (TN): {TN}")
    else:
        # Para problemas multiclase
        print("Problema multiclase - mostrando matriz completa")
    
    accuracy = accuracy_score(y_true, y_pred)
    print(f"\nPrecisión (Accuracy): {accuracy:.4f}")
    print("-" * 50)
    
    return accuracy

def evaluate_holdout(model, X_train, X_test, y_train, y_test, model_name=""):
    """Evaluación con Hold-Out 70/30"""
    print(f"{model_name} - HOLD-OUT 70/30")
    print("=" * 40)
    
    # Entrenar y predecir
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Mostrar resultados
    accuracy = print_confusion_matrix_details(y_test, y_pred)
    return accuracy

def evaluate_kfold(model, X, y, k=10, model_name=""):
    """Evaluación con 10-Fold Cross Validation"""
    print(f"{model_name} - 10-FOLD CROSS VALIDATION")
    print("=" * 40)
    
    # Stratified K-Fold
    kfold = StratifiedKFold(n_splits=k, shuffle=True, random_state=RANDOM_STATE)
    scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
    
    print(f"Precisión por fold: {['%.4f' % s for s in scores]}")
    print(f"Precisión promedio: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
    print("-" * 50)
    
    return scores.mean()

def evaluate_loocv(model, X, y, model_name="", max_samples=100):
    """Evaluación con Leave-One-Out"""
    print(f"{model_name} - LEAVE-ONE-OUT")
    print("=" * 40)
    
    # Usar subconjunto si es muy grande
    if len(X) > max_samples:
        indices = np.random.choice(len(X), max_samples, replace=False)
        X_sub = X.iloc[indices] if hasattr(X, 'iloc') else X[indices]
        y_sub = y.iloc[indices] if hasattr(y, 'iloc') else y[indices]
        print(f"Usando {max_samples} muestras de {len(X)} totales")
    else:
        X_sub, y_sub = X, y
    
    # LOOCV
    loo = LeaveOneOut()
    scores = cross_val_score(model, X_sub, y_sub, cv=loo, scoring='accuracy')
    
    print(f"Precisión: {scores.mean():.4f}")
    print(f"Número de iteraciones: {len(scores)}")
    print("-" * 50)
    
    return scores.mean()

### 14. Split de datos para Hold-Out

In [73]:
# Hold-Out 70/30
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE, stratify=y
)

print(f"Training: {X_train.shape[0]} muestras")
print(f"Testing: {X_test.shape[0]} muestras")
print()

Training: 149 muestras
Testing: 65 muestras



### 15. Evaluación de 1-NN

In [74]:
# Diccionario para almacenar todos los resultados
results = {}

print("EVALUACIÓN 1-NN")
print("=" * 60)

# Modelo 1-NN
model_1nn = KNeighborsClassifier(n_neighbors=1)

# Hold-Out
acc_ho_1nn = evaluate_holdout(model_1nn, X_train, X_test, y_train, y_test, "1-NN")

# 10-Fold CV
acc_kf_1nn = evaluate_kfold(model_1nn, X, y, model_name="1-NN")

# LOOCV
acc_loocv_1nn = evaluate_loocv(model_1nn, X, y, model_name="1-NN")

# Almacenar resultados
results['1-NN'] = {
    'Hold-Out': acc_ho_1nn,
    '10-Fold CV': acc_kf_1nn,
    'LOOCV': acc_loocv_1nn
}

EVALUACIÓN 1-NN
1-NN - HOLD-OUT 70/30
Matriz de Confusión:
[[16  2  3  0  0  0]
 [ 5 17  0  1  0  0]
 [ 2  1  2  0  0  0]
 [ 0  0  0  4  0  0]
 [ 0  1  0  0  2  0]
 [ 0  0  0  0  1  8]]

Problema multiclase - mostrando matriz completa

Precisión (Accuracy): 0.7538
--------------------------------------------------
1-NN - 10-FOLD CROSS VALIDATION
Precisión por fold: ['0.6818', '0.6818', '0.8636', '0.6364', '0.7143', '0.8571', '0.7619', '0.8095', '0.6190', '0.7619']
Precisión promedio: 0.7387 (+/- 0.1644)
--------------------------------------------------
1-NN - LEAVE-ONE-OUT
Usando 100 muestras de 214 totales
Precisión: 0.6500
Número de iteraciones: 100
--------------------------------------------------


### 16. Evaluación de KNN con diferentes K

In [75]:
# Evaluación de KNN con diferentes K
k_values = [3, 5, 7, 9]

for k in k_values:
    print(f"\nEVALUACIÓN KNN con k={k}")
    print("=" * 60)
    
    # Modelo KNN
    model_knn = KNeighborsClassifier(n_neighbors=k)
    
    # Hold-Out
    acc_ho = evaluate_holdout(model_knn, X_train, X_test, y_train, y_test, f"KNN (k={k})")
    
    # 10-Fold CV
    acc_kf = evaluate_kfold(model_knn, X, y, model_name=f"KNN (k={k})")
    
    # LOOCV
    acc_loocv = evaluate_loocv(model_knn, X, y, model_name=f"KNN (k={k})")
    
    # Almacenar resultados
    results[f'KNN (k={k})'] = {
        'Hold-Out': acc_ho,
        '10-Fold CV': acc_kf,
        'LOOCV': acc_loocv
    }


EVALUACIÓN KNN con k=3
KNN (k=3) - HOLD-OUT 70/30
Matriz de Confusión:
[[19  1  1  0  0  0]
 [ 5 18  0  0  0  0]
 [ 3  1  1  0  0  0]
 [ 0  1  0  3  0  0]
 [ 0  1  0  0  2  0]
 [ 0  1  0  0  1  7]]

Problema multiclase - mostrando matriz completa

Precisión (Accuracy): 0.7692
--------------------------------------------------
KNN (k=3) - 10-FOLD CROSS VALIDATION
Precisión por fold: ['0.6364', '0.7273', '0.7727', '0.5909', '0.6667', '0.7143', '0.8095', '0.6190', '0.6190', '0.7619']
Precisión promedio: 0.6918 (+/- 0.1436)
--------------------------------------------------
KNN (k=3) - LEAVE-ONE-OUT
Usando 100 muestras de 214 totales
Precisión: 0.6400
Número de iteraciones: 100
--------------------------------------------------

EVALUACIÓN KNN con k=5
KNN (k=5) - HOLD-OUT 70/30
Matriz de Confusión:
[[19  2  0  0  0  0]
 [ 5 18  0  0  0  0]
 [ 3  2  0  0  0  0]
 [ 0  3  0  1  0  0]
 [ 0  1  0  0  2  0]
 [ 0  1  0  0  1  7]]

Problema multiclase - mostrando matriz completa

Precisión (Accur

### 17. Evaluación de Naive Bayes

In [76]:
print("\nEVALUACIÓN NAIVE BAYES")
print("=" * 60)

# Modelo Naive Bayes
model_nb = GaussianNB()

# Hold-Out
acc_ho_nb = evaluate_holdout(model_nb, X_train, X_test, y_train, y_test, "Naive Bayes")

# 10-Fold CV
acc_kf_nb = evaluate_kfold(model_nb, X, y, model_name="Naive Bayes")

# LOOCV
acc_loocv_nb = evaluate_loocv(model_nb, X, y, model_name="Naive Bayes")

# Almacenar resultados
results['Naive Bayes'] = {
    'Hold-Out': acc_ho_nb,
    '10-Fold CV': acc_kf_nb,
    'LOOCV': acc_loocv_nb
}


EVALUACIÓN NAIVE BAYES
Naive Bayes - HOLD-OUT 70/30
Matriz de Confusión:
[[19  0  1  0  0  1]
 [17  4  0  1  0  1]
 [ 5  0  0  0  0  0]
 [ 0  4  0  0  0  0]
 [ 0  0  0  0  3  0]
 [ 0  0  0  1  0  8]]

Problema multiclase - mostrando matriz completa

Precisión (Accuracy): 0.5231
--------------------------------------------------
Naive Bayes - 10-FOLD CROSS VALIDATION
Precisión por fold: ['0.5000', '0.6364', '0.5455', '0.5000', '0.3810', '0.5238', '0.5714', '0.4762', '0.3333', '0.3810']
Precisión promedio: 0.4848 (+/- 0.1797)
--------------------------------------------------
Naive Bayes - LEAVE-ONE-OUT
Usando 100 muestras de 214 totales
Precisión: 0.4900
Número de iteraciones: 100
--------------------------------------------------


### 18. Resumen comparativo

In [77]:
print("\n" + "=" * 100)
print("RESUMEN COMPARATIVO FINAL - PRECISIÓN")
print("=" * 100)

# Crear tabla comparativa
comparison_data = []
models = ['1-NN', 'KNN (k=3)', 'KNN (k=5)', 'KNN (k=7)', 'KNN (k=9)', 'Naive Bayes']

for model_name in models:
    comparison_data.append({
        'Modelo': model_name,
        'Hold-Out 70/30': f"{results[model_name]['Hold-Out']:.4f}",
        '10-Fold CV': f"{results[model_name]['10-Fold CV']:.4f}",
        'Leave-One-Out': f"{results[model_name]['LOOCV']:.4f}"
    })

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

# Gráfico de comparación
plt.figure(figsize=(14, 8))

model_names = comparison_df['Modelo']
holdout_scores = [float(x) for x in comparison_df['Hold-Out 70/30']]
kfold_scores = [float(x) for x in comparison_df['10-Fold CV']]
loocv_scores = [float(x) for x in comparison_df['Leave-One-Out']]

x = np.arange(len(model_names))
width = 0.25

plt.bar(x - width, holdout_scores, width, label='Hold-Out 70/30', alpha=0.8, color='skyblue')
plt.bar(x, kfold_scores, width, label='10-Fold CV', alpha=0.8, color='lightgreen')
plt.bar(x + width, loocv_scores, width, label='Leave-One-Out', alpha=0.8, color='salmon')

plt.xlabel('Modelos de Clasificación')
plt.ylabel('Precisión')
plt.title('Comparación de Modelos - Todos los Métodos de Validación', fontsize=14, weight='bold')
plt.xticks(x, model_names, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)

# Añadir valores en las barras
for i, (h, k, l) in enumerate(zip(holdout_scores, kfold_scores, loocv_scores)):
    plt.text(i - width, h + 0.01, f'{h:.3f}', ha='center', va='bottom', fontsize=8)
    plt.text(i, k + 0.01, f'{k:.3f}', ha='center', va='bottom', fontsize=8)
    plt.text(i + width, l + 0.01, f'{l:.3f}', ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.show()


RESUMEN COMPARATIVO FINAL - PRECISIÓN
     Modelo Hold-Out 70/30 10-Fold CV Leave-One-Out
       1-NN         0.7538     0.7387        0.6500
  KNN (k=3)         0.7692     0.6918        0.6400
  KNN (k=5)         0.7231     0.6729        0.6100
  KNN (k=7)         0.7077     0.6721        0.5200
  KNN (k=9)         0.7077     0.6675        0.6200
Naive Bayes         0.5231     0.4848        0.4900
