Clasificación y Optimización de Hiperparámetros (Core)

Objetivo: Implementar un pipeline completo de machine learning para un problema de clasificación utilizando técnicas de preprocesamiento, modelado, y optimización de hiperparámetros. Enfocar especialmente en la limpieza de datos y la optimización utilizando GridSearchCV y RandomizedSearchCV.

Dataset: Medical Cost Personal Dataset

Descripción del Dataset: El dataset de costos médicos personales contiene información sobre varios factores que afectan los costos de seguros médicos, como la edad, el sexo, el índice de masa corporal, y el hábito de fumar. Este dataset es ideal para practicar técnicas de preprocesamiento y optimización de modelos debido a la presencia de datos sucios y variables tanto categóricas como numéricas.

# 1) Carga y Exploración Inicial de Datos:
* Cargar el dataset desde Kaggle.
* Realizar una exploración inicial para entender la estructura del dataset y las características disponibles.
* Identificar y documentar los valores faltantes y outliers en el dataset.

In [71]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, log_loss
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, f1_score
from collections import Counter
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import uniform


In [72]:
# cargar el dataset
data = pd.read_csv('../data/insurance.csv')
print(data.shape)
print(data.columns)
data.info()
data.head(5)

(1338, 7)
Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [73]:
print("Valores faltantes:", data.isnull().sum())

Valores faltantes: age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [74]:
data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [75]:
# Corregimos los tipo de datos
data_type = {
    'smoker' : 'category',
    'region' : 'category',   
    'sex' : 'category' 
}
data = data.astype(data_type)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       1338 non-null   int64   
 1   sex       1338 non-null   category
 2   bmi       1338 non-null   float64 
 3   children  1338 non-null   int64   
 4   smoker    1338 non-null   category
 5   region    1338 non-null   category
 6   charges   1338 non-null   float64 
dtypes: category(3), float64(2), int64(2)
memory usage: 46.3 KB


In [76]:
data['sex'].unique()


['female', 'male']
Categories (2, object): ['female', 'male']

In [77]:
data['region'].unique()

['southwest', 'southeast', 'northwest', 'northeast']
Categories (4, object): ['northeast', 'northwest', 'southeast', 'southwest']

In [78]:
data['smoker'].unique()

['yes', 'no']
Categories (2, object): ['no', 'yes']

## Preprocesamiento de Datos:
* Imputar valores faltantes utilizando técnicas adecuadas (media, mediana, moda, imputación avanzada).
* Codificar variables categóricas utilizando One-Hot Encoding.
* Escalar características numéricas utilizando StandardScaler.

In [79]:
# Seleccionar columnas 
X = data[['age', 'sex', 'bmi', 'children', 'smoker', 'region']] 
y = data['charges']

# Dividir en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Definir columnas categóricas y numéricas
numerical_columns = ['age', 'bmi', 'children']
categorical_columns = ['sex', 'smoker', 'region']

# Definir transformaciones específicas para cada tipo de dato
preprocessor = ColumnTransformer(
    transformers=[    
        ('num', StandardScaler(), numerical_columns),  # Escalado numérico
        ('cat', OneHotEncoder(), categorical_columns)  # Codificación categórica
    ]
)

## Implementación de Modelos de Clasificación:

In [None]:

# Crear un pipeline para KNN
knn_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', KNeighborsRegressor())
])

# Crear un pipeline para Random Forest)
rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('logreg', RandomForestRegressor())
])

# Crear un pipeline para el árbol de decisión
dt_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('dt', DecisionTreeRegressor())
])

# Entrenar y evaluar los modelos

# KNN
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_predictions)
knn_r2 = r2_score(y_test, knn_predictions)
print(f'KNN - MSE: {knn_mse:.4f}, R²: {knn_r2:.4f}')

# Random Forest
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_r2 = r2_score(y_test, rf_predictions)
print(f'Random Forest - MSE: {rf_mse:.4f}, R²: {rf_r2:.4f}')

# Árbol de Decisión
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_predictions)
dt_r2 = r2_score(y_test, dt_predictions)
print(f'Árbol de Decisión - MSE: {dt_mse:.4f}, R²: {dt_r2:.4f}')

KNN - MSE: 34260377.0476, R²: 0.7663
Regresión Logística - MSE: 21317568.6867, R²: 0.8546
Árbol de Decisión - MSE: 37005481.0817, R²: 0.7476


## Optimización de Hiperparámetros

In [81]:

# Definir los hiperparámetros a optimizar para cada modelo

# KNN: Probar con diferentes números de vecinos y pesos
knn_param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11],  # Número de vecinos
    'knn__weights': ['uniform', 'distance'],  # Tipo de ponderación
}

# Definir los hiperparámetros a optimizar para Random Forest
rf_param_grid = {
    'rf__n_estimators': [50, 100, 150, 200],  # Número de árboles en el bosque
    'rf__max_depth': [3, 5, 10, None],  # Profundidad máxima del árbol
    'rf__min_samples_split': [2, 5, 10],  # Mínimo número de muestras para dividir
    'rf__min_samples_leaf': [1, 2, 4],  # Mínimo número de muestras por hoja
    'rf__bootstrap': [True, False]  # Si usar o no bootstrap
}

# Árbol de Decisión: Probar con diferentes profundidades y criterios
dt_param_grid = {
    'dt__max_depth': [3, 5, 7, 10, None],  # Profundidad máxima
    'dt__min_samples_split': [2, 5, 10],  # Mínimo número de muestras para dividir
    'dt__criterion': ['squared_error', 'friedman_mse'],  # Criterio de división
}



In [82]:
# Crear un pipeline para cada modelo con su preprocesador

# KNN Pipeline
knn_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', KNeighborsRegressor())
])

# Parámetros optimizados para la búsqueda
rf_param_grid = {
    'rf__n_estimators': [100, 150, 200],  # Limitar el número de árboles
    'rf__max_depth': [5, 10, None],  # Profundidades más moderadas
    'rf__min_samples_split': [2, 5],  # Solo algunos valores representativos
    'rf__min_samples_leaf': [1, 2],  # Solo algunos valores
    'rf__bootstrap': [True]  # Solo probar con bootstrap
}
# Crear el modelo de Random Forest
rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('rf', RandomForestRegressor())
])

# Árbol de Decisión Pipeline
dt_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('dt', DecisionTreeRegressor())
])

In [83]:
# Implementar GridSearchCV para KNN
grid_knn = GridSearchCV(knn_model, knn_param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_knn.fit(X_train, y_train)
grid_knn_best_params = grid_knn.best_params_
grid_knn_best_score = grid_knn.best_score_

In [84]:
# Implementar GridSearchCV para Random Forest
grid_rf = GridSearchCV(rf_model, rf_param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_rf.fit(X_train, y_train)
grid_rf_best_params = grid_rf.best_params_
grid_rf_best_score = grid_rf.best_score_

In [85]:
# Implementar GridSearchCV para Árbol de Decisión
grid_dt = GridSearchCV(dt_model, dt_param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_dt.fit(X_train, y_train)
grid_dt_best_params = grid_dt.best_params_
grid_dt_best_score = grid_dt.best_score_

In [86]:
# Implementar RandomizedSearchCV para KNN (con distribución uniforme para 'n_neighbors')
random_knn = RandomizedSearchCV(knn_model, knn_param_grid, n_iter=10, cv=3, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
random_knn.fit(X_train, y_train)
random_knn_best_params = random_knn.best_params_
random_knn_best_score = random_knn.best_score_

# Implementar RandomizedSearchCV para Random Forest
random_rf = RandomizedSearchCV(rf_model, rf_param_grid, n_iter=10, cv=3, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
random_rf.fit(X_train, y_train)
random_rf_best_params = random_rf.best_params_
random_rf_best_score = random_rf.best_score_

# Implementar RandomizedSearchCV para Árbol de Decisión
random_dt = RandomizedSearchCV(dt_model, dt_param_grid, n_iter=10, cv=3, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
random_dt.fit(X_train, y_train)
random_dt_best_params = random_dt.best_params_
random_dt_best_score = random_dt.best_score_

In [87]:
# Comparar los resultados de GridSearchCV y RandomizedSearchCV
print("KNN - GridSearchCV Best Params:", grid_knn_best_params)
print("KNN - GridSearchCV Best Score:", grid_knn_best_score)
print("KNN - RandomizedSearchCV Best Params:", random_knn_best_params)
print("KNN - RandomizedSearchCV Best Score:", random_knn_best_score)

# Comparar los resultados de GridSearchCV y RandomizedSearchCV para Random Forest
print("\nRandom Forest - GridSearchCV Best Params:", grid_rf_best_params)
print("Random Forest - GridSearchCV Best Score:", grid_rf_best_score)
print("Random Forest - RandomizedSearchCV Best Params:", random_rf_best_params)
print("Random Forest - RandomizedSearchCV Best Score:", random_rf_best_score)

# Comparar los resultados de GridSearchCV y RandomizedSearchCV para Desicion Tree
print("\nDecisionTree - GridSearchCV Best Params:", grid_dt_best_params)
print("DecisionTree - GridSearchCV Best Score:", grid_dt_best_score)
print("DecisionTree - RandomizedSearchCV Best Params:", random_dt_best_params)
print("DecisionTree - RandomizedSearchCV Best Score:", random_dt_best_score)

KNN - GridSearchCV Best Params: {'knn__n_neighbors': 5, 'knn__weights': 'distance'}
KNN - GridSearchCV Best Score: -40757730.68560166
KNN - RandomizedSearchCV Best Params: {'knn__weights': 'distance', 'knn__n_neighbors': 5}
KNN - RandomizedSearchCV Best Score: -40757730.68560166

Random Forest - GridSearchCV Best Params: {'rf__bootstrap': True, 'rf__max_depth': 5, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'rf__n_estimators': 100}
Random Forest - GridSearchCV Best Score: -22237202.17169947
Random Forest - RandomizedSearchCV Best Params: {'rf__n_estimators': 200, 'rf__min_samples_split': 2, 'rf__min_samples_leaf': 2, 'rf__max_depth': 5, 'rf__bootstrap': True}
Random Forest - RandomizedSearchCV Best Score: -22211853.941315006

DecisionTree - GridSearchCV Best Params: {'dt__criterion': 'friedman_mse', 'dt__max_depth': 3, 'dt__min_samples_split': 5}
DecisionTree - GridSearchCV Best Score: -23494632.197413202
DecisionTree - RandomizedSearchCV Best Params: {'dt__min_samples_split

## Evaluación de Modelos:

In [None]:
# Crear un pipeline para KNN
knn_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', KNeighborsRegressor(n_neighbors=5, weights='distance'))  # Usando los mejores parámetros
])

# Crear un pipeline para Random Forest
rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('rf', RandomForestRegressor(
        bootstrap=True, max_depth=5, min_samples_leaf=2, 
        min_samples_split=5, n_estimators=100))  # Usando los mejores parámetros
])

# Crear un pipeline para el árbol de decisión
dt_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('dt', DecisionTreeRegressor(criterion='friedman_mse', max_depth=3, min_samples_split=5))  # Usando los mejores parámetros
])

# Entrenar los modelos con los parámetros aplicados

# KNN
knn_model.fit(X_train, y_train)

# Random Forest
rf_model.fit(X_train, y_train)

# Árbol de Decisión
dt_model.fit(X_train, y_train)

# Hacer predicciones con cada modelo
knn_predictions = knn_model.predict(X_test)
rf_predictions = rf_model.predict(X_test)
dt_predictions = dt_model.predict(X_test)

# Evaluar los modelos (Ejemplo: MSE y R²)
from sklearn.metrics import mean_squared_error, r2_score

knn_mse = mean_squared_error(y_test, knn_predictions)
knn_r2 = r2_score(y_test, knn_predictions)

rf_mse = mean_squared_error(y_test, rf_predictions)
rf_r2 = r2_score(y_test, rf_predictions)

dt_mse = mean_squared_error(y_test, dt_predictions)
dt_r2 = r2_score(y_test, dt_predictions)

# Imprimir los resultados
print(f"KNN - MSE: {knn_mse:.4f}, R²: {knn_r2:.4f}")
print(f"Random Forest - MSE: {rf_mse:.4f}, R²: {rf_r2:.4f}")
print(f"Decision Tree - MSE: {dt_mse:.4f}, R²: {dt_r2:.4f}")



KNN - MSE: 32949566.4971, R²: 0.7753
Random Forest - MSE: 18929926.8504, R²: 0.8709
Decision Tree - MSE: 22877590.7905, R²: 0.8440


## Conclusiones:
* El modelo mas adecuado para este dataset es el random forest cuyo resultado despues de optimizar los parametros con gridsearchCV obtuvo un R² de 0.87
* Como es un modelo de regresion y no de clasificacion se utilizan las metricas de R² y mse para comparar los modelos