<a href="https://colab.research.google.com/github/andres-merino/AprendizajeAutomaticoInicial-05-N0105/blob/main/2-Ejercicios/10-Optimizacion-Hiperparametros.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table style="border: none; border-collapse: collapse;">
    <tr>
        <td style="width: 20%; vertical-align: middle; padding-right: 10px;">
            <img src="https://i.imgur.com/nt7hloA.png" width="100">
        </td>
        <td style="width: 2px; text-align: center;">
            <font color="#0030A1" size="7">|</font><br>
            <font color="#0030A1" size="7">|</font>
        </td>
        <td>
            <p style="font-variant: small-caps;"><font color="#0030A1" size="5">
                <b>Escuela de Ciencias Físicas y Matemática</b>
            </font> </p>
            <p style="font-variant: small-caps;"><font color="#0030A1" size="4">
                Aprendizaje Automático Inicial &bull; Optmización de Hiperparámetros
            </font></p>
            <p style="font-style: oblique;"><font color="#0030A1" size="3">
                Isaac Porras &bull; 2024-02
            </font></p>
        </td>  
    </tr>
</table>

---
## <font color='264CC7'> Introducción </font>

A lo largo de este taller, aplicaremos optimización de hiperparámetros en un modelo que elijas.

Los paquetes necesarios son:

In [21]:
# Paquetes necesarios

import pandas as pd  # Manejo de datos
import matplotlib.pyplot as plt  # Visualización

from sklearn.model_selection import train_test_split # División de datos
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # Métrica de evaluación

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold  # Búsqueda de hiperparámetros
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler  # Escalado de datos
import joblib
import pickle


---
## <font color='264CC7'> Clasificación </font>


### <font color='264CC7'> Preprocesamiento de datos </font>

Primero necesitas el conjunto de datos. Los datos a utilzar son los seleccionados en la clase anterior.

<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">
<strong>Ejercicio:</strong><br>
    Carga el conjunto de datos y procésalos:
<ul>
  <li>Muestra algunos datos.</li>
  <li>Muestra una descripción de los datos.</li>
  <li>Escala los datos si es necesario.</li>
</ul>
</div>

In [22]:
# Instalar gdown en caso de que no esté disponible
!pip install gdown --quiet

# Descargar el archivo desde el link público de Google Drive
import gdown

# Usar el ID del archivo proporcionado
file_id = "1xSk99J1KVFeIYvwfW-xxgNrNUpOhZF2j"
url = f"https://drive.google.com/uc?id={file_id}"

# Nombre del archivo a guardar
output = "breast-cancer.csv"  # Cambia el nombre según corresponda

# Descargar el archivo
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1xSk99J1KVFeIYvwfW-xxgNrNUpOhZF2j
To: /content/breast-cancer.csv
100%|██████████| 125k/125k [00:00<00:00, 68.0MB/s]


'breast-cancer.csv'

In [23]:
# Importar el dataset
data = pd.read_csv(output)

# Eliminar la columna 'id' ya que no aporta información relevante
data_cleaned = data.drop(columns=['id'])

# Seleccionar las columnas para el modelo
X = data_cleaned.drop(columns=['diagnosis'])  # Variables predictoras
y = data_cleaned['diagnosis']  # Variable objetivo

# Verificar la estructura del nuevo dataset
print("\nDataset filtrado:")
display(data_cleaned.head(10))

print(f"\nNúmero de filas (filtrado): {data_cleaned.shape[0]}")
print(f"Número de columnas (filtrado): {data_cleaned.shape[1]}")

data_cleaned.describe()


Dataset filtrado:


Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151
8,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,...,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072
9,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,...,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075



Número de filas (filtrado): 569
Número de columnas (filtrado): 31


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [24]:
# Dividir los datos en conjuntos de entrenamiento y prueba (80% entrenamiento, 20% prueba)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
# Escalar los datos
scaler = StandardScaler()

In [26]:
# Ajustar el escalador SOLO con el conjunto de entrenamiento
X_train = scaler.fit_transform(X_train)

# Usar el mismo escalador para transformar el conjunto de prueba
X_test = scaler.transform(X_test)


### <font color='264CC7'> Modelo </font>


<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">
<strong>Ejercicio:</strong><br>
    Selecciona el mejor modelo de las clases anteriores.
<ul>
  <li>Muestra los hiperparámetros del modelo.</li>
  <li>Consulta qué significan al menos 4 hiperparámetros.</li>
  <li>Selecciona los hiperparámetros que deseas optimizar, al menos 3.</li>
</ul>
</div>

In [27]:
# Crear y entrenar un arbol con ganancia de información
modelo_base = RandomForestClassifier(random_state=62)

# Parámetros del modelo
modelo_base.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 62,
 'verbose': 0,
 'warm_start': False}

Voy a usar los parametros 'n_estimators', 'max_depth', 'criterion'

### <font color='264CC7'> Optimización por GridSearch </font>

<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">
<strong>Ejercicio:</strong><br>
    Aplica GridSearch para optimizar los hiperparámetros del modelo.
<ul>
  <li>Para cada hiperparámetro, selecciona al menos 3 valores, si es posible.</li>
  <li>Utiliza al menos 5 validaciones cruzadas.</li>
  <li>Muestra los parámetros óptimos y su score.</li>
</ul>
</div>

In [28]:
parametros = {'n_estimators': [10, 20, 30],
              'max_depth': [None, 3, 5],
              'criterion': ['gini', 'entropy', 'log_loss']}
k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

modelo = GridSearchCV(modelo_base, parametros, cv=k_fold, scoring='accuracy')
modelo

In [29]:
# Definir los k-folds
modelo.fit(X_train, y_train)

# Mostrar los resultados
pd.DataFrame(modelo.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.033652,0.004678,0.003885,0.000938,gini,,10,"{'criterion': 'gini', 'max_depth': None, 'n_es...",0.945055,0.956044,0.956044,0.912088,0.945055,0.942857,0.01615,17
1,0.071672,0.032632,0.006316,0.004613,gini,,20,"{'criterion': 'gini', 'max_depth': None, 'n_es...",0.956044,0.967033,0.945055,0.923077,0.934066,0.945055,0.015541,15
2,0.117965,0.016573,0.006771,0.001103,gini,,30,"{'criterion': 'gini', 'max_depth': None, 'n_es...",0.956044,0.978022,0.967033,0.912088,0.934066,0.949451,0.023671,8
3,0.040414,0.014397,0.003791,0.000127,gini,3.0,10,"{'criterion': 'gini', 'max_depth': 3, 'n_estim...",0.934066,0.967033,0.923077,0.923077,0.923077,0.934066,0.017024,25
4,0.110202,0.042086,0.006688,0.002933,gini,3.0,20,"{'criterion': 'gini', 'max_depth': 3, 'n_estim...",0.934066,0.967033,0.956044,0.945055,0.923077,0.945055,0.015541,15
5,0.1527,0.037431,0.009536,0.005432,gini,3.0,30,"{'criterion': 'gini', 'max_depth': 3, 'n_estim...",0.945055,0.978022,0.956044,0.934066,0.923077,0.947253,0.018906,12
6,0.051498,0.012972,0.008639,0.009403,gini,5.0,10,"{'criterion': 'gini', 'max_depth': 5, 'n_estim...",0.956044,0.945055,0.967033,0.934066,0.945055,0.949451,0.011207,8
7,0.127791,0.047383,0.011933,0.004406,gini,5.0,20,"{'criterion': 'gini', 'max_depth': 5, 'n_estim...",0.956044,0.956044,0.967033,0.901099,0.934066,0.942857,0.023466,17
8,0.29982,0.037197,0.014608,0.012657,gini,5.0,30,"{'criterion': 'gini', 'max_depth': 5, 'n_estim...",0.956044,0.956044,1.0,0.923077,0.934066,0.953846,0.026374,3
9,0.061891,0.018535,0.003847,0.000274,entropy,,10,"{'criterion': 'entropy', 'max_depth': None, 'n...",0.967033,0.967033,0.956044,0.934066,0.956044,0.956044,0.012038,1


In [30]:
# Mejores parámetros
print("Mejores parámetros", modelo.best_params_)
print("Mejor score", modelo.best_score_)

Mejores parámetros {'criterion': 'entropy', 'max_depth': None, 'n_estimators': 10}
Mejor score 0.956043956043956


### <font color='264CC7'> Optimización por RandomSearch </font>

<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">
<strong>Ejercicio:</strong><br>
    Aplica RandomSearch para optimizar los hiperparámetros del modelo.
<ul>
  <li>Para cada hiperparámetro, selecciona al menos 5 valores, si es posible.</li>
  <li>Utiliza al menos 5 validaciones cruzadas.</li>
  <li>Usa RandomSearchCV con 25 iteraciones.</li>
  <li>Muestra los parámetros óptimos y su score.</li>
</ul>
</div>

In [31]:
parametros = {'n_estimators': [10, 20, 30, 40, 50],
              'max_depth': [None, 3, 5, 10, 15, 20, 25, 30, 35, 40],
              'criterion': ['gini', 'entropy', 'log_loss']}
k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

modelo = RandomizedSearchCV(modelo_base, parametros, cv=k_fold, scoring='accuracy', n_iter=25, random_state=42)
modelo

In [32]:
# Definir los k-folds
modelo.fit(X_train, y_train)

# Mostrar los resultados
pd.DataFrame(modelo.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.252431,0.074721,0.017621,0.014767,40,15.0,entropy,"{'n_estimators': 40, 'max_depth': 15, 'criteri...",0.967033,0.967033,0.967033,0.934066,0.934066,0.953846,0.01615,5
1,0.161386,0.050967,0.010363,0.004259,40,10.0,gini,"{'n_estimators': 40, 'max_depth': 10, 'criteri...",0.956044,0.978022,0.967033,0.89011,0.934066,0.945055,0.031082,19
2,0.284856,0.083662,0.016159,0.008794,40,10.0,log_loss,"{'n_estimators': 40, 'max_depth': 10, 'criteri...",0.967033,0.967033,0.967033,0.934066,0.934066,0.953846,0.01615,5
3,0.354736,0.052519,0.017441,0.005053,40,20.0,entropy,"{'n_estimators': 40, 'max_depth': 20, 'criteri...",0.967033,0.967033,0.967033,0.934066,0.934066,0.953846,0.01615,5
4,0.179953,0.055889,0.006144,0.001982,20,20.0,entropy,"{'n_estimators': 20, 'max_depth': 20, 'criteri...",0.967033,0.967033,0.956044,0.923077,0.945055,0.951648,0.016447,13
5,0.124443,0.033486,0.013212,0.008369,20,25.0,gini,"{'n_estimators': 20, 'max_depth': 25, 'criteri...",0.956044,0.967033,0.945055,0.923077,0.934066,0.945055,0.015541,19
6,0.699023,0.318209,0.049111,0.029843,50,5.0,entropy,"{'n_estimators': 50, 'max_depth': 5, 'criterio...",0.967033,0.967033,0.989011,0.945055,0.923077,0.958242,0.022413,1
7,0.34699,0.136566,0.019747,0.013401,20,35.0,log_loss,"{'n_estimators': 20, 'max_depth': 35, 'criteri...",0.967033,0.967033,0.956044,0.923077,0.945055,0.951648,0.016447,13
8,0.528598,0.234812,0.01855,0.007032,40,10.0,entropy,"{'n_estimators': 40, 'max_depth': 10, 'criteri...",0.967033,0.967033,0.967033,0.934066,0.934066,0.953846,0.01615,5
9,0.26697,0.051068,0.012688,0.005324,30,25.0,entropy,"{'n_estimators': 30, 'max_depth': 25, 'criteri...",0.967033,0.967033,0.967033,0.923077,0.945055,0.953846,0.017582,5


In [33]:
# Mejores parámetros
print("Mejores parámetros", modelo.best_params_)
print("Mejor score", modelo.best_score_)

Mejores parámetros {'n_estimators': 50, 'max_depth': 5, 'criterion': 'entropy'}
Mejor score 0.9582417582417584


### <font color='264CC7'> Guardado de modelo </font>

<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">
<strong>Ejercicio:</strong><br>
  Con los parámetros óptimos que mejor resultado dieron, reentrena el modelo, muestra su score y guárdalo.
</div>

In [34]:
# Realizar predicciones y evaluar el modelo
y_pred = modelo.predict(X_test)

# Precisión del modelo con dos decimales
accuracy = round(accuracy_score(y_test, y_pred), 2)
print("Precisión del modelo:", accuracy)

# Matriz de confusión
cm = confusion_matrix(y_test, y_pred)
print("Matriz de confusión:")
print(cm)

# Reporte de clasificación
print("Reporte de clasificación:")
print(classification_report(y_test, y_pred))

Precisión del modelo: 0.96
Matriz de confusión:
[[70  1]
 [ 3 40]]
Reporte de clasificación:
              precision    recall  f1-score   support

           B       0.96      0.99      0.97        71
           M       0.98      0.93      0.95        43

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [35]:
# Guardar el modelo
with open('modelo_optimizado.pkl', 'wb') as file:
    pickle.dump(modelo, file)

with open('modelo_optimizado.joblib', 'wb') as file:
    joblib.dump(modelo, file)

### <font color='264CC7'> Publicación </font>

<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">
<strong>Ejercicio:</strong><br>
  Coloca el este cuaderno y el modelo en tu repositorio de GitHub. Agrega una licencia MIT y un README.md donde se explique el contenido del repositorio, los datos utilizados y los resultados obtenidos.
</div>