# Laboratorio 2 - Clasificador de rostros

**Autores**

*   [214205] Enrique, Oliva
*   [192680] Martina, Severo
*   [229484] Santiago, Tonarelli

**Formato de entrega**:

* Esta misma notebook en formato .ipynb
* Cambiar el nombre de la notebook por NumEst1_NumEst2_NumEst3_Lab_1.
* Es importante que la notebook pueda ejecutarse sin problemas al seleccionar 'Ejecutar todo'.
* Se considerará que sus datos pueden estar en otra localización.


**Plazo de entrega**: hasta el Domingo 16/06 a las 23:59 horas a través de Aulas.

**Objetivo**: implementar un algoritmo de clasificación que permita predecir si una imagen dada es un rostro o no.

## Librerías

In [1]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: c:\Users\Santiago\.pyenv\pyenv-win\versions\3.12.1\python.exe -m pip install --upgrade pip


In [2]:
import os
from tqdm import tqdm
from time import time

import random
import numpy as np
import matplotlib.pyplot as plt

from skimage.exposure import equalize_hist

from skimage.transform import integral_image
from skimage.feature import haar_like_feature, haar_like_feature_coord

from sklearn.feature_selection import SelectPercentile, f_classif

## Funciones auxiliares

In [3]:
def extract_feature_image(img, feature_type=None, feature_coord=None):
    """Extrae las Haar features de la imagen"""
    ii = integral_image(img)
    return haar_like_feature(ii, 0, 0, ii.shape[0], ii.shape[1],
                             feature_type=feature_type,
                             feature_coord=feature_coord)

## Datos

**CBCL FACE DATABASE #1**:

*   19 x 19 Grayscale PGM format images
*   Training set:  2429 faces, 4548 non-faces
*   Test set: 472 faces, 23573 non-faces



In [4]:
# !unzip /content/CBCL.zip
# colab
# !tar -xvzf /content/face.test.tar.gz
# !tar -xvzf /content/face.train.tar.gz

# vscode
!tar -xvzf content/face.test.tar.gz
!tar -xvzf content/face.train.tar.gz

x test/
x test/face/
x test/face/cmu_0000.pgm
x test/face/cmu_0001.pgm
x test/face/cmu_0002.pgm
x test/face/cmu_0003.pgm
x test/face/cmu_0004.pgm
x test/face/cmu_0005.pgm
x test/face/cmu_0006.pgm
x test/face/cmu_0007.pgm
x test/face/cmu_0008.pgm
x test/face/cmu_0009.pgm
x test/face/cmu_0010.pgm
x test/face/cmu_0011.pgm
x test/face/cmu_0012.pgm
x test/face/cmu_0013.pgm
x test/face/cmu_0014.pgm
x test/face/cmu_0015.pgm
x test/face/cmu_0016.pgm
x test/face/cmu_0017.pgm
x test/face/cmu_0018.pgm
x test/face/cmu_0019.pgm
x test/face/cmu_0020.pgm
x test/face/cmu_0021.pgm
x test/face/cmu_0022.pgm
x test/face/cmu_0023.pgm
x test/face/cmu_0024.pgm
x test/face/cmu_0025.pgm
x test/face/cmu_0026.pgm
x test/face/cmu_0027.pgm
x test/face/cmu_0028.pgm
x test/face/cmu_0029.pgm
x test/face/cmu_0030.pgm
x test/face/cmu_0031.pgm
x test/face/cmu_0032.pgm
x test/face/cmu_0033.pgm
x test/face/cmu_0034.pgm
x test/face/cmu_0035.pgm
x test/face/cmu_0036.pgm
x test/face/cmu_0037.pgm
x test/face/cmu_0038.pgm
x te

In [5]:
suffix = '.pgm'

# train_faces = os.listdir('/content/train/face') # colab
train_faces = os.listdir('train/face') # vscode
train_faces = [filename for filename in train_faces if filename.endswith(suffix)]

# train_background = os.listdir('/content/train/non-face') # colab
train_background = os.listdir('train/non-face') # vscode
train_background = [filename for filename in train_background if filename.endswith(suffix)]

# test_faces = os.listdir('/content/test/face') # colab
test_faces = os.listdir('test/face') # vscode
test_faces = [filename for filename in test_faces if filename.endswith(suffix)]

# test_background = os.listdir('/content/test/non-face') # colab
test_background = os.listdir('test/non-face') # vscode
test_background = [filename for filename in test_background if filename.endswith(suffix)]

In [6]:
print(f'# Train Faces: {len(train_faces)}')
print(f'# Train Back: {len(train_background)}')
print(f'# Test Faces: {len(test_faces)}')
print(f'# Test Back: {len(test_background)}')

# Train Faces: 2429
# Train Back: 4548
# Test Faces: 472
# Test Back: 23573


In [7]:
# Tomaremos una fracción de los datos. Puede ajustar estos parámetros a gusto
f = 0.2
n_face = int(f*len(train_faces))
n_back = int(f*len(train_background))

# Para mantener la proporción de background en test calculamos (para mantener una proporción balanceada entre las clases (rostros y no-rostros)):
m = int(np.round(len(test_faces)*len(train_background)/len(train_faces)))

print(f'# Train Faces Sample Size: {n_face}')
print(f'# Train Back Sample Size: {n_back}')
print(f'# m: {m}')

# Train Faces Sample Size: 485
# Train Back Sample Size: 909
# m: 884


In [8]:
sample_train_faces = random.sample(train_faces,n_face)

Im_train = []
for filename in tqdm(sample_train_faces):
    # path = '/content/train/face/' + filename # colab
    path = 'train/face/' + filename # vscode
    with open(path, 'rb') as pgmf:
        image = plt.imread(pgmf)
    Im_train.append(image)

n_train_faces = len(Im_train)
y_train = [1]*n_train_faces # Cada imagen de rostro se etiqueta con un 1

100%|██████████| 485/485 [00:01<00:00, 387.72it/s]


In [9]:
sample_train_background = random.sample(train_background,n_back)

for filename in tqdm(sample_train_background):
    # path = "/content/train/non-face/" + filename # colab
    path = "train/non-face/" + filename # vscode
    with open(path, 'rb') as pgmf:
        image = plt.imread(pgmf)
    Im_train.append(image)

n_train_background = len(Im_train)-n_train_faces
y_train = y_train + [0]*n_train_background # Cada imagen de no-rostro se etiqueta con un 0

100%|██████████| 909/909 [00:02<00:00, 418.12it/s]


In [10]:
print(f'# Train: {len(Im_train)}, {len(y_train)}')

# Train: 1394, 1394


In [11]:
Im_test = []
for filename in tqdm(test_faces):
    # path = "/content/test/face/" + filename # colab
    path = "test/face/" + filename # vscode
    with open(path, 'rb') as pgmf:
        image = plt.imread(pgmf)
    Im_test.append(image)

n_test_faces = len(Im_test)
y_test = [1]*n_test_faces

100%|██████████| 472/472 [00:01<00:00, 415.73it/s]


In [12]:
sample_test_background = random.sample(test_background,m)

for filename in tqdm(sample_test_background):
    # path = "/content/test/non-face/" + filename # colab
    path = "test/non-face/" + filename # vscode
    with open(path, 'rb') as pgmf:
        image = plt.imread(pgmf)
    Im_test.append(image)

n_test_background = len(Im_test)-n_test_faces
y_test = y_test + [0]*n_test_background

100%|██████████| 884/884 [00:02<00:00, 426.67it/s]


In [13]:
print(f'# Test: {len(Im_test)}, {len(y_test)}')

# Test: 1356, 1356


## Histogram equalization

In [14]:
# Normalización de las imágenes de entrenamiento y prueba
Im_train_norm = [equalize_hist(image) for image in Im_train]
Im_test_norm = [equalize_hist(image) for image in Im_test]

## Matriz de features

### Calculamos y seleccionamos las mejores features en entrenamiento

In [15]:
X_train = [extract_feature_image(img) for img in tqdm(Im_train_norm)]
X_train = np.array(X_train)

100%|██████████| 1394/1394 [01:24<00:00, 16.43it/s]


In [16]:
# Pueden guardar la matriz si lo desean
np.save('X_train', X_train)

In [17]:
# Y cargarla posteriormente
X_train = np.load('X_train.npy')

In [18]:
X_train.shape

(1394, 63960)

In [19]:
# Selección de características en train
print("Seleccionando las features de mayor dependencia lineal con y")
t_start = time()
# SelectPercentile: selecciona las mejores características basadas en una prueba estadística.
# f_classif: mide la dependencia lineal entre dos conjuntos de datos.
# percentile=1: Selecciona el 1% de las mejores características.
# fit(X_train, y_train): Ajusta el selector de características a los datos de entrenamiento X_train y y_train.
# get_support(indices=True): Obtiene los índices de las características seleccionadas.
f_indices = SelectPercentile(f_classif, percentile=1).fit(X_train, y_train).get_support(indices=True)
t = time() - t_start
X_train = X_train[:,f_indices]
print("Seleccionadas %d features potenciales" % X_train.shape[1])
print(f'Tiempo: {t} segundos')

Seleccionando las features de mayor dependencia lineal con y
Seleccionadas 640 features potenciales
Tiempo: 0.7616219520568848 segundos


### Calculamos dichas features para test

In [20]:
# haar_like_feature_coord(): genera coordenadas y tipos de características Haar para una ventana de búsqueda especificada. En este caso 19x19 píxeles.
feature_coord, feature_type = haar_like_feature_coord(width=19,
                                                      height=19,
                                                      )

In [21]:
t_start = time()
X_test = [extract_feature_image(img,
                                feature_type=feature_type[f_indices],
                                feature_coord=feature_coord[f_indices]) for img in tqdm(Im_test_norm)]
t = time() - t_start
X_test = np.array(X_test)

100%|██████████| 1356/1356 [00:00<00:00, 3304.28it/s]


In [22]:
print(f'Tiempo: {t} segundos')
print(f'Shape X_test: {X_test.shape}')

Tiempo: 0.41188621520996094 segundos
Shape X_test: (1356, 640)


# Clasificadores - Evaluación con Holdout

### Random Forest

In [23]:
from sklearn.ensemble import RandomForestClassifier

# Crear el clasificador Random Forest
rf_classifier = RandomForestClassifier(n_estimators=600, random_state=42)

# Entrenar el clasificador con el conjunto de entrenamiento y las características extraídas
rf_classifier.fit(X_train, y_train)

# Evaluación del Modelo
from sklearn.metrics import accuracy_score, f1_score, precision_score, classification_report, confusion_matrix

# Predecir las etiquetas de las imágenes de prueba
y_pred = rf_classifier.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred)
f1_score_rf = f1_score(y_test, y_pred)
precision_rf = precision_score(y_test, y_pred)

print("Accuracy:", accuracy_rf)
print("F1 Score:", f1_score_rf)
print("Precision:", precision_rf)

print("\nReporte de clasificación:")
print(classification_report(y_test, y_pred))

print("\nMatriz de confusión:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7168141592920354
F1 Score: 0.4146341463414634
Precision: 0.7391304347826086

Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.71      0.95      0.81       884
           1       0.74      0.29      0.41       472

    accuracy                           0.72      1356
   macro avg       0.73      0.62      0.61      1356
weighted avg       0.72      0.72      0.67      1356


Matriz de confusión:
[[836  48]
 [336 136]]


### Gradient Boosting

In [24]:
from sklearn.ensemble import GradientBoostingClassifier

gb_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

gb_classifier.fit(X_train, y_train)

# Evaluación del modelo
y_pred_gb = gb_classifier.predict(X_test)

accuracy_gb = accuracy_score(y_test, y_pred_gb)
f1_score_gb = f1_score(y_test, y_pred_gb)
precision_gb = precision_score(y_test, y_pred_gb)

print("Accuracy:", accuracy_gb)
print("F1 Score:", f1_score_gb)
print("Precision:", precision_gb)

print("\nReporte de clasificación:")
print(classification_report(y_test, y_pred_gb))

print("\nMatriz de confusión:")
print(confusion_matrix(y_test, y_pred_gb))


Accuracy: 0.7286135693215339
F1 Score: 0.4604105571847507
Precision: 0.7476190476190476

Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.73      0.94      0.82       884
           1       0.75      0.33      0.46       472

    accuracy                           0.73      1356
   macro avg       0.74      0.64      0.64      1356
weighted avg       0.73      0.73      0.69      1356


Matriz de confusión:
[[831  53]
 [315 157]]


### Árboles de decisión

In [25]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(random_state=42)

dt_classifier.fit(X_train, y_train)

# Evaluación del modelo
y_pred_dt = dt_classifier.predict(X_test)

accuracy_dt = accuracy_score(y_test, y_pred_dt)
f1_score_dt = f1_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)

print("Accuracy:", accuracy_dt)
print("F1 Score:", f1_score_dt)
print("Precision:", precision_dt)

print("\nReporte de clasificación:")
print(classification_report(y_test, y_pred_dt))

print("\nMatriz de confusión:")
print(confusion_matrix(y_test, y_pred_dt))


Accuracy: 0.726401179941003
F1 Score: 0.48686030428769017
Precision: 0.701195219123506

Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.73      0.92      0.81       884
           1       0.70      0.37      0.49       472

    accuracy                           0.73      1356
   macro avg       0.72      0.64      0.65      1356
weighted avg       0.72      0.73      0.70      1356


Matriz de confusión:
[[809  75]
 [296 176]]


### Regresión logística

In [26]:
from sklearn.linear_model import LogisticRegression

lr_classifier = LogisticRegression(max_iter=1000, random_state=42)

lr_classifier.fit(X_train, y_train)

# Evaluación del modelo
y_pred_lr = lr_classifier.predict(X_test)

accuracy_lr = accuracy_score(y_test, y_pred_lr)
f1_score_lr = f1_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr)

print("Accuracy:", accuracy_lr)
print("F1 Score:", f1_score_lr)
print("Precision:", precision_lr)

print("\nReporte de clasificación:")
print(classification_report(y_test, y_pred_lr))

print("\nMatriz de confusión:")
print(confusion_matrix(y_test, y_pred_lr))


Accuracy: 0.7382005899705014
F1 Score: 0.4613050075872534
Precision: 0.8128342245989305

Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.73      0.96      0.83       884
           1       0.81      0.32      0.46       472

    accuracy                           0.74      1356
   macro avg       0.77      0.64      0.64      1356
weighted avg       0.76      0.74      0.70      1356


Matriz de confusión:
[[849  35]
 [320 152]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Redes neuronales

In [27]:
from sklearn.neural_network import MLPClassifier

mlp_classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)

mlp_classifier.fit(X_train, y_train)

# Evaluación del modelo
y_pred_mlp = mlp_classifier.predict(X_test)

accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
f1_score_mlp = f1_score(y_test, y_pred_mlp)
precision_mlp = precision_score(y_test, y_pred_mlp)

print("Accuracy:", accuracy_mlp)
print("F1 Score:", f1_score_mlp)
print("Precision:", precision_mlp)

print("\nReporte de clasificación:")
print(classification_report(y_test, y_pred_mlp))

print("\nMatriz de confusión:")
print(confusion_matrix(y_test, y_pred_mlp))


Accuracy: 0.782448377581121
F1 Score: 0.6050870147255689
Precision: 0.8218181818181818

Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.77      0.94      0.85       884
           1       0.82      0.48      0.61       472

    accuracy                           0.78      1356
   macro avg       0.80      0.71      0.73      1356
weighted avg       0.79      0.78      0.76      1356


Matriz de confusión:
[[835  49]
 [246 226]]


# Clasificadores - Evaluación con Repeated Holdout

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Número de repeticiones
n_repeats = 5
accuracies_rf = []
accuracies_gb = []
accuracies_dt = []
accuracies_lr = []
accuracies_mlp = []


for _ in range(n_repeats):
    # Dividir los datos en conjuntos de entrenamiento y prueba
    X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

    # Entrenar y evaluar Random Forest
    rf_classifier.fit(X_train_split, y_train_split)
    y_pred_rf = rf_classifier.predict(X_val_split)
    accuracies_rf.append(accuracy_score(y_val_split, y_pred_rf))

    # Entrenar y evaluar Gradient Boosting
    gb_classifier.fit(X_train_split, y_train_split)
    y_pred_gb = gb_classifier.predict(X_val_split)
    accuracies_gb.append(accuracy_score(y_val_split, y_pred_gb))

    # Entrenar y evaluar Árboles de decisión
    dt_classifier.fit(X_train_split, y_train_split)
    y_pred_dt = dt_classifier.predict(X_val_split)
    accuracies_dt.append(accuracy_score(y_val_split, y_pred_dt))

    # Entrenar y evaluar Regresión logística
    lr_classifier.fit(X_train_split, y_train_split)
    y_pred_lr = lr_classifier.predict(X_val_split)
    accuracies_lr.append(accuracy_score(y_val_split, y_pred_lr))

    # Entrenar y evaluar Redes neuronales
    mlp_classifier.fit(X_train_split, y_train_split)
    y_pred_mlp = mlp_classifier.predict(X_val_split)
    accuracies_mlp.append(accuracy_score(y_val_split, y_pred_mlp))

print("Repeated Holdout Accuracy Scores (RF):", np.mean(accuracies_rf))
print("Repeated Holdout Accuracy Scores (GB):", np.mean(accuracies_gb))
print("Repeated Holdout Accuracy Scores (DT):", np.mean(accuracies_dt))
print("Repeated Holdout Accuracy Scores (LR):", np.mean(accuracies_lr))
print("Repeated Holdout Accuracy Scores (MLP):", np.mean(accuracies_mlp))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Repeated Holdout Accuracy Scores (RF): 0.9605734767025089
Repeated Holdout Accuracy Scores (GB): 0.9605734767025089
Repeated Holdout Accuracy Scores (DT): 0.9605734767025089
Repeated Holdout Accuracy Scores (LR): 0.9713261648745519
Repeated Holdout Accuracy Scores (MLP): 0.9498207885304659


# Selección de modelos (búsqueda de hiperparámetros)

### Random Forest

In [29]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier

param_dist = {
    'n_estimators': randint(100, 500),
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

random_search_rf = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42),
                                   param_distributions=param_dist,
                                   n_iter=100,
                                   cv=5,  # Este es el parámetro para cross-validation
                                   scoring='f1',
                                   verbose=1,
                                   n_jobs=-1)

random_search_rf.fit(X_train, y_train)
cv_results_rf = random_search_rf.cv_results_
best_rf_model = random_search_rf.best_estimator_

# TODO: delete if made in next cell
# print("Mejor modelo:", best_rf_model)
# print("cv_results:", cv_results_rf)
# print("Mejores hiperparámetros:", random_search_rf.best_params_)


Fitting 5 folds for each of 100 candidates, totalling 500 fits


140 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
76 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Santiago\.pyenv\pyenv-win\versions\3.12.1\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Santiago\.pyenv\pyenv-win\versions\3.12.1\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\Santiago\.pyenv\pyenv-win\versions\3.12.1\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Santiago\.pyenv\pyenv-win\versions\3.12.1\Lib\site-packages\

In [30]:
print("Mejor modelo:", best_rf_model)
print("cv_results:", cv_results_rf)
print("Mejores hiperparámetros:", random_search_rf.best_params_)

Mejor modelo: RandomForestClassifier(bootstrap=False, max_depth=20, min_samples_split=5,
                       n_estimators=310, random_state=42)
cv_results: {'mean_fit_time': array([1.00487876e+00, 3.01889925e+00, 5.79584494e+00, 5.29840217e+00,
       1.12447767e+00, 2.57172441e+00, 3.50708961e-03, 3.40795517e-03,
       6.88584280e-01, 3.04485226e+00, 4.06122327e+00, 3.61061096e-03,
       4.37812724e+00, 6.16612144e+00, 3.20682526e-03, 3.73173738e+00,
       3.51149654e+00, 2.98760905e+00, 2.08809781e+00, 1.11353035e+00,
       3.89292440e+00, 6.89807324e+00, 3.40900421e-03, 3.30882072e-03,
       3.58047485e-03, 7.88552809e-01, 1.03409400e+00, 2.90880532e+00,
       1.90961633e+00, 1.36529026e+00, 3.73231654e+00, 3.12110844e+00,
       3.51683955e+00, 3.96851964e+00, 3.51099968e-03, 3.30700874e-03,
       1.27660193e+00, 3.10768580e+00, 5.58064704e+00, 6.71215057e-03,
       4.17479239e+00, 4.52673755e+00, 6.94846535e-01, 1.00561709e+00,
       1.79302688e+00, 7.06861649e+00, 6.5

### Gradient Boosting

In [31]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.ensemble import GradientBoostingClassifier

param_dist_gb = {
    'n_estimators': randint(100, 500),
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0]
}

random_search_gb = RandomizedSearchCV(estimator=GradientBoostingClassifier(random_state=42),
                                      param_distributions=param_dist_gb,
                                      n_iter=20,
                                      cv=3,
                                      scoring='f1',
                                      n_jobs=-1,
                                      verbose=1)
random_search_gb.fit(X_train, y_train)
cv_results_gb = random_search_gb.cv_results_
best_gb_model = random_search_gb.best_estimator_

# TODO: delete if made in next cell
# print("Mejor modelo:", best_gb_model)
# print("cv_results:", cv_results_gb)
# print("Mejores hiperparámetros:", random_search_gb.best_params_)


Fitting 3 folds for each of 20 candidates, totalling 60 fits


In [32]:
print("Mejor modelo:", best_gb_model)
print("cv_results:", cv_results_gb)
print("Mejores hiperparámetros:", random_search_gb.best_params_)

Mejor modelo: GradientBoostingClassifier(learning_rate=0.2, n_estimators=341, random_state=42,
                           subsample=0.8)
cv_results: {'mean_fit_time': array([ 23.36819641,  50.61555616,  52.07078624,  43.25534201,
       145.96107729,  45.205966  ,  51.19608529,  28.81203389,
        15.79275608,  36.12298743,  38.5975097 ,  42.33074315,
        42.9019378 ,  52.84341709, 128.99863625,  70.93444173,
        15.64875189,  16.69801505,  48.03074145, 110.14257431]), 'std_fit_time': array([0.95541841, 0.44889155, 1.60469194, 1.60371599, 4.68963057,
       1.8156573 , 1.01996765, 1.31935447, 0.50635849, 1.93035821,
       0.27961853, 1.35688431, 1.29731819, 1.86871021, 3.44719981,
       1.30184992, 0.55844265, 0.22284851, 0.15048978, 2.42749272]), 'mean_score_time': array([0.00200391, 0.00317089, 0.00667612, 0.00383798, 0.00434462,
       0.00333953, 0.00400583, 0.00216945, 0.00200113, 0.00317383,
       0.00299962, 0.0023369 , 0.00317327, 0.0041728 , 0.00433969,
       0.0

### Árboles de decisión

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid_dt = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
                              param_grid=param_grid_dt,
                              cv=5,
                              scoring='f1',
                              n_jobs=-1,
                              verbose=1)

grid_search_dt.fit(X_train, y_train)
cv_result_dt = grid_search_dt.cv_results_
best_dt_model = grid_search_dt.best_estimator_

# TODO: delete if made in next cell
# print("Mejor modelo:", best_dt_model)
# print("cv_results:", cv_result_dt)
# print("Mejores hiperparámetros:", grid_search_dt.best_params_)


Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [34]:
print("Mejor modelo:", best_dt_model)
print("cv_results:", cv_result_dt)
print("Mejores hiperparámetros:", grid_search_dt.best_params_)

Mejor modelo: DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
cv_results: {'mean_fit_time': array([0.58889036, 0.60422626, 0.58637843, 0.52386632, 0.58603482,
       0.57658124, 0.50543876, 0.51109452, 0.52436705, 0.573839  ,
       0.54246373, 0.52290387, 0.5315793 , 0.52878222, 0.51015081,
       0.46557355, 0.46223464, 0.47403021, 0.56375146, 0.55383554,
       0.53568711, 0.50393062, 0.51974978, 0.51004238, 0.48164926,
       0.47005091, 0.46982541, 0.56865349, 0.52399378, 0.53746715,
       0.54879284, 0.51732512, 0.52587543, 0.47178826, 0.45728192,
       0.43347759]), 'std_fit_time': array([0.08911513, 0.07958572, 0.09730099, 0.07677173, 0.12184302,
       0.08432953, 0.07616103, 0.05168765, 0.03918019, 0.05672123,
       0.06156875, 0.06164823, 0.10915625, 0.09756513, 0.09085803,
       0.03512029, 0.04077135, 0.05172413, 0.08957599, 0.09286034,
       0.09855536, 0.0626382 , 0.07248742, 0.08892229, 0.06443133,
       0.05882365, 0.03983977, 0.10551931, 0.07006561, 

### Regresión logística

In [35]:
from sklearn.linear_model import LogisticRegression

param_grid_lr = {
    'C': [0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_search_lr = GridSearchCV(estimator=LogisticRegression(random_state=42, max_iter=1000),
                              param_grid=param_grid_lr,
                              cv=5,
                              scoring='f1',
                              n_jobs=-1,
                              verbose=1)

grid_search_lr.fit(X_train, y_train)
cv_results_lr = grid_search_lr.cv_results_
best_lr_model = grid_search_lr.best_estimator_

# TODO: delete if made in next cell
# print("Mejor modelo:", best_lr_model)
# print("cv_results:", cv_results_lr)
# print("Mejores hiperparámetros:", grid_search_lr.best_params_)


Fitting 5 folds for each of 8 candidates, totalling 40 fits


In [36]:
print("Mejor modelo:", best_lr_model)
print("cv_results:", cv_results_lr)
print("Mejores hiperparámetros:", grid_search_lr.best_params_)

Mejor modelo: LogisticRegression(C=0.1, max_iter=1000, penalty='l1', random_state=42,
                   solver='liblinear')
cv_results: {'mean_fit_time': array([0.34547267, 0.52077346, 1.41295695, 0.58335948, 3.63921428,
       0.78123193, 3.85490003, 0.77725592]), 'std_fit_time': array([0.08119214, 0.06800885, 0.41592309, 0.18027224, 0.71150212,
       0.20934677, 0.87910884, 0.17060406]), 'mean_score_time': array([0.00270553, 0.00421438, 0.00160003, 0.00160546, 0.00119996,
       0.00200648, 0.00110435, 0.00110173]), 'std_score_time': array([0.00177893, 0.00157109, 0.00049027, 0.00085579, 0.00040019,
       0.00031651, 0.00020671, 0.00020421]), 'param_C': masked_array(data=[0.1, 0.1, 1.0, 1.0, 10.0, 10.0, 100.0, 100.0],
             mask=[False, False, False, False, False, False, False, False],
       fill_value=1e+20), 'param_penalty': masked_array(data=['l1', 'l2', 'l1', 'l2', 'l1', 'l2', 'l1', 'l2'],
             mask=[False, False, False, False, False, False, False, False],
    

### Redes neuronales

In [37]:
from sklearn.neural_network import MLPClassifier

param_grid_mlp = {
    'hidden_layer_sizes': [(50,), (100,), (50,50), (100,50)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant', 'adaptive']
}

grid_search_mlp = GridSearchCV(estimator=MLPClassifier(random_state=42, max_iter=1000),
                               param_grid=param_grid_mlp,
                               cv=5,
                               scoring='f1',
                               n_jobs=-1,
                               verbose=1)

grid_search_mlp.fit(X_train, y_train)
cv_results_mlp = grid_search_mlp.cv_results_
best_mlp_model = grid_search_mlp.best_estimator_

# TODO: delete if made in next cell
# print("Mejor modelo:", best_mlp_model)
# print("cv_results:", cv_results_mlp)
# print("Mejores hiperparámetros:", grid_search_mlp.best_params_)



Fitting 5 folds for each of 64 candidates, totalling 320 fits


In [38]:
print("Mejor modelo:", best_mlp_model)
print("cv_results:", cv_results_mlp)
print("Mejores hiperparámetros:", grid_search_mlp.best_params_)

Mejor modelo: MLPClassifier(alpha=0.05, hidden_layer_sizes=(50, 50), max_iter=1000,
              random_state=42)
cv_results: {'mean_fit_time': array([ 4.64865022,  1.20101533,  6.27039537,  0.85465598, 11.70641217,
        1.88662715, 14.92387919,  1.64228635,  2.92532229,  0.97855191,
        5.00194402,  1.05477309, 13.55455179,  1.22929583, 13.18780613,
        1.07722416,  3.74709148,  0.74640336,  4.50567341,  0.77324767,
       10.08429451,  1.77366805, 11.7151588 ,  1.74468951,  3.53605342,
        0.95975957,  4.86544614,  0.90825391, 11.02834306,  1.42625937,
       16.12508655,  1.18277521,  2.82587442,  0.9461904 ,  4.01022892,
        0.85559525,  6.58965187,  1.47816591,  7.52377567,  1.4520308 ,
        2.96926746,  1.2689137 ,  4.30944571,  1.10298395,  4.08492994,
        3.21179028,  7.02790794,  3.3431272 ,  3.58361535,  0.88293972,
        4.22084708,  0.66116629,  6.98290181,  1.35048079,  7.00012221,
        1.25089555,  2.48039746,  1.42464747,  5.45376482,  1.3

# Evaluación con Cross-Validation

In [39]:
# TODO: check si es realmente necesario. En teoría, ya se hizo en las celdas anteriores con el 'cv'

# from sklearn.model_selection import cross_val_score

# # Random Forest
# scores_rf = cross_val_score(best_rf_model, X_train, y_train, cv=5, scoring='accuracy')

# # Gradient Boosting
# scores_gb = cross_val_score(best_gb_model, X_train, y_train, cv=5, scoring='accuracy')

# # Árboles de decisión
# scores_dt = cross_val_score(best_dt_model, X_train, y_train, cv=5, scoring='accuracy')

# # Regresión logística
# scores_lr = cross_val_score(best_lr_model, X_train, y_train, cv=5, scoring='accuracy')

# # Redes neuronales
# scores_mlp = cross_val_score(best_mlp_model, X_train, y_train, cv=5, scoring='accuracy')

# print("Cross-Validation Accuracy Scores (RF):", np.mean(scores_rf))
# print("Cross-Validation Accuracy Scores (GB):", np.mean(scores_gb))
# print("Cross-Validation Accuracy Scores (DT):", np.mean(scores_dt))
# print("Cross-Validation Accuracy Scores (LR):", np.mean(scores_lr))
# print("Cross-Validation Accuracy Scores (MLP):", np.mean(scores_mlp))
