#  REGRESIÓN LOGÍSTICA 
 ***
 <code> **AJUSTE DE HIPERPARÁMETROS** </code>


En este notebook se ajustan los hiperparámetros del modelo de regresión logistica con regularización. Se han probado 6 modelos distintos:

- BASELINE (todas las variables, parametros por defecto y sin transformaciones, mediante CV)
- Modelo con todas las variables, escalado robusto y Grid Search CV
- Modelo con variables seleccionadas, escalado robusto, Grid Search CV
- Modelo con variables seleccionadas, escalado robusto, Randomized Search CV
- Modelo con variables seleccionadas, escalado robusto, SMOTE-NC, Grid Search CV
- Modelo con variables seleccionadas, escalado robusto, SMOTE-NC, Randomized Search CV

[Regresión logística (documentación oficial)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [1]:
# Importamos las librerías necesarias

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, KFold, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import FunctionTransformer

from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline

import mlflow
import mlflow.sklearn
from mlflow.models import infer_signature
from scipy.stats import uniform

In [2]:
import os
import sys

# Añadimos la carpeta 'drive' al path
ruta_carpeta_drive = os.path.abspath('../../drive')
if ruta_carpeta_drive not in sys.path:
    sys.path.insert(0, ruta_carpeta_drive)

import drive

### 1. Importación de los datos

Para entrenar este modelo vamos a partir de dos conjuntos de datos: uno con todas las variables y otro con las variables seleccionadas en la etapa anterior.

In [3]:
# Descargamos los datos en formato parquet de Google Drive
drive.descargar_archivos_concretos('datosEntrenamiento.parquet', '../../drive')
drive.descargar_archivos_concretos('datosEntrenamientoRL.parquet', '../../drive')

Archivo datosEntrenamiento.parquet guardado en: ../data/clean/datosEntrenamiento.parquet
Archivo datosEntrenamientoRL.parquet guardado en: ../data/clean/datosEntrenamientoRL.parquet


In [4]:
# Df con todas las variables
# DF_ALL = pd.read_parquet('../data/clean/datosEntrenamiento.parquet')
DF_ALL = pd.read_csv('ENTRENAMIENTO_FINAL.csv', index_col = 0)

# Df con selección de variables
# DF = pd.read_parquet('../data/clean/datosEntrenamientoRL.parquet')
DF = pd.read_csv('ENTRENAMIENTO_FINAL_RL.csv')

In [5]:
DF_ALL.head()

Unnamed: 0,Bestseller,NumPages,SagaNumber,RedPerc,BluePerc,BelongsSaga,Price,WordsTitle,PriceFormat,BookInterest1M,...,Womens,Womens Fiction,World War I,World War II,Young Adult,Young Adult Contemporary,Young Adult Fantasy,Young Adult Romance,Young Adult Science Fiction,Zombies
0,0.0,329.0,1.0,0.51,0.4,0,19.99,1.0,paperback,0.0,...,0,0,0,0,0,0,0,0,0,0
1,0.0,269.0,2.0,0.61,0.54,1,3.99,2.0,ebook,0.0,...,0,0,0,0,0,0,0,0,0,0
2,0.0,2335.0,1.0,0.72,0.57,1,20.99,7.0,ebook,0.0,...,0,0,0,0,1,0,0,0,0,0
3,0.0,40.0,1.0,0.83,0.35,0,25.0,1.0,hardcover,0.0,...,0,0,0,0,1,0,0,0,0,0
4,0.0,189.0,1.0,0.59,0.26,0,15.0,4.0,paperback,0.0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
DF.head()

Unnamed: 0,Bestseller,RedPerc,BluePerc,BelongsSaga,WordsTitle,BookInterest1M,HasTwitter,HasWikipedia,PrevBestSellAuthor,19th Century,...,Whodunit,Witches,Womens,World War I,World War II,Young Adult,Young Adult Contemporary,Young Adult Fantasy,Young Adult Romance,PriceFormat
0,0.0,0.51,0.4,0,1.0,0.0,1.0,0.0,0.0,1,...,0,0,0,0,0,0,0,0,0,paperback
1,0.0,0.61,0.54,1,2.0,0.0,1.0,1.0,0.0,0,...,0,0,0,0,0,0,0,0,0,ebook
2,0.0,0.72,0.57,1,7.0,0.0,0.0,1.0,0.0,0,...,0,0,0,0,0,1,0,0,0,ebook
3,0.0,0.83,0.35,0,1.0,0.0,1.0,1.0,0.0,0,...,0,0,0,0,0,1,0,0,0,hardcover
4,0.0,0.59,0.26,0,4.0,0.0,0.0,1.0,0.0,0,...,0,0,0,0,0,0,0,0,0,paperback


### 2. Separación en train y test

En este notebook solo usaremos el conjunto de train para ajustar los hiperparámetros. El conjunto de test se emplea en la evaluación de los modelos.

In [7]:
# Semilla
SEED = 22

# Proporción del conjunto de test
TEST_SIZE = 0.3

# Número de folds para la validación cruzada
CV_FOLDS = 5

**Con selección de variables**

In [8]:
y = DF["Bestseller"]
X = DF.iloc[:, 1:]

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, stratify=y, random_state=SEED)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2534, 221), (1086, 221), (2534,), (1086,))

**Con todas las variables**

In [9]:
y = DF_ALL["Bestseller"]
X = DF_ALL.iloc[:, 1:]

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train_ALL, X_test_ALL, y_train_ALL, y_test_ALL = train_test_split(X, y, test_size=TEST_SIZE, stratify=y, random_state=SEED)
X_train_ALL.shape, X_test_ALL.shape, y_train_ALL.shape, y_test_ALL.shape

((2534, 310), (1086, 310), (2534,), (1086,))

### Creación de KFolds

Para tener una evaluación de los modelos justa, realizaremos validación cruzadan estratificada con k = 5 común para todos los modelos.

In [10]:
# Inicializamos el objeto KFold
kf = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=SEED)

In [11]:
def codificarPriceFormat(df):
    return pd.get_dummies(df, columns=['PriceFormat'], dtype=int)

### Métricas de evaluación

In [12]:
# Función para calcular la sensibilidad
def sensitivity(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return tp / (tp + fn)

# Función para calcular la especificidad
def specificity(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return tn / (tn + fp)

# Convertir las funciones en funciones de puntuación para usar en RandomizedSearchCV
sensitivity_scorer = make_scorer(sensitivity)
specificity_scorer = make_scorer(specificity)

METRICS = {'balanced_accuracy': 'balanced_accuracy',
           'sensitivity': sensitivity_scorer,
           'specificity': specificity_scorer}

### Preparación del entorno de MLFlow

In [None]:
!mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db

[2024-04-28 12:09:34 +0200] [18720] [INFO] Starting gunicorn 21.2.0
[2024-04-28 12:09:34 +0200] [18720] [INFO] Listening at: http://127.0.0.1:8080 (18720)
[2024-04-28 12:09:34 +0200] [18720] [INFO] Using worker: sync
[2024-04-28 12:09:34 +0200] [18721] [INFO] Booting worker with pid: 18721
[2024-04-28 12:09:34 +0200] [18722] [INFO] Booting worker with pid: 18722
[2024-04-28 12:09:34 +0200] [18723] [INFO] Booting worker with pid: 18723
[2024-04-28 12:09:34 +0200] [18724] [INFO] Booting worker with pid: 18724


In [13]:
# !mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db

In [14]:
# Sets the sqlite db as the MLFLOW_TRACKING_URI 
os.environ['MLFLOW_TRACKING_URI'] = 'sqlite:///mlruns.db'

# WARNING: TO SEE THE LOCAL SERVER YOU HAVE TO CHOOSE THE CORRECT BACKEND STORE AS FOLLOWS:
# mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db

# mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

# Para imprimir los experimentos que están en la base de datos

# Establecer la URI de seguimiento
mlflow.set_tracking_uri('sqlite:///mlruns.db')

# Obtener todos los experimentos
experiment_ids = mlflow.search_runs().experiment_id.unique()

# Imprimir los experimentos
for exp_id in experiment_ids:
    print(exp_id)
    
# Defino el experimento el que guardaré todas las ejecuciones
mlflow.set_experiment(experiment_name = 'Regresión logística')

<Experiment: artifact_location='/Users/javimartinfuentes/Documents/GitHub/NOVELLA/modelos/regresionLogistica/mlruns/1', creation_time=1714262707661, experiment_id='1', last_update_time=1714262707661, lifecycle_stage='active', name='Regresión logística', tags={}>

# MODELO BASELINE

***

Creamos un modelo base para después comparar con los otros modelos que entrenemos con más técnicas de procesado y transformaciones. 

Modelo baseline:
* Parámetros por defecto
* Todas las variables
* Sin transformaciones
* Sin SMOTE-NC

In [15]:
# Copiamos el conjunto de datos con todas las variables 
DF_BASE = DF_ALL.copy()
DF_BASE = codificarPriceFormat(DF_BASE)

In [16]:
# Dividimos los datos en conjuntos de entrenamiento y prueba

X_BASE = DF_BASE.drop('Bestseller', axis=1)
y_BASE = DF_BASE['Bestseller']

X_base_train, X_base_test, y_base_train, y_base_test = train_test_split(X_BASE, y_BASE, test_size=TEST_SIZE, stratify=y, random_state=SEED)

In [17]:
# Generamos el modelo
RL = LogisticRegression(random_state = SEED)

In [18]:
# Aplicamos validación cruzada
scores = cross_validate(RL, X_base_train, y_base_train, scoring=METRICS, cv=CV_FOLDS,
                        return_train_score=True, verbose=1, n_jobs=-1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    1.1s remaining:    1.6s
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or

In [19]:
scores

{'fit_time': array([0.06189299, 0.06321502, 0.06350803, 0.06599784, 0.06404305]),
 'score_time': array([0.00528002, 0.00422311, 0.00476789, 0.00438094, 0.00411892]),
 'test_balanced_accuracy': array([0.61702733, 0.60739649, 0.63680825, 0.67523114, 0.62977255]),
 'train_balanced_accuracy': array([0.64573334, 0.66076667, 0.64955997, 0.65850032, 0.63978544]),
 'test_sensitivity': array([0.25      , 0.23529412, 0.29411765, 0.38235294, 0.26865672]),
 'train_sensitivity': array([0.30627306, 0.34317343, 0.31734317, 0.33579336, 0.29779412]),
 'test_specificity': array([0.98405467, 0.97949886, 0.97949886, 0.96810934, 0.99088838]),
 'train_specificity': array([0.98519362, 0.97835991, 0.98177677, 0.98120729, 0.98177677])}

In [20]:
# Registramos los resultados en MlFlow
with mlflow.start_run():
    
    # Métricas
    m = ["balanced_accuracy", "sensitivity", "specificity"]

    for metric in m:
        
        for fold in range(len(scores[f"train_{metric}"])):
            
            # Obtenemos las métricas de cada fold
            train_fold_metric = scores[f"train_{metric}"][fold]
            test_fold_metric = scores[f"test_{metric}"][fold]
            
            # Log the metric for each fold
            mlflow.log_metric(f"train_{metric}_fold_{fold+1}", train_fold_metric)
            mlflow.log_metric(f"test_{metric}_fold_{fold+1}", test_fold_metric)
            
        # Calculamos la media de los valores
        train_mean = np.mean(scores[f"train_{metric}"])
        test_mean = np.mean(scores[f"test_{metric}"])

        # Log the mean values for train and test sets
        mlflow.log_metric(f"train_{metric}_mean", train_mean)
        mlflow.log_metric(f"test_{metric}_mean", test_mean)

    # Establece una etiqueta que describe el propósito de esta ejecución
    mlflow.set_tag("variables", "311")
    mlflow.set_tag("transformación", "NO")
    mlflow.set_tag("estrategia", "CV")
    mlflow.set_tag("smote", "NO")

    # Infiere el signature del modelo, que describe el tipo de entrada y salida del modelo
    RL.fit(X_base_train, y_base_train)
    signature = infer_signature(X_base_train, RL.predict(X_base_train))

    # Registra el modelo
    model_info = mlflow.sklearn.log_model(
        sk_model=RL,
        artifact_path="rl_model",
        signature=signature,
        input_example=X_base_train,
        registered_model_name="BASELINE",
    )


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Registered model 'BASELINE' already exists. Creating a new version of this model...
Created version '3' of model 'BASELINE'.


# MODELOS CON VARIABLES SELECCIONADAS Y SMOTE-NC 
***

El SMOTE-NC es una técnica de oversampling que trabaja tanto con variables categóricas con numéricas. En nuestro caso, lo aplicaremos a cada uno de los conjuntos de entrenamiento generados en las iteraciones de la validación cruzada. Si no se hiciese de este modo, correríamos riesgo de data leakage.

### Escalado de variables

Para mejorar el rendimiento de la red neuronal, vamos a escalar las variables. Como los datos tienen muchos outliers, una estandarización sería muy susceptible a valores extremos. Por lo tanto, vamos a aplicar un [escalado robusto](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#plot-all-scaling-robust-scaler-section), que usa la mediana y los rangos intercuartílicos. De esta forma la transformación será más resistente a las variaciones introducidas por datos atípicos.

In [21]:
data_scaled = DF.copy()
X_scaled = data_scaled.drop('Bestseller', axis=1)
y_scaled = data_scaled['Bestseller']

# Dividimos en train y test
X_scaled_train, X_scaled_test, y_scaled_train, y_scaled_test = train_test_split(X_scaled, y_scaled, test_size=TEST_SIZE, stratify=y, random_state=SEED)

In [22]:
# Inicializamos RobustScaler
scaler = RobustScaler()

# Solo lo aplicamos a las variables numéricas

variables_numericas = ['RedPerc', 'BluePerc', 'WordsTitle', 'BookInterest1M', 'PrevBestSellAuthor']

# Aplicamos el RobustScaler a los datos de entrenamiento y test
X_scaled_train[variables_numericas] = scaler.fit_transform(X_scaled_train[variables_numericas])
X_scaled_test[variables_numericas] = scaler.transform(X_scaled_test[variables_numericas])

### Creación del pipeline

Creamos un pipeline con las operaciones que se deben aplicar a cada fold en el entrenamiento:
* Oversampling (SMOTENC)
* Redondear variables enteras
* Transformación variables categóricas con un valor único
* Clasificador (MLP)

In [23]:
def redondearVariables(X):
    variablesRedondeo = ["WordsTitle"]
    # Itera sobre las columnas especificadas y redondea sus valores
    for v in variablesRedondeo:
        X[v] = np.round(X[v])
    return X

In [24]:
# Columnas de los géneros
columnas_generos = X_scaled_train.columns[10:-1]

# Columnas categóricas
categoricalColumns = ["BelongsSaga", "PriceFormat", "HasTwitter", "HasWikipedia"] + list(columnas_generos)

In [25]:
smote = SMOTENC(categorical_features = categoricalColumns, random_state = SEED)

# Definimos el clasificador 
RL = LogisticRegression(random_state = SEED)

# Definimos el transformador para codificar la variable categórica 'PriceFormat'
column_transformer = ColumnTransformer([
    ('ohe', OneHotEncoder(), ['PriceFormat'])
], remainder='passthrough')

# Definimos el transformador de la función para redondear
transformador_funcion = FunctionTransformer(func=redondearVariables)

# Construimos el pipeline
pipeline = Pipeline([
    ('smote', smote),
    ('redondear_variables', transformador_funcion),
    ('encoder', column_transformer),
    ('classifier', RL)
])


### Grid Search CV

In [26]:
# Definir los hiperparámetros a ajustar
param_grid = {
    'classifier__solver': ['saga'],
    'classifier__penalty': ['elasticnet'],
    'classifier__max_iter': [2000],
    'classifier__C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], 
    'classifier__l1_ratio': np.linspace(0.0, 1.0, 10),
    'classifier__fit_intercept': [True, False]
}

# Inicializo GridSearch
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=kf,
                           scoring=METRICS, refit = "balanced_accuracy", return_train_score=True, n_jobs=-1, error_score="raise")

grid_search.fit(X_scaled_train, y_scaled_train)

# Resultados
cv_results = grid_search.cv_results_
best_params = grid_search.best_params_























In [27]:
best_params

{'classifier__C': 0.01,
 'classifier__fit_intercept': False,
 'classifier__l1_ratio': 0.1111111111111111,
 'classifier__max_iter': 2000,
 'classifier__penalty': 'elasticnet',
 'classifier__solver': 'saga'}

In [28]:
# Convertimos los resultados de la validación cruzada en un dataframe
df_results = pd.DataFrame(grid_search.cv_results_)

# Filtrar la fila con los mejores parámetros
filtered_row = df_results.loc[
    (df_results['param_classifier__C'] == best_params['classifier__C']) &
    (df_results['param_classifier__l1_ratio'] == best_params['classifier__l1_ratio']) &
    (df_results['param_classifier__max_iter'] == best_params['classifier__max_iter']) &
    (df_results['param_classifier__penalty'] == best_params['classifier__penalty']) &
    (df_results['param_classifier__solver'] == best_params['classifier__solver']) &
    (df_results['param_classifier__fit_intercept'] == best_params['classifier__fit_intercept'])
]

index_row = filtered_row.index[0]

# Registro los resultados en MLFlow
with mlflow.start_run():

    # Almaceno los valores de los hiperparámetros
    for key, value in best_params.items():
        mlflow.log_param(key, value)

    # Registra las métricas de cada fold para cada métrica
    for metric in METRICS:
        
        M = metric.replace(" ", "_")
        
        # Media
        
        mlflow.log_metric(f"mean_train_{M}", df_results[f"mean_train_{M}"][index_row])
        mlflow.log_metric(f"mean_test_{M}", df_results[f"mean_test_{M}"][index_row])

        # Desviación típica
        mlflow.log_metric(f"std_train_{M}", df_results[f"std_train_{M}"][index_row])
        mlflow.log_metric(f"std_test_{M}", df_results[f"std_test_{M}"][index_row])

        for i in range(CV_FOLDS):

            # Resultados de entrenamiento en cada fold
            mlflow.log_metric(f"train_{M}fold{i}", df_results[f"split{i}_train_{M}"][index_row])
            # Resultados de validación en cada fold
            mlflow.log_metric(f"test_{M}fold{i}", df_results[f"split{i}_test_{M}"][index_row])

    # Establece una etiqueta que describe el propósito de esta ejecución
    mlflow.set_tag("TIPO", "RL_SMOTE_GRID_SEARCH")

    mlflow.set_tag("variables", "222")
    mlflow.set_tag("transformación", "ESCALADO ROBUSTO")
    mlflow.set_tag("estrategia", "GRID SEARCH CV")
    mlflow.set_tag("smote", "SI")

    # Infiere el signature del modelo, que describe el tipo de entrada y salida del modelo
    signature = infer_signature(X_scaled_train, grid_search.best_estimator_.predict(X_scaled_train))

    # Registra el modelo
    model_info = mlflow.sklearn.log_model(
        sk_model=grid_search,
        artifact_path="rf_model",
        signature=signature,
        input_example=X_scaled_train,
        registered_model_name="RL_SMOTE_GRID_SEARCH",
    )

Successfully registered model 'RL_SMOTE_GRID_SEARCH'.
Created version '1' of model 'RL_SMOTE_GRID_SEARCH'.


### Random Search CV

In [29]:
# Definir los hiperparámetros a ajustar

param_grid = {
    'classifier__solver': ['saga'],
    'classifier__penalty': ['elasticnet'],
    'classifier__max_iter': [2000],
    'classifier__C': uniform(loc=0.0001, scale=9999.9999),  # Distribución uniforme entre 0.0001 y 10000,
    'classifier__l1_ratio': uniform(loc=0.0, scale=1.0),  # Distribución uniforme entre 0.0 y 1.0,
    'classifier__fit_intercept': [True, False]
}

In [30]:
# Definir la búsqueda aleatoria
random_search = RandomizedSearchCV(
    estimator=pipeline, param_distributions=param_grid, 
    n_iter=200, cv=kf, 
    scoring= METRICS, 
    refit = "balanced_accuracy",
    return_train_score=True, n_jobs = -1
)

random_search.fit(X_scaled_train, y_scaled_train)

# Resultados
cv_results = random_search.cv_results_
best_params = random_search.best_params_




















































In [31]:
best_params

{'classifier__C': 6660.372992132262,
 'classifier__fit_intercept': False,
 'classifier__l1_ratio': 0.9815941906816548,
 'classifier__max_iter': 2000,
 'classifier__penalty': 'elasticnet',
 'classifier__solver': 'saga'}

In [32]:
# Convertimos los resultados de la validación cruzada en un dataframe
df_results = pd.DataFrame(cv_results)

# Filtrar la fila con los mejores parámetros
filtered_row = df_results.loc[
    (df_results['param_classifier__C'] == best_params['classifier__C']) &
    (df_results['param_classifier__l1_ratio'] == best_params['classifier__l1_ratio']) &
    (df_results['param_classifier__max_iter'] == best_params['classifier__max_iter']) &
    (df_results['param_classifier__penalty'] == best_params['classifier__penalty']) &
    (df_results['param_classifier__solver'] == best_params['classifier__solver']) &
    (df_results['param_classifier__fit_intercept'] == best_params['classifier__fit_intercept'])
]

index_row = filtered_row.index[0]

# Registro los resultados en MLFlow
with mlflow.start_run():

    # Almaceno los valores de los hiperparámetros
    for key, value in best_params.items():
        mlflow.log_param(key, value)

    # Registra las métricas de cada fold para cada métrica
    for metric in METRICS:
        
        M = metric.replace(" ", "_")
        
        # Media
        
        mlflow.log_metric(f"mean_train_{M}", df_results[f"mean_train_{M}"][index_row])
        mlflow.log_metric(f"mean_test_{M}", df_results[f"mean_test_{M}"][index_row])

        # Desviación típica
        mlflow.log_metric(f"std_train_{M}", df_results[f"std_train_{M}"][index_row])
        mlflow.log_metric(f"std_test_{M}", df_results[f"std_test_{M}"][index_row])

        for i in range(CV_FOLDS):

            # Resultados de entrenamiento en cada fold
            mlflow.log_metric(f"train_{M}fold{i}", df_results[f"split{i}_train_{M}"][index_row])
            # Resultados de validación en cada fold
            mlflow.log_metric(f"test_{M}fold{i}", df_results[f"split{i}_test_{M}"][index_row])

    # Establece una etiqueta que describe el propósito de esta ejecución
    mlflow.set_tag("variables", "222")
    mlflow.set_tag("transformación", "ESCALADO ROBUSTO")
    mlflow.set_tag("estrategia", "RANDOM SEARCH CV")
    mlflow.set_tag("smote", "SI")

    # Infiere el signature del modelo, que describe el tipo de entrada y salida del modelo
    signature = infer_signature(X_scaled_train, random_search.best_estimator_.predict(X_scaled_train))

    # Registra el modelo
    model_info = mlflow.sklearn.log_model(
        sk_model=random_search,
        artifact_path="rf_model",
        signature=signature,
        input_example=X_scaled_train,
        registered_model_name="RL_SMOTE_RANDOM_SEARCH",
    )

Successfully registered model 'RL_SMOTE_RANDOM_SEARCH'.
Created version '1' of model 'RL_SMOTE_RANDOM_SEARCH'.


# MODELOS CON VARIABLES SELECCIONADAS Y SIN SMOTE-NC

In [33]:
X_scaled_train = codificarPriceFormat(X_scaled_train)

In [34]:
# Definir el modelo - LogisticRegression
RL = LogisticRegression(random_state = SEED)

### Grid Search CV

In [35]:
# Definir los hiperparámetros a ajustar
param_grid = {
    'solver': ['saga'],
    'penalty': ['elasticnet'],
    'max_iter': [2000],
    'C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], 
    'l1_ratio': np.linspace(0.0, 1.0, 10),
    'fit_intercept': [True, False]
}

# Inicializo GridSearch
grid_search = GridSearchCV(estimator=RL, param_grid=param_grid, cv=kf,
                           scoring=METRICS, refit = "balanced_accuracy", return_train_score=True, n_jobs=-1, error_score="raise")

grid_search.fit(X_scaled_train, y_scaled_train)

# Resultados
cv_results = grid_search.cv_results_
best_params = grid_search.best_params_























In [36]:
best_params

{'C': 100.0,
 'fit_intercept': False,
 'l1_ratio': 0.5555555555555556,
 'max_iter': 2000,
 'penalty': 'elasticnet',
 'solver': 'saga'}

In [37]:
# Convertimos los resultados de la validación cruzada en un dataframe
df_results = pd.DataFrame(grid_search.cv_results_)

# Filtrar la fila con los mejores parámetros
filtered_row = df_results.loc[
    (df_results['param_C'] == best_params['C']) &
    (df_results['param_l1_ratio'] == best_params['l1_ratio']) &
    (df_results['param_max_iter'] == best_params['max_iter']) &
    (df_results['param_penalty'] == best_params['penalty']) &
    (df_results['param_solver'] == best_params['solver']) &
    (df_results['param_fit_intercept'] == best_params['fit_intercept'])
]


index_row = filtered_row.index[0]

# Registro los resultados en MLFlow
with mlflow.start_run():

    # Almaceno los valores de los hiperparámetros
    for key, value in best_params.items():
        mlflow.log_param(key, value)

    # Registra las métricas de cada fold para cada métrica
    for metric in METRICS:
        
        M = metric.replace(" ", "_")
        
        # Media
        
        mlflow.log_metric(f"mean_train_{M}", df_results[f"mean_train_{M}"][index_row])
        mlflow.log_metric(f"mean_test_{M}", df_results[f"mean_test_{M}"][index_row])

        # Desviación típica
        mlflow.log_metric(f"std_train_{M}", df_results[f"std_train_{M}"][index_row])
        mlflow.log_metric(f"std_test_{M}", df_results[f"std_test_{M}"][index_row])

        for i in range(CV_FOLDS):

            # Resultados de entrenamiento en cada fold
            mlflow.log_metric(f"train_{M}fold{i}", df_results[f"split{i}_train_{M}"][index_row])
            # Resultados de validación en cada fold
            mlflow.log_metric(f"test_{M}fold{i}", df_results[f"split{i}_test_{M}"][index_row])

    # Establece una etiqueta que describe el propósito de esta ejecución
    mlflow.set_tag("variables", "222")
    mlflow.set_tag("transformación", "ESCALADO ROBUSTO")
    mlflow.set_tag("estrategia", "GRID SEARCH CV")
    mlflow.set_tag("smote", "NO")

    # Infiere el signature del modelo, que describe el tipo de entrada y salida del modelo
    signature = infer_signature(X_scaled_train, grid_search.best_estimator_.predict(X_scaled_train))

    # Registra el modelo
    model_info = mlflow.sklearn.log_model(
        sk_model=grid_search,
        artifact_path="rf_model",
        signature=signature,
        input_example=X_scaled_train,
        registered_model_name="RL_SIN_SMOTE_GRID_SEARCH",
    )

Successfully registered model 'RL_SIN_SMOTE_GRID_SEARCH'.
Created version '1' of model 'RL_SIN_SMOTE_GRID_SEARCH'.


# MODELOS CON TODAS LAS VARIABLES Y SMOTE-NC

In [38]:
data_scaled = DF_ALL.copy()
X_scaled = data_scaled.drop('Bestseller', axis=1)
y_scaled = data_scaled['Bestseller']

# Dividimos en train y test
X_scaled_train, X_scaled_test, y_scaled_train, y_scaled_test = train_test_split(X_scaled, y_scaled, test_size=TEST_SIZE, stratify=y, random_state=SEED)



In [39]:
X_scaled_train.describe()

Unnamed: 0,NumPages,SagaNumber,RedPerc,BluePerc,BelongsSaga,Price,WordsTitle,BookInterest1M,Rating20Days,HasTwitter,...,Womens,Womens Fiction,World War I,World War II,Young Adult,Young Adult Contemporary,Young Adult Fantasy,Young Adult Romance,Young Adult Science Fiction,Zombies
count,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,...,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0
mean,356.687845,1.829913,0.479574,0.425947,0.37056,18.317388,3.317285,205.985004,4.113062,0.685083,...,0.001973,0.018942,0.001579,0.02723,0.179558,0.006314,0.036306,0.012234,0.000395,0.000395
std,114.388258,3.442477,0.228746,0.203433,0.48305,5.607009,1.617231,419.207405,0.374915,0.464575,...,0.044385,0.136348,0.039707,0.162784,0.383895,0.079226,0.187088,0.109949,0.019865,0.019865
min,11.0,0.0,0.01,0.01,0.0,0.99,1.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,304.0,1.0,0.29,0.26,0.0,14.99,2.0,0.0,3.87,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,350.0,1.0,0.46,0.405,0.0,17.39,3.0,100.0,4.14,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,400.0,1.0,0.66,0.5775,1.0,21.1575,4.0,176.0,4.38,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2335.0,58.0,0.99,0.94,1.0,59.95,14.0,2911.0,5.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [40]:
# Inicializamos RobustScaler
scaler = RobustScaler()

# Solo lo aplicamos a las variables numéricas

variables_numericas = ['NumPages', 'SagaNumber', 'RedPerc', 'BluePerc', 'Price', 'WordsTitle', 'BookInterest1M',
                     'Rating20Days', 'PrevBestSellAuthor']

# Aplicamos el RobustScaler a los datos de entrenamiento y test
X_scaled_train[variables_numericas] = scaler.fit_transform(X_scaled_train[variables_numericas])
X_scaled_test[variables_numericas] = scaler.transform(X_scaled_test[variables_numericas])

In [41]:
X_scaled_train.describe()

Unnamed: 0,NumPages,SagaNumber,RedPerc,BluePerc,BelongsSaga,Price,WordsTitle,BookInterest1M,Rating20Days,HasTwitter,...,Womens,Womens Fiction,World War I,World War II,Young Adult,Young Adult Contemporary,Young Adult Fantasy,Young Adult Romance,Young Adult Science Fiction,Zombies
count,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,...,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0,2534.0
mean,0.069665,0.829913,0.052902,0.06597518,0.37056,0.150367,0.158642,0.602188,-0.052819,0.685083,...,0.001973,0.018942,0.001579,0.02723,0.179558,0.006314,0.036306,0.012234,0.000395,0.000395
std,1.191544,3.442477,0.618231,0.6407336,0.48305,0.909122,0.808616,2.38186,0.735127,0.464575,...,0.044385,0.136348,0.039707,0.162784,0.383895,0.079226,0.187088,0.109949,0.019865,0.019865
min,-3.53125,-1.0,-1.216216,-1.244094,0.0,-2.6591,-1.0,-0.568182,-4.196078,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.479167,0.0,-0.459459,-0.4566929,0.0,-0.389137,-0.5,-0.568182,-0.529412,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,-8.673617e-17,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.520833,0.0,0.540541,0.5433071,1.0,0.610863,0.5,0.431818,0.470588,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,20.677083,57.0,1.432432,1.685039,1.0,6.900689,5.5,15.971591,1.686275,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [42]:
list(X_scaled_train.columns)

['NumPages',
 'SagaNumber',
 'RedPerc',
 'BluePerc',
 'BelongsSaga',
 'Price',
 'WordsTitle',
 'PriceFormat',
 'BookInterest1M',
 'Rating20Days',
 'HasTwitter',
 'HasWikipedia',
 'PrevBestSellAuthor',
 '19th Century',
 '20th Century',
 'Abuse',
 'Action',
 'Adoption',
 'Adult',
 'Adult Fiction',
 'Adventure',
 'Africa',
 'African American',
 'African American Romance',
 'African Literature',
 'Aliens',
 'Alternate History',
 'Amateur Sleuth',
 'Amazon',
 'American',
 'American History',
 'Americana',
 'Amish',
 'Angels',
 'Animals',
 'Anthologies',
 'Apocalyptic',
 'Art',
 'Arthurian',
 'Artificial Intelligence',
 'Asia',
 'Asian Literature',
 'Audiobook',
 'Australia',
 'Autistic Spectrum Disorder',
 'BDSM',
 'Banned Books',
 'Baseball',
 'Biography Memoir',
 'Boarding School',
 'Book Club',
 'Books About Books',
 'Botswana',
 'Boys Love',
 'British Literature',
 'Buddhism',
 'Bulgaria',
 'Bulgarian Literature',
 'Canada',
 'Cats',
 'Chess',
 'Chick Lit',
 'Childrens',
 'China',
 'Chr

### Creación del pipeline

Creamos un pipeline con las operaciones que se deben aplicar a cada fold en el entrenamiento:
* Oversampling (SMOTENC)
* Redondear variables enteras
* Transformación variables categóricas con un valor único
* Clasificador (MLP)

In [43]:
def redondearVariables(X):
    variablesRedondeo = ["WordsTitle", "NumPages", "SagaNumber", "PrevBestSellAuthor"]
    # Itera sobre las columnas especificadas y redondea sus valores
    for v in variablesRedondeo:
        X[v] = np.round(X[v])
    return X

In [44]:
# Columnas de los géneros
columnas_generos = X_scaled_train.columns[13:]

# Columnas categóricas
categoricalColumns = ["BelongsSaga", "PriceFormat", "HasTwitter", "HasWikipedia"] + list(columnas_generos)

In [45]:
smote = SMOTENC(categorical_features = categoricalColumns, random_state = SEED)

# Definimos el clasificador 
RL = LogisticRegression(random_state = SEED)

# Definimos el transformador para codificar la variable categórica 'PriceFormat'
column_transformer = ColumnTransformer([
    ('ohe', OneHotEncoder(), ['PriceFormat'])
], remainder='passthrough')

# Definimos el transformador de la función para redondear
transformador_funcion = FunctionTransformer(func=redondearVariables)

# Construimos el pipeline
pipeline = Pipeline([
    ('smote', smote),
    ('redondear_variables', transformador_funcion),
    ('encoder', column_transformer),
    ('classifier', RL)
])


### Grid Search

In [46]:
# Definir los hiperparámetros a ajustar
param_grid = {
    'classifier__solver': ['saga'],
    'classifier__penalty': ['elasticnet'],
    'classifier__max_iter': [2000],
    'classifier__C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], 
    'classifier__l1_ratio': np.linspace(0.0, 1.0, 10),
    'classifier__fit_intercept': [True, False]
}

# Inicializo GridSearch
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=kf,
                           scoring=METRICS, refit = "balanced_accuracy", return_train_score=True, n_jobs=-1, error_score="raise")

grid_search.fit(X_scaled_train, y_scaled_train)

# Resultados
cv_results = grid_search.cv_results_
best_params = grid_search.best_params_

























In [47]:
best_params

{'classifier__C': 0.01,
 'classifier__fit_intercept': False,
 'classifier__l1_ratio': 0.0,
 'classifier__max_iter': 2000,
 'classifier__penalty': 'elasticnet',
 'classifier__solver': 'saga'}

In [48]:
# Convertimos los resultados de la validación cruzada en un dataframe
df_results = pd.DataFrame(grid_search.cv_results_)

# Filtrar la fila con los mejores parámetros
filtered_row = df_results.loc[
    (df_results['param_classifier__C'] == best_params['classifier__C']) &
    (df_results['param_classifier__l1_ratio'] == best_params['classifier__l1_ratio']) &
    (df_results['param_classifier__max_iter'] == best_params['classifier__max_iter']) &
    (df_results['param_classifier__penalty'] == best_params['classifier__penalty']) &
    (df_results['param_classifier__solver'] == best_params['classifier__solver']) &
    (df_results['param_classifier__fit_intercept'] == best_params['classifier__fit_intercept'])
]

index_row = filtered_row.index[0]

# Registro los resultados en MLFlow
with mlflow.start_run():

    # Almaceno los valores de los hiperparámetros
    for key, value in best_params.items():
        mlflow.log_param(key, value)

    # Registra las métricas de cada fold para cada métrica
    for metric in METRICS:
        
        M = metric.replace(" ", "_")
        
        # Media
        
        mlflow.log_metric(f"mean_train_{M}", df_results[f"mean_train_{M}"][index_row])
        mlflow.log_metric(f"mean_test_{M}", df_results[f"mean_test_{M}"][index_row])

        # Desviación típica
        mlflow.log_metric(f"std_train_{M}", df_results[f"std_train_{M}"][index_row])
        mlflow.log_metric(f"std_test_{M}", df_results[f"std_test_{M}"][index_row])

        for i in range(CV_FOLDS):

            # Resultados de entrenamiento en cada fold
            mlflow.log_metric(f"train_{M}fold{i}", df_results[f"split{i}_train_{M}"][index_row])
            # Resultados de validación en cada fold
            mlflow.log_metric(f"test_{M}fold{i}", df_results[f"split{i}_test_{M}"][index_row])

    # Establece una etiqueta que describe el propósito de esta ejecución
    mlflow.set_tag("TIPO", "RL_SMOTE_GRID_SEARCH_ALL_VARIABLES")
    
    mlflow.set_tag("variables", "311")
    mlflow.set_tag("transformación", "ESCALADO ROBUSTO")
    mlflow.set_tag("estrategia", "GRID SEARCH CV")
    mlflow.set_tag("smote", "SI")

    # Infiere el signature del modelo, que describe el tipo de entrada y salida del modelo
    signature = infer_signature(X_scaled_train, grid_search.best_estimator_.predict(X_scaled_train))

    # Registra el modelo
    model_info = mlflow.sklearn.log_model(
        sk_model=grid_search,
        artifact_path="rf_model",
        signature=signature,
        input_example=X_scaled_train,
        registered_model_name="RL_SMOTE_GRID_SEARCH_ALL_VARIABLES",
    )

Successfully registered model 'RL_SMOTE_GRID_SEARCH_ALL_VARIABLES'.
Created version '1' of model 'RL_SMOTE_GRID_SEARCH_ALL_VARIABLES'.


### Random Search CV

In [49]:
# Definir los hiperparámetros a ajustar
from scipy.stats import uniform
param_grid = {
    'classifier__solver': ['saga'],
    'classifier__penalty': ['elasticnet'],
    'classifier__max_iter': [2000],
    'classifier__C': uniform(loc=0.0001, scale=9999.9999),  # Distribución uniforme entre 0.0001 y 10000,
    'classifier__l1_ratio': uniform(loc=0.0, scale=1.0),  # Distribución uniforme entre 0.0 y 1.0,
    'classifier__fit_intercept': [True, False]
}

In [50]:
# Definir la búsqueda aleatoria
random_search = RandomizedSearchCV(
    estimator=pipeline, param_distributions=param_grid, 
    n_iter=200, cv=kf, 
    scoring= METRICS, 
    refit = "balanced_accuracy",
    return_train_score=True, n_jobs = -1
)

random_search.fit(X_scaled_train, y_scaled_train)

# Resultados
cv_results = random_search.cv_results_
best_params = random_search.best_params_




















































In [51]:
best_params

{'classifier__C': 1523.264510478191,
 'classifier__fit_intercept': True,
 'classifier__l1_ratio': 0.08665179135892231,
 'classifier__max_iter': 2000,
 'classifier__penalty': 'elasticnet',
 'classifier__solver': 'saga'}

In [52]:
# Convertimos los resultados de la validación cruzada en un dataframe
df_results = pd.DataFrame(cv_results)

# Filtrar la fila con los mejores parámetros
filtered_row = df_results.loc[
    (df_results['param_classifier__C'] == best_params['classifier__C']) &
    (df_results['param_classifier__l1_ratio'] == best_params['classifier__l1_ratio']) &
    (df_results['param_classifier__max_iter'] == best_params['classifier__max_iter']) &
    (df_results['param_classifier__penalty'] == best_params['classifier__penalty']) &
    (df_results['param_classifier__solver'] == best_params['classifier__solver']) &
    (df_results['param_classifier__fit_intercept'] == best_params['classifier__fit_intercept'])
]

index_row = filtered_row.index[0]

# Registro los resultados en MLFlow
with mlflow.start_run():

    # Almaceno los valores de los hiperparámetros
    for key, value in best_params.items():
        mlflow.log_param(key, value)

    # Registra las métricas de cada fold para cada métrica
    for metric in METRICS:
        
        M = metric.replace(" ", "_")
        
        # Media
        
        mlflow.log_metric(f"mean_train_{M}", df_results[f"mean_train_{M}"][index_row])
        mlflow.log_metric(f"mean_test_{M}", df_results[f"mean_test_{M}"][index_row])

        # Desviación típica
        mlflow.log_metric(f"std_train_{M}", df_results[f"std_train_{M}"][index_row])
        mlflow.log_metric(f"std_test_{M}", df_results[f"std_test_{M}"][index_row])

        for i in range(CV_FOLDS):

            # Resultados de entrenamiento en cada fold
            mlflow.log_metric(f"train_{M}fold{i}", df_results[f"split{i}_train_{M}"][index_row])
            # Resultados de validación en cada fold
            mlflow.log_metric(f"test_{M}fold{i}", df_results[f"split{i}_test_{M}"][index_row])

    # Establece una etiqueta que describe el propósito de esta ejecución
    mlflow.set_tag("TIPO", "RL_SMOTE_RANDOM_SEARCH_ALL_VARIABLES")
    mlflow.set_tag("variables", "311")
    mlflow.set_tag("transformación", "ESCALADO ROBUSTO")
    mlflow.set_tag("estrategia", "RANDOM SEARCH CV")
    mlflow.set_tag("smote", "SI")

    # Infiere el signature del modelo, que describe el tipo de entrada y salida del modelo
    signature = infer_signature(X_scaled_train, random_search.best_estimator_.predict(X_scaled_train))

    # Registra el modelo
    model_info = mlflow.sklearn.log_model(
        sk_model=random_search,
        artifact_path="rf_model",
        signature=signature,
        input_example=X_scaled_train,
        registered_model_name="RL_SMOTE_RANDOM_SEARCH_ALL_VARIABLES",
    )

Successfully registered model 'RL_SMOTE_RANDOM_SEARCH_ALL_VARIABLES'.
Created version '1' of model 'RL_SMOTE_RANDOM_SEARCH_ALL_VARIABLES'.
