<a href="https://colab.research.google.com/github/JCaballerot/Recommender_Systems/blob/main/Hybrid_Recommender/Book_Crossing_Hybrid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1 align=center><font size = 5> Hybrid
 Recommender</font></h1>

---

<center>
  <img src="https://storage.googleapis.com/kaggle-datasets-images/1661575/2726067/684ac0c4c14cb46d1047ccb620b45cac/dataset-cover.jpg?t=2021-10-21-03-18-09" width="800" height="300">
</center>


## Objetivo de este Notebook

1. Cargar y preprocesar un Dataset.
2. Realizar un sistema de recomendación basado en hybrid methods.
3. Comprobar el performance del sistema.

## Tabla de Contenidos

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item31">Contexto</a>  
2. <a href="#item32">Descargar y preparar el Dataset</a>  
6. <a href="#item34">Entrenamiento del modelo</a>  
6. <a href="#item34">Validación del modelo</a>  

</font>
</div>

### 1. Contexto


El conjunto de datos "Book-Crossing" (también conocido como BX) es una colección de datos relacionados con libros y reseñas de libros. Este conjunto de datos se centra en la interacción de los usuarios con libros y sus calificaciones, y es ampliamente utilizado en aplicaciones de sistemas de recomendación.


<b>Descripción de datos</b>


El conjunto de datos Book-Crossing contiene información sobre:

---


<b>Usuarios (BX-Users):</b>

Contiene la información del usuario. Los campos incluyen:

* User-ID: Un identificador único para cada usuario.
* Location: La ubicación del usuario.
* Age: La edad del usuario.




<b>Libros (BX-Books):</b>

Contiene la información de los libros. Los campos incluyen:

* ISBN: Número de ISBN del libro, que es un identificador único.
* Book-Title: El título del libro.
* Book-Author: El autor del libro.
* Year-Of-Publication: El año de publicación del libro.
* Publisher: El editor del libro.
* Otras informaciones adicionales sobre los libros.




<b>Evaluaciones (BX-Book-Ratings):</b>

Contiene las evaluaciones de los libros. Los campos incluyen:

* User-ID: El identificador del usuario que dio la evaluación.
* ISBN: El ISBN del libro evaluado.
* Book-Rating: La calificación del libro en una escala (por lo general, de 1 a 10).




---



<strong>Puede consultar este [link](https://www.kaggle.com/datasets/syedjaferk/book-crossing-dataset) para leer más sobre la fuente de datos Book Crossing.</strong>

### 2. Descargar y preparar Dataset

In [810]:
# Download Book-Crossing Dataset
!curl -o dataset.zip "http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip"
!unzip dataset.zip
!ls -la

In [810]:
!curl -L -o dataset.zip "https://drive.google.com/uc?id=1P7_nW6mZAVgf7sDqdm3SaR9_ZaCOeyjI&export=download&authuser=0"
!unzip dataset.zip
!ls -la

In [466]:
# Principales librerías
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore") # Turn off warnings


In [810]:
ratings = pd.read_csv("BX-Book-Ratings.csv", sep=";", encoding="ISO-8859-1")
books   = pd.read_csv("BX-Books.csv",        sep=";", encoding="ISO-8859-1", error_bad_lines=False)
users   = pd.read_csv("BX-Users.csv",        sep=";", encoding="ISO-8859-1")

In [810]:
users.head()

In [810]:
books.head()

<b>Calificaciones explícitas</b>: Están expresadas en una escala del 1-10 (más alta) y representan una calificación explícita por parte del usuario.

<b>Calificaciones implícitas</b>: Son expresadas por un 0, indicando que no hay una calificación explícita. En el contexto de este dataset, una calificación de 0 indica una interacción implícita con el libro (por ejemplo, el usuario lo compró o leyó), pero no proporciona una calificación explícita del contenido.

In [810]:
ratings.head()

In [471]:
print("  Users: {} \n  Books: {}\n  Ratings: {}".format(len(users), len(books), len(ratings)))


  Users: 278858 
  Books: 271360
  Ratings: 1149780


In [472]:
users.columns = users.columns.str.lower().str.replace('-', '_')
books.columns = books.columns.str.lower().str.replace('-', '_')
ratings.columns = ratings.columns.str.lower().str.replace('-', '_')

### 3. Uniendo data

In [473]:
# Analizaremos únicamente los datos explicitos del usuario-item
ratings = ratings[ratings.book_rating > 0]

In [810]:
ratings.head()

In [810]:
# Cruzamos las bases de datos para  obtener una tabla única

data = pd.merge(ratings, users, on = 'user_id', how = 'left')
data = pd.merge(data,    books, on = 'isbn', how = 'left')
data.drop(columns = ['image_url_s', 'image_url_m', 'image_url_l'], inplace = True)

data.head()

In [810]:
# Estilo de Seaborn
sns.set(style="whitegrid")
# figura y eje
plt.figure(figsize=(6, 3))
sns.histplot(data.book_rating, bins=30, kde=False, color="skyblue")

In [477]:
#tratando información del año de publicación
data.year_of_publication = pd.to_numeric(data.year_of_publication, errors='coerce')


In [478]:
# Ejemplo de remoción de outliers
lower_threshold = 1964
upper_threshold = 2004

data = data[(data['year_of_publication'] >= lower_threshold) & (data['year_of_publication'] <= upper_threshold)]
data.year_of_publication = data.year_of_publication.astype(int)

In [479]:
#Creando antiguedad del libro
data['antiguedad'] = 2008 - data.year_of_publication

In [810]:
# Estilo de Seaborn
sns.set(style="whitegrid")

# figura y eje
plt.figure(figsize=(6, 3))

# histograma
sns.histplot(data.antiguedad, bins=30, kde=False, color="skyblue")

# título y etiquetas a los ejes
plt.title('Distribución de antiguedad', fontsize=12)
plt.xlabel('Antiguedad', fontsize=10)
plt.ylabel('Frecuencia', fontsize=10)

# Muestra el histograma
plt.show()

In [481]:
books_list = data.groupby('book_title')['user_id'].count().reset_index()
books_list.sort_values(by = 'user_id', ascending = False, inplace = True)

print(f"{len(books_list)} libros diferentes, nos quedaremos con los más populares para no saturar nuestro Recsys")

132690 libros diferentes, nos quedaremos con los más populares para no saturar nuestro Recsys


In [810]:
books_list

In [810]:
books_list[:500]

In [484]:
# Calculamos los libros más populares
pop_books = books_list[:500].book_title.tolist()

In [485]:
data_v2 = data[data.book_title.isin(pop_books)]

In [810]:
data_v2.head()

Dicotomizaremos la variable objetivo para que el modelo aprenda la probabilidad de que el cliente tenga afinidad con el libro. Esta estrategia es bastante utilizada en las aplicaciones de Recsys pero no olvidemos que también se puede apuntar a predecir directamente el rating del cliente.

In [487]:
data_v2['target'] = data_v2.book_rating.apply(lambda x: 1 if x > 7 else 0)

In [810]:
data_v2.head()

In [810]:
# figura y eje
plt.figure(figsize=(6, 3))
# Analizando el target
sns.countplot(x='target', data = data_v2, palette = 'hls')
plt.title('¿La data presenta desbalance?', fontsize=12)


### 4. Muestreo de datos

In [490]:
# Muestreo de data
from sklearn.model_selection import train_test_split

train, test = train_test_split(data_v2,
                               stratify = data_v2.target, # Recuerda estratificar para evitar sesgos durante el muestreo
                               train_size = 0.6,
                               random_state = 123)

watch, test = train_test_split(test,
                               stratify = test.target, # Recuerda estratificar para evitar sesgos durante el muestreo
                               train_size = 0.5,
                               random_state = 123)

# El muestreo puede hacerse por cliente o por enmascaramiento como en anteriores ejercicios.

# Enfoque Content-Based

### 5. Tratamiento de variables

Variable de locacion

In [809]:
train.head()

In [809]:
temp = train.groupby('location')['user_id'].count().reset_index()
temp.sort_values(by = 'user_id', ascending = False)

In [493]:
# Función para extraer los n últimos elementos y unirlos con ','
def extract_last_n(location, n):
    parts = location.split(', ')
    return ', '.join(parts[-n:])

# Generar agregaciones
train['location_level2'] = train['location'].apply(lambda x: extract_last_n(x, 2))
train['location_level3'] = train['location'].apply(lambda x: extract_last_n(x, 1))

test['location_level2'] = test['location'].apply(lambda x: extract_last_n(x, 2))
test['location_level3'] = test['location'].apply(lambda x: extract_last_n(x, 1))

watch['location_level2'] = watch['location'].apply(lambda x: extract_last_n(x, 2))
watch['location_level3'] = watch['location'].apply(lambda x: extract_last_n(x, 1))

In [809]:
train.head()

In [809]:
temp = train.groupby('location_level2')['user_id'].count().reset_index()
temp = temp[temp.user_id > 30]
temp.sort_values(by = 'user_id', ascending = False)

In [809]:
temp = train.groupby('location_level3')['user_id'].count().reset_index()
temp = temp[temp.user_id > 30]
temp.sort_values(by = 'user_id', ascending = False)

In [497]:
# Creando variable mixta de locacion
train['location_f'] = train.apply(lambda row: row['location_level2'] if row['location_level3'] == 'usa' else row['location_level3'], axis=1)
test['location_f']  = test.apply(lambda row: row['location_level2'] if row['location_level3'] == 'usa' else row['location_level3'], axis=1)
watch['location_f'] = watch.apply(lambda row: row['location_level2'] if row['location_level3'] == 'usa' else row['location_level3'], axis=1)


In [809]:
train.head()

**Encoding**

El encoding de variables categóricas convierte las categorías de texto en números de una manera que puede ser utilizada de manera eficiente por los algoritmos de machine learning.


In [809]:
train.head()

In [500]:
catergory_features = ['book_title', 'book_author', 'publisher', 'location_f']

In [501]:
%%capture
!pip3 install category_encoders

In [502]:
# Aplicando category encoders
from category_encoders import TargetEncoder

encoder = TargetEncoder(handle_unknown = 'infrequent_if_exist',
                        handle_missing = 'value',
                        min_samples_leaf = 30)

encoder.fit(train[catergory_features].astype('category'), train['target'])


In [503]:
# Aplicando transformaciones sobre  variables

train[[x + '_coded' for x in catergory_features]] = encoder.transform(train[catergory_features].astype('category'))
test[[x + '_coded' for x in catergory_features]]  = encoder.transform(test[catergory_features].astype('category'))
watch[[x + '_coded' for x in catergory_features]] = encoder.transform(watch[catergory_features].astype('category'))


In [809]:
train.head()

### 6. XGBoost




In [505]:
import xgboost as xgb
from sklearn.metrics import *

In [506]:
features = ['age', 'antiguedad', 'book_title_coded', 'book_author_coded', 'publisher_coded', 'location_f_coded']

In [507]:
# Definimos los parámetros para el Grid Search

param_grid = {'objective': ['binary:logistic'],
              'booster' : ['gbtree'],
              'learning_rate': [0.01, 0.05, 0.1],
              'max_depth': [3, 5, 7],
              'colsample_bytree': [0.7, 1],
              'subsample': [0.7, 1]}


In [809]:
%%time
from sklearn.model_selection import GridSearchCV

# Crear clasificador
xgBoost = xgb.XGBClassifier(use_label_encoder=False, n_estimators = 500)


# Crear objeto GridSearchCV
grid_search = GridSearchCV(xgBoost,
                           param_grid,
                           scoring = make_scorer(auc),
                           cv = 3,  # Número de folds en la validación cruzada
                           verbose = 2,  # Verbosidad del output
                           n_jobs = -1  # Uso de todos los núcleos disponibles
                          )

# Realizar búsqueda de parámetros
grid_search.fit(train[features],
                train.target,
                early_stopping_rounds = 10,
                eval_metric = "auc",
                eval_set=[(watch[features], watch.target)],
                verbose = True)



In [509]:
# Obtener el mejor modelo
best_model = grid_search.best_estimator_

# Si deseas, también puedes extraer y visualizar los mejores parámetros encontrados
best_params = grid_search.best_params_
print(f"Best parameters found: {best_params}")


Best parameters found: {'booster': 'gbtree', 'colsample_bytree': 0.7, 'learning_rate': 0.01, 'max_depth': 3, 'objective': 'binary:logistic', 'subsample': 0.7}


In [510]:
%%capture
!pip install --upgrade xgboost

In [809]:
# Entrenando el modelo final

xgBoost = xgb.XGBClassifier(use_label_encoder=False,
                            n_estimators = 500, **best_params)

xgBoost.fit(train[features],
            train.target,
            early_stopping_rounds=10,
            eval_metric="auc",
            eval_set=[(train[features], train.target), (watch[features], watch.target)],
            verbose=True)


# Extraer los resultados de evaluación
results = xgBoost.evals_result()


In [809]:
epochs = len(results['validation_0']['auc'])
x_axis = range(0, epochs)

# Ajusta el tamaño
fig, ax = plt.subplots(figsize=(8, 4))

ax.plot(x_axis, results['validation_0']['auc'], label='Train')
ax.plot(x_axis, results['validation_1']['auc'], label='Watch')

ax.set_ylim([0.6, 0.7])  # Para limitar la cantidad de epochs

ax.legend()
plt.ylabel('AUC')
plt.title('XGBoost AUC')
plt.show()

In [809]:
# Definir tamaño
fig, ax = plt.subplots(figsize=(5, 3))

# Graficar la importancia de las variables
xgb.plot_importance(xgBoost, importance_type="total_gain", ax=ax, title="Feature Importance (Gain)", show_values=False)

# Mostrar el gráfico
plt.show()

### 7. Evaluación del modelo

In [514]:
from scipy.stats import ks_2samp

# Definir métricas adicionales
def gini(y_true, y_score):
    auc = roc_auc_score(y_true, y_score)
    return 2*auc - 1

def ks_statistic(y_true, y_score):
    return ks_2samp(y_score[y_true == 1], y_score[y_true == 0]).statistic

In [515]:
# predicción del modelo
train['prediction_xgboost'] = xgBoost.predict_proba(train[features])[:, 1]
test['prediction_xgboost']  = xgBoost.predict_proba(test[features])[:, 1]
watch['prediction_xgboost'] = xgBoost.predict_proba(watch[features])[:, 1]


In [579]:

def model_eval_metrics(prediction = 'prediction'):
  results = pd.DataFrame(columns=['Metric', 'Train', 'Test', 'Watch'])

  metrics = [
      ("Accuracy", accuracy_score),
      ("Precision", precision_score),
      ("Recall", recall_score),
      ("F1 Score", f1_score),
      ("AUC-ROC", roc_auc_score),
      ("Gini", roc_auc_score),
      ("KS Statistic", ks_statistic),
      ("Jaccard", jaccard_score)
  ]

  for metric_name, metric_func in metrics:
      if metric_name in ["KS Statistic", "AUC-ROC"]:  # Si la métrica requiere probabilidades
          train_score = metric_func(train['target'], train[prediction])
          test_score = metric_func(test['target'], test[prediction])
          watch_score = metric_func(watch['target'], watch[prediction])

      elif metric_name in ["Gini"]:
          train_score = roc_auc_score(train['target'], train[prediction])*2-1
          test_score = roc_auc_score(test['target'], test[prediction])*2-1
          watch_score = roc_auc_score(watch['target'], watch[prediction])*2-1

      else:  # Si la métrica se aplica a etiquetas
          train_score = metric_func(train['target'], train[prediction].apply(lambda x: 1 if x > 0.5 else 0))
          test_score = metric_func(test['target'],   test[prediction].apply(lambda x: 1 if x > 0.5 else 0))
          watch_score = metric_func(watch['target'], watch[prediction].apply(lambda x: 1 if x > 0.5 else 0))

      results = results.append({
          'Metric': metric_name,
          'Train': train_score,
          'Test': test_score,
          'Watch': watch_score
      }, ignore_index=True)


  pd.set_option('display.float_format', '{:.2f}'.format)

  #results[results.Metric == 'Gini'] = results[results.Metric == 'Gini']*2-1
  # Mostrar los resultados
  return results

In [809]:
XGBoost_Results = model_eval_metrics(prediction = 'prediction_xgboost')
XGBoost_Results

### 8. ANN

In [582]:
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler


In [583]:
# Estandarización

scaler = StandardScaler()
train_std = scaler.fit_transform(train[features].fillna(0))
watch_std  = scaler.transform(watch[features].fillna(0))
test_std  = scaler.transform(test[features].fillna(0))


In [584]:
# Arquitectura de la red
tf.random.set_seed(123)

model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', kernel_initializer='glorot_uniform', input_shape=(train_std.shape[1],)),
    #keras.layers.Dropout(0.5),  # Capa de dropout
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(8,  activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])


In [585]:
# Compilar el modelo
#optimizer = keras.optimizers.Adam(learning_rate=0.01)  # Ajusta la tasa de aprendizaje según sea necesario
optimizer = keras.optimizers.Adagrad(learning_rate = 0.05)

model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])


In [809]:

early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=0.0001)

# Entrenar el modelo con Early Stopping y reducción de la tasa de aprendizaje
history = model.fit(train_std, train.target, epochs=100, batch_size=64, validation_data=(watch_std, watch.target), verbose=1,
                    callbacks=[early_stopping, reduce_lr])


In [809]:
# predicción del modelo
train['prediction_rna'] = model.predict(scaler.transform(train[features].fillna(0)))
test['prediction_rna']  = model.predict(scaler.transform(test[features].fillna(0)))
watch['prediction_rna'] = model.predict(scaler.transform(watch[features].fillna(0)))


In [809]:
ANN_Results = model_eval_metrics(prediction = 'prediction_rna')
ANN_Results


# Enfoque Collaborative Filtering

### 9. K-Nearest Neighbors

In [683]:
from sklearn.neighbors import KNeighborsClassifier

# Crear un objeto KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 100)
knn.fit(train_std, train.target)


In [684]:
# predicción del modelo
train['prediction_knn'] = knn.predict_proba(scaler.transform(train_std))[:, 1]
test['prediction_knn']  = knn.predict_proba(scaler.transform(test_std))[:, 1]
watch['prediction_knn'] = knn.predict_proba(scaler.transform(watch_std))[:, 1]

In [809]:
KNN_Results = model_eval_metrics(prediction = 'prediction_knn')
KNN_Results


### 9. Métodos de ensamble

 <b> Output Fusion </b>

In [809]:
test[['prediction_xgboost', 'prediction_rna', 'prediction_knn']].head()

In [689]:
#train['prediction_output_fusion'] = (train.prediction_xgboost + train.prediction_rna)/2
#test['prediction_output_fusion']  = (test.prediction_xgboost + test.prediction_rna)/2
#watch['prediction_output_fusion'] = (watch.prediction_xgboost + watch.prediction_rna)/2

train['prediction_output_fusion'] = 3/(1/train.prediction_xgboost + 1/train.prediction_rna + 1/train.prediction_knn)
test['prediction_output_fusion']  = 3/(1/test.prediction_xgboost  + 1/test.prediction_rna + 1/test.prediction_knn)
watch['prediction_output_fusion'] = 3/(1/watch.prediction_xgboost + 1/watch.prediction_rna + 1/watch.prediction_knn)

In [809]:
output_fusion_Results = model_eval_metrics(prediction = 'prediction_output_fusion')
output_fusion_Results

In [809]:
XGBoost_Results

In [809]:
results_ANN

In [809]:
KNN_Results

 <b> Weighted ensemble recommender </b>

In [697]:

train['prediction_weighted_ensemble'] = (0.27*train.prediction_xgboost + 0.28*train.prediction_rna + 0.19*train.prediction_knn)/(0.27 + 0.28 + 0.19)
test['prediction_weighted_ensemble']  = (0.27*test.prediction_xgboost  + 0.28*test.prediction_rna  + 0.19*test.prediction_knn)/(0.27 + 0.28 + 0.19)
watch['prediction_weighted_ensemble'] = (0.27*watch.prediction_xgboost + 0.28*watch.prediction_rna + 0.19*watch.prediction_knn)/(0.27 + 0.28 + 0.19)


In [809]:
weighted_ensemble_Results = model_eval_metrics(prediction = 'prediction_weighted_ensemble')
weighted_ensemble_Results

 <b> Weighted ensemble recommender (Linear Regressor) </b>

In [809]:
import statsmodels.api as sm

# Fit and summarize OLS model
mod = sm.Logit(train.target, sm.add_constant(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']]))
res = mod.fit()
print(res.summary())


In [240]:
train['prediction_linear_ensemble'] = ( 1.8720*train.prediction_xgboost + 4.5743*train.prediction_rna -3.5027)
test['prediction_linear_ensemble']  = ( 1.8720*test.prediction_xgboost + 4.5743*test.prediction_rna -3.5027)
watch['prediction_linear_ensemble'] = ( 1.8720*watch.prediction_xgboost + 4.5743*watch.prediction_rna -3.5027)


In [809]:
weighted_ensemble_Results = model_eval_metrics(prediction = 'prediction_linear_ensemble')
weighted_ensemble_Results

### 10. Métodos Meta-Level

<b> Staking methods </b>

Random Forest

In [728]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators = 100,
                             max_depth = 6,
                             min_samples_leaf = 0.05,
                             oob_score = True,
                             verbose = 1,
                             n_jobs = 12,
                             random_state = 123)



In [809]:
%%time
rfc = rfc.fit(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']], train.target)


In [809]:
importances = pd.DataFrame({'features' : ['prediction_xgboost', 'prediction_rna', 'prediction_knn'],
                            'importance' : rfc.feature_importances_}).sort_values('importance', ascending = False)

importances.loc[importances.importance > 0]

In [809]:
# Obtener las probabilidades predichas
train['prediction_stacking_rfc'] = rfc.predict_proba(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']])[:, 1]
watch['prediction_stacking_rfc'] = rfc.predict_proba(watch[['prediction_xgboost', 'prediction_rna', 'prediction_knn']])[:, 1]
test['prediction_stacking_rfc']  = rfc.predict_proba(test[['prediction_xgboost', 'prediction_rna', 'prediction_knn']])[:, 1]


In [809]:
stacking_rfc_Results = model_eval_metrics(prediction = 'prediction_stacking_rfc')
stacking_rfc_Results

XGBoost

In [809]:
param_grid = {
    'objective': ['binary:logistic'],
    'booster': ['gbtree'],
    'learning_rate': [0.1, 0.05],
    'max_depth': [4, 5, 6, 7, 8],

    'colsample_bytree': [0.5, 0.7, 0.9, 1],
    'subsample': [0.5, 0.7, 0.9, 1],
    'min_child_weight': [0.05, 30, 100],

    'gamma': [0, 0.2, 0.4],

    'eval_metric': ["auc"],
    'random_state' : [123]}


from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, roc_auc_score

xgBoost = xgb.XGBClassifier(n_estimators = 500)

grid_search = GridSearchCV(
    xgBoost,
    param_grid,
    scoring=make_scorer(roc_auc_score),  # Asegúrate de que es roc_auc_score
    cv = 3,  # Considera aumentar el número de folds
    n_jobs=-1,
    return_train_score=True
)

# Fit grid_search sin eval_set
grid_search.fit(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']], train.target,
                early_stopping_rounds = 10,
                eval_set=[(watch[['prediction_xgboost', 'prediction_rna', 'prediction_knn']], watch.target)],verbose=True)


In [809]:
# Obtener el mejor modelo
best_model = grid_search.best_estimator_

# Si deseas, también puedes extraer y visualizar los mejores parámetros encontrados
best_params = grid_search.best_params_
print(f"Best parameters found: {best_params}")

In [703]:
results = pd.DataFrame(grid_search.cv_results_)
selected_columns = [col for col in results.columns if 'param_' in col] + ['mean_train_score', 'mean_test_score']
final_results = results[selected_columns]
final_results.rename(columns={'mean_train_score': 'AUC_Train', 'mean_test_score': 'AUC_Test'}, inplace=True)
final_results = final_results[final_results.AUC_Test > 0].sort_values(by = 'AUC_Test', ascending = False)

In [809]:
final_results.head()

In [710]:
import xgboost as xgb

# Definir los parámetros de XGBoost
params = {'objective': 'binary:logistic',
          'booster': 'gbtree',

          'max_depth': 6,
          'colsample_bytree' : 0.7,
          'subsample': 0.7,
          'eval_metric': 'auc',
          'min_child_weight': 0.05,

          'gamma': 0.4,
          'learning_rate': 0.05}

In [809]:
# Entrenando el modelo final

xgBoost = xgb.XGBClassifier(n_estimators = 500, **params)

xgBoost.fit(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']],
            train.target,
            early_stopping_rounds = 10,
            eval_set = [(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']], train.target),
                        (watch[['prediction_xgboost', 'prediction_rna', 'prediction_knn']], watch.target)],
            verbose = True)


# Extraer los resultados de evaluación
results = xgBoost.evals_result()


In [712]:
# Obtener las probabilidades predichas
train['prediction_stacking_xgboost'] = xgBoost.predict_proba(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']])[:, 1]
watch['prediction_stacking_xgboost'] = xgBoost.predict_proba(watch[['prediction_xgboost', 'prediction_rna', 'prediction_knn']])[:, 1]
test['prediction_stacking_xgboost']  = xgBoost.predict_proba(test[['prediction_xgboost', 'prediction_rna', 'prediction_knn']])[:, 1]


In [809]:
stacking_xgboost_Results = model_eval_metrics(prediction = 'prediction_stacking_xgboost')
stacking_xgboost_Results

In [280]:
%%capture
!pip3 install shap

In [716]:
import shap

explainer = shap.Explainer(xgBoost,
                           train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']],
                           feature_names = ['prediction_xgboost', 'prediction_rna', 'prediction_knn'])

shap_values = explainer(train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']])

In [809]:
shap.plots.scatter(shap_values[:, "prediction_xgboost"])

In [809]:
shap.plots.scatter(shap_values[:, "prediction_rna"])

In [809]:
shap.plots.scatter(shap_values[:, "prediction_knn"])

In [809]:
# Crear el gráfico de resumen de SHAP
shap.summary_plot(shap_values, test[['prediction_xgboost', 'prediction_rna', 'prediction_knn']], plot_type="bar")

In [809]:
shap.summary_plot(shap_values,
                  train[['prediction_xgboost', 'prediction_rna', 'prediction_knn']],
                  feature_names=['prediction_xgboost', 'prediction_rna', 'prediction_knn'])


### 11. Método de switching

Clustering

In [736]:
from sklearn.cluster import KMeans


In [739]:
# Probando diferentes tamaños para k (número de clusters)

inertia = []
for n in range(1 , 11):
    algorithm = (KMeans(n_clusters = n ,
                        init = 'k-means++',
                        n_init = 10,
                        max_iter=300,
                        tol = 0.0001,
                        random_state= 123 ,
                        algorithm='elkan') )

    algorithm.fit(train_std)
    inertia.append(algorithm.inertia_)


**Indicador de Inercia**

El índice de inercia mide que tan bien esta definido un clustering, a menor sea el valor mejor para nuestros resultados. Esta sujeto a la escala de nuestros datos.

<img src="https://miro.medium.com/v2/resize:fit:822/1*5yf86FgujYyctqkku2M-Kg.png" alt="HTML5 Icon" width= 200 height=150>


In [809]:
# Gráfico de codo

plt.figure(1 , figsize = (8 ,4))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

In [741]:
# Modelo de clustering final

algorithm = (KMeans(n_clusters = 5 ,init='k-means++', n_init = 10 , max_iter=300,
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
algorithm.fit(train_std)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_

In [742]:
train['cluster_pred'] = algorithm.predict(train_std)
test['cluster_pred'] = algorithm.predict(test_std)
watch['cluster_pred'] = algorithm.predict(watch_std)

In [743]:
#PCA
from sklearn.decomposition import PCA

pca = PCA(n_components = 2) #
pca = pca.fit(train_std)

In [744]:
train[['component1', 'component2']] = pca.transform(train_std)
test[['component1', 'component2']] = pca.transform(test_std)
watch[['component1', 'component2']] = pca.transform(watch_std)


In [746]:
# Configuración de scatterplot k-means

h = 0.02
x_min, x_max = pca.transform(train_std)[:, 0].min() - 1, pca.transform(train_std)[:, 0].max() + 1
y_min, y_max = pca.transform(train_std)[:, 1].min() - 1, pca.transform(train_std)[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z2 = algorithm.predict(train_std)

In [809]:
# Scatterplot k-means

plt.figure(1 , figsize = (12 , 5) )
plt.scatter( x = 'component1' ,y = 'component2' , data = train , c = train.cluster_pred, s = 20)
plt.ylabel('principalComponents') , plt.xlabel('principalComponents')
plt.show()

In [809]:
# Calcular el error cuadrático para cada modelo en cada fila
train['error_xgboost'] = (train['prediction_xgboost'] - train['target']) ** 2
train['error_rna'] = (train['prediction_rna'] - train['target']) ** 2
train['error_knn'] = (train['prediction_knn'] - train['target']) ** 2

# Agrupar por la etiqueta de cluster y calcular el error medio
error_promedio_por_cluster = train.groupby('cluster_pred').agg({
    'error_xgboost': 'mean',
    'error_rna': 'mean',
    'error_knn': 'mean'
}).reset_index()

error_promedio_por_cluster

In [763]:

def switching_prediction(row):
    if row['cluster_pred'] in [0, 1, 2]:
        return row['prediction_rna']
    elif row['cluster_pred'] == 3:
        return row['prediction_xgboost']
    elif row['cluster_pred'] == 4:
        return row['prediction_knn']

train['switching_prediction'] = train.apply(switching_prediction, axis=1)
test['switching_prediction'] = test.apply(switching_prediction, axis=1)
watch['switching_prediction'] = watch.apply(switching_prediction, axis=1)



In [809]:
switching_Results = model_eval_metrics(prediction = 'switching_prediction')
switching_Results


Decision Trees

In [775]:

from sklearn.tree import DecisionTreeRegressor

# Definiendo modelo


dtree = DecisionTreeRegressor(max_depth = 6,
                               min_samples_leaf = 0.05,
                               random_state = 123)

dtree = dtree.fit(train[features].fillna(0), train.error_rna)

dtree

In [776]:
from sklearn.tree import export_graphviz
from pydotplus import graph_from_dot_data

dot_data = export_graphviz(dtree,
                           feature_names = features,
                           filled = True,
                           rounded = True,
                           special_characters = True)

graph = graph_from_dot_data(dot_data)
graph.write_png('tree.png')
print(graph)

<pydotplus.graphviz.Dot object at 0x7c90b097c910>


In [779]:
train['error_rna_pred'] = dtree.predict(train[features].fillna(0))
test['error_rna_pred'] = dtree.predict(test[features].fillna(0))
watch['error_rna_pred'] = dtree.predict(watch[features].fillna(0))

In [809]:
# Agrupar por la etiqueta de rama y calcular el error medio
error_promedio_por_rama = watch.groupby('error_rna_pred').agg({
    'error_xgboost': 'mean',
    'error_rna': 'mean',
    'error_knn': 'mean'
}).reset_index()

error_promedio_por_rama.sort_values(by = 'error_rna_pred')

In [784]:

def switching_prediction2(row):
    if row['error_rna_pred'] >= 0.21:
        return row['prediction_xgboost']
    else:
        return row['prediction_rna']

train['switching_prediction2'] = train.apply(switching_prediction2, axis=1)
test['switching_prediction2'] = test.apply(switching_prediction2, axis=1)
watch['switching_prediction2'] = watch.apply(switching_prediction2, axis=1)



In [809]:
switching2_Results = model_eval_metrics(prediction = 'switching_prediction2')
switching2_Results


### 11. Método de Cascading

Random Forest

In [804]:
from sklearn.ensemble import RandomForestClassifier

rfc_cascade = RandomForestClassifier(n_estimators = 200,
                             max_depth = 8,
                             min_samples_leaf = 0.05,
                             oob_score = True,
                             verbose = 1,
                             n_jobs = 12,
                             random_state = 123)



In [809]:
%%time
rfc_cascade_capa1 = rfc_cascade.fit(train[['prediction_xgboost'] + features].fillna(0), train.target)


In [809]:
# Obtener las probabilidades predichas
train['prediction_cascade_capa1'] = rfc_cascade_capa1.predict_proba(train[['prediction_xgboost'] + features].fillna(0))[:, 1]
watch['prediction_cascade_capa1'] = rfc_cascade_capa1.predict_proba(watch[['prediction_xgboost'] + features].fillna(0))[:, 1]
test['prediction_cascade_capa1']  = rfc_cascade_capa1.predict_proba(test[['prediction_xgboost'] + features].fillna(0))[:, 1]


In [809]:
rfc_cascade_capa2 = rfc_cascade.fit(train[['prediction_cascade_capa1', 'prediction_xgboost', 'prediction_rna'] + features].fillna(0), train.target)

train['prediction_cascade_capa2'] = rfc_cascade_capa2.predict_proba(train[['prediction_cascade_capa1', 'prediction_xgboost', 'prediction_rna'] + features].fillna(0))[:, 1]
watch['prediction_cascade_capa2'] = rfc_cascade_capa2.predict_proba(watch[['prediction_cascade_capa1', 'prediction_xgboost', 'prediction_rna'] + features].fillna(0))[:, 1]
test['prediction_cascade_capa2']  = rfc_cascade_capa2.predict_proba(test[['prediction_cascade_capa1', 'prediction_xgboost', 'prediction_rna']  + features].fillna(0))[:, 1]

In [809]:
rfc_cascade_capa3 = rfc_cascade.fit(train[['prediction_cascade_capa1', 'prediction_cascade_capa2', 'prediction_xgboost', 'prediction_rna', 'prediction_knn'] + features].fillna(0), train.target)

train['prediction_cascade_capa3'] = rfc_cascade_capa2.predict_proba(train[['prediction_cascade_capa1', 'prediction_cascade_capa2', 'prediction_xgboost', 'prediction_rna', 'prediction_knn'] + features].fillna(0))[:, 1]
watch['prediction_cascade_capa3'] = rfc_cascade_capa2.predict_proba(watch[['prediction_cascade_capa1', 'prediction_cascade_capa2', 'prediction_xgboost', 'prediction_rna', 'prediction_knn'] + features].fillna(0))[:, 1]
test['prediction_cascade_capa3']  = rfc_cascade_capa2.predict_proba(test[['prediction_cascade_capa1', 'prediction_cascade_capa2', 'prediction_xgboost', 'prediction_rna', 'prediction_knn']  + features].fillna(0))[:, 1]

In [809]:
cascade_Results = model_eval_metrics(prediction = 'prediction_cascade_capa3')
cascade_Results

---
## Gracias por completar este laboratorio!