# Fase de **Puesta en producción** | Verne Academy **Kaggle Competition**
### Por **Mario Jurado Galán**
Este notebook incluye:
+ Carga de datos
+ Transformaciones
+ Entrenamiento del modelo 
+ Puesta en producción

# 0. Librerias

In [None]:
# ----Tratamiento de datos---
import functions as func
import pandas as pd
pd.set_option("display.max_rows", None)

# ----Modelado del dataset----
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.feature_selection import RFECV

# ----Modelo y entrenamiento----
import optuna
import lightgbm as lgbm

# ----Puesta en producción----
import pickle

# ----Warnings----
import warnings
warnings.filterwarnings('once')  


  from .autonotebook import tqdm as notebook_tqdm


### -> Optimizar hiperparametros y mejores variables?

In [None]:
# Variable a modificar si se quiere realizar una busqueda de los mejores hiperparametros y de las mejores variables a utilizar (True) o usar las ya existentes (False)
realizar_tuneo_HP_y_RFECV=False

# 1. Carga de datos

In [3]:
dataset_df=pd.read_csv("../Datasets/MasterBI_Train.csv")

  dataset_df=pd.read_csv("../Datasets/MasterBI_Train.csv")


# 2. Transformaciones
+ Eliminación del Id
+ División del dataset entre features y label
+ Eliminación manual de columnas
+ Transformación manual de valores y tipos de datos
+ Creacion de variables de máscara y contextuales
+ Imputación de nulos 
    + Simple Imputer (por la media para valores numéricos y valor mas frecuente para valores categoricos)
+ Aplicación de encoders
    + OneHot Encoder (variables con 5 o menos valores únicos)
    + Target Encoder (variables con mas de 5 valores únicos)
+ Tuneo de hiperparametros
+ Selección de variables con RFECV


In [5]:
id="MachineIdentifier"
label="HasDetections"

In [6]:
# Eliminación del ID por no aportar ningun valor al entrenamiento
dataset_df=dataset_df.drop(columns=id)

In [7]:
# Dividir dataset entre features y label
X=dataset_df.drop(columns=label)
y=dataset_df[label]

### Transformaciones manuales

In [8]:
# Se borrarán columnas con una cantidad de nulos mayor al 90%, que sean constantes, ids o que sean totalmente irrelevantes para el modelo.
delete_features=["PuaMode","Census_ProcessorClass","DefaultBrowsersIdentifier","Census_IsFlightingInternal","Census_InternalBatteryType",
                                "Census_ThresholdOptIn","SmartScreen","OrganizationIdentifier"]
X=X[[col for col in X.columns if col not in delete_features]]

In [9]:
# La columna Census_IsWIMBootEnabled tiene un 36% de 0 y un 63% de valores nulos, a juzgar por el nombre, podemos suponer que el nulo equivale al valor 1
X["Census_IsWIMBootEnabled"]=X["Census_IsWIMBootEnabled"].fillna(1.0)

In [12]:
# Las siguientes columnas tienen nulos y son numeras pero deberian ser tratadas como categorias en vez de numericas, para imputarse por el valor mas frecuente.
new_cat_cols=["CountryIdentifier","CityIdentifier","GeoNameIdentifier","AVProductStatesIdentifier","AVProductsInstalled","AVProductsEnabled","IsProtected","SMode",
            "IeVerIdentifier","Firewall","Census_OEMNameIdentifier","Census_OEMModelIdentifier","Census_ProcessorCoreCount","Census_ProcessorManufacturerIdentifier",
            "Census_ProcessorModelIdentifier","Census_TotalPhysicalRAM","Census_OSInstallLanguageIdentifier","Census_IsFlightsDisabled",
            "Census_FirmwareManufacturerIdentifier","Census_FirmwareVersionIdentifier","Census_IsVirtualDevice","Census_IsAlwaysOnAlwaysConnectedCapable","Wdft_IsGamer","Wdft_RegionIdentifier"] 
X[new_cat_cols]=X[new_cat_cols].astype("category")

### Creación de variables de máscara y contextuales sintéticas 
+ Para las variables del estilo máscara de red, se creará una columna con cada division posible de esa máscara.
+ Se creará una columna con el valor medio, el máximo y el mínimo por cada una de las combinaciones posibles entre columnas categoricas y numericas indicadas a continuación.

In [None]:
# Creación de variables de máscara
X=func.add_mask_features(X)

In [None]:
# Variables elegidas por ser de las más relevantes en las primeras iteraciones, sin añadir ninguna variable extra. 
synth_cat_cols=["CityIdentifier","AVProductStatesIdentifier"]
synth_num_cols=["Census_SystemVolumeTotalCapacity","Census_PrimaryDiskTotalCapacity"]

In [None]:
# Creación de variables contextuales
X, added_cols = func.generate_grouped_stats(X, synth_num_cols, synth_cat_cols)
X = func.generate_synthetic_features(X, added_cols)

  group_by_feat = df_ext.groupby(by=[cat])
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  group_by_feat = df_ext.groupby(by=[cat])
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).reset_index()
  df_grouped = group_by_feat[num_feat].agg([np.mean, np.max, np.min]).rese

In [15]:
print(f"--Dataset--\nRows:{X.shape[0]}\nColumns:{X.shape[1]}\n")
print(f"Cantidad de nulos: \nTrain: {X.isna().sum().sum()}")

--Dataset--
Rows:892148
Columns:118

Cantidad de nulos: 
Train: 822492


### Imputación de nulos
+ División de las columnas por tipo de dato
+ Imputación de nulos de variables numericas por la media de ellas
+ Imputación de nulos de variables categoricas por el valor mas frecuente

In [16]:
# Se devuelve columna a su estado numerico original
X["Census_PrimaryDiskTotalCapacity"]=X["Census_PrimaryDiskTotalCapacity"].astype("category")

In [17]:
# División por tipo de dato
numeric_cols=X.select_dtypes(include=["int64","float64"]).columns.to_list()
category_cols=X.select_dtypes(include=["category","object"]).columns.to_list()

In [18]:
# Imputar variables numericas por la media 
numeric_imp=SimpleImputer(strategy="mean")
X[numeric_cols]=numeric_imp.fit_transform(X[numeric_cols])

In [19]:
# Imputar variables categoricas por el valor mas frecuente
category_imp=SimpleImputer(strategy="most_frequent")
X[category_cols]=category_imp.fit_transform(X[category_cols])

### Category Encoders
+ Encoding de variables por cantidad de valores unicos que presentan:
    + OneHotEncoder para columnas con 5 o menos valores unicos
    + TargetEncoder para columnas con más de 5 valores unicos

In [21]:
# Separar las columnas por la cantidad de valores unicos para usar en cada encoder
onehot_cols=list(filter(lambda col:X[col].nunique()<=5,X.columns))
target_cols=list(filter(lambda col:X[col].nunique()>5,X.columns))

##### -> OneHot Encoding

In [22]:
# OneHot encoding para variables con pocos valores unicos (<=5)
onehot_enc=OneHotEncoder(handle_unknown='ignore')

train_onehot_encoded=onehot_enc.fit_transform(X[onehot_cols])

dataset_onehot_df=pd.DataFrame(data=train_onehot_encoded.toarray(), columns=onehot_enc.get_feature_names_out(onehot_cols),index=X.index)

##### -> Target Encoding

In [24]:
# Target encoding para variables con más valores unicos (>5)
target_enc=TargetEncoder()

dataset_target_df=target_enc.fit_transform(X[target_cols].astype("category"), y)

In [None]:
# Fusión de los DataFrames resultantes de los encoders 
X=dataset_onehot_df.join(dataset_target_df)

In [26]:
print(f"--Dataset--\nRows:{X.shape[0]}\nColumns:{X.shape[1]}\n")
print(f"Cantidad de nulos: \nTrain: {X.isna().sum().sum()}")

--Dataset--
Rows:892148
Columns:162

Cantidad de nulos: 
Train: 0


### Tuneo de hiperparametros
+ Optimización bayesiana con Optuna

In [None]:
if realizar_tuneo_HP_y_RFECV:

    # Crear el estudio con Optuna buscando maximizar la metrica
    study = optuna.create_study(direction='maximize')
    study.optimize(lambda trial: func.objective(trial, X, y), n_trials=50)

    best_params = study.best_params

else:

    # Resultado de último tuneo de hiperparametros con Optuna
    best_params={'n_estimators': 95,
        'feature_fraction': 0.1,
        'bagging_fraction': 0.8500000000000001,
        'num_leaves': 100,
        'learning_rate': 0.21000000000000002,
        'max_depth': 15,
        'min_child_samples': 6,
        'reg_alpha': 0.0,
        'reg_lambda': 0.4,
        'colsample_bytree': 0.8}


[I 2024-12-05 19:16:33,557] A new study created in memory with name: no-name-d7bc56b8-ad93-4bfe-a9b7-87d2a4a442b8
[I 2024-12-05 19:16:54,374] Trial 0 finished with value: 0.7707321789317789 and parameters: {'n_estimators': 50, 'feature_fraction': 0.30000000000000004, 'bagging_fraction': 0.55, 'num_leaves': 90, 'learning_rate': 0.36000000000000004, 'max_depth': 50, 'min_child_samples': 4, 'reg_alpha': 0.0, 'reg_lambda': 0.8, 'colsample_bytree': 0.2}. Best is trial 0 with value: 0.7707321789317789.
[I 2024-12-05 19:17:11,436] Trial 1 finished with value: 0.7634500922455366 and parameters: {'n_estimators': 25, 'feature_fraction': 0.6, 'bagging_fraction': 0.8500000000000001, 'num_leaves': 60, 'learning_rate': 0.16000000000000003, 'max_depth': 95, 'min_child_samples': 6, 'reg_alpha': 0.4, 'reg_lambda': 1.0, 'colsample_bytree': 0.4}. Best is trial 0 with value: 0.7707321789317789.
[I 2024-12-05 19:17:32,742] Trial 2 finished with value: 0.7685186259896511 and parameters: {'n_estimators': 35,

### Selección de variables - Metodo RFE

In [None]:
if realizar_tuneo_HP_y_RFECV:

    # Selección de variables usando RFECV, buscando maximizar la metrica Recall, con los hiperparametros previamente sacados
    rfe = RFECV(
        estimator=lgbm.LGBMClassifier(**best_params),
        min_features_to_select=10, 
        cv=5, 
        scoring='recall')

    rfe.fit(X,y)

    rfe_features=X.columns[rfe.support_]

else:

    #Resultado de ultimo RFECV, 85 columnas de 162
    rfe_features=['Census_PrimaryDiskTypeName_HDD',
        'Census_PrimaryDiskTypeName_SSD',
        'Census_HasOpticalDiskDrive_0.0',
        'Census_OSArchitecture_x86',
        'Census_GenuineStateName_INVALID_LICENSE',
        'Census_GenuineStateName_IS_GENUINE',
        'Census_GenuineStateName_OFFLINE',
        'Census_IsSecureBootEnabled_0.0',
        'Census_IsSecureBootEnabled_1.0',
        'Census_IsWIMBootEnabled_0.0',
        'Census_IsWIMBootEnabled_1.0',
        'Census_IsPenCapable_1.0',
        'Wdft_IsGamer_0.0',
        'EngineVersion',
        'AppVersion',
        'AvSigVersion',
        'RtpStateBitfield',
        'AVProductStatesIdentifier',
        'CountryIdentifier',
        'CityIdentifier',
        'GeoNameIdentifier',
        'OsBuild',
        'OsSuite',
        'OsPlatformSubRelease',
        'OsBuildLab',
        'SkuEdition',
        'Census_MDC2FormFactor',
        'Census_OEMNameIdentifier',
        'Census_OEMModelIdentifier',
        'Census_ProcessorCoreCount',
        'Census_ProcessorModelIdentifier',
        'Census_PrimaryDiskTotalCapacity',
        'Census_SystemVolumeTotalCapacity',
        'Census_TotalPhysicalRAM',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches',
        'Census_InternalPrimaryDisplayResolutionHorizontal',
        'Census_InternalPrimaryDisplayResolutionVertical',
        'Census_PowerPlatformRoleName',
        'Census_InternalBatteryNumberOfCharges',
        'Census_OSVersion',
        'Census_OSBranch',
        'Census_OSBuildNumber',
        'Census_OSBuildRevision',
        'Census_OSEdition',
        'Census_OSSkuName',
        'Census_OSInstallTypeName',
        'Census_OSInstallLanguageIdentifier',
        'Census_OSUILocaleIdentifier',
        'Census_ActivationChannel',
        'Census_FlightRing',
        'Census_FirmwareManufacturerIdentifier',
        'Census_FirmwareVersionIdentifier',
        'Wdft_RegionIdentifier',
        'EngineVersion_part_3',
        'EngineVersion_part_4',
        'AppVersion_part_2',
        'AppVersion_part_3',
        'AppVersion_part_4',
        'AvSigVersion_part_2',
        'AvSigVersion_part_3',
        'OsBuildLab_part_2',
        'Census_OSVersion_part_3',
        'Census_OSVersion_part_4',
        'mean_Census_SystemVolumeTotalCapacity_by_CityIdentifier',
        'max_Census_SystemVolumeTotalCapacity_by_CityIdentifier',
        'mean_Census_PrimaryDiskTotalCapacity_by_CityIdentifier',
        'max_Census_PrimaryDiskTotalCapacity_by_CityIdentifier',
        'min_Census_PrimaryDiskTotalCapacity_by_CityIdentifier',
        'mean_Census_SystemVolumeTotalCapacity_by_AVProductStatesIdentifier',
        'max_Census_SystemVolumeTotalCapacity_by_AVProductStatesIdentifier',
        'min_Census_SystemVolumeTotalCapacity_by_AVProductStatesIdentifier',
        'max_Census_PrimaryDiskTotalCapacity_by_AVProductStatesIdentifier',
        'min_Census_PrimaryDiskTotalCapacity_by_AVProductStatesIdentifier',
        'Census_SystemVolumeTotalCapacity_ratio_mean_on_CityIdentifier',
        'Census_SystemVolumeTotalCapacity_amplitude_on_CityIdentifier',
        'Census_SystemVolumeTotalCapacity_ratio_max_on_CityIdentifier',
        'Census_PrimaryDiskTotalCapacity_ratio_mean_on_CityIdentifier',
        'Census_PrimaryDiskTotalCapacity_amplitude_on_CityIdentifier',
        'Census_PrimaryDiskTotalCapacity_ratio_max_on_CityIdentifier',
        'Census_SystemVolumeTotalCapacity_ratio_mean_on_AVProductStatesIdentifier',
        'Census_SystemVolumeTotalCapacity_amplitude_on_AVProductStatesIdentifier',
        'Census_SystemVolumeTotalCapacity_ratio_max_on_AVProductStatesIdentifier',
        'Census_PrimaryDiskTotalCapacity_ratio_mean_on_AVProductStatesIdentifier',
        'Census_PrimaryDiskTotalCapacity_amplitude_on_AVProductStatesIdentifier',
        'Census_PrimaryDiskTotalCapacity_ratio_max_on_AVProductStatesIdentifier']

[LightGBM] [Info] Number of positive: 356711, number of negative: 357007
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.041245 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 10345
[LightGBM] [Info] Number of data points in the train set: 713718, number of used features: 149
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499793 -> initscore=-0.000829
[LightGBM] [Info] Start training from score -0.000829
[LightGBM] [Info] Number of positive: 356711, number of negative: 357007
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.040447 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 10343
[LightGBM] [Info] Number of data points in the train set: 713718, number of used features: 148
[LightGB

In [None]:
# Se queda con las features relevantes resultantes del RFECV
X=X[rfe_features]

# 4. Entrenamiento del modelo 
+ Entrenamiento de un modelo de clasificación LightGBM con todos los datos y los mejores hiperparametros

In [32]:
# Modelo de clasificación LightGBM
model=lgbm.LGBMClassifier(**best_params)

# Entrenar el clasificador con el dataset entero
model.fit(X,y)

[LightGBM] [Info] Number of positive: 445889, number of negative: 446259
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.018934 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9132
[LightGBM] [Info] Number of data points in the train set: 892148, number of used features: 85
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499793 -> initscore=-0.000829
[LightGBM] [Info] Start training from score -0.000829


# 5. Puesta en produccion
Se guardarán todos los objetos creados en Pickles para su uso en el notebook de Inferencia

##### -> Modelo entrenado

In [33]:
with open('output/model.pkl', 'wb') as c:
    pickle.dump(model,c)

##### -> Objetos de transformación

In [34]:
with open('output/numeric_imputer.pkl', 'wb') as c:
    pickle.dump(numeric_imp, c)
    
with open('output/category_imputer.pkl', 'wb') as c:
    pickle.dump(category_imp, c)
    
with open('output/onehot_encoder.pkl', 'wb') as c:
    pickle.dump(onehot_enc, c)

with open('output/target_encoder.pkl', 'wb') as c:
    pickle.dump(target_enc, c)

##### -> Lista de columnas

In [35]:
columns = X.columns.to_list()

with open('output/columns.pkl', 'wb') as c:
    pickle.dump(columns, c)

with open('output/delete_features.pkl','wb') as c:
    pickle.dump(delete_features,c)

with open('output/synth_num_cols.pkl', 'wb') as c:
    pickle.dump(synth_num_cols, c)

with open('output/synth_cat_cols.pkl', 'wb') as c:
    pickle.dump(synth_cat_cols, c) 

with open('output/num_cols.pkl', 'wb') as c:
    pickle.dump(numeric_cols, c)

with open('output/cat_cols.pkl', 'wb') as c:
    pickle.dump(category_cols, c)

with open('output/onehot_cols.pkl', 'wb') as c:
    pickle.dump(onehot_cols, c)

with open('output/target_cols.pkl', 'wb') as c:
    pickle.dump(target_cols, c)