# Train model pipeline

## Rationale:
Based on the previous experiments we are to train the model for the competition's submissions. We are to use the full dataset ```train``` + ```test``` + ```validation``` to train the final models and in the case of the ensemble models, the ```out of fold``` predictions. 

## Methodology:
We are to train each model with the optimized hyperparamters and use the out of fold predictions of each to train two final stackers:

1. A linear model trained on the model's predictions.
2. A NN model trained on the model's predictions.

Then we will save the predictions on the private dataset and use them for the final submissions.

## Conclusions:

1. **Consistency in Model Performance**: The private and public scores for the different models are very close to each other, indicating consistent performance.

2. **Stacking Models**: "Stacking MLP," "Stacking LR," and "Stacking AVG" models perform similarly with scores around 0.7980 (private) and 0.8000 (public).

3. **Boruta + Optuna**: "Boruta + Optuna" model performs competitively with scores of 0.7980 (private) and 0.7992 (public).

4. **Model Selection**: Consider factors beyond just performance, including model complexity, interpretability, training time, and resource requirements.

5. **Ensemble and Model Tuning**: Stacking models and combining feature selection methods with hyperparameter optimization can be effective strategies.

6. **Further Investigation**: Explore feature importance, model interpretability, and experiment with different model architectures or hyperparameters.

In summary, while performance differences are subtle, choose a model considering practical aspects and continue experimentation for optimization.


| Model             | Private Score | Public Score |
|-------------------|---------------|--------------|
| Stacking MLP      | 0.7983        | 0.8006       |
| Stacking LR       | 0.7981        | 0.8005       |
| Stacking AVG      | 0.7980        | 0.8002       |
| Boruta + Optuna   | 0.7980        | 0.7992       |

In [4]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

import warnings;warnings.filterwarnings("ignore")

import cloudpickle as cp

from functools import reduce

from sklearn.neural_network import MLPClassifier

import joblib

from sklearn.linear_model import LogisticRegressionCV

import sys
sys.path.append("../")

# local imports
from src.learner_params import target_column, space_column, MODEL_PARAMS

from utils.functions__training import model_pipeline, lgbm_classification_learner
from src.learner_params import params_all, params_ensemble, params_fw, params_optuna
from utils.feature_selection_lists import fw_features, boruta_features, optuna_features, ensemble_features
from utils.features_lists import all_features_list
from utils.functions__utils import train_binary

columns = ['prediction_MrMr',
 'prediction_Optuna',
 'prediction_all_features',
 'prediction_boruta',
 'prediction_ensemble']

names = ["all_features",
        "boruta",
        "MrMr",
        "Optuna", 
        "ensemble"
      ]

In [2]:
train_df = pd.read_pickle("../data/train_df.pkl")
validation_df = pd.read_pickle("../data/validation_df.pkl")
test_df= pd.read_pickle("../data/test_df.pkl")

private_df= pd.read_pickle("../data/private_df.pkl")

data = pd.concat([train_df, test_df, validation_df], ignore_index=True)


### Train the best model (Boruta + Optuna)

In [3]:
save_estimator_path = "../model_files/final__boruta_learner.pkl"
model_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = MODEL_PARAMS,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )
with open(save_estimator_path, "wb") as context:
    cp.dump(model_logs["lgbm_classification_learner"], context)
    
model_logs["data"]["oof_df"].to_pickle("../data/final__boruta_oof_df.pkl")

2023-09-28T17:08:16 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-28T17:08:16 | INFO | Training for fold 1
2023-09-28T17:12:39 | INFO | Training for fold 2
2023-09-28T17:17:03 | INFO | Training for fold 3
2023-09-28T17:21:39 | INFO | CV training finished!
2023-09-28T17:21:39 | INFO | Training the model in the full dataset...
2023-09-28T17:27:31 | INFO | Training process finished!
2023-09-28T17:27:31 | INFO | Calculating metrics...
2023-09-28T17:27:31 | INFO | Full process finished in 19.31 minutes.
2023-09-28T17:27:31 | INFO | Saving the predict function.
2023-09-28T17:27:31 | INFO | Predict function saved.


In [27]:
private_predictions = model_logs["lgbm_classification_learner"]["predict_fn"](private_df, apply_shap = False)

In [28]:
path = "../data/submissions/private_predictions_hyperopt2s.csv"
private_predictions = private_predictions[[space_column, "prediction"]]
private_predictions = private_predictions.rename(columns = {"prediction":"TARGET"})
private_predictions.columns = private_predictions.columns.str.upper()
private_predictions.to_csv(path, index = False)

### Train the ensembles:

1. Average prediction
2. Logistic regression

In [52]:
save_estimator_path = "../model_files/final__fw_learner.pkl"
fw_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_fw,
                            target_column = target_column,
                            features = fw_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(fw_full_logs["lgbm_classification_learner"], context)


fw_full_logs["data"]["oof_df"].to_pickle("../data/final__fw_oof_df.pkl")

2023-09-28T09:13:07 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-28T09:13:08 | INFO | Training for fold 1
2023-09-28T09:20:51 | INFO | Training for fold 2
2023-09-28T09:30:01 | INFO | Training for fold 3
2023-09-28T09:38:01 | INFO | CV training finished!
2023-09-28T09:38:01 | INFO | Training the model in the full dataset...
2023-09-28T09:49:11 | INFO | Training process finished!
2023-09-28T09:49:11 | INFO | Calculating metrics...
2023-09-28T09:49:12 | INFO | Full process finished in 36.39 minutes.
2023-09-28T09:49:12 | INFO | Saving the predict function.
2023-09-28T09:49:12 | INFO | Predict function saved.


In [53]:
save_estimator_path = "../model_files/final__optuna_learner.pkl"
optuna_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_optuna,
                            target_column = target_column,
                            features = optuna_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(optuna_full_logs["lgbm_classification_learner"], context)
    
optuna_full_logs["data"]["oof_df"].to_pickle("../data/final__optuna_oof_df.pkl")

2023-09-28T09:49:36 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-28T09:49:41 | INFO | Training for fold 1
2023-09-28T10:04:04 | INFO | Training for fold 2
2023-09-28T10:18:40 | INFO | Training for fold 3
2023-09-28T10:33:34 | INFO | CV training finished!
2023-09-28T10:33:34 | INFO | Training the model in the full dataset...
2023-09-28T10:50:39 | INFO | Training process finished!
2023-09-28T10:50:39 | INFO | Calculating metrics...
2023-09-28T10:50:39 | INFO | Full process finished in 61.36 minutes.
2023-09-28T10:50:39 | INFO | Saving the predict function.
2023-09-28T10:50:39 | INFO | Predict function saved.


In [54]:
save_estimator_path = "../model_files/final__ensemble_learner.pkl"
ensemble_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_ensemble,
                            target_column = target_column,
                            features = ensemble_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(ensemble_full_logs["lgbm_classification_learner"], context)
    
ensemble_full_logs["data"]["oof_df"].to_pickle("../data/final__ensemble_oof_df.pkl")

2023-09-28T10:54:22 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-28T10:54:32 | INFO | Training for fold 1
2023-09-28T11:11:17 | INFO | Training for fold 2
2023-09-28T11:28:22 | INFO | Training for fold 3
2023-09-28T11:45:42 | INFO | CV training finished!
2023-09-28T11:45:42 | INFO | Training the model in the full dataset...
2023-09-28T12:09:06 | INFO | Training process finished!
2023-09-28T12:09:06 | INFO | Calculating metrics...
2023-09-28T12:09:06 | INFO | Full process finished in 75.00 minutes.
2023-09-28T12:09:06 | INFO | Saving the predict function.
2023-09-28T12:09:06 | INFO | Predict function saved.


In [55]:
save_estimator_path = "../model_files/final__all_learner.pkl"
all_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_all,
                            target_column = target_column,
                            features = all_features_list,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )
with open(save_estimator_path, "wb") as context:
    cp.dump(all_full_logs["lgbm_classification_learner"], context)
    
all_full_logs["data"]["oof_df"].to_pickle("../data/final__all_oof_df.pkl")

2023-09-28T12:09:39 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-28T12:09:44 | INFO | Training for fold 1
2023-09-28T12:34:49 | INFO | Training for fold 2
2023-09-28T13:00:02 | INFO | Training for fold 3
2023-09-28T13:25:54 | INFO | CV training finished!
2023-09-28T13:25:54 | INFO | Training the model in the full dataset...
2023-09-28T16:39:41 | INFO | Training process finished!
2023-09-28T16:39:41 | INFO | Calculating metrics...
2023-09-28T16:39:42 | INFO | Full process finished in 270.39 minutes.
2023-09-28T16:39:42 | INFO | Saving the predict function.
2023-09-28T16:39:42 | INFO | Predict function saved.


In [3]:
all_predict_fn = joblib.load("../model_files/final__all_learner.pkl")
boruta_predict_fn= joblib.load("../model_files/final__boruta_learner.pkl")
fw_predict_fn= joblib.load("../model_files/final__fw_learner.pkl")
optuna_predict_fn= joblib.load("../model_files/final__optuna_learner.pkl")
ensemble_predict_fn= joblib.load("../model_files/final__ensemble_learner.pkl")

lpf = [
all_predict_fn,
boruta_predict_fn,
fw_predict_fn, 
optuna_predict_fn,
ensemble_predict_fn
]

In [14]:
l= []
for name, predict_fn in zip(names, lpf):
    aux = predict_fn["predict_fn"](private_df)[[space_column, "prediction"]].rename(columns = {"prediction":f"prediction_{name}"})
    l.append(aux)
df_predictions= reduce(lambda x,y:pd.merge(x,y, on = space_column), l)

In [8]:
df_predictions.loc[:,"prediction_average"] = df_predictions.loc[:,columns].mean(axis = 1)

In [4]:
tmp_all = pd.read_pickle("../data/final__all_oof_df.pkl")
boruta_all = pd.read_pickle("../data/final__boruta_oof_df.pkl")
ensemble_all = pd.read_pickle("../data/final__ensemble_oof_df.pkl")
fw_all = pd.read_pickle("../data/final__fw_oof_df.pkl")
optuna_all = pd.read_pickle("../data/final__optuna_oof_df.pkl")

ldf = [tmp_all, boruta_all, fw_all, optuna_all, ensemble_all]

In [5]:
l= []
for name, _df in zip(names, ldf):
    aux = _df[[space_column, "prediction"]].rename(columns = {"prediction":f"prediction_{name}"})
    l.append(aux)
df_predictions_train= reduce(lambda x,y:pd.merge(x,y, on = space_column), l)

In [19]:
lr = LogisticRegressionCV(cv = 3, random_state=42)
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, lr)

Score on test set for fold 1 is :0.799
Score on test set for fold 2 is :0.793
Score on test set for fold 3 is :0.798


In [23]:
df_predictions.loc[:,"prediction_lr"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_lr"}, inplace = True)

In [12]:
###
mlp = MLPClassifier(random_state=42,activation="tanh", max_iter=300,learning_rate="adaptive")
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, mlp)
###

Score on test set for fold 1 is :0.799
Score on test set for fold 2 is :0.794
Score on test set for fold 3 is :0.798


In [15]:
df_predictions.loc[:,"prediction_mlp"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_mlp"}, inplace = True)

### Save the results

In [40]:
path = "../data/submissions/private_predictions_average.csv"
private_predictions = df_predictions[[space_column, "prediction_average"]]
private_predictions = private_predictions.rename(columns = {"prediction_average":"TARGET"})
private_predictions.columns = private_predictions.columns.str.upper()
private_predictions.to_csv(path, index = False)

In [25]:
path = "../data/submissions/private_predictions_lr.csv"
private_predictions = df_predictions[[space_column, "prediction_lr"]]
private_predictions = private_predictions.rename(columns = {"prediction_lr":"TARGET"})
private_predictions.columns = private_predictions.columns.str.upper()
private_predictions.to_csv(path, index = False)

In [16]:
path = "../data/submissions/private_predictions_mlp.csv"
private_predictions = df_predictions[[space_column, "prediction_mlp"]]
private_predictions = private_predictions.rename(columns = {"prediction_mlp":"TARGET"})
private_predictions.columns = private_predictions.columns.str.upper()
private_predictions.to_csv(path, index = False)