# Train model pipeline

## Rationale:
Based on the previous experiments we are to train the model for the competition's submissions. We are to use the full dataset ```train``` + ```test``` + ```validation``` to train the final models and in the case of the ensemble models, the ```out of fold``` predictions. 

## Methodology:
We are to train each model with the optimized hyperparamters and use the out of fold predictions of each to train two final stackers:

1. A linear model trained on the model's predictions.
2. A NN model trained on the model's predictions.

Then we will save the predictions on the private dataset and use them for the final submissions.

## Conclusions:

1. **Consistency in Model Performance**: The private and public scores for the different models are very close to each other, indicating consistent performance.

2. **Stacking Models**: "Stacking MLP," "Stacking LR," and "Stacking AVG" models perform similarly with scores around 0.7980 (private) and 0.8000 (public).

3. **Boruta + Optuna**: "Boruta + Optuna" model performs competitively with scores of 0.7980 (private) and 0.7992 (public).

4. **Model Selection**: Consider factors beyond just performance, including model complexity, interpretability, training time, and resource requirements.

5. **Ensemble and Model Tuning**: Stacking models and combining feature selection methods with hyperparameter optimization can be effective strategies.

6. **Further Investigation**: Explore feature importance, model interpretability, and experiment with different model architectures or hyperparameters.

In summary, while performance differences are subtle, choose a model considering practical aspects and continue experimentation for optimization.


| Model             | Private Score | Public Score |
|-------------------|---------------|--------------|
| Stacking MLP      | 0.7983        | 0.8006       |
| Stacking LR       | 0.7981        | 0.8005       |
| Stacking AVG      | 0.7980        | 0.8002       |
| Boruta + Optuna   | 0.7980        | 0.7992       |

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

import warnings;warnings.filterwarnings("ignore")

import cloudpickle as cp

from functools import reduce

from sklearn.neural_network import MLPClassifier

import joblib

from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import StackingClassifier

import sys
sys.path.append("../")

# local imports
from src.learner_params import target_column,space_column, MODEL_PARAMS

from utils.functions__training import model_pipeline, lgbm_classification_learner
from src.learner_params import params_all, params_ensemble, params_fw, params_original
from utils.feature_selection_lists import fw_features, boruta_features, ensemble_features
from utils.features_lists import all_features_list, base_features
from utils.functions__utils import train_binary

columns = ['prediction_MrMr',
 'prediction_base',
 'prediction_all_features',
 'prediction_boruta',
 'prediction_ensemble']

names = ["all_features",
        "boruta",
        "MrMr",
        "base", 
        "ensemble"
      ]

In [2]:
train_df = pd.read_pickle("../data/train_df.pkl")
validation_df = pd.read_pickle("../data/validation_df.pkl")
test_df= pd.read_pickle("../data/test_df.pkl")

private_df= pd.read_pickle("../data/private_df.pkl")

data = pd.concat([train_df, test_df, validation_df], ignore_index=True)

### Train the best model (Boruta + Optuna)

In [3]:
save_estimator_path = "../model_files/final__boruta_learner.pkl"
model_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = MODEL_PARAMS,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )
with open(save_estimator_path, "wb") as context:
    cp.dump(model_logs["lgbm_classification_learner"], context)
    
model_logs["data"]["oof_df"].to_pickle("../data/final__boruta_oof_df.pkl")

2023-10-11T13:38:06 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:38:06 | INFO | Training for fold 1
2023-10-11T13:39:55 | INFO | Training for fold 2
2023-10-11T13:41:47 | INFO | Training for fold 3
2023-10-11T13:43:40 | INFO | CV training finished!
2023-10-11T13:43:40 | INFO | Training the model in the full dataset...
2023-10-11T13:46:23 | INFO | Training process finished!
2023-10-11T13:46:23 | INFO | Calculating metrics...
2023-10-11T13:46:23 | INFO | Full process finished in 8.29 minutes.
2023-10-11T13:46:23 | INFO | Saving the predict function.
2023-10-11T13:46:23 | INFO | Predict function saved.


In [4]:
private_predictions = model_logs["lgbm_classification_learner"]["predict_fn"](private_df, apply_shap = False)

In [5]:
path = "../data/submissions/private_predictions_hyperopt2s.csv"
private_predictions = private_predictions[[space_column, "prediction"]]
private_predictions = private_predictions.rename(columns = {"prediction":"Probability"})
private_predictions.to_csv(path, index = False)

### Train the ensembles:

1. Average prediction
2. Logistic regression
3. MLP

In [6]:
save_estimator_path = "../model_files/final__fw_learner.pkl"
fw_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_fw,
                            target_column = target_column,
                            features = fw_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(fw_full_logs["lgbm_classification_learner"], context)


fw_full_logs["data"]["oof_df"].to_pickle("../data/final__fw_oof_df.pkl")

2023-10-11T13:46:26 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:46:26 | INFO | Training for fold 1
2023-10-11T13:47:12 | INFO | Training for fold 2
2023-10-11T13:47:57 | INFO | Training for fold 3
2023-10-11T13:48:44 | INFO | CV training finished!
2023-10-11T13:48:44 | INFO | Training the model in the full dataset...
2023-10-11T13:49:42 | INFO | Training process finished!
2023-10-11T13:49:42 | INFO | Calculating metrics...
2023-10-11T13:49:42 | INFO | Full process finished in 3.26 minutes.
2023-10-11T13:49:42 | INFO | Saving the predict function.
2023-10-11T13:49:42 | INFO | Predict function saved.


In [7]:
save_estimator_path = "../model_files/final__base_learner.pkl"
base_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_original,
                            target_column = target_column,
                            features = base_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(base_full_logs["lgbm_classification_learner"], context)
    
base_full_logs["data"]["oof_df"].to_pickle("../data/final__base_oof_df.pkl")

2023-10-11T13:49:42 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:49:42 | INFO | Training for fold 1
2023-10-11T13:50:10 | INFO | Training for fold 2
2023-10-11T13:50:37 | INFO | Training for fold 3
2023-10-11T13:51:05 | INFO | CV training finished!
2023-10-11T13:51:05 | INFO | Training the model in the full dataset...
2023-10-11T13:51:37 | INFO | Training process finished!
2023-10-11T13:51:37 | INFO | Calculating metrics...
2023-10-11T13:51:37 | INFO | Full process finished in 1.92 minutes.
2023-10-11T13:51:37 | INFO | Saving the predict function.
2023-10-11T13:51:37 | INFO | Predict function saved.


In [8]:
save_estimator_path = "../model_files/final__ensemble_learner.pkl"
ensemble_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_ensemble,
                            target_column = target_column,
                            features = ensemble_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(ensemble_full_logs["lgbm_classification_learner"], context)
    
ensemble_full_logs["data"]["oof_df"].to_pickle("../data/final__ensemble_oof_df.pkl")

2023-10-11T13:51:37 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:51:38 | INFO | Training for fold 1
2023-10-11T13:52:04 | INFO | Training for fold 2
2023-10-11T13:52:30 | INFO | Training for fold 3
2023-10-11T13:52:57 | INFO | CV training finished!
2023-10-11T13:52:57 | INFO | Training the model in the full dataset...
2023-10-11T13:53:33 | INFO | Training process finished!
2023-10-11T13:53:33 | INFO | Calculating metrics...
2023-10-11T13:53:33 | INFO | Full process finished in 1.92 minutes.
2023-10-11T13:53:33 | INFO | Saving the predict function.
2023-10-11T13:53:33 | INFO | Predict function saved.


In [9]:
save_estimator_path = "../model_files/final__all_learner.pkl"
all_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_all,
                            target_column = target_column,
                            features = all_features_list,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )
with open(save_estimator_path, "wb") as context:
    cp.dump(all_full_logs["lgbm_classification_learner"], context)
    
all_full_logs["data"]["oof_df"].to_pickle("../data/final__all_oof_df.pkl")

2023-10-11T13:53:33 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:53:33 | INFO | Training for fold 1
2023-10-11T13:54:23 | INFO | Training for fold 2
2023-10-11T13:55:13 | INFO | Training for fold 3
2023-10-11T13:56:03 | INFO | CV training finished!
2023-10-11T13:56:03 | INFO | Training the model in the full dataset...
2023-10-11T13:57:06 | INFO | Training process finished!
2023-10-11T13:57:06 | INFO | Calculating metrics...
2023-10-11T13:57:06 | INFO | Full process finished in 3.56 minutes.
2023-10-11T13:57:06 | INFO | Saving the predict function.
2023-10-11T13:57:06 | INFO | Predict function saved.


In [10]:
all_predict_fn = joblib.load("../model_files/final__all_learner.pkl")
boruta_predict_fn= joblib.load("../model_files/final__boruta_learner.pkl")
fw_predict_fn= joblib.load("../model_files/final__fw_learner.pkl")
base_predict_fn= joblib.load("../model_files/final__base_learner.pkl")
ensemble_predict_fn= joblib.load("../model_files/final__ensemble_learner.pkl")

lpf = [
all_predict_fn,
boruta_predict_fn,
fw_predict_fn, 
base_predict_fn,
ensemble_predict_fn
]

In [26]:
l= []
for name, predict_fn in zip(names, lpf):
    aux = predict_fn["predict_fn"](private_df)[[space_column, "prediction"]].rename(columns = {"prediction":f"prediction_{name}"})
    l.append(aux)
df_predictions= reduce(lambda x,y:pd.merge(x,y, on = space_column), l)

In [27]:
df_predictions.loc[:,"prediction_average"] = df_predictions.loc[:,columns].mean(axis = 1)

In [28]:
tmp_all = pd.read_pickle("../data/final__all_oof_df.pkl")
boruta_all = pd.read_pickle("../data/final__boruta_oof_df.pkl")
ensemble_all = pd.read_pickle("../data/final__ensemble_oof_df.pkl")
fw_all = pd.read_pickle("../data/final__fw_oof_df.pkl")
base_all = pd.read_pickle("../data/final__base_oof_df.pkl")

ldf = [tmp_all, boruta_all, fw_all, base_all, ensemble_all]

In [29]:
l= []
for name, _df in zip(names, ldf):
    aux = _df[[space_column, "prediction"]].rename(columns = {"prediction":f"prediction_{name}"})
    l.append(aux)
df_predictions_train= reduce(lambda x,y:pd.merge(x,y,on = space_column), l)

In [30]:
lr = LogisticRegressionCV(cv = 3, random_state=42)
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, lr)

Score on test set for fold 1 is :0.872
Score on test set for fold 2 is :0.865
Score on test set for fold 3 is :0.862


In [31]:
df_predictions.loc[:,"prediction_lr"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_lr"}, inplace = True)

In [32]:
###
# mlp = MLPClassifier(random_state=42,activation="tanh", max_iter=300,learning_rate="adaptive")
mlp = MLPClassifier(random_state=42,activation="tanh", max_iter=1000,learning_rate="adaptive")
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, mlp)
###

Score on test set for fold 1 is :0.872
Score on test set for fold 2 is :0.865
Score on test set for fold 3 is :0.862


In [33]:
df_predictions.loc[:,"prediction_mlp"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_mlp"}, inplace = True)

In [35]:
###
estimators = [("tanh",MLPClassifier(random_state=42,activation="tanh", max_iter=300,learning_rate="adaptive")),
             ("relu", MLPClassifier(random_state=101,activation="relu", max_iter=300,learning_rate="adaptive")),
             ("sigmoid",MLPClassifier(random_state=0,activation="logistic", max_iter=300,learning_rate="adaptive"))
             ]
model = StackingClassifier(estimators, final_estimator=LogisticRegressionCV(random_state=42,scoring="roc_auc"))
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, model)
###

Score on test set for fold 1 is :0.872
Score on test set for fold 2 is :0.865
Score on test set for fold 3 is :0.862


In [36]:
df_predictions.loc[:,"prediction_mlp_stacked"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_mlp_stacked"}, inplace = True)

### Save the results
We are to save the results for every individual and ensemble model

In [29]:
path = "../data/submissions/private_predictions_average_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_average"]]
private_predictions = private_predictions.rename(columns = {"prediction_average":"Probability"})
private_predictions.to_csv(path, index = False)

In [30]:
path = "../data/submissions/private_predictions_lr_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_lr"]]
private_predictions = private_predictions.rename(columns = {"prediction_lr":"Probability"})
private_predictions.to_csv(path, index = False)

In [40]:
path = "../data/submissions/private_predictions_mlp_gsc_20231011.csv"
private_predictions = df_predictions[[space_column, "prediction_mlp"]]
private_predictions = private_predictions.rename(columns = {"prediction_mlp":"Probability"})
private_predictions.to_csv(path, index = False)

In [32]:
path = "../data/submissions/private_predictions_boruta_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_boruta"]]
private_predictions = private_predictions.rename(columns = {"prediction_boruta":"Probability"})
private_predictions.to_csv(path, index = False)

In [33]:
path = "../data/submissions/private_predictions_fw_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_MrMr"]]
private_predictions = private_predictions.rename(columns = {"prediction_MrMr":"Probability"})
private_predictions.to_csv(path, index = False)

In [34]:
path = "../data/submissions/private_predictions_base_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_base"]]
private_predictions = private_predictions.rename(columns = {"prediction_base":"Probability"})
private_predictions.to_csv(path, index = False)

In [35]:
path = "../data/submissions/private_predictions_ensemble_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_ensemble"]]
private_predictions = private_predictions.rename(columns = {"prediction_ensemble":"Probability"})
private_predictions.to_csv(path, index = False)

In [36]:
path = "../data/submissions/private_predictions_all_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_all_features"]]
private_predictions = private_predictions.rename(columns = {"prediction_all_features":"Probability"})
private_predictions.to_csv(path, index = False)

In [37]:
path = "../data/submissions/private_predictions_mlp_stacked_gsc.csv"
private_predictions = df_predictions[[space_column, "prediction_mlp_stacked"]]
private_predictions = private_predictions.rename(columns = {"prediction_mlp_stacked":"Probability"})
private_predictions.to_csv(path, index = False)