# Train model pipeline

## Rationale:
Based on the previous experiments we are to train the model for the competition's submissions. We are to use the full dataset ```train``` + ```test``` + ```validation``` to train the final models and in the case of the ensemble models, the ```out of fold``` predictions. 

## Methodology:
We are to train each model with the optimized hyperparamters and use the out of fold predictions of each to train two final stackers:

1. A linear model trained on the model's predictions.
2. A NN model trained on the model's predictions.

Then we will save the predictions on the private dataset and use them for the final submissions.

## Conclusions:

1. **Consistency in Model Performance**: The private and public scores for the different models are very close to each other, indicating consistent performance.

2. **Stacking Models**: "Stacking MLP," "Stacking LR," and "Stacking AVG" models perform similarly with scores around 0.7980 (private) and 0.8000 (public).

3. **Boruta + Optuna**: "Boruta + Optuna" model performs competitively with scores of 0.7980 (private) and 0.7992 (public).

4. **Model Selection**: Consider factors beyond just performance, including model complexity, interpretability, training time, and resource requirements.

5. **Ensemble and Model Tuning**: Stacking models and combining feature selection methods with hyperparameter optimization can be effective strategies.

6. **Further Investigation**: Explore feature importance, model interpretability, and experiment with different model architectures or hyperparameters.

In summary, while performance differences are subtle, choose a model considering practical aspects and continue experimentation for optimization.


| Model             | Private Score | Public Score |
|-------------------|---------------|--------------|
| Stacking MLP      | 0.7983        | 0.8006       |
| Stacking LR       | 0.7981        | 0.8005       |
| Stacking AVG      | 0.7980        | 0.8002       |
| Boruta + Optuna   | 0.7980        | 0.7992       |

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

import warnings;warnings.filterwarnings("ignore")

import cloudpickle as cp

from functools import reduce

from sklearn.neural_network import MLPClassifier

import joblib
from copy import deepcopy

from sklearn.linear_model import LogisticRegressionCV

import sys
sys.path.append("../")

# local imports
from src.learner_params import target_column,space_column, MODEL_PARAMS

from utils.functions__training import model_pipeline, lgbm_classification_learner
from src.learner_params import params_all, params_ensemble, params_fw, params_original
from utils.feature_selection_lists import fw_features, boruta_features, ensemble_features
from utils.features_lists import all_features_list, base_features
from utils.functions__utils import train_binary
from utils.functions__utils import find_constraint

columns = ['prediction_MrMr',
 'prediction_base',
 'prediction_all_features',
 'prediction_boruta',
 'prediction_ensemble']

names = ["all_features",
        "boruta",
        "MrMr",
        "base", 
        "ensemble"
      ]

In [2]:
train_df = pd.read_pickle("../data/train_df.pkl")
validation_df = pd.read_pickle("../data/validation_df.pkl")
test_df= pd.read_pickle("../data/test_df.pkl")

private_df= pd.read_pickle("../data/private_df.pkl")

data = pd.concat([train_df, test_df, validation_df], ignore_index=True)

### Monotone constraints

In [5]:
ensemble_monotone_const_dict = {}
for feature in ensemble_features:
    aux = find_constraint(train_df, feature, target_column)
    ensemble_monotone_const_dict[feature] = aux

all_monotone_const_dict = {}
for feature in all_features_list:
    aux = find_constraint(train_df, feature, target_column)
    all_monotone_const_dict[feature] = aux

In [6]:
ensemble_params_mc = deepcopy(params_ensemble)
ensemble_params_mc["learner_params"]["extra_params"]["monotone_constraints"] = list(ensemble_monotone_const_dict.values())

all_params_mc = deepcopy(params_all)
all_params_mc["learner_params"]["extra_params"]["monotone_constraints"] = list(all_monotone_const_dict.values())

### Train the best model (Boruta + Optuna)

In [7]:
save_estimator_path = "../model_files/final__boruta_learner.pkl"
model_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = MODEL_PARAMS,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )
with open(save_estimator_path, "wb") as context:
    cp.dump(model_logs["lgbm_classification_learner"], context)
    
model_logs["data"]["oof_df"].to_pickle("../data/final__boruta_oof_df.pkl")

2023-10-11T12:52:52 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T12:52:52 | INFO | Training for fold 1
2023-10-11T12:54:36 | INFO | Training for fold 2
2023-10-11T12:56:24 | INFO | Training for fold 3
2023-10-11T12:58:19 | INFO | CV training finished!
2023-10-11T12:58:19 | INFO | Training the model in the full dataset...
2023-10-11T13:00:58 | INFO | Training process finished!
2023-10-11T13:00:58 | INFO | Calculating metrics...
2023-10-11T13:00:58 | INFO | Full process finished in 8.10 minutes.
2023-10-11T13:00:58 | INFO | Saving the predict function.
2023-10-11T13:00:58 | INFO | Predict function saved.


In [8]:
private_predictions = model_logs["lgbm_classification_learner"]["predict_fn"](private_df, apply_shap = False)

In [9]:
path = "../data/submissions/private_predictions_hyperopt2s.csv"
private_predictions = private_predictions[[space_column, "prediction"]]
private_predictions = private_predictions.rename(columns = {"prediction":"Probability"})
private_predictions.to_csv(path, index = False)

### Train the ensembles:

1. Average prediction
2. Logistic regression
3. MLP

In [10]:
save_estimator_path = "../model_files/final__fw_learner.pkl"
fw_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_fw,
                            target_column = target_column,
                            features = fw_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(fw_full_logs["lgbm_classification_learner"], context)


fw_full_logs["data"]["oof_df"].to_pickle("../data/final__fw_oof_df.pkl")

2023-10-11T13:01:01 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:01:01 | INFO | Training for fold 1
2023-10-11T13:01:45 | INFO | Training for fold 2
2023-10-11T13:02:29 | INFO | Training for fold 3
2023-10-11T13:03:14 | INFO | CV training finished!
2023-10-11T13:03:14 | INFO | Training the model in the full dataset...
2023-10-11T13:04:10 | INFO | Training process finished!
2023-10-11T13:04:10 | INFO | Calculating metrics...
2023-10-11T13:04:10 | INFO | Full process finished in 3.16 minutes.
2023-10-11T13:04:10 | INFO | Saving the predict function.
2023-10-11T13:04:10 | INFO | Predict function saved.


In [11]:
save_estimator_path = "../model_files/final__base_learner.pkl"
base_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_original,
                            target_column = target_column,
                            features = base_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(base_full_logs["lgbm_classification_learner"], context)
    
base_full_logs["data"]["oof_df"].to_pickle("../data/final__base_oof_df.pkl")

2023-10-11T13:04:11 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:04:11 | INFO | Training for fold 1
2023-10-11T13:04:37 | INFO | Training for fold 2
2023-10-11T13:05:04 | INFO | Training for fold 3
2023-10-11T13:05:30 | INFO | CV training finished!
2023-10-11T13:05:30 | INFO | Training the model in the full dataset...
2023-10-11T13:06:02 | INFO | Training process finished!
2023-10-11T13:06:02 | INFO | Calculating metrics...
2023-10-11T13:06:02 | INFO | Full process finished in 1.85 minutes.
2023-10-11T13:06:02 | INFO | Saving the predict function.
2023-10-11T13:06:02 | INFO | Predict function saved.


In [12]:
save_estimator_path = "../model_files/final__ensemble_learner.pkl"
ensemble_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = ensemble_params_mc,
                            target_column = target_column,
                            features = ensemble_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )

with open(save_estimator_path, "wb") as context:
    cp.dump(ensemble_full_logs["lgbm_classification_learner"], context)
    
ensemble_full_logs["data"]["oof_df"].to_pickle("../data/final__ensemble_oof_df.pkl")

2023-10-11T13:06:02 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:06:02 | INFO | Training for fold 1
2023-10-11T13:06:29 | INFO | Training for fold 2
2023-10-11T13:06:57 | INFO | Training for fold 3
2023-10-11T13:07:25 | INFO | CV training finished!
2023-10-11T13:07:25 | INFO | Training the model in the full dataset...
2023-10-11T13:08:00 | INFO | Training process finished!
2023-10-11T13:08:00 | INFO | Calculating metrics...
2023-10-11T13:08:00 | INFO | Full process finished in 1.98 minutes.
2023-10-11T13:08:00 | INFO | Saving the predict function.
2023-10-11T13:08:00 | INFO | Predict function saved.


In [13]:
save_estimator_path = "../model_files/final__all_learner.pkl"
all_full_logs = model_pipeline(train_df = data,
                            validation_df = validation_df,
                            params = params_all,
                            target_column = target_column,
                            features = all_features_list,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False,
                            save_estimator_path=None
                          )
with open(save_estimator_path, "wb") as context:
    cp.dump(all_full_logs["lgbm_classification_learner"], context)
    
all_full_logs["data"]["oof_df"].to_pickle("../data/final__all_oof_df.pkl")

2023-10-11T13:08:01 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-11T13:08:01 | INFO | Training for fold 1
2023-10-11T13:08:49 | INFO | Training for fold 2
2023-10-11T13:09:39 | INFO | Training for fold 3
2023-10-11T13:10:28 | INFO | CV training finished!
2023-10-11T13:10:28 | INFO | Training the model in the full dataset...
2023-10-11T13:11:29 | INFO | Training process finished!
2023-10-11T13:11:29 | INFO | Calculating metrics...
2023-10-11T13:11:29 | INFO | Full process finished in 3.47 minutes.
2023-10-11T13:11:29 | INFO | Saving the predict function.
2023-10-11T13:11:29 | INFO | Predict function saved.


In [14]:
all_predict_fn = joblib.load("../model_files/final__all_learner.pkl")
boruta_predict_fn= joblib.load("../model_files/final__boruta_learner.pkl")
fw_predict_fn= joblib.load("../model_files/final__fw_learner.pkl")
base_predict_fn= joblib.load("../model_files/final__base_learner.pkl")
ensemble_predict_fn= joblib.load("../model_files/final__ensemble_learner.pkl")

lpf = [
all_predict_fn,
boruta_predict_fn,
fw_predict_fn, 
base_predict_fn,
ensemble_predict_fn
]

In [15]:
l= []
for name, predict_fn in zip(names, lpf):
    aux = predict_fn["predict_fn"](private_df)[[space_column, "prediction"]].rename(columns = {"prediction":f"prediction_{name}"})
    l.append(aux)
df_predictions= reduce(lambda x,y:pd.merge(x,y, on = space_column), l)

In [16]:
df_predictions.loc[:,"prediction_average"] = df_predictions.loc[:,columns].mean(axis = 1)

In [17]:
tmp_all = pd.read_pickle("../data/final__all_oof_df.pkl")
boruta_all = pd.read_pickle("../data/final__boruta_oof_df.pkl")
ensemble_all = pd.read_pickle("../data/final__ensemble_oof_df.pkl")
fw_all = pd.read_pickle("../data/final__fw_oof_df.pkl")
base_all = pd.read_pickle("../data/final__base_oof_df.pkl")

ldf = [tmp_all, boruta_all, fw_all, base_all, ensemble_all]

In [18]:
l= []
for name, _df in zip(names, ldf):
    aux = _df[[space_column, "prediction"]].rename(columns = {"prediction":f"prediction_{name}"})
    l.append(aux)
df_predictions_train= reduce(lambda x,y:pd.merge(x,y,on = space_column), l)

In [19]:
lr = LogisticRegressionCV(cv = 3, random_state=42)
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, lr)

Score on test set for fold 1 is :0.873
Score on test set for fold 2 is :0.864
Score on test set for fold 3 is :0.862


In [20]:
df_predictions.loc[:,"prediction_lr"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_lr"}, inplace = True)

In [21]:
###
mlp = MLPClassifier(random_state=42,activation="tanh", max_iter=300,learning_rate="adaptive")
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, mlp)
###

Score on test set for fold 1 is :0.872
Score on test set for fold 2 is :0.864
Score on test set for fold 3 is :0.862


In [None]:
df_predictions.loc[:,"prediction_mlp"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_mlp"}, inplace = True)

In [40]:
###
from sklearn.ensemble import StackingClassifier
estimators = [("tanh",MLPClassifier(random_state=42,activation="tanh", max_iter=300,learning_rate="adaptive")),
             ("relu", MLPClassifier(random_state=42,activation="relu", max_iter=300,learning_rate="adaptive")),
             ("sigmoid",MLPClassifier(random_state=0,activation="logistic", max_iter=300,learning_rate="adaptive"))
             ]
model = StackingClassifier(estimators)
aux = df_predictions_train.merge(data[[space_column, target_column]], on = space_column)
result = train_binary(aux, columns, target_column, model)
###

Score on test set for fold 1 is :0.872
Score on test set for fold 2 is :0.864
Score on test set for fold 3 is :0.862


In [41]:
df_predictions.loc[:,"prediction_mlp_stack"] = result["model"].predict_proba(df_predictions[columns])[:,1]
df_predictions.rename(columns = {"prediction":"prediction_mlp_stack"}, inplace = True)

### Save the results

In [26]:
path = "../data/submissions/private_predictions_average_gsc_mc_selected.csv"
private_predictions = df_predictions[[space_column, "prediction_average"]]
private_predictions = private_predictions.rename(columns = {"prediction_average":"Probability"})
private_predictions.to_csv(path, index = False)

In [27]:
path = "../data/submissions/private_predictions_lr_gsc_mc_selected.csv"
private_predictions = df_predictions[[space_column, "prediction_lr"]]
private_predictions = private_predictions.rename(columns = {"prediction_lr":"Probability"})
private_predictions.to_csv(path, index = False)

In [28]:
path = "../data/submissions/private_predictions_mlp_gsc_mc_selected.csv"
private_predictions = df_predictions[[space_column, "prediction_mlp"]]
private_predictions = private_predictions.rename(columns = {"prediction_mlp":"Probability"})
private_predictions.to_csv(path, index = False)

In [47]:
path = "../data/submissions/private_predictions_mlp_stacked_gsc_mc_selected.csv"
private_predictions = df_predictions[[space_column, "prediction_mlp_stack"]]
private_predictions = private_predictions.rename(columns = {"prediction_mlp_stack":"Probability"})
private_predictions.to_csv(path, index = False)