### Demo how to use the training module for regression task

This notebook works with the processed dataset for regression task and shows how to:

- Instance models
- Training models with different strategies
- Evaluate the trained models
- Export models
- Load and use models

### Loading modules

In [1]:
import pandas as pd
import sys
sys.path.insert(0, "../src/")

from predictive_models.RegressionModels import RegressionModels
from utils.utilsLib import *

### Preparing dataset

- Read the processed dataset
- Separate dataset values of response column
- Split dataset into train and independent test

In [2]:
df_data = pd.read_csv("../processed_datasets/embedding_low_dataset.csv")
df_data.head(5)

Unnamed: 0,p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8,p_9,p_10,...,p_312,p_313,p_314,p_315,p_316,p_317,p_318,p_319,p_320,half_life_seconds
0,0.02674,-0.171403,0.178561,0.180199,0.012636,-0.108113,-0.219023,-0.073262,-0.268416,-0.140269,...,-0.006349,0.071385,0.055954,-0.012171,0.190824,-0.017415,-0.107225,0.284304,0.023288,4320.0
1,0.091733,-0.187599,0.212225,0.060982,-0.10847,-0.21831,-0.286023,-0.124183,-0.247224,-0.191654,...,-0.032426,0.258178,0.081489,-0.068229,0.138499,-0.003603,0.004191,0.213267,0.018047,4320.0
2,0.082563,-0.210952,0.239499,0.079224,-0.103408,-0.261932,-0.266437,-0.122079,-0.243431,-0.223874,...,-0.066775,0.250513,0.117581,-0.046149,0.167968,-0.012323,-0.017102,0.224041,0.033795,4320.0
3,0.04814,-0.254203,0.179131,0.084825,-0.065147,-0.232894,-0.220538,-0.104856,-0.261101,-0.197619,...,-0.114089,0.284401,0.191484,-0.057476,0.218636,0.012756,-0.013462,0.272075,0.072061,4320.0
4,0.006447,-0.280032,0.132136,0.142575,-0.025799,-0.243614,-0.186888,-0.073403,-0.182603,-0.222269,...,-0.153647,0.234304,0.180723,0.00852,0.202126,-0.052754,-0.026384,0.275791,0.025841,4320.0


In [3]:
dataset = df_data.drop(columns=["half_life_seconds"]).values
response = df_data["half_life_seconds"].values

In [4]:
X_train, X_test, y_train, y_test = applySplit(dataset, response, random_state=42, test_size=0.1)

### Instance model

- Instance the RegressionModels class and call one of the method to instance a regression model

In [5]:
regx_model = RegressionModels(
    dataset, 
    response, 
    test_size=0.2, 
    random_state=42
)

regx_model.instanceRandomForest()

### Training process

- Prepare dataset for input process (training and validation datasets division)
- Train the model
- Eval the model using the regression metrics

The performances of the models will be save in the attribute performances of the RegressionModels object

In [6]:
regx_model.processModel()
regx_model.performances

{'validation_metrics': {'R2': -0.32162735891849903,
  'MAE': 2007.2894870538269,
  'MSE': 5914334.091781149,
  'Kendall-tau': -0.11012639093128974,
  'Pearson': -0.07156311122638283,
  'Spearman': -0.15713095360271642}}

### Training process with k-fold

- Training with K-fold or stratified strategies implies split the training dataset in k segments. In this case, two metrics are obtained, including the training metrics and the metrics of validation
- If you train with k-fold you don't need train and fit previous the model. The process automatically train and fit the model once the k-fold is finised

In [7]:
regx_model.processModel(kfold=True, k=5)
regx_model.performances

{'training_metrics': {'MAE': 1905.6051850860538,
  'MSE': 5414830.135497616,
  'R2': -0.22210889986511076,
  'RMSE': 2326.69421234808},
 'validation_metrics': {'R2': -0.2872705453770572,
  'MAE': 1988.3278962128018,
  'MSE': 5760586.008221955,
  'Kendall-tau': -0.09686509778249569,
  'Pearson': -0.034034929070846306,
  'Spearman': -0.13460545842013008}}

### Export model

In [8]:
regx_model.exportModel(name_export="../demo_trained_models/rf_regx_demo.joblib")

### Load and use the model

In [9]:
regx_model.loadModel(name_model="../demo_trained_models/rf_regx_demo.joblib")
predictions_model = regx_model.makePredictionsWithModel(X_test)
performances_test = regx_model.evalModel(
    type_model="regx",
    y_true=y_test, 
    y_pred=predictions_model)
performances_test

{'R2': -0.292664430343867,
 'MAE': 1913.6271715810399,
 'MSE': 5530559.428491416,
 'Kendall-tau': -0.09740694031689828,
 'Pearson': -0.03984186496410636,
 'Spearman': -0.1352115554636694}