# How to train a model on any dataset

## Declare your data

Don't forget to write your config file in `config/` folder by copying existing config files. You can also add specific values based on some columns operations and delete specific values at the beginning.
Other options (colums to drop,datetime colums) need to be declared directly in the config file.

By default, the preprocessing adds nothing, removes correlated features and uses ordinal encoding on all categorical features.

In [1]:
import sys

sys.path.append("../")

import pandas as pd
import torch

from beexai.dataset.dataset import Dataset
from beexai.dataset.load_data import load_data
from beexai.training.train import Trainer
from beexai.utils.path import create_dir
from beexai.utils.time_seed import set_seed

seed = 42
set_seed(seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

DATA_NAME = "kickstarter"
MODEL_NAME = "NeuralNetwork"

# Declare columns to add and values to delete as list of tuples (col_name,func_to_apply,dtype) and (col_name,value_to_delete)
add_list = [
    (
        "duration",
        lambda y: (pd.to_datetime(y["deadline"]) - pd.to_datetime(y["launched"])).apply(
            lambda x: x.days
        ),
        None,
    )
]
values_to_delete = [("country", 'N,0"'), ("state", "live")]
# If you don't want to add columns or delete values, don't specify them in `load_data`

create_dir(f"../output/data")
CONFIG_PATH = f"config/{DATA_NAME}.yml"
data_test, target_col, task, dataCleaner = load_data(
    from_cleaned=False,
    config_path=CONFIG_PATH,
    keep_corr_features=True,
    values_to_delete=values_to_delete,
    add_list=add_list,
)

## Get the training and test samples

You can choose the number of folds for k-fold and the ratio of the test data. You can also choose to scale or not the input features.

In [None]:
data = Dataset(data_test, target_col)
scale_params = {
    "x_num_scaler_name": "quantile_normal",
    "x_cat_encoder_name": "ordinalencoder",
    "y_scaler_name": "labelencoder",  # change to minmax or another float scaler for regression
    "cat_not_to_onehot": ["name"],
}
X_train, X_test, y_train, y_test = data.get_train_test(
    test_size=0.2, scaler_params=scale_params
)

## Train the model

You can choose the model you want to train and the hyperparameters you want to use.

In [None]:
NUM_LABELS = data.get_classes_num(task)
NN_PARAMS = {"input_dim": X_train.shape[1], "output_dim": NUM_LABELS}
trainer = Trainer(MODEL_NAME, task, NN_PARAMS, device=device)
# trainer = Trainer("XGBClassifier", task, device=device)
trainer.train(X_train, y_train, loss_file="../output/loss.png")

## Evaluation and saving

You can get the metrics on the test set for your model (`accuracy/f1-score` for classification, `mse/rmse/mape/r2-score` for regression). You can also save the model in `.pt` or `.joblib` format.

In [None]:
trainer.model.eval()  # comment if not using NN

metrics = trainer.get_metrics(X_test, y_test)
for k, v in metrics.items():
    print(k, v)

In [None]:
# Save model
create_dir(f"../output/models/{DATA_NAME}")
trainer.save_model(
    f"../output/models/{DATA_NAME}/{MODEL_NAME}.pt"
)  # change to .joblib for sklearn models

## Next steps

- Go to `2.Explain.ipynb` to get explainability scores for the model you just trained.
- Go to `3.Metrics.ipynb` to get explainability metrics for the method and the model of your choice.