# End to end Neural Network classification

This notebooks synthesizes all the previous notebooks into a single pipeline. It is a good starting point to understand how to use the pipeline from end to end.

In this example, we will train a Neural Network to classify the kickstarter dataset to predict the success status of a project

### Load data and train the model

In [None]:
import sys

sys.path.append("../")

In [None]:
import torch

from beexai.dataset.dataset import Dataset
from beexai.dataset.load_data import load_data
from beexai.evaluate.metrics.get_results import get_all_metrics
from beexai.explanation.explaining import CaptumExplainer
from beexai.training.train import Trainer
from beexai.utils.path import create_dir
from beexai.utils.sampling import stratified_sampling
from beexai.utils.time_seed import set_seed

For this example, we will add a column `duration` which is the difference between the `deadline` and `launched` columns. We will also drop the entries with value `N,0` for the column `country` and values `live` for the column `state`.

`load_data` function also allows to remove correlated features with a default threshold of 70% and one-hot encode categorical features with the possibility of making an exception for high dimensional features which would result in too many columns.

In [None]:
seed = 42
set_seed(seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

DATA_NAME = "kickstarter"
MODEL_NAME = "NeuralNetwork"

create_dir(f"../output/data")
CONFIG_PATH = f"config/{DATA_NAME}.yml"
data_test, target_col, task, dataCleaner = load_data(
    from_cleaned=True, config_path=CONFIG_PATH, keep_corr_features=True
)
scale_params = {
    "x_num_scaler_name": "quantile_normal",
    "x_cat_encoder_name": "ordinalencoder",
    "y_scaler_name": "labelencoder",
    "cat_not_to_onehot": ["name"],
}
data = Dataset(data_test, target_col)
X_train, X_test, y_train, y_test = data.get_train_test(
    test_size=0.2, scaler_params=scale_params
)
X_train, X_val, y_train, y_val = data.get_train_val(X_train, y_train, val_size=0.2)
num_labels = data.get_classes_num(task)

In the case of a neural network, we need to specify the input and output shape of the model.

In [None]:
NN_PARAMS = {"input_dim": X_train.shape[1], "output_dim": num_labels}
trainer = Trainer(MODEL_NAME, task, NN_PARAMS, device)
trainer.train(
    X_train.values,
    y_train.values,
    loss_file=f"../output/loss.png",
    x_val=X_val,
    y_val=y_val,
)
trainer.model.eval()
metrics = trainer.get_metrics(X_test, y_test)
for k, v in metrics.items():
    print(f"{k}: {v}")

Two formats are available for saving your model: `pt` and `joblib`. The `pt` format is made for PyTorch models and the `joblib` format is made for sklearn models.

In [None]:
create_dir(f"../output/models/{DATA_NAME}")
trainer.save_model(f"../output/models/{DATA_NAME}/{MODEL_NAME}.pt")

For faster testing, we use the function `stratified_sampling` that samples a fraction of the data while keeping the same distribution of the target variable.

In [None]:
X_test, y_test = stratified_sampling(X_test, y_test, 100, task)

### Captum Models

Many choices of explainers are available in Captum. We will use the `IntegratedGradients` explainer for this example but it is also possible to use `DeepLift`, `Lime`, `ShapleyValueSampling` and other XAI methods.

In [None]:
explainer = CaptumExplainer(
    trainer.model, task=task, method="IntegratedGradients", sklearn=False, device=device
)
explainer.init_explainer()

### Evaluate IG with XAI metrics

Several quantitative metrics are also implemented to evaluate the explanations. It is also possible to have safety checks on the explanations with the training of a model on shuffled labels and a random explainability baseline. 

In [None]:
all_preds = trainer.model.predict(X_test.values)
get_all_metrics(
    X_test,
    all_preds,
    trainer.model,
    explainer,
    baseline="zero",
    auc_metric="accuracy",
    print_plot=False,
    save_path=None,
    device=device,
)