# Training and application of the Deconfounding DualADAE model

The following notebook provides the instructions to create and train from scratch a DualADAE model on multi-center and multi-scanner radiomics data.
It requires the DualADAE repository installed in the system that is running the code.

In [None]:
from utils import *

First we will import the radiomics dataset, that has to be stored in the `DATA/` folder in csv format.
The `csv_path` variable is loaded automatically from `utils.py` script. If you want to set a different directory for your data, change the definition of this variable.

In [None]:
# Import data to process
filepath = csv_path + 'data.csv' # data.csv contains your radiomics matrix
_data_df = pd.read_csv(filepath)

# Scale your data
data_df_std = StandardScaler().fit_transform(_data_df)
data_df = pd.DataFrame(data_df_tmp, columns=_data_df.columns)

## Model Architecture and Hyperparameters Optimization

The DualADAE library allows the user to define a set of hyperparameters (number of nodes per hidden layer and $p$ value for the drop-out) and performs grid search of the best Autoencoder model configuration. The best model is identified minimizing the AE reconstruction error.

By default, the library tests three versions of the AE model:
- Three layers AE
- Two layers AE
- One layer AE


The function `make_AE_optimization()` will take as input the radiomics dataset and 4 lists of hyperparameters: 
- dimensions of the first hidden layer (`dim1`)
- dimensions of the second hidden layer (`dim2`)
- dimensions of the third hidden layer (`dim3`)
- set of drop-out values (`dropouts`)

Out of this four lists, it will generate **three `.json` files**, that will be stored in the `PARAMS/` folder: each file contains a library of the best configuration for each model version. 



**Note** Model optimization can be skipped by moving directly to the next section. The model in this case will be defined on the basis of a default set of hyperparameters, saved as `default_model.json`.

In [None]:
# Define the hyperparameters to evaluate in grid search of the best model configuration
dims1 = [32, 16, 8]
dims2 = [16, 8, 4]
dims3 = [8, 4, 2]
dropouts = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]

# Perform model optimization
make_AE_optimization(data_df, dims1, dims2, dims3, dropouts)

## DualADAE training and deconfusion of radiomics data

First we have to import the confounding factors data, and prepare the training and test sets to train the model.

In [None]:
# Import data to process
filename_center = csv_path + 'center.csv' # center.csv contains your center labels
center_df = pd.read_csv(filename_center)
filename_scanner = csv_path + 'scanner.csv' # center.csv contains your center labels
scanner_df = pd.read_csv(filename_scanner)

labels_df = pd.concat([center_df, scanner_df], axis=1)

# Detect common samples
common_samples = np.intersect1d(data_df.index.values, labels_df.index.values)
labels_df = labels_df.loc[common_samples]
input_df = data_df.loc[common_samples]

# prepare your final training data
X = input_df
Y = labels_df
n_centers = len(np.unique(center_df))
n_scanners =  len(np.unique(scanner_df))

# Train and test splits
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=12345)

The following is the actual training of the DualADAE model, calling the function `make_DualADAE()`.

This function takes several inputs:
- `X_train`, `Y_train`, `X_test`, `X_test` are the input datasets prepared above
- `name_dict_param` is a string that contains the name of the chosen model version: users can choose between one-layer, two-layer or three-layer best configuration (in case they run the optimization step), or the default model.
- `iterations`: number of training iterations
- `n_centers`: the number of Centers that collected the data
- `n_scanners`: the total number of scanners used among all centers

The function returns a dataframe containing the deconfounded embeddings. The same dataframe is by default stored in the `ADV_FILES_DUAL/` folder, that will be automatically created in the working directory.

In [None]:
# Run models
lambda_val = 0.5
name_dict_param = 'best_AE_param_two_layer.json' #change with 'default_model.json' in case no optimization was performed
iternations = 6000

embedding_df = make_DualADAE(X_train, 
                             Y_train, 
                             X_test, 
                             Y_test, 
                             name_dict_param, 
                             lambda_val, 
                             iternations, 
                             n_centers, 
                             n_scanners)

The code above train the AD-AE model in a single-split validation model. If we want to train the network in cross-validation mode, we follow the code below, calling `make_DualADAE_crossval()`:

In [None]:
# Run models
lambda_val = 0.5
name_dict_param = 'best_AE_param_two_layer.json' #change with 'default_model.json' in case no optimization was performed
iternations = 100
n_split = 50

embedding_df = make_DualADAE_crossval(X, Y, 
                                      name_dict_param,
                                      run,
                                      lambda_val,
                                      n_split,
                                      iternations,
                                      n_centers, 
                                      n_scanners)

# Have fun with your embeddings!! :D

where:
- `X`, `Y`, `X_test`, `X_test` are the non-split input datasets
- `iterations`: number of training iterations per each split
- `n_split`: number of cross-validation split
- and the rest of parameters as above