# How to Train a LEOPARD

This notebook provides a brief instruction for training your own LEOPARD. You can also use the following code to reproduce the result of the MGH COVID dataset reported in our paper. The result may vary slightly with each run due to the stochastic mechanisms involved. Any questions regarding the code, please contact the zookeeper: Siyu Han (siyu.han@tum.de)

## Step 1: Import required modules

Before proceeding, ensure that the following dependencies are properly installed:
- python: 3.79
- numpy: 1.21.5
- pandas: 1.3.5
- scikit-learn: 1.0.2
- pytorch: 1.11.0
- pytorch_lightning: 1.6.4
- tensorboard: 2.10.0
- cuda (if use GPU): 11.3

*The listed version numbers represent those utilized during our development. We cannot guarantee the compatibility or identical outcomes with different versions.*

In [None]:
import torch
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger

from src.data import *
from src.train import TrainLEOPARD

print("current pytorch version: ", torch.__version__)         # 1.11.0
print("current pytorch_lightning version: ", pl.__version__)  # 1.6.4

## Step 2: Prepare dataset

This is done by the function `prepare_dataset()`. This function does the following things: 
1. load data and format them into training/validation/test sets with the function `load_split_data()`. 
2. scale data use a scaler with the function `scale_data()`. 
3. create an instance of `OmicsDataset` class using scaled data for each data split.

The function `prepare_dataset()` receives arguments for the following parameters:
- `data_dir`: a string to specify the folder containing the data of complete views (vA) and incomplete views (vB) and multiple timepoints (1, 2, ...). The input data are saved in .csv files with names like "vA_train.csv", "vB_train.csv". "v" and "t" denote views and timepoints. The incomplete views in vB will be imputed. "v\*_t\*_train.csv" used for model training and validation. Missing values can be encoded as NA. Even if data from view B at a specific timepoint are completely missing, you still need to ensure the file contains the corresponding sample ID and variable ID, where missing views are indicated with NA. "v\*_test.csv" is only used for performance evaluation where data cannot be seen by the model during training. If you only want to complete missing views in your dataset, you can save the same data into "v\*_train.csv" and "v\*_test.csv" to get imputed result. Default: `"data\MGH_COVID"`
- `missTimepoint`: a string to specify which timepoint is missing and needs to be completed. Default: `"D3"`
- `valSet_ratio`: a numeric value between 0 and 1 to specify the ratio of data from "v\*_t\*_train.csv" used for constructing the validation set. Default: `0.2`
- `trainNum`: a numeric value or `"all"` indicating how many samples will be randomly selected from the training data for training. Default: `"all"`
- `obsNum`: a numeric value or `"all"` indicating how many samples from "vB_t2_train.csv" will be used for training. Default: `0` 
- `use_scaler`: a string to indicate which scaler is used to scale data. Support `"standard"`, `"robust"`, and `"minmax"`. Description of the scalers please refer to the User Guide of `sklearn.preprocessing`. Default: `"standard"`
- `save_data_dir`: `None` or a path used for saving the indices of samples used in training. Default: `None`
- `set_seed`: a numeric value to set seed for reproducible results. Default: `1`

In [None]:
obsNum = 0
trainNum = "all"

loaded_data = prepare_dataset(data_dir="data/MGH_COVID", missTimepoint="D3",
                              valSet_ratio=0.2, trainNum=trainNum, obsNum=obsNum, 
                              use_scaler="standard", save_data_dir=None, set_seed=1)

## Step 3: Create an instance of the `TrainLEOPARD` class

The pytorch code of LEOPARD is organized into `TrainLEOPARD`, a `LightningModule`. LEOPARD is fully customizable. You can adapt your LEOPARD using the following parameters when instantiating an instance of the `TrainLEOPARD` class:

#### arugments for input data:
- `loaded_data`: output of `prepare_dataset()`, including prepared data splits and scalers
- `pre_layers_viewA`, `pre_layers_viewB`: pre-layers for view A and view B to convert them to the same dimension. Default: `[64]`, `[64]`
- `post_layers_viewA`, `post_layers_viewB`: post-layers for view A and view B to convert embeddings back to data in the original dimension. Default: `[64]`, `[64]`

#### arugments for the content encoder:
- `encoder_content_layers`: layers for the content encoder. A list where the length indicates the total number of layers, and each element specifies the size of the corresponding layer. Default: `[64, 64, 64]`
- `encoder_content_norm`: a list indicates if using normalization for the layers in the content encoder. Supported `"instance"`, `"batch"`, and `"none"`. Default: `["instance", "instance", "instance"]`
- `encoder_content_dropout`: a list specifies dropout rate for each layer in the content encoder. Default: `[0, 0, 0]`

#### arugments for the temporal encoder:
- `encoder_temporal_layers`: layers for the temporal encoder. Default: `[64, 64, 64]`
- `encoder_temporal_norm`: if use normalization for the layers in the temporal encoder? Supported `"instance"`, `"batch"`, and `"none"`. Default: `["none", "none", "none"]`
- `encoder_temporal_dropout`: dropout rate for each layer in the temporal encoder. Default: `[0, 0, 0]`

#### arugments for the projection head:
- `use_projection_head`: if use projection head for contrastive learning? Default: `False`
- `projection_output_size`: set output size of projection head. Ignored if `use_projection_head=False`.  Default: `0`

#### arugments for the generator:
- `generator_block_num`: how many layers/blocks are used for the generator. Default: `3`
- `generator_norm`: if use normalization for the layers in the generator? Supported `"instance"`, `"batch"`, and `"none"`. Default: `["none", "none", "none"]`
- `generator_dropout`: dropout rate for each layer in the generator. Default: `[0, 0, 0]`
- `merge_mode`: re-entangle content and temporal representations by concatenation (`"concat"`) or AdaIN (`"adain"`)? Default: `"adain"`

#### arugments for the discriminator:
- `discriminator_layers`: layers for the multi-task discriminator. Default: `[128, 128]`
- `discriminator_norm`: if use normalization for the layers in the discriminator? Supported `"instance"`, `"batch"`, and `"none"`. Default: `["none", "none"]`
- `discriminator_dropout`: dropout rate for each layer in the discriminator. Default: `[0, 0]`

#### arugments for configuring losses:
- `reconstruction_loss`: use `"MSE"` or `"MAE"` to compute reconstruction loss? Default: `"MSE"`
- `adversarial_loss`: use `"MSE"` or `"BCE"` to compute adversarial loss? Default: `"MSE"`
- `weight_reconstruction`, `weight_adversarial`, `weight_representation`, `weight_contrastive`: weights for different losses. Default: `1`, `1`, `0.1`, `0.1`
- `contrastive_temperature`: temperature for NT-Xent-based contrastive loss. Default: `0.05`

#### arugments for optimization:
- `lr_G`, `lr_D`: learning rate for generator process (encoders and generator) and discrimination process (discriminator). ***You need to tune this for your own datasets.*** Default: `0.0005`, `0.0005`
- `b1_G`, `b1_D`: beta_1 for Adam Optimizer. Default: `0.9`, `0.9`
- `lr_scheduler_G`, `lr_scheduler_D`: `"none"` or use `"LambdaLR"` or `"SGDR"` as lr scheduler? Default: `"none"`, `"none"`
- `batch_size`: batch size. ***You need to adjust this based on your sample size.*** Default: `32`

#### additional arguments:
- `save_embedding_dir`: `None` or a path used for saving the content and temporal embeddings. 
- `save_embedding_every_n_epoch`: an integar to indicate how often to save embeddings. Ignored when `output_embedding=None`.  
- `note`: add some additional texts as a hyperparameter to label each run.  Default: `"obsNum_" + str(obsNum)`

*Some hyperparameters (especially `lr_G`, `lr_D`, and `batch_size`) may need to be tuned for your own datasets.*

In [None]:
my_leopard = TrainLEOPARD(loaded_data=loaded_data,
                          pre_layers_viewA=[64], pre_layers_viewB=[64],
                          post_layers_viewA=[64], post_layers_viewB=[64],
                          
                          encoder_content_layers=[64, 64, 64],
                          encoder_content_norm=['instance', 'instance', 'instance'],
                          encoder_content_dropout=[0, 0, 0],
                          
                          encoder_temporal_layers=[64, 64, 64],
                          encoder_temporal_norm=['none', 'none', 'none'],
                          encoder_temporal_dropout=[0, 0, 0],
                          
                          use_projection_head=False, 
                          projection_output_size=0,
                          
                          generator_block_num=3,
                          generator_norm=['none', 'none', 'none'],
                          generator_dropout=[0, 0, 0],
                          merge_mode='adain',
                          
                          discriminator_layers=[64, 64],
                          discriminator_norm=['none', 'none'],
                          discriminator_dropout=[0, 0],
                          
                          reconstruction_loss='MSE', adversarial_loss='MSE',
                          weight_reconstruction=1, weight_adversarial=1,
                          weight_representation=0.1,
                          weight_contrastive=0.1,
                          contrastive_temperature=0.05,
                          
                          lr_G=0.0005, lr_D=0.0005, b1_G=0.9, b1_D=0.9,
                          lr_scheduler_G='none', lr_scheduler_D='none',
                          batch_size=32, 
                           
                          save_embedding_dir=os.path.join("lightning_logs", "trainNum_" + str(trainNum), 
                                                          "obsNum_" + str(obsNum),
                                                          "disentangled_embeddings"),
                          save_embedding_every_n_epoch=10,
                          note="obsNum_" + str(obsNum))

## Step 4: Create an instance of the `Trainer` class

This is done by calling `Trainer()` from `pytorch_lightning`. `Trainer` can help you train your LEOPARD. Here we use the following settings (please refer to its Docs for a comprehensive parameter explanation):
- `enable_progress_bar`: show progress bar? Default: `True`
- `log_every_n_steps`: a numeric value that specifies the interval, in steps, at which metrics should be logged. Default: `3`
- `max_epochs`: a numeric value that defines the maximum number of epochs the training loop should run. Default: `100`
- `gpus`: a value indicating which GPUs to use. Default: `1 if torch.cuda.is_available() else None`
- `logger`: a tensorboard logger responsible for logging training/validation metrics and other experiment details.

In [None]:
save_dir = os.path.join("lightning_logs", "trainNum_" + str(trainNum))
name = "obsNum_" + str(obsNum)

trainer = pl.Trainer(
    enable_progress_bar=False,
    log_every_n_steps=3,
    max_epochs=199,
    gpus=1 if torch.cuda.is_available() else None,
    logger=TensorBoardLogger(save_dir=save_dir, name=name)
)

## Step 5: Train your LEOPARD

Now let's train your LEOPARD!

Optional: you can also visualize the training process with the logger. Use the `%tensorboard` magic command or call it in command line: `tensorboard --logdir *save_dir* --port 8080`
(*use your own `save_dir` and port number*)

In tensorboard, you can monitor the losses computed on the training set and validation set (if you have one), which can help mitigate the risk of overfitting. For example, you might want to stop the training process if the reconstruction loss starts to increase.

In [None]:
# optional: invoke TensorBoard with the %tensorboard magic command.

%load_ext tensorboard
%tensorboard --logdir lightning_logs

In [None]:
# train a LEOPARD

trainer.fit(my_leopard)

## Step 6: Impute and export data

You can impute the missing data and writing them into a .csv file. If you have ground truth for "vB_t2_test.csv", you can also export percent bias of the imputed data.

In [None]:
# impute data
imputed_data = trainer.predict(my_leopard, my_leopard.test_dataloader())[0]

In [None]:
imputed_data

In [None]:
# create folder for output
output_dir = os.path.join(save_dir, name, "version_" + str(trainer.logger.version), "results") 

if not os.path.exists(output_dir):
    os.mkdir(output_dir)

# export imputed data
pd.DataFrame(imputed_data["generated_data"]).to_csv(
            os.path.join(output_dir, "imputedData_obs" + str(obsNum) + ".csv"), index=False
)

In [None]:
# export percent bias (only when groundtruth is available)
pd.DataFrame(imputed_data["raw_percentBias"]).to_csv(
            os.path.join(output_dir, "PB_obs" + str(obsNum) + ".csv"), index=False
)

## End
This manual is prepared based on our analysis and has been tested on our benchmark datasets. 
Please let us know if you found any issues.