# SAMVAE - Experiments Configuration Guide

This model allows flexible configuration for training survival models using clinical, omic, and WSI (Whole Slide Image) data from Breast Cancer (BRCA) and Lower Grade Glioma (LGG) cohorts.

You must specify which modalities you want to use in a JSON configuration (datasets.json), with the following keys:

- clinical_dataset
- omic_dataset
- wsi_dataset
- time_event (mandatory)

EEach key should be assigned either:
- a list of dataset names (to include them), or
- null if the modality is not used.

A file named combinations.json is provided, which lists possible valid combinations of modalities and datasets that can be used. This can serve as a reference or starting point for your own configurations. See the table with the available datasets:

| **Modality**                    | **BRCA**                                                          | **LGG**                                                       |
| ------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- |
| **Clinical (Survival)**         | `brca_clinical`                                                   | `lgg_clinical`                                                |
| **Clinical (Competing Risks)**  | `brca_clinical_cr`                                                | `lgg_clinical_cr`                                             |
| **DNA Methylation**             | `brca_adn`                                                        | `lgg_adn`                                                     |
| **microRNA**                    | `brca_miRNA`                                                      | `lgg_miRNA`                                                   |
| **RNAseq**                      | `brca_RNAseq`                                                     | `lgg_RNAseq`                                                  |
| **Copy Number Variation (CNV)** | `brca_cnv`                                                        | `lgg_cnv`                                                     |
| **WSI (Whole Slide Images)**    | `brca_wsi`, `brca_CLAM_mask`, `brca_CLAM_heatmap`, `brca_patches` | `lgg_wsi`, `lgg_CLAM_mask`, `lgg_CLAM_heatmap`, `lgg_patches` |
| **Time/Event (Survival)**       | `brca_time_event`                                                 | `lgg_time_event`                                              |
| **Time/Event (Competing Risks)**| `brca_time_event_cr`                                              | `lgg_time_event_cr`                                           |

# Notes
- You can combine one or more modalities depending on your experiment. 
- You can perform hyperparameter optimization either in unimodal mode or by combining other modalities with clinical data. In our case, we conducted hyperparameter searches for the omic and WSI (Whole Slide Image) modalities in the presence of clinical data. Once the optimal hyperparameters are identified, they can be fixed and used to train the full multimodal model. This ensures that the final model benefits from a well-optimized configuration that generalizes across modalities.
- time_event must always be specified and should match the cohort (e.g., brca_time_event or lgg_time_event).
- If using competing risks models, make sure to select *_clinical_cr and *_time_event_cr datasets.
- For the WSI modality, we used the patch-based representation to enable experiments with multiple image instances per patient, which better captures intra-patient variability. Other image formats, such as CLAM masks and heatmaps, were used in intermediate tests but were not the primary focus of our experiments.

In [None]:
# Example 1: Training on BRCA clinical + WSI data (Survival Analysis) 
[
    {
    "clinical_dataset": ["brca_clinical"],
    "omic_dataset": null,
    "wsi_dataset": ["brca_patches"],
    "time_event": ["brca_time_event"]
    }
]

In [None]:
# Example 2: Training on LGG clinical + microRNA + RNAseq + WSI data (Competing Risks)
[
    {
    "clinical_dataset": ["lgg_clinical_cr"],
    "omic_dataset": ["lgg_miRNA", "lgg_RNAseq"],
    "wsi_dataset": ["lgg_patches"],
    "time_event": ["lgg_time_event_cr"]
    }
]

In the utils.py file, under # Training and testing configurations, you can adjust several key parameters depending on your experimental setup:

In [None]:
args['train'] = True
args['eval'] = True
args['hyperparameter_optimization'] = True
args['optimizer_with_clinical'] = True   
args['batch_size'] = 512
args['n_threads'] = 1
args['n_seeds'] = 10   
args['N_wsi'] = 15


- You can set args['train'], args['eval'], args['hyperparameter_optimization'] and args['optimizer_with_clinical']  to True or False depending on whether you want to enable training, evaluation, or hyperparameter optimization.
- The default batch_size is set to 512, except when using image data (e.g., WSI), in which case a lower batch size (e.g., 50) is used due to computational limitations. If you are using multiple images per patient, it is recommended to reduce the batch size further to avoid GPU memory issues.
- Similarly, n_threads should be kept low (e.g., 1) when using large images, again due to memory constraints on our GPU.
- The n_seeds parameter controls the number of random seeds used for reproducibility. We use 5 seeds for regular experiments and 10 seeds for final model combinations.
- The N_wsi parameter sets the number of images per patient used during training. We have tested the model with 1, 5, 10, and 15 WSI patches to evaluate the impact of this hyperparameter.

