# Training scDiffusion-X

Clone the scDiffusion-X to your local machine:
```
git clone https://github.com/EperLuo/scDiffusion-X.git
cd scDiffusion-X
```
Set the conda environment. (Please refer to the Installation section)

**Data preparation**

The data for training scDiffusion-X should contain two different modalities and saved in moun ```h5mu``` format. For scRNA-seq data, the row count expression data should be saved in ```mdata['rna'].X```. For scATAC-seq data, the binary chromatin openness should be saved in ```mdata['atac'].X```. Meta-information such as cell type can be placed in ```mdata['rna'].obs['cell_type']```. See the example data in xxx for details.


**Train the Autoencoder**

After organizing the data, you can start to train the multi-modal Autoencoder. 
```
cd script/training_autoencoder
bash train_autoencoder_multimodal.sbatch
```
Adjust the data path to your local path. The dataset config file is in script/training_autoencoder/configs/dataset, see the comments in `openproblem.yaml` for details. The checkpoint will be saved in script/training_autoencoder/outputs/checkpoints and the log file will be saved in script/training_autoencoder/outputs/logs. The autoencoder config file is in script/training_autoencoder/configs/encoder, see the comments in `encoder_multimodal.yaml` for details. 

There are three different sizes Autoencoders: `encoder_multimodal_small`, `encoder_multimodal`, and `encoder_multimodal_large`. We recommand to use `encoder_multimodal` (corresponding to `encoder_multimodal.yaml`) for most of dataset. If the genes and peaks are more than 50,000 and 200,000, we recommand a larger autoencoder in `encoder_multimodal_large.yaml`. If the genes and peaks are less than 5,000 and 15,000, we recommand a smaller autoencoder in `encoder_multimodal_small.yaml`. The `norm_type` in the encoder config yaml control the normalization type. For data generation task, we recommend `batch_norm`, and for translation task, we recommend `layer_norm` since it has better generalization for OOD data.


**Train the diffusion backbone**

```
cd script/training_diffusion
sh ssh_scripts/multimodal_train.sh
```
Again, adjust the data path and output path to your own, and also change the ae_path&encoder_config in the sh file to the autoencoder you tarined in step 1. The `rna_dim` and `atac_dim` refer to the dimensions of latent representation, you should change them to match the autoencoder you used (refer to the encoder's config file). When training with condition (like the cell type condition), set the `num_class` to the number of unique labels. The training is unconditional when the `num_class` is not set. 

Also, change the `devices` and `NUM_GPUS` parameter according to your own situation. The total batch size is num_gpu*batch_size. 


**Pretrained model**
Here we provided a model pretrained on the miniatlas dataset (Wu J, et al. EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment). This dataset contains more than 130,000 scATAC-seq with paired scRNA-seq, across 57 cell types. The pretrained model weight can be found at: https://cloud.tsinghua.edu.cn/d/144fde6ca47b44d580cd/.
The complete cell types list:
['Acinar cell', 'Adipocyte', 'Alpha cell', 'Amacrine cell',
       'Astrocyte', 'B cell', 'Beta cell', 'Bipolar cell', 'CD4 T',
       'Capillary EC', 'Cardiomyocyte', 'Colonocyte', 'Cone cell',
       'Delta cell', 'Endocardial cell', 'Endocrine cell',
       'Endothelial cell', 'Enterocyte', 'Epithelial cell',
       'Erythroblast', 'Excitatory neuron', 'Fibroblast', 'Fibroblasts',
       'Glia', 'Goblet cell', 'Hepatocyte', 'Horizontal cell',
       'Inhibitory neuron', 'Leyding cell', 'Luminal cell', 'Macrophage',
       'Mast cell', 'Mesothelial cell', 'Microfold cell', 'Microglia',
       'Monocyte', 'Myofibroblast', 'Neuron', 'Nk cell',
       'Oligodendrocyte', 'Oligodendrocyte progenitor cell', 'PP cell',
       'Paneth cell', 'Pericyte', 'Plasma cell', 'Podocyte',
       'Proerythroblast', 'Renal epithelial cell - Loop of Henle',
       'Renal epithelial cell - distal tubules',
       'Renal epithelial cell - proximal tubules', 'Rod cell',
       'Smooth muscle cell', 'T cell', 'TUBA1A ductal cell', 'Tuft cell',
       'Type A intercalated cell', 'Type B intercalated cell']
When generating new dataset, the type index in the same order as the cell types above, e.g. 0 for Acinar cell and 1 for Adipocyte.