## SpaRED Library Quick Start DEMO

This demonstration is a step by step on how to use the SpaRED library to train a gene expression prediction model with pre-trained weights. In this tutorial we will illustrate how to:

* Load a SpaRED dataset and save it in an adata 
* Prepare the data for training a pretrained model for gene expression prediction
* Initialize a pretrained model
* Train the model using pytorch lighting

In [2]:
import os
import sys
from pathlib import Path

currentdir = os.getcwd()
parentdir = str(Path(currentdir).parents[2])
sys.path.insert(0, parentdir)

import spared

### Load Datasets

The `datasets` file has a function to get any desired dataset and return the adata as well as the parameter dictionary. This function returns a filtered and processed adata. This function has a parameter called *visualize* that allows for all visualizations if set to True. The fuction also saves the raw_adata (not processed) in case it is required. 

We will begin by loading a dataset and setting the *visualize* parameter as False since no images are required for the functions analized in this DEMO.

In [3]:
from spared.datasets import get_dataset

#get dataset
data = get_dataset("vicari_mouse_brain", visualize=False)

  from .autonotebook import tqdm as notebook_tqdm


Loading vicari_mouse_brain dataset with the following data split:
train data: ['V11L12-038_A1', 'V11L12-038_B1', 'V11L12-038_C1', 'V11L12-038_D1', 'V11L12-109_A1', 'V11L12-109_B1', 'V11L12-109_C1', 'V11L12-109_D1']
val data: ['V11T16-085_A1', 'V11T16-085_B1', 'V11T16-085_C1', 'V11T16-085_D1']
test data: ['V11T17-101_A1', 'V11T17-101_B1']
Parameters already saved in /media/SSD4/dvegaa/SpaRED/spared/processed_data/vicari_data/vicari_mouse_brain/2024-07-08-11-11-47/parameters.json
Loading main adata file from disk (/media/SSD4/dvegaa/SpaRED/docs/notebooks/tutorials/processed_data/vicari_data/vicari_mouse_brain/2024-07-08-11-11-47/adata.h5ad)...
The loaded adata object looks like this:
AnnData object with n_obs × n_vars = 43804 × 128
    obs: 'in_tissue', 'array_row', 'array_col', 'patient', 'slide_id', 'split', 'unique_id', 'n_genes_by_counts', 'total_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'gene_symbol', 'exp_frac', 'glob_exp_frac', 'n_cells_by_counts', 'mean_counts', 'pc

### Prepare data

To train a pretrained model for gene expression prediction we must first prepare the data. We use Dataloaders, a component commonly used in machine learning frameworks like PyTorch to handle the loading of data in an efficient and flexible manner. The function `get_pretraine_dataloaders` receives the following parameters as input:

* **adata (ad.AnnData):** AnnData object to process
* **layer (str):** the layer to use for pre-training
* **batch_size (int):** the batch size of the loaders. Default is set to 128
* **shuffle (bool):** whether to shuffle the data in the loaders
* **cuda (bool):** whether to use cuda in the loaders

And returns train, valid and test dataloaders as a Tuple to training a pretrained model. 


In [4]:
from spared.dataloaders import get_pretrain_dataloaders

# Declare train and test loaders
train_dataloader, val_dataloader, test_dataloader = get_pretrain_dataloaders(
    adata=data.adata,
    layer = 'c_d_log1p',
    batch_size = 265,
    shuffle = True,
    use_cuda = True
)

Using noisy_delta layer for training. This will probably yield bad results.
Percentage of imputed observations with median filter: 27.503%


### Initialize model

The SpaRED library provides a function to  initilize a model with pretrained weights for gene expression prediction. The function `ImageEncoder` receives as input:

* **args (argparse):** argparse with specifid variables required to initialize the model
* **latem_dim (int):** latent dimensions used as output feature dimensions in the last layer of the defined model.

And returns the model. The backbone used as pretrained weights is defined in the argparse parameter. We will use *ShuffleNetV2* in this DEMO, however, you can check the available backbones in the documentation of our SpaRED library. 

In [5]:
from spared.models import ImageBackbone
import argparse
import torch

# Define argparse variables
test_args = argparse.Namespace()
arg_dict = vars(test_args)
input_dict = {
    'img_backbone': 'ShuffleNetV2',
    'img_use_pretrained': True,
    'average_test': False,
    'optim_metric': 'MSE',
    'robust_loss': False,
    'optimizer': 'Adam',
    'lr': 0.0001,
    'momentum': 0.9,
}

for key,value in input_dict.items():
    arg_dict[key]= value


# Declare device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ImageBackbone(args=test_args,  latent_dim=data.adata.n_vars).to(device)
model


ImageBackbone(
  (test_transforms): Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
  (criterion): MSELoss()
  (encoder): ShuffleNetV2(
    (conv1): Sequential(
      (0): Conv2d(3, 24, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
    )
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (stage2): Sequential(
      (0): InvertedResidual(
        (branch1): Sequential(
          (0): Conv2d(24, 24, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=24, bias=False)
          (1): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): Conv2d(24, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (3): BatchNorm2d(24, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (4): ReLU(inplace=True)
        )
        (branch2

### Training the model

Now we will use pytorch lighting to train the model. First we must define a `ModelCheckpoint` callback to monitor the validation mean squared error (val_MSE) and save only the best model based on this metric. Then we initialize a `Trainer` and define various parameters. In this case our model will run for a maximum of 1000 training steps, perform validation every 10 steps, log progress every 10 steps, and use one GPU for training. Additionally, the trainer displays a progress bar and a model summary during training. This setup ensures efficient model training and validation, with automatic saving of the best model based on the specified validation metric.

In [6]:
from lightning.pytorch import Trainer
from lightning.pytorch.callbacks import ModelCheckpoint

# Define checkpoint callback to save best model in validation
checkpoint_callback = ModelCheckpoint(
    monitor=f'val_MSE', # Choose your validation metric
    save_top_k=1, # Save only the best model
    mode='min'
)

# Define the trainier and fit the model
trainer = Trainer(
    max_steps=100,
    val_check_interval=10,
    log_every_n_steps=10,
    callbacks=[checkpoint_callback],
    check_val_every_n_epoch=None,
    devices=1,
    enable_progress_bar=True,
    enable_model_summary=True
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Now, we must begin the training process using PyTorch Lightning's `Trainer` with the `fit` method. The `fit` method is called with the specified model, training data loader (train_dataloader), and validation data loader (val_dataloader). This setup allows the Trainer to manage the training loop, including feeding the model with training data, performing validation at specified intervals, and utilizing the previously defined configurations and callbacks for efficient training and model checkpointing.

In [7]:
trainer.fit(
    model=model,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader
)

Missing logger folder: /media/SSD4/dvegaa/SpaRED/docs/notebooks/tutorials/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name            | Type         | Params
-------------------------------------------------
0 | test_transforms | Normalize    | 0     
1 | criterion       | MSELoss      | 0     
2 | encoder         | ShuffleNetV2 | 472 K 
-------------------------------------------------
472 K     Trainable params
0         Non-trainable params
472 K     Total params
1.892     Total estimated model params size (MB)


Epoch 1:  11%|█         | 10/90 [00:51<06:53,  0.19it/s, v_num=0]          

`Trainer.fit` stopped: `max_steps=100` reached.


Epoch 1:  11%|█         | 10/90 [00:51<06:53,  0.19it/s, v_num=0]


### Evaluate models performance

After training, the path to the best model is obtained from the `checkpoint_callback`. We can use the best model to evaluate performance on the test set (if available). We load the model from the `checkpoint_callback` path and use the `trainer.test` method to evaluate the model's performance on the test data. 

In [7]:
# Load the best model after training
best_model_path = checkpoint_callback.best_model_path
model = ImageBackbone.load_from_checkpoint(best_model_path)

# Test model if there is a test dataloader
if not (test_dataloader is None):
    trainer.test(model, dataloaders=test_dataloader)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]


Testing DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 10.49it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       test_Global          -3.1575732231140137
        test_MAE            1.0347423553466797
        test_MSE            1.7111455202102661
      test_PCC-Gene         0.12955906987190247
     test_PCC-Patch          0.879209578037262
      test_R2-Gene          -1.7244747877120972
      test_R2-Patch         0.3040209114551544
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
