# TRIPLEX Data Processing Tutorial

This tutorial provides a comprehensive guide to using the TRIPLEX pipeline for processing spatial transcriptomics (ST) data integrated with whole slide images (WSI). The pipeline supports various tasks including data preprocessing, feature extraction, and data preparation for both training and inference.

## Overview of the Pipeline

The TRIPLEX pipeline integrates several key components:

1. **Data Preprocessing**: Preparing spatial transcriptomics data and whole slide images
2. **Feature Extraction**: Extracting features from WSI patches using deep learning models
3. **Dataset Preparation**: Creating datasets suitable for training and inference


## 1. Installation and Setup

In [None]:
import os

if os.path.basename(os.getcwd()) == 'tutorials':
    # Change to parent directory
    os.chdir('..')

In [None]:
os.getcwd()

In [None]:
from src.preprocess.pipeline import TriplexPipeline, get_config                        
from src.preprocess.pipeline.utils import get_available_gpus

### 1.1 Check GPU Availability

The pipeline can leverage GPU acceleration for feature extraction.

In [None]:
# Check available GPUs
available_gpus = get_available_gpus()
print(f"Available GPUs: {len(available_gpus)}")

# If running on GPU, show CUDA information
import torch
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Detected GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")

## 2. Configuration System


In [None]:
working_dir = os.path.abspath('./src/preprocess')

In [None]:
# Load the default configuration
try:
    default_config = get_config(f"{working_dir}", 'default')
    print("Default configuration parameters:")
    for key, value in sorted(default_config.items()):
        print(f"  {key}: {value}")
except FileNotFoundError:
    print("Default configuration file not found. Make sure to create it first.")

### 2.2 Creating a Custom Configuration

You can create a custom configuration by merging dictionaries or loading from a YAML file:

In [None]:
# Create a custom configuration by overriding default values
custom_config = {
    'mode': 'train',
    'platform': 'visium',
    'input_dir': '/path/to/input/data',
    'output_dir': '/path/to/output/data',
    'slide_ext': '.svs',
    'patch_size': 224,
    'slide_level': 0,
    'save_neighbors': True,
    'total_gpus': min(2, len(available_gpus))  # Use at most 2 GPUs
}

## 3. The Data Processing Pipeline


## 3.1 Complete Pipeline Execution


### Example 1: Processing HEST data (Andersson; BC1)

In [None]:
from huggingface_hub import login

login(token="YOUR HUGGING FACE TOKEN")

In [None]:
# Example of processing HEST data
hest_config = {
    'mode': 'hest',
    'input_dir': './input/ST/andersson',  
    'output_dir': './input/ST/andersson', 
    'slide_ext': '.tif',
    'save_neighbors': True,
    'model_name': 'cigar'
}

pipeline = TriplexPipeline(hest_config)
pipeline.run_pipeline()  # This will run preprocessing and feature extraction

### Example 2: Processing Visium data (GSE240429)

In [None]:
# Example of processing Visium data for training
visium_config = {
    'mode': 'train',
    'platform': 'visium',
    'input_dir': '/data-hdd/home/shared/spRNAseq/public/GSE240429/',  # Replace with actual path
    'output_dir': 'input/GSE240429',
    'slide_ext': '.tif',
    'save_neighbors': True
}

pipeline = TriplexPipeline(visium_config)
pipeline.run_pipeline() # This will run preprocessing and feature extraction


### Example 3: Inference on WSI data

In [None]:
# Example of inference on WSI data
inference_config = {
    'mode': 'inference',
    'input_dir': '/path/to/wsi/data',  # Replace with actual path
    'output_dir': '/input/data/path',  # Replace with actual path
    'slide_ext': '.mrxs',  # Aperio format
    'slide_level': 1,  # Use level 1 for faster processing
    'total_gpus': min(4, len(available_gpus))  # Use up to 4 GPUs
}

pipeline = TriplexPipeline(inference_config)
pipeline.run_pipeline() # This will run preprocessing and feature extraction


## 3.2 Step by step processing


### Example 1: Processing HEST data (Andersson; BC1)

#### 1) Data Preprocessing

The first step is preprocessing the data. This includes:
- Loading and processing spatial transcriptomics data
- Extracting patches from whole slide images
- Preparing gene sets for training

***Understanding Preprocessing Modes***

TRIPLEX supports three main preprocessing modes:

1. **train mode**: Used to prepare data for model training
   - Processes both spatial transcriptomics data and WSIs
   - Creates patch datasets from WSIs
   - Extracts gene sets (highly variable genes and highly expressed genes)
   - Splits data for cross-validation

2. **hest mode**: Used for HEST (Histology-Enhanced Spatial Transcriptomics) data
   - Loads pre-processed HEST data
   - Extracts patches and neighbor information
   - Prepares gene sets for training

3. **inference mode**: Used to prepare data for inference
   - Processes WSIs only (no spatial transcriptomics data required)
   - Extracts patches and coordinates for inference

In [None]:
# Example of processing HEST data
hest_config = {
    'mode': 'hest',
    'input_dir': './input/ST/andersson',  
    'output_dir': './input/ST/andersson', 
    'slide_ext': '.tif',
    'save_neighbors': True,
    'model_name': 'cigar'
}

pipeline = TriplexPipeline(hest_config)

In [None]:
pipeline.preprocess()

#### 2) Feature Extraction

After preprocessing, the next step is to extract features from the WSI patches. TRIPLEX extracts two types of features:

1. **Global features**: Features extracted from individual patches
2. **Neighbor features**: Features that incorporate neighborhood context


***Feature Extraction Models***

TRIPLEX supports several feature extraction models:

- **cigar**: A self-supervised model trained on histopathology images
- Other models from the HEST model zoo can also be used

The default is `cigar`

**2-1) Sequential Feature Extraction**


In [None]:
# pipeline.run_extraction('global')  # Extract only global features
# pipeline.run_extraction('neighbor')  # Extract only neighbor features
pipeline.run_extraction('both')  # Extract both global and neighbor features

**2-2) Parallel Feature Extraction**

For large datasets, TRIPLEX can perform feature extraction in parallel across multiple GPUs:

In [None]:
# Configuration for parallel feature extraction
parallel_config = {
    'mode': 'hest',
    'input_dir': './input/ST/andersson',  
    'output_dir': './input/ST/andersson', 
    'slide_ext': '.tif',
    'save_neighbors': True,
    'model_name': 'cigar',
    'total_gpus': len(available_gpus),  # Use all available GPUs
    'batch_size': 1024,
    'num_workers': 4
}

print(f"Parallel extraction would use {len(available_gpus)} GPUs")
pipeline = TriplexPipeline(parallel_config)
pipeline.run_parallel_extraction()


## 4. Using the Command Line Interface

The TRIPLEX pipeline can also be run from the command line using predefined configurations:

```bash
# Run the pipeline using a predefined configuration
python script/run_pipeline.py BC1

# Override specific parameters
python script/run_pipeline.py inference --input_dir /path/to/data --total_gpus 2

# Extract features in parallel
bash script/extract_features_parallel.sh /path/to/images /path/to/output .svs 4 both
```

## 5. Understanding the Output Structure

The TRIPLEX pipeline generates several output directories and files. Here's a guide to the output structure:

```
output_dir/
├── patches/           # Extracted patches from WSIs
│   └── neighbor/      # Neighbor patches (if save_neighbors is True)
├── adata/             # Processed gene expression data
│   └── *.h5ad         # AnnData files with gene expression
├── emb/               # Extracted features
│   ├── global/        # Global features
│   │   └── cigar/     # Features from the cigar model
│   └── neighbor/      # Neighbor features
│       └── cigar/     # Features from the cigar model
├── pos/               # Patch positions for inference
├── var_50genes.json   # Highly variable genes
└── mean_1000genes.json # Highly expressed genes
```

## 6. Best Practices and Tips


### 6.1 Memory Management

- Adjust `batch_size` based on your GPU memory
- Use `num_workers` based on your CPU cores (typically 4-8 is sufficient)

### 6.2 Performance Optimization

- Use multiple GPUs for feature extraction on large datasets
- For very large datasets, process files in batches

### 6.3 Quality Control

- Check intermediate outputs (patches, gene sets) to ensure quality
- If feature extraction fails, try reducing batch size
- Verify that gene expression data properly aligns with WSI patches

## 7. Troubleshooting Common Issues

### 7.1 Memory Errors

If you encounter CUDA out of memory errors:
- Reduce batch size
- Process fewer files simultaneously

### 7.2 File Format Issues

If you have issues with file formats:
- Ensure your slide extension matches the actual files
- For Aperio SVS files, use '.svs'
- For MRXS files, use '.mrxs'
- For TIFF files, use '.tif' or '.tiff'

### 7.3 Platform-Specific Issues

For platform-specific preprocessing:
- Visium data should have the standard 10X Visium folder structure
- For HEST data, ensure the data is formatted according to HEST specifications
- For custom platforms, you may need to modify the data loading functions