# **TiRank Analysis Pipeline Example**

This notebook demonstrates how to use the **TiRank** library to integrate spatial transcriptomics (ST) data and bulk transcriptomics data to identify phenotype-associated spots and determine significant clusters. The analysis includes data loading, preprocessing, model training, prediction, identification of significant clusters, and visualization of the results.

---

## **Table of Contents**

1. [Setup and Imports](#setup-and-imports)
2. [Load Data](#load-data)
    - 2.1 [Select Save Paths](#select-save-paths)
    - 2.2 [Load Clinical Data](#load-clinical-data)
    - 2.3 [Load Bulk Expression Profile](#load-bulk-expression-profile)
    - 2.4 [Check Data Consistency](#check-data-consistency)
    - 2.5 [Load Spatial Transcriptomics Data](#load-spatial-transcriptomics-data)
3. [Preprocessing](#preprocessing)
    - 3.1 [Load Saved AnnData Object](#load-saved-anndata-object)
    - 3.2 [Preprocess ST Data](#preprocess-st-data)
    - 3.3 [Clinical Data Preparation and Splitting](#clinical-data-preparation-and-splitting)
    - 3.4 [Gene Pair Transformation](#gene-pair-transformation)
4. [Analysis](#analysis)
    - 4.1 [TiRank Analysis](#tirank-analysis)
        - 4.1.1 [Data Loading and Preparation](#data-loading-and-preparation)
        - 4.1.2 [Model Training](#model-training)
        - 4.1.3 [Model Inference](#model-inference)
        - 4.1.4 [Identify Hubs and Significant Clusters](#identify-hubs-and-significant-clusters)
        - 4.1.5 [Visualization](#visualization)
    - 4.2 [Differential Expression and Pathway Enrichment Analysis](#differential-expression-and-pathway-enrichment-analysis)

---

<a id='setup-and-imports'></a>
## **1. Setup and Imports**

First, we need to import all the necessary libraries and modules required for the analysis. We also set a random seed for reproducibility.

In [None]:
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

# Import standard libraries
import torch
import pickle
import os

# Import TiRank modules
from TiRank.Model import setup_seed, initial_model_para
from TiRank.LoadData import (
    load_bulk_clinical,
    load_bulk_exp,
    check_bulk,
    load_st_data,
    transfer_exp_profile,
    view_dataframe
)
from TiRank.SCSTpreprocess import (
    FilteringAnndata,
    Normalization,
    Logtransformation,
    Clustering,
    compute_similarity
)
from TiRank.Imageprocessing import GetPathoClass
from TiRank.GPextractor import GenePairExtractor
from TiRank.Dataloader import generate_val, PackData
from TiRank.TrainPre import (
    tune_hyperparameters,
    Predict,
    Pcluster,
    IdenHub
)
from TiRank.Visualization import (
    plot_score_distribution,
    DEG_analysis,
    DEG_volcano,
    Pathway_Enrichment,
    plot_score_umap,
    plot_label_distribution_among_conditions,
    plot_STmap
)

# Set random seed for reproducibility
setup_seed(619)

---

<a id='load-data'></a>
## **2. Load Data**

In this section, we load the clinical data, bulk expression profiles, and spatial transcriptomics data required for the analysis.

<a id='select-save-paths'></a>
### **2.1 Select Save Paths**

Define the paths where the results and intermediate data will be saved.

In [None]:
# Main directory for saving results
savePath = "./ST_Survival_CRC"

# Directory for loading data
savePath_1 = os.path.join(savePath, "1_loaddata")
if not os.path.exists(savePath_1):
    os.makedirs(savePath_1, exist_ok=True)

<a id='load-clinical-data'></a>
### **2.2 Load Clinical Data**

Load the clinical data from a CSV file.

In [None]:
# Directory containing your data
dataPath = "./CRC_ST_Prog/"

# Path to clinical data CSV
path_to_bulk_cli = os.path.join(dataPath, "GSE39582_clinical_os.csv")

# Load clinical data
bulkClinical = load_bulk_clinical(path_to_bulk_cli)

# Optional: View the clinical data DataFrame
view_dataframe(bulkClinical)

<a id='load-bulk-expression-profile'></a>
### **2.3 Load Bulk Expression Profile**

Load the bulk expression data from a CSV file.

In [None]:
# Path to bulk expression data CSV
path_to_bulk_exp = os.path.join(dataPath, "GSE39582_exp_os.csv")

# Load bulk expression data
bulkExp = load_bulk_exp(path_to_bulk_exp)

# Optional: View the bulk expression DataFrame
view_dataframe(bulkExp)

<a id='check-data-consistency'></a>
### **2.4 Check Data Consistency**

Ensure that the sample names and identifiers are consistent between the bulk expression data and clinical data.

In [None]:
# Check consistency between bulk expression and clinical data
check_bulk(savePath, bulkExp, bulkClinical)

<a id='load-spatial-transcriptomics-data'></a>
### **2.5 Load Spatial Transcriptomics Data**

Load the spatial transcriptomics (ST) data from the specified folder.

In [None]:
# Path to the folder containing ST data
path_to_st_folder = os.path.join(dataPath, "SN048_A121573_Rep1")

# Load ST data
scAnndata = load_st_data(path_to_st_folder, savePath)

# Transfer expression profile from AnnData object
st_exp_df = transfer_exp_profile(scAnndata)

# Optional: View the ST expression DataFrame
view_dataframe(st_exp_df)

---

<a id='preprocessing'></a>
## **3. Preprocessing**

This section involves filtering, normalizing, and transforming the ST data. We also perform clustering and obtain pathological classifications.

<a id='load-saved-anndata-object'></a>
### **3.1 Load Saved AnnData Object**

Load the saved AnnData object from the previous step.

In [None]:
# Directory for preprocessing results
savePath_2 = os.path.join(savePath, "2_preprocessing")
if not os.path.exists(savePath_2):
    os.makedirs(savePath_2, exist_ok=True)

# Load the saved AnnData object
with open(os.path.join(savePath_1, "anndata.pkl"), "rb") as f:
    scAnndata = pickle.load(f)

<a id='preprocess-st-data'></a>
### **3.2 Preprocess ST Data**

Filter the data based on counts and mitochondrial gene proportion, normalize, log-transform, and perform clustering.

In [None]:
# Define the inference mode (e.g., "ST" for spatial transcriptomics)
infer_mode = "ST"  # Optional parameter

# Filtering the data
scAnndata = FilteringAnndata(
    scAnndata,
    max_count=35000,    # Maximum total counts per cell
    min_count=5000,     # Minimum total counts per cell
    MT_propor=10,       # Maximum percentage of mitochondrial genes
    min_cell=10,        # Minimum number of cells expressing the gene
    imgPath=savePath_2  # Path to save images/results
)
# Optional parameters: max_count, min_count, MT_propor, min_cell

# Normalize the data
scAnndata = Normalization(scAnndata)

# Log-transform the data
scAnndata = Logtransformation(scAnndata)

# Perform clustering on the data
scAnndata = Clustering(scAnndata, infer_mode=infer_mode, savePath=savePath)

# Compute similarity matrix (optional distance calculation)
compute_similarity(
    savePath=savePath,
    ann_data=scAnndata,
    calculate_distance=False  # Set to True if distance calculation is needed
)

**Note:** Ensure that the `pretrain_path` points to the pre-trained image processing model file included in your package.

In [None]:
# Path to the pre-trained image processing model
pretrain_path = "./ctranspath.pth"

# Number of pathological clusters to identify
n_patho_cluster = 7  # Optional variable (adjust based on your data)

# Perform image processing to get pathological classifications
scAnndata = GetPathoClass(
    adata=scAnndata,
    pretrain_path=pretrain_path,
    n_clusters=n_patho_cluster,
    image_save_path=os.path.join(savePath_2, "patho_label.png")
    # Advanced parameters: n_components (PCA components), n_clusters
)

# Save the processed AnnData object
with open(os.path.join(savePath_2, "scAnndata.pkl"), "wb") as f:
    pickle.dump(scAnndata, f)

<a id='clinical-data-preparation-and-splitting'></a>
### **3.3 Clinical Data Preparation and Splitting**

Prepare the clinical data and split the bulk data into training and validation sets.

In [None]:
# Define the analysis mode (e.g., "Cox" for survival analysis)
mode = "Cox"

# Split data into training and validation sets
generate_val(
    savePath=savePath,
    validation_proportion=0.15,  # Optional parameter: proportion of data for validation
    mode=mode
)

<a id='gene-pair-transformation'></a>
### **3.4 Gene Pair Transformation**

Extract informative gene pairs for the analysis.

In [None]:
# Initialize the GenePairExtractor with parameters
GPextractor = GenePairExtractor(
    savePath=savePath,
    analysis_mode=mode,
    top_var_genes=2000,       # Optional: number of top variable genes to select
    top_gene_pairs=1000,      # Optional: number of top gene pairs to select
    p_value_threshold=0.05,   # Optional: p-value threshold for gene pair selection
    max_cutoff=0.8,           # Optional: upper cutoff for correlation coefficient
    min_cutoff=-0.8           # Optional: lower cutoff for correlation coefficient
)

# Load data for gene pair extraction
GPextractor.load_data()

# Run the gene pair extraction process
GPextractor.run_extraction()

# Save the extracted gene pairs
GPextractor.save_data()

---

<a id='analysis'></a>
## **4. Analysis**

In this section, we perform the TiRank analysis, including model training, prediction, identification of significant clusters, and visualization.

<a id='tirank-analysis'></a>
### **4.1 TiRank Analysis**

<a id='data-loading-and-preparation'></a>
#### **4.1.1 Data Loading and Preparation**

Load and prepare the data for model training and inference.

In [None]:
# Directory for saving analysis results
savePath_3 = os.path.join(savePath, "3_Analysis")
if not os.path.exists(savePath_3):
    os.makedirs(savePath_3, exist_ok=True)

# Ensure the 'mode' variable is consistent throughout the analysis
mode = "Cox"          # Analysis mode (e.g., "Cox" for survival analysis)
infer_mode = "ST"     # Inference mode (e.g., "ST" for spatial transcriptomics)
device = "cuda" if torch.cuda.is_available() else "cpu"  # Use GPU if available

# Pack the data into DataLoader objects for training and validation
PackData(
    savePath=savePath,
    mode=mode,
    infer_mode=infer_mode,
    batch_size=1024   # Optional parameter: batch size for DataLoader
)

<a id='model-training'></a>
#### **4.1.2 Model Training**

Initialize model parameters and tune hyperparameters.

In [None]:
# Set the encoder type for the model (e.g., "MLP" for multi-layer perceptron)
encoder_type = "MLP"  # Optional parameter (options: "MLP", "Transformer", etc.)

# Initialize model parameters
initial_model_para(
    savePath=savePath,
    nhead=2,           # Optional: number of heads in multi-head attention (if using Transformer)
    nhid1=96,          # Optional: hidden layer size 1
    nhid2=8,           # Optional: hidden layer size 2
    n_output=32,       # Optional: output size
    nlayers=3,         # Optional: number of layers
    n_pred=1,          # Optional: number of predictions (e.g., 1 for regression)
    dropout=0.5,       # Optional: dropout rate
    mode=mode,
    encoder_type=encoder_type,
    infer_mode=infer_mode
)

# Tune hyperparameters using Optuna or other optimization libraries
tune_hyperparameters(
    savePath=savePath,
    device=device,
    n_trials=5    # Optional parameter: number of hyperparameter tuning trials
)

<a id='model-inference'></a>
#### **4.1.3 Model Inference**

Perform prediction and rejection based on the trained model.

In [None]:
# Predict phenotype-associated spots and perform rejection (uncertainty estimation)
Predict(
    savePath=savePath,
    mode=mode,
    do_reject=True,        # Optional: whether to perform rejection
    tolerance=0.05,        # Optional: tolerance level for rejection
    reject_mode="GMM"      # Optional: rejection mode (e.g., "GMM" for Gaussian Mixture Model)
)

<a id='identify-hubs-and-significant-clusters'></a>
#### **4.1.4 Identify Hubs and Significant Clusters**

Identify hub spots and perform permutation tests to determine significant clusters.

In [None]:
# Identify hub spots based on categorical columns
IdenHub(
    savePath=savePath,
    cateCol1="patho_class",        # First categorical column (e.g., pathological class)
    cateCol2="leiden_clusters",    # Second categorical column (e.g., clustering result)
    min_spots=10                   # Optional: minimum number of spots to consider a hub
)

# Perform permutation tests to identify significant clusters
Pcluster(savePath=savePath, clusterColName="patho_class", perm_n=1001)
Pcluster(savePath=savePath, clusterColName="leiden_clusters", perm_n=1001)
Pcluster(savePath=savePath, clusterColName="combine_cluster", perm_n=1001)

<a id='visualization'></a>
#### **4.1.5 Visualization**

Visualize the results using various plotting functions.

In [None]:
# Plot the distribution of prediction scores
plot_score_distribution(savePath)  # Displays the probability score distribution

# Plot UMAP embedding colored by prediction scores
plot_score_umap(savePath, infer_mode)

# Plot the distribution of labels among different conditions
plot_label_distribution_among_conditions(savePath, group="patho_class")
plot_label_distribution_among_conditions(savePath, group="leiden_clusters")
plot_label_distribution_among_conditions(savePath, group="combine_cluster")

# Plot spatial maps of the spots with cluster labels
plot_STmap(savePath=savePath, group="combine_cluster")

<a id='differential-expression-and-pathway-enrichment-analysis'></a>
### **4.2 Differential Expression and Pathway Enrichment Analysis**

Perform differential expression analysis and pathway enrichment to understand the biological processes involved.

In [None]:
# Set thresholds for differential expression analysis
fc_threshold = 2          # Optional: fold-change threshold
Pvalue_threshold = 0.05   # Optional: p-value threshold
do_p_adjust = True        # Optional: whether to adjust p-values for multiple testing

# Perform differential expression analysis
DEG_analysis(
    savePath=savePath,
    fc_threshold=fc_threshold,
    Pvalue_threshold=Pvalue_threshold,
    do_p_adjust=do_p_adjust
)

# Plot volcano plots for differential expression results
DEG_volcano(
    savePath=savePath,
    fc_threshold=fc_threshold,
    Pvalue_threshold=Pvalue_threshold,
    do_p_adjust=do_p_adjust
)

# Perform pathway enrichment analysis using specified databases
# Available databases can be found at: https://maayanlab.cloud/Enrichr/#libraries
Pathway_Enrichment(
    savePath=savePath,
    database=["GO_Biological_Process_2023"]  # Optional: replace with desired databases
)

---

**Final Remarks:**

- Ensure that all required data files are available at the specified paths.
- Adjust optional parameters such as thresholds, number of clusters, and batch sizes based on your dataset and computational resources.
- The `pretrain_path` should point to the pre-trained model file included in your package.
- Consistency of variables like `mode` and `infer_mode` is crucial throughout the script.
- Visualization functions help in interpreting the results and understanding the spatial distribution of clusters.
- If you encounter any issues or need further clarification on any part of the script, refer to the TiRank documentation or reach out for support.

---

**Note:** To run this notebook interactively:

1. Copy and paste the content into a new Jupyter Notebook.
2. Ensure that the `TiRank` library and all dependencies are properly installed.
3. Update the paths to data files and models according to your local setup.
4. Execute the cells sequentially.

---

# **References**

- **TiRank Documentation:** *[Add link to TiRank documentation]*
- **Enrichr Libraries:** [https://maayanlab.cloud/Enrichr/#libraries](https://maayanlab.cloud/Enrichr/#libraries)

---

Feel free to modify the notebook according to your specific needs. If you have any questions or need further assistance, don't hesitate to ask!