# Pipeline to run Segger

Part is in command line, part is in python 

Segger is divided in 3 major steps: 1- Dataset creation, 2- Training, 3- Prediction

First create the folder to store the segger first step, in there create a folder to store the models, and a third one to store the result


# Bash part (step by step version)

#### conda environment: segger-env

### (Do copy and paste)

## 1- Dataset Creation Input
- Spatial dataset (base_dir)
- Path to store dataset created (data_dir)
- Single Cell/Metacell dataset (scrnaseq_file)
- Column name of the sc/metacell cell type annotation (celltype_column)
- Path to save time taken ( ; } 2>)

### Code

{ time python3 /mnt/scratch1/Fcarcanholo/cellxgene_python/segger_dev/src/segger/cli/create_dataset_fast.py \
   --base_dir /mnt/scratch1/Touchstone_data/public_data/Xenium_BreastCancer_DuctalCarcinoma \
--data_dir /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor \
   --sample_type xenium \
   --scrnaseq_file /mnt/scratch1/Fcarcanholo/cellxgene_python/xenium_breast_segger/sc_BRCA_subtype/sc_BRCA_atlas_cells.h5ad \
   --celltype_column celltype_minor \
   --k_bd 3 \
   --dist_bd 15.0 \
   --k_tx 3 \
   --dist_tx 5.0 \
   --tile_width 200 \
   --tile_height 200 \
   --neg_sampling_ratio 5.0 \
   --frac 1.0 \
   --val_prob 0.1 \
   --test_prob 0.2 \
   --n_workers 16 ; } 2> /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor/creating_time.txt


## 2- Training Input
- Path where you stored dataset created (dataset_dir)
- Path to store models (models_dir)
- Path to save time taken ( ; } 2>)

### Code

{ time python3 /mnt/scratch1/Fcarcanholo/cellxgene_python/segger_dev/src/segger/cli/train_model.py \
    --dataset_dir /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor \
    --models_dir /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor/models \
    --sample_tag first_training \
    --init_emb 8 \
    --hidden_channels 32 \
    --num_tx_tokens 500 \
    --out_channels 8 \
    --heads 2 \
    --num_mid_layers 2 \
    --batch_size 4 \
    --num_workers 2 \
    --accelerator cuda \
    --max_epochs 200 \
    --devices 1 \
    --strategy auto \
    --precision 16-mixed ; } 2> /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor/training_time.txt

## 3- Prediction
- Path where you stored dataset created (segger_data_dir)
- Path to store models (models_dir)
- Path to store results of Segger (benchmarks_dir)
- Path to the transcripts.parquet from the spatial data (transcripts_file)
- Path to save time taken ( ; } 2>)


### Code

{ time python3 /mnt/scratch1/Fcarcanholo/cellxgene_python/segger_dev/src/segger/cli/predict_fast.py \
    --segger_data_dir /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor \
    --models_dir /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor/models \
    --benchmarks_dir /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor/segment_res \
    --transcripts_file /mnt/scratch1/Touchstone_data/public_data/Xenium_BreastCancer_DuctalCarcinoma/transcripts.parquet \
    --batch_size 1 \
    --num_workers 1 \
    --model_version 0 \
    --save_tag segger_embedding_1001 \
    --min_transcripts 5 \
    --cell_id_col segger_cell_id \
    --use_cc false \
    --knn_method cuda \
    --file_format anndata \
    --k_bd 4 \
    --dist_bd 12.0 \
    --k_tx 5 \
    --dist_tx 5.0 ; } 2> /mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sc_minor/prediction_time.txt

# straight forward version

### First define your input
- segger_dir is the directory to save all the segger outputs
- spatial_dir is the path to xenium dataset to apply segger
- mtc_or_sc_dataset is the path to metacell or single cell to use as reference to segger
- cell_type_column_name is the name of the column with the cell type annotation

conda activate segger-env
segger_dir="/mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/sanity_check_automatic"
spatial_dir="/mnt/scratch1/Touchstone_data/public_data/Xenium_BreastCancer_DuctalCarcinoma"
mtc_or_sc_dataset="/mnt/scratch1/Fcarcanholo/cellxgene_python/xenium_breast_segger/sc_BRCA_subtype/sc_BRCA_atlas_metacell_without_filter.h5ad"
cell_type_column_name="cell_type"


## Code

mkdir -p "$segger_dir/models" \
&& \
mkdir -p "$segger_dir/segment_res" \
&& \
{ time python3 /mnt/scratch1/Fcarcanholo/cellxgene_python/segger_dev/src/segger/cli/create_dataset_fast.py \
   --base_dir "$spatial_dir" \
   --data_dir "$segger_dir" \
   --sample_type xenium \
   --scrnaseq_file "$mtc_or_sc_dataset" \
   --celltype_column "$cell_type_column_name" \
   --k_bd 3 \
   --dist_bd 15.0 \
   --k_tx 3 \
   --dist_tx 5.0 \
   --tile_width 200 \
   --tile_height 200 \
   --neg_sampling_ratio 5.0 \
   --frac 1.0 \
   --val_prob 0.1 \
   --test_prob 0.2 \
   --n_workers 16 ; } 2> "$segger_dir/creating_time.txt" \
&& \
{ time python3 /mnt/scratch1/Fcarcanholo/cellxgene_python/segger_dev/src/segger/cli/train_model.py \
    --dataset_dir "$segger_dir" \
    --models_dir "$segger_dir/models" \
    --sample_tag first_training \
    --init_emb 8 \
    --hidden_channels 32 \
    --num_tx_tokens 500 \
    --out_channels 8 \
    --heads 2 \
    --num_mid_layers 2 \
    --batch_size 4 \
    --num_workers 2 \
    --accelerator cuda \
    --max_epochs 200 \
    --devices 1 \
    --strategy auto \
    --precision 16-mixed ; } 2> "$segger_dir/training_time.txt" \
&& \
{ time python3 /mnt/scratch1/Fcarcanholo/cellxgene_python/segger_dev/src/segger/cli/predict_fast.py \
    --segger_data_dir "$segger_dir" \
    --models_dir "$segger_dir/models" \
    --benchmarks_dir "$segger_dir/segment_res" \
    --transcripts_file "$spatial_dir/transcripts.parquet" \
    --batch_size 1 \
    --num_workers 1 \
    --model_version 0 \
    --save_tag segger_embedding_1001 \
    --min_transcripts 5 \
    --cell_id_col segger_cell_id \
    --use_cc false \
    --knn_method cuda \
    --file_format anndata \
    --k_bd 4 \
    --dist_bd 12.0 \
    --k_tx 5 \
    --dist_tx 5.0 ; } 2> "$segger_dir/prediction_time.txt"


# Python part

conda environment: segger-env

### Doing transfer label from the reference used on the dataset creation (First step of Segger) to the segger output

In the path to segger result, will be created another folder to store the result, in there it will have the anndata object you need called "segger_adata". This it will be the object you must use to the following steps

In [None]:
from segger.validation.utils import annotate_query_with_reference

#Reading segger direct output
mtc_minor_cell_type = ad.read_h5ad("/mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/mtc_minor/segment_res/segger_embedding_1001_0.5_False_4_12.0_5_5.0_20250201/segger_adata.h5ad")
#Reading metacell used as input in Dataset Creation (First step of segger)
metacell = ad.read_h5ad("/mnt/scratch1/Fcarcanholo/cellxgene_python/xenium_breast_segger/sc_BRCA_subtype/sc_BRCA_atlas_metacell_without_filter.h5ad")
#transfer the label
mtc_minor_cell_type_annotated = annotate_query_with_reference(reference_adata=metacell, query_adata=mtc_minor_cell_type, transfer_column="celltype_minor")
# mtc_minor_cell_type_annotated.write_h5ad("/mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/segger_metacell_minor_creat_major_transf.h5ad")

## Analysing Spatial

conda environment: cellxgene

In [None]:
import spatialdata as sd
from spatialdata_io import xenium
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
import squidpy as sq
import anndata as ad

In order to use the spatial functions for segger output I have to manualy set the .obsm["spatial"] that is annotated in .obs

In [None]:
segger_metacell_minor = ad.read_h5ad("/mnt/scratch1/Fcarcanholo/cellxgene_python/other_spatial_xenium_breast/segger/segger_metacell_minor_ductal.h5ad")
spatial = segger_metacell_minor.obs[["cell_centroid_x", "cell_centroid_y"]]
spatial = spatial.to_numpy()
segger_metacell_minor.obsm["spatial"] = spatial

Also, the segger output dont have the leiden annotation/calculation yet, so you have to do the basic spatial pipeline to run it

In [None]:
segger_metacell_minor.layers["counts"] = segger_metacell_minor.X.copy()
sc.pp.normalize_total(segger_metacell_minor, inplace=True)
sc.pp.log1p(segger_metacell_minor)
sc.pp.pca(segger_metacell_minor)
sc.pp.neighbors(segger_metacell_minor)
sc.tl.umap(segger_metacell_minor)
sc.tl.leiden(segger_metacell_minor,resolution=1.5, key_added="leiden_1.5")

Now you are good to plot spatial plot with cell type annotation or/and leiden

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 10))  # Adjust the size as needed
sq.pl.spatial_scatter(
    segger_metacell_minor,
    library_id="spatial",
    shape=None,
    color=[
        "cell_type"
    ],
    ax=ax  # Pass the axis to the function
)