# SimiC Preprocessing Pipeline Tutorial

>*Author: Irene Marín-Goñi, PhD student - ML4BM group (CIMA University of Navarra)*

This notebook demonstrates how to preprocess single-cell RNA-seq data for SimiC analysis.

## Overview
This preprocessing tutorial covers:
1. Package installation and setup
2. MAGIC imputation pipeline
3. Gene selection and experiment setup
4. Preparing input files for SimiC

For running SimiC analysis see `Tutorial_SimiCPipeline_full.ipynb`

## Introduction
Before running SimiC, you need to:
- Impute your scRNA-seq data. We recommend to use [MAGIC](https://pypi.org/project/magic-impute/3.0.0/) and include a wrapper class `MagicPipeline` to ease the process.
- Select top variable genes based on Median Absolute Deviation (MAD) or the genes of interest from which you want to infer the gene regulatory network.
- Prepare input files in the correct format for SimiCPipeline.

This tutorial shows you how to do all of this using the SimiC preprocessing modules.

## Setup

The easiest way to configure your environment is to follow the `README` instructions using `poetry` (or `Docker`).


Required packages for this tutorial:
- simicpipeline
- anndata
- pandas
- numpy
- os
- pickle

Internally simicpipeline also uses:
- scipy
- sklearn
- scprep (in preprocessing)
- magic-impute (in preprocessing)



## Import Modules
First, import the necessary preprocessing modules.

In [1]:
import os
print(os.getcwd())
print(os.listdir())
import simicpipeline 
print(f"SimiC pipeline version: ", {simicpipeline.__version__})

/home/workdir
['data']


SimiC pipeline version:  {'0.1.0'}


<a id='part1'></a>
# Part 1: MAGIC Imputation Pipeline

MAGIC (Markov Affinity-based Graph Imputation of Cells) is used to denoise and impute scRNA-seq data. `MagicPipeline` facilitates the steps described in [Magic Tutorial]("https://magic.readthedocs.io/en/stable/tutorial.html")


## Step 1.1: Load Your Data

Load your raw expression data. Note that `MacigPipeline` class expects AnnData format.

Below you will see different examples on how to generate the AnnData object from different input files (including Seurat if you are more familiar with R)

<div class="alert alert-block alert-info">
<em> <b>Note 1:</b> If you have already filtered and transformed your data but want to repeat following these steps, make sure the adata object has the raw counts in the `adata.raw.X` slot</b>
</div>

<div class="alert alert-block alert-info">
<b>Note 2:</b> If you have already processed (inputed) your data  you can jump to Part 2 Experiment Setup
</div>

#### From Seurat
If you have your data in a Seurat object (R package) the easiest approach is this:
```r

write.table(data.frame(Cells = colnames(seurat_obj)), file.path("/path/to/data/", "cell_ids.txt"), row.names = FALSE, col.names = FALSE, quote = FALSE)

write.table(data.frame(Genes = rownames(seurat_obj)), file.path("/path/to/data/", "genes_ids.txt"), row.names = FALSE, col.names = FALSE, quote = FALSE)

write.table(seurat_obj@metadata, file.path("/path/to/data/", "metadata.csv"), sep = ",",row.names = TRUE, col.names = TRUE, quote = FALSE)

m_raw = GetAssayData(seurat_obj, assay = "RNA", layer = "counts") # Take the raw counts

Matrix::writeMM(m_raw, paste0(magic_path, "/singlecell_matrix.mtx")) # Save it in MatrixMarket format 
```

Then follow the example code below.

In [None]:

print("Load your AnnData object here")

# # Example: Load from Matrix Market format
import pandas as pd
import anndata as ad
from pathlib import Path

# This will return a pd.DataFrame
df = simicpipeline.load_from_matrix_market( 
    matrix_path=Path("./data/singlecell_matrix.mtx"),
    genes_path=Path("./data/genes_ids.txt"),
    cells_path=Path("./data/cell_ids.txt"),
    transpose=True,
    cells_index_name="Cell",
)
adata = ad.AnnData(X=df.values, obs=pd.DataFrame(index=df.index), var=pd.DataFrame(index=df.columns))

obs_meta=pd.read_csv("/path/to/data/metadata.csv", sep = "\.", index_col=0)
print(obs_meta.shape)
print(obs_meta.head())


Load your AnnData object here
(72650, 8)
                             sample    treatment  cell_line  \
cell                                                          
43_01_92__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_01_73__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_01_94__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_02_41__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_02_56__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   

               final_annotation final_annotation_functional  \
cell                                                          
43_01_92__s1       Cancer cells         Proliferating cells   
43_01_73__s1       Cancer cells                  Basal-like   
43_01_94__s1       Cancer cells         Proliferating cells   
43_02_41__s1  Endothelial cells           Endothelial cells   
43_02_56__s1       Cancer cells                  Basal-like   

                nn_majority_label  nn_majority_frac  flag_misplaced  
cell 

<div class="alert alert-block alert-warning">
<b>WARNING!!:</b> Make sure that the index of obs_meta matches the cell names in df.index. Otherwise it will lead to misalignment of metadata and expression data.
</div>

In [None]:
# Match the obs metadata index to the adata.obs_names
obs_meta = obs_meta.loc[adata.obs_names]
adata = ad.AnnData(adata.X, obs=obs_meta, var=adata.var)
# If your data is raw, you should set it properly in AnnData object
adata.raw = adata.copy()

Alternative loading options

In [None]:
# print("Load your AnnData object here")

# # Example: Load from CSV files
# import pandas as pd
# import anndata as ad
# # This example assumes that in csv rows are cells and columns are genes
# expression_data = pd.read_csv('path/to/expression_data.csv', index_col = 0) # Column 1 as cell IDs
# # Match the observation metadata
# metadata = pd.read_csv('path/to/metadata.csv', index_col=0)
# metadata = metadata.loc[expression_data.index]
# adata = ad.AnnData(X=expression_data.values, obs=metadata)

# # Example: Load from 10X format
# adata = ad.read_10x_mtx('path/to/10x/directory')

# # Example: Load from h5ad file using simicpipeline function
# adata = simicpipeline.load_from_anndata('path/to/your/data.h5ad')

# If your data is raw, you should set it properly with
# print(hasattr(adata, 'raw'))
# adata.raw = adata.copy()

In [None]:
adata

AnnData object with n_obs × n_vars = 72650 × 36774
    obs: 'sample', 'treatment', 'cell_line', 'final_annotation', 'final_annotation_functional', 'nn_majority_label', 'nn_majority_frac', 'flag_misplaced'

## Step 1.2: Initialize MAGIC Pipeline

Create a MAGIC pipeline instance:
- `input_data`: Your AnnData object. If you run the full pipline starting from raw counts they should be in `adata.raw.X`
- `project_dir`: Project directory where `magic_output`dir will be created and files will be saved
- `magic_output_file`: Filename for the imputed data (default: 'magic_data_allcells_sqrt.pickle')
- `filtered`: Set to True if data is already filtered (low quality cells and genes) (default: False)

In [5]:
# This command will initialize the MAGIC pipeline and generate the output directory if it does not exist
from simicpipeline import MagicPipeline
magic_pipeline = MagicPipeline(
    input_data= adata,
    project_dir='./SimiCExampleRun',
    magic_output_file='magic_imputed.pickle',
    filtered=False
)

print(magic_pipeline)


Creating project directory: SimiCExampleRun
MagicPipeline(
  data = AnnData object with (n_obs × n_vars) = 72650 × 36774,
  filtered = False,
  imputed = False,
  magic_data = None,
  project_dir = 'SimiCExampleRun'
)


In [6]:
magic_pipeline.print_project_info()

SimiCExampleRun/
└── magic_output/


## Step 1.3: Filter Cells and Genes

Remove low-quality cells and lowly-expressed genes:
- `min_cells_per_gene`: Minimum number of cells expressing a gene (default: 10)
- `min_umis_per_cell`: Minimum total UMI counts per cell (default: 500)
<div class="alert alert-block alert-info">
<em><b>Note:</b> If your data was already filtered you can skip this step and set the flitered argument flag to `True` in the previous step.</em>
</div>

In [98]:
magic_pipeline.filter_cells_and_genes(min_cells_per_gene = 10, min_umis_per_cell = 500)


Filtering cells and genes...
Before filtering: 72650 cells x 36774 genes
Keeping 27837/36774 genes (75.70%)
Keeping 72650/72650 cells (100.00%)
All cells pass the filter!
After filtering: 72650 cells x 27837 genes


MagicPipeline(
  data = AnnData object with (n_obs × n_vars) = 72650 × 27837,
  filtered = True,
  imputed = False,
  magic_data = None,
  project_dir = 'SimiCExampleRun'
)

## Step 1.4: Normalize Data

Perform library size normalization with `scprep` followed by square root transformation.

<div class="alert alert-block alert-warning">
<em><b>Note:</b> this will overide adata.X with normalized data and remove adata.raw slot.</em>
</div>

In [99]:
# Note this will overide adata.X with normalized data and remove adata.raw slot
magic_pipeline.normalize_data()


Normalizing data...
After normalization: 72650 cells x 27837 genes


MagicPipeline(
  data = AnnData object with (n_obs × n_vars) = 72650 × 27837,
  filtered = True,
  imputed = False,
  magic_data = None,
  project_dir = 'SimiCExampleRun'
)

## Step 1.5: Run MAGIC Imputation
Run magic imputation with defaul parameters

In [100]:
magic_pipeline.run_magic(
    random_state=123,
    n_jobs=-2,  # Use all but 1 CPU cores
    save_data=True
)


Running MAGIC imputation...
Calculating MAGIC...
  Running MAGIC on 72650 cells and 27837 genes.
  Calculating graph and diffusion operator...
    Calculating PCA...
    Calculated PCA in 83.78 seconds.
    Calculating KNN search...
    Calculated KNN search in 2.98 seconds.
    Calculating affinities...
    Calculated affinities in 7.15 seconds.
  Calculated graph and diffusion operator in 94.08 seconds.
  Running MAGIC with `solver='exact'` on 27837-dimensional data may take a long time. Consider denoising specific genes with `genes=<list-like>` or using `solver='approximate'`.
  Calculating imputation...
  Calculated imputation in 187.77 seconds.
Calculated MAGIC in 283.46 seconds.
MAGIC imputation complete:  72650 cells x 27837 genes

Saving MAGIC-imputed data to SimiCExampleRun/magic_output/magic_imputed.pickle
Saved successfully to SimiCExampleRun/magic_output/magic_imputed.pickle


MagicPipeline(
  data = AnnData object with (n_obs × n_vars) = 72650 × 27837,
  filtered = True,
  imputed = True,
  magic_data = AnnData object with n_obs × n_vars = 72650 × 27837,
  project_dir = 'SimiCExampleRun'
)

In [118]:
magic_pipeline.magic_adata.write_h5ad(magic_pipeline.magic_output_file.with_suffix('.h5ad'))


If you want to run MAGIC imputation with custom parameters you can pass them as `**kwargs`:
- `t`: Number of diffusion steps (default: 'auto')
- `knn`: Number of nearest neighbors (default: 5)
- `decay`: Decay rate for kernel (default: 1)
- `n_jobs`: Number of parallel jobs (default: -2)
- `genes`: Genes to be returned. If None or "all genes" it returns teh entire matrix.
- `save_data`: Whether to automatically save imputed data (default: True). If magic_output_file extension is .pickle will save it in .pickle, if h5ad, will save in adata format.

See [MAGIC documentation](https://magic.readthedocs.io/) for more parameter options.

## Step 1.6: Check Pipeline Status

In [7]:
magic_pipeline.print_project_info(max_depth=2)

SimiCExampleRun/
└── magic_output/
    ├── magic_imputed.h5ad
    └── magic_imputed.pickle


<div class="alert alert-block alert-success">
<b>Success!</b> MAGIC imputation is complete. The imputed data is saved in the magic_output directory.
</div>


# Part 2: Experiment Setup and Gene Selection
<a id='experiment'></a>

Now we will select top variable genes and prepare input files for SimiCPipeline. 
<div class="alert alert-block alert-info">

<em><b>Note:</b> if you have your data filtered, normalized and imputed with alternative methods you can start from here.</em>
</div>


## Step 2.1: Load Imputed Data

In this example we will start from the imputed AnnData object from the MAGIC pipeline. If you saved and stopped your work, you can re-load the object with the following code:

In [None]:
# If you saved it in h5ad format, you can load it back using:
# import simicpipeline
# imputed_data = simicpipeline.load_from_anndata('./SimiCExampleRun/magic_output/magic_imputed.h5ad')

In [None]:
# # If you saved it in pickle format, you can load it back using:
# import pickle
# with open('./SimiCExampleRun/magic_output/magic_imputed.pickle', 'rb') as f:
#     imputed_data = pickle.load(f)

In [None]:
# If you contintue from the previous section you can access the MAGIC-imputed AnnData object
imputed_data = magic_pipeline.magic_adata.copy()
imputed_data
# If you have inputed data from other methods, you can load it here. Make sure it is in AnnData format or pandas DataFrame (cells × genes)

AnnData object with n_obs × n_vars = 72650 × 27837
    obs: 'sample', 'treatment', 'cell_line', 'final_annotation', 'final_annotation_functional', 'nn_majority_label', 'nn_majority_frac', 'flag_misplaced'

In [3]:
print(f"Imputed data shape: {imputed_data.shape}")
print(imputed_data.obs.head())

Imputed data shape: (72650, 27837)
                             sample    treatment  cell_line  \
Cell                                                          
43_01_73__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_01_92__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_01_94__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_02_41__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   
43_02_56__s1  KPB25L-UV_Combination  Combination  KPB25L-UV   

               final_annotation final_annotation_functional  \
Cell                                                          
43_01_73__s1       Cancer cells                  Basal-like   
43_01_92__s1       Cancer cells         Proliferating cells   
43_01_94__s1       Cancer cells         Proliferating cells   
43_02_41__s1  Endothelial cells           Endothelial cells   
43_02_56__s1       Cancer cells                  Basal-like   

                nn_majority_label  nn_majority_frac  flag_misplaced  
Cell       

## Step 2.2: Initialize Experiment Setup

Create an experiment setup instance and directories:
- `input_data`: Your imputed AnnData object or pandas DataFrame (cells × genes)
- `tf_path`: Path to transcription factor (TF) list file (.csv or .txt)
- `project_dir`: Directory where experiment files will be saved

<div class="alert alert-block alert-info">

<em><b>Note:</b> In case you do not have a TF list: 
</em>
</div>

We provide a mouse TF list in the data folder that can be saved in your working data directory. 
- TF mouse list was downloaded in December 2024 from [AnimalTFDB4](https://guolab.wchscu.cn/AnimalTFDB4_static/download/TF_list_final/Mus_musculus_TF)
- TF human list was downloaded in February 2026 from [XX](#)


In this tutorial we are working wiht mouse data so we will use the TF list from `AnimalTFDB4`

In [4]:
from importlib.resources import files
import pandas as pd
p2tf = files("simicpipeline.data").joinpath("Mus_musculus_TF.txt")
mouse_TF_df = pd.read_csv(p2tf, sep='\t')
mouse_TF = mouse_TF_df['Symbol']
mouse_TF.to_csv('./data/TF_list.csv', index=False, header=False)

In [16]:
# Initialize ExperimentSetup
from simicpipeline import ExperimentSetup
experiment = ExperimentSetup(
    input_data = imputed_data, 
    tf_path = "./data/TF_list.csv", # Should have no header
    project_dir='./SimiCExampleRun'
)

print(f"Matrix shape: {experiment.matrix.shape}")
print(f"Number of cells: {len(experiment.cell_names)}")
print(f"Number of genes: {len(experiment.gene_names)}")
print(f"Number of TFs: {len(experiment.tf_list)}")
print(f"... Example TF names: {experiment.tf_list[0:5]}\n")
print("\n" + "="*70)
print(f"Current directory status")
print("="*70 + "\n")
experiment.print_project_info(max_depth=2)

Matrix shape: (72650, 27837)
Number of cells: 72650
Number of genes: 27837
Number of TFs: 1611
... Example TF names: ['Lin28b', 'Tbx2', 'Dmtf1', 'Irx4', 'Irf3']


Current directory status

SimiCExampleRun/
├── inputFiles/
├── magic_output/
|   ├── magic_imputed.h5ad
|   └── magic_imputed.pickle
└── outputSimic/
    ├── figures/
    └── matrices/


The previous code automatically creates the SimiC directory structure:
```
project_dir/
├── inputFiles/       # Input files for SimiC
└── outputSimic/      # Output files generated by SimiCPipeline
    ├── figures/      # For future visualizations
    └── matrices/     # For future results
```

## Step 2.3: Calculate MAD and Select Genes

Select top variable genes based on Median Absolute Deviation (MAD):
- `n_tfs`: Number of top TF genes to select (default: 100)
- `n_targets`: Number of top target genes to select (default: 1000)

Returns a tuple of (TF_list, TARGET_list)

In [6]:
tf_list, target_list = experiment.calculate_mad_genes(
    n_tfs=100,
    n_targets=1000
)

print(f"Selected {len(tf_list)} TFs")
print(f"Selected {len(target_list)} targets")
print(f"\nTop 10 TFs: {tf_list[:10]}")
print(f"\nTop 10 targets: {target_list[:10]}")

Removing 0 targets with MAD = 0
Selecting top 1000 targets based on MAD.
Selected 100 TFs
Selected 1000 targets

Top 10 TFs: ['Hmga2', 'Zfpm2', 'Bnc2', 'Glis3', 'Zeb2', 'Pbx1', 'Ebf1', 'Zeb1', 'Mecom', 'Nfib']

Top 10 targets: ['Rn18s-rs5', 'Malat1', 'Cmss1', 'Lamp2', 'Brinp3', 'Rad51b', 'Ccbe1', 'Tenm4', 'Nop58', 'Xist']


## Step 2.4: Subset Data to Selected Genes

Create a subset of your data containing only the selected TFs and targets.

In [7]:
# Combine TF and target lists
import anndata as ad
selected_genes = tf_list + target_list

# Subset the data
if isinstance(imputed_data, ad.AnnData):
    subset_data = imputed_data[:, selected_genes].copy()
elif isinstance(imputed_data, pd.DataFrame):
    subset_data = imputed_data[selected_genes].copy()

print(f"Subset data shape: {subset_data.shape}")

Subset data shape: (72650, 1100)


## Step 2.5: Save Experiment Files

Save the expression matrix and TF names in `.pickle` format and annotation file (optional) as `.txt`
- `run_data`: `ad.AnnData` or `pd.Dataframe` with data to run in SimiC (Inputed and sliced according to experiment run)
- `matrix_filename`: Filename to save `run_data` (saved with row/column headers). Can be `.pickle`or `csv`.
- `tf_filename`: Filename for TF names list for the experiment run. Can be `.pickle`or `csv`. Even though you have a general TF_list file, this function will save the TFs selected by MAD that are found in your run_data.

- `annotation`:`str` (Optional) if `run_data` is `ad.AnnData` and `annotation` is in `run_data.obs.columns`, it will create a `.txt` file with the phenotype annotations needed for SimiC with `index = False`, `header = False`

All files are saved in the `inputFiles/` directory.

We recommend saving it in pickle format for fast load/dump process and save disk space.

In [None]:
# This takes 1.5 mins
experiment.save_experiment_files(
    run_data = subset_data,
    matrix_filename = 'expression_matrix.csv',
    tf_filename = 'TF_list.csv',
    annotation = 'groups' # Will raise warning if column not found in subset_data.obs_names
)

Saved expression matrix to SimiCExampleRun/inputFiles/expression_matrix.csv
Saved 100 TFs to SimiCExampleRun/inputFiles/TF_list.csv


Available columns:
 ['sample', 'treatment', 'cell_line', 'final_annotation', 'final_annotation_functional', 'nn_majority_label', 'nn_majority_frac', 'flag_misplaced']
Please manually provide an appropriate annotation file to SimiCPipeline in SimiCExampleRun/inputFiles

-------

Experiment files saved successfully.

-------



In [9]:
experiment.save_experiment_files(
    run_data = subset_data,
    matrix_filename = 'expression_matrix.pickle',
    tf_filename = 'TF_list.csv', # Will raise warning if file already exists and overwrite
    annotation = 'treatment' 
)

Saved expression matrix to SimiCExampleRun/inputFiles/expression_matrix.pickle

Saved 100 TFs to SimiCExampleRun/inputFiles/TF_list.csv

-------

Annotation 'treatment' found in obs columns!

Annotation distribution: {0: 17259, 1: 20491, 2: 15426, 3: 19474}
Saved annotation to SimiCExampleRun/inputFiles/treatment_annotation.txt

-------

Experiment files saved successfully.

-------



<div class="alert alert-block alert-success">
<b>Success!</b> All preprocessing steps completed. Your files are ready for SimiC analysis.
</div>


## Step 2.6: Verify Saved Files

Check that all files were created correctly.

In [31]:
experiment.print_project_info(max_depth=2)

SimiCExampleRun/
├── inputFiles/
│   ├── TF_list.csv
│   ├── expression_matrix.csv
│   ├── expression_matrix.pickle
│   └── treatment_annotation.txt
├── magic_output/
│   ├── magic_imputed.h5ad
│   └── magic_imputed.pickle
└── outputSimic/
    ├── figures/
    └── matrices/


# Summary

This tutorial covered:

✓ Loading and filtering scRNA-seq data

✓ Running MAGIC imputation

✓ Selecting top variable genes using MAD

✓ Preparing input files for SimiC analysis with proper directory structure

### Output Directory Structure

Your output directory now contains:
```
SimicExampleRun/
├── magic_output/
│   └── magic_imputed.pickle
├── inputFiles/
│   ├── expression_matrix.pickle
│   ├── TF_list.pickle
│   └── groups_phenotype.txt
└── outputSimic/
    ├── figures/
    └── matrices/
```


# Alternative Approach

In this the previous section we used the whole Magic-inputed matrix (obtained in [Part1](#part1)) and selected top MAD genes but you may want to run SimiC in a subset of cells from your data.

Generally we recommend to impute the data in the whole dataset, especially if it was generated in the same sequencing batch, as MAGIC will have more context information to impute the data. However, we acknowledge that every experiment/dataset is different and may require a different approach.

If you want to run SimiCPipeline in a subset of cells, once you have imputed your data, make sure you slice the adata object before you inilitalize the `ExperimentSetup` class so MAD genes are calculated over your cells of interest.

Following this tutorial steps are not required for running `SimiCPipeline` but recommended before as it will facilitate the process. Just make sure that:

1. You have **cell assignments** if not done before: Create a file with cell phenotype labels matching  the same order as your expression matrix that you will use in SimiC.
2. Prepare **expression matrix** pickle/csv format
3. Prepare **TF_list** pickle/csv with TF genes in your expression matrix.


We will show how to easily subset your cells and save it with the `ExperimentSetup`

In [30]:
imputed_data.obs

Unnamed: 0_level_0,sample,treatment,cell_line,final_annotation,final_annotation_functional,nn_majority_label,nn_majority_frac,flag_misplaced
Cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
43_01_73__s1,KPB25L-UV_Combination,Combination,KPB25L-UV,Cancer cells,Basal-like,Basal-like,1.0,False
43_01_92__s1,KPB25L-UV_Combination,Combination,KPB25L-UV,Cancer cells,Proliferating cells,Proliferating cells,1.0,False
43_01_94__s1,KPB25L-UV_Combination,Combination,KPB25L-UV,Cancer cells,Proliferating cells,Proliferating cells,1.0,False
43_02_41__s1,KPB25L-UV_Combination,Combination,KPB25L-UV,Endothelial cells,Endothelial cells,Endothelial cells,1.0,False
43_02_56__s1,KPB25L-UV_Combination,Combination,KPB25L-UV,Cancer cells,Basal-like,Basal-like,0.9,False
...,...,...,...,...,...,...,...,...
06_92_89__s8,KPB25L_control,control,KPB25L,Cancer cells,Basal-like,Basal-like,1.0,False
06_94_61__s8,KPB25L_control,control,KPB25L,Macrophages,Macrophages,Macrophages,1.0,False
06_96_48__s8,KPB25L_control,control,KPB25L,Cancer cells,Unknown,Unknown,1.0,False
06_96_58__s8,KPB25L_control,control,KPB25L,Macrophages,Macrophages,Macrophages,1.0,False


In [32]:
from simicpipeline import ExperimentSetup
cell_mask = imputed_data.obs['cell_line'].isin(["KPB25L"]) & imputed_data.obs['final_annotation_functional'].isin(['Proliferating cells','Basal-like'])
print(f"Number of selected cells:",{sum(cell_mask)})
subset_imputed_data = imputed_data[cell_mask,:].copy()

print(f"Suset matrix shape:", {subset_imputed_data.shape})

experiment2 = ExperimentSetup(
    input_data = subset_imputed_data, 
    tf_path = "./data/TF_list.csv", # Should have no header
    project_dir='./SimiCExampleRun/KPB25L/Tumor'
)
experiment2.print_project_info(max_depth=1)
# Then follow the same steps as above to select genes and save experiment files

Number of selected cells: {21490}
Suset matrix shape: {(21490, 27837)}
Creating project directory: SimiCExampleRun/KPB25L/Tumor
Tumor/
├── inputFiles/
└── outputSimic/


However the initial output directory will then look like:


In [33]:
experiment.print_project_info(max_depth=4)

SimiCExampleRun/
├── KPB25L/
│   └── Tumor/
│       ├── inputFiles/
│       └── outputSimic/
│           ├── figures/
│           └── matrices/
├── inputFiles/
│   ├── TF_list.csv
│   ├── expression_matrix.csv
│   ├── expression_matrix.pickle
│   └── treatment_annotation.txt
├── magic_output/
│   ├── magic_imputed.h5ad
│   └── magic_imputed.pickle
└── outputSimic/
    ├── figures/
    └── matrices/
        └── full_experiment/


Repeat the steps to calculate MAD genes adn save files

In [34]:
tf_list, target_list = experiment2.calculate_mad_genes(
    n_tfs=100,
    n_targets=1000
)
# Combine TF and target lists
selected_genes2 = tf_list + target_list
subset_experiment_data = subset_imputed_data[:, selected_genes2].copy()

experiment2.save_experiment_files(
    run_data = subset_experiment_data,
    matrix_filename = 'expression_matrix.pickle',
    tf_filename = 'TF_list.csv', # Will raise warning if file already exists and overwrite
    annotation = 'treatment' 
)
print(f"Subset data shape: {subset_experiment_data.shape}")

Removing 8 targets with MAD = 0
Selecting top 1000 targets based on MAD.
Saved expression matrix to SimiCExampleRun/KPB25L/Tumor/inputFiles/expression_matrix.pickle
Saved 100 TFs to SimiCExampleRun/KPB25L/Tumor/inputFiles/TF_list.csv

-------

Annotation 'treatment' found in obs columns!

Annotation distribution: {0: 4392, 1: 5895, 2: 4964, 3: 6239}
Saved annotation to SimiCExampleRun/KPB25L/Tumor/inputFiles/treatment_annotation.txt

-------

Experiment files saved successfully.

-------

Subset data shape: (21490, 1100)


In [35]:
experiment2.print_project_info()

Tumor/
├── inputFiles/
│   ├── TF_list.csv
│   ├── expression_matrix.pickle
│   └── treatment_annotation.txt
└── outputSimic/
    ├── figures/
    └── matrices/



# Next steps:

1. **Run SimiC**: Use `SimiCPipeline` class to run SimiC.
2. **Explore results**: Use `SimicVisualization` class to analyze GRNs and TF activities.

Check `Tutorial_SimiCPipeline_full.ipynb` or `Tutorial_SimiCPipeline_visualization` for guided info.


### Final Notes

<div class="alert alert-block alert-info">
<b>Data Format:</b> All matrices are stored as cells × genes (rows = cells, columns = genes)
</div>

<div class="alert alert-block alert-warning">
<b>Memory Usage:</b> MAGIC imputation can be memory-intensive for large datasets. Consider using a machine with sufficient RAM and adjusting MAGIC parameters (n_jobs, knn, t)
</div>

<div class="alert alert-block alert-info">
<b>Please note:</b> Although you will be able to pass custom file/direcotry paths, we highly recommend to follow the directory structure described above and follow this tutorial before running SimiC to avoid errors.
</div>