STIGMA

Introduction

NGS has made gene analysis routine in clinics, yet many variants in genes of unknown function are labeled as variants of uncertain significance (VUSs) and do not contribute to a diagnosis of rare diseases, until further validated by experimental verification. Gene prioritization can help narrow down the list of candidate genes under consideration. .

Here we introduce single-cell tissue-specific gene prioritization using machine learning (STIGMA) to prioritize disease gene for congenital malformations. STIGMA predicts the disease-causing probability of genes based on their expression profiles across cell types, while considering the temporal dynamics during the embryogenesis of a healthy (wild-type) organism.

Dependencies

R:

Seurat
Monocle3
dplyr
splines
biomaRt
AnnotationHub
GenomicRanges
BSgenome
tidyr
GenomicFeatures
TxDb.Hsapiens.UCSC.hg38.knownGene

Python

sklearn
imblern
pandas
numpy
math
sys
itertools
matplotlib

Inorder to run STIGMA we need to prepare the input matrix which consists of single cell features and gene intrinsic propersties, which can be done by following STEP1 and STEP2. STEP3 Formats the input.

Input Feature preprocessing

STEP1a: Fetching Single Cell Features from Seurat:

Load the Seurat object after clustering or subclustering.
Utilize the 'AverageExpression' function to acquire cluster-specific gene expression.
Employ the 'HVFInfo' function to compute variance in expression within each cluster.
Determine the percentage of cells expressing the gene in each sub-cluster (PrctCellExpringGene).
Calculate the fold-change in expression between each sub-cluster and the remaining cells (FoldChange).

STEP1b: Fetching Single Cell temporal Features from Monocle3:

Proceed with the trajectory analysis pipeline, following Monocle's methodology for each partition/cluster.
Bin the generated pseudo time and incorporate it as metadata into the Seurat object.


pseudo <- data.frame(pseudotime(monocleobject, reduction_method = "UMAP"))
colnames(pseudo)[1] <- 'pseudotime'
seuratobject@meta.data$pseudotime<-pseudo$pseudotime[match(colnames(monocleobject), rownames(seuratobject))]
seuratobject@meta.data$pseudotime.bin <- as.numeric(findInterval(seuratobject$pseudotime, quantile(seuratobject$pseudotime, seq(0,1, 1/10),na.rm=T)))

Then, compute the average expression of all genes for each pseudo time bin column in the metadata.


avg_expr <- data.frame(AverageExpression(object=seuratobject, assays='RNA', slot='counts', group.by='pseudotime.bin'))
write.table(avg_expr,'Input_bsplines.tsv', sep='\t')

Fit a spline by parsing Input_bsplines.tsv


Rscript featurePreprocess/Bsplines.R Input_bsplines.tsv.

STEP2: Fetching Gene Intrinsic Properties:

Extract gene intrinsic properties, such as promoter GC content, by executing

  	Rscript featurePreprocess/PromoterGC.R.

Obtain gene constraints, including metrics like pLI, pNull, pRec, syn_Z, mis_Z, and lof_Z, for protein-coding genes from gnomAD. In the paper we have used an older version of gnomad(v.2.1.1).
Retrieve gene GC content from Biomart.

STEP3: Creating the input matrix:

Once the features are obtained, they are stored as a tsv with column as features and rows as genes.
Annotate the genes with positive(Known disease gene) and negative classes(House keeping genes).

STEP4: STIGMA optimization and gene prioritization

STIGMA gene prediction model can be optimized by parsing input.tsv
```
python3 model/RandomForest_optimization.py input.tsv
```

2. STIGMA gene prediction model to predict test genes with [input.tsv](https://github.com/SpielmannLab/STIGMA/blob/main/sample_dataset/input.tsv) by running <br />

python3 model/rf_model.py --inputfeature_matrix=input.tsv --candidategenes=candidate_genes.tsv --n_estimators=<Output from optimization> --max_depth=<Output from optimization> --min_samples_split=<Output from optimization> --min_samples_leaf=<Output from optimization> --max_features=<Output from optimization> --bootstrap=<Output from optimization> --n_neighbors=<Output from optimization>

## STEP5: STIGMA validation <br />
1. To optain the phenotypes associated with the congenital disease based on Monarch Initiative <br />

python3 validation/monarch_analysis.py


[1] Absolute paths are used at certain instances, which will need to be adapted, as needed. <br />
[2] Anaconda was used to set up the necessary environments <br />

## Citations
1. Balachandran, S., Prada-Medina, C.A., Mensah, M.A., Kakar, N., Nagel, I., Pozojevic, J., Audain, E., Hitz, M.-P., Kircher, M., Sreenivasan, V.K.A., et al. (2024). STIGMA: Single-cell tissue-specific gene prioritization using machine learning. Am J Hum Genet, S0002-9297(23)00443-3. https://doi.org/10.1016/j.ajhg.2023.12.011.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
featurePreprocess		featurePreprocess
model		model
sample_dataset		sample_dataset
validation		validation
GraphicalAbstract.png		GraphicalAbstract.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STIGMA

Introduction

Dependencies

R:

Python

Input Feature preprocessing

STEP1a: Fetching Single Cell Features from Seurat:

STEP1b: Fetching Single Cell temporal Features from Monocle3:

STEP2: Fetching Gene Intrinsic Properties:

STEP3: Creating the input matrix:

STEP4: STIGMA optimization and gene prioritization

About

Releases

Packages

Languages

SpielmannLab/STIGMA

Folders and files

Latest commit

History

Repository files navigation

STIGMA

Introduction

Dependencies

R:

Python

Input Feature preprocessing

STEP1a: Fetching Single Cell Features from Seurat:

STEP1b: Fetching Single Cell temporal Features from Monocle3:

STEP2: Fetching Gene Intrinsic Properties:

STEP3: Creating the input matrix:

STEP4: STIGMA optimization and gene prioritization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages