Skip to content

Releases: NLM-DIR/NSForest

NS-Forest v4.1

Choose a tag to compare

@BeverlyPeng BeverlyPeng released this 26 Mar 21:49

[Release Note:] Improved documentation and standardized output. No algorithmic change to v4.0.

Changes in the output format

  • results saved in .pkl in additional to .csv
  • "PPV" column --> "precision" column in the results table
  • added "software_version" and "cluster_header" columns in the results table

NS-Forest v4.1

Documentation: https://nsforest.readthedocs.io/en/latest/

BMC Methods Link: https://bmcmethods.biomedcentral.com/articles/10.1186/s44330-024-00015-2

Download and installation

In terminal:

git clone https://github.com/JCVenterInstitute/NSForest.git

cd NSForest

conda env create -f nsforest.yml

conda activate nsforest

pip install .

Tutorial

Follow the on readthedocs: https://nsforest.readthedocs.io/en/latest/tutorial.html

Pipeline

NS-Forest is an algorithm designed to identify minimum combinations of necessary and sufficient marker genes for a cell type cluster identified in a single cell or single nucleus RNA sequencing experiment that optimizes classification accuracy. NS-Forest proceeds through the following steps (default setting):

  1. Data input: An AnnData object (e.g., .h5ad file) with cell type cluster labels.

  2. Binary score calculation: Each gene is assigned a binary score for every cluster. Binary score is a measurement of the binary expression pattern of a gene. A higher binary score means a gene is expressed in one cluster and not others. A lower binary score means a gene is expressed in many clusters and would not be an ideal candidate for a cell type-specific marker gene.

  3. Binary scoring criterion: NS-Forest then filters for genes with high binary scores. Candidate genes are selected if their binary scores are 2 standard deviations above the mean of all genes expressed in the cluster.

  4. Random forest: The top 15 binary score genes are used as input into a random forest classifier, which ranks the genes by Gini Impurity, while producing a classification model for each cluster.

  5. Decision tree evaluation: The top 6 ranked random forest genes are used as input into decision trees where all combinations of input genes are evaluated and the combination with the highest F-beta score is selected.

  6. Output: The NS-Forest algorithm outputs 1-6 marker genes per cluster along with the classification metrics (F-beta, PPV (precision), recall) and the On-Target Fraction expression metric.

NS-Forest Marker Gene Evaluation

The final module in the NS-Forest algorithm can also be used to assess the performance of any collection of marker gene combinations identified using any approach. The marker gene evaluation module includes the following steps (default setting):

  1. Data input: 1) An AnnData object (e.g., .h5ad file) with cell type cluster labels. 2) A list of marker genes for every cluster to be evaluated.

  2. Decision tree creation: One-vs-all decision trees are created for each gene in the cluster combination and evaluated for classification accuracy.

  3. Decision tree evaluation: Each gene in the cluster combination is evaluated using these decision trees to determine if the gene gives the correct classification. If even one gene in the cluster combination gives a misclassification, then the prediction is considered incorrect. Note: This strict criteria may lead to PPV = 0 when no true positives (TP) classification are obtained.

  4. Output: The NS-Forest marker gene evaluation outputs the classification metrics (F-beta, PPV (precision), recall) and On-Target Fraction for every cluster combination, which can be used to compare against other marker gene lists.

Prerequisites

  • This is a python script written and tested in python 3.11, scanpy 1.9.6.
  • Other required libraries: numpy, pandas, sklearn, plotly, time, tqdm.

Versions and citations

Earlier versions are managed in Releases.

Version 4.0:

Liu A, Peng B, Pankajam A, Duong TE, Pryhuber G, Scheuermann RH, Zhang Y. Discovery of optimal cell type classification marker genes from single cell RNA sequencing data. BMC Methods 1, 15 (2024). https://doi.org/10.1186/s44330-024-00015-2

Version 2:

Aevermann BD, Zhang Y, Novotny M, Keshk M, Bakken TE, Miller JA, Hodge RD, Lelieveldt B, Lein ES, Scheuermann RH. A machine learning method for the discovery of minimum marker gene combinations for cell-type identification from single-cell RNA sequencing. Genome Res. 2021 Jun 4:gr.275569.121. doi: 10.1101/gr.275569.121.

Version 1.3/1.0:

Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.

Authors

License

This project is licensed under the MIT License.

Acknowledgments

  • Allen Institute of Brain Science
  • Brain Initiative Cell Census Network
  • Chan Zuckerberg Initiative
  • California Institute for Regenerative Medicine

Full Changelog: v4.0...v4.1

NS-Forest v4.0

Choose a tag to compare

@BeverlyPeng BeverlyPeng released this 26 Mar 22:13

[Release Note:] This version implemented the "BinaryFirst" strategy in the algorithm. Major code optimizations and modularization. Added the evaluating module.

Full Changelog: v4.0_dev...v4.0

NS-Forest v4.0_dev

Choose a tag to compare

@yunzhang813 yunzhang813 released this 29 Mar 16:01
6caf6c3

[Release Note:] Pre-release of NS-Forest v4.0.

Dev version of NS-Forest v4.0

Follow the tutorial to get started.

Download 'NSForest_v4dot0_dev.py' and replace the version in the tutorial. Sample code below.

adata_median = preprocessing_medians(adata, cluster_header)
adata_median.varm["medians_" + cluster_header].stack().plot.hist(bins=30, title = 'cluster medians')

adata_median_binary = preprocessing_binary(adata_median, cluster_header, "medians_" + cluster_header)
adata_median_binary.varm["binary_scores_" + cluster_header].stack().plot.hist(bins=30, title='binary scores')

## make a copy of prepared adata
adata_prep = adata_median_binary.copy()

NSForest(adata_prep, cluster_header=cluster_header, n_trees=1000, n_genes_eval=6,
          medians_header = "medians_" + cluster_header, binary_scores_header = "binary_scores_" + cluster_header,
          gene_selection = "BinaryFirst_high", outputfilename="BinaryFirst_high")

Full Changelog: v3.9...v4.0_dev

NS-Forest v3.9

Choose a tag to compare

@yunzhang813 yunzhang813 released this 28 Feb 21:53
f0b7bb8

[Release Note:] Major code optimizations based on algorithm v3.0. No algorithmic change to v3.0.

Changes of parameter name from v3.0
[old name] = [new name]
threads = n_jobs
howManyInformativeGenes2test = n_top_genes
InformativeGenes = n_binary_genes
clusterLabelcolumnHeader = cluster_header
rfTrees = n_trees
Median_Expression_Level = median_cutoff = 0 #set to 0
Genes_to_testing = n_genes_eval
dataDummy = df_dummies
column = cl

Download and installation

NS-Forest can be installed using pip:
sudo pip install nsforest

If you are using a machine on which you lack administrative access, NS-Forest can be installed locally using pip:
pip install --user nsforest

NS-Forest can also be installed using conda:
conda install -c ttl074 nsforest

Will be uploaded to official conda channel soon.

Prerequisites:

  • This is a python script written and tested in python 3.8, scanpy 1.8.2, anndata 0.8.0.
  • Other required libraries: numpy, pandas, sklearn, itertools, time, tqdm.

Tutorial

Follow the tutorial to get started.

If you download 'NSForest_v3dot9_2.py' directly, replace the version to the most updated one in the tutorial.

If you download the pip or conda package, use the following in the tutorial.

import nsforest as ns
ns.NSForest()

Versions and citations

Earlier versions are managed in Releases.

Version 2 and beyond:

Aevermann BD, Zhang Y, Novotny M, Keshk M, Bakken TE, Miller JA, Hodge RD, Lelieveldt B, Lein ES, Scheuermann RH. A machine learning method for the discovery of minimum marker gene combinations for cell-type identification from single-cell RNA sequencing. Genome Res. 2021 Jun 4:gr.275569.121. doi: 10.1101/gr.275569.121.

Version 1.3/1.0:

Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.

Authors

License

This project is licensed under the MIT License.

Acknowledgments

  • BICCN
  • Allen Institute of Brain Science
  • Chan Zuckerberg Initiative
  • California Institute for Regenerative Medicine

What's Changed

  • Adding pip instructions and packaging files by @ttl074 in #13

New Contributors

Full Changelog: v3.0...v3.9

NS-Forest v3.0

Choose a tag to compare

@BAevermann BAevermann released this 17 Jun 18:33
07bbdbe

[Release note:] New version of NS-Forest is redeveloped to operate directly on a scanpy object. The algorithm is essentially the same, and in testing returns identical results to NS-Forest v2.0 when the same parameters are used.

Necessary and Sufficient Forest (NS-Forest) for Cell Type Marker Determination from cell type clusters

Getting Started

Install python 3.6 or above. Download NSForest_v3.py file

Prerequisites

  • This is a python script written in python 3.6. Required libraries: Numpy, Pandas, Sklearn, graphviz, numexpr, scanpy
  • scanpy object (adata) with at least one column containing the cluster assignments. Default slot set to adata.obs["louvain"]; however parameter is tunable in function call.

Using NS-Forest v3.0

from NSForest_v3 import *

import itertools

adata_markers = NS_Forest(adata) #Runs NS_Forest on scanpy object

Markers = list(itertools.chain.from_iterable(adata_markers['NSForest_Markers'])) #gets list of minimal markers from dataframe for display in scanpy plotting functions

Binary_Markers = list(itertools.chain.from_iterable(adata_markers['Binary_Genes'])) #gets list of binary markers from dataframe for display in scanpy plotting functions

NS-Forest v3.0 parameters

NS_Forest(adata, clusterLabelcolumnHeader = "louvain", rfTrees = 1000, Median_Expression_Level = 0, Genes_to_testing = 6, betaValue = 0.5)

  • adata = scanpy object
  • rfTrees = Number of trees
  • clusterLabelcolumnHeader = column header in adata.obs['header_here!'] where cluster assignments reside. Typically 'louvain' if louvain clustering was used.
  • Median_Expression_Level = median expression level for removing negative markers
  • Genes_to_testing = How many ranked genes by binary score will be evaluated in permutations by fbeta-score
  • betaValue = Set values for fbeta weighting. 1 is default f-measure. close to zero is Precision, greater than 1 weights toward Recall

Description

Necessary and Sufficient Forest is a method that takes cluster results from single cell/nuclei RNAseq experiments
and generates lists of minimal markers needed to define each “cell type cluster”.

The method begins by re-encoding the cluster labels into binary classifications, and Random Forest models are generated comparing each
cluster versus all. The top fifteen genes are then reranked using a score measuring how binary they are, e.g., a gene with expression in
the target cluster but no expression in the other clusters would have a high binary score. Expression cutoffs for the top six genes ranked
by binary score are then determined by generating individual decision trees and extracting the decision path information. Then all combinations
of the top six most binary genes are evaluated using f-beta score as an objective function (the beta value default set at 0.5, which weights the
f-measure score more toward precision as opposed to recall).

See code for detailed comments.

Versioning

This is version 3.0 The earlier releases were described in the below publications.

Version 2

Aevermann BD, Zhang Y, Novotny M, Keshk M, Bakken TE, Miller JA, Hodge RD, Lelieveldt B, Lein ES, Scheuermann RH. A machine learning method for the discovery of minimum marker gene combinations for cell-type identification from single-cell RNA sequencing. Genome Res. 2021 Jun 4:gr.275569.121. doi: 10.1101/gr.275569.121. Epub ahead of print. PMID: 34088715.

version 1.3/1.0:

Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH.
Cell type discovery using single-cell transcriptomics: implications for ontological representation.
Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.

Authors

License

This project is licensed under the MIT License - see the https://opensource.org/licenses/MIT for details

Acknowledgments

  • BICCN
  • Allen Institute of Brain Science
  • Chan Zuckerberg Initiative
  • California Institute for Regenerative Medicine

What's Changed

  • Solve issue #1 (TypeError: 'NoneType' object is not callable) by @e-sollier in #2

New Contributors

Full Changelog: v2.0...v3.0

NS-Forest v2.0

Choose a tag to compare

@BAevermann BAevermann released this 21 Nov 21:50

Necessary and Sufficient Forest (NS-Forest) for Cell Type Marker Determination from cell type clusters

Getting Started

Install Jupyter notebook and python 2.7

Prerequisites

  • The script is a Jupyter notebook in python 2.7. Required libraries: Numpy, Pandas, Sklearn, graphviz, numexpr
  • The input data is a tab delimited expression Cell x Gene matrix with one column containing the cluster assignments
  • The cluster-label column must be named "Clusters" and the labels must be non-numeric (if currently numbers, please add "Cl" or any text would work).
  • The gene identifiers used must avoid special characters such as ./-/@ or beginning with numbers (I prefix identifiers beginning with numbers and substitute all special characters with "_")

Description

Necessary and Sufficient Forest is a method that takes cluster results from single cell/nuclei RNAseq experiments
and generates lists of minimal markers needed to define each “cell type cluster”.

The method begins by re-encoding the cluster labels into binary classifications, and Random Forest models are generated comparing each
cluster versus all. The top fifteen genes are then reranked using a score measuring how binary they are, e.g., a gene with expression in
the target cluster but no expression in the other clusters would have a high binary score. Expression cutoffs for the top six genes ranked
by binary score are then determined by generating individual decision trees and extracting the decision path information. Then all permutations
of the top six most binary genes are evaluated using f-beta score as an objective function (the beta value default set at 0.5, which weights the
f-measure score more toward precision as opposed to recall).

See code for detailed comments.

Versioning

This is version 2.0 The initial release was version 1.3. Version 1.0 was described in:

Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH.
Cell type discovery using single-cell transcriptomics: implications for ontological representation.
Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.

Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

  • Allen Institute of Brain Science
  • Chan Zuckerberg Initiative
  • California Institute for Regenerative Medicine

NS-Forest v1.3

Choose a tag to compare

@BAevermann BAevermann released this 02 Nov 20:48
dad5b65

Necessary and Sufficient Forest (NS-Forest) for Cell Type Marker Determination from cell type clusters

Getting Started

Install Jupyter notebook and python 2.7

Prerequisites

  • The script is a Jupyter notebook in python 2.7. Required libraries: Numpy, Pandas, Sklearn, graphviz, numexpr
  • The input data is a tab delimited expression Cell x Gene matrix with one column containing the cluster assignments
  • The cluster-label column must be named "Clusters" and the labels must be non-numeric (if currently numbers, please add "Cl" or any text would work).
  • The gene identifiers used must avoid special characters such as ./-/@ or beginning with numbers (I prefix identifiers beginning with numbers and substitute all special characters with "_")

Description

Necessary and Sufficient Forest is a method that takes cluster results from single cell/nuclei RNAseq experiments
and generates lists of minimal markers needed to define each “cell type cluster”.

The method begins by re-encoding the cluster labels into binary classifications, and Random Forest models are generated comparing each
cluster versus all. The top ten ranked features from the Random Forest are then tested using f-measure as an objective function.
For example, during the first step all top ten features are independently evaluated for their discriminatory power at an
expression value where 75% of the cells have greater than or equal expression. Given that 25% of the cells are lost de facto,
the maximum f-measure for the first step is estimated to be around 0.87 (there will be cases where its higher or lower, such
as having equal expression across all cells). After the best f-measure is found classifying with one gene than the remaining
nine genes are tested in combination with the top first gene, again using an expression value where 75% of the cells have expression.
After the best pair of genes is found, the remaining 8 genes are tested in third position, and onward until the analysis reaches
6 combinations.

See code for detailed comments.

Versioning

The initial release is version 1.3. Version 1.0 was described in:

Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH.
Cell type discovery using single-cell transcriptomics: implications for ontological representation.
Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.

Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

  • Allen Institute of Brain Science
  • Chan Zuckerberg Initiative
  • California Institute for Regenerative Medicine