Skip to content

Computational-Morphogenomics-Group/MarkerMap

Repository files navigation

MarkerMap

MarkerMap is a generative model for selecting the most informative gene markers by projecting cells into a shared, interpretable embedding without sacrificing accuracy.

Table of Contents

  1. Installation
    1. MacOs
    2. Windows
  2. Quick Start
    1. Simple Example
    2. Benchmark Example
  3. Features
  4. For Developers
    1. Adding Your Model
  5. License

Installation

MacOS

  • The easiest way to install is with pip from https://pypi.org/project/markermap/
  • Simply do pip install markermap
  • To use the persist package, you will need to install it following the instructions: https://github.com/iancovert/persist/
  • Note: If using on Google Colab, you may have to restart the runtime after installing because scanpy installs a newer version of matplotlib than the default one.
  • Note: Smashpy specifies an older version of tensorflow (==2.5.0), so this creates installation problems. Thus we do not require installing Smashpy to use the package, but it will need to be installed if you want to compare that method.

Windows

  • Coming soon!

Quick Start

Simple Example

Copy the code from here, or take a look at scripts/quick_start.py for a python script or notebooks/quick_start.ipynb for a Jupyter Notebook.

Imports

Data is handled with numpy and with scanpy which is used for many computational biology datasets.

import numpy as np
import scanpy as sc

from markermap.vae_models import MarkerMap, train_model
from markermap.utils import (
    new_model_metrics,
    plot_confusion_matrix,
    split_data,
)

Set Parameters

Define some parameters that we will use when creating the MarkerMap.

  • z_size is the dimension of the latent space in the variational auto-encoder. We always use 16
  • hidden_layer_size is the dimension of the hidden layers in the auto-encoder that come before and after the latent space layer. This is dependent on the data, a good rule of thumb is ~10% of the dimension of the input data. For the CITE-seq data which has 500 columns, we will use 64
  • k is the number of markers to extract
  • batch_size is the model batch size
  • Set the file_path to wherever your data is
z_size = 16
hidden_layer_size = 64
k=50
batch_size = 64

file_path = 'data/cite_seq/CITEseq.h5ad'

Data

Set file_path to wherever your data is located. We then read in the data using scanpy which returns an AnnData object. The gene data is stored in adata.X, and the cell labels are stored in adata.obs['names']. For consistency across different datasets, we store the labels in the adata.obs['annotation'], but this can be controlled by the group_by variable.

We then split the data into training, validation, and test sets with a 70%, 10%, 20% split. We can use MarkerMap.prepareData to construct the train_dataloader and val_dataloader that will be used during training.

group_by = 'annotation'
adata = sc.read_h5ad(file_path)
adata.obs[group_by] = adata.obs['names']

# we will use 70% training data, 10% vaidation data during the training of the marker map, then 20% for testing
train_indices, val_indices, test_indices = split_data(
    adata.X,
    adata.obs[group_by],
    [0.7, 0.1, 0.2],
)
train_val_indices = np.concatenate([train_indices, val_indices])

train_dataloader, val_dataloader = MarkerMap.prepareData(
    adata,
    train_indices,
    val_indices,
    group_by,
    None, #layer, just use adata.X
    batch_size=batch_size,
)

Define and Train the Model

Now it is time to define the MarkerMap. There are many hyperparameters than can be tuned here, but the most important are k and the loss_tradeoff. The k parameter may require some domain knowledge, but it is fairly easy to benchmark for different levels of k, as we will see in the later examples. Loss_tradeoff is also important, see the paper for a further discussion. In general, we have 3 levels, 0 (supervised only), 0.5 (mixed supervised-unsupervised) and 1 (unsupervised only). This step may take a couple of minutes.

supervised_marker_map = MarkerMap(
    adata.X.shape[1],
    hidden_layer_size,
    z_size,
    len(adata.obs[group_by].unique()),
    k,
    loss_tradeoff=0,
)

train_model(supervised_marker_map, train_dataloader, val_dataloader)

Evaluate the model

Finally, we test the model. The new_model_metrics function trains a simple model such as a RandomForestClassifer on the training data restricted to the k markers, and then evaluates it on the testing data. We then print the misclassification rate, the f1-score, and plot a confusion matrix.

misclass_rate, test_rep, cm = new_model_metrics(
    adata[train_val_indices, :].X,
    adata[train_val_indices, :].obs[group_by],
    adata[test_indices, :].X,
    adata[test_indices, :].obs[group_by],
    markers = supervised_marker_map.markers().clone().cpu().detach().numpy(),
)

print(misclass_rate)
print(test_rep['weighted avg']['f1-score'])
plot_confusion_matrix(cm, adata.obs[group_by].unique())

Benchmark Example

Now we will do an example where we use the benchmarking tools of the package. Follow the steps from the Simple Example through the data section, then pick up here. Alternatively, checkout out scripts/quick_start_benchmark.py for a python script or notebooks/quick_start_benchmark.ipynb for a Jupyter Notebook.

Define the Models

Now it is time to define all the models that we are benchmarking. For this tutorial, we will benchmark the three versions of MarkerMap: Supervised, Mixed, and Unsupervised. Each model in this repository comes with a function getBenchmarker where we specify all the parameters used for constructing the model and all the parameters used for training the model. The benchmark function will then run and evaluate them all. For this tutorial we will also specify a train argument, max_epochs which limits the number of epochs during training.

supervised_marker_map = MarkerMap.getBenchmarker(
  create_kwargs = {
    'input_size': adata.shape[1],
    'hidden_layer_size': hidden_layer_size,
    'z_size': z_size,
    'num_classes': len(adata.obs[group_by].unique()),
    'k': k, # because we are benchmarking over k, this will be overwritten
    'loss_tradeoff': 0,
  },
  train_kwargs = {
    'max_epochs': max_epochs,
  }
)

mixed_marker_map = MarkerMap.getBenchmarker(
  create_kwargs = {
    'input_size': adata.shape[1],
    'hidden_layer_size': hidden_layer_size,
    'z_size': z_size,
    'num_classes': len(adata.obs[group_by].unique()),
    'k': k, # because we are benchmarking over k, this will be overwritten
    'loss_tradeoff': 0.5,
  },
  train_kwargs = {
    'max_epochs': max_epochs,
  }
)

unsupervised_marker_map = MarkerMap.getBenchmarker(
  create_kwargs = {
    'input_size': adata.shape[1],
    'hidden_layer_size': hidden_layer_size,
    'z_size': z_size,
    'num_classes': None, #since it is unsupervised, we can just say that the number of classes is None
    'k': k, # because we are benchmarking over k, this will be overwritten
    'loss_tradeoff': 1.0,
  },
  train_kwargs = {
    'max_epochs': max_epochs,
  },
)

Run the Benchmark

Finally, we run the benchmark by passing in all the models as a dictionary, the number of times to run the model, the data and labels, the type of benchmark, and the range of values we are benchmarking over. We will set the range of k values as [10, 25, 50], but you may want to go higher in practice. Note that we also add the RandomBaseline model. This model selects k markers at random, and it is always a good idea to include this one because it performs better than might be expected. It is also very fast, so there is little downside.

The benchmark function splits the data, runs each model with the specified k, then trains a simple model on just the k markers and evaluates how they perform on a test set that was not used to train the marker selection model or the simple evaluation model. The results reported have accuracy and f1 score, and we can visualize them in a plot with the plot_benchmarks function.

This function will train each MarkerMap 3 times for a total of 9 runs, so it may take over 10 minutes depending on your hardware. Feel free to comment out some of the models.

k_range = [10, 25, 50]

results, benchmark_label, benchmark_range = benchmark(
  {
    'Unsupervised Marker Map': unsupervised_marker_map,
    'Supervised Marker Map': supervised_marker_map,
    'Mixed Marker Map': mixed_marker_map,
    'Baseline': RandomBaseline.getBenchmarker(train_kwargs = { 'k': k }),
  },
  1, # num_times, how many different random train/test splits to run the models on
  adata,
  benchmark='k',
  group_by=group_by,
  benchmark_range=k_range,
)

plot_benchmarks(results, benchmark_label, benchmark_range, mode='accuracy')

Features

  • The MarkerMap package provides functionality to easily benchmark different marker selection methods to evaluate performance under a number of metrics. Each model has a getBenchmarker function which takes model constructor parameters and trainer parameters and returns a model function. The benchmark function then takes all these model functions, a dataset, and the desired type of benchmarking and runs all the models to easily compare performance. See scripts/benchmark_k for examples.
  • Types of benchmarking:
    • k: The number of markers to select from
    • label_error: Given a range of percentages, pick that percent of points in the training + validation set and set their label to a random label form among the existing labels.
  • To load the data, you can make use of the following functions: get_citeseq, get_mouse_brain, get_paul, and get_zeisel. Note that both get_mouse_brain and get_paul do some pre-processing, including removing outliers and normalizing the data in the case of Mouse Brain.

For Developers

  • You will want to set up this library as an editable install.
    • Clone the repository git clone https://github.com/Computational-Morphogenomics-Group/MarkerMap.git
    • Navigate to the MarkerMap directory cd MarkerMap
    • Locally install the package pip install -e . (may have to use pip3 if your system has both python2 and python3 installed)
  • If you are going to be developing this package, also install the following: pip install pre-commit pytest
  • In the root directory, run pre-commit install. You should see a line like pre-commit installed at .git/hooks/pre-commit. Now when you commit to your local branch, it will run jupyter nbconvert --clean-output on all the local jupyter notebooks on that branch. This ensures that only clean notebooks are uploaded to the github.
  • To run tests, simply run pytest: pytest.

Adding Your Model

  • The MarkerMap project is built to allow for easy comparison between different types of marker selection algorithms. If you would like to benchmark your method against the others, the process is straightforward:
    • Suppose your method is called Acme. Create a class AcmeWrapper in other_models.py that extends Acme and BenchmarkableModel from other_models.py.
    • In AcmeWrapper, implement benchmarkerFunctional, check out the other methods in that function for a full list of arguments. The most important two are create_kwargs and train_kwargs which are dictionaries of arguments. The create_kwargs is arguments that when be called when initializes the class, i.e. Acme(**train_kwargs), and the train_kwargs are arguments that would be called when training the model. The benchmarkerFunctional method should return an array of the indices of the markers.
    • Voila! Now you can define the model with AcmeWrapper.getBenchmarker(create_kwargs, train_kwargs). This will return a model functional that you pass to the benchmark function. See quick_start_benchmark.py for an example.
  • If you would like to add your super duper model to the repository, create a branch and submit a Pull Request.

License

  • This project is licensed under the terms of the MIT License.

About

Marker selection, supervised and unsupervised

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published