Skip to content

NCBI-Codeathons/automated-sc-RNA-seq-analysis-in-the-cloud

Repository files navigation

Automated scRNA-seq Analysis in the Cloud

What’s the problem?

Currently, the process of analysing sc-RNA-seq data is difficult to manage without a repertoire of technological skills. There is no singular workflow that guides the user from data inputs to relevant analysis, particularly with a user friendly output.

Utilizing existing tools, we set out to create a linear workflow that would perform basic QC like filtering, normalization and automated annotations. We utilized the existing database Tabula Muris Senis as the starting point for the labelling step but we intend to use other datasets, in particular the ones part of the HCA when they become available.

Objective: Build a semi-automated sc-RNA-seq analysis workflow in the cloud that takes raw, unprocessed data and outputs a processed file annotated using OnClass and Tabula Muris Senis as the reference database for the annotations.

What is Tabula Muris Senis?

Tabula Muris Senis is a comprehensive resource for the cell biology community which offers a detailed molecular and cell-type specific portrait of aging.

What is OnClass?

OnClass is a python package for single-cell cell type annotation. It uses the Cell Ontology to capture the cell type similarity and because of that it can label cells in the new dataset whether they are present or not in the training data.

What's in this repo?

There are three related python projects here:

  • In [webapp][webapp/] there is simple flask app, that uses the docker containers defined in
  • [context_processing][context_processing]
  • and [context_annotations][context_annotations].

To download sample data:

./download-data.sh

To run the flask app:

cd webapp
pip install -r requirements.txt
./start.sh

The app uses images we have pushed to dockerhub. To rebuild the image locally and run it with the samples in data/:

./build-and-run-image.sh

There are some tests:

  • test-ci.sh is fast and is run by github: We should make sure we have a green checkmark before merging!
  • test-local.sh exercises all the scripts, and may be much slower. It should run successfully on a fresh checkout.

Roadmap:

This was begun at the Single Cell Hackathon, NYGC, January 15-17, 2020. It can run in a local development environment, but it's a long way from being something that could be deployed in the cloud. We've created issues for some of the next steps.

TMS2

  1. Input gene counts and metadata .h5ad
    1. Preprocessing
  2. Process data using Scanpy
    1. Minimum number of reads
    2. Minimum number of genes
    3. Minimum number of cells
  3. Visualization
    1. Utilizing CZ Biohub cellxgene tool
  4. Annotations
    1. Label Propagation
    2. SCVI & OnClass

Dependencies:

Scanpy Docker Numpy IPython Louvain Leidenalg python-igraph OnClass

Input file format

.h5ad

Where is the data?

Codeathon team:

Lead Angela Oliveira Pisco, PhD - Chan Zuckerberg Biohub
Chuck McCallum - Harvard Medical School
Kyndal Goss – NIH Vaccine Research Center
Sanjana Shah - NIH Vaccine Research Center
Jaqueline Cattell – NIH Office of Data Science Strategy