Skip to content

Latest commit

 

History

History
86 lines (62 loc) · 5.07 KB

tabula-muris-on-aws.md

File metadata and controls

86 lines (62 loc) · 5.07 KB

Tabula muris on AWS

The Tabula muris data was generated by the Chan Zuckerberg Biohub and made available for anyone to use via Amazon S3.

This data collection is the underlying dataset to the recent publication Transcriptomic characterization of 20 organs and tissues from mouse at single cell resolution creates a Tabula Muris. The Tabula muris project is a a compendium of single cell transcriptomic data from the mouse containing nearly 100,000 cells from 20 organs and tissues. The data allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as immune cells from distinct anatomical locations. The resource also enables contrasting two distinct technical approaches:

  • microfluidic droplet-based 3'-end counting, which provides a survey of thousands of cells per organ at relatively low coverage.
  • FACS-based full length transcript analysis, which provides higher sensitivity and coverage.

This rich collection of annotated cells will be a useful resource for:

  • Defining gene expression in previously poorly-characterized cell populations.
  • Validating findings in future targeted single-cell studies.
  • Developing of methods for integrating datasets (eg between the FACS and droplet experiments), characterizing batch effects, and quantifying the variation of gene expression in many cell types between organs and animals.

Since late 2017, Tabula muris data have been made available to all users free of charge. AWS has made the data freely available on Amazon S3 so that anyone can download the resource to perform analysis and advance medical discovery without needing to worry about the cost of storing Tabula muris data or the time required to download it.

Learn more about how Tabula muris data is used in the project vignettes repo.

Accessing Tabula muris on AWS

The data are organized using a directory structure as bellow.

czbiohub-tabula-muris
├── 10x_bam_files: BAM files for 10x droplet data
│   ├── *.bam
│   └── *.bam.bai
├── facs_bam_files : BAM files for FACS smartseq2 data
│   ├── *.bam
│   └── *.bam.bai
├── TM_droplet_mat.csv.gz
├── TM_droplet_mat.h5ad
├── TM_droplet_mat.rds
├── TM_droplet_metadata.csv
├── TM_facs_mat.csv.gz
├── TM_facs_mat.h5ad
├── TM_facs_mat.rds
└── TM_facs_metadata.csv

The unprocessed data files are stored in two different folders, 10x_bam_files and tabula_muris_bam_files, according to the respective method used when preparing the samples, 10x or FACS.

The processed data is provide in three different formats for each of the two methods:

  • .h5ad files to load in Python using anndata
  • .rds files to load in R
  • .csv.gz files for general use

A csv describing all data is available for each method at s3://czbiohub-tabula-muris/TM_facs_metadata.csv or s3://czbiohub-tabula-muris/TM_droplet_metadata.csv

If you use the AWS Command Line Interface, you can access the bucket with the command: aws s3 ls s3://czbiohub-tabula-muris

How to start using the data

Data files for R

You can download complete count files as sparse matrices in .rds format for easy loading into R. Download TM_facs_mat.rds and TM_droplet_mat.rds into the data folder. It can be loaded as

tm.droplet.matrix = readRDS(here("data", "TM_droplet_mat.rds"))
tm.droplet.metadata = read_csv(here("data", "TM_droplet_metadata.csv"))

Data files for Python

You can download complete count files as sparse matrices using anndata's h5ad file format for use in Python here and here. You can process the resulting AnnData object using, for instance, Scanpy.

import pandas
import scanpy

tm_facs_metadata = pd.read_csv('data/TM_facs_metadata.csv')
tm_facs_data = scanpy.anndata.read_h5ad('data/TM_facs_mat.h5ad')

Recover fastq file from bam

You will need to download the BAM files and then:

  • if working with the 10X dataset download the 10X's bam2fastq tool or
  • if working with the facs/smartseq-2 dataset use bamtofastq from bedtools

How to cite this dataset

If you find the Tabula muris data useful for your research please cite our publication

Contact

If you have questions about the data, you can create an Issue at the project repo on GitHub.

License

There are no restrictions on the use of data received from the Chan Zuckerberg Biohub, unless expressly identified prior to or at the time of receipt.