pigx pipeline for single-cell RNAseq
Copyright 2017-2020: Vedran Franke, Bora Uyar, Ricardo Wurmus, Altuna Akalin. This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.
PiGX scRNAseq is an analysis pipeline for preprocessing and quality control for single cell RNA sequencing experiments. The inputs are read files from the sequencing experiment, and a configuration file which describes the experiment. It produces processed files for downstream analysis and interactive quality reports. The pipeline is designed to work with UMI based methods. It currently supports all methods which output paired adapter - read files. The pipeline was heavily influenced by the Dropseq pipeline from the McCaroll lab.
What does it do
- Quality control reads using fastQC and multiQC
- Automatically determines the appropriate cell number
- Constructs the digital gene expression matrix
- Calculates per sample and per cell statistics
- Prepares a quality control report
- Normalizes data and does dimensionallity reduction
What does it output
- bam files
- bigwig files
- UMI and read count matrices
- Quality control report
- SingleCellExperiment object with pre-calculated statistics and dimensionallity reductions
PiGx - scRNA-seq workflow
You can install this pipeline and all of its dependencies through GNU Guix:
guix package -i pigx-scrnaseq
You can also install it manually from source. You can find the latest release here. PiGx uses the GNU build system. Please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:
./configure \ --prefix=/some/where make install
By default the
configure script expects tools to be in a directory
listed in the
PATH environment variable. If the tools are installed
in a location that is not on the
PATH you can tell the
script about them with variables. Run
./configure --help for a list
of all variables and options.
You can prepare a suitable environment with Conda or with GNU Guix.
Assuming you have Guix installed, the following command spawns a sub-shell in which all dependencies are available:
guix environment -l guix.scm
If you do not use one of these package managers, you will need to ensure that the following software is installed:
To run PiGx on your experimental data, first enter the necessary parameters in the spreadsheet file (see following section), and then from the terminal type. To run the pipeline, you will also need the appropriate genome sequence in fasta format, and the genome annotation in a gtf format.
$ pigx-scrnaseq [options] sample_sheet.csv -s settings.yaml
To see all available options type the
$ pigx-scrnaseq --help usage: pigx-scrnaseq [-h] [-v] -s SETTINGS [-c CONFIGFILE] [--target TARGET] [-n] [--graph GRAPH] [--force] [--reason] [--unlock] samplesheet PiGx scRNAseq Pipeline. PiGx scRNAseq is a data processing pipeline for single cell RNAseq read data. positional arguments: samplesheet The sample sheet containing sample data in CSV format. optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -s SETTINGS, --settings SETTINGS A YAML file for settings that deviate from the defaults. -c CONFIGFILE, --configfile CONFIGFILE The config file used for calling the underlying snakemake process. By default the file 'config.json' is dynamically created from the sample sheet and the settings file. --target TARGET Stop when the named target is completed instead of running the whole pipeline. The default target is "final-report". Pass "--target=help" to describe all available targets. -n, --dry-run Only show what work would be performed. Do not actually run the pipeline. --graph GRAPH Output a graph in Graphviz dot format showing the relations between rules of this pipeline. You must specify a graph file name such as "graph.pdf". --force Force the execution of rules, even though the outputs are considered fresh. --reason Print the reason why a rule is executed. --unlock Recover after a snakemake crash. This pipeline was developed by the Akalin group at MDC in Berlin in 2017-2018.
The input parameters
The sample sheet is a tabular file describing the experiment. The table has the following columns:
- name - name for the sample, which will be used to label the sample in all downstream analysis
- baarcode - fastq file containing the adapter sequences
- reads - fastq file containing the sequenced reads
- location of these files is specified in
- location of these files is specified in
- method - sequencing platform on which the experiment was performed (i.e. dropseq)
- covariates - variables which describe the samples. For example: replicate, time, hour post infection, tissue ...
Additional columns may be included which may be used as covariates in the differential expression analysis (sex, age, different treatments).
The settings file is a YAML file which specifies:
- The locations of the reads (directory where
fastqfiles are located)
- The location of the output directory
- The location of the
fastafile with the reference genome (must be prepared by the user)
- The location of a
GTFfile with genome annotations
- The locations of the reads (directory where
Genome assembly name (i.e. mm10)
In order to get started, enter
pigx-scrnaseq --init-settings my_settings.yaml. This will create a file called
my_settings.yaml with the default structure. The file will look like this:
locations: output-dir: out/ reads-dir: sample_data/reads/ tempdir: covariates: 'covariate1, covariate2, ...' annotation: primary: genome: name: hg19 fasta: sample_data/genome.fa gtf: sample_data/genome.gtf execution: submit-to-cluster: no jobs: 6 nice: 19
Single cell expression analysis is data intensive, and requires substantial computing resources. The pipeline uses the STAR aligner for read mapping, so the memory requirements will scale with the size of the genome. Please look at the STAR manual for the concrete number about the memory requirements. For the human/mouse genome it requires ~ 40Gb of RAM. The pipeline produces temporary files which require a substantial amount of disk space. Please ensure that you have at least 30Gb of disk space per 100 milion sequenced reads. The location of the temporary directory can be controlled using the tempdir: variable in the settings.yaml. By default the tempdir is set to /tmp.
Important: please make sure that the temporary directory has adequate free space
Output directory structure
The output directory structure should look like the following tree
|-- Annotation | `-- genome_name (i.e. mm10) | `-- STAR_INDEX |-- Log |-- Mapped | |-- Sample_1 | | `-- genome_name | |-- Sample_2 | | `-- genome_name | |-- Sample_3 | | `-- genome_name | `-- Sample_4 | `-- genome_name
Contains pre-processed fasta and gtf file, along with the STAR genome index. The genome fasta file is processed into a dict header. The gtf file has gene_names replaced with gene_id.
Important: We sincerely advise that you check that the gtf file corresponds to the same organism and genome version as the genome fasta files. The chromosome names have to completely correspond between the two files.
We encourage users to use both the genome annotation and the fasta file from the ENSEMBL database.
Contains execution logs for every step of the pipeline.
The Mapped folder contains per sample processed single cell samples. Additionally, it contains a loom file with merged expression values from all experiments, an RDS file with a saved SingleCellExperiment object, and a quality control report in the html format.
Analaysis results for each sample are done in a separate subdirectory under Mapped. Structure of analysis results:
|-- Sample1.fastq.bam |-- Sample1_1_fastqc.html |-- Sample1_1_fastqc.zip |-- Sample1_2_fastqc.html |-- Sample1_2_fastqc.zip |-- genome_name | |-- adapter_trimming_report.txt | |-- Sample1_genome_name.bw | |-- Sample1_genome_name.m.bw | |-- Sample1_genome_name.p.bw | |-- Sample1_genome_name_BAMTagHistogram.txt | |-- Sample1_genome_name_DownstreamStatistics.txt | |-- Sample1_genome_name_READS.Matrix.txt | |-- Sample1_genome_name_ReadCutoff.png | |-- Sample1_genome_name_ReadCutoff.yaml | |-- Sample1_genome_name_ReadStatistics.txt | |-- Sample1_genome_name_Summary.txt | |-- Sample1_genome_name_UMI.Matrix.loom | |-- Sample1_genome_name_UMI.Matrix.txt | |-- polyA_trimming_report.txt | |-- star.Log.final.out | |-- star.Log.out | |-- star.Log.progress.out | |-- star.SJ.out.tab | |-- star_gene_exon_tagged.bai | |-- star_gene_exon_tagged.bam | |-- unaligned_tagged_Cellular.bam_summary.txt | `-- unaligned_tagged_Molecular.bam_summary.txt
Description of relevant output files:
Sample1.fastq.bam - contains merged barcode and sequence fq files
Sample1_genome_name.bw - bigWig file constructed from selected cells. Files with m/p.bw contain strand separated signal
Sample1_genome_name_BAMTagHistogram - Number of reads in coressponding to each cell barcode.
Sample1_genome_name_UMI.Matrix.txt/loom - UMI based digital expression matrix in txt and loom format
Sample1_genome_name_READS.Matrix.txt - Read count digital expression matrix
Sample1_genome_name_ReadCutoff.yaml - contains the UMI threshold for selecting high quality cells (obtained using dropbead). The corresponding .png file visualizes the UMI curve and the threshold.
star_gene_exon_tagged.bam - mapped and annotated reads. Each read is tagged by annotation based on it's mapping location.
Sample1_genome_name_ReadStatistics/Downstream.txt - quality control statistics used in the html report. They contain values such as number of reads in Exons/Introns.
The combined expression data are subsequently processed into a SingleCellExperiment object. SingleCellExperiment is a Bioconductor class for storing expression values, along with the cell, and gene data, and experimental meta data in a single container. It is constructed on top of hdf5 file based arrays (Pagès 2017), which enables exploration even on systems with limited memory capacity. During the object construction, the pipeline performs expression normalization, dimensionallity reduction, identification of significantly variable genes, assigns the cells to the steps of the cell cycle, and calculates the quality statistics. The SingleCellExperiment object contains all of the necessary data needed for further exploration. The object connects the pigx-pipeline with the Bioconductor single cell computing environment, and enables integration with state of the art statistical, and machine learning mehods (scran, zinbwave, netSmooth, iSEE.
execution section in the settings file allows the user to specify whether the pipeline is to be submitted to a cluster, or run locally, and the degree of parallelism. For a full list of possible parameters, see
An example can be found in the
tests directory. The
sample_sheet.csv file here specifies the following sample data:
How to contribute?
The easiest way to install all of the dependencies is through the guix package management system. Firstly download and install guix to your computer. guix.scm file in the root of the project directory contains the description recipe for installing all of the necessary tools. The following command will install all of the dependencies to the .guix-profile folder
guix environment -l guix.scm --root=`pwd`/.guix-profile
Installing PigX-scRNAseq for development
# sets up the directory basepath='~/pigx=scrnaseq/development' mkdir -p $basepath; cd $basepath # downloads the repository git clone https://github.com/BIMSBbioinfo/pigx_scrnaseq.git cd pigx_scrnaseq; mkdir run # uses guix to install all of the dependencies into a separate environment guix environment -l guix.scm --root=`pwd`'./run/.guix-profile' # sets the temporary directory - needed for storing large temporary files export TMPDIR=~/Tmp # installs the pipeline ./bootstrap.sh && ./configure --prefix=`pwd`/run && make install # runs the pipeline on the test data ./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml
Preparing the environment for the development
To prepare the environment for the development set the following variable:
If this variable is not set pigx-scrnaseq will execute files in the
./run/bin folder (pre-installed files),
and will not react to changes to scripts.
If you already have a pre-installed dependencies, then execute the following commands to setup your environment
# loads the guix environment. guixr package -p run/.guix-profile --search-path=prefix export PIGX_UNINSTALLED=1 # runs the pipeline on the test data ./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml
loads the dependencies into PATH
guixr package -p ./run/.guix-profile --search-path="prefix"
Scripts and Executables
pigx-scrnaseq is the main driver script for the pipeline (user entry point).
It is constructed from the pigx-scrnaseq.in during the configuration step.
If you want to update the pigx-scrnaseq, change the pigx-scrnaseq.in, and run the
installs the pipeline
step of the developemnt installation, to update the changes.
Is the main SnakeMake script which constructs the execution graph and executes the pipeline. Any changes to the Snake_Dropseq.py are observed directly upon execution.
Folder which contains all R and python scripts. These scripts are called used by the Snake_Dropseq.py
To make changes or add improvements to the pipeline, follow these steps:
create a new git branch
switch to the branch
make your updates
check whether the updates work by running the following code:
make install && ./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml
run the tests with
check whether there were updates to master. If there were updates, run git pull -r. Again check whether the pipeline works
push the changes to the corresponding branch, and open a pull request.