Extensible single cell analysis pipelines for reproducible and large-scale single cell processing using Viash and Nextflow.
Please find more in-depth documentation on the website.
Openpipelines execute a list of predefined tasks. These descrete steps are also provided as standalone components that can be executed individually, with a standardized interface. This is especially useful when a particular step wraps a tool that you do not necessarily always need to execute in a workflow context.
In terms of workflows, the following functionality is provided:
- Demultiplexing: conversion of raw sequencing data to FASTQ objects.
- Ingestion: Read mapping and generating a count matrix.
- Single sample processing: cell filtering and doublet detection.
- Multisample processing: Count transformation, normalization, QC metric calulations.
- Integration: Clustering, integration and batch correction using single and multimodal methods.
- Downstream analysis workflows
flowchart LR
demultiplexing["Step 1: Demultiplexing"]
ingestion["Step 2: Ingestion"]
process_samples["Step 3: Process Samples"]
integration["Step 4: Integration"]
downstream["Step 5: Downstream"]
demultiplexing-->ingestion-->process_samples-->integration-->downstream
Openpipelines is now available on Viash Hub. Viash Hub provides a list of components and workflows, together with a graphical interface that guides you through the steps of running a workflow or standalone component. Intstructions are provided for using a local viash or nextflow executable (requires using a linux based OS), but connecting to a Seqera cloud instance is also supported.
Executing a workflow is a bit more involved and requires familiarity with the command line interface (CLI).
In order to use the workflows in this package on your local computer, you’ll need to do the following:
- Install nextflow
- Install a nextflow compatible executor. This workflow provides a profile for docker.
Nextflow workflow scripts, schema’s and configuration files can be found
in the target/nextflow
folder. On the main
branch however, only the
source code that needs to be build into the functionning workflows and
components can be found. Instead, please refer to the main_build
branch or any of the tags to find the target
folders. Components and
workflows are organized into namespaces, which can be nested. Workflows
are located at target/nextflow/workflows
, while components that
execute individual workflow steps are
A reference of workflows and modules is also provided in the documentation.
A list of workflows arguments can be consulted in multiple ways:
- On Viash Hub
- In the reference documentation
- The config YAML file lists the argument for each workflow and component
- In the
target/nextflow
folder, a nextflow schema JSON file (nextflow_schema.json
) is provided next to each workflow.nf
file. - Using nextflow on the CLI:
nextflow run openpipelines-bio/openpipeline \
-r 2.1.1 \
-main-script target/nextflow/workflows/ingestion/demux/main.nf \
--help
Nextflow’s labels can be used to specify the amount of resources a process can use. This workflow uses the following labels for CPU, memory and disk:
lowmem
,lowmem
,midmem
,highmem
,veryhighmem
lowcpu
,lowcpu
,midcpu
,highcpu
,veryhighcpu
lowdisk
,middisk
,highdisk
,veryhighdisk
The defaults for these labels can be found at
src/workflows/utils/labels.config
. Nextflow checks that the specified
resources for a process do not exceed what is available on the machine
and will not start if it does. Create your own config file to tune the
labels to your needs, for example:
// Resource labels
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 8 }
withLabel: midcpu { cpus = 16 }
withLabel: highcpu { cpus = 16 }
withLabel: verylowmem { memory = 4.GB }
withLabel: lowmem { memory = 8.GB }
withLabel: midmem { memory = 16.GB }
withLabel: highmem { memory = 32.GB }
When starting nextflow using the CLI, you can use -c
to provide the
file to nextflow and overwrite the defaults.
Here, generating FASTQ files from raw sequencing data is demonstrated,
based on data generated using 10X genomic’s protocols. However, BD
genomics data is also supported by Openpipeline. If you wish to try it
out yourself, test data is available at
s3://openpipelines-data/cellranger_tiny_bcl/bcl
.
nextflow run openpipelines-bio/openpipeline \
-r 2.1.1 \
-main-script target/nextflow/workflows/ingestion/demux/main.nf \
-c "<path to resource config file>" \
-profile docker \
--publish_dir "<path to output directory>" \
--id "cellranger_tiny_bcl" \
--input "s3://openpipelines-data/cellranger_tiny_bcl/bcl" \
--sample_sheet "s3://openpipelines-data/cellranger_tiny_bcl/bcl/sample_sheet.csv" \
--demultiplexer "mkfastq"
FASTQ files can be mapped to a reference genome and the resulting mapped
reads can be counted in order to generate a count matrix. Both
BD Rhapsody
and Cell Ranger
are supported. Here, we demonstrate
using Cell Ranger multi on test data available at
s3://openpipelines-data/10x_5k_anticmv
.
In order to facilitate passing multiple argument values, the parameters can be specified using a YAML file.
input:
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_*.fastq.gz"
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_*.fastq.gz"
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_VDJ_*.fastq.gz"
gex_reference: "s3://openpipelines-data/reference_gencodev41_chr1/reference_cellranger.tar.gz"
vdj_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz"
feature_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/feature_reference.csv"
library_id:
- "5k_human_antiCMV_T_TBNK_connect_GEX_1_subset"
- "5k_human_antiCMV_T_TBNK_connect_AB_subset"
- "5k_human_antiCMV_T_TBNK_connect_VDJ_subset"
library_type:
- "Gene Expression"
- "Antibody Capture"
- "VDJ"
You can pass this file to nextflow using -params-file
nextflow run openpipelines-bio/openpipeline \
-r 2.1.1 \
-main-script target/nextflow/workflows/ingestion/cellranger_multi/main.nf \
-c "<path to resource config file>" \
-profile docker \
-params-file "<path to your parameter YAML file>" \
--publish_dir "<path to output directory>"
Filtering, normalization, clustering, dimensionality reduction and QC calculations (w/o integration)
Once you have an MuData object for each of your samples, you can process
it into a multisample file that is ready for integration and other
downstream analyses. This can be done using the process_samples
workflow. Here is an example, but please keep in mind that the exact
parameters that need to be provided differ depending on you data. A lot
of functionality for this pipeline can be customized, including the name
of the output slots where data is being stored.
param_list:
- id: "sample_1"
input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
rna_min_counts: 2
- id: "sample_2"
input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
rna_min_counts: 1
rna_max_counts: 1000000
rna_min_genes_per_cell: 1
rna_max_genes_per_cell: 1000000
rna_min_cells_per_gene: 1
rna_min_fraction_mito: 0.0
rna_max_fraction_mito: 1.0
In order to provide multiple samples to the pipeline, param_list
is
used. Using param_list
it is possible to specify arguments per sample.
However, it is still possible to define arguments for all samples
together by listing those outside the param_list
block.
nextflow run openpipelines-bio/openpipeline \
-r 2.1.1 \
-main-script target/nextflow/workflows/multiomics/process_samples/main.nf \
-c "<path to resource config file>" \
-profile docker \
-params-file "<path to your parameter YAML file>"
--publish_dir "<path to output directory>"
Another option to execute individual modules on the CLI is to use
viash run
. All you need to do is download viash, clone the
Openpipeline repository and point viash to a config file. However, keep
in mind that using viash run
for workflows is currently not supported.
Please see viash run --help
for more information on how to use the
command, but here is an example:
viash run --engine docker src/mapping/cellranger_multi/config.vsh.yaml --help