Skip to content

openpipelines-bio/openpipeline

Repository files navigation

OpenPipeline

Extensible single cell analysis pipelines for reproducible and large-scale single cell processing using Viash and Nextflow.

ViashHub GitHub GitHub License GitHub Issues Viash version

Documentation

Please find more in-depth documentation on the website.

Functionality Overview

Openpipelines execute a list of predefined tasks. These descrete steps are also provided as standalone components that can be executed individually, with a standardized interface. This is especially useful when a particular step wraps a tool that you do not necessarily always need to execute in a workflow context.

In terms of workflows, the following functionality is provided:

  • Demultiplexing: conversion of raw sequencing data to FASTQ objects.
  • Ingestion: Read mapping and generating a count matrix.
  • Single sample processing: cell filtering and doublet detection.
  • Multisample processing: Count transformation, normalization, QC metric calulations.
  • Integration: Clustering, integration and batch correction using single and multimodal methods.
  • Downstream analysis workflows
flowchart LR
  demultiplexing["Step 1: Demultiplexing"]
  ingestion["Step 2: Ingestion"]
  process_samples["Step 3: Process Samples"]
  integration["Step 4: Integration"]
  downstream["Step 5: Downstream"]
  demultiplexing-->ingestion-->process_samples-->integration-->downstream
Loading

Guided execution using Viash Hub (CLI and Seqera cloud)

Openpipelines is now available on Viash Hub. Viash Hub provides a list of components and workflows, together with a graphical interface that guides you through the steps of running a workflow or standalone component. Intstructions are provided for using a local viash or nextflow executable (requires using a linux based OS), but connecting to a Seqera cloud instance is also supported.

Execution using the nextflow executable

Executing a workflow is a bit more involved and requires familiarity with the command line interface (CLI).

Setup

In order to use the workflows in this package on your local computer, you’ll need to do the following:

  • Install nextflow
  • Install a nextflow compatible executor. This workflow provides a profile for docker.

Location of the workflow scripts

Nextflow workflow scripts, schema’s and configuration files can be found in the target/nextflow folder. On the main branch however, only the source code that needs to be build into the functionning workflows and components can be found. Instead, please refer to the main_build branch or any of the tags to find the target folders. Components and workflows are organized into namespaces, which can be nested. Workflows are located at target/nextflow/workflows, while components that execute individual workflow steps are

A reference of workflows and modules is also provided in the documentation.

Retrieving a list of a workflow parameters

A list of workflows arguments can be consulted in multiple ways:

  • On Viash Hub
  • In the reference documentation
  • The config YAML file lists the argument for each workflow and component
  • In the target/nextflow folder, a nextflow schema JSON file (nextflow_schema.json) is provided next to each workflow .nf file.
  • Using nextflow on the CLI:
nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/demux/main.nf \
    --help

Resource usage tuning

Nextflow’s labels can be used to specify the amount of resources a process can use. This workflow uses the following labels for CPU, memory and disk:

  • lowmem, lowmem, midmem, highmem, veryhighmem
  • lowcpu, lowcpu, midcpu, highcpu, veryhighcpu
  • lowdisk, middisk, highdisk, veryhighdisk

The defaults for these labels can be found at src/workflows/utils/labels.config. Nextflow checks that the specified resources for a process do not exceed what is available on the machine and will not start if it does. Create your own config file to tune the labels to your needs, for example:

// Resource labels
withLabel: verylowcpu { cpus = 2 }
withLabel: lowcpu { cpus = 8 }
withLabel: midcpu { cpus = 16 }
withLabel: highcpu { cpus = 16 }

withLabel: verylowmem { memory = 4.GB }
withLabel: lowmem { memory = 8.GB }
withLabel: midmem { memory = 16.GB }
withLabel: highmem { memory = 32.GB }

When starting nextflow using the CLI, you can use -c to provide the file to nextflow and overwrite the defaults.

Demultiplexing example

Here, generating FASTQ files from raw sequencing data is demonstrated, based on data generated using 10X genomic’s protocols. However, BD genomics data is also supported by Openpipeline. If you wish to try it out yourself, test data is available at s3://openpipelines-data/cellranger_tiny_bcl/bcl.

nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/demux/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    --publish_dir "<path to output directory>" \
    --id "cellranger_tiny_bcl" \
    --input "s3://openpipelines-data/cellranger_tiny_bcl/bcl" \
    --sample_sheet "s3://openpipelines-data/cellranger_tiny_bcl/bcl/sample_sheet.csv" \
    --demultiplexer "mkfastq"

Mapping and read counting

FASTQ files can be mapped to a reference genome and the resulting mapped reads can be counted in order to generate a count matrix. Both BD Rhapsody and Cell Ranger are supported. Here, we demonstrate using Cell Ranger multi on test data available at s3://openpipelines-data/10x_5k_anticmv.

In order to facilitate passing multiple argument values, the parameters can be specified using a YAML file.

input:
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_*.fastq.gz"
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_*.fastq.gz"
    - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_VDJ_*.fastq.gz"
gex_reference: "s3://openpipelines-data/reference_gencodev41_chr1/reference_cellranger.tar.gz"
vdj_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz"
feature_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/feature_reference.csv"
library_id:
    - "5k_human_antiCMV_T_TBNK_connect_GEX_1_subset"
    - "5k_human_antiCMV_T_TBNK_connect_AB_subset"
    - "5k_human_antiCMV_T_TBNK_connect_VDJ_subset"
library_type:
    - "Gene Expression"
    - "Antibody Capture"
    - "VDJ"

You can pass this file to nextflow using -params-file

nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/ingestion/cellranger_multi/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    -params-file "<path to your parameter YAML file>" \
    --publish_dir "<path to output directory>"

Filtering, normalization, clustering, dimensionality reduction and QC calculations (w/o integration)

Once you have an MuData object for each of your samples, you can process it into a multisample file that is ready for integration and other downstream analyses. This can be done using the process_samples workflow. Here is an example, but please keep in mind that the exact parameters that need to be provided differ depending on you data. A lot of functionality for this pipeline can be customized, including the name of the output slots where data is being stored.

param_list:
    - id: "sample_1"
      input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
      rna_min_counts: 2
    - id: "sample_2"
      input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
      rna_min_counts: 1
rna_max_counts: 1000000
rna_min_genes_per_cell: 1
rna_max_genes_per_cell: 1000000
rna_min_cells_per_gene: 1
rna_min_fraction_mito: 0.0
rna_max_fraction_mito: 1.0

In order to provide multiple samples to the pipeline, param_list is used. Using param_list it is possible to specify arguments per sample. However, it is still possible to define arguments for all samples together by listing those outside the param_list block.

nextflow run openpipelines-bio/openpipeline \
    -r 2.1.1 \
    -main-script target/nextflow/workflows/multiomics/process_samples/main.nf \
    -c "<path to resource config file>" \
    -profile docker \
    -params-file "<path to your parameter YAML file>"
    --publish_dir "<path to output directory>"

Executing standalone components using the Viash executable

Another option to execute individual modules on the CLI is to use viash run. All you need to do is download viash, clone the Openpipeline repository and point viash to a config file. However, keep in mind that using viash run for workflows is currently not supported. Please see viash run --help for more information on how to use the command, but here is an example:

viash run --engine docker src/mapping/cellranger_multi/config.vsh.yaml --help