Skip to content

Running Panoramic

galtoledano edited this page Oct 30, 2023 · 22 revisions

Running Panoramic

This page explains how to construct pan-genomes using Panoramic. It starts with a general introduction to the pipelines and their inputs, then explains in detail how to configure and run the pipelines and finally we go through the pipelines outputs. A worked example of a run can be found here.

The Panoramic pipelines

Panoramic provides three pipelines implementing different approaches for constructing pan-genomes: the "de-novo" approach, the "map-to-pan" approach, and the "iterative assembly" approach.

Panoramic can handle different types of input samples, namely low-coverage (LC), high-coverage (HC) and reference-level samples. The following workflows are used for constructing pan-genomes:
DN
MTP
IA
All pipelines are implemented as Snakemake workflows. Snakemake is a workflow management system, and although you don't need to be proficient with it to run Panoramic, it can be useful to take a quick look at this page about executing Snakemake workflows.

Inputs

Sample types

All Panoramic pipelines accept samples (genomes) of three types, or quality levels:

  • Reference - the pipelines expect that a single reference genome is used. The reference sample should have a reference-level genome sequence available (preferably chromosome-level). There should also be a reference annotation and derived transcript and protein sequences.
  • High-coverage (HC) - optionally, high-coverage samples can be incorporated into the pan-genome. These samples have high-quality genome assemblies and, optionally, annotations available. so they won't be re-assembled or annotated by the pipeline, but will still be compared to other samples. If you wish to annotate your sample using the pipeline, please refer to the HC configuration file definition.
  • Low-coverage (LC) - these usually constitute the bulk of samples when constructing a pan-genome. However, you can still have no LC samples. Such samples do not have available assemblies or annotations, and the only available data are sequencing reads. Such samples will be assembled and annotated by Panoramic. Currently, Panoramic expects a single paired-end (PE) Illumina library. The required sequencing depth depends on the analyzed organism and how difficult the genome is to assemble. For most eukaryotes, you'll need > 20x coverage to get a reasonable assembly.

Sample inputs

All genomic, transcriptomic and proteomic sequences for reference and HC samples should be provided as FASTA files. Genome annotations should be provided as GFF3 files. A few things to note:

  • Make sure GFF3 files do not contain a FASTA section and that genes are made of features of types gene, mRNA, exon, and CDS (five_prime_UTR and three_prime_UTR are optional). It's OK if other types of features are also present.
  • Make sure chromosome names match between FASTA and GFF3 files. Common issues: "chr1" vs. "1", "chr1" vs. "Chr1" etc.
  • Transcript and protein FASTA headers must match the ID attribute of mRNA features in the respective GFF3 files. This is not always the case when downloading data from various DBs, so you may need to process your files before they can go into Panoramic. It is recommended to run the script util/validate_sample.py on the reference and on each HC sample to ensure GFF3 and FASTA files match.

Annotation evidence

For genome annotation, Panoramic performs evidence-based annotation using several software tools. Evidence are provided as FASTA files of transcript or protein sequences. In some cases, you'll only have the reference transcripts and proteins as evidence, but it is recommended to try and enrich the evidence collection, so more novel genes can be detected. Any assembled transcript sets from the same organism can be used. Protein sequences can also be taken from closely-related organisms.
In addition, Panoramic can also perform ab-initio gene prediction using one or more tools. Some of these need to be pre-trained for your organism. More information is provided in the configuration section.

Constructing a pan-genome

This section assumes that Panoramic is set-up and that you have all the reference, HC samples, and evidence data ready.

Configuration

Panoramic pipelines expect four configuration files: one with information about LC samples, one with information about HC samples, the main pipeline configuration, and a configuration regarding the annotation process.

  • The LC configuration is a tab-separated values (TSV) file with two columns: name of the sample and the ENA/SRA run accession. Each line represents one sample. The run accession (e.g. ERR1309526) allows Panoramic to automatically download relevant data and is also used as a unique sample identifier.
  • The HC configuration is a TSV file with the following columns: sample name, annotation gff3 path, genome fasta path, and proteins fasta path. The last three columns should contain full paths to the input files per HC sample. If you wish to annotate the HC sample with the pipeline, the 'genome fasta path' column should contain the full path to the input file, and the other input columns should contain the string "--" exactly.
  • The main configuration is a YAML file with many parameters as well as paths referring to other input files, including the LC and HC samples TSVs. A full descriptions of all configuration parameters can be found here.
  • The annotation configuration is another YAML file containing parameters related to the annotation process. Further information can be found here.

Note that configurations slightly differ between the three pipelines, since different parameters are needed.
To create the required configurations, you can use the templates provided for each pipeline:
De-novo:

Panoramic/de_novo/LQ_samples_info_de_novo.tsv
Panoramic/de_novo/HQ_samples_info_de_novo.tsv
Panoramic/de_novo/conf_template_de_novo.yml

Map-to-pan:

Panoramic/map_to_pan/LQ_samples_info_map_to_pan.tsv
Panoramic/map_to_pan/HQ_samples_info_map_to_pan.tsv
Panoramic/map_to_pan/conf_template_map_to_pan.yml

Iterative assembly:

Panoramic/iterative_assembly/LQ_samples_info_iterative_assembly.tsv
Panoramic/iterative_assembly/HQ_samples_info_iterative_assembly.tsv
Panoramic/iterative_assembly/conf_template_iterative_assembly.yml

Annotation configuration:

Panoramic/EVM_annotation/conf_template_EVM.yml

You can also find some configuration examples under the test section (further discussed below). Copy the relevant configuration templates and edit them using any text editor (vim, notepad etc).

Running the pipeline

Once all configuration files are ready, you can run the pipeline of your choice.
Start by activating the Panoramic Conda environment:
conda activate snakemake-panoramic
Then run the pipeline using the command:
snakemake -s </path/to/pipeline.snakefile> --configfile </path/to/main_conf.yml> --use-conda -p -j <number of CPUs>
Snakefiles can be found at:

Panoramic/de_novo/PGC_de_novo.snakefile
Panoramic/map_to_pan/PGC_map_to_pan.snakefile
Panoramic/iterative_assembly/PGC_iterative_assembly.snakefile

Note that you may use any Snakemake CLI option. Two very useful ones are:

  • -n - perform a "dry run" - commands will be printed to the screen but nothing will be run. This is recommended before starting the actual run, to make sure you're running the intended procedure and detect any issues with the run.
  • -k - keep running the pipeline as far as possible, even if certain jobs fail.

Since complete pipeline runs can take hours to days, it is recommended that you run in the background and redirect STDOUT andSTDERR to files:
snakemake -s </path/to/pipeline.snakefile> --configfile </path/to/main_conf.yml> --use-conda -p -j <number of CPUs> >out 2>err &

Running on HPC clusters

Constructing a pan-genome for eukaryotes with large genomes would usually require the usage of HPC clusters with a queueing system such as SGE, PBS, TORQUE or other. Panoramic (and Snakemake) can make efficient use of HPC clusters, but you'll need to make some preparations the first time you use it and change the command a bit.
The interaction with the queue system is achieved through a wrapper script that sends jobs to queue. For example, with a PBS-Pro queue, the script can be found at pbs_qsub_snakemake_wrapper.py. You should be able to generate a script that fits your system based on this script. Further information and help can be found here.
There are several parameters in the main pipeline configuration dedicated to running on HPC clusters, such as the queue name, priority and max number of jobs. More details here.
Once you have the queue wrapper script ready, you can run a pipeline using the HPC cluster like this:
snakemake -s </path/to/pipeline.snakefile> --configfile </path/to/main_conf.yml> --use-conda -p -j <max number of jobs> --cluster "python </path/to/queue_wrapper_script.py>" --latency-wait 60 --jobscript </path/to/jobscript.sh>
For jobscript.sh, you can use util/jobscript.sh.
It is good practice to also send the command as a queue job (Note: Some systems do not allow jobs to create more jobs. In this case, this will not work). So on a PBS-Pro system the script to be sent to queue will look something like:

#!/bin/bash
#PBS -S /bin/bash
#PBS -N Panoramic
#PBS -r y
#PBS -q my_queue
#PBS -V
#PBS -e /path/to/workdir/panoramic.ERE
#PBS -o /path/to/workdir/panoramic.OUT

hostname
conda activate snakemake-panoramic

cd /path/to/workdir
snakefile="/path/to/Panoramic/map_to_pan/PGC_map_to_pan.snakefile"
qsub_script="/path/to/Panoramic/util/pbs_qsub_snakemake_wrapper.py"
job_script="/path/to/Panoramic/util/jobscript.sh"
snakemake -s $snakefile --configfile conf_test_map_to_pan.yml --cluster "python $qsub_script" --latency-wait 60 --use-conda -p -j 400 --jobscript "$job_script" >out 2>err

And the whole process can be started by running qsub Panoramic.sh

Monitoring, logging and error handling

The main log, indicating currently-running steps (rules) and error messages, is printed to STDERR. Further details regarding the progress of each step and possible errors are available in step-specific logs storing STDOUT and SDTERR of specific processes. These can be found under <output directory>/logs/. File names have the format <sample name>_<rule name>.<out/err>. If the step is not specific to a single sample, the prefix all_samples is used.
For example, if the pipeline fails and the main log indicates Error executing rule merge_reads for sample S123, then you should probably look at the file logs/S123_merge_reads.err.

Resuming failed runs

Snakemake workflows come with integrated caching abilities, meaning that failed runs can always be resumed by simply rerunning the pipeline. The workflow will automatically detect which steps still need to run, based on what files are present. This means that if a pipeline run failed due to some environment issue (e.g machine/storage/network downtime), you should just rerun or re-submit the job. If the failure occurred due to issues with the data, you may need to rerun some completed steps. To do that, you'll need to manually delete the outputs of the relevant steps, which can be determined from the logs.

Outputs

The main outputs of the pipelines are:

  • <output dir>/all_samples/pan_genome/pan_genome.fasta - pan genome genomic sequences. These include the reference genome + all novel sequences collected from all HQ and LQ samples.
  • <output dir>/all_samples/pan_genome/all_novel.fasta - only novel genomic sequences (from HQ and LQ samples)
  • <output dir>/all_samples/pan_genome/pan_genes.gff - all pan genome gene models (GFF3 format), including reference and non-reference genes.
  • <output dir>/all_samples/pan_genome/pan_proteome.fasta - protein sequences for all reference and non-reference genes.
  • <output dir>/all_samples/pan_genome/pan_PAV.tsv - the gene presence/absence matrix, as a tab-separated-values (TSV) table.
  • <output dir>/all_samples/stats/report.html/ipynb - a statistical report about the constructed pan genome, in html and Jupyter notebook formats.