sequence_accessories

Accessory scripts for `sequence_handling`

What is `sequence_accessories`?

sequence_accessories is a set of scripts that complements sequence_handling. sequence_handling is designed to automate the processing of raw sequence data in FASTQ format. It also provides quality checks so users can diagnose potential errors in the data or processing. However, other tasks can be automated but fall outside the scope of sequence_handling. To keep sequence_handling focused, sequence_accessories was developed to handle those tasks.

How does `sequence_accessories` work?

Much like sequence_handling, sequence_accessories is designed to process large amounts of data in parallel. However, sequence_accessories does not have an inherent dependence on the Portable Batch System. Instead of resource-heavy handlers, sequence_accessories uses lighter accessories to do its processing. These accessories have fewer options than the handlers of sequence_handling. As such, sequence_accessories runs entirely on the command line without the help of a configuration file.

Setting parameters

Unlike sequence_handling, sequence_accessories uses command-line arguments instead of parameters set in a configuration file. The format for arguments is --parameter-name=value. This differs from the common -p value syntax found in most programs; however, the equals-delimited syntax is easier to parse. In addition to arguments, some accessories also have flags. Flags follow the format --flag; to trigger a flag, simply pass it on the command line.

When reading documentation for sequence_accessories, required arguments are denoted by angle brackets around the value, while square brackets around the argument and value denote optional arguments. All flags are optional and are also denoted by square brackets.

Accessories

Basic usage is done with the following command:

./sequence_accessories <accessory> [options]

Where <accessory> is one of the accessories listed below, and [options] are arguments for that accessory. A simple help message can be found by running:

./sequence_accessories

Detailed help for each accessory is available by running it without arguments.

Available accessories

ListGenerator (currently not ready; exits in dispatcher)
SummarizeStats
DumpFastq
SRADownloader
MergeBAM
SimpleCoverage
AddBAMLane
FreeBayes_SNP_Calls
SubsampleFastq
UG100_filter
PanDepthCoverage

SummarizeStats

The SummarizeStats accessory runs SAMTools idxstats on BAM files and creates a text file with the sequence length, number of mapped reads, and number of unmapped reads for every sample passed to it. If the BAM files are not indexed, SummarizeStats will generate CSI-format indexes for them.

Arguments:

--sample-list=<sample_list>: required list of BAM files to generate statistics for
[--project=project]: optional name for the output file, defaults to 'STATS'

Dependencies:

SAMTools 1.3 or higher
GNU Parallel

DumpFastq

The DumpFastq accessory creates gzipped FASTQ files from SRA archives. This accessory can handle dumping to either single- or paired-end FASTQ files.

Arguments:

--sample-list=<sample_list>: required list of SRA archives to dump
[--outdirectory=outdirectory]: optional directory to dump the FASTQ files to
[--paired]: flag to dump to paired-end FASTQ files

Dependencies:

fastq-dump from the SRA Toolkit
GNU Parallel

SRADownloader

The SRADownloader accessory downloads SRA archives from the SRA FTP server. This accessory takes SR-/ER-/DR- numbers that correspond to an experiment, run, sample, or study.

Arguments:

--sample-list=<sample_list>: required list of SRA accession numbers to download
--sample-type=<sample_type>: required type of accession number given; can choose from: 'experiment', 'run', or 'study'
[--outdirectory=outdirectory]: optional directory to download SRA archives to
[--validate]: flag to use vdb-validate to run a checksum on the SRA archives within SRADownloader

Dependencies:

lftp
GNU Parallel
vdb-validate from the SRA Toolkit if validating within SRADownloader

MergeBAM

The MergeBAM accessory uses BAMtools to merge several BAM files into a single BAM file. This accessory can handle multiple merges at once.

Arguments

--sample-list=<sample_list>: required list of BAM files
--name-table=<table>: a table where the first column is the sample name for the merged BAM file, and the remaining columns are the names of BAM files that make up the merged BAM
[--outdirectory=outdirectory]: optional directory to place the merged BAM

Dependencies

SimpleCoverage

The SimpleCoverage accessory uses SAMTools to calculate coverage over BAM files. This accessory outputs a table summarizing the average depth across the entire sample.

Arguments

--sample-list=<sample_list>: required list of BAM files
[--genome-size=genome_size]: optional size of the genome in number of base pairs, will calculate automatically if not specified; if you have exome sequencing data, provide the exome size
[--project=project]: optional name for the output file, defaults to 'SimpleCoverage'
[--outdirectory=outdirectory]: optional directory to place output files

Dependencies

AddBAMLane

Adds lane/read-group information to BAM files.

Dependencies

FreeBayes_SNP_Calls

Runs a FreeBayes-based SNP calling workflow.

Notes

This accessory script currently contains internal, hard-coded paths and parameters.

Dependencies

GNU Parallel
FreeBayes
SAMTools
BAMTools
ogap
bamleftalign

SubsampleFastq

Subsample FASTQ reads using seqtk.

Dependencies

UG100_filter

Runs quality filtering for the UG100 cohort variant calls.

Arguments

INPUT_FILE: input BCF/VCF file
OUT_DIR: output directory

Dependencies

bcftools 1.21 or higher

PanDepthCoverage

Calculates coverage depth over BAM/CRAM files using PanDepth. Supports whole-genome, gene-level (GFF/GTF), region-level (BED), and windowed coverage modes. The --gff, --bed, and --window-size options are mutually exclusive; if none is given, per-chromosome statistics are reported.

Arguments

--sample-list=<sample_list>: required list of BAM or CRAM files
[--gff=gff_file]: optional GFF/GTF file for gene-level coverage
[--bed=bed_file]: optional BED file for region-level coverage
[--window-size=window_size]: optional window size in bp for windowed coverage
[--feature=feature_type]: GFF/GTF feature to parse — CDS or exon (default: CDS)
[--min-mapq=min_mapq]: minimum mapping quality filter (default: 0)
[--threads=threads]: PanDepth threads per sample (default: 3)
[--project=project]: optional output file prefix (default: PanDepthCoverage)
[--outdirectory=outdirectory]: optional output directory

Dependencies

PanDepth v2.26 or higher
GNU Parallel

Future Accessories

FastSanger

FastSanger functionality is currently provided by an external utility script:

/Users/pmorrell/Library/CloudStorage/Dropbox/Documents/Sandbox/PMorrell/Utilities/phd_to_fastq.py

That script converts .phd.1 files to FASTQ format for use with sequence_handling.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Accessories		Accessories
README.md		README.md
sequence_accessories		sequence_accessories

Folders and files

Latest commit

History

Repository files navigation

sequence_accessories

Accessory scripts for sequence_handling

What is sequence_accessories?

How does sequence_accessories work?

Setting parameters

Accessories

Available accessories

SummarizeStats

Arguments:

Dependencies:

DumpFastq

Arguments:

Dependencies:

SRADownloader

Arguments:

Dependencies:

MergeBAM

Arguments

Dependencies

SimpleCoverage

Arguments

Dependencies

AddBAMLane

Dependencies

FreeBayes_SNP_Calls

Notes

Dependencies

SubsampleFastq

Dependencies

UG100_filter

Arguments

Dependencies

PanDepthCoverage

Arguments

Dependencies

Future Accessories

FastSanger

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Accessory scripts for `sequence_handling`

What is `sequence_accessories`?

How does `sequence_accessories` work?

Packages