Skip to content

2.1. Automated Pipeline Starting with FASTQ or FASTA files

Breon Schmidt edited this page Sep 8, 2021 · 2 revisions

Ah, so you're starting from the beginning? Fantastic! This pipeline will run STAR and ALLSorts for you.

Before We Begin

Download the reference Fasta and GTF used for ALLSorts

ALLSorts runs on hg19 (I know), so we need references related to that.

GTF - ftp://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz

FASTA - ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz

Now ungzip both somewhere memorable!

Generate STAR Genome - Only needs to be performed once!

We need to align our fastq/fasta reads to a reference genome, so let's make one! These should do, but feel free to adjust to make it work for other hg19 projects - so long as the fasta and gtf file remain.

STAR --runMode genomeGenerate --genomeDir **/path/to/desired/output/** --limitGenomeGenerateRAM 64000000000 # 64 GB - Choose as appropriate for your environment. --runThreadN 8 # Choose as appropriate for your environment. --sjdbGTFfile /path/to/Homo_sapiens.GRCh37.87.chr.gtf --genomeFastaFiles /path/to/Homo_sapiens.GRCh37.dna.primary_assembly.fa

ALLSorts has been installed

Just follow the instructions https://github.com/Oshlack/ALLSorts/wiki.

Running ALLSorts starting with FASTA/FASTQ files

Ok, ALLSorts and the prerequisites above have been installed? You're good to go!

ALLSorts can be run with this script, note the parameter descriptions below: bpipe -p results=$results -p threads=$threads -p a_mem=$mem -p type=$type -p strand=$strand -p format=$format -p genome_dir=$genome_dir $COUNTSDIR/counts.groovy_ $fasta

Parameters

Feel free to make these environment variables (I tend to) or just directly insert them into the command line snippet above.

$results = /path/to/desired/output

$threads = 8 # choose as appropriate

$mem = 64000000000 # 64GB - choose as appropriate

$type = "fasta" or "fastq" # choose as appropriate for your input

$strand = "yes" or "no" or "reverse" # No and Reverse will be the two most used (no = unstranded, reverse = stranded typically)

$format = the format path of your input fastq/fasta as per bpipes input spec. A brief example would be an input like /path/to/sample1_R1.fastq.gz and /path/to/sample1_R2.fastq.gz being represented by: format = /%_R.fastq.gz. This will use sample1 as the branch name.

genome_dir = /path/to/Genome/ The output from the STAR genome generation step provided earlier.

$COUNTSDIR/counts.groovy should be the path /your/allsorts/clone/path/tools/counts/counts.groovy

$fasta - the path to your fasta/fastq files. Can be something as simple as /path/to/fastq/*.fastq.gz, so long as the format parameter is set correctly.

Run a test to see if it's working!

If you have setup your prerequisite tools correctly, this should output a result fairly quickly! Just change the parameters as suitable for your environment.

bpipe -p results=/output/path/ -p threads=8 -p a_mem=64000000000 -p type="fasta" -p strand="no" -p format="*/%_*.fasta.gz" -p genome_dir="/path/to/Genome/" $COUNTSDIR/counts.groovy /your/allsorts/clone/path/tests/fastq/*.fasta.gz

The output will just be some collection of predictions, it's not a real sample, just a garbled mess of counts.

Issues

Please report any https://github.com/Oshlack/ALLSorts/issues!