UltraFast Expressed Variant Calling
The current tools for identification of expressed variants are either slow or involve several software dependencies which are difficult to install. In addition, they either generate several intermediate files or use significant amounts of RAM which is not highthroughput if you are trying to analyze large datasets.
Our pipeline identifies expressed variants from public databases such as NCBI SRA as well as user’s own RNA-Seq data using one of several RNA aligners and quantification tools.
Presentation: Hackathon end presentation
Alignment/quantification from SRA
Older or newer versions may be used but may not be fully compatible.
Pull from SRA:
- NGS Software Development Kit 1.3.0+: Make sure to install the python SDK
- NCBI VDB Software Development Kit 2.8.0+
- pv (or any pipe-able buffering tool)
Reference data : note that the chromosome/contig identifiers need to match for all of these
- GRCh38 human genome reference fasta
- Create reference indexes for almost all of the above tools
- ClinVar vcf
- GTF file with exome regions
pip install git+https://github.com/NCBI-Hackathons/Ultrafast_Mapping_CSHL.git
There are several main python scripts: stream_sra.py, align.py and call_variants. The first streams a set of reads for a given accession to a file (or stdout by default). The second uses the first and pipes the reads to the aligner of your choice (via a FIFO, or pair of FIFOs for paired-end reads). The last accepts a BAM from an aligner and will call variants using one of several methods.
Save the first 10 reads of SRR1616919 to a pair of fastq files:
stream_sra.py -a SRR1616919 -M 10 head.1.fq head.2.fq
Append the next 10 reads to the same files:
stream_sra.py -a SRR1616919 -f 11 -l 20 -o a head.1.fq head.2.fq
Print the first 10 reads of SRR1616919 to stdout:
align.py -a SRR1616919 -M 10 -p head
Align all reads of SRR1616919 using STAR and generate a sorted BAM using 16 threads...:
align.py -a SRR1616919 -p star -r /path/to/star/index -o aligned.bam -t 16
...or with Hisat2:
align.py -a SRR1616919 -p hisat -r /path/to/hisat/index -o aligned.bam -t 16
Generate counts of SRR1616919 using Kallisto...:
align.py -a SRR1616919 -p kallisto -r /path/to/kallisto/index -o kallisto_output -t 16
...or with Salmon:
align.py -a SRR1616919 -p salmon -r /path/to/salmon/index -o salmon_output -t 16
Call variants with mpileup...:
varcallers.py --caller mpileup \ --samtools PATH_TO/samtools \ --bam mybam.bam \ --out OUTFILE.vcf \ --index hg38.fa \ --regions clinvar.chr.bed
...or with GATK:
varcallers.py --caller gatk \ --gatk PATH_TO/GenomeAnalysisTK.jar \ --bam mybam.bam \ --out OUTDIR \ --index hg38.fa \ --regions clinvar.chr.bed \ --indels Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \ --dbsnp Homo_sapiens_assembly38.dbsnp.vcf.gz \ --mem 16 --threads 12
Expressed variant calling
Install the alignment tools as above, and the run the 'run_pipeline.sh' script in the 'scripts' directory.
- Move all scripts over to srastream https://github.com/jdidion/srastream
- Update to samtools 1.4 dependency and drop Sambamba dependency.