No description, website, or topics provided.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
VCF Annotator
evac
scripts
slides
tests
.DS_Store
.gitignore
Dockerfile
LICENSE
README.md
convert_formats_gatk.sh
convert_formats_mpileup.sh
run_counts.sh
run_counts_plots.R
run_mapping_sort.sh
setup.py

README.md

UltraFast Expressed Variant Calling

The current tools for identification of expressed variants are either slow or involve several software dependencies which are difficult to install. In addition, they either generate several intermediate files or use significant amounts of RAM which is not highthroughput if you are trying to analyze large datasets.

Our pipeline identifies expressed variants from public databases such as NCBI SRA as well as user’s own RNA-Seq data using one of several RNA aligners and quantification tools.

Presentation: Hackathon end presentation

Alignment/quantification from SRA

Dependencies

Older or newer versions may be used but may not be fully compatible.

Pull from SRA:

RNA Aligners

Transcript quantification

Transcript counting

Variant callers

Reference data : note that the chromosome/contig identifiers need to match for all of these

Other tools

Installation

pip install git+https://github.com/NCBI-Hackathons/Ultrafast_Mapping_CSHL.git

Usage

There are several main python scripts: stream_sra.py, align.py and call_variants. The first streams a set of reads for a given accession to a file (or stdout by default). The second uses the first and pipes the reads to the aligner of your choice (via a FIFO, or pair of FIFOs for paired-end reads). The last accepts a BAM from an aligner and will call variants using one of several methods.

Save the first 10 reads of SRR1616919 to a pair of fastq files:

stream_sra.py -a SRR1616919 -M 10 head.1.fq head.2.fq

Append the next 10 reads to the same files:

stream_sra.py -a SRR1616919 -f 11 -l 20 -o a head.1.fq head.2.fq

Print the first 10 reads of SRR1616919 to stdout:

align.py -a SRR1616919 -M 10 -p head

Align all reads of SRR1616919 using STAR and generate a sorted BAM using 16 threads...:

align.py -a SRR1616919 -p star -r /path/to/star/index -o aligned.bam -t 16

...or with Hisat2:

align.py -a SRR1616919 -p hisat -r /path/to/hisat/index -o aligned.bam -t 16

Generate counts of SRR1616919 using Kallisto...:

align.py -a SRR1616919 -p kallisto -r /path/to/kallisto/index -o kallisto_output -t 16

...or with Salmon:

align.py -a SRR1616919 -p salmon -r /path/to/salmon/index -o salmon_output -t 16

Call variants with mpileup...:

varcallers.py 	--caller mpileup \
		--samtools PATH_TO/samtools \
		--bam mybam.bam \
		--out OUTFILE.vcf \
                --index hg38.fa \
		--regions clinvar.chr.bed 
	

...or with GATK:

varcallers.py 	--caller gatk \
		--gatk PATH_TO/GenomeAnalysisTK.jar \
		--bam mybam.bam \
		--out OUTDIR \
		--index hg38.fa \
		--regions clinvar.chr.bed \
		--indels Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
		--dbsnp Homo_sapiens_assembly38.dbsnp.vcf.gz \
		--mem 16 --threads 12

Expressed variant calling

Usage

Install the alignment tools as above, and the run the 'run_pipeline.sh' script in the 'scripts' directory.

TODO