Skip to content
Griffan(Fan Zhang) edited this page Mar 9, 2020 · 34 revisions

Welcome to the FASTQuick wiki!

Contents

FASTQuick Introduction

Quality assessment of sequence data is so important that it lays the foundation for all the analyses carried out thereafter. But comprehensive assessment always requires fully dataset level reads mapping which takes up to hundreds of hours for single moderate-depth human sample, hence fast sequencing speed and slow QC speed formed a bottleneck that delays timely feedback to avoid further waste of resources. FASTQuick addresses this problem.

Key Features:

  • Collect Basic QC related information rapidly
    • Finish QC analysis hundreds of times faster than existed tools.
    • Generate a report of Alignment, Quality vs Cycle/Empirical Quality Correlation, Genotype Likelihood, Insert Size Distribution, GC distribution, Depth distribution, Pileup
  • Estimate the possible extent of sample contamination
  • Estimate population identity

FASTQuick incorporates the alignment, population identification, contamination estimation and variety of QC analysis tools. FASTQuick can run on a user's computer, on an instance in a compute cloud.

Ultra-Fast Speed

Features comparison with other QC tools:

Performance comparison with full-alignment based QC pipeline:

General Workflow of FASTQuick

A Short Tutorial of FASTQuick

We have prepared an example directory and a resource directory in the top-level source code directory. After the ‘make’ step, you can enter the example directory for a test run. Before we start, we need several database files. You can either use the ones we prepared in resource directory or you can download them from following links:

[1000 genome hg19 reference] (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz)

[dbSNP Variant Sites] (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp//technical/reference/dbsnp132_20101103.vcf.gz)

[1000 genome strict mask region] (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed)

1 Build index

Assuming we are now in example directory, we can build index by:

src/FASTQuick index --siteVCF hapmap.vcf --dbsnpVCF dbsnp.vcf.gz  --ref hs37d5.fa --mask 20141007.all.strict_mask.fasta --out_prefix NA12878

This would generate a bunch of index files starting with prefix “NA12878.FASTQuick.fa”, these files include:

NA12878.FASTQuick.fa.bed
NA12878.FASTQuick.fa.fai
NA12878.FASTQuick.fa.rbwt
NA12878.FASTQuick.fa.rpac
NA12878.FASTQuick.fa.SelectedSite.vcf.gz
NA12878.FASTQuick.fa.amb
NA12878.FASTQuick.fa.bwt
NA12878.FASTQuick.fa.gc
NA12878.FASTQuick.fa.rsa
NA12878.FASTQuick.fa.SelectedSite.vcf.gz.tbi
NA12878.FASTQuick.fa.ann
NA12878.FASTQuick.fa.dpSNP.subset.vcf
NA12878.FASTQuick.fa.pac
NA12878.FASTQuick.fa.rollhash
NA12878.FASTQuick.fa.sa

These files could be reused once your have generated.

2 Summarize basic statistics

Next, we can type:

src/FASTQuick align  --index_prefix NA12878 --fq_list NA12878.fq.list --out_prefix NA12878_output

This would generate files that you needed for analysis, including:

NA12878_output.Bam
NA12878_output.EmpCycleDist
NA12878_output.GCDist
NA12878_output.InsertSizeTable
NA12878_output.Pileup.gz
NA12878_output.SexChromInfo
NA12878_output.DepthDist
NA12878_output.EmpRepDist
NA12878_output.RawInsertSizeDist
NA12878_output.AdjustedInsertSizeDist
NA12878_output.Likelihood
NA12878_output.Pileup.gz.tbi
NA12878_output.Summary

Most of these files’ names are self-explanatory, and later we would generate a graphical report to help you better understand these results.

3 Estimate genetic ancestry and contamination

Next, to estimate this sample’s population identity, we can type

src/FASTQuick pop+con --SVDPrefix resource/hapmap_3.3.b37.dat --Pileup NA12878.Pileup.gz

After it finishes, FASTQuick will print estimated PC coordinates and contamination level:

Estimation from OptimizeHeter:
Contaminating Sample PC1:-0.623602      PC2:0.57292
Intended Sample  PC1:-0.036304  PC2:0.0200112
Alpha:0.0013662

The PC values are used to visualize genetic ancestry in the 1000 genome reference panel.

The Alpha value indicates the estimated contamination level.

5 Generate genetic ancestry report

We also provide a set of auxiliary tools to visualize these results. You can find them in bin/ directory.

Rscript $(FASTQUICK_HOME)/bin/RPlotScript.R <FASTQuick align out_prefix>  <SVDPrefix> <FASTQuickInstallDir>

This would generate a figure looks like this:

However, the recommended way is to use bin/FASTQuick.sh to do a one-stop analysis, which will generate all the results listed above automatically.

FASTQuick Manual

index

FASTQuick index --siteVCF [hapmap site vcf] --dbsnpVCF [dbsnp site vcf]  --ref [reference fasta]  --flank_len [250] --var_short [9000] --flank_long_len [1000] --var_long [1000] --callableRegion [repeat_mask.fasta] --out_prefix [reduced_ref_index]

Index database sequences, using known variant sites to anchor informative regions.

OPTIONS
--siteVCF        [String] VCF file with candidate variant sites(if predefinedVCF not specified)
--predefinedVCF  [String] VCF file with predefined variant sites(if siteVCF not specified)
--regionList     [String] Bed file with target region list
--dbsnpVCF       [String] VCF file with dbsnp site 
--ref            [String] Fasta file with reference genome
--callableRegion [String] Masked fasta or bed file to specify callable regions 
--flank_len      [Int] Flanking region length of short-flanking-region variant
--var_short      [Int] Number of short-flanking-region variant
--flank_long_len [Int] Flanking region length of long-flanking-region variant
--var_long       [Int] Number of long-flanking-region variant
--out_prefix     [String] Prefix of all the output index files

In principle, you can choose any common genetic variants database for --siteVCF; or if you have a specific list of variants, you can specify --predefinedVCF to skip marker selection stage.

To further simplify this step, we also provided a bundle of resource files with the pre-defined genetic variant list in ${FASTQUICK_HOME}/resource/

align

FASTQuick align --index_prefix [reduced_ref_index] --fq_list [sample’s fastq list file] [--bam_out] [--cal_dup] [--I] --t [2]  --out_prefix [NA12878] --frac_samp [1.0] 

Align short reads 70~300 bp to selected reference region to generate comprehensive quality control related statistics in a very short time.

OPTIONS 
--fq_list        [String] Path of fastq file list, format: [path of pair-end 1]\t[path of pair-end 2(optional)]
--bam_in         [String] Input reads in bam format
--index_prefix   [String] Input Prefix of all the index files
--out_prefix     [String] Prefix of variety of output files
--n              [Float] Maximum edit distance if the value is INT or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04]
--o              [Int] Maximum number of gap opens [1]
--e              [Int] Maximum number of gap extensions, -1 for k-difference mode (disallowing long gaps) [-1]
--i              [Int] Disallow an indel within INT bp towards the ends [5]
--d              [Int] Disallow a long deletion within INT bp towards the 3’-end [16]
--l              [Int] Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [32]
--k              [Int] Maximum edit distance in the seed [2]
--m              [Int] Maximum entries in the queue [2000000]
--t              [Int] Number of threads [1]
--R              [Int] Stop searching when there are >INT equally best hits [30]
--q              [Int] Quality threshold for read trimming down to 35bp [0]
--RG             [String] Read group name
--N              [Bool] Non-iterative mode: search for all n-difference hits
--NonI           [Bool] Input fastq quality is sanger format 
--L              [Bool] Log-scaled gap penalty for long deletions 
--max_isize      [Int] Maximal insert size for a read pair to be considered being mapped properly.[500] 
--max_occ        [Int] Maximum occurrences of a read for pairing. A read with more occurrences will be treated as a single-end read. Reducing this parameter helps faster pairing. [100000]
--is_sw          [Bool] Enable Smith-Waterman for the unmapped mate.[True]
--n_multi        [Int] Maximum hits to output for paired reads.[3]
--N_multi        [Int] Maximum hits to output for discordant pairs.[10]
--ap_prior       [Float] Prior of chimeric rate (lower bound)[1.0e-05]
--force_isize    [Bool] Disable insert size estimate.[False]
--cal_dup        [Bool] Calculate PCR duplicated reads in all the statistics.[False]
--frac_samp      [Float] Overall reads downsampling rate.[1]

This step utilizes an optimized version of BWA to rapidly screen unmap reads without losing sensitivity to potential mismatches on reads.

Each line of fq_list contains the full path of each fastq file. Pair-end fastq files are in the same line and delimited by tab; a single-end fastq file occupies the whole line.

pop+con

FASTQuick pop+con --BamFile [sample bam file] --SVDPrefix [SVD resource files prefix] --Reference [reference genome fasta file]

Jointly estimate sample genetic ancestry and contamination rate(same method as in VerifyBAMID2)

OPTIONS
--SVDPrefix      [String] SVD related files prefix(normally shared by .UD, .mu and .bed files)[Required]
--BamFile        [String] Bam or Cram file for the sample[Required]
--Reference      [String] reference file[Required]
--Seed           [Int] Random number seed(default:12345)
--NumPC          [Int] Number of Principal Components used in the estimation
--NumThread      [Int] Set number of threads in likelihood calculation[default:4]
--FixPC          [String] Specify known PC coordinates for the sample[format PC1:PC2:PC3...]
--FixAlpha       [Double] Specify known contamination level
--WithinAncestry [Bool] Enabling withinAncestry assume target sample and contamination source are from the same populations,[default:betweenAncestry] otherwise")
--KnownAF        [String] A Bed file that provides known allele frequency for each marker, similar behavior with VerifyBamID 1.0
--Epsilon        [Double] Minimization procedure convergence threshold, usually a trade-off between accuracy and running time[default:1e-10]
--OutputPileup   [Bool] If output temp pileup file
--Verbose        [Bool] If print the progress of the method on the screen
/*To construct SVDPrefix auxiliary files*/
--RefVCF         [String] Reference panel VCF with genotype information, for the generation of .UD .mu .bed files[Optional]

Generate Final Report

Usage:

Rscript $(FASTQUICK_HOME)/bin/RPlotScript.R <FASTQuick align out_prefix>  <SVDPrefix> <FASTQuickInstallDir>

Example:

Rscript $(FASTQUICK_HOME)/bin/RPlotScript.R  FASTQuick_align_out_prefix $(FASTQUICK_HOME)/resource/1000g.phase3.10k.b37.vcf.gz.dat $(FASTQUICK_HOME)

Getting Help with FASTQuick

It’s a good place to report bugs and usage using GitHub issues: [https://github.com/Griffan/FASTQuick/issues] Alternatively, if you can’t get a response or need emergent help, you can also email Fan via fanzhang@umich.edu See also FASTQuick FAQs below.

FAQ