From 906564e5bd7b2712ced8b09401e0b903deb2576c Mon Sep 17 00:00:00 2001 From: Google Code Exporter Date: Mon, 3 Aug 2015 16:58:53 -0400 Subject: [PATCH 1/2] Migrating wiki contents from Google Code --- Compiling.md | 36 ++++++++++++++ FastqJoin.md | 58 +++++++++++++++++++++ FastqMcf.md | 124 +++++++++++++++++++++++++++++++++++++++++++++ FastqMultx.md | 112 +++++++++++++++++++++++++++++++++++++++++ ProjectHome.md | 41 +++++++++++++++ SamStats.md | 65 ++++++++++++++++++++++++ SamStatsDetails.md | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++ Varcall.md | 98 ++++++++++++++++++++++++++++++++++++ 8 files changed, 678 insertions(+) create mode 100644 Compiling.md create mode 100644 FastqJoin.md create mode 100644 FastqMcf.md create mode 100644 FastqMultx.md create mode 100644 ProjectHome.md create mode 100644 SamStats.md create mode 100644 SamStatsDetails.md create mode 100644 Varcall.md diff --git a/Compiling.md b/Compiling.md new file mode 100644 index 0000000..541213d --- /dev/null +++ b/Compiling.md @@ -0,0 +1,36 @@ +# Introduction # + +Compiling & installing. + + +# Details # + +You should be able to run + +``` +> make +``` +and +``` +> make test +``` + +... to install. + +Some caveats are: + +The sparse hash library may need updating. + +The new version of varcall requires the GNU scientific library to be installed in order to compile. + +On UBUNTU : + +apt-get install libgsl0-dev + +On CENTOS/REDHAT : + +rpm -i gsl-devel + +On WINDOWS: + +Use MinGW, and use the [Windows port of GSL](http://gnuwin32.sourceforge.net/packages/gsl.htm) \ No newline at end of file diff --git a/FastqJoin.md b/FastqJoin.md new file mode 100644 index 0000000..5fab680 --- /dev/null +++ b/FastqJoin.md @@ -0,0 +1,58 @@ +# Usage # + +``` +Usage: fastq-join [options] [mate.fq] -o + +Joins two paired-end reads on the overlapping ends. + +Options: + +-o FIL See 'Output' below +-v C Verifies that the 2 files probe id's match up to char C + use '/' for Illumina reads +-p N N-percent maximum difference (8) +-m N N-minimum overlap (6) +-r FIL Verbose stitch length report +-R No reverse complement +-V Show version + +Output: + + You can supply 3 -o arguments, for un1, un2, join files, or one +argument as a file name template. The suffix 'un1, un2, or join' is +appended to the file, or they replace a %-character if present. + + If a 'mate' input file is present (barcode read), then the files +'un3' and 'join2' are also created. + + Files named ".gz" are assumed to be compressed, and can be +read/written as long as "gzip" is in the path. +``` + +## Etc ## + +This uses our sqr(distance)/len for anchored alignment quality algorithm. It's a good measure of anchored alignment quality, akin (in my mind) to squared-deviation for means. + +## Overlapping Bases ## + +### When the bases match ### + +The higher quality base is used, and it is increased by up to 3 + +### When the bases don't match ### + +If one quality is greater than "3" (50%), then the the resulting quality is the difference between the two qualities (reduced quality due to mismatch), or "3" )(50%), whichever is greater. + +#### Examples: #### + +``` +40 vs 3 = 37 : second base has low quality... doesn't change top by much +40 vs 40 = 3 : two equal quality bases that don't match = qual of 3 +2 vs 2 = 2 : neither base has a high quality +``` + +#### Some caveats: #### + +Illumina's quality scores are not accurate and estimates vary by chemistry and sequencer. I would recommend using a profiling tool, on PhiX, and adjusting your qualities using the results of the tool. + +For example the quality score "2" has a true quality, typically of "11" ... this is Illumina's code for "quality estimation failure". The quality scores at the high end ("34-40") are often overestimates. \ No newline at end of file diff --git a/FastqMcf.md b/FastqMcf.md new file mode 100644 index 0000000..f7ddc6f --- /dev/null +++ b/FastqMcf.md @@ -0,0 +1,124 @@ +# Introduction # + +fastq-mcf attempts to: + + * Detect & remove sequencing adapters and primers + * Detect limited skewing at the ends of reads and clip + * Detect poor quality at the ends of reads and clip + * Detect Ns, and remove from ends + * Remove reads with CASAVA 'Y' flag (purity filtering) + * Discard sequences that are too short after all of the above + * Keep multiple mate-reads in sync while doing all of the above + +# Usage # + +``` + +Usage: fastq-mcf [options] [mates1.fq ...] +Version: 1.04.636 + +Detects levels of adapter presence, computes likelihoods and +locations (start, end) of the adapters. Removes the adapter +sequences from the fastq file(s). + +Stats go to stderr, unless -o is specified. + +Specify -0 to turn off all default settings + +If you specify multiple 'paired-end' inputs, then a -o option is +required for each. IE: -o read1.clip.q -o read2.clip.fq + +Options: + -h This help + -o FIL Output file (stats to stdout) + -s N.N Log scale for adapter minimum-length-match (2.2) + -t N % occurance threshold before adapter clipping (0.25) + -m N Minimum clip length, overrides scaled auto (1) + -p N Maximum adapter difference percentage (10) + -l N Minimum remaining sequence length (19) + -L N Maximum remaining sequence length (none) + -D N Remove duplicate reads : Read_1 has an identical N bases (0) + -k N sKew percentage-less-than causing cycle removal (2) + -x N 'N' (Bad read) percentage causing cycle removal (20) + -q N quality threshold causing base removal (10) + -w N window-size for quality trimming (1) + -H remove >95% homopolymer reads (no) + -0 Set all default parameters to zero/do nothing + -U|u Force disable/enable Illumina PF filtering (auto) + -P N Phred-scale (auto) + -R Dont remove Ns from the fronts/ends of reads + -n Dont clip, just output what would be done + -C N Number of reads to use for subsampling (300k) + -S Save all discarded reads to '.skip' files + -d Output lots of random debugging stuff + +Quality adjustment options: + --cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT + --phred-adjust SCORE,AMT Adjust score SCORE by amount AMT + +Filtering options*: + --[mate-]qual-mean NUM Minimum mean quality score + --[mate-]qual-gt NUM,THR At least NUM quals > THR + --[mate-]max-ns NUM Maxmium N-calls in a read (can be a %) + --[mate-]min-len NUM Minimum remaining length (same as -l) + --hompolymer-pct PCT Homopolymer filter percent (95) + +If mate- prefix is used, then applies to second non-barcode read only + +Adapter files are 'fasta' formatted: + +Specify n/a to turn off adapter clipping, and just use filters + +Increasing the scale makes recognition-lengths longer, a scale +of 100 will force full-length recognition of adapters. + +Adapter sequences with _5p in their label will match 'end's, +and sequences with _3p in their label will match 'start's, +otherwise the 'end' is auto-determined. + +Skew is when one cycle is poor, 'skewed' toward a particular base. +If any nucleotide is less than the skew percentage, then the +whole cycle is removed. Disable for methyl-seq, etc. + +Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done +for miRNA, amplicon and other low-complexity situations!) + +Duplicate read filtering is appropriate for assembly tasks, and +never when read length < expected coverage. -D 50 will use +4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly. + +*Quality filters are evaluated after clipping/trimming +``` + +## Notes ## + +Adapter file format is fasta. You can set it to /dev/null, and pass "-f" to do skew detection only. + +## Todo ## + + * When discarding one read for being "too short", it has to discard both pairs. For a sequencing run of normal quality this is not an issue. It should, though, write "un-mated" reads (whose mate was skipped) to a separate file. Typically, since these read mates were poor quality, it's not really useful... but it can be for diagnostics. I've seen runs where these provide valuable data. + + * Like any tool that does many things, fastq-mcf can be limited in it's ability to be flexible. The biggest missing feature is for it to be able to read files that are formatted like it's stderr output, and use them to guide the process. Given that feature, fastq-mcf would be complete. + +## Notes ## + + * Default settings are probably too conservative when it comes to trimming poor quality/detecting base-skew. + + * It won't trim the "insides" of a paired-end read. It also will no longer attempt to quality filter a barcode read. No override for these, but I can't think of a reason to. + + * The -x percentage can be a confusing parameter. it causes the **entire** cycle to be removed from **all** reads.... even ones without N's... if more than, say, 20% of that cycle is N's. + +## Cleaning multiple files ## + +Using process substitution, or named pipes, you can clean multiple fastq's in one pass. This is useful for combining multiple MiSeq runs, or multiple lanes for example: + +``` +fastq-mcf \ + -o cleaned.R1.fq.gz \ + -o cleaned.R2.fq.gz \ + adapters.fa \ + <(gunzip -c uncleaned.lane1.R1.fq.gz uncleaned.lane2.R1.fq.gz;) \ + <(gunzip -c uncleaned.lane1.R2.fq.gz uncleaned.lane2.R2.fq.gz;) +``` + +(Many bioinformatic tools are not "stream friendly", and some may require the "buffer" command to work. But fastq-mcf does its own buffering internally.) \ No newline at end of file diff --git a/FastqMultx.md b/FastqMultx.md new file mode 100644 index 0000000..7e33bff --- /dev/null +++ b/FastqMultx.md @@ -0,0 +1,112 @@ +# Introduction # + +The idea behind this is to reduce the amount of "piping" going on in a pipeline. A lot of time, disk space and nail-chewing is spent keeping files in sync, figuring out what barcodes are on what samples, etc. The goal of this program is to make it easier to demultiplex possibly paired-end sequences, and also to allow the "guessing" of barcode sets based on master lists of barcoding protocols (fluidigm, truseq, etc.) + +# Usage # + +``` +Usage: fastq-multx [-g|-l] -o r1.%.fq [mate.fq -o r2.%.fq] ... + +Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file. + +Barcodes file looks like this: + + + ... + +Default is to guess the -bol or -eol based on clear stats. + +If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used. + +If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen. + +Grouped barcodes file looks like this: + + + + ... + +Mated reads, if supplied, are kept in-sync + +Options: + +-o FIL1 [FIL2] Output files (one per input, required) +-g FIL Determine barcodes from indexed read FIL +-l FIL Determine barcodes from any read, using FIL as a master list +-b Force beginning of line +-e Force end of line +-x Don't trim barcodes before writing +-n Don't execute, just print likely barcode list +-v C Verify that mated id's match up to character C ('/' for illumina) +-m N Allow up to N mismatches, as long as they are unique +``` + + +Files named ".gz" are assumed to be compressed, and can be +read/written as long as "gzip" is in the path. + +# Example 1 # + +# this example will read/output files that are gzipped, since -B is used and only 1 sequence files is present, it will look for barcodes on the "ends" of the sequence and will tell you which end it found them on + +`fastq-multx -B barcodes.fil seq.fastq.gz -o %.fq.gz` + +Contents of barcodes.fil: + +``` +mock_a ACCC +salt_a CGTA +mock_b GAGT +salty_b CGGT +``` + +# Example 2 # + +# this example will first determine which "barcode group" to use, will select the most likely set of barcodes from that file, and will then proceed as if only that set was specified. this allows for a single pipeline that works with multiple technologies + +`fastq-multx -l barcodes.grp seq2.fastq.gz seq1.fastq.gz -o n/a -o out%.fq` + +Contents of barcodes.grp: + +``` +id seq style +LB1 ATCACG TruSeq +LB2 CGATGT TruSeq +LB3 TTAGGC TruSeq +LB4 TGACCA TruSeq +LB5 ACAGTG TruSeq +A01_01 TAGCTTGT Fluidigm +B01_02 CGATGTTT Fluidigm +C01_03 GCCAATGT Fluidigm +D01_04 ACAGTGGT Fluidigm +E01_05 ATCACGTT Fluidigm +``` + +Standard error will output: + +`Using Barcode Group: TruSeq on File: seq2.fastq.gz (start), Threshold 0.59%` + +This indicated that The LB1-LB5 barcodes will be used, and that the filess will be named LB1-LB5, and that the barcode was at the "start" of the reads in the seq2 file. + + +Example of Nextera/Dual-Indexed input: + +``` +id seq style +D708_508 TAATGCGC-GTACTGAC TruSeq RNA +D709_501 CGGCTATG-TATAGCCT TruSeq RNA +D709_502 CGGCTATG-ATAGAGGC TruSeq RNA +D709_503 CGGCTATG-CCTATCCT TruSeq RNA +D709_504 CGGCTATG-GGCTCTGA TruSeq RNA +D709_505 CGGCTATG-AGGCGAAG TruSeq RNA +D709_506 CGGCTATG-TAATCTTA TruSeq RNA +D709_507 CGGCTATG-CAGGACGT TruSeq RNA +D709_508 CGGCTATG-GTACTGAC TruSeq RNA +D710_501 TCCGCGAA-TATAGCCT TruSeq RNA +D710_502 TCCGCGAA-ATAGAGGC TruSeq RNA +D710_503 TCCGCGAA-CCTATCCT TruSeq RNA +D710_504 TCCGCGAA-GGCTCTGA TruSeq RNA +D710_505 TCCGCGAA-AGGCGAAG TruSeq RNA +D710_506 TCCGCGAA-TAATCTTA TruSeq RNA +D710_507 TCCGCGAA-CAGGACGT TruSeq RNA +``` \ No newline at end of file diff --git a/ProjectHome.md b/ProjectHome.md new file mode 100644 index 0000000..c6f20b0 --- /dev/null +++ b/ProjectHome.md @@ -0,0 +1,41 @@ +Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. + +Primarily written to support an Illumina based pipeline - but should work with any FASTQs. + +### Overview: ### + + * [fastq-mcf](FastqMcf.md) +> > Scans a sequence file for adapters, and, based on a log-scaled threshold, determines a set of clipping parameters and performs clipping. Also does skewing detection and quality filtering. + * [fastq-multx](FastqMultx.md) +> > Demultiplexes a fastq. Capable of auto-determining barcode id's based on a master set fields. Keeps multiple reads in-sync during demultiplexing. Can verify that the reads are in-sync as well, and fail if they're not. + * [fastq-join](FastqJoin.md) +> > Similar to audy's stitch program, but in C, more efficient and supports some automatic benchmarking and tuning. It uses the same "squared distance for anchored alignment" as other tools. + * [varcall](Varcall.md) +> > Takes a pileup and calculates variants in a more easily parameterized manner than some other tools. + +### Other Stuff: ### + + * [sam-stats](SamStats.md) - Basic sam/bam stats. Like other tools, but produces what I want to look at, in a format suitable for passing to other programs. (View source) + + * fastq-stats - Basic fastq stats. Counts duplicates. Option for per-cycle stats, or not (irrelevant for many sequencers). (View source) + + * determine-phred - Returns the phred scale of the input file. Works with sams, fastq's or pileups and gzipped files. + + * Chrdex.pm & Sqldex.pm - obsoleted by the cpan module Text::Tidx. Sqldex may not actually be obsolete, because Tidx uses more ram and is slower for very small jobs. But for Exome and RNA-Seq work, [Text::Tidx](http://search.cpan.org/~earonesty/Text-Tidx/) beats both. + + * qsh - Runs a bash script file like a "cluster aware makefile"...only processing newer things, die'ing if things go wrong, and sending jobs to a queue manager if they're big. That way you don't have to write makefiles, or wrap things in "qsub" calls for every little program. Not really ready yet. + + * grun - Fast, lightweight grid queue software. Keeps the job queue on disk at all times. Very fast. Works well by now + + * gwrap - Bash wrapper shell that downloads all dependencies that are not the local system.... good for EC2 nodes. Linux only. Will use it if we ever go to EC2. + + * gtf2bed - Converter that bundles up a GFF's exons and makes a UCSC-styled bed file with thin/thick properly set from the start/stop sites. (Click for source) + + * randomFQ - takes a fastq (can be gzipped or paired-end) and randomly subsets to a user defined number of reads (Click for source) + +### Citing: ### + + +> Erik Aronesty (2011). _ea-utils_ : "Command-line tools for processing biological sequencing data"; http://code.google.com/p/ea-utils + +> Erik Aronesty (2013). _TOBioiJ_ : "Comparison of Sequencing Utility Programs", [DOI:10.2174/1875036201307010001](http://benthamscience.com/open/openaccess.php?tobioij/articles/V007/1TOBIOIJ.htm) \ No newline at end of file diff --git a/SamStats.md b/SamStats.md new file mode 100644 index 0000000..d5de09c --- /dev/null +++ b/SamStats.md @@ -0,0 +1,65 @@ +# Introduction # + +Tool for computing statistics from (possibly compressed) SAM or BAM files. + +See SamStatsDetails for more info on the output fields. + +# Usage # + +``` +Usage: sam-stats [options] [file1] [file2...filen] +Version: 1.32 + +Produces lots of easily digested statistics for the files listed + +Options (default in parens): + +-D Keep track of multiple alignments (slower!) +-M Only overwrite if newer (requires -x, or multiple files) +-A Report all chr sigs, even if there are more than 1000 +-R FIL RNA-Seq stats output (coverage, 3' bias, etc) +-B Input is bam, dont bother looking at magic +-x FIL File extension for multiple files (stats) +-b INT Number of reads to sample for per-base stats (1M) +-S INT Size of ascii-signature (30) +-z Don't fail when zero entries in sam + +OUTPUT: + +If one file is specified, then the output is to standard out. If +multiple files are specified, or if the -x option is supplied, +the output file is .. Default extension is 'stats'. + +Complete Stats: + + : mean, max, stdev, median, Q1 (25 percentile), Q3 + reads : # of entries in the sam file, might not be # reads + phred : phred scale used + bsize : # reads used for qual stats + mapped reads : number of aligned reads (unique probe id sequences) + mapped bases : total of the lengths of the aligned reads + forward : number of forward-aligned reads + reverse : number of reverse-aligned reads + secondary : # of entries with 256-bit set + snp rate : mismatched bases / total bases + ins rate : insert bases / total bases + del rate : deleted bases / total bases + pct mismatch : percent of reads that have mismatches + len : read length stats, ignored if fixed-length + mapq : stats for mapping qualities + insert : stats for insert sizes + % : percentage of mapped bases per chr, followed by a signature + +Subsampled stats (1M reads max): + base qual : stats for base qualities + %A,%T,%C,%G : base percentages + +Meaning of the per-chromosome signature: + A ascii-histogram of mapped reads by chromosome position. + It is only output if the original SAM/BAM has a header. The values + are the log2 of the # of mapped reads at each position + ascii '0'. +``` + +DETAILED OVERVIEW OF STATISTICS: + +Click SamStatsDetails for more information on each stat, how it's calculated and what it means. \ No newline at end of file diff --git a/SamStatsDetails.md b/SamStatsDetails.md new file mode 100644 index 0000000..d90a53d --- /dev/null +++ b/SamStatsDetails.md @@ -0,0 +1,144 @@ +# Sam-stats statistics by name # + +### reads ### + +Incremented for every sequence-containing line in the sam file, regardless of whether it represents an alignment. for some files, this is not actually the number of reads. indeed, this may be a poor name for this stat + +### version ### + +The version of sam-stats used. + +### mapped reads ### + +If duplicate tracking was enabled via -D, then this attempts to recapitulate the number of unique, mapped, probe-id's in the original sam file. It is multiplied by 2 for paired-end data with duplicate read id's. The idea is that if you divide this by the number of reads in the fastq you aligned (possibly from the output of fastq-stats), you will get an accurate "percentage of reads aligned" statistic. + +If duplicate tracking is not enabled, then the result is the number of mapped reads in the sam file. + +The definition of "mapped" is something with a non-negative position, and a "non-asterisk" cigar string. + +### pct ambiguous ### + +Formula is: + +100 ** / ** + +"Aligning more than once" means more than 2 alignments for paired end and more than 1 alignment for single-end. + +_This does NOT, currently, use the BWA XA tag, and support for this will be added in the near future._ + +### max dup align ### + +The maximum number of alignments for a single probe-id. IE: if seq4455 aligns 50 times, and this is more than any other sequence, then the result would be 50. For paired-end reads one is subtracted, since you should get two alignments with the same probe-id for each read, but anything more than that is considered "ambiguous". + +One could argue, rightly, that it should be divided by two for paired end. If this is changed, it would require a major version upgrade, or rename of the statistic since it would invalidate historical comparisons. + + +### singleton mappings ### + +Number of reads that aligned only once, without a pair. Only tracked if -D is enabled. + +### total mappings ### + +The number of lines in the sam file that were "mapped" reads. + +### mapped bases ### + +The total number of bases, derived from the length of the sequence strings, in the sam file that were "mapped" reads. + +### library:paired-end ### + +And indicator that "paired end" semantics were used when running sam-stats, because there were alignments that exhibited these characteristics. + +### discordant mates ### +The number of read-pairs that were aligned to different chromosomes. + +### distant mates ### +The number of read-pairs that were aligned more than 50k bases apart + +### phred ### +The phred scale used when analyzing this sam file. Should always be 33, unless you're looking at some old Illumina aligned output. + + +### forward ### +The number of lines in the sam file that were aligned to the "forward" strand. No accounting is done on duplicates. + +### reverse ### +The number of lines in the sam file that were aligned to the "reverse" strand. No accounting is done on duplicates. + +### len (max/min/mean) ### +"len max" is always output. "len mean, len stdev" are only output if the length of the sequence reads vary. Calculations are based on the the length of the (possibly hard-clipped) sequence in the sam file. + +### mapq (mean/stdev/Q1/median/Q3) ### + +These are all statistics on mapped reads only, with no attempt to de-duplicate multiply aligned reads before calculating the statistic. As a result, a single read that is represented 1000 times in the sam results can skew the mapping quality significantly lower (since they will all be zeroes). + +The statistics presented are: mean and standard deviation, (Q1) first-quartile (R-compatible algorithm used), median, and (Q3) upper-quartile. + +### snp rate ### +The total number of mismatch bases, divided by the total number of mapped based. Calculated using NM tags. + +### ins rate ### +The total number of inserted bases, divided by the total number of mapped based. Calculated using cigar strings. + +### del rate ### +The total number of deleted bases, divided by the total number of mapped based. Calculated using cigar strings. + +### insert mean/stdev/Q1/median/Q3 ### +These are the "isize" statistics. The mean and standard deviation are ["trimmed" by 10%](http://en.wikipedia.org/wiki/Truncated_mean), in accordance with best-practices for producing meaningful fragment size distributions. + +You will have to add the mean length of read2 to this value in order to get the total fragment size for programs like Mosaik and RSEM that require it, or subtract the mean read1 size for programs like tophat that require the "between fragment" size (even if negative). + +Quartiles are computed with an R-compatible algorithm + +### base qual mean/stdev ### + +Total quality score of every mapped base, divided by the total number of mapped bases. + +### %A/%T/%C/%G/%N ### + +Total number of mapped bases of each type, divided by the total number of mapped bases. + +### num ref seqs ### + +The number of reference sequences in the original SAM headers. + +### num ref aligned ### + +The number of unique reference sequences that have at least one alignment. + +### median skew ### + +Only enabled in "rna mode", the median skewness of all transcripts with coverage. An "overall bias" metric, useful for detecting 3'bias. + +### median coverage ### + +Median of nonzero coverage (see rna mode below) + +### median coverage cv ### + +Median of nonzero coverage variability (see rna mode below) + +### %chr1, %gene, and signature values ### + +For each mapped reference chromosome or gene, the total number of mapped bases is totaled. The percentages are the total mapped bases for that gene or chromosome divided by the total mapped bases. + +If "-A" is used, up to a million reference sequences are tracked and output. + +In addition to outputting a percentage, and only if the sam file had a header, a signature representing a histogram of alignments is output. + +Each chromosome/gene is divided in to 30 evenly sized "buckets". + +The values in the signature are the log2 of the # of mapped reads in each position bucket times the length of those reads + ascii '0'. + +Long reads caveat: There is no attempt to break-up long reads that align across multiple regions however. So these signatures are not (yet) useful for very long reads. If a read's first position maps to histogram bucket 1, but spans across to bucket 2, all of the bases are currently counted in bucket 1. For short reads, this is not significant. + + +### "RNA-seq" mode coverage matrix ### + +A "coverage" file is output with one row per reference sequence, the _7 columns_ are: transcript id, length, count, coverage percentage, skewness (bias), and coverage cv, and a signature (see above). + + * Coverage: statistics are 1x, and approximate, and are intended to measure exon bias - not appropriate for assembly! + + * Skewness: is the standard pearsons 3rd order skewness applies to the alignment positions across the transcript. + + * Coverage CV: the coefficient of variation of the coverage histogram (can be computed from the signature). \ No newline at end of file diff --git a/Varcall.md b/Varcall.md new file mode 100644 index 0000000..178aae5 --- /dev/null +++ b/Varcall.md @@ -0,0 +1,98 @@ +# Introduction # + +varcall does two tasks: + + * Calculates error rate statistics in a pileup + * Outputs variants per position + + +The tool is not complete. + +In particular: the VCF output is somewhat broken. The native output formats now work fine, and the stats/outputs collection and use are also working well. + +One of the advantages of using varcall over something like GATK is the extensive parameterization, allowing varcall to be used in any situation, this is also a shortcoming... since users must know the meanings of every parameter. In addition, varcall seems to be particularly fast. + +Recently we added variation specific error rate collection, and it is used, instead of a the global error rate, in the calculation of pvals. This vastly reduces false positives, etc. + +By the time we reach version 1, all of these should be fixed, including better docs, and better "auto-configuration" based on sampling. If you are using varcall before then, carefully experiment with parameters and compare to other tools. + +Version 1 should also have the ability to do a stats collection across a group of samples, and then subsequent calling using those stats. The advantage here is that for some amplicon/pcr assays, there are insufficient locii to get an accurate error variation estimate per-sample. + + +# Usage # + +``` + +Usage: varcall <-s|-v> <-f REF> [options] bam1 [bam2...] +Version: 0.95.794 (BETA) + +Either outputs summry stats for the list of files, or performs variant calling. +(You can specify -s and -v, in which case both are done). + +Options (later options override earlier): + +-s Calculate statistics +-v|version Calculate variants bases on supplied parameters (see -S) +-f Reference fasta (required if using bams, ignored otherwise) +-m Min locii depth (1) +-a Min allele depth (2) +-p Min allele pct by quality (0) +-q Min qual (3) +-Q Min mapping quality (0) +-b Min pct balance (strand/total) (0) +-D FLOAT Max duplicate read fraction (depth/length per position) (1) +-d FLOAT Minimum diversity (CV from optimal depth) (0.25) +-G FLOAT Minimum agreement (Weighted CV of positional variation) (0.25) +-0 Zero out all filters, set e-value filter to 1, report everything +-B If running from a BAM, turn off BAQ correction (false) +-R Homopolymer repeat indel filtering (8) +-e FLOAT Alpha filter to use, requires -l or -S (.05) +-g FLOAT Global minimum error rate (default: assume phred is ok) +-l INT Number of locii in total pileup used for bonferroni (1 mil) +-x CHR:POS Output this pos only, then quit +-S FILE Read in statistics and params from a previous run with -s (do this!) +-A ANNOT Calculate in-target stats using the annotation file (requires -o) +-o PREFIX Output prefix (works with -s or -v) +-F files List of file types to output (var, varsum, eav, vcf) + +Extended Options + +--pcr-annot BED Only include reads adhering to the expected amplicons +--stranded TYPE Can be FR (the default), FF, FR. Used with pcr-annot +--diversity|d FLOAT Alias for -d +--agreement|G FLOAT Alias for -G +--no-indels Ignore all indels + +Input files + +Files must be sorted bam files with bai index files available. +Alternatively, a single pileup file can be supplied. + +Output files + +Varcalls go to stdout. Stats go to stdout, or stderr if varcalling too + +If an output prefix is used, files are created as follows: + PREFIX.var Variant calls in tab delimited 'varcall' format + PREFIX.eav Variant calls in tab delimited 'ea-var' format + PREFIX.cse Variant calls in tab delimited 'varprowl' format + PREFIX.vcf Variant calls, in vcf format + PREFIX.varsum Summary of variant calls + PREFIX.tgt.var On-target version of .var + PREFIX.tgt.cse On-target version of .cse + PREFIX.tgt.varsum On-target version of .varsum + +Stats Output: + +Contains mean, median, quartile information for depth, base quality, read len, +mapping quality, indel levels. Also estimates parameters suitable for +variant calls, and can be passed directly to this program for variant calls + +If an output prefix is used, files are created as follows: + + PREFIX.stats Stats output + PREFIX.noise Non-reference, non-homozygous allele summary + PREFIX.xnoise Like noise, but with context-specific rates + + +``` \ No newline at end of file From ca0d5992d64c2f2e2ecd3f7c07efb600778afc4a Mon Sep 17 00:00:00 2001 From: EA Genomic Services Date: Thu, 8 Oct 2015 15:48:45 -0400 Subject: [PATCH 2/2] Update SamStats.md --- SamStats.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/SamStats.md b/SamStats.md index d5de09c..7be78c3 100644 --- a/SamStats.md +++ b/SamStats.md @@ -2,7 +2,7 @@ Tool for computing statistics from (possibly compressed) SAM or BAM files. -See SamStatsDetails for more info on the output fields. +See [SamStatsDetails](https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/SamStatsDetails.md) for more info on the output fields. # Usage # @@ -62,4 +62,4 @@ Meaning of the per-chromosome signature: DETAILED OVERVIEW OF STATISTICS: -Click SamStatsDetails for more information on each stat, how it's calculated and what it means. \ No newline at end of file +Click [SamStatsDetails](https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/SamStatsDetails.md) for more information on each stat, how it's calculated and what it means.