Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
HyperLevelDB
samtools No commit message Mar 1, 2013
sparsehash-2.0.3
t
tidx
CHANGES document changes Sep 4, 2014
Chrdex.pm
Grun.pm perl module allowing remote execution of perl functions Aug 9, 2013
Makefile
README
affy-csv-to-bed.pl
affy-liftover.pl
aggregate-results-by-gene.pl
alc approximate line count Sep 18, 2013
align-sw.cpp seqan example Oct 15, 2011
bam-filter.cpp
bowtie-gzip.patch
bwa-to-bowtie
check-clipper.sh
contig-stats
count-ambig.cpp
debug
determine-phred
ea-bcl2fastq.cpp
ea-utils.spec
ea-utils.spex
fasta-qual-to-fastq
fastq-clipper.cpp
fastq-join.cpp Removed LastChangedRevision SVN magic as it will not only not work, b… Feb 26, 2016
fastq-join.t
fastq-lib.cpp
fastq-lib.h remove google folder Jul 24, 2014
fastq-mcf.cpp
fastq-multx.cpp Some int type variable overflow Jun 20, 2017
fastq-stats.cpp
fastq-to-fasta
fastx-graph fixed Jan 10, 2013
gcModel.cpp
gcModel.h
getgenbankannot
getline.c
gff2gtf
grun
gtf2bed
install-bamtools.sh
master-barcodes.txt
merge-fifo simple select loop, line-buffered Jul 14, 2014
mirna-quant.cpp
mixfastqs.pl
multx.sh renamed to trunk ... Apr 15, 2011
pbi-clean.cpp
qsh new way to specify command line parsing... nicer Jul 18, 2011
qsub
randomFQ
sam-stats.cpp
sam-stats.pl
seqsig.cpp add Aug 26, 2014
testchrdex-speed.pl
testchrdex.pl misc stuff Feb 7, 2013
testsuf.pl
utils.cpp
utils.h
varcall-matrix dup reads Dec 13, 2012
varcall.cpp
xjoin
zhead

README

OVERVIEW:

fastq-mcf

Scans a sequence file for adapters, and, based on a log-scaled threshold, determines a set of clipping parameters and performs clipping. Also does skewing detection and quality filtering.

fastq-multx

Demultiplexes a fastq. Capable of auto-determining barcode id's based on a master set fields. Keeps multiple reads in-sync during demultiplexing. Can verify that the reads are in-sync as well, and fail if they're not.

fastq-join
Similar to audy's stitch program, but in C, more efficient and supports some automatic benchmarking and tuning. It uses the same "squared distance for anchored alignment" as other tools.

fastq-stats
Outputs stats for fastqs

sam-stats
Output stats for sam/bam files

varcall
Variant caller, takes bam or pileup output and does variant calling with advanced features like PCR duplicate filtering, homopolymer repeat filtering, calculation of error rate and dectectibility (minimum percentage) thresholds.

REQUIRES:

On Ubuntu:

sudo apt-get install subversion zlib1g-dev libgsl0-dev

For building sam-stats, please install this first!

https://github.com/pezmaster31/bamtools/wiki/Building-and-installing

QUICK FAQ:

This is based on feedback/emails, etc.

fastq-mcf does a 300k sub-sampling to determine what to do.   There are lots of paramters to play with, but the "automatic" mode should do the right thing most of the time.  If it doesn't, I really would like to hear why/what it did.  The point in this tool is that the basic quality and adapter filtering should be something that's done automagically 90% of the time - not by manually picking paramters for each run.   The fact that it's making decisions "for the user" means it will probably change more over time than the other tools.  

If you want fastq-mcf to be similar to other tools, you need to pass -m XX, and -s 100, so it's a fixed-length.  If you try running with unrealistic, or "test" data, the heuristic won't work.  Instead, try with a subsample of 50000 or so "real" reads.

fastq-mcf doubles as a read-filtering program, it supports a broad range of filtering arguments.

fastq-join produces a "report".  This is just a list of lengths of joined reads.   Also it chooses the "better quality base" when overlapping.  Very stable code at this point.

fastq-multx is intended to keep mates in sync, so you can demultiplex in one-pass.  For single-reads, it's not better than other tools out there, except that you don't need to predefine your sets... which can help logistics in high-volume situations.  Also, notice the output file's "%-sign" substitution... this is instead of lots of prefix and suffix arguments.  Mismatch algorithm is "maximal unique"... ie... if it's possible that 2 barcodes can match, it won't use *either*.   Qualities are *no longer* ignored, you can explicity set low quality reads as mismatches.  Minimum edit distances can be specified, useful for recovering when CASAVA demultiplexing was poor, especially for dual indexed.  Very stable code at this point.

Dual-indexed codes are listed as SEQUENCE-SEQUENCE in the barcode file.  I haven't tried mixing them with others on the autodetect code, I can't imaginge there's a reason to do that.

The latest version can ignores bases that have extremely low qualities (<5), and refuses to match a barcode that isn't a minimum distance from another best match.   It's a lot safer, but for some poor-quality runs these features will need to be disabled.

sam-stats take a lot of options for a variety of reports.   The most important ones to note are -D, which builds a huge hash of probe ids, and -R which produces a coverage matrix.   It could autodetect if reads are sorted by probe ID and save RAM.   It could also reduce RAM by removing common prefixes from the hash after some X reads.   It doesn't do those things now.

INSTALL:

Should be able to run "make install" on most machines that have g++ installed.  On windows, install a copy of the MinGW environment.   You'll need zlib installed for some tools.   fastq-mcf, fastq-stats, etc.  are pretty basic, and work without any external libs.

Example:

PREFIX=/usr make install

OR to a subdir:

BINDIR=/usr/bin/ea-utils make install

Or with other options:

CC=g++ PREFIX=/usr/local make install