Skip to content

HaisiYi/Bayexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bayexer

a demultiplexing tool for Illumina sequencers

Author

Haisi Yi yihaisi@gmail.com

How To Cite

Bioinformatics (2015) doi:10.1093/bioinformatics/btv501 http://bioinformatics.oxfordjournals.org/content/early/2015/09/11/bioinformatics.btv501.long

About

Bayexer is a fast and accurate tool for demultiplexing reads generated by Illumina sequencers.

  • Support reads produced by all Illumina sequencers (GA, HiSeq, MiSeq, NextSeq)
  • Support with both single-index and double-index sequences
  • Support index of any length
  • Support the length of sequenced index is different from reference index
  • capable demultiplexing reads with extremely low quality index

Dependencies

  • perl interpreter
  • perl module Getopt::Long
  • perl module List::Util (>=1.35)
  • perl module Data::Dumper

Installation and Running

You do not need to install it. In unix-like environment, put the Bayexer file at any place you want and simply type:

./Bayexer

Or, you can put it in a directory included in your PATH variable and run it like this:

Bayexer

If you still can not run it by the way described above or you are in a Windows system, you can run it with this command:

perl Bayexer

Tips

Use the raw reads directly generated by sequencers, do not do any quality control (best keep those reads fail to pass the filter) or trimming before demultiplexing.

Options

Input/Ouput Options:

  • -i the fastq file(s) of common reads

one or two fastq files are acceptable.

  • -j the fastq file(s) of index reads

one or two fastq files are acceptable. If only one file is assigned to -j, Bayexer treats the input dataset single-indexed. If two files are assigned to -j, the dataset is treated as double-indexed.

  • -o the output directory in which the demultiplexed fastq files will be put

Creating a new directory before you run Bayexer is recommended.

  • -x the file of sample-index list

An example of the format of the sample sheet file:

#Name Index1 Index2 amount
Sample1 AATTCAA CATCCGG 3
Sample2 CGCGCAG TCATGGT 2
Sample3 AAGGTCT AGAACCG 1

The first column is the names of samples. The second column is the sequences of first index, and if necessary the third column is the sequences of second index (should have the same number of bases with the first index). The last column is optional which contains the values of relative amount of each sample. Lines with # at the begining are ignored.

  • -q the quality score type (phred33 or phred64) [33]

If you are not sure which type of score to use for your data, there are some useful information http://en.wikipedia.org/wiki/FASTQ_format

Training Data Extraction Options

  • -a the pre-index1 adapter sequence [GATCGGAAGAGCACACGTCTGAACTCCAGTCAC]
  • -b the pre-index2 adapter sequence [AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT]

The pre-index adapter sequences are CRUCIAL for Bayexer, you should MAKE SURE they are exactly CORRECT for your input data. The sequence information of the official library building kits is available at Illumina website. The default values of Bayexer are compatible with most of Illumina TrueSeq Kits.

  • -u use last N bases of the pre-index sequence in the search [8]

This parameter indicates how many bases of the adapter adjacent to the 5'-end of index are used in the training set search process. Recommendation: 7-12 for double-index data, and 12-18 for single-index data. Tip: the MORE the BETTER on the amount of found training records for each sample. If you found some of the samples have low amounts (eg. lower than 100) of training records, try setting lower values on the -u and -n parameters (but not too low, -u never lower than 7).

  • -d the relative direction of the index 2 and its upstream adapter sequence (ff or fr) [ff]

This parameter indicates the relative direction between the second index and its upstream adapter sequence. For most of the library building and sequencing strategies, the default ff is correct.

  • -n the minimum quality score of the index bases in common reads to be accepted in the training data searching [5]

This parameter indicates the lowest quality score of the index bases found in the common reads to accept as a training data. Even if one base of them drops below this value, the whole record is not added to the training set.

Options Concerning the Estimation Accuracy

  • -p turn on/off the inference of prior probability(auto/infer) [auto]

If -p is set to 'auto', Bayexer will first try to compute the priori probabilities according to the values provided in the last column of sample sheet file, if they are unavailable Bayexer will infer the priori probabilities from the input data itself. If -p is set to 'infer', Bayexer just directly makes the inference and ignores the last column of sample sheet.

  • -f the minimum number of evidences of a feature to be used in Naive Bayes Classifier[5]

The minimun number of evidences during the feature selection progress. If the total amount of training data are very large, you can set this value greater, or staying default is a safer choice.

  • -v the maximum occurrence of a barcode to use the one-versus-all-but-one technique [150]

For the sequenced index with total occurrences lower than this value, Bayexer will use one-versus-all-but-one technique to compute the p values.

  • -c the minimum P to be trusted [0.95]

If the maximum posteriori probability of a sequenced index is smaller than this value, it will be labelled 'untrusted'.

  • -l the minimum occurence frequency of a sequenced index to be considered in the Bayes Module [20]

If the total occurrences of a sequenced index is lower than this value, it will be labelled 'untrusted'.

  • --help this help information
  • --dev for developer use only

About

A Demultiplexing Tool for Illumina Sequencers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages