Skip to content

Commit 2dee6b4

Browse files
committed
Add tutorial docs, and a bunch of minor doc fixes
1 parent 5164725 commit 2dee6b4

File tree

5 files changed

+145
-68
lines changed

5 files changed

+145
-68
lines changed

docs/algorithm.rst

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,65 +4,65 @@ Axe's matching algorithm
44

55
Axe uses an algorithm based on longest-prefix-in-trie matching to match a
66
variable length from the start of each read against a set of 'mutated'
7-
barcodes.
7+
indexes.
88

99
Hamming distance matching
1010
-------------------------
1111

1212
While for most applications in high-throughput sequencing hamming distances are
13-
a frowned-upon metric, it is typical for HTS read barcodes to be designed to
13+
a frowned-upon metric, it is typical for HTS read indexes to be designed to
1414
tolerate a certain level of hamming mismatches. Given these sequences are short
1515
and typically occur at the 5' end of reads, insertions and deletions rarely
1616
need be considered, and the increased rate of assignment of reads with many
17-
errors is offset by the risk of falsely assigning barcodes to an incorrect
17+
errors is offset by the risk of falsely assigning indexes to an incorrect
1818
sample. In any case, reads with more than 1-2 sequencing errors in their first
1919
several bases are likely to be poor quality, and will simply be filtered out
2020
during downstream quality control.
2121

2222
Hamming mismatch tries
2323
----------------------
2424

25-
Typically, reads are matched to a set of barcodes by calculating the hamming
26-
distance between the barcode, and the first :math:`l` bases of a read for a
27-
barcode of length :math:`l`. The "correct" barcode is then selected by
28-
recording either the barcode with the lowest hamming distance to the read
29-
(competitive matching) or by simply accepting the first barcode with a hamming
25+
Typically, reads are matched to a set of indexes by calculating the hamming
26+
distance between the index, and the first :math:`l` bases of a read for a
27+
index of length :math:`l`. The "correct" index is then selected by
28+
recording either the index with the lowest hamming distance to the read
29+
(competitive matching) or by simply accepting the first index with a hamming
3030
distance below a certain threshold. These approaches are both very
3131
computationally expensive, and can have lower accuracy than the algorithm I
32-
propose. Additionally, implementations of these methods rarely handle barcodes
33-
of differing length and combinatorial barcoding well, if at all.
32+
propose. Additionally, implementations of these methods rarely handle indexes
33+
of differing length and combinatorial indexing well, if at all.
3434

3535
Central to Axe's algorithm is the concept of hamming-mismatch tries. A trie is
3636
a N-ary tree for an N letter alphabet. In the case of high-throughput
3737
sequencing reads, we have the alphabet ``AGCT``, corresponding to the four
3838
nucleotides of DNA, plus ``N``, used to represent ambiguous base calls. Instead
39-
of matching each barcode to each read, we pre-calculate all allowable sequences
39+
of matching each index to each read, we pre-calculate all allowable sequences
4040
at each mismatch level, and store these in level-wise tries. For example, to
4141
match to a hamming distance of 2, we create three tries: One containing all
42-
barcodes, verbatim, and two tries where every sequence within a hamming
43-
distance of 1 and 2 of each barcode respectively. Hereafter, these tries are
42+
indexes, verbatim, and two tries where every sequence within a hamming
43+
distance of 1 and 2 of each index respectively. Hereafter, these tries are
4444
referred to as the 0, 1 and 2-mm tries, for a hamming distance (mismatch) of
4545
0, 1 and 2. Then, we find the longest prefix in each sequence read in the 0mm
4646
trie. If this prefix is not a valid leaf in the 0mm trie, we find the longest
4747
prefix in the 1mm trie, and so on for all tries in ascending order. If no
4848
prefix of the read is a complete sequence in any trie, the read is assigned to
49-
an "non-barcoded" output file.
49+
an "non-indexd" output file.
5050

51-
This algorithm ensures optimal barcode matching in many ways, but is also
52-
extremely fast. In situations with barcodes of differing length, we ensure that
53-
the *longest* acceptable barcode at a given hamming distance is chosen;
54-
assuming that sequence is random after the barcode, the probability of false
51+
This algorithm ensures optimal index matching in many ways, but is also
52+
extremely fast. In situations with indexes of differing length, we ensure that
53+
the *longest* acceptable index at a given hamming distance is chosen;
54+
assuming that sequence is random after the index, the probability of false
5555
assignments using this method is low. We also ensure that short perfect matches
56-
are preferred to longer inexact matches, as we firstly only consider barcodes
57-
with no error, then 1 error, and so on. This ensures that reads with barcodes
56+
are preferred to longer inexact matches, as we firstly only consider indexes
57+
with no error, then 1 error, and so on. This ensures that reads with indexes
5858
that are followed by random sequence that happens to inexactly match a longer
59-
barcode in the set are not falsely assigned to this longer barcode.
59+
index in the set are not falsely assigned to this longer index.
6060

6161
The speed of this algorithm is largely due to the constant time matching
62-
algorithm with respect to the number of barcodes to match. The time taken to
63-
match each read is proportional instead to the length of the barcodes, as for a
64-
barcode of length :math:`l`, at most :math:`l + 1` trie level descents are
62+
algorithm with respect to the number of indexes to match. The time taken to
63+
match each read is proportional instead to the length of the indexes, as for a
64+
index of length :math:`l`, at most :math:`l + 1` trie level descents are
6565
required to find an entry in the trie. As this length is more-or-less constant
6666
and small, the overall complexity of axe's algorithm is :math:`O(n)` for
6767
:math:`n` reads, as opposed to :math:`O(nm)` for :math:`n` reads and :math:`m`
68-
barcodes as is typical for traditional matching algorithms
68+
indexes as is typical for traditional matching algorithms

docs/conf.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
# ones.
3232
extensions = [
3333
'sphinx.ext.todo',
34-
'sphinx.ext.pngmath',
34+
'sphinx.ext.imgmath',
3535
'sphinx.ext.ifconfig',
3636
]
3737

@@ -188,10 +188,10 @@
188188

189189
latex_elements = {
190190
# The paper size ('letterpaper' or 'a4paper').
191-
'papersize': 'a4paper',
191+
#'papersize': 'a4paper',
192192

193193
# The font size ('10pt', '11pt' or '12pt').
194-
'pointsize': '11pt',
194+
#'pointsize': '11pt',
195195

196196
# Additional stuff for the LaTeX preamble.
197197
#'preamble': '',

docs/index.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,17 @@ Welcome to axe's documentation!
77
===============================
88

99
Axe is a read de-multiplexer, useful in situations where sequence reads contain
10-
the barcodes that uniquely distinguish samples. Axe uses a rapid and accurate
10+
the indexes that uniquely distinguish samples. Axe uses a rapid and accurate
1111
algorithm based on hamming mismatch tries to competitively match the prefix of
12-
a sequencing read against a set of barcodes. Axe supports combinatorial
13-
barcoding schemes.
12+
a sequencing read against a set of indexes. Axe supports combinatorial
13+
indexing schemes.
1414

1515
Contents:
1616

1717
.. toctree::
1818
:maxdepth: 2
1919

20+
tutorial
2021
usage
2122
algorithm
2223

docs/tutorial.rst

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
************
2+
Axe Tutorial
3+
************
4+
5+
**TODO!**
6+
7+
In this tutorial, we'll use Axe to demultiplex some paired-end,
8+
combinatorially-index Genotyping-by-Sequencing reads. The data for this
9+
tutorial is available from figshare
10+
`here https://figshare.com/articles/axe-tutorial_tar/6143720`_.
11+
12+
Axe should be run as the initial step of any analysis: don't use sequence QC
13+
tools like AdapterRemoval or Trimmomatic before using axe, as indexes may be
14+
trimmed away, or pairing information removed.
15+
16+
Step 0: Download the trial data
17+
-------------------------------
18+
19+
This will download the trial data, and extract it on the fly:
20+
21+
.. code-block:: bash
22+
23+
curl -LS https://ndownloader.figshare.com/files/11094782 | tar xv
24+
25+
Step 1: prepare a key file
26+
--------------------------
27+
28+
The key file associates index sequences with sample names. A key file can be
29+
prepared in a spreadsheet editor, like LibreOffice Calc, or Excel. The format
30+
is quite strict, and is described in detail in the online usage documentation.
31+
32+
Let's now inspect the keyfile I have provided for the tutorial.
33+
34+
.. code-block:: bash
35+
36+
head axe-keyfile.tsv
37+
38+
39+
Step 2: Demultiplex with Axe
40+
----------------------------
41+
42+
43+
In this step, we will demultiplex our interleaved input file to per-sample
44+
interleaved output files. To see a full range of Axe's options, please run
45+
``axe-demux -h``, or inspect the online usage documentation.
46+
47+
First, let's inspect the input.
48+
49+
.. code-block:: bash
50+
51+
zcat axe-tutorial.fastq.gz | head -n 8
52+
53+
Then, we need to ensure that axe has somewhere to put the demultiplexed reads.
54+
Axe outputs one file (or more, depending on pairing) per sample. Axe does so by
55+
appending the sample name to some prefix (as given by the ``-I``, ``-F``,
56+
and/or ``-R`` options). If this prefix is a directory, then sample fastq files
57+
will be created in that subdirectory, but the directory must exist. Let's make
58+
an output directory:
59+
60+
.. code-block:: bash
61+
62+
mkdir -p output
63+
64+
Now, let's demultiplex the reads!
65+
66+
.. code-block:: bash
67+
68+
axe-demux -i axe-tutorial.fastq.gz -I output/ \
69+
-c -b axe-keyfile.tsv -t demux-stats.tsv -z 1
70+
71+
The command above demultiplexes reads from ``axe-tutorial.fastq.gz`` into
72+
separate files under ``output``, based on the combinatorial (``-c``)
73+
sample-to-index-sequence mapping described in ``axe-keyfile.tsv``, and saves a
74+
file of statistics as ``demux-stats.tsv``. Note that we have enabled
75+
compression of output files using the ``-z`` option, in case you don't have
76+
much disk space available. This will make Axe slightly slower.

docs/usage.rst

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,10 @@ Axe Usage
99
did not change.
1010

1111
Axe has several usage modes. The primary distinction is between the two
12-
alternate barcoding schemes, single and combinatorial barcoding. Single barcode
13-
matching is used when only the first read contains barcode sequences.
14-
Combinatorial barcoding is used when both reads in a read pair contain
15-
independent (typically different) barcode sequences.
12+
alternate indexing schemes, single and combinatorial indexing. Single index
13+
matching is used when only the first read contains index sequences.
14+
Combinatorial indexing is used when both reads in a read pair contain
15+
independent (typically different) index sequences.
1616

1717
For concise reference, the command-line usage of ``axe-demux`` is reproduced
1818
below:
@@ -37,8 +37,8 @@ default) and 9, where 0 indicates plain text output (``gzopen`` mode "wT"), and
3737
fastest and 9 is most compact.
3838

3939
The output flags should be prefixes that are used to generate the output file
40-
name based on the barcode's (or barcode pair's) ID. The names are generated as:
41-
``prefix`` + ``_`` + ``barcode ID`` + ``_`` + ``read number`` + ``.extension``.
40+
name based on the index's (or index pair's) ID. The names are generated as:
41+
``prefix`` + ``_`` + ``index ID`` + ``_`` + ``read number`` + ``.extension``.
4242
The output file for reads that could not be demultiplexed is ``prefix`` + ``_``
4343
+ ``unknown`` + ``_`` + ``read number`` + ``.extension``. The read number is
4444
omitted unless the paired read file scheme is used, and is "il" for interleaved
@@ -51,65 +51,65 @@ The corresponding CLI flags are:
5151
- ``-r`` and ``-R``: Paired R2 file input and output.
5252
- ``-i`` and ``-I``: Interleaved paired input and output.
5353

54-
The barcode file
54+
The index file
5555
----------------
5656

57-
The barcode file is a tab-separated file with an optional header. It is
57+
The index file is a tab-separated file with an optional header. It is
5858
mandatory, and is always supplied using the ``-b`` command line flag. The exact
59-
format is dependent on barcoding mode, and is described further in the sections
59+
format is dependent on indexing mode, and is described further in the sections
6060
below. If a header is present, the header line must start with either
61-
`Barcode` or ``barcode``, or it will be interpreted as a barcode line, leading
61+
`Barcode` or ``index``, or it will be interpreted as a index line, leading
6262
to a parsing error. Any line starting with ';' or '#' is ignored, allowing
63-
comments to be added in line with barcodes. Please ensure that the software
64-
used to produce the barcode uses ASCII encoding, and does not insert a
63+
comments to be added in line with indexes. Please ensure that the software
64+
used to produce the index uses ASCII encoding, and does not insert a
6565
Byte-order Mark (BoM) as many text editors can silently use Unicode-based
6666
encoding schemes. I recommend the use of
6767
`LibreOffice Calc <www.libreoffice.org>`_ (part of a free and open source
68-
office suite) to generate barcode tables; Microsoft Excel can also be used.
68+
office suite) to generate index tables; Microsoft Excel can also be used.
6969

7070
Mismatch level selection
7171
------------------------
7272

73-
Independent of barcode mode, the ``-m`` flag is used to select the maximum
74-
allowable hamming distance between a read's prefix and a barcode to be
75-
considered as a match. As "mutated" barcodes must be unique, a hamming distance
76-
of one is the default as typically barcodes are designed to differ by a hamming
73+
Independent of index mode, the ``-m`` flag is used to select the maximum
74+
allowable hamming distance between a read's prefix and a index to be
75+
considered as a match. As "mutated" indexes must be unique, a hamming distance
76+
of one is the default as typically indexes are designed to differ by a hamming
7777
distance of at least two. Optionally, (using the ``-p`` flag), axe will allow
78-
selective mismatch levels, where, if clashes are observed, the barcode will
79-
only be matched exactly. This allows one to process datasets with barcodes that
78+
selective mismatch levels, where, if clashes are observed, the index will
79+
only be matched exactly. This allows one to process datasets with indexes that
8080
don't have a sufficiently high distance between them.
8181

82-
Single barcode mode
82+
Single index mode
8383
-------------------
8484

85-
Single barcode mode is the default mode of operation. Barcodes are matched
86-
against read one (hereafter the forward read), and the barcode is trimmed from
85+
Single index mode is the default mode of operation. Barcodes are matched
86+
against read one (hereafter the forward read), and the index is trimmed from
8787
only the forward read, unless the ``-2`` command line flag is given, in which
88-
case a prefix the same length as the matched barcode is also trimmed from the
88+
case a prefix the same length as the matched index is also trimmed from the
8989
second or reverse read. Note that sequence of this second read is not checked
9090
before trimming.
9191

92-
In single barcode mode, the barcode file has two columns: ``Barcode`` and
92+
In single index mode, the index file has two columns: ``Barcode`` and
9393
``ID``.
9494

95-
Combinatorial barcode mode
95+
Combinatorial index mode
9696
--------------------------
9797

98-
Combinatorial barcode mode is activated by giving the ``-c`` flag on the
99-
command line. Forward read barcodes are matched against the forward read, and
100-
reverse read barcodes are matched against the reverse read. The optimal
101-
barcodes are selected independently, and the barcode pair is selected from
102-
these two barcodes. The respective barcodes are trimmed from both reads; the
103-
``-2`` command line flag has no effect in combinatorial barcode mode.
98+
Combinatorial index mode is activated by giving the ``-c`` flag on the
99+
command line. Forward read indexes are matched against the forward read, and
100+
reverse read indexes are matched against the reverse read. The optimal
101+
indexes are selected independently, and the index pair is selected from
102+
these two indexes. The respective indexes are trimmed from both reads; the
103+
``-2`` command line flag has no effect in combinatorial index mode.
104104

105-
In combinatorial barcode mode, the barcode file has three columns:
106-
``Barcode1``, ``Barcode2`` and ``ID``. Individual barcodes can occur many times
107-
within the forward and reverse barcodes, but barcode pairs must be unique
105+
In combinatorial index mode, the index file has three columns:
106+
``Barcode1``, ``Barcode2`` and ``ID``. Individual indexes can occur many times
107+
within the forward and reverse indexes, but index pairs must be unique
108108
combinations.
109109

110-
The Demultipexing Statistics File
111-
---------------------------------
110+
The Demultiplexing Statistics File
111+
----------------------------------
112112

113113
The ``-t`` option allows the output of per-sample read counts to a
114114
tab-separated file. The file will have a header describing its format, and
115-
includes a line for unbarcoded reads.
115+
includes a line for reads which could not be demultiplexed.

0 commit comments

Comments
 (0)