Add tutorial docs, and a bunch of minor doc fixes

kdm9 · kdm9 · commit 2dee6b4f8d4f · 2018-04-16T20:23:42.000+10:00
diff --git a/docs/algorithm.rst b/docs/algorithm.rst
@@ -4,65 +4,65 @@ Axe's matching algorithm
 
 Axe uses an algorithm based on longest-prefix-in-trie matching to match a
 variable length from the start of each read against a set of 'mutated'
-barcodes.
+indexes.
 
 Hamming distance matching
 -------------------------
 
 While for most applications in high-throughput sequencing hamming distances are
-a frowned-upon metric, it is typical for HTS read barcodes to be designed to
+a frowned-upon metric, it is typical for HTS read indexes to be designed to
 tolerate a certain level of hamming mismatches. Given these sequences are short
 and typically occur at the 5' end of reads, insertions and deletions rarely
 need be considered, and the increased rate of assignment of reads with many
-errors is offset by the risk of falsely assigning barcodes to an incorrect
+errors is offset by the risk of falsely assigning indexes to an incorrect
 sample. In any case, reads with more than 1-2 sequencing errors in their first
 several bases are likely to be poor quality, and will simply be filtered out
 during downstream quality control.
 
 Hamming mismatch tries
 ----------------------
 
-Typically, reads are matched to a set of barcodes by calculating the hamming
-distance between the barcode, and the first :math:`l` bases of a read for a
-barcode of length :math:`l`. The "correct" barcode is then selected by
-recording either the barcode with the lowest hamming distance to the read
-(competitive matching) or by simply accepting the first barcode with a hamming
+Typically, reads are matched to a set of indexes by calculating the hamming
+distance between the index, and the first :math:`l` bases of a read for a
+index of length :math:`l`. The "correct" index is then selected by
+recording either the index with the lowest hamming distance to the read
+(competitive matching) or by simply accepting the first index with a hamming
 distance below a certain threshold.  These approaches are both very
 computationally expensive, and can have lower accuracy than the algorithm I
-propose. Additionally, implementations of these methods rarely handle barcodes
-of differing length and combinatorial barcoding well, if at all.
+propose. Additionally, implementations of these methods rarely handle indexes
+of differing length and combinatorial indexing well, if at all.
 
 Central to Axe's algorithm is the concept of hamming-mismatch tries. A trie is
 a N-ary tree for an N letter alphabet. In the case of high-throughput
 sequencing reads, we have the alphabet ``AGCT``, corresponding to the four
 nucleotides of DNA, plus ``N``, used to represent ambiguous base calls. Instead
-of matching each barcode to each read, we pre-calculate all allowable sequences
+of matching each index to each read, we pre-calculate all allowable sequences
 at each mismatch level, and store these in level-wise tries. For  example, to
 match to a hamming distance of 2, we create three tries: One containing all
-barcodes, verbatim, and two tries where every sequence within a hamming
-distance of 1 and 2 of each barcode respectively. Hereafter, these tries are
+indexes, verbatim, and two tries where every sequence within a hamming
+distance of 1 and 2 of each index respectively. Hereafter, these tries are
 referred to  as the 0, 1 and 2-mm tries, for a hamming distance (mismatch) of
 0, 1 and 2. Then, we find the longest prefix in each sequence read in the 0mm
 trie. If this prefix is not a valid leaf in the 0mm trie, we find the longest
 prefix in the 1mm trie, and so on for all tries in ascending order. If no
 prefix of the read is a complete sequence in any trie, the read is assigned to
-an "non-barcoded" output file.
+an "non-indexd" output file.
 
-This algorithm ensures optimal barcode matching in many ways, but is also
-extremely fast. In situations with barcodes of differing length, we ensure that
-the *longest* acceptable barcode at a given hamming distance is chosen;
-assuming that sequence is random after the barcode, the probability of false
+This algorithm ensures optimal index matching in many ways, but is also
+extremely fast. In situations with indexes of differing length, we ensure that
+the *longest* acceptable index at a given hamming distance is chosen;
+assuming that sequence is random after the index, the probability of false
 assignments using this method is low. We also ensure that short perfect matches
-are preferred to longer inexact matches, as we firstly only consider barcodes
-with no error, then 1 error, and so on. This ensures that reads with barcodes
+are preferred to longer inexact matches, as we firstly only consider indexes
+with no error, then 1 error, and so on. This ensures that reads with indexes
 that are followed by random sequence that happens to inexactly match a longer
-barcode in the set are not falsely assigned to this longer barcode.
+index in the set are not falsely assigned to this longer index.
 
 The speed of this algorithm is largely due to the constant time matching
-algorithm with respect to the number of barcodes to match. The time taken to
-match each read is proportional instead to the length of the barcodes, as for a
-barcode of length :math:`l`, at most :math:`l + 1` trie level descents are
+algorithm with respect to the number of indexes to match. The time taken to
+match each read is proportional instead to the length of the indexes, as for a
+index of length :math:`l`, at most :math:`l + 1` trie level descents are
 required to find an entry in the trie. As this length is more-or-less constant
 and small, the overall complexity of axe's algorithm is :math:`O(n)` for
 :math:`n` reads, as opposed to :math:`O(nm)` for :math:`n` reads and :math:`m`
-barcodes as is typical for traditional matching algorithms
+indexes as is typical for traditional matching algorithms
diff --git a/docs/conf.py b/docs/conf.py
@@ -31,7 +31,7 @@
 # ones.
 extensions = [
     'sphinx.ext.todo',
-    'sphinx.ext.pngmath',
+    'sphinx.ext.imgmath',
     'sphinx.ext.ifconfig',
 ]
 
@@ -188,10 +188,10 @@
 
 latex_elements = {
 # The paper size ('letterpaper' or 'a4paper').
-'papersize': 'a4paper',
+#'papersize': 'a4paper',
 
 # The font size ('10pt', '11pt' or '12pt').
-'pointsize': '11pt',
+#'pointsize': '11pt',
 
 # Additional stuff for the LaTeX preamble.
 #'preamble': '',
diff --git a/docs/index.rst b/docs/index.rst
@@ -7,16 +7,17 @@ Welcome to axe's documentation!
 ===============================
 
 Axe is a read de-multiplexer, useful in situations where sequence reads contain
-the barcodes that uniquely distinguish samples. Axe uses a rapid and accurate
+the indexes that uniquely distinguish samples. Axe uses a rapid and accurate
 algorithm based on hamming mismatch tries to competitively match the prefix of
-a sequencing read against a set of barcodes. Axe supports combinatorial
-barcoding schemes.
+a sequencing read against a set of indexes. Axe supports combinatorial
+indexing schemes.
 
 Contents:
 
 .. toctree::
    :maxdepth: 2
 
+   tutorial
    usage
    algorithm
 
diff --git a/docs/tutorial.rst b/docs/tutorial.rst
@@ -0,0 +1,76 @@
+************
+Axe Tutorial
+************
+
+**TODO!**
+
+In this tutorial, we'll use Axe to demultiplex some paired-end,
+combinatorially-index Genotyping-by-Sequencing reads. The data for this
+tutorial is available from figshare
+`here https://figshare.com/articles/axe-tutorial_tar/6143720`_.
+
+Axe should be run as the initial step of any analysis: don't use sequence QC
+tools like AdapterRemoval or Trimmomatic before using axe, as indexes may be
+trimmed away, or pairing information removed.
+
+Step 0: Download the trial data
+-------------------------------
+
+This will download the trial data, and extract it on the fly:
+
+.. code-block:: bash
+
+   curl -LS https://ndownloader.figshare.com/files/11094782 | tar xv
+
+Step 1: prepare a key file
+--------------------------
+
+The key file associates index sequences with sample names. A key file can be
+prepared in a spreadsheet editor, like LibreOffice Calc, or Excel. The format
+is quite strict, and is described in detail in the online usage documentation.
+
+Let's now inspect the keyfile I have provided for the tutorial.
+
+.. code-block:: bash
+
+   head axe-keyfile.tsv
+
+
+Step 2: Demultiplex with Axe
+----------------------------
+
+
+In this step, we will demultiplex our interleaved input file to per-sample
+interleaved output files. To see a full range of Axe's options, please run
+``axe-demux -h``, or inspect the online usage documentation.
+
+First, let's inspect the input.
+
+.. code-block:: bash
+
+   zcat axe-tutorial.fastq.gz | head -n 8
+
+Then, we need to ensure that axe has somewhere to put the demultiplexed reads.
+Axe outputs one file (or more, depending on pairing) per sample. Axe does so by
+appending the sample name to some prefix (as given by the ``-I``, ``-F``,
+and/or ``-R`` options). If this prefix is a directory, then sample fastq files
+will be created in that subdirectory, but the directory must exist. Let's make
+an output directory:
+
+.. code-block:: bash
+
+   mkdir -p output
+
+Now, let's demultiplex the reads!
+
+.. code-block:: bash
+
+   axe-demux -i axe-tutorial.fastq.gz -I output/ \
+      -c -b axe-keyfile.tsv -t demux-stats.tsv -z 1
+
+The command above demultiplexes reads from ``axe-tutorial.fastq.gz`` into
+separate files under ``output``, based on the combinatorial (``-c``)
+sample-to-index-sequence mapping described in ``axe-keyfile.tsv``, and saves a
+file of statistics as ``demux-stats.tsv``. Note that we have enabled
+compression of output files using the ``-z`` option, in case you don't have
+much disk space available. This will make Axe slightly slower.
diff --git a/docs/usage.rst b/docs/usage.rst
@@ -9,10 +9,10 @@ Axe Usage
   did not change.
 
 Axe has several usage modes. The primary distinction is between the two
-alternate barcoding schemes, single and combinatorial barcoding. Single barcode
-matching is used when only the first read contains barcode sequences.
-Combinatorial barcoding is used when both reads in a read pair contain
-independent (typically different) barcode sequences.
+alternate indexing schemes, single and combinatorial indexing. Single index
+matching is used when only the first read contains index sequences.
+Combinatorial indexing is used when both reads in a read pair contain
+independent (typically different) index sequences.
 
 For concise reference, the command-line usage of ``axe-demux`` is reproduced
 below:
@@ -37,8 +37,8 @@ default) and 9, where 0 indicates plain text output (``gzopen`` mode "wT"), and
 fastest and 9 is most compact.
 
 The output flags should be prefixes that are used to generate the output file
-name based on the barcode's (or barcode pair's) ID. The names are generated as:
-``prefix`` + ``_`` + ``barcode ID`` + ``_`` + ``read number`` + ``.extension``.
+name based on the index's (or index pair's) ID. The names are generated as:
+``prefix`` + ``_`` + ``index ID`` + ``_`` + ``read number`` + ``.extension``.
 The output file for reads that could not be demultiplexed is ``prefix`` + ``_``
 + ``unknown`` + ``_`` + ``read number`` + ``.extension``.  The read number is
 omitted unless the paired read file scheme is used, and is "il" for interleaved
@@ -51,65 +51,65 @@ The corresponding CLI flags are:
  - ``-r`` and ``-R``: Paired R2 file input and output.
  - ``-i`` and ``-I``: Interleaved paired input and output.
 
-The barcode file
+The index file
 ----------------
 
-The barcode file is a tab-separated file with an optional header. It is
+The index file is a tab-separated file with an optional header. It is
 mandatory, and is always supplied using the ``-b`` command line flag. The exact
-format is dependent on barcoding mode, and is described further in the sections
+format is dependent on indexing mode, and is described further in the sections
 below. If a header is present, the header line must start with either
-`Barcode` or ``barcode``, or it will be interpreted as a barcode line, leading
+`Barcode` or ``index``, or it will be interpreted as a index line, leading
 to a parsing error. Any line starting with ';' or '#' is ignored, allowing
-comments to be added in line with barcodes. Please ensure that the software
-used to produce the barcode uses ASCII encoding, and does not insert a
+comments to be added in line with indexes. Please ensure that the software
+used to produce the index uses ASCII encoding, and does not insert a
 Byte-order Mark (BoM) as many text editors can silently use Unicode-based
 encoding schemes. I recommend the use of
 `LibreOffice Calc <www.libreoffice.org>`_ (part of a free and open source
-office suite) to generate barcode tables; Microsoft Excel can also be used.
+office suite) to generate index tables; Microsoft Excel can also be used.
 
 Mismatch level selection
 ------------------------
 
-Independent of barcode mode, the ``-m`` flag is used to select the maximum
-allowable hamming distance between a read's prefix and a barcode to be
-considered as a match. As "mutated" barcodes must be unique, a hamming distance
-of one is the default as typically barcodes are designed to differ by a hamming
+Independent of index mode, the ``-m`` flag is used to select the maximum
+allowable hamming distance between a read's prefix and a index to be
+considered as a match. As "mutated" indexes must be unique, a hamming distance
+of one is the default as typically indexes are designed to differ by a hamming
 distance of at least two. Optionally, (using the ``-p`` flag), axe will allow
-selective mismatch levels, where, if clashes are observed, the barcode will
-only be matched exactly. This allows one to process datasets with barcodes that
+selective mismatch levels, where, if clashes are observed, the index will
+only be matched exactly. This allows one to process datasets with indexes that
 don't have a sufficiently high distance between them.
 
-Single barcode mode
+Single index mode
 -------------------
 
-Single barcode mode is the default mode of operation. Barcodes are matched
-against read one (hereafter the forward read), and the barcode is trimmed from
+Single index mode is the default mode of operation. Barcodes are matched
+against read one (hereafter the forward read), and the index is trimmed from
 only the forward read, unless the ``-2`` command line flag is given, in which
-case a prefix the same length as the matched barcode is also trimmed from the
+case a prefix the same length as the matched index is also trimmed from the
 second or reverse read. Note that sequence of this second read is not checked
 before trimming.
 
-In single barcode mode, the barcode file has two columns: ``Barcode`` and
+In single index mode, the index file has two columns: ``Barcode`` and
 ``ID``.
 
-Combinatorial barcode mode
+Combinatorial index mode
 --------------------------
 
-Combinatorial barcode mode is activated by giving the ``-c`` flag on the
-command line. Forward read barcodes are matched against the forward read, and
-reverse read barcodes are matched against the reverse read. The optimal
-barcodes are selected independently, and the barcode pair is selected from
-these two barcodes. The respective  barcodes are trimmed from both reads; the
-``-2`` command line flag has no effect in combinatorial barcode mode.
+Combinatorial index mode is activated by giving the ``-c`` flag on the
+command line. Forward read indexes are matched against the forward read, and
+reverse read indexes are matched against the reverse read. The optimal
+indexes are selected independently, and the index pair is selected from
+these two indexes. The respective  indexes are trimmed from both reads; the
+``-2`` command line flag has no effect in combinatorial index mode.
 
-In combinatorial barcode mode, the barcode file has three columns:
-``Barcode1``, ``Barcode2`` and ``ID``. Individual barcodes can occur many times
-within the forward and reverse barcodes, but barcode pairs must be unique
+In combinatorial index mode, the index file has three columns:
+``Barcode1``, ``Barcode2`` and ``ID``. Individual indexes can occur many times
+within the forward and reverse indexes, but index pairs must be unique
 combinations.
 
-The Demultipexing Statistics File
----------------------------------
+The Demultiplexing Statistics File
+----------------------------------
 
 The ``-t`` option allows the output of per-sample read counts to a
 tab-separated file. The file will have a header describing its format, and
-includes a line for unbarcoded reads.
+includes a line for reads which could not be demultiplexed.