documentation update

COMBINE-lab · Aug 19, 2020 · 52c1406 · 52c1406
1 parent 20885e4
commit 52c1406
Show file tree

Hide file tree

Showing 3 changed files with 39 additions and 22 deletions.
diff --git a/docs/source/collate.rst b/docs/source/collate.rst
@@ -11,8 +11,6 @@ dictate how the collation and filtering will be performed.
 
 * ``-i, --input-dir <input-dir>`` : The input directory.  This is the directory that was the *output* of ``generate-permit-list``.  This directory contains information computed by the ``generate-permit-list`` command that will allow successful collation and barcode correction.  This is also the directory where the collated RAD file will be *output*.
 
-* ``-e, --expected-ori <expected-ori>`` : The expected orientation of valid alignments.  Many single-cell protocols generated a strand-aware library, allowing increased precision by filtering out alignments that do not occur in the prescribed orientation.  This flag will filter out alignments that do not match the provided orientation.  The options are 'fw' (filters out alignments to the reverse complement strand), 'rc' (filter out alignments to the forward strand) and 'both' or 'either' (do not filter any alignments).
-
 * ``-m, --max-records <max-records>`` : The maximum number of read records to keep in memory at once during collation. The ``collate`` command will pass over the input RAD file multiple times collecting the records associated with a set of (corrected) cellular barcodes so that they can be written out in collated format to the output RAD file.  This parameter determines (approximately) how many records will be held in memory at once, and therefore determines the memory usage of the ``collate`` command.  The larger the value used the faster the collation process will be, since fewer passes are made.  The smaller this value, the lower the memory usage will be, at the cost of more passes.  The default value is 10,000,000.  Note that this determines the number of records *approximately*, because a specific barcode will never be split across multiple collation passes.  The algorithm employed is to collect the reads associated with different cellular barcodes in the current pass until the number of reads to be collected *first exceeds* this value.
 
 output

diff --git a/docs/source/generate_permit_list.rst b/docs/source/generate_permit_list.rst
@@ -2,10 +2,16 @@ generate-permit-list
 ====================
 
 This command takes as input a RAD file (created by running alevin with the ``--justAlign`` flag), and it determines what cell 
-barcodes should be associated with "true" cells, which should be corrected to some "true" barcode, and which should simply 
-be ignored / discarded. This command has 3 required arguments; the path to an input RAD file `--input`, 
-the path to an output directory ``--output-dir`` (which will be created if it doesn't exist), and then one of the following mutually 
-exclusive options (which determines how the "true" barcodes are decided):
+barcodes should be associated with "true" cells, which should be corrected to
+some "true" barcode, and which should simply be ignored / discarded. This
+command has 4 required arguments; the path to an input RAD file ``--input``,
+the path to an output directory ``--output-dir`` (which will be created if it
+doesn't exist), the expected orientation of properly mapped reads
+``--expected-ori`` (the options are 'fw' (filters out alignments to the
+reverse complement strand), 'rc' (filter out alignments to the forward
+strand) and 'both' or 'either' (do not filter any alignments)), and then one
+of the following mutually exclusive options (which determines how the "true"
+barcodes are decided):
 
 * ``--knee-distance``: This flag will use the distance method that is used in the whitelist command of 
   UMI-tools to attempt to automatically determine the number of true barcodes. Briefly, this 
@@ -44,4 +50,5 @@ relevant to users of ``alevin-fry``, but the files are described here.
 
 # The file ``permit_map.bin`` is a binary file (a serde serialized HashMap) that maps each barcode in the input RAD file that is within an edit distance of 1 to some *true* barcode to the barcode to which it corrects.  This allows the ``collate`` command to group together all of the read records corresponding to the same *corrected* barcode.
 
+# The file  ``generate_permit_list.json`` that is a JSON file containing information about the run of the command (currently, just the expected orientation).
 
diff --git a/docs/source/quant.rst b/docs/source/quant.rst
@@ -20,23 +20,35 @@ The ``quant`` command exposes a number of different resolution strategies.  They
 
 * ``cr-like`` : This strategy is like the one adopted in cell-ranger, except that it does not first collapse 1-edit-distance UMIs.  Within each cell barcode, a list of (gene, UMI, count) tuples is created. If a read maps to more than one gene, then it generates more than one such tuple.  The tuples are then sorted lexicographically (first by gene id, then by UMI, and then by count).  Any UMI that aligns to only a single gene is assigned to that gene.  UMIs that align to more than one gene are assigned to the gene with the highest count for this UMI.  If there is a tie for the highest count gene for this UMI, then the corresponding reads are simply discarded.
 
+* ``cr-like-em`` : This strategy is like ``cr-like``, except that when a UMI has genes to which it matches with equal frequency, rather than discard the UMIs, the genes are treated as an equivalence class, and the counts for each gene are determined via an expectation maximization algorithm.
+
 output
 ------
 
-The output of the ``quant`` command consists of 3 files: ``barcodes.txt``,
-``counts.mtx.gz`` and ``gene_names.txt``. The ``counts.mtx.gz`` is a matrix market
-coordinate format file where the number of *rows* is equal to the number of
-genes and the number of columns is equal to the number of *cells*. The header
-line encodes the number of rows, columns and non-zero entries. The subsequent
-lines (1-based indexing) encode the locations and values of the non-zero
-entries.  This entire ``.mtx`` format file is gzipped during output to minimize
-disk space. The two other files provide the labels for the rows and columns of
-this matrix. The ``gene_names.txt`` file is a text file that contains the
-names of the rows of the matrix, in the order in which it is written, with
-one gene name written per line. The ``barcodes.txt`` file is a text file that
-contains the names of the columns of the matrix, in the order in which it is
-written, with one barcode name written per line.
-
-
+The output of the ``quant`` command consists of 5 files: ``barcodes.txt``,
+``counts.eds.gz``, ``gene_names.txt``, ``meta_info.json``, and ``features.txt``. 
+The ``meta_info.json`` file contains information about the quantification run,
+such as the method used for UMI resolution.  The ``features.txt`` file contains
+cell-level information designed to be useful in post-quantification cell filtering
+(better determining "true" cells from background, noise, doublets etc.).
+The other three files all correspond to quantification information.
+
+The ``counts.eds.gz`` is a file in EDS_ format that stores the gene-by-cell
+expression matrix. The two other files provide the labels for the rows and
+columns of this matrix. The ``gene_names.txt`` file is a text file that
+contains the names of the rows of the matrix, in the order in which it is
+written, with one gene name written per line. The ``barcodes.txt`` file is a
+text file that contains the names of the columns of the matrix, in the order
+in which it is written, with one barcode name written per line.
+
+.. _alevin: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y
+.. _EDS: https://github.com/COMBINE-lab/EDS
+
+..
+  matrix market coordinate format file where the number of *rows* is equal to the number of
+  genes and the number of columns is equal to the number of *cells*. The header
+  line encodes the number of rows, columns and non-zero entries. The subsequent
+  lines (1-based indexing) encode the locations and values of the non-zero
+  entries.  This entire ``.mtx`` format file is gzipped during output to minimize
+  disk space. 
 
-.. _alevin: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y