Update quant.rst

describe new resolution modes and 2 of the hidden flags.
COMBINE-lab · Jun 1, 2022 · 4483e62 · 4483e62
1 parent b391839
commit 4483e62
Showing 1 changed file with 20 additions and 9 deletions.
diff --git a/docs/source/quant.rst b/docs/source/quant.rst
@@ -3,7 +3,7 @@ quant
 
 The ``quant`` command takes a collated RAD file and performs feature (e.g. gene) quantification, outputting a sparse matrix of de-duplicated counts as well as a list of labels for the rows and columns.  The ``quant`` command takes an input directory containing the collated RAD file, a transcript-to-gene map, an output directory where the results will be written, and a "resolution strategy" (described below).  Quantification is multi-threaded, so it also, optionally, takes as an arguments the number of threads to use concurrently.
 
-The transcript-to-gene map should be either:
+The transcript-to-gene map (provided using the ``-m`` or ``--tg-map`` option) should be either:
 
 1. A three-column (headerless) tab-separated file where the first column contains a target name, the second column contains the corresponding gene feature to which this target belongs and the third column contains either ``S`` or ``U``, with ``S`` denoting the corresponding feature should be attributed to the spliced status of its parent gene and ``U`` denoting that it should be attributed to the unspliced status of its parent gene.
 
@@ -19,7 +19,11 @@ The ``quant`` command exposes a number of different resolution strategies.  Note
 
 * ``parsimony`` : This strategy is the same as "full", except that it does *not* probabilistically resolve reads that remain as gene-multimapping after applying the parsimony criterion.  Instead, reads that do not have a unique most-parsimonious assignment are discarded. 
 
-* ``trivial`` : This strategy does not search for 1 edit-distance neighbors of UMIs.  Instead, it first discards any reads that multi-map at the gene level.  The reads that remain then all map uniquely to a single gene.  These reads are deduplicated by (exact) UMI, and the number of distinct UMIs mapping to each gene are taken as that gene's count in the current cell.
+* ``parsimony-gene`` : This strategy is the same as ``parsimony`` above, except that mappings of UMIs to transcripts are projected to their corresponding gene *before* the relevant graph (Parsimonious UMI Graph) is constructed.  Thus, the vertices of the graph consist of sets of gene labels rather than sets of transcript labels.  This method may be less precise than the ``parsimony`` method (i.e. may wrongly group together UMIs that arise from different transcripts within the same gene), but it simultaneously likely to be more robust to misannotation or incomplete annotation (e.g. incomplete UTR annotation).
+
+* ``parsimony-gene-em`` : This strategy is the same as ``parsimony-above`` above, except that, like ``parsimony-em`` any graphs that exhibit a multi-gene cover will have the multimapping resolved probabilistically with an EM algorithm..
+
+* ``trivial`` : This strategy does not search for 1 edit-distance neighbors of UMIs.  Instead, it first discards any reads that multi-map at the gene level.  The reads that remain then all map uniquely to a single gene.  These reads are deduplicated by (exact) UMI, and the number of distinct UMIs mapping to each gene are taken as that gene's count in the current cell. **Note**: This resolution strategy is not available in USA mode.
 
 Additionally, this command can optionally take the following flags (note that not all resolution strategies are compatible with these flags):
 
@@ -29,18 +33,26 @@ Additionally, this command can optionally take the following flags (note that no
 
 * ``--summary-stat`` : This flag will write the summary statistics of the bootstrap replicates (i.e. the mean and variance of the inferential replicates).  This provides the most important information for uncertainty-aware downstream analysis, while requiring much less storage space than the full bootstrap replicate information.  This flag is only meaningful when ``--num-bootstraps`` is meaningful.
 
-* ``--quant-subset <sfile>`` : This optional argument provides a file containing list of barcodes to quantify (one barcode per line, written as a string), those not in this list will be ignored during inference and will not appear in the output quantification matrix.  If this argument is not provided, then all of the original barcodes will be quantified.
+* ``--quant-subset <SFILE>`` : This optional argument provides a file containing list of barcodes to quantify (one barcode per line, written as a string), those not in this list will be ignored during inference and will not appear in the output quantification matrix.  If this argument is not provided, then all of the original barcodes will be quantified.
+
+* ``--use-mtx`` : This flag will cause the output to be written in matrix market coordinate format (which is the default).
+
+* ``--use-eds`` : This flag will cause the output to be written in EDS format rather than in matrix market format.
+
+There are also a few flags that are not immediately exposed:
 
-* ``--use-mtx`` : This flag will cause the output to be written in matrix market coordinate format rather than in EDS format.
+* ``--umi-edit-dist <EDIST>`` : This option takes a parameter that sets the Hamming distance within which potentially colliding UMIs will be considered for correction.  With resolution modes ``parsimony``, ``parsimony-em``, ``parsimony-gene`` or ``parsimony-gene-em`` the valid values are 0 and 1 (and the default is 1).  With other resolution modes, the default (and currently the only supported value) is 0.
+
+* ``--large-graph-thresh <NVERT>`` : This option takes a parameter that sets the order (number of nodes) of a PUG above which an alternative (faster) resolution strategy will be applied.  This option only has an effect for ``parsimony``, ``parsimony-em``, ``parsimony-gene`` or ``parsimony-gene-em`` resolution modes.  The default value is 1000.
 
 output
 ------
 
-The output of the ``quant`` command consists of 5 files: ``quants_mat_rows.txt``, ``counts.eds.gz`` (or ``quants_mat.mtx`` if run with the ``--use-mtx`` flag), ``quants_mat_cols.txt``, ``quant.json``, and ``featureDump.txt``.  The ``quant.json`` file contains information about the quantification run, such as the method used for UMI resolution.  The ``featureDump.txt`` file contains cell-level information designed to be useful in post-quantification cell filtering (better determining "true" cells from background, noise, doublets etc.).  The other three files all correspond to quantification information.
+The output of the ``quant`` command consists of 5 files: ``quants_mat_rows.txt``, ``quants_mat.mtx`` (or ``counts.eds.gz`` if run with the ``--use-eds`` flag), ``quants_mat_cols.txt``, ``quant.json``, and ``featureDump.txt``.  The ``quant.json`` file contains information about the quantification run, such as the method used for UMI resolution.  The ``featureDump.txt`` file contains cell-level information designed to be useful in post-quantification cell filtering (better determining "true" cells from background, noise, doublets etc.).  The other three files all correspond to quantification information.
 
-If ``quant`` was executed in USA mode, then the resulting count matrix will be of dimension ``C``x``3G`` where ``C`` is the number of quantified cells (barcodes) and ``G`` is the number of genes.  This is because, in USA mode, ``alevin-fry`` quantifies the UMI count attributable to each splicing state of each gene in each cell, where the splicing state is one of spliced (S), unspliced (U) or ambiguous (A).  If ``quant`` was run with a two-column transcript-to-gene map (not in USA-mode), then the resulting count matrix will be a ``C``x``G`` matrix, as splicing status is not tracked.  For more details on USA mode and its uses, please read the ``alevin-fry`` `preprint <https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1>`__, or the `corresponding tutorial <https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/>`__.
+If ``quant`` was executed in USA mode, then the resulting count matrix will be of dimension ``C``x``3G`` where ``C`` is the number of quantified cells (barcodes) and ``G`` is the number of genes.  This is because, in USA mode, ``alevin-fry`` quantifies the UMI count attributable to each splicing state of each gene in each cell, where the splicing state is one of spliced (S), unspliced (U) or ambiguous (A).  If ``quant`` was run with a two-column transcript-to-gene map (not in USA-mode), then the resulting count matrix will be a ``C``x``G`` matrix, as splicing status is not tracked.  For more details on USA mode and its uses, please read the ``alevin-fry`` `paper https://www.nature.com/articles/s41592-022-01408-3`__ or `preprint <https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1>`__, or the `corresponding tutorial <https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/>`__.
 
-The ``counts.eds.gz`` is a file in EDS_ format that stores the gene-by-cell expression matrix. The two other files provide the labels for the rows and columns of this matrix. The ``quants_mat_cols.txt`` file is a text file that contains the names of the rows of the matrix, in the order in which it is written, with one gene name written per line. The ``quants_mat_rows.txt`` file is a text file that contains the names of the columns of the matrix, in the order in which it is written, with one barcode name written per line.
+The ``quants_mat.mtx`` is a matrix market `coordinate format <https://math.nist.gov/MatrixMarket/formats.html>`__ file (or if running with ``--use-eds`` then ``counts.eds.gz`` is a gzipped file in EDS_ format) that stores the gene-by-cell expression matrix. The two other files provide the labels for the rows and columns of this matrix. The ``quants_mat_cols.txt`` file is a text file that contains the names of the rows of the matrix, in the order in which it is written, with one gene name written per line. The ``quants_mat_rows.txt`` file is a text file that contains the names of the columns of the matrix, in the order in which it is written, with one barcode name written per line.
 
 .. _alevin: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y
 .. _EDS: https://github.com/COMBINE-lab/EDS
@@ -50,6 +62,5 @@ The ``counts.eds.gz`` is a file in EDS_ format that stores the gene-by-cell expr
   genes and the number of columns is equal to the number of *cells*. The header
   line encodes the number of rows, columns and non-zero entries. The subsequent
   lines (1-based indexing) encode the locations and values of the non-zero
-  entries.  This entire ``.mtx`` format file is gzipped during output to minimize
-  disk space. 
+  entries.