Merge branch 'develop'

COMBINE-lab · Aug 2, 2022 · 46de5e4 · 46de5e4
2 parents 45d3d6d + 25306c5
commit 46de5e4
Show file tree

Hide file tree

Showing 16 changed files with 2,118 additions and 394 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "alevin-fry"
-version = "0.6.0"
+version = "0.6.1"
 authors = [
   "Avi Srivastava <avi.srivastava@nyu.edu>",
   "Hirak Sarkar <hirak_sarkar@hms.harvard.edu>",
@@ -46,6 +46,8 @@ bincode = "1.3.3"
 bstr = "0.2.17"
 crossbeam-channel = "0.5.4"
 crossbeam-queue = "0.3.5"
+# derive_builder = "0.11.2"
+typed-builder = "0.10.0"
 indicatif = "0.16.2"
 needletail = "0.4.1"
 petgraph = "0.6.0"
@@ -75,7 +77,9 @@ rust-htslib = { version = "0.39.5", default-features = false, features = [
 ] }
 sce = { git = "https://github.com/parazodiac/SingleCellExperiment", version = "0.1.2" }
 
-clap = { version = ">=3.1.18", features = ["derive", "wrap_help", "cargo"] }
+# no shenanigans; clap makes breaking "fixes" too often to allow variability
+# in the version different from what we tested with
+clap = { version = "=3.2.16", features = ["derive", "wrap_help", "cargo"] }
 
 [profile.release]
 #debug = true

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -22,7 +22,7 @@
 author = 'Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, Rob Patro'
 
 # The full version, including alpha/beta/rc tags
-release = '0.5.1'
+release = '0.6.0'
 
 master_doc = 'index'
 

diff --git a/docs/source/generate_permit_list.rst b/docs/source/generate_permit_list.rst
@@ -33,7 +33,7 @@ barcodes are decided):
   a number of reads >= to the <ncells>-th barcode will be considered part of the permit list, all others will not 
   (but will be considered for correction to this permit list).
 
-* ``--valid-bc <bcfile>``: This option will read the provided file <bcfile> and treat it as an explicitly-provided list of true, filtered barcodes (i.e. a list of barcodes believed to belong to a set of high-confidence cells truly present in the given sample). Barcodes appearing in this list will be considered to correspond to true and filtered cells, and barcodes will be corrected to this list. This flag is *not* designed to perform unfiltered quantification (i.e. correcting to a list of all *possible* barcodes generated by a technology, like e.g. the `10x v3 permit list <https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz>`_).  To correct against an *unfiltered* permit list, you should use the ``--unfiltered-pl`` flag described below (which is currently in beta).
+* ``--valid-bc <bcfile>``: This option will read the provided file <bcfile> and treat it as an explicitly-provided list of true, filtered barcodes (i.e. a list of barcodes believed to belong to a set of high-confidence cells truly present in the given sample). Barcodes appearing in this list will be considered to correspond to true and filtered cells, and barcodes will be corrected to this list. This flag is *not* designed to perform unfiltered quantification (i.e. correcting to a list of all *possible* barcodes generated by a technology, like e.g. the `10x v3 permit list <https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz>`_).  To correct against an *unfiltered* permit list, you should use the ``--unfiltered-pl`` flag described below.
 
 * ``--unfiltered-pl <plist>``: This option accepts as an argument a list of *possible* barcodes for the sample.  For example, this is the flag you should use if you wish to provide an "external permit list", like the 10x v2 or 10x v3 permit lists. Unilike with the ``--valid-bc`` flag, the list passed to this argument is the set of all possible barcodes for the technology being processed, and it is likely that most of the barcodes in the file may not correspond to cells present in this particular sample.  When using this argument, you may also pass the ``--min-reads`` argument to determine the minimum frequency with which a barcode must be seen in order to be retained.  The algorithm used here will pass over the input records (mapped reads) and count how many times each of the barcodes in the unfiltered permit list occur exactly.  Any barcode ocurring >= ``min-reads`` times will be considered as a present cell.  Subsequently, all barcodes that did not match a present cell will be searched (at an edit distance of up to 1) againt the barcodes determined to correspond to present cells.  If an initially non-matching barcode has a unique neighbor among the barcodes for present cells, it will be corrected to that barcode, but if it has no 1-edit neighbor, or if it has 2 or more 1-edit neighbors among that list (i.e. it's correction would be ambiguous), then the record is discarded.
 

diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst
@@ -8,13 +8,20 @@ First, we need to generate a RAD file using alevin.  The RAD file is created by
 
 .. code:: bash
 
-    $ salmon alevin -lISR --chromium -1 <read1_files> -2 <read2_files> -o <alevin_odir> -i <index> -p <num_threads> --tgMap <tg_map> --sketch
+    $ salmon alevin -lISR --chromium -1 <read1_files> -2 <read2_files> -o <alevin_odir> -i <index> -p <num_threads> --sketch
 
-Given the output directory generated above, the next step is to let alevin-fry generate the permit list.  Here we use the "knee" method `-k`.
+Given the output directory generated above, the next step is to let alevin-fry generate the permit list.  First, we grab the 10x Chromium version 2
+permit list (if we had Chromium v3 chemistry, we would use that permit-list instead):
+
+.. code:: bash
+
+    $ wget https://umd.box.com/shared/static/jbs2wszgbj7k4ic2hass9ts6nhqkwq1p -O 10x_v2_permit.txt
+
+Now, we can use this permit list to scan the cell barcodes actually encountered in our reads and determine a set of cells that were likely present in our sample:
 
 .. code:: bash 
 
-    $ alevin-fry generate-permit-list --input <alevin_odir> --expected-ori fw --output-dir <fry_odir> -k
+    $ alevin-fry generate-permit-list --input <alevin_odir> --expected-ori fw --output-dir <fry_odir> --unfiltered-pl 10x_v2_permit.txt
 
 Next, given the permit list and barcode mapping (which resides in the `<fry_odir>` directory), we collate the original RAD file using the command below.
 

diff --git a/docs/source/quant.rst b/docs/source/quant.rst
@@ -3,7 +3,7 @@ quant
 
 The ``quant`` command takes a collated RAD file and performs feature (e.g. gene) quantification, outputting a sparse matrix of de-duplicated counts as well as a list of labels for the rows and columns.  The ``quant`` command takes an input directory containing the collated RAD file, a transcript-to-gene map, an output directory where the results will be written, and a "resolution strategy" (described below).  Quantification is multi-threaded, so it also, optionally, takes as an arguments the number of threads to use concurrently.
 
-The transcript-to-gene map should be either:
+The transcript-to-gene map (provided using the ``-m`` or ``--tg-map`` option) should be either:
 
 1. A three-column (headerless) tab-separated file where the first column contains a target name, the second column contains the corresponding gene feature to which this target belongs and the third column contains either ``S`` or ``U``, with ``S`` denoting the corresponding feature should be attributed to the spliced status of its parent gene and ``U`` denoting that it should be attributed to the unspliced status of its parent gene.
 
@@ -19,7 +19,11 @@ The ``quant`` command exposes a number of different resolution strategies.  Note
 
 * ``parsimony`` : This strategy is the same as "full", except that it does *not* probabilistically resolve reads that remain as gene-multimapping after applying the parsimony criterion.  Instead, reads that do not have a unique most-parsimonious assignment are discarded. 
 
-* ``trivial`` : This strategy does not search for 1 edit-distance neighbors of UMIs.  Instead, it first discards any reads that multi-map at the gene level.  The reads that remain then all map uniquely to a single gene.  These reads are deduplicated by (exact) UMI, and the number of distinct UMIs mapping to each gene are taken as that gene's count in the current cell.
+* ``parsimony-gene`` : This strategy is the same as ``parsimony`` above, except that mappings of UMIs to transcripts are projected to their corresponding gene *before* the relevant graph (Parsimonious UMI Graph) is constructed.  Thus, the vertices of the graph consist of sets of gene labels rather than sets of transcript labels.  This method may be less precise than the ``parsimony`` method (i.e. may wrongly group together UMIs that arise from different transcripts within the same gene), but it simultaneously likely to be more robust to misannotation or incomplete annotation (e.g. incomplete UTR annotation).
+
+* ``parsimony-gene-em`` : This strategy is the same as ``parsimony-above`` above, except that, like ``parsimony-em`` any graphs that exhibit a multi-gene cover will have the multimapping resolved probabilistically with an EM algorithm..
+
+* ``trivial`` : This strategy does not search for 1 edit-distance neighbors of UMIs.  Instead, it first discards any reads that multi-map at the gene level.  The reads that remain then all map uniquely to a single gene.  These reads are deduplicated by (exact) UMI, and the number of distinct UMIs mapping to each gene are taken as that gene's count in the current cell. **Note**: This resolution strategy is not available in USA mode.
 
 Additionally, this command can optionally take the following flags (note that not all resolution strategies are compatible with these flags):
 
@@ -29,18 +33,26 @@ Additionally, this command can optionally take the following flags (note that no
 
 * ``--summary-stat`` : This flag will write the summary statistics of the bootstrap replicates (i.e. the mean and variance of the inferential replicates).  This provides the most important information for uncertainty-aware downstream analysis, while requiring much less storage space than the full bootstrap replicate information.  This flag is only meaningful when ``--num-bootstraps`` is meaningful.
 
-* ``--quant-subset <sfile>`` : This optional argument provides a file containing list of barcodes to quantify (one barcode per line, written as a string), those not in this list will be ignored during inference and will not appear in the output quantification matrix.  If this argument is not provided, then all of the original barcodes will be quantified.
+* ``--quant-subset <SFILE>`` : This optional argument provides a file containing list of barcodes to quantify (one barcode per line, written as a string), those not in this list will be ignored during inference and will not appear in the output quantification matrix.  If this argument is not provided, then all of the original barcodes will be quantified.
+
+* ``--use-mtx`` : This flag will cause the output to be written in matrix market coordinate format (which is the default).
+
+* ``--use-eds`` : This flag will cause the output to be written in EDS format rather than in matrix market format.
+
+There are also a few flags that are not immediately exposed:
 
-* ``--use-mtx`` : This flag will cause the output to be written in matrix market coordinate format rather than in EDS format.
+* ``--umi-edit-dist <EDIST>`` : This option takes a parameter that sets the Hamming distance within which potentially colliding UMIs will be considered for correction.  With resolution modes ``parsimony``, ``parsimony-em``, ``parsimony-gene`` or ``parsimony-gene-em`` the valid values are 0 and 1 (and the default is 1).  With other resolution modes, the default (and currently the only supported value) is 0.
+
+* ``--large-graph-thresh <NVERT>`` : This option takes a parameter that sets the order (number of nodes) of a PUG above which an alternative (faster) resolution strategy will be applied.  This option only has an effect for ``parsimony``, ``parsimony-em``, ``parsimony-gene`` or ``parsimony-gene-em`` resolution modes.  The default value is 1000.
 
 output
 ------
 
-The output of the ``quant`` command consists of 5 files: ``quants_mat_rows.txt``, ``counts.eds.gz`` (or ``quants_mat.mtx`` if run with the ``--use-mtx`` flag), ``quants_mat_cols.txt``, ``quant.json``, and ``featureDump.txt``.  The ``quant.json`` file contains information about the quantification run, such as the method used for UMI resolution.  The ``featureDump.txt`` file contains cell-level information designed to be useful in post-quantification cell filtering (better determining "true" cells from background, noise, doublets etc.).  The other three files all correspond to quantification information.
+The output of the ``quant`` command consists of 5 files: ``quants_mat_rows.txt``, ``quants_mat.mtx`` (or ``counts.eds.gz`` if run with the ``--use-eds`` flag), ``quants_mat_cols.txt``, ``quant.json``, and ``featureDump.txt``.  The ``quant.json`` file contains information about the quantification run, such as the method used for UMI resolution.  The ``featureDump.txt`` file contains cell-level information designed to be useful in post-quantification cell filtering (better determining "true" cells from background, noise, doublets etc.).  The other three files all correspond to quantification information.
 
-If ``quant`` was executed in USA mode, then the resulting count matrix will be of dimension ``C``x``3G`` where ``C`` is the number of quantified cells (barcodes) and ``G`` is the number of genes.  This is because, in USA mode, ``alevin-fry`` quantifies the UMI count attributable to each splicing state of each gene in each cell, where the splicing state is one of spliced (S), unspliced (U) or ambiguous (A).  If ``quant`` was run with a two-column transcript-to-gene map (not in USA-mode), then the resulting count matrix will be a ``C``x``G`` matrix, as splicing status is not tracked.  For more details on USA mode and its uses, please read the ``alevin-fry`` `preprint <https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1>`__, or the `corresponding tutorial <https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/>`__.
+If ``quant`` was executed in USA mode, then the resulting count matrix will be of dimension ``C``x``3G`` where ``C`` is the number of quantified cells (barcodes) and ``G`` is the number of genes.  This is because, in USA mode, ``alevin-fry`` quantifies the UMI count attributable to each splicing state of each gene in each cell, where the splicing state is one of spliced (S), unspliced (U) or ambiguous (A).  If ``quant`` was run with a two-column transcript-to-gene map (not in USA-mode), then the resulting count matrix will be a ``C``x``G`` matrix, as splicing status is not tracked.  For more details on USA mode and its uses, please read the ``alevin-fry`` `paper <https://www.nature.com/articles/s41592-022-01408-3>`__ or `preprint <https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1>`__, or the `corresponding tutorial <https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/>`__.
 
-The ``counts.eds.gz`` is a file in EDS_ format that stores the gene-by-cell expression matrix. The two other files provide the labels for the rows and columns of this matrix. The ``quants_mat_cols.txt`` file is a text file that contains the names of the rows of the matrix, in the order in which it is written, with one gene name written per line. The ``quants_mat_rows.txt`` file is a text file that contains the names of the columns of the matrix, in the order in which it is written, with one barcode name written per line.
+The ``quants_mat.mtx`` is a matrix market `coordinate format <https://math.nist.gov/MatrixMarket/formats.html>`__ file (or if running with ``--use-eds`` then ``counts.eds.gz`` is a gzipped file in EDS_ format) that stores the gene-by-cell expression matrix. The two other files provide the labels for the rows and columns of this matrix. The ``quants_mat_cols.txt`` file is a text file that contains the names of the rows of the matrix, in the order in which it is written, with one gene name written per line. The ``quants_mat_rows.txt`` file is a text file that contains the names of the columns of the matrix, in the order in which it is written, with one barcode name written per line.
 
 .. _alevin: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y
 .. _EDS: https://github.com/COMBINE-lab/EDS
@@ -50,6 +62,5 @@ The ``counts.eds.gz`` is a file in EDS_ format that stores the gene-by-cell expr
   genes and the number of columns is equal to the number of *cells*. The header
   line encodes the number of rows, columns and non-zero entries. The subsequent
   lines (1-based indexing) encode the locations and values of the non-zero
-  entries.  This entire ``.mtx`` format file is gzipped during output to minimize
-  disk space. 
+  entries.