Skip to content

Commit

Permalink
some updates to docs
Browse files Browse the repository at this point in the history
  • Loading branch information
rob-p committed Jan 11, 2022
1 parent b6ed32a commit 126603b
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 6 deletions.
8 changes: 4 additions & 4 deletions docs/source/generate_permit_list.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ barcodes are decided):

* ``--valid-bc <bcfile>``: This option will read the provided file <bcfile> and treat it as an explicitly-provided list of true, filtered barcodes (i.e. a list of barcodes believed to belong to a set of high-confidence cells truly present in the given sample). Barcodes appearing in this list will be considered to correspond to true and filtered cells, and barcodes will be corrected to this list. This flag is *not* designed to perform unfiltered quantification (i.e. correcting to a list of all *possible* barcodes generated by a technology, like e.g. the `10x v3 permit list <https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz>`_). To correct against an *unfiltered* permit list, you should use the ``--unfiltered-pl`` flag described below (which is currently in beta).

* ``--unfiltered-pl <plist>``: This option accepts as an argument a list of *possible* barcodes for the sample. For example, this is the flag you should use if you wish to provide an "external permit list", like the 10x v2 or 10x v3 permit lists. Unilike with the ``--valid-bc`` flag, the list passed to this argument is the set of all possible barcodes for the technology being processed, and it is likely that most of the barcodes in the file may not correspond to cells present in this particular sample. When using this argument, you may also pass the ``--min-reads`` argument to determine the minimum frequency with which a barcode must be seen in order to be retained. The algorithm used here will pass over the input records (mapped reads) and count how many times each of the barcodes in the unfiltered permit list occur exactly. Any barcode ocurring >= ``min-reads`` times will be considered as a present cell. Subsequently, all barcodes that did not match a present cell will be searched (at an edit distance of up to 1) againt the barcodes determined to correspond to present cells. If an initially non-matching barcode has a unique neighbor among the barcodes for present cells, it will be corrected to that barcode, but if it has no 1-edit neighbor, or if it has 2 or more 1-edit neighbors among that list (i.e. it's correction would be ambiguous), then the record is discarded. *Note* : support for unfiltered permit lists is currently in beta.
* ``--unfiltered-pl <plist>``: This option accepts as an argument a list of *possible* barcodes for the sample. For example, this is the flag you should use if you wish to provide an "external permit list", like the 10x v2 or 10x v3 permit lists. Unilike with the ``--valid-bc`` flag, the list passed to this argument is the set of all possible barcodes for the technology being processed, and it is likely that most of the barcodes in the file may not correspond to cells present in this particular sample. When using this argument, you may also pass the ``--min-reads`` argument to determine the minimum frequency with which a barcode must be seen in order to be retained. The algorithm used here will pass over the input records (mapped reads) and count how many times each of the barcodes in the unfiltered permit list occur exactly. Any barcode ocurring >= ``min-reads`` times will be considered as a present cell. Subsequently, all barcodes that did not match a present cell will be searched (at an edit distance of up to 1) againt the barcodes determined to correspond to present cells. If an initially non-matching barcode has a unique neighbor among the barcodes for present cells, it will be corrected to that barcode, but if it has no 1-edit neighbor, or if it has 2 or more 1-edit neighbors among that list (i.e. it's correction would be ambiguous), then the record is discarded.

* ``--min-reads <threshold>``: This flag is meant to be used (and currently only applied) in conjunction with ``--unfiltered-pl``. Any barcodes from the provided permit list that have >= ``<threshold>`` exact occurrences in the input file will be deemed as present cells and will be passed on to subsequent phases of quantification. Barcodes occurring < ``threshold`` number of times will be corrected against the set of present cells using the procedure described above.

Expand All @@ -47,11 +47,11 @@ output
The ``generate-permit-list`` command outputs a number of different files in the output directory. Not all files are
relevant to users of ``alevin-fry``, but the files are described here.

1. The file ``all_freq.tsv`` is a two-column tab-separated file that lists, for each distinct barcode in the input RAD file, the number of read records that were tagged with this barcode.
1. The file ``all_freq.bin`` is a binary file that records, for each distinct barcode in the input RAD file, the number of read records that were tagged with this barcode.

2. The file ``permit_freq.tsv`` is a two-column tab-separated file that lists, for each barcode in the input RAD file that is determined to be a *true* barcode, the number of read records associated with this barcode.
2. The file ``permit_freq.bin`` is a binary file that lists, for each barcode in the input RAD file that is determined to be a *true* barcode, the number of read records associated with this barcode.

3. The file ``permit_map.bin`` is a binary file (a serde serialized HashMap) that maps each barcode in the input RAD file that is within an edit distance of 1 to some *true* barcode to the barcode to which it corrects. This allows the ``collate`` command to group together all of the read records corresponding to the same *corrected* barcode.

4. The file ``generate_permit_list.json`` that is a JSON file containing information about the run of the command (currently, just the expected orientation).
4. The file ``generate_permit_list.json`` that is a JSON file containing information about the run of the command (currently, just the expected orientation).

4 changes: 2 additions & 2 deletions docs/source/quant.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
quant
=====

The ``quant`` command takes a collated RAD file and performs feature (e.g. gene) quantification, outputting a sparse matrix of de-duplicated counts as well as a list of labels for the rows and columns. The ``quant`` command takes an input directory containing the collated RAD file, a transcript-to-gene map, an output directory where the results will be written and a "resolution strategy" (described below). Quantification is multi-threaded, so it also, optionally, takes as an arguments the number of threads to use concurrently.
The ``quant`` command takes a collated RAD file and performs feature (e.g. gene) quantification, outputting a sparse matrix of de-duplicated counts as well as a list of labels for the rows and columns. The ``quant`` command takes an input directory containing the collated RAD file, a transcript-to-gene map, an output directory where the results will be written, and a "resolution strategy" (described below). Quantification is multi-threaded, so it also, optionally, takes as an arguments the number of threads to use concurrently.

The transcript-to-gene map should be either:

Expand Down Expand Up @@ -36,7 +36,7 @@ Additionally, this command can optionally take the following flags (note that no
output
------

The output of the ``quant`` command consists of 5 files: ``quants_mat_rows.txt``, ``counts.eds.gz`` (or ``quants_mat.mtx`` if run with the ``--use-mtx`` flag), ``quants_mat_cols.txt``, ``meta_info.json``, and ``features.txt``. The ``meta_info.json`` file contains information about the quantification run, such as the method used for UMI resolution. The ``features.txt`` file contains cell-level information designed to be useful in post-quantification cell filtering (better determining "true" cells from background, noise, doublets etc.). The other three files all correspond to quantification information.
The output of the ``quant`` command consists of 5 files: ``quants_mat_rows.txt``, ``counts.eds.gz`` (or ``quants_mat.mtx`` if run with the ``--use-mtx`` flag), ``quants_mat_cols.txt``, ``quant.json``, and ``featureDump.txt``. The ``quant.json`` file contains information about the quantification run, such as the method used for UMI resolution. The ``featureDump.txt`` file contains cell-level information designed to be useful in post-quantification cell filtering (better determining "true" cells from background, noise, doublets etc.). The other three files all correspond to quantification information.

If ``quant`` was executed in USA mode, then the resulting count matrix will be of dimension ``C``x``3G`` where ``C`` is the number of quantified cells (barcodes) and ``G`` is the number of genes. This is because, in USA mode, ``alevin-fry`` quantifies the UMI count attributable to each splicing state of each gene in each cell, where the splicing state is one of spliced (S), unspliced (U) or ambiguous (A). If ``quant`` was run with a two-column transcript-to-gene map (not in USA-mode), then the resulting count matrix will be a ``C``x``G`` matrix, as splicing status is not tracked. For more details on USA mode and its uses, please read the ``alevin-fry`` `preprint <https://www.biorxiv.org/content/10.1101/2021.06.29.450377v1>`__, or the `corresponding tutorial <https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/>`__.

Expand Down

0 comments on commit 126603b

Please sign in to comment.