resolving conflict

COMBINE-lab · Aug 4, 2020 · 1bf240c · 1bf240c
2 parents 0a0a9fe + 52a0e6c
commit 1bf240c
Show file tree

Hide file tree

Showing 9 changed files with 93 additions and 74 deletions.
diff --git a/README.md b/README.md
@@ -18,54 +18,6 @@ be done (in bash-like shells) using:
 $ export PATH=`pwd`/target/release/:$PATH
 ```
 
-## Running alevin-fry
-
-There are a (growing) number of different sub-commands: `generate-permit-list`, `collate`, and `quant`.  Each of these
-is invoked as a command passed as the first argument to the `alevin-fry` executable.  For example, to run the 
-`generate-permit-list` command, one would run:
-
-```{bash}
-$ alevin-fry generate-permit-list --help
-```
-
-This should then show the following:
-
-```{bash}
-$ alevin-fry generate-permit-list --help
-alevin-fry-generate-permit-list 0.0.1
-Avi Srivastava, Rob Patro
-Generate a permit list of barcodes from a RAD file
-
-USAGE:
-    alevin-fry generate-permit-list [FLAGS] --input <input> --output-dir <output-dir> --expect-cells <expect-cells> --force-cells <force-cells> --valid-bc <valid-bc>
-
-FLAGS:
-    -h, --help             Prints help information
-    -k, --knee-distance    attempt to determine the number of barcodes to keep using the knee distance method
-    -V, --version          Prints version information
-
-OPTIONS:
-    -e, --expect-cells <expect-cells>    defines the expected number of cells to use in determining the (read, not UMI) based cutoff
-    -f, --force-cells <force-cells>      select the top-k most-frequent barcodes, based on read count, as valid (true)
-    -i, --input <input>                  input RAD file
-    -o, --output-dir <output-dir>        output directory
-    -b, --valid-bc <valid-bc>            uses true barcode collected from a provided file
-```
-
-The commands and options are described below.
-
-
-### generate-permit-list
-
-This command takes as input a RAD file (created by running `alevin` with the `--justAlign` flag), and it determines what cell barcodes should
-be associated with "true" cells, which should be corrected to some "true" barcode, and which should simply be ignored / discarded.  This command 
-has 3 required arguments; the path to an input RAD file `--input`, the path to an output directory `--output-dir` (which will be created if it doesn't exist),
-and then **one** of the following mutually exclusive options (which determines how the "true" barcodes are decided):
-
-  * `--knee-distance` : This flag will use the `distance` method that is used in the `whitelist` command of [UMI-tools](https://github.com/CGATOxford/UMI-tools) to attempt to automatically determine the number of true barcodes.  Briefly, this method first counts the number of _reads_ associated with each barcode, and then sorts the barcodes in descending order by their associated read count.  It then constructs the cumulative distribution function from this sorted list of frequencies.  Finally, it applies an iterative algorithm to attempt to determine the optimal number of barcodes to include by looking for a "knee" or "elbow" in the CDF graph.  The algorithm considers each barcode in the CDF where it's x-coordinate is equal to this barcode's rank divided by the total number of barcodes (i.e. its normalized rank) and the y-coordinate is equal to the (normalized) cumulative frequency achieved at this barcode.  It then computes the distance of this barcode from the line x=y (defined by the start and end of the CDF).  The initial knee is predicted as the point that has the maximum distance from the x=y line.  The algorithm is iterative, because experiments with many low-quality barcodes may predict too many valid barcodes using this method.  Thus, the algorithm is run repeatedly, each time considering a prefix of the CDF from index 0 through the previous knee's index * 5.  Once two subsequent iterations of the algorithm return the same knee point, the algorithm terminates.
-  * `--force-cells <ncells>`: This option will count the number of reads associated with each barcode, and sort the barcodes in descending order of frequency.  Then, it will consider the first `<ncells>` barcodes to be valid.  Any barcode that has a number of reads >= to the `<ncells>`-th barcode will be considered part of the permit list, all others will not (but will be considered for _correction_ to this permit list).
-  * `--valid-bc <bcfile>`: This option will read the provided file `<bcfile>` and treat it as an explicitly-provided list of true barcodes.  Barcodes appearing in this list will be considered true, and barcodes will be corrected to this list.
-  * `--expect-cells <ncells>`: Not currently implemnted.
-
-
+## Documentation 
 
+Alevin-fry is under active development.  However, you can find the documentation on [read the docs](https://alevin-fry.readthedocs.io/en/latest/).  We try to keep the documentation up to date with the latest developments in the software.
diff --git a/docs/source/collate.rst b/docs/source/collate.rst
@@ -14,3 +14,9 @@ dictate how the collation and filtering will be performed.
 * ``-e, --expected-ori <expected-ori>`` : The expected orientation of valid alignments.  Many single-cell protocols generated a strand-aware library, allowing increased precision by filtering out alignments that do not occur in the prescribed orientation.  This flag will filter out alignments that do not match the provided orientation.  The options are 'fw' (filters out alignments to the reverse complement strand), 'rc' (filter out alignments to the forward strand) and 'both' or 'either' (do not filter any alignments).
 
 * ``-m, --max-records <max-records>`` : The maximum number of read records to keep in memory at once during collation. The ``collate`` command will pass over the input RAD file multiple times collecting the records associated with a set of (corrected) cellular barcodes so that they can be written out in collated format to the output RAD file.  This parameter determines (approximately) how many records will be held in memory at once, and therefore determines the memory usage of the ``collate`` command.  The larger the value used the faster the collation process will be, since fewer passes are made.  The smaller this value, the lower the memory usage will be, at the cost of more passes.  The default value is 10,000,000.  Note that this determines the number of records *approximately*, because a specific barcode will never be split across multiple collation passes.  The algorithm employed is to collect the reads associated with different cellular barcodes in the current pass until the number of reads to be collected *first exceeds* this value.
+
+output
+------
+
+The ``collate`` command only has a single output.  It will write a file name
+``map.collated.rad`` in the directory specified by ``-i``.
diff --git a/docs/source/generate_permit_list.rst b/docs/source/generate_permit_list.rst
@@ -32,8 +32,8 @@ exclusive options (which determines how the "true" barcodes are decided):
 
 * ``--expect-cells <ncells>``: Not currently implemented.
 
-Outputs
--------
+output
+------
 
 The ``generate-permit-list`` command outputs a number of different files in the output directory.  Not all files are 
 relevant to users of ``alevin-fry``, but the files are described here.

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -21,7 +21,8 @@ It makes use of the selective-alignment and barcode processing framework of
    building
    getting_started
    generate_permit_list  
-   collate 
+   collate
+   quant
    LICENSE.rst
 
 Indices and tables

diff --git a/docs/source/quant.rst b/docs/source/quant.rst
@@ -0,0 +1,42 @@
+quant
+=====
+
+The ``quant`` command takes a collated RAD file and performs feature (e.g. gene) quantification, outputting
+a sparse matrix of de-duplicated counts as well as a list of labels for the rows and columns.  The ``quant``
+command takes an input directory containing the collated RAD file, a transcript-to-gene map, an output directory
+where the results will be written and a "resolution strategy" (described below).  Quantification is 
+multi-threaded, so it also, optionally, takes as an arguments the number of threads to use concurrently.
+
+The transcript-to-gene map should be a two-column (headerless) tab separated file where the first column 
+contains a transcript name and the second column contains the corresponding gene name for this transcript.
+
+The ``quant`` command exposes a number of different resolution strategies.  They are:
+
+* ``full`` : This is the default resolution strategy.  It implements the algorithm described in the alevin_ paper.  Briefly, it builds a graph among the set of reads that align to an overlapping set of transcripts and that have similar (within an edit distance of 1) UMIs.  It then attempts to find a parsimonious cover for this graph using the fewest number of possible transcripts.  If a unique parsimonious cover is found, then the (deduplicated) reads are assigned directly to the genes that yield the most parsimonious cover. If multiple equally-parsimonious covers exist, then the reads are considered multi-mapping at the gene level and they are probabilistically resolved using an expectation maximization (EM) algorithm. 
+
+* ``parsimony`` : This strategy is the same as "full", except that it does *not* probabilistically resolve reads that remain as gene-multimapping after applying the parsimony criterion.  Instead, reads that do not have a unique most-parsimonious assignment are discarded. 
+
+* ``trivial`` : This strategy does not search for 1 edit-distance neighbors of UMIs.  Instead, it first discards any reads that multi-map at the gene level.  The reads that remain then all map uniquely to a single gene.  These reads are deduplicated by (exact) UMI, and the number of distinct UMIs mapping to each gene are taken as that gene's count in the current cell.
+
+* ``cr-like`` : This strategy is like the one adopted in cell-ranger, except that it does not first collapse 1-edit-distance UMIs.  Within each cell barcode, a list of (gene, UMI, count) tuples is created. If a read maps to more than one gene, then it generates more than one such tuple.  The tuples are then sorted lexicographically (first by gene id, then by UMI, and then by count).  Any UMI that aligns to only a single gene is assigned to that gene.  UMIs that align to more than one gene are assigned to the gene with the highest count for this UMI.  If there is a tie for the highest count gene for this UMI, then the corresponding reads are simply discarded.
+
+output
+------
+
+The output of the ``quant`` command consists of 3 files: ``barcodes.txt``,
+``counts.mtx.gz`` and ``gene_names.txt``. The ``counts.mtx.gz`` is a matrix market
+coordinate format file where the number of *rows* is equal to the number of
+genes and the number of columns is equal to the number of *cells*. The header
+line encodes the number of rows, columns and non-zero entries. The subsequent
+lines (1-based indexing) encode the locations and values of the non-zero
+entries.  This entire ``.mtx`` format file is gzipped during output to minimize
+disk space. The two other files provide the labels for the rows and columns of
+this matrix. The ``gene_names.txt`` file is a text file that contains the
+names of the rows of the matrix, in the order in which it is written, with
+one gene name written per line. The ``barcodes.txt`` file is a text file that
+contains the names of the columns of the matrix, in the order in which it is
+written, with one barcode name written per line.
+
+
+
+.. _alevin: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y
diff --git a/libradicl/Cargo.toml b/libradicl/Cargo.toml
@@ -18,10 +18,11 @@ indicatif = "0.15.0"
 crossbeam-channel = "0.4.3"
 petgraph = "0.5.1"
 executors = "0.7"
-sprs = "0.7.1"
 bio-types = "0.6.0"
 quickersort = "3.0.1"
 needletail = "0.4"
 num-format = "0.4.0"
 statrs = "0.13.0"
-rand = "0.7.3"
+rand = "0.7.3"
+sprs = { git = "https://github.com/k3yavi/sprs" } #"0.8"
+sce = { git = "https://github.com/k3yavi/SingleCellExperiment" }
diff --git a/libradicl/src/cellfilter.rs b/libradicl/src/cellfilter.rs
@@ -11,13 +11,13 @@ use crate as libradicl;
 use fasthash::sea::Hash64;
 use fasthash::RandomState;
 use libradicl::exit_codes;
-use num_format::{Locale, ToFormattedString};
 use needletail::bitkmer::*;
-use std::str::from_utf8;
+use num_format::{Locale, ToFormattedString};
 use std::collections::HashMap;
 use std::fs::File;
 use std::io::BufReader;
 use std::io::{BufWriter, Write};
+use std::str::from_utf8;
 
 pub enum CellFilterMethod {
     // cut off at this cell in
@@ -315,15 +315,27 @@ pub fn generate_permit_list(
 
     for (k, v) in permitted_map {
         let bc_mer: BitKmer = (k, ft_vals.bclen as u8);
-        writeln!(&mut writer, "{}\t{}", from_utf8(&bitmer_to_bytes(bc_mer)[..]).unwrap(), v).expect("couldn't write to output file.");
+        writeln!(
+            &mut writer,
+            "{}\t{}",
+            from_utf8(&bitmer_to_bytes(bc_mer)[..]).unwrap(),
+            v
+        )
+        .expect("couldn't write to output file.");
     }
 
     let o_path = parent.join("all_freq.tsv");
     let output = std::fs::File::create(&o_path).expect("could not create output.");
     let mut writer = BufWriter::new(&output);
     for (k, v) in hm {
         let bc_mer: BitKmer = (k, ft_vals.bclen as u8);
-        writeln!(&mut writer, "{}\t{}", from_utf8(&bitmer_to_bytes(bc_mer)[..]).unwrap(), v).expect("couldn't write to output file.");
+        writeln!(
+            &mut writer,
+            "{}\t{}",
+            from_utf8(&bitmer_to_bytes(bc_mer)[..]).unwrap(),
+            v
+        )
+        .expect("couldn't write to output file.");
     }
 
     let s_path = parent.join("permit_map.bin");

diff --git a/libradicl/src/quant.rs b/libradicl/src/quant.rs
@@ -409,7 +409,8 @@ pub fn quantify(
     // TODO: guess capacity better
     // TODO: in the future, we may not want to hold the
     // entire triplet matrix in memory at once?
-    let mut omat = TriMatI::<f32, u32>::new((num_genes, hdr.num_chunks as usize));
+    // @k3yavi: Changing this to usize for testing
+    let mut omat = TriMatI::<f32, usize>::new((num_genes, hdr.num_chunks as usize));
 
     let output_path = std::path::Path::new(&output_dir);
     fs::create_dir_all(output_path)?;
@@ -437,8 +438,11 @@ pub fn quantify(
         c += 1;
     });
 
-    let mat_path = output_path.join("counts.mtx");
-    sprs::io::write_matrix_market(&mat_path, &omat)?;
+    //let mat_path = output_path.join("counts.mtx");
+    //sprs::io::write_matrix_market(&mat_path, &omat)?;
+
+    let mat_path = output_path.join("counts.eds.gz");
+    sce::eds::writer(&mat_path, &omat.to_csr())?;
 
     let gn_path = output_path.join("gene_names.txt");
     let gn_file = File::create(gn_path).expect("couldn't create gene name file.");

diff --git a/src/main.rs b/src/main.rs
@@ -11,7 +11,7 @@ extern crate slog;
 extern crate slog_term;
 
 use bio_types::strand::Strand;
-use clap::{App, Arg};
+use clap::{crate_authors, crate_version, App, Arg};
 use libradicl::cellfilter::{generate_permit_list, CellFilterMethod};
 use libradicl::schema::ResolutionStrategy;
 use mimalloc::MiMalloc;
@@ -22,8 +22,8 @@ use std::unimplemented;
 #[global_allocator]
 static GLOBAL: MiMalloc = MiMalloc;
 
-static VERSION: &str = "0.0.1";
-static AUTHORS: &str = "Avi Srivastava, Rob Patro";
+// static VERSION: &str = "0.0.1";
+// static AUTHORS: &str = "Avi Srivastava, Rob Patro";
 
 #[allow(dead_code)]
 fn gen_random_kmer(k: usize) -> String {
@@ -40,14 +40,15 @@ fn gen_random_kmer(k: usize) -> String {
 
 fn main() {
     let max_num_threads: String = (num_cpus::get() as u32).to_string();
-
+    let crate_authors = crate_authors!("\n");
+    let version = crate_version!();
     // [] add command for just counting barcode frequency
     // [] add other algorithms for determining barcode cutoff
 
     let gen_app = App::new("generate-permit-list")
         .about("Generate a permit list of barcodes from a RAD file")
-        .version(VERSION)
-        .author(AUTHORS)
+        .version(version)
+        .author(crate_authors)
         .arg(Arg::from("-i, --input=<input>  'input RAD file'"))
         .arg(Arg::from(
             "-o, --output-dir=<output-dir>  'output directory'",
@@ -73,8 +74,8 @@ fn main() {
 
     let collate_app = App::new("collate")
     .about("Collate a RAD file by corrected cell barcode")
-    .version(VERSION)
-    .author(AUTHORS)
+    .version(version)
+    .author(crate_authors)
     .arg(Arg::from("-i, --input-dir=<input-dir> 'input directory made by generate-permit-list'"))
     .arg(Arg::from("-r, --rad-file=<rad-file> 'the RAD file to be collated'"))
     .arg(Arg::from("-m, --max-records=[max-records] 'the maximum number of read records to keep in memory at once'")
@@ -84,8 +85,8 @@ fn main() {
 
     let quant_app = App::new("quant")
     .about("Quantify expression from a collated RAD file")
-    .version(VERSION)
-    .author(AUTHORS)
+    .version(version)
+    .author(crate_authors)
     .arg(Arg::from("-i, --input-dir=<input-dir>  'input directory containing collated RAD file'"))
     .arg(Arg::from("-m, --tg-map=<tg-map>  'transcript to gene map'"))
     .arg(Arg::from("-o, --output-dir=<output-dir> 'output directory where quantification results will be written'"))
@@ -97,8 +98,8 @@ fn main() {
         .about("the resolution strategy by which molecules will be counted"));
 
     let opts = App::new("alevin-fry")
-        .version(VERSION)
-        .author(AUTHORS)
+        .version(version)
+        .author(crate_authors)
         .about("Process RAD files from the command line")
         .subcommand(gen_app)
         .subcommand(collate_app)