MontgomeryLab · taimontgomery · Aug 15, 2022 · Apr 7, 2022 · Apr 8, 2022 · Jul 26, 2022
diff --git a/START_HERE/paths.yml b/START_HERE/paths.yml
@@ -17,6 +17,9 @@ features_csv: ./features.csv
 ##-- The final output directory for files produced by the pipeline --#
 run_directory: run_directory
 
+##-- The directory for temporary files. Determined by cwltool if blank. --##
+tmp_directory:
+
 ######-------------------------------- BOWTIE-BUILD ---------------------------------######
 #
 # To build bowtie indexes:

diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml
@@ -328,6 +328,7 @@ dir_name_plotter: plots
 ######-------------------------------------------------------------------------------######
 
 run_directory: ~
+tmp_directory: ~
 features_csv: { }
 samples_csv: { }
 reference_genome_files: [ ]

diff --git a/doc/Parameters.md b/doc/Parameters.md
@@ -75,14 +75,14 @@ By default, tiny-count will only evaluate alignments to features which match a `
 |----------------------------|---------------------------|
 | counter-normalize-by-hits: | `--normalize-by-hits T/F` |
 
-By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step.
+By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step.
 
 ### Decollapse
  | Run Config Key      | Commandline Argument   |
 |---------------------|------------------------|
 | counter_decollapse: | `--decollapse`         |
 
-The SAM files produced by the tinyRNA pipeline are collapsed by default; alignments sharing a SEQ field are strictly multi-alignments and do not reflect original sequence counts. If this option is switched "on", tiny-count will produce a decollapsed copy of each input SAM file. Each alignment in the decollapsed SAM will be duplicated by the sequence's original count. This is useful for browsing in IGV.
+The SAM files produced by the tinyRNA pipeline are collapsed by default; alignments sharing a SEQ field are strictly multi-alignments and do not reflect original sequence counts. If this option is switched "on", tiny-count will produce a decollapsed copy of each input SAM file. Each alignment in the decollapsed SAM will be duplicated by the sequence's original count. This is useful for browsing in IGV. If non-collapsed inputs are provided to tiny-count in standalone mode, this option will be ignored.
 
 ### Filters
  | Run Config Key                 | Commandline Argument                       |
@@ -108,19 +108,19 @@ Diagnostic information will include intermediate alignment files for each librar
 
 ### Full tiny-count Help String
 ```
-tiny-count -i SAMPLES -c CONFIGFILE -o OUTPUTPREFIX [-h]
-           [-sf [SOURCE [SOURCE ...]]] [-tf [TYPE [TYPE ...]]]
-           [-nh T/F] [-dc] [-a] [-p] [-d]
+tiny-count -i SAMPLES -f FEATURES -o OUTPUTPREFIX [-h]
+           [-sf [SOURCE ...]] [-tf [TYPE ...]] [-nh T/F] [-dc] [-a]
+           [-p] [-d]
 
 This submodule assigns feature counts for SAM alignments using a Feature Sheet
 ruleset. If you find that you are sourcing all of your input files from a
 prior run, we recommend that you instead run `tiny recount` within that run's
 directory.
 
 Required arguments:
-  -i SAMPLES, --input-csv SAMPLES
+  -i SAMPLES, --samples-csv SAMPLES
                         your Samples Sheet
-  -c CONFIGFILE, --config CONFIGFILE
+  -f FEATURES, --features-csv FEATURES
                         your Features Sheet
   -o OUTPUTPREFIX, --out-prefix OUTPUTPREFIX
                         output prefix to use for file names
@@ -137,7 +137,8 @@ Optional arguments:
                         If T/true, normalize counts by (selected) overlapping
                         feature counts. Default: true.
   -dc, --decollapse     Create a decollapsed copy of all SAM files listed in
-                        your Samples Sheet.
+                        your Samples Sheet. This option is ignored for non-
+                        collapsed inputs.
   -a, --all-features    Represent all features in output counts table, even if
                         they did not match a Select for / with value.
   -p, --is-pipeline     Indicates that tiny-count was invoked as part of a

diff --git a/doc/tiny-count.md b/doc/tiny-count.md
@@ -7,10 +7,12 @@ For an explanation of tiny-count's parameters in the Run Config and by commandli
 tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps are resource and time intensive, so it is inconvenient to rerun an end-to-end analysis to test new selection rules. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequesites.
 
 ## Running as a Standalone Tool
-If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command requires that you specify the paths to your Samples Sheet and Features Sheet, and a filename prefix for outputs. [All other arguments are optional](Parameters.md#full-tiny-count-help-string). You will need to make a copy of your Samples Sheet and modify it so that the `Input FASTQ Files` column instead contains paths to the corresponding SAM files from a prior end-to-end run. SAM files from non-tinyRNA sources are not currently supported as the tiny-collapse step is critical to its function.
+If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command requires that you specify the paths to your Samples Sheet and Features Sheet, and a filename prefix for outputs. [All other arguments are optional](Parameters.md#full-tiny-count-help-string). You will need to make a copy of your Samples Sheet and modify it so that the `Input FASTQ Files` column instead contains paths to the corresponding SAM files from a prior end-to-end run. SAM files from third party sources are also supported, and can be produced from reads collapsed by tiny-collapse or fastx, or from non-collapsed reads.
 
 >**Important:** reusing the same output filename prefix between standalone runs will result in prior outputs being overwritten.
 
+#### Using Non-collapsed Sequence Alignments
+While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1.)
 
 
 # Feature Selection

diff --git a/tests/testdata/counter/non-collapsed.sam b/tests/testdata/counter/non-collapsed.sam
@@ -0,0 +1,2 @@
+@SQ	SN:I	LN:21
+NON_COLLAPSED_QNAME	16	I	15064570	255	21M	*	0	0	CAAGACAGAGCTTCACCGTTC	IIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:21	NM:i:0	XM:i:2
diff --git a/tests/unit_tests_hts_parsing.py b/tests/unit_tests_hts_parsing.py
@@ -1,9 +1,11 @@
 import collections
+import contextlib
 import unittest
-from random import randint
-
 import HTSeq
+import io
+
 from copy import deepcopy
+from random import randint
 from unittest.mock import patch, mock_open, call
 
 from tiny.rna.counter.features import FeatureSelector
@@ -62,14 +64,14 @@ def exhaust_iterator(self, it):
     """Did SAM_reader correctly skip header values and parse all pertinent info from a single record SAM file?"""
 
     def test_sam_reader(self):
-        sam_bundle = next(SAM_reader().bundle_multi_alignments(self.short_sam_file))
+        sam_bundle, read_count = next(SAM_reader().bundle_multi_alignments(self.short_sam_file))
         sam_record = sam_bundle[0]
 
         self.assertEqual(sam_record['chrom'], "I")
         self.assertEqual(sam_record['start'], 15064569)
         self.assertEqual(sam_record['end'], 15064590)
         self.assertEqual(sam_record['strand'], '-')
-        self.assertEqual(sam_record['name'], "0_count=5")
+        self.assertEqual(sam_record['name'], b"0_count=5")
         self.assertEqual(sam_record['seq'], b"CAAGACAGAGCTTCACCGTTC")
         self.assertEqual(sam_record['len'], 21)
         self.assertEqual(sam_record['nt5'], 'G')
@@ -87,13 +89,13 @@ def test_sam_parser_comparison(self):
         ours = SAM_reader().bundle_multi_alignments(file)
         theirs = HTSeq.bundle_multiple_alignments(HTSeq.BAM_Reader(file))
 
-        for our_bundle, their_bundle in zip(ours, theirs):
+        for (our_bundle, _), their_bundle in zip(ours, theirs):
             self.assertEqual(len(our_bundle), len(their_bundle))
             for our, their in zip(our_bundle, their_bundle):
                 self.assertEqual(our['chrom'], their.iv.chrom)
                 self.assertEqual(our['start'], their.iv.start)
                 self.assertEqual(our['end'], their.iv.end)
-                self.assertEqual(our['name'], their.read.name)
+                self.assertEqual(our['name'].decode(), their.read.name)
                 self.assertEqual(our['nt5'], chr(their.read.seq[0]))  # See note above
                 self.assertEqual(our['strand'], their.iv.strand)
                 if our['strand'] == '-':                              # See note above
@@ -542,15 +544,15 @@ def test_SAM_reader_get_decollapsed_filename(self):
 
         self.assertEqual(sam_out, "sam_file_decollapsed.sam")
 
-    """Does SAM_reader._read_thru_header() correctly identify header lines and write them to the decollapsed file?"""
+    """Does SAM_reader._read_to_first_aln() correctly identify header lines and write them to the decollapsed file?"""
 
     def test_SAM_reader_read_thru_header(self):
         reader = SAM_reader(decollapse=True)
         reader._decollapsed_filename = "mock_outfile_name.sam"
 
         with open(self.short_sam_file, 'rb') as sam_in:
             with patch('builtins.open', mock_open()) as sam_out:
-                line = reader._read_thru_header(sam_in)
+                line = reader._read_to_first_aln(sam_in)
 
         expected_writelines = [
             call('mock_outfile_name.sam', 'w'),
@@ -560,12 +562,13 @@ def test_SAM_reader_read_thru_header(self):
         ]
 
         sam_out.assert_has_calls(expected_writelines)
-        self.assertTrue(len(reader._headers) == 1)
+        self.assertTrue(len(reader._header_lines) == 1)
 
     """Does SAM_reader._write_decollapsed_sam() write the correct number of duplicates to the decollapsed file?"""
 
     def test_SAM_reader_write_decollapsed_sam(self):
         reader = SAM_reader(decollapse=True)
+        reader.collapser_type = "tiny-collapse"
         reader._decollapsed_reads = [(b"0_count=5", b"mock line from SAM file")]
         reader._decollapsed_filename = "mock_outfile_name.sam"
 
@@ -601,6 +604,42 @@ def test_SAM_reader_parse_alignments_decollapse(self):
                 self.exhaust_iterator(reader._parse_alignments(sam_in))
                 write_fn.assert_called_once()
 
+    """Does SAM_reader report a single read count for non-collapsed SAM records?"""
+
+    def test_SAM_reader_single_readcount_non_collapsed_SAM(self):
+        # Read non-collapsed.sam but duplicate its single record twice
+        with open(f"{resources}/non-collapsed.sam", 'rb') as f:
+            sam_lines = f.readlines()
+            sam_lines.extend([sam_lines[1]] * 2)
+            mock_file = mock_open(read_data=b''.join(sam_lines))
+
+        with patch('tiny.rna.counter.hts_parsing.open', new=mock_file):
+            reader = SAM_reader()
+            bundle, read_count = next(reader.bundle_multi_alignments('mock_file'))
+
+        self.assertEqual(bundle[0]['name'], b'NON_COLLAPSED_QNAME')
+        self.assertEqual(len(bundle), 3)
+        self.assertEqual(read_count, 1)
+
+    """Are decollapsed outputs skipped when non-collapsed SAM files are supplied?"""
+
+    def test_SAM_reader_no_decollapse_non_collapsed_SAM_files(self):
+        stdout_capture = io.StringIO()
+        with patch.object(SAM_reader, "_write_decollapsed_sam") as write_sam, \
+                patch.object(SAM_reader, "_write_header_for_decollapsed_sam") as write_header:
+
+            with contextlib.redirect_stderr(stdout_capture):
+                reader = SAM_reader(decollapse=True)
+                records = reader.bundle_multi_alignments(f"{resources}/non-collapsed.sam")
+                self.exhaust_iterator(records)
+
+        write_sam.assert_not_called()
+        write_header.assert_not_called()
+        self.assertEqual(reader.collapser_type, None)
+        self.assertEqual(stdout_capture.getvalue(),
+                         "Alignments do not appear to be derived from a supported collapser input. "
+                         "Decollapsed SAM files will therefore not be produced.\n")
+
     """Does CaseInsensitiveAttrs correctly store, check membership, and retrieve?"""
 
     def test_CaseInsensitiveAttrs(self):

diff --git a/tiny/entry.py b/tiny/entry.py
@@ -89,24 +89,16 @@ def run(tinyrna_cwl_path: str, config_file: str) -> None:
     # First get the configuration file set up for this run
     config_object = Configuration(config_file)
     run_directory = config_object.create_run_directory()
-    cwl_conf_file = config_object.save_run_profile()
+    config_object.save_run_profile()
 
     workflow = f"{tinyrna_cwl_path}/workflows/tinyrna_wf.cwl"
-    parallel = config_object['run_parallel']
-    loudness = config_object['verbosity']
 
-    if config_object['run_native']:  # experimental
+    if config_object['run_native']:
         # Execute the CWL runner via native Python
-        return_code = run_native(
-            config_object, workflow,
-            run_directory=run_directory,
-            parallel=parallel, verbosity=loudness)
+        return_code = run_cwltool_native(config_object, workflow, run_directory)
     else:
         # Use the cwltool CWL runner via command line
-        return_code = run_cwltool_subprocess(
-            cwl_conf_file, workflow,
-            run_directory=run_directory,
-            parallel=parallel, verbosity=loudness)
+        return_code = run_cwltool_subprocess(config_object, workflow, run_directory)
 
     # If the workflow completed without errors, we want to update
     # the Paths Sheet to point to the new bowtie index prefix
@@ -152,56 +144,61 @@ def resume(tinyrna_cwl_path: str, config_file: str, step: str) -> None:
 
     if config['run_native']:
         # We can pass our config object directly without writing to disk first
-        run_native(config, resume_wf, verbosity=config['verbosity'])
+        run_cwltool_native(config, resume_wf)
     else:
         # Processed Run Config must be written to disk first
-        resume_conf_file = "resume_" + os.path.basename(config_file)
+        resume_conf_file = config.get_outfile_path()
         config.write_processed_config(resume_conf_file)
-        run_cwltool_subprocess(resume_conf_file, resume_wf, verbosity=config['verbosity'])
+        run_cwltool_subprocess(config, resume_wf)
 
     if os.path.isfile(resume_wf):
         # We don't want the generated workflow to be returned by a call to setup-cwl
         os.remove(resume_wf)
 
 
-def run_cwltool_subprocess(config_file: str, workflow: str, run_directory=None, parallel=False, verbosity="normal") -> int:
+def run_cwltool_subprocess(config_object: 'ConfigBase', workflow: str, run_directory='.') -> int:
     """Executes the workflow using a command line invocation of cwltool
 
     Args:
-        config_file: the processed configuration file produced by configuration.py
+        config_object: a constructed ConfigBase-derived object
         workflow: the path to the workflow to be executed
         run_directory: the destination folder for workflow output subdirectories (default: CWD)
-        parallel: process libraries in parallel where possible
-        verbosity: controls the depth of information written to terminal by cwltool
 
     Returns: None
 
     """
 
-    command = ['cwltool --timestamps --relax-path-checks --on-error continue']
+    processed_configfile = config_object.get_outfile_path()
+    tmpdir = config_object['tmp_directory']
+    parallel = config_object['run_parallel']
+    verbosity = config_object['verbosity']
+
+    command = [f'cwltool --timestamps --relax-path-checks --on-error continue --outdir {run_directory}']
+    if tmpdir is not None: command.append(f'--tmpdir-prefix {tmpdir}')
     if verbosity == 'debug': command.append('--debug --js-console --leave-tmpdir')
     if verbosity == 'quiet': command.append('--quiet')
-    if run_directory: command.append(f'--outdir {run_directory}')
     if parallel: command.append('--parallel')
 
-    cwl_runner = ' '.join(command + [workflow, config_file])
+    cwl_runner = ' '.join(command + [workflow, processed_configfile])
     return subprocess.run(cwl_runner, shell=True).returncode
 
 
-def run_native(config_object: 'ConfigBase', workflow: str, run_directory: str = '.', parallel=False, verbosity="normal") -> int:
+def run_cwltool_native(config_object: 'ConfigBase', workflow: str, run_directory: str = '.') -> int:
     """Executes the workflow using native Python rather than subprocess "command line"
 
     Args:
         config_object: a constructed ConfigBase-derived object
         workflow: the path to the workflow to be executed
         run_directory: the destination folder for workflow output subdirectories (default: CWD)
-        parallel: process libraries in parallel where possible
-        verbosity: controls the depth of information written to terminal by cwltool
 
     Returns: None
 
     """
 
+    tmpdir = config_object['tmp_directory']
+    parallel = config_object['run_parallel']
+    verbosity = config_object['verbosity']
+
     def furnish_if_file_record(file_dict):
         if isinstance(file_dict, dict) and file_dict.get('class', None) == 'File':
             file_dict['basename'] = os.path.basename(file_dict['path'])
@@ -225,8 +222,10 @@ def furnish_if_file_record(file_dict):
         'debug': verbosity == "debug"
     })
 
-    # Set proper temp directory for Mac users
-    if sys.platform == "darwin":
+    # Set temp directory for intermediate outputs
+    if tmpdir is not None:
+        runtime_context.tmp_outdir_prefix = tmpdir
+    elif sys.platform == "darwin":
         default_mac_path = "/private/tmp/docker_tmp"
         if runtime_context.tmp_outdir_prefix == DEFAULT_TMP_PREFIX:
             runtime_context.tmp_outdir_prefix = default_mac_path

diff --git a/tiny/rna/configuration.py b/tiny/rna/configuration.py
@@ -124,8 +124,9 @@ def create_run_directory(self) -> str:
 
         return run_dir
 
-    def get_outfile_path(self, infile: str) -> str:
-        """Prepend date+time to the Config File name. Change its path to reside in the Run Directory."""
+    def get_outfile_path(self, infile: str = None) -> str:
+        """Returns the path and filename for the processed run config"""
+        if infile is None: infile = self.inf
         return self.joinpath(self['run_directory'], os.path.basename(infile))
 
     def write_processed_config(self, filename: str = None) -> str:
@@ -184,7 +185,7 @@ def to_cwl_file_class(input_file_path):
             path_to_input = self.paths.from_here(input_file_path)
             return self.cwl_file(path_to_input)
 
-        for absorb_key in ['ebwt', 'plot_style_sheet', 'adapter_fasta']:
+        for absorb_key in ['ebwt', 'plot_style_sheet', 'adapter_fasta', 'tmp_directory']:
             self[absorb_key] = self.paths[absorb_key]
         self['run_directory'] = self.paths.from_here(self.paths['run_directory'])
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		@SQ SN:I LN:21
		NON_COLLAPSED_QNAME 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2