Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
81d4804
Counter now offers basic support for counting SAM files which were no…
AlexTate Apr 7, 2022
e8d071f
The SAM_reader class has been adapted to handle fastx_collapser QNAME…
AlexTate Apr 8, 2022
0b1dd08
Merge branch 'issue-187' into issue-186
AlexTate Jul 26, 2022
e930584
Minor bugs related to the recent upgrade from Python 3.7 to 3.9 have …
AlexTate Jul 26, 2022
496ed8b
Mild refactor to clean up the method signatures of run_cwltool_subpro…
AlexTate Jul 27, 2022
c19e775
Adding support for the new tmp_directory parameter
AlexTate Jul 27, 2022
c7bcb64
Minor addition to ResumeBase for compatibility with the refactor in e…
AlexTate Jul 27, 2022
219113b
tmp_directory key and placeholder key added to paths.yml and run_conf…
AlexTate Jul 27, 2022
edbeb70
Light refactoring to help clean up SAM_reader
AlexTate Jul 30, 2022
cc32895
Further refinements to SAM_reader. Mostly just experimenting with mov…
AlexTate Jul 31, 2022
fff006c
Added an additional test to check that a read count of 1 is reported …
AlexTate Jul 31, 2022
df56f06
Irrelevant information was accidentally included in a prior commit
AlexTate Jul 31, 2022
e304c19
Detail polish
AlexTate Aug 4, 2022
be435d7
Merge branch 'master' into issue-186
AlexTate Aug 4, 2022
6e7c020
Detail polish after resolving merge conflicts
AlexTate Aug 4, 2022
1c600b7
Documentation update to indicate that third party and non-collapsed S…
AlexTate Aug 9, 2022
4993b88
Expanded on the "non-collapsed SAM file caveats" in tiny-count.md. Th…
AlexTate Aug 11, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions START_HERE/paths.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ features_csv: ./features.csv
##-- The final output directory for files produced by the pipeline --#
run_directory: run_directory

##-- The directory for temporary files. Determined by cwltool if blank. --##
tmp_directory:

######-------------------------------- BOWTIE-BUILD ---------------------------------######
#
# To build bowtie indexes:
Expand Down
1 change: 1 addition & 0 deletions START_HERE/run_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,7 @@ dir_name_plotter: plots
######-------------------------------------------------------------------------------######

run_directory: ~
tmp_directory: ~
features_csv: { }
samples_csv: { }
reference_genome_files: [ ]
Expand Down
17 changes: 9 additions & 8 deletions doc/Parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,14 +75,14 @@ By default, tiny-count will only evaluate alignments to features which match a `
|----------------------------|---------------------------|
| counter-normalize-by-hits: | `--normalize-by-hits T/F` |

By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step.
By default, tiny-count will divide the number of counts associated with each sequence, twice, before they are assigned to a feature. Each unique sequence's count is determined by tiny-collapse (or a compatible collapsing utility) and is preserved through the alignment process. The original count is divided first by the number of loci that the sequence aligns to, and second by the number of features passing selection at each locus. Switching this option "off" disables the latter normalization step.

### Decollapse
| Run Config Key | Commandline Argument |
|---------------------|------------------------|
| counter_decollapse: | `--decollapse` |

The SAM files produced by the tinyRNA pipeline are collapsed by default; alignments sharing a SEQ field are strictly multi-alignments and do not reflect original sequence counts. If this option is switched "on", tiny-count will produce a decollapsed copy of each input SAM file. Each alignment in the decollapsed SAM will be duplicated by the sequence's original count. This is useful for browsing in IGV.
The SAM files produced by the tinyRNA pipeline are collapsed by default; alignments sharing a SEQ field are strictly multi-alignments and do not reflect original sequence counts. If this option is switched "on", tiny-count will produce a decollapsed copy of each input SAM file. Each alignment in the decollapsed SAM will be duplicated by the sequence's original count. This is useful for browsing in IGV. If non-collapsed inputs are provided to tiny-count in standalone mode, this option will be ignored.

### Filters
| Run Config Key | Commandline Argument |
Expand All @@ -108,19 +108,19 @@ Diagnostic information will include intermediate alignment files for each librar

### Full tiny-count Help String
```
tiny-count -i SAMPLES -c CONFIGFILE -o OUTPUTPREFIX [-h]
[-sf [SOURCE [SOURCE ...]]] [-tf [TYPE [TYPE ...]]]
[-nh T/F] [-dc] [-a] [-p] [-d]
tiny-count -i SAMPLES -f FEATURES -o OUTPUTPREFIX [-h]
[-sf [SOURCE ...]] [-tf [TYPE ...]] [-nh T/F] [-dc] [-a]
[-p] [-d]

This submodule assigns feature counts for SAM alignments using a Feature Sheet
ruleset. If you find that you are sourcing all of your input files from a
prior run, we recommend that you instead run `tiny recount` within that run's
directory.

Required arguments:
-i SAMPLES, --input-csv SAMPLES
-i SAMPLES, --samples-csv SAMPLES
your Samples Sheet
-c CONFIGFILE, --config CONFIGFILE
-f FEATURES, --features-csv FEATURES
your Features Sheet
-o OUTPUTPREFIX, --out-prefix OUTPUTPREFIX
output prefix to use for file names
Expand All @@ -137,7 +137,8 @@ Optional arguments:
If T/true, normalize counts by (selected) overlapping
feature counts. Default: true.
-dc, --decollapse Create a decollapsed copy of all SAM files listed in
your Samples Sheet.
your Samples Sheet. This option is ignored for non-
collapsed inputs.
-a, --all-features Represent all features in output counts table, even if
they did not match a Select for / with value.
-p, --is-pipeline Indicates that tiny-count was invoked as part of a
Expand Down
4 changes: 3 additions & 1 deletion doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ For an explanation of tiny-count's parameters in the Run Config and by commandli
tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. However, the earlier pipeline steps are resource and time intensive, so it is inconvenient to rerun an end-to-end analysis to test new selection rules. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequesites.

## Running as a Standalone Tool
If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command requires that you specify the paths to your Samples Sheet and Features Sheet, and a filename prefix for outputs. [All other arguments are optional](Parameters.md#full-tiny-count-help-string). You will need to make a copy of your Samples Sheet and modify it so that the `Input FASTQ Files` column instead contains paths to the corresponding SAM files from a prior end-to-end run. SAM files from non-tinyRNA sources are not currently supported as the tiny-collapse step is critical to its function.
If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command requires that you specify the paths to your Samples Sheet and Features Sheet, and a filename prefix for outputs. [All other arguments are optional](Parameters.md#full-tiny-count-help-string). You will need to make a copy of your Samples Sheet and modify it so that the `Input FASTQ Files` column instead contains paths to the corresponding SAM files from a prior end-to-end run. SAM files from third party sources are also supported, and can be produced from reads collapsed by tiny-collapse or fastx, or from non-collapsed reads.

>**Important:** reusing the same output filename prefix between standalone runs will result in prior outputs being overwritten.

#### Using Non-collapsed Sequence Alignments
While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1.)


# Feature Selection
Expand Down
2 changes: 2 additions & 0 deletions tests/testdata/counter/non-collapsed.sam
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
@SQ SN:I LN:21
NON_COLLAPSED_QNAME 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2
57 changes: 48 additions & 9 deletions tests/unit_tests_hts_parsing.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import collections
import contextlib
import unittest
from random import randint

import HTSeq
import io

from copy import deepcopy
from random import randint
from unittest.mock import patch, mock_open, call

from tiny.rna.counter.features import FeatureSelector
Expand Down Expand Up @@ -62,14 +64,14 @@ def exhaust_iterator(self, it):
"""Did SAM_reader correctly skip header values and parse all pertinent info from a single record SAM file?"""

def test_sam_reader(self):
sam_bundle = next(SAM_reader().bundle_multi_alignments(self.short_sam_file))
sam_bundle, read_count = next(SAM_reader().bundle_multi_alignments(self.short_sam_file))
sam_record = sam_bundle[0]

self.assertEqual(sam_record['chrom'], "I")
self.assertEqual(sam_record['start'], 15064569)
self.assertEqual(sam_record['end'], 15064590)
self.assertEqual(sam_record['strand'], '-')
self.assertEqual(sam_record['name'], "0_count=5")
self.assertEqual(sam_record['name'], b"0_count=5")
self.assertEqual(sam_record['seq'], b"CAAGACAGAGCTTCACCGTTC")
self.assertEqual(sam_record['len'], 21)
self.assertEqual(sam_record['nt5'], 'G')
Expand All @@ -87,13 +89,13 @@ def test_sam_parser_comparison(self):
ours = SAM_reader().bundle_multi_alignments(file)
theirs = HTSeq.bundle_multiple_alignments(HTSeq.BAM_Reader(file))

for our_bundle, their_bundle in zip(ours, theirs):
for (our_bundle, _), their_bundle in zip(ours, theirs):
self.assertEqual(len(our_bundle), len(their_bundle))
for our, their in zip(our_bundle, their_bundle):
self.assertEqual(our['chrom'], their.iv.chrom)
self.assertEqual(our['start'], their.iv.start)
self.assertEqual(our['end'], their.iv.end)
self.assertEqual(our['name'], their.read.name)
self.assertEqual(our['name'].decode(), their.read.name)
self.assertEqual(our['nt5'], chr(their.read.seq[0])) # See note above
self.assertEqual(our['strand'], their.iv.strand)
if our['strand'] == '-': # See note above
Expand Down Expand Up @@ -542,15 +544,15 @@ def test_SAM_reader_get_decollapsed_filename(self):

self.assertEqual(sam_out, "sam_file_decollapsed.sam")

"""Does SAM_reader._read_thru_header() correctly identify header lines and write them to the decollapsed file?"""
"""Does SAM_reader._read_to_first_aln() correctly identify header lines and write them to the decollapsed file?"""

def test_SAM_reader_read_thru_header(self):
reader = SAM_reader(decollapse=True)
reader._decollapsed_filename = "mock_outfile_name.sam"

with open(self.short_sam_file, 'rb') as sam_in:
with patch('builtins.open', mock_open()) as sam_out:
line = reader._read_thru_header(sam_in)
line = reader._read_to_first_aln(sam_in)

expected_writelines = [
call('mock_outfile_name.sam', 'w'),
Expand All @@ -560,12 +562,13 @@ def test_SAM_reader_read_thru_header(self):
]

sam_out.assert_has_calls(expected_writelines)
self.assertTrue(len(reader._headers) == 1)
self.assertTrue(len(reader._header_lines) == 1)

"""Does SAM_reader._write_decollapsed_sam() write the correct number of duplicates to the decollapsed file?"""

def test_SAM_reader_write_decollapsed_sam(self):
reader = SAM_reader(decollapse=True)
reader.collapser_type = "tiny-collapse"
reader._decollapsed_reads = [(b"0_count=5", b"mock line from SAM file")]
reader._decollapsed_filename = "mock_outfile_name.sam"

Expand Down Expand Up @@ -601,6 +604,42 @@ def test_SAM_reader_parse_alignments_decollapse(self):
self.exhaust_iterator(reader._parse_alignments(sam_in))
write_fn.assert_called_once()

"""Does SAM_reader report a single read count for non-collapsed SAM records?"""

def test_SAM_reader_single_readcount_non_collapsed_SAM(self):
# Read non-collapsed.sam but duplicate its single record twice
with open(f"{resources}/non-collapsed.sam", 'rb') as f:
sam_lines = f.readlines()
sam_lines.extend([sam_lines[1]] * 2)
mock_file = mock_open(read_data=b''.join(sam_lines))

with patch('tiny.rna.counter.hts_parsing.open', new=mock_file):
reader = SAM_reader()
bundle, read_count = next(reader.bundle_multi_alignments('mock_file'))

self.assertEqual(bundle[0]['name'], b'NON_COLLAPSED_QNAME')
self.assertEqual(len(bundle), 3)
self.assertEqual(read_count, 1)

"""Are decollapsed outputs skipped when non-collapsed SAM files are supplied?"""

def test_SAM_reader_no_decollapse_non_collapsed_SAM_files(self):
stdout_capture = io.StringIO()
with patch.object(SAM_reader, "_write_decollapsed_sam") as write_sam, \
patch.object(SAM_reader, "_write_header_for_decollapsed_sam") as write_header:

with contextlib.redirect_stderr(stdout_capture):
reader = SAM_reader(decollapse=True)
records = reader.bundle_multi_alignments(f"{resources}/non-collapsed.sam")
self.exhaust_iterator(records)

write_sam.assert_not_called()
write_header.assert_not_called()
self.assertEqual(reader.collapser_type, None)
self.assertEqual(stdout_capture.getvalue(),
"Alignments do not appear to be derived from a supported collapser input. "
"Decollapsed SAM files will therefore not be produced.\n")

"""Does CaseInsensitiveAttrs correctly store, check membership, and retrieve?"""

def test_CaseInsensitiveAttrs(self):
Expand Down
53 changes: 26 additions & 27 deletions tiny/entry.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,24 +89,16 @@ def run(tinyrna_cwl_path: str, config_file: str) -> None:
# First get the configuration file set up for this run
config_object = Configuration(config_file)
run_directory = config_object.create_run_directory()
cwl_conf_file = config_object.save_run_profile()
config_object.save_run_profile()

workflow = f"{tinyrna_cwl_path}/workflows/tinyrna_wf.cwl"
parallel = config_object['run_parallel']
loudness = config_object['verbosity']

if config_object['run_native']: # experimental
if config_object['run_native']:
# Execute the CWL runner via native Python
return_code = run_native(
config_object, workflow,
run_directory=run_directory,
parallel=parallel, verbosity=loudness)
return_code = run_cwltool_native(config_object, workflow, run_directory)
else:
# Use the cwltool CWL runner via command line
return_code = run_cwltool_subprocess(
cwl_conf_file, workflow,
run_directory=run_directory,
parallel=parallel, verbosity=loudness)
return_code = run_cwltool_subprocess(config_object, workflow, run_directory)

# If the workflow completed without errors, we want to update
# the Paths Sheet to point to the new bowtie index prefix
Expand Down Expand Up @@ -152,56 +144,61 @@ def resume(tinyrna_cwl_path: str, config_file: str, step: str) -> None:

if config['run_native']:
# We can pass our config object directly without writing to disk first
run_native(config, resume_wf, verbosity=config['verbosity'])
run_cwltool_native(config, resume_wf)
else:
# Processed Run Config must be written to disk first
resume_conf_file = "resume_" + os.path.basename(config_file)
resume_conf_file = config.get_outfile_path()
config.write_processed_config(resume_conf_file)
run_cwltool_subprocess(resume_conf_file, resume_wf, verbosity=config['verbosity'])
run_cwltool_subprocess(config, resume_wf)

if os.path.isfile(resume_wf):
# We don't want the generated workflow to be returned by a call to setup-cwl
os.remove(resume_wf)


def run_cwltool_subprocess(config_file: str, workflow: str, run_directory=None, parallel=False, verbosity="normal") -> int:
def run_cwltool_subprocess(config_object: 'ConfigBase', workflow: str, run_directory='.') -> int:
"""Executes the workflow using a command line invocation of cwltool

Args:
config_file: the processed configuration file produced by configuration.py
config_object: a constructed ConfigBase-derived object
workflow: the path to the workflow to be executed
run_directory: the destination folder for workflow output subdirectories (default: CWD)
parallel: process libraries in parallel where possible
verbosity: controls the depth of information written to terminal by cwltool

Returns: None

"""

command = ['cwltool --timestamps --relax-path-checks --on-error continue']
processed_configfile = config_object.get_outfile_path()
tmpdir = config_object['tmp_directory']
parallel = config_object['run_parallel']
verbosity = config_object['verbosity']

command = [f'cwltool --timestamps --relax-path-checks --on-error continue --outdir {run_directory}']
if tmpdir is not None: command.append(f'--tmpdir-prefix {tmpdir}')
if verbosity == 'debug': command.append('--debug --js-console --leave-tmpdir')
if verbosity == 'quiet': command.append('--quiet')
if run_directory: command.append(f'--outdir {run_directory}')
if parallel: command.append('--parallel')

cwl_runner = ' '.join(command + [workflow, config_file])
cwl_runner = ' '.join(command + [workflow, processed_configfile])
return subprocess.run(cwl_runner, shell=True).returncode


def run_native(config_object: 'ConfigBase', workflow: str, run_directory: str = '.', parallel=False, verbosity="normal") -> int:
def run_cwltool_native(config_object: 'ConfigBase', workflow: str, run_directory: str = '.') -> int:
"""Executes the workflow using native Python rather than subprocess "command line"

Args:
config_object: a constructed ConfigBase-derived object
workflow: the path to the workflow to be executed
run_directory: the destination folder for workflow output subdirectories (default: CWD)
parallel: process libraries in parallel where possible
verbosity: controls the depth of information written to terminal by cwltool

Returns: None

"""

tmpdir = config_object['tmp_directory']
parallel = config_object['run_parallel']
verbosity = config_object['verbosity']

def furnish_if_file_record(file_dict):
if isinstance(file_dict, dict) and file_dict.get('class', None) == 'File':
file_dict['basename'] = os.path.basename(file_dict['path'])
Expand All @@ -225,8 +222,10 @@ def furnish_if_file_record(file_dict):
'debug': verbosity == "debug"
})

# Set proper temp directory for Mac users
if sys.platform == "darwin":
# Set temp directory for intermediate outputs
if tmpdir is not None:
runtime_context.tmp_outdir_prefix = tmpdir
elif sys.platform == "darwin":
default_mac_path = "/private/tmp/docker_tmp"
if runtime_context.tmp_outdir_prefix == DEFAULT_TMP_PREFIX:
runtime_context.tmp_outdir_prefix = default_mac_path
Expand Down
7 changes: 4 additions & 3 deletions tiny/rna/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,9 @@ def create_run_directory(self) -> str:

return run_dir

def get_outfile_path(self, infile: str) -> str:
"""Prepend date+time to the Config File name. Change its path to reside in the Run Directory."""
def get_outfile_path(self, infile: str = None) -> str:
"""Returns the path and filename for the processed run config"""
if infile is None: infile = self.inf
return self.joinpath(self['run_directory'], os.path.basename(infile))

def write_processed_config(self, filename: str = None) -> str:
Expand Down Expand Up @@ -184,7 +185,7 @@ def to_cwl_file_class(input_file_path):
path_to_input = self.paths.from_here(input_file_path)
return self.cwl_file(path_to_input)

for absorb_key in ['ebwt', 'plot_style_sheet', 'adapter_fasta']:
for absorb_key in ['ebwt', 'plot_style_sheet', 'adapter_fasta', 'tmp_directory']:
self[absorb_key] = self.paths[absorb_key]
self['run_directory'] = self.paths.from_here(self.paths['run_directory'])

Expand Down
Loading