Enable background subtraction / file unzipping #118

christinehc · 2023-12-12T18:59:36Z

Description

This PR (1) enables automatic gzipped file detection and unzipping as part of the main Snekmer workflow, and (2) overhauls the integration of background files into Snekmer workflows such that background files can be supplied to Snekmer and the kmer profile of background sequences be used to inform the probability of kmers appearing in a given family vs. a background, thus affecting downstream models. For (2), a parallel workflow is enabled in Snekmer that processes background files and sums the kmer profiles observed across the background for integration into the scoring and modeling steps. See the full changelog for details.

Issues

Fixes Background file issue #37
Fixes windows- zipped files not detected/unzipped yet #60

Full Changelog

fix: glob all files without exclusion of bg. fix bg file detection
- refactor: all files are streamed to input files, rather than just files without associated background files.
- refactor: background filenames (stripped of extensions) are no longer part of the input stream for rules.score, preventing odd errors
- refactor: updated associated files to pull all input files as desired.
- fix: redo file glob -- file globbing now proceeds through glob_wildcards to more cleanly grab input files
fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch) (potentially fixes windows- zipped files not detected/unzipped yet #60)
fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models.
fix,refactor: pipe background i/o, update filenames
- fix: snakemake now correctly builds DAG for background workflow, including file unzipping
- refactor: some files have been renamed for simplicity
- refactor: some instances of skm.io.load_npz have been replaced with np.load due to KeyError (perhaps due to numpy or pickle version?)
- refactor: rules.combine_background now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact
feat: update kmer probability scoring for background subtract
- refactor: kmer probability scoring using background subtraction is now the default scoring method
- feat:snekmer.score.feature_class_probabilities now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input
chore: update config, tick version, and clean up files
- chore: new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods
- chore: uptick version from v1.1.1 -> v1.4.0
  - upticked +3 minor versions in anticipation of two pending PRs
- chore: remove no longer needed files
feat: enable kmer scoring via background subtraction (fixes Background file issue #37)
- feat: kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
  - (note: some column headers have changed, which may affect downstream analysis (e.g. integration with Kmer association #115 , Functionmotifs #116))
- feat: to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
- feat,refactor: extensive changes have been made to snekmer.score to accommodate the new changes, including:
  - feat: snekmer.score.score now has 3 distinct formulae to compute probability scores according to the desired scoring method
  - feat: snekmer.score.feature_class_probabilities now also integrates the scoring method
  - refactor: extensive code cleanup to remove extraneous functionalities
- refactor: the main scoring rule itself has been significantly altered as follows:
  - refactor: all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
  - refactor: extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
  - refactor: scoring method now integrated

changelog: - All files are streamed to input files, rather than just files without associated background files. - Background filenames (stripped of extensions) are no longer part of the input stream for `rules.score`, preventing odd errors - Updated associated files to pull all input files as desired.

changelog: - NOTE THAT WORKFLOW IS CURRENTLY BROKEN DUE TO SNAKEMAKE I/O REASONS AND I AM COMMITTING INTERRIM CHANGES - fix: redo file glob -- file globbing now proceeds through `glob_wildcards` to more cleanly grab input files - fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch). - fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models. These fixes work piece-by-piece locally but have not been fully tested and may not work ideally yet.

- Note: changes did NOT work, hence the "broken" tag.

changelog: - snakemake now correctly builds DAG for background workflow, including file unzipping - some files have been renamed for simplicity - some instances of `skm.io.load_npz` have been replaced with `np.load` due to KeyError (perhaps due to numpy or pickle version?) - `rules.combine_background` now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact - NOTE: WORKFLOW IS BROKEN AT `rules.score_with_background` due to file load / array shape issues that will be fixed in the next commit. - addresses #37

changelog: - kmer probability scoring using background subtraction is now the default scoring method - `snekmer.score.feature_class_probabilities` now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input - TODO: integration with `snekmer.score.KmerScorer` object

changelog: - new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods - uptick version from v1.1.1 -> v1.4.0 - upticked +3 minor versions in anticipation of two pending PRs - remove no longer needed files

changelog: - kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family - note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115, #116) - to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created. - extensive changes have been made to `snekmer.score` to accommodate the new changes, including: - `snekmer.score.score` now has 3 distinct formulae to compute probability scores according to the desired scoring method - `snekmer.score.feature_class_probabilities` now also integrates the scoring method - the main scoring rule itself has been significantly altered as follows" - all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed - extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored - scoring method now integrated

christinehc · 2023-12-12T21:20:43Z

TODO:

Add background file structure to documentation
Update and/or add to documentation re:scoring to reflect new changes with background and combined options
Merge in changes from Kmer association #115 / Functionmotifs #116 and resolve any incompatibilities
- Should also fix the unsuccessful CI issue which has already been resolved in the respective PRs

changelog: - fix: `snekmer.utils.get_family` now accepts `regex=None` by default as to not erroneously truncate filenames. - fix: small change to `snekmer.utils.get_family` to correctly identify directories. - refactor: overhaul `snekmer.utils.split_file_ext` to split at the point of an .faa, .fa, .fna, or .fasta extension instead of assuming at most 2 potential extensions

changelog: - file unzipping is now handled by top-level unzip code in each snakefile; thus, `process.smk` is outdated and has been deleted as it is no longer needed.

changelog: - file wildcard globbing previously proceeded through `glob.glob`, but had been updated in the model workflow to use snakemake's `glob_wildcards` utility. This method has the added benefit of preventing recursion errors with wildcard retrieval from gzipped files. The changes have now been applied to cluster and search workflows.

changelog: - refactor: move `cluster_cluster.py` -> `cluster.py` - refactor: move cluster report generation to separate script directive - fix: change cluster mode file globbing to mirror model mode changes, i.e. uses snakemake `glob_wildcards` instead of python `glob.glob`. This should also fix unzipping issues and recursion errors related to unzipping.

changelog: - fix: search file globbing updated to use snakemake's `glob_wildcards` rather than python's `glob.glob` in search mode. Should also resolve issues with file detection for files requiring unzipping and avoid recursion errors. Tested locally with a small subset of small families. - style: applied snakefmt to `cluster.smk` and `search.smk`

changelog: - feat: Snakemake `--resources` flag has been added to Snekmer CLI for all modes and tested locally. - refactor: Wrapped all snakemake command line arguments into dictionary which is now passed to all snekmer subcommands. Removes the redundancy in specifying the same command line arguments every time a subcommand is called.

changelog: - fix: resolve error with array shapes due to matrix dimensions (transpose matrix required) - refactor: renamed variables to streamline code

changelog: - basis harmonization now accounts for either 1D or 2D array cases - 1D arrays are explicitly handled to match expected shape parameters set by the assumption that input arrays are 2D - `utils.check_n_seqs` now uses boolean input arg to handle gz files rather than inferring from filename

changelog: - Workflow now accounts for cases where no background files are included when either "combined" or "background" mode are selected. (TODO: raise warning in this case) - Bypass UnicodeDecodeError for `utils.check_n_seqs`

christinehc added 7 commits October 13, 2023 15:15

[!broken!] build: tweak workflow to attempt snakemake debug

be53b22

- Note: changes did NOT work, hence the "broken" tag.

christinehc self-assigned this Dec 12, 2023

christinehc added the enhancement New feature or request label Dec 12, 2023

christinehc added 20 commits December 21, 2023 14:06

build: reinsert missing config parameter for cluster mode

4ee5d4a

fix: split filename by input extensions

1344d93

refactor: change run to script directive for kmer search

35e0b49

refactor: deprecate process.smk for file unzipping

2625da7

changelog: - file unzipping is now handled by top-level unzip code in each snakefile; thus, `process.smk` is outdated and has been deleted as it is no longer needed.

refactor: add process snakefile back to prevent import errors

85a009f

fix: use background weight config spec for scoring

6a8c042

refactor: replace utils.to_feature_matrix -> np.vstack

abaac01

refactor: change list comprehension to numpy operation

15421ad

refactor: substitute list operations to numpy for efficiency

5ed97ad

refactor: use more efficient list comprehension

270f1d9

refactor: purge columns from memory

43a8ccf

build: uptick version

5e39a7f

fix,refactor: resolve array shapes. streamline code

20023fd

changelog: - fix: resolve error with array shapes due to matrix dimensions (transpose matrix required) - refactor: renamed variables to streamline code

refactor: remove extraneous variables and verbose code

ba6df73

refactor: move kmerize to scripts. replace lists with numpy arrays

710d2ba

christinehc added 8 commits April 16, 2024 14:26

chore: update branch with changes to main

679af3d

refactor: remove commented (outdated) code

c40e09f

[skip-ci] in-prog: adapt learn workflow to new filesystem globbing

5d3ffac

build: force new versions of hdbscan to bypass install errors

22dd1c5

[skip ci] docs: add error catching for missing bg files in c/bg mode

163239c

build: fix sklearn/hdbscan versions to account for TypeError bug

0de98d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable background subtraction / file unzipping #118

Enable background subtraction / file unzipping #118

christinehc commented Dec 12, 2023 •

edited

christinehc commented Dec 12, 2023

Enable background subtraction / file unzipping #118

Are you sure you want to change the base?

Enable background subtraction / file unzipping #118

Conversation

christinehc commented Dec 12, 2023 • edited

Description

Issues

Full Changelog

christinehc commented Dec 12, 2023

TODO:

christinehc commented Dec 12, 2023 •

edited