Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable background subtraction / file unzipping #118

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Conversation

christinehc
Copy link
Collaborator

@christinehc christinehc commented Dec 12, 2023

Description

This PR (1) enables automatic gzipped file detection and unzipping as part of the main Snekmer workflow, and (2) overhauls the integration of background files into Snekmer workflows such that background files can be supplied to Snekmer and the kmer profile of background sequences be used to inform the probability of kmers appearing in a given family vs. a background, thus affecting downstream models. For (2), a parallel workflow is enabled in Snekmer that processes background files and sums the kmer profiles observed across the background for integration into the scoring and modeling steps. See the full changelog for details.

Issues

Full Changelog

  • fix: glob all files without exclusion of bg. fix bg file detection
    • refactor: all files are streamed to input files, rather than just files without associated background files.
    • refactor: background filenames (stripped of extensions) are no longer part of the input stream for rules.score, preventing odd errors
    • refactor: updated associated files to pull all input files as desired.
    • fix: redo file glob -- file globbing now proceeds through glob_wildcards to more cleanly grab input files
  • fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch) (potentially fixes windows- zipped files not detected/unzipped yet #60)
  • fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models.
  • fix,refactor: pipe background i/o, update filenames
    • fix: snakemake now correctly builds DAG for background workflow, including file unzipping
    • refactor: some files have been renamed for simplicity
    • refactor: some instances of skm.io.load_npz have been replaced with np.load due to KeyError (perhaps due to numpy or pickle version?)
    • refactor: rules.combine_background now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact
  • feat: update kmer probability scoring for background subtract
    • refactor: kmer probability scoring using background subtraction is now the default scoring method
    • feat:snekmer.score.feature_class_probabilities now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input
  • chore: update config, tick version, and clean up files
    • chore: new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods
    • chore: uptick version from v1.1.1 -> v1.4.0
      • upticked +3 minor versions in anticipation of two pending PRs
    • chore: remove no longer needed files
  • feat: enable kmer scoring via background subtraction (fixes Background file issue #37)
    • feat: kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
    • feat: to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
    • feat,refactor: extensive changes have been made to snekmer.score to accommodate the new changes, including:
      • feat: snekmer.score.score now has 3 distinct formulae to compute probability scores according to the desired scoring method
      • feat: snekmer.score.feature_class_probabilities now also integrates the scoring method
      • refactor: extensive code cleanup to remove extraneous functionalities
    • refactor: the main scoring rule itself has been significantly altered as follows:
      • refactor: all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
      • refactor: extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
      • refactor: scoring method now integrated

changelog:
- All files are streamed to input files, rather than just files without associated background files.
- Background filenames (stripped of extensions) are no longer part of the input stream for `rules.score`, preventing odd errors
- Updated associated files to pull all input files as desired.
changelog:
- NOTE THAT WORKFLOW IS CURRENTLY BROKEN DUE TO SNAKEMAKE I/O REASONS AND I AM COMMITTING INTERRIM CHANGES
- fix: redo file glob -- file globbing now proceeds through `glob_wildcards` to more cleanly grab input files
- fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch).
- fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models. These fixes work piece-by-piece locally but have not been fully tested and may not work ideally yet.
- Note: changes did NOT work, hence the "broken" tag.
changelog:
- snakemake now correctly builds DAG for background workflow, including file unzipping
- some files have been renamed for simplicity
- some instances of `skm.io.load_npz` have been replaced with `np.load` due to KeyError (perhaps due to numpy or pickle version?)
- `rules.combine_background` now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact
- NOTE: WORKFLOW IS BROKEN AT `rules.score_with_background` due to file load / array shape issues that will be fixed in the next commit.
- addresses #37
changelog:
- kmer probability scoring using background subtraction is now the default scoring method
- `snekmer.score.feature_class_probabilities` now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input
- TODO: integration with `snekmer.score.KmerScorer` object
changelog:
- new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods
- uptick version from v1.1.1 -> v1.4.0
  - upticked +3 minor versions in anticipation of two pending PRs
- remove no longer needed files
changelog:
- kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
  - note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115, #116)
- to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
- extensive changes have been made to `snekmer.score` to accommodate the new changes, including:
  - `snekmer.score.score` now has 3 distinct formulae to compute probability scores according to the desired scoring method
  - `snekmer.score.feature_class_probabilities` now also integrates the scoring method
- the main scoring rule itself has been significantly altered as follows"
  - all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
  - extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
  - scoring method now integrated
@christinehc
Copy link
Collaborator Author

TODO:

  • Add background file structure to documentation
  • Update and/or add to documentation re:scoring to reflect new changes with background and combined options
  • Merge in changes from Kmer association #115 / Functionmotifs #116 and resolve any incompatibilities
    • Should also fix the unsuccessful CI issue which has already been resolved in the respective PRs

@christinehc christinehc self-assigned this Dec 12, 2023
@christinehc christinehc added the enhancement New feature or request label Dec 12, 2023
changelog:
- fix: `snekmer.utils.get_family` now accepts `regex=None` by default as to not erroneously truncate filenames.
- fix: small change to `snekmer.utils.get_family` to correctly identify directories.
- refactor: overhaul `snekmer.utils.split_file_ext` to split at the point of an .faa, .fa, .fna, or .fasta extension instead of assuming at most 2 potential extensions
changelog:
- file unzipping is now handled by top-level unzip code in each snakefile; thus, `process.smk` is outdated and has been deleted as it is no longer needed.
changelog:
- file wildcard globbing previously proceeded through `glob.glob`, but had been updated in the model workflow to use snakemake's `glob_wildcards` utility. This method has the added benefit of preventing recursion errors with wildcard retrieval from gzipped files. The changes have now been applied to cluster and search workflows.
changelog:
- refactor: move `cluster_cluster.py` -> `cluster.py`
- refactor: move cluster report generation to separate script directive
- fix: change cluster mode file globbing to mirror model mode changes, i.e. uses snakemake `glob_wildcards` instead of python `glob.glob`. This should also fix unzipping issues and recursion errors related to unzipping.
changelog:
- fix: search file globbing updated to use snakemake's `glob_wildcards` rather than python's `glob.glob` in search mode. Should also resolve issues with file detection for files requiring unzipping and avoid recursion errors. Tested locally with a small subset of small families.
- style: applied snakefmt to `cluster.smk` and `search.smk`
changelog:
- feat: Snakemake `--resources` flag has been added to Snekmer CLI for all modes and tested locally.
- refactor: Wrapped all snakemake command line arguments into dictionary which is now passed to all snekmer subcommands. Removes the redundancy in specifying the same command line arguments every time a subcommand is called.
changelog:
- fix: resolve error with array shapes due to matrix dimensions (transpose matrix required)
- refactor: renamed variables to streamline code
changelog:
- basis harmonization now accounts for either 1D or 2D array cases
- 1D arrays are explicitly handled to match expected shape parameters set by the assumption that input arrays are 2D
- `utils.check_n_seqs` now uses boolean input arg to handle gz files rather than inferring from filename
changelog:
- Workflow now accounts for cases where no background files are included when either "combined" or "background" mode are selected. (TODO: raise warning in this case)
- Bypass UnicodeDecodeError for `utils.check_n_seqs`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

windows- zipped files not detected/unzipped yet Background file issue
1 participant