Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kmer association #115

Merged
merged 13 commits into from
Dec 13, 2023
Merged

Kmer association #115

merged 13 commits into from
Dec 13, 2023

Conversation

jjacobson95
Copy link
Collaborator

@jjacobson95 jjacobson95 commented Oct 16, 2023

Not ready for pull yet, but just wanted this to be on the radar.

Updates:

  • Learn output storage space reduced by 40-50%.
  • Apply output storage space reduced by ~95% when using new "save_apply_associations" parameter .
  • Small increases in speed and efficiency for both Learn and Apply.
  • Learn is set up to train on full length and is capable of evaluating confidence based on user defined fragment sizes now.
  • Code fully restructured to be object oriented. It is now much more readable, maintainable and cleaner for a future publication.
    • Doc strings added to all functions. Overall, there should be much greater clarity for the purpose of each section of code.

Before merging pull request several things must still be done:

  • Confirm that the additive method for kmer count matrix generation is still working for Learn.
  • Update Documentation with new parameters.
  • Lint.

…ct-Oriented Manner. Cleaned and added doc strings. Altered Learn output to take up 40-50% less storage. Much cleaner for a future publication. More readable and managable for future updates. Currently working for standard usage. To do: Check if this still works for the additive database method in Learn.
@jjacobson95
Copy link
Collaborator Author

Failing on Model in the test. Is this a known issue? Looks like it may be unrelated to model but maybe just an issue with the test.

@jjacobson95
Copy link
Collaborator Author

Hi @christinehc, just checking in on this. Do you know if this is a known or previous issue with model?

@christinehc
Copy link
Collaborator

christinehc commented Nov 7, 2023

Hmm, I took a look and didn't see anything obvious. I would try rerunning the tests and seeing if that works? If it fails again, I'll do some more digging

Edit: started the rerun a short while ago; we'll see how it goes

Edit 2: failed again, hmm. Not getting much clarity from the debug log itself but let me try a few things. Seems to be an issue with the actions workflow itself

@christinehc
Copy link
Collaborator

@jjacobson95: Still failing but I found this possibly related issue?

AKA try changing ubuntu-latest to ubuntu-18.04 in the actions.yml and see if that works

@christinehc christinehc mentioned this pull request Nov 29, 2023
@jjacobson95
Copy link
Collaborator Author

Ubuntu-18.04 failed - look like its no longer supported with github. Looks like only the latest (ubuntu-22.04) and ubuntu-20.04 are available - options are here.
Currently testing ubuntu-20.04.

@jjacobson95
Copy link
Collaborator Author

Note to all (#116) - ubuntu 20.04 works.
Note to me - Learn/Apply unit test needs updating as several new arguments need to be updated in the config file.

@jjacobson95
Copy link
Collaborator Author

Ready to merge! @biodataganache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: I would use an input function to handle the conditional creation of optional files. It's a bit "cleaner" code style-wise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stylistic comments:

  • See previous comment on apply.smk about input functions for conditional rule all
  • Lines 606/616/903:if not x is preferable to if x == False
  • Line 791: I would print a more informative error message.
  • Thoughts for future development: I wonder if some aspects of the classes, e.g. the checking function stream, can be streamlined. I do think the object-based approach is great, but the classes are a bit large/unwieldy and in future development I'd consider strategies to simplify, even if that means moving some of the heavy lifting to module-level functions rather than classes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions @christinehc. I think all of those are good ideas. If not in this version, I'll make these changes for the next version. The classes are pretty large, in a future iteration, I'll think on how to handle functions. would you recommend using an additional helper file and importing them or keeping them within the current scope?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can discuss what makes the most sense when we plan the next major code update, as part of the complexity of the classes arises from the complexity of the workflow itself and we'd have to see which areas would be most ripe for simplification.

Copy link
Collaborator

@christinehc christinehc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including some commentary on minor suggested changes, but everything seems to be working / CI is passing, so I'll submit an approval formally.

@biodataganache
Copy link
Collaborator

Couple of items:

  1. Please add a script in the Snekmer/resources/tutorial/learnapp_tutorial_files/ folder that will run the learn and apply examples (see Snekmer/resources/tutorial/demo_example/ for the idea).
  2. Please remove the base file in the Snekmer/resources/tutorial/learnapp_tutorial_files/learn/ folder from the repo (this should be generated by running the example)
  3. Please fix the following error when snekmer learn is run:
    /Users/d3p620/lib/Snekmer/resources/tutorial/learnapp_tutorial_files/learn

(snekmer) d3p620@WE48427 learn % snekmer learn
KeyError in line 914 of /Applications/anaconda3/envs/snekmer/lib/python3.10/site-packages/snekmer/rules/learn.smk:
'conf_weight_modifier'
File "/Applications/anaconda3/envs/snekmer/lib/python3.10/site-packages/snekmer/rules/learn.smk", line 914, in

@jjacobson95
Copy link
Collaborator Author

Changes made. But before merging, I should also update the docs to reflect parameter changes. I'll try to have this done by next Tuesday.

@christinehc
Copy link
Collaborator

Please also remember to update the version here before pushing

christinehc added a commit that referenced this pull request Dec 12, 2023
changelog:
- kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
  - note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115, #116)
- to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
- extensive changes have been made to `snekmer.score` to accommodate the new changes, including:
  - `snekmer.score.score` now has 3 distinct formulae to compute probability scores according to the desired scoring method
  - `snekmer.score.feature_class_probabilities` now also integrates the scoring method
- the main scoring rule itself has been significantly altered as follows"
  - all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
  - extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
  - scoring method now integrated
christinehc added a commit that referenced this pull request Dec 12, 2023
changelog:
- kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
  - note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115, #116)
- to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
- extensive changes have been made to `snekmer.score` to accommodate the new changes, including:
  - `snekmer.score.score` now has 3 distinct formulae to compute probability scores according to the desired scoring method
  - `snekmer.score.feature_class_probabilities` now also integrates the scoring method
- the main scoring rule itself has been significantly altered as follows"
  - all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
  - extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
  - scoring method now integrated
@christinehc christinehc merged commit 94a1374 into main Dec 13, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants