-
Notifications
You must be signed in to change notification settings - Fork 49
Updated motif processing to Tangermeme 0.4 #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
|
Partially addresses #55 |
|
|
||
| # Trim PPMs based on information content | ||
| start, end = trim_pwm( | ||
| cwm, trim_threshold=trim_threshold, return_indices=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the trim should be done on the pwm and not cwm? cwm values are not normalized in any way generally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based this on the modiscolite repo where trimming seems to be done on the CWM, e.g.:
https://github.com/jmschrei/tfmodisco-lite/blob/main/modiscolite/report.py#L98
https://github.com/jmschrei/tfmodisco-lite/blob/main/modiscolite/report.py#L236
Is it incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I had a look. Trimming on CWM is fine.
So it looks like within trim_pwm, the function recomputes the threshold as
trim_thresh = np.max(score) * trim_threshold
This is a bit unexpected to me, as the threshold should be used as is. Originally, I imagine it's for PWMs where the range is fixed (probs from 0 to 1).
My suggestion would be to compute exact threshold outside function call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think the threshold is meant to be calculated on the log-odds matrix, not on the probability matrix. Let me find a reference for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you go: https://bioconductor.org/packages/devel/bioc/vignettes/universalmotif/inst/doc/IntroductionToSequenceMotifs.pdf
Specifically, this section on page 4:

While the context here is motif scanning, not trimming, this is the threshold that we're calculating. It's related to entropy of each position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In either case, the exact threshold calculation should happen outside the trim_pwm function. Trim should just take a matrix, some threshold and perform the trimming using that threshold
Replaced pymemesuite with tangermeme for all motif handling.
Specifically
setup.cfgread_meme_fileso that it now reads motifs in the dictionary format that tangermeme expectsmodisco_to_memesince it is no longer necessary to have a meme file input. Instead, added aread_modisco_reportfunction that reads modisco motifs directly into the dictionary format that tangermeme expectsgrelu.io.memetogrelu.io.motifsas this module now involves modisco .h5 files in addition to meme filesmotifs_to_stringsto use the dictionary format as inputscan_sequencesto use tangermeme fimo function as backendscan_sequencesgrelu.transforms.seq_transforms.MotifScoreclass to use the dictionary format as inputpaddingargument fromgrelu.interpret.motifs.trim_pwmsince we no longer have a use case for it