Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about prediction on sequences score #85

Closed
ymkng opened this issue Jun 13, 2019 · 6 comments
Closed

question about prediction on sequences score #85

ymkng opened this issue Jun 13, 2019 · 6 comments

Comments

@ymkng
Copy link

ymkng commented Jun 13, 2019

Hi,
I have been reading the documentation and I'm still not sure what the output scores for getting predictions from a trained model means. I noticed that the scores are all from 0-1, is it the probability that a TF will bind to an input region and what is this probability based on?

Another question I have is that if I set my "center_bin_to_predict" to be 200 when training the model, and my "feature_thresholds" is 0.5, do my input TF binding regions have to be at least 200bp long for Selene to classify it as a "binding region"

thanks!

Michelle

@kathyxchen
Copy link
Collaborator

kathyxchen commented Jun 13, 2019

Hi Michelle,

Yes, you can consider the scores to be 'probabilities'; however, this is a rather loose definition because these values are really just the outputs from the Sigmoid layer (which constrain the values to between 0 and 1 and allow us to determine whether a particular chromatin factor is likely to bind at a region).

If the threshold is 0.5, the TF binding region needs to cover at least 100bp of the center bin to be classified as a binding region. You can adjust this threshold or the center bin size based on the size of the peaks in your track files.

Thanks!
Kathy

@ymkng
Copy link
Author

ymkng commented Jun 14, 2019

thanks so much! ...so if my TF binding regions vary in size, between 90-300 base pairs, what would you suggest as the the threshold or center_bin_to_predict to be set to?

@evancofer
Copy link
Collaborator

I would think 0.45, or somewhere between 0.40 and 0.50. I don't have a good sense of how it will influence performance on your specific data if you set it too small/large, so it might be worth tuning it on some validation data. What do you think @kathyxchen ?

@kathyxchen
Copy link
Collaborator

kathyxchen commented Jun 14, 2019

Agree with @evancofer that it could be worth tuning on validation data. Otherwise you could make the decision by figuring out the distribution of the TF binding region sizes, how they are distributed in the genome, & if you have multiple TFs in your dataset, are there ones at 90bp that you'd be excluding entirely by keeping the threshold this large... etc

@evancofer
Copy link
Collaborator

@ymkng @kathyxchen Can this be closed?

@ymkng
Copy link
Author

ymkng commented Jun 18, 2019

thanks!

@ymkng ymkng closed this as completed Jun 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants