Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate trainings data using context #25

Open
10 tasks done
teresa-m opened this issue Jul 30, 2021 · 10 comments
Open
10 tasks done

Generate trainings data using context #25

teresa-m opened this issue Jul 30, 2021 · 10 comments

Comments

@teresa-m
Copy link
Member

teresa-m commented Jul 30, 2021

Idea:
Adding Context to the 'trusted RRI' to generate from these sequences the positive and negative instances using IntaRNA.
The positive data will be generated by calling IntaRNA setting the --seed[Q,T]Range to the HTS found interaction side. For the negative RRI instance generation IntaRNA is called by constraining the regions, that are known to be part of an interaction, and therefore should not be part of the predicted RRI (--[q,t]AccConstr="b:start_1st_blocked_side-end_1st_blocked_side,start_2st_blocked_side-end_2st_blocked_side,...")

image

Tasks:

  • map RBP binding position from hg18 to hg38
  • find a good format to store the RBP binding sides (add +/- 10 to the single positions)
  • store the RRI of the positive set and also previously filtered RRIs in the same format
  • plot RBP and RRI interaction length distribution
  • extract genomic context for trusted RRI's (e.g. 300 on both sides, context length depends on length of RRI and RBP interactions sides)
  • IntaRNA calls genreal set the following parameters: --outMaxE=-5, --outOverlap=B, --outNumber=[5,10], --seedBP=5, -t 37, - - [ ] IntaRNA calls genreal set the following output parameters: --seedE, ED's und Ehybrid (should already be implemented)
  • Find schema to transfer the genomic positons to the sequence (trusted RRI + context) position. See figure below
  • call postive instances using --seed[Q,T]Range
  • call negative instances masking occupyed areas --[q,t]AccConstr="b:start_1st_blocked_side-end_1st_blocked_side,start_2st_blocked_side-end_2st_blocked_side,..." #28
  • plot energy profiles of pos and neg instances

image

@teresa-m
Copy link
Member Author

See later: Could we add a bias by placing the trusted RRI in the middle of the context and possible negative interaction could be at the border of the context? Shoudl we disallow RRIs in the bignning and and of the sequences?

@teresa-m
Copy link
Member Author

teresa-m commented Aug 2, 2021

Is it a week spot that we have the proteom m-RNA binding data? Are there many proteins binding to ncRNAs?

@martin-raden
Copy link
Member

concerning

extrect genomec context for trusted RRI's (e.g. 300 on both sides, context lenght depens on length of RRI and RBP interactions sides)

I would always add the same context length left/right independently of the RRI/RBP subsequence length. simplifies the setup and sequence length is of no matter anyway...

@martin-raden
Copy link
Member

See later: Could we add a bias by placing the trusted RRI in the middle of the context and possible negative interaction could be at the border of the context? Shoudl we disallow RRIs in the bignning and and of the sequences?

constrain seeds to be not at sequence ends

good point! can be solved by constraining the seed to the positions +100 to (length-100), i.e. similar to the positive data set but with the additional "blocking constraints". that way, the accessibilities of the RRIs should be reliable than those around sequence ends!

@martin-raden
Copy link
Member

Is it a week spot that we have the proteom m-RNA binding data? Are there many proteins binding to ncRNAs?

I would guess RBPs do not distinguish much between lnRNA and mRNA...

@teresa-m
Copy link
Member Author

teresa-m commented Aug 4, 2021

xtrect genomec context for trusted RRI's (e.g. 300 on both sides, context lenght depens on length of RRI and RBP interactions sides)

Ja sorry I wanted to check the length of RRI and RBP binding sides to see if 300 context is long enough. But of course, the added context will be for all the same. Maybe I should have made this point more clear.

@teresa-m
Copy link
Member Author

teresa-m commented Aug 4, 2021

would guess RBPs do not distinguish much between lnRNA and mRNA...

I was just wondering since the data of the paper I found only gives us the proteome binding m-RNA, if I understood it correctly.

@martin-raden
Copy link
Member

would guess RBPs do not distinguish much between lnRNA and mRNA...

I was just wondering since the data of the paper I found only gives us the proteome binding m-RNA, if I understood it correctly.

most likely they were just interested in direct gene regulation, i.e. mRNA binding

@teresa-m
Copy link
Member Author

Two latest attempts to make the positive and negative feature distribution more allike.
(1) Using occupied regions as contain also for the positive instance generation -> We are losing many sequences but the distribution looks a bit more similar than without using it.
(2) not allowing long bulges within the interaction

@martin-raden
Copy link
Member

bin gespannt... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants