Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display RNA/DNA modification probabilities at single read level. #1869

Closed
a-slide opened this issue Apr 1, 2021 · 35 comments
Closed

Display RNA/DNA modification probabilities at single read level. #1869

a-slide opened this issue Apr 1, 2021 · 35 comments
Assignees

Comments

@a-slide
Copy link

a-slide commented Apr 1, 2021

Hi @GMOD team,

The fields of epigenetics and now epitranscriptomics have evolved very fast in recent years and there are now a few DNA and RNA modifications which can be detected at single molecule resolution, in particular with Oxford Nanopore sequencing.

In order to avoid the proliferation of non-standard formats to store these modifications there was a long discussion with the samtools team on how to represent all modifications within SAM/BAM format (samtools/hts-specs#362), leading to the proposition to use additional flags in the specification to store both modification positions (Mm) and probabilities (Ml) at read level. There can also be multiple modification tracks included within a single read. Here is the link to the Pull Request currently being considered: samtools/hts-specs#418. Hopefully the PR will be merged soon and will be part of the official SAM specification.

Several Oxford Nanopore tools to detect RNA modifications already implement the specification ahead of the official release, including Megalodon (https://github.com/nanoporetech/megalodon) and Guppy basecaller.

An ideal implementation in Jbrowse would display 1 track per modification and per read showing the probability of a modification presence as a color scale, or maybe multiple modification overlaid with different color scales. There is another old issue GMOD/jbrowse#509 discussing a way to display methylation information, but this is per position and not per read.

Does that sound like it could be easily implemented as a Jbrowse plugin ?

@a-slide
Copy link
Author

a-slide commented Apr 1, 2021

I have also raised a similar feature request in IGV igvteam/igv#945

@cmdcolin
Copy link
Collaborator

cmdcolin commented Apr 1, 2021

I think probably this work would occur on our jbrowse 2 project, jbrowse 1 is sort of in maintenance mode (it's not super obvious from our readme but that development is at this repo https://github.com/gmod/jbrowse-components)

I actually started trying this out awhile back in jbrowse 2 but it got tabled

Demo using MM and ML tags on jbrowse 2 for multiple modification types with transparency indicating probabilities:
localhost_3000__config=test_data%2Fvolvox%2Fconfig json session=local-nEY2hTudA

Probably most commonly someone would just look at maybe one type and select that type but this was a fun demo to show multiple

I think this is a great idea though... I'd say, it shall be probably be worked on! things like methplotlib are great references for this already working in the wild

@a-slide
Copy link
Author

a-slide commented Apr 2, 2021

Sorry for the confusion and for missing the PR in the other repo 😖
That sounds like a very good starting point to me. I agree with you that 1 mod is going to be enough for most people. But, Nanopore Seq should soon be able to support multiple types of CpG modifications simultaneously (5mC, 5hmC, 5fC...), so it is a nice feature to be able to show multiple mods with different color scales. And for RNA, it will also be very useful. There will be cases where multiple modifications might be called at the same position, but I think this could be solved by showing only the one with the highest probability.

Feel free to transfer the issue in the jbrowse-components repo.

I am putting together a few test files containing "real" modification data obtained with Guppy and Megalodon.

@cmdcolin cmdcolin transferred this issue from GMOD/jbrowse Apr 2, 2021
@rbuels rbuels added this to Incoming in JBrowse team board via automation Apr 2, 2021
@rbuels
Copy link
Contributor

rbuels commented Apr 7, 2021

@cmdcolin is there a branch with prototype work on this?

@cmdcolin
Copy link
Collaborator

cmdcolin commented Apr 7, 2021

@a-slide
Copy link
Author

a-slide commented Apr 9, 2021

I generated a BAM file with Megalodon containing reads aligned on the human chromosome 20 with around 5X of coverage. It contains predicted modification sites for both 5mC and 5hmC and follows the proposed SAM specification.
It should be mostly 5mC and a few 5hmC.

Files (including the reference) can be downloaded from: https://nanoporetech.box.com/s/82pnw3lhusfs93s0vj7azxiz4x7kxabo

I am also making other larger test files with Zymo community in which we spiked 5mC, so it contains a higher proportion of mods but not in CG context only.

Hope this helps

@cmdcolin
Copy link
Collaborator

cmdcolin commented Apr 9, 2021

awesome, thanks for this...would it be ok if I rehost this in a demo instance on our s3 bucket? can provide some metadata to credit you as the source :)

@a-slide
Copy link
Author

a-slide commented Apr 9, 2021

Yes, feel free to re-host it there with credit to Oxford Nanopore Technologies

@cmdcolin
Copy link
Collaborator

@a-slide thanks...just a note, I think this BAM may be generated on hg19 chr20 but the filename in the box link suggests it is chr20 on hg38

Loading on hg38, the BAM has a "C" mismatch where there is a C in the reference, and converting to CRAM using hg38 makes it pretty confused too

localhost_3000__config=test_data%2Fconfig_demo json session=local-lWC4ThHTe

just a note :)

@cmdcolin
Copy link
Collaborator

hmmm, maybe actually it is not hg19...i'll dig into this a little further

@cmdcolin
Copy link
Collaborator

made a different issue for our incorrect display of SNPs on this dataset, it is indeed hg38 we just have a bug

#1896

@cmdcolin
Copy link
Collaborator

cmdcolin commented Apr 10, 2021

Ok I think maybe I found the reason...I think the issue is that the MD tag was not generated correctly for the file

IGV and tview may ignore this and hence display the correct stuff

JBrowse generally uses trusts the MD tag as written so it displays SNPs wrong. JBrowse can also generate the MD tag if none is supplied in the file but in this case, since it is in the file, does not

I regererated the MD tag with samtools calmd with samtools calmd human_chr20_mod_call_5mC_5hmC_CG.bam Homo_sapiens.GRCh38.dna.chromosome.20.fa --output-fmt BAM >! new.bam 2> /dev/null

That appears to fix it for jbrowse at least...

Do you think the upstream tool will be able to fix this? Just wondering whether jbrowse should distrust MD tag in general

@a-slide
Copy link
Author

a-slide commented Apr 12, 2021

Hi Colin,
Sorry about that, there might be an issue when the tool computes the MD tag, but since it wraps around minimap2, I guess it is sort of out of our hands, but we will investigate to see if there is an easy fix.
I would probably regenerate MD tags by default or at least offer the possibility to do so via the interface.
Thanks for digging into it.
Let me know if you need anything else.

@cmdcolin
Copy link
Collaborator

cmdcolin commented Apr 12, 2021

No worry:)
Did you also happen to know if that test file has multiple modifications? I didn't see any 5hmc from a quick look

@a-slide
Copy link
Author

a-slide commented Apr 12, 2021

I saw a few when looking at the reads, but they are much rarer, and the model to predict them is not as good as the 5mC
You have a greater chance to spot them near promoters.

@rbuels rbuels moved this from Incoming to In progress in JBrowse team board Apr 13, 2021
@cmdcolin
Copy link
Collaborator

@a-slide I was trying to get a useful "Color by methylation" setting with this data which I figured was a little different from just coloring the modified bases (e.g. since drawing unmodified CpG also is important in addition to modified CpG). A bit inspired by igv stuff like http://software.broadinstitute.org/software/igv/interpreting_bisulfite_mode

example screenshot, with the demo data you sent
116168403-479fde80-a6d0-11eb-88c6-f0d742e0877a

live link here
http://s3.amazonaws.com/jbrowse.org/code/jb2/methylation/index.html?config=test_data%2Fconfig_demo.json&session=share-mParXfFi_G&password=1nqZk

still needs a bit of work, but thought it would be of interest...

@a-slide
Copy link
Author

a-slide commented Apr 27, 2021

Hi @cmdcolin,
Nice work. This looks very promising indeed.
I assume red is CpG methylated and blue unmethylated? Did you define an arbitrary cut-off to decide if a position is modified or not ?

What about using a light to dark color scale for individual reads based on the ML tag like in the initial prototype ? Unmodified bases could still be represented I guess in the light side of the color scale. It could also be applied to the coverage track using the mean or median modification score for all reads aligned at a given position.
The current layout is also very informative as it shows sharper difference, so I would probably keep it in any cases.

We are working on a tools to generate bedmethyl files from a modBAM which could actually be used directly to show the overall methylation score (one file per modification).

Hope it helps

@cmdcolin
Copy link
Collaborator

Sorry to not explain :) there is still the original coloring system proposed in the prototype still, I just thought it would be good (at least to my mind) to make a specialized methylation view so that you can see the unmodified CpG (which may not be indicated by the MM tag) so I tried to make this specialized mode scan for CpG and then color the unmodified ones blue and modified ones red (just binary)

But, yes...we do still have the original code that colors all the modifications still, as a separate option (Color by -> "Modifications" vs Color by "Methylation") in the track menu

@cmdcolin
Copy link
Collaborator

Curious about the bedMethyl stuff too...we could try to think about how best to display those if just basic boxes wouldnt work

@a-slide
Copy link
Author

a-slide commented Apr 28, 2021

Oh I see. Great idea then. I think most people will actually be more familiar with that type of binary display, even if it doesn't have the probability information. Weirdly, I cannot get the original mod color map display mode in the online interactive session you shared, but I am not very familiar with JBrowse so I might be doing something wrong.

@a-slide
Copy link
Author

a-slide commented Apr 28, 2021

Actually I forgot that Megalodon does generate a BedMethyl file per modification. The tool in development I mentioned will be for Guppy direct mod base basecalling.
I added the matching Megalodon m5C file in the Box shared folder if you want to have a look https://nanoporetech.box.com/s/82pnw3lhusfs93s0vj7azxiz4x7kxabo

The format is described here https://www.encodeproject.org/data-standards/wgbs/

@cmdcolin
Copy link
Collaborator

cmdcolin commented May 5, 2021

Random update @a-slide I think I found maybe a factor in the calmd issue from earlier that could be of interest...

It appears that after I ran calmd, then only the reads that are marked as "reverse strand" get marked up with tons of mismatches

On the reads that are forward strand, the mismatches look ok

localhost_3000__config=test_data%2Fconfig_demo json session=local-WP-ckdBfz (1)

This could have implications for whatever tool generated these reads. The reads that are on the reverse strand may need to have their seq field reverse complemented or something like this(???)

All the reads on the forward strand have an ok number of mismatches, but all the reads on the reverse strand after running calmd have very high number of mismatches

Here is a zoomed out view showing basically the pattern where about half the reads have very high proportion of mismatches (in coverage graph, half are grey indicating matching the ref, other half are misc colors coming from the reverse strand reads with high mismatch rate)

localhost_3000__config=test_data%2Fconfig_demo json session=local-WP-ckdBfz (2)

@cmdcolin
Copy link
Collaborator

cmdcolin commented May 5, 2021

Here is a share link showing both methylation and modifications-only mode (created a copy of the track, and added separate settings in both)

http://s3.amazonaws.com/jbrowse.org/code/jb2/methylation/index.html?config=test_data%2Fconfig_demo.json&session=share-37YAPKjzzx&password=XkORz

In order to change the colorscheme manually, can use the "three dots" on the track label and navigate to Pileup settings->Color by->Methylation

There is starting to be a lot of options so maybe this is due for a redesign hehe, but that works on the link

Note that if you have files of your own on the web (can't open local files from computer currently) you can also add your own track in that URL

s3 amazonaws com_jbrowse org_code_jb2_methylation_index html_config=test_data%2Fconfig_demo json session=local-cmp6UUQm_

@a-slide
Copy link
Author

a-slide commented May 7, 2021

Hi @cmdcolin.
Thank you very much for getting to the bottom of the MD string issue. I will forward to the Guppy aligner developpers so they can have a look at whatever seems to be going wrong with reverse reads.

@a-slide
Copy link
Author

a-slide commented May 7, 2021

Thanks very much for the link to the interactive session.
Both representation are actually very informative. I browsed a bit and found nice promoter regions showing very clear hyper or hypo methylation signal.

I am not quite sure I understood everything but my understanding is that the methylation display mode (blue/red) shows all the CpGs even if they are not in the MM tag (meaning fully unmethylated). However I found quite a few instances where the CpGs are displayed in the modification mode but not in the methylation one (see below). Is it because they don't fall on a reference CpG site ?

Screenshot_2021-05-07 JBrowse

@cmdcolin
Copy link
Collaborator

cmdcolin commented May 7, 2021

yes...those are not on a reference CpG site

if interested, I could draw those still though even if they don't fall on the reference CpG...or even draw them in a different color...still exploring what is sort of "expected" or the most useful way to depict this data, and also address how to display non-CpG methylation and hydroxymethylation and stuff also (in the Methylation mode with blue and red it only draws the conventional "m" modifications now...but should it draw other types too perhaps?)

@a-slide
Copy link
Author

a-slide commented May 7, 2021

Hi @cmdcolin.
Thank you very much for getting to the bottom of the MD string issue. I will forward to the Guppy aligner developpers so they can have a look at whatever seems to be going wrong with reverse reads.

Apparently this was recently fixed in the last version of guppy aligner (4.5.4).
I gave it a go and added the new files to the shared Box folder https://nanoporetech.box.com/s/82pnw3lhusfs93s0vj7azxiz4x7kxabo (human_chr20_mod_call_5mC_5hmC_CG_new.bam)

@a-slide
Copy link
Author

a-slide commented May 7, 2021

yes...those are not on a reference CpG site

if interested, I could draw those still though even if they don't fall on the reference CpG...or even draw them in a different color...still exploring what is sort of "expected" or the most useful way to depict this data, and also address how to display non-CpG methylation and hydroxymethylation and stuff also (in the Methylation mode with blue and red it only draws the conventional "m" modifications now...but should it draw other types too perhaps?)

I would say it is fine to display everything in the generic modification display mode. I think restricting to a specific reference sequence context should be done by the modification calling tool itself and not the genome browser. That said I think the CpG methylation display mode is very useful as it is. Not quite sure if this should be extended to other modifications. 5hmC maybe. For RNA it's going to be much more complicated as consensus motifs are loosely defined at best.

Any thoughts @marcus1487 ?

@marcus1487
Copy link

I would say that displaying all contexts would be best and leave any context constraints to the caller producing the file. Megalodon has a tool to filter the corresponding database by reference context. It might be worth adding a command to megalodon to perform this filtering on the modbase bam (I'll log an issue).

In terms of what is expected, this is somewhat organism specific, so I would be opposed to coloring based on sequence context. My opinion would be to color based on modification type. Not sure if consistent colors can be coded for some common mods and then use a color wheel for any mods not known. I'm happy with any default convention here really.

@cmdcolin
Copy link
Collaborator

@marcus1487 thanks for the input! as far as the sequence context, this is a separate mode, and the generic "Color by modifications" mode should be ok. I made it so that we can show the color mapping now that was chosen, and could probably allow user selectable color picker or pre-configured config file. we merged this to our main branch pending release now, will probably be more refinements but excited thus far :)

@cmdcolin
Copy link
Collaborator

this will probably get released shortly. may close issue for now but look forward to seeing more data etc. can post to new issues or discussion threads (https://github.com/GMOD/jbrowse-components/discussions) to continue the conversation!

JBrowse team board automation moved this from In progress to Done May 24, 2021
@nextgenusfs
Copy link

@a-slide @cmdcolin found this issue when trying to figure out how to display a bedMethyl track. It was derived from ONT dorado and then modkit pileup https://github.com/nanoporetech/modkit. How/what is the best way to import this as a track?

Format is here https://github.com/nanoporetech/modkit#bedmethyl-column-descriptions and this is a head of the file:

chr2    3       4       h       1       -       3       4       255,0,0 1 0.00 0 1 0 0 0 0 33
chr2    3       4       m       1       -       3       4       255,0,0 1 0.00 0 1 0 0 0 0 33
chr2    9       10      h       3       +       9       10      255,0,0 3 0.00 0 3 0 0 0 0 0
chr2    9       10      m       3       +       9       10      255,0,0 3 0.00 0 3 0 0 0 0 0
chr2    10      11      h       35      -       10      11      255,0,0 35 0.00 0 35 0 0 1 2 2
chr2    10      11      m       35      -       10      11      255,0,0 35 0.00 0 35 0 0 1 2 2
chr2    28      29      h       5       -       28      29      255,0,0 5 0.00 0 5 0 0 2 2 35
chr2    28      29      m       5       -       28      29      255,0,0 5 0.00 0 5 0 0 2 2 35
chr2    36      37      h       1       -       36      37      255,0,0 1 0.00 0 1 0 2 1 0 40
chr2    36      37      m       1       -       36      37      255,0,0 1 0.00 0 1 0 2 1 0 40

@cmdcolin
Copy link
Collaborator

@nextgenusfs good question. the existing work on modifications so far has been displaying the data directly on the reads and not on bedMethyl but it would be good to make a bedMethyl example as well

that can probably be loaded as a plain bed tabix or bigbed file, but would be expected to display in some special way?

could definitely make a new issue for tracking

@nextgenusfs
Copy link

Okay, I'll make a new issue -- in this repo correct?

@cmdcolin
Copy link
Collaborator

yep that'd be great :) thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

5 participants