Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to represent Inversions #6

Closed
pmelsted opened this issue Jul 27, 2015 · 11 comments
Closed

How to represent Inversions #6

pmelsted opened this issue Jul 27, 2015 · 11 comments
Assignees
Labels

Comments

@pmelsted
Copy link
Collaborator

This was raised in #3, how can the format represent inversions, or is this something we want.

Since there are 3 main use cases for GFA, assembly graphs, long reads and variation graphs it should be noted that inversions are only explicitly needed in variation graphs.

An assembler would naturally construct two contigs for an inverted segment and long reads would be unaffected.

What is annoying is that the inverted segments are not complemented, so this would mean we would need to come up with a new symbol or mechanism to denote this.

@sjackman
Copy link
Collaborator

I'm confused as to how DNA can become reversed without being complemented. Can you give me an example?

@ekg
Copy link
Collaborator

ekg commented Jul 28, 2015

I am not completely clear how this is possible either.

That said we should not limit ourselves to things that we know are normal biologically. This would be like making a FASTA spec that says only biologically viable DNA sequences can be represented.

I don't think there is any problem for inversions if each node is like a virtual pair of nodes connected by a hidden link (we could imagine that this link carries the sequence label of the node). Then edges always come from one end and go to another.

You can also represent the deletion of a node without referring to its neighbors, which is very useful.

I also can't see how this would present a problem for other uses. The more simple and general we can keep things the less constraint the various uses will need to work around.

However if links have only + and - versions as they do now then we can't convey enough information to represent this.

Graphviz has dot format, which is generally able to represent any graph you can think up. It is also quite simple to make simple graphs. We should aim for this level of generality.

@pmelsted
Copy link
Collaborator Author

These inversions happen when the molecular machinery goes wrong and it's
usually bad for you.

I don't see a clean way of representing this without adding a new operator.
We could use ~ (tilde) to denote non-complemented, so ~+ and ~- would mean
... ugh.

Do you have a pointer to the ga4gh graph discussion about this?

@sjackman
Copy link
Collaborator

I do not believe that it is possible by natural mutation to reverse a DNA sequence without also complementing it. It is not helpful to design a file format to handle cases that are not physically possible.

@ababaian
Copy link

I've never heard of reverse-non-complement in vivo, chemically it makes no sense since it requires breaking each individual 5' to 3' bond and flipping it. The only time i've ever seen it used is a control when searching for low information regions within repetitive sequences.

A technical artifact which arises in silico though, that's easy to see.

@pmelsted
Copy link
Collaborator Author

From this diagram http://ghr.nlm.nih.gov/handbook/illustrations/inversion it looks like the region will be reverse-complemented.

@sjackman
Copy link
Collaborator

Yes, correct. Reverse and complemented. The - value of the orientation field indicates reversed and complemented.

@ekg
Copy link
Collaborator

ekg commented Jul 29, 2015

I disagree that we should only design for things that are physically
possible. The graphs we are all working with have no natural chemical
basis. No genome will ever look like an overlap or de Bruijn graph, so a
design rule of this type would preclude everything we are doing. Maybe I am
taking the metaphor too far though :)

The use case that makes a lot of sense to me is describing the deletion of
an entire node. If we cannot describe which end edges go from and to then
this cannot be done in a node-local sense. You would need to add edges
between the inbound and outbound nodes where an intermediary has been
deleted and a path that skips it is required.

As for representing non complemented inversions, it seems correct that
another operator would be required to clarify this. I guess an extension of
the cigar concept would be sufficient? The reason for not duplicating these
as reverse complemented sequences is to enable non ambiguous alignment to
and annotation of the graph. With minor extensions to the exchange format
the inversion can be encoded in the graph without duplication.

@adamnovak, @benedictpaten, and @haussler have been strong proponents of
this idea and maybe could better clarify what I am describing.
On Jul 29, 2015 12:21 AM, "Shaun Jackman" notifications@github.com wrote:

Yes, correct. Reverse and complemented. The - value of the orientation
field indicates reversed and complemented.


Reply to this email directly or view it on GitHub
#6 (comment).

@sjackman
Copy link
Collaborator

You would need to add edges between the inbound and outbound nodes where an intermediary has been deleted and a path that skips it is required.

Yes, that's correct. A deletion is represented like so:
Path 11 is AAACCCATA
Path 12 is AAAATA

S 0 AAA
S 1 CCC
S 2 ATA
L 0 + 1 + 0M
L 0 + 2 + 0M
L 1 + 2 + 0M
P 11 0+,1+,2+ 0M,0M,0M
P 12 0+,2+ 0M,0M,0M

del

@sjackman
Copy link
Collaborator

I disagree that we should only design for things that are physically possible.

Biology has enough weirdness as it is. Let's prioritize first handling the cases that are physically possible.

@pmelsted
Copy link
Collaborator Author

Similarly for (RC)-inversion it can be represented directly

S 0 AAA
S 1 CCC
S 2 ATA
L 0 + 1 + 0M
L 0 + 1 - 0M
L 1 + 2 + 0M
L 1 - 2 + 0M
P 11 0+,1+,2+ 0M,0M,0M
P 12 0+,1-,2+ 0M,0M,0M

I think the case you are thinking of adding intermediate nodes does happen in de Bruijn graphs, but since you can specify 0M as overlap you don't need them here.

@sjackman sjackman self-assigned this Aug 6, 2015
@sjackman sjackman closed this as completed Aug 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants