Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Walks #47

Closed
wants to merge 1 commit into from
Closed

Walks #47

wants to merge 1 commit into from

Conversation

ekg
Copy link
Collaborator

@ekg ekg commented Sep 23, 2016

This is an alternative way of encoding sequences relative to the graph that is equivalent to the model used in vg for paths and alignments to the graph.

This is an alternative way of encoding sequences relative to the graph that is
equivalent to the model used in vg for paths and alignments to the graph.
@ekg
Copy link
Collaborator Author

ekg commented Sep 28, 2016

Bump. Suggest merge if there aren't objections.

@sjackman
Copy link
Collaborator

Hi, Erik. Can the fragment F records of GFA2 be used to describe your walks? I believe the only missing field is MappingRank. Could that field go in an optional tagged field of the fragment F record?

@sjackman sjackman added this to the 1.1 milestone Sep 28, 2016
@ekg
Copy link
Collaborator Author

ekg commented Sep 28, 2016

@sjackman here we are trying to maintain compatibility between vg's schema and GFA1, which is broken due to the incomplete implementation of paths in P. I don't know about F records, but maybe they could be expressed as Walks? I don't think we should require the use of tagged fields for critical aspects of the format.

@sjackman
Copy link
Collaborator

sjackman commented Sep 28, 2016

I believe a walk can be described as an ordered (PO or O) subset of fragment F records in GFA2. Could you please check the GFA2 spec and see whether you agree?

@lh3
Copy link
Collaborator

lh3 commented Sep 28, 2016

@ekg, could you elaborate on the CIGAR field on your Walk-line? Is it the alignment of a segment against the walk sequence? Do you allow mismatches and gaps between the segment and the walk sequence?

@ekg
Copy link
Collaborator Author

ekg commented Sep 28, 2016

could you elaborate on the CIGAR field on your Walk-line? Is it the alignment of a segment against the walk sequence? Do you allow mismatches and gaps between the segment and the walk sequence?

Yes. Walks are equivalent to vg "Paths", which are generic alignments between sequences and graphs that allow for mismatches and gaps. They can thus also be used to define alignments to the graph, as well as sequences that are "fully embedded" in the graph.

@sjackman

I believe a walk can be described as an ordered (PO or O) subset of fragment F records in GFA2. Could you please check the GFA2 spec and see whether you agree?

The F records are similar but are not possible for us to use as they do not encode ranks. If we always need ranks, we shouldn't include them in an optional field. So, F is not sufficient in its current form and I do not think we should hack support through an optional field.

The entire space things you can encode in GFA1/2 should be describable using only sequences and mappings between them. This suggests that F, E, L, G, W, and P have completely overlapping scope. They are just different ways of talking about alignments/mappings between the vertexes of the graph and possibly implicit sequences elsewhere. Which one to use seems to depend on whether your sequences are external or internal, whether you want to think of them as edges or not, whether the edges are overlap-tail or internal, and if your path is fully embedded or not.

Walks, like these other record types, can be partly represented by the other ones. Whether we should merge all these types or not is another conversation. In vg, we effectively have represented all of them except L with vg::Paths, so I'd support merging them. However, this PR is not about resolving this. Here, we just want to maintain compatibility between our working implementation and GFA.

@lh3
Copy link
Collaborator

lh3 commented Sep 28, 2016

Yes. Walks are equivalent to vg "Paths", which are generic alignments between sequences and graphs that allow for mismatches and gaps. They can thus also be used to define alignments to the graph, as well as sequences that are "fully embedded" in the graph.

Thanks for the clarification. A followup question just to make sure I understand your intention. Suppose we have a GFA1 graph only containing zero-length overlaps (i.e. "blunt" ends). Do you still need to encode mismatches and gaps in CIGAR on your walk lines? I guess the answer is yes?

@ekg
Copy link
Collaborator Author

ekg commented Sep 28, 2016

@lh3 leaving CIGAR elements (or any complete description of the alignment) allows us to add walks to the graph without modifying the other records (for instance S and L). We could even imagine the paths existing in their own independent file, but referring to another GFA-encoded sequence graph. Does this make sense?

@sjackman sjackman self-assigned this Sep 28, 2016
@lh3
Copy link
Collaborator

lh3 commented Sep 29, 2016

@ekg I finally see why you need CIGAR. Suppose we have such a graph:

S A * LN:i:100
S B * LN:i:100
L A + B + 10M2I15M

and suppose we have a walk from A+ to B+. What CIGAR will you use at "?" below?

W A foo 1 + 0 100M
W B foo 2 + 0 ?

@ekg
Copy link
Collaborator Author

ekg commented Sep 29, 2016

@lh3 I think that in this case the W records could be:

W A foo 1 + 0 100M
W B foo 2 + 25 75M

I am assuming that the L CIGAR is defined in terms of A.

Does this make sense?

@lh3
Copy link
Collaborator

lh3 commented Sep 29, 2016

For overlap CIGAR, I would take A as the reference and B as the query. Then B has 27bp in the overlap. That is really the problem with overlap CIGAR. I understand what you mean now.

Your walk is closer to the golden path concept I mentioned earlier. I like that because it explicitly says what sequence a path/walk spells. The current Path definition is ambiguous when there are inexact overlaps.

@ekg I would prefer the following Walk:

W  <walkName>  <rank>  <segName>  <ori>  <segStart>  <segEnd>  [CG:Z:<CIGAR>]

where segStart<segEnd are on the segment strand. I put CIGAR at an optional field because I feel it is a little too specific to vg. Most of time we only work with paths/walks that are composed from segments.

Each walk has a complement walk, which can be derived by reversing the order and flipping the orientation.

@ggonnella
Copy link
Contributor

ggonnella commented Sep 30, 2016

For overlap CIGAR, I would take A as the reference and B as the query.
That is really the problem with overlap CIGAR.

I think this should be clearly explained in the specification (not only in W lines, of course).

(Issue #53)

@ekg
Copy link
Collaborator Author

ekg commented Oct 17, 2016

@lh3 given that the CIGAR is a critical part of the walk, why should we put it in the optional ad-hoc SAM-like key/value pairs? If we start putting everything there I begin to wonder why we even have the fixed fields at all. We could just use a structured format for the entire record and avoid the need to write a new parser...

What I've heard (from @richarddurbin and others) is that the tab-delimited format is easy to parse from C, and is preferable to a structured format like JSON for this reason. However, I don't understand how the key/value pairs fit into this perspective. They would seem to make things more complex and more flexible. For a core data exchange format, flexibility looks good on paper but in practice causes years of headaches. We will forever be chasing new characters that are used to specify new data types in the key/value pairs, as has been the case with the endlessly-shifting SAM CIGAR format.

@ekg
Copy link
Collaborator Author

ekg commented Oct 17, 2016

We are really stuck here, because vg is not going to make Paths in the format that's now fixed, and it doesn't look like there is appetite to merge Walks.

My objective in this forum is to link something that actually works into the data format. I am not being hypothetical about anything, and I'm trying to make clear how I'm basing my arguments in experience. Yet, this perspective doesn't seem to be helpful. For my own sanity, I'm going to drop off for a while until things here stabilize.

vg is not currently emitting valid GFA, and I'm not planning on fixing this unless we get something like Walks or another graphical alignment object in the formats. Going forward we can concern ourselves with projecting from the various GFA-like things into vg's data model. The text output is useful for human and scripted interrogation of graphs, so it shouldn't be disabled. In the vg documentation we can note that this format is "GFA-simple" so that it is clear that it is not a directly-compatible format.

@lh3
Copy link
Collaborator

lh3 commented Oct 17, 2016

In principle, the sequence spelled from a path should not be different from the segments on the path. Allowing differences between segments and the actual path sequence is an exception needed by vg alone. CIGAR is not a key property of general paths. In this sense, I am not convinced that CIGAR is qualified as a fixed field. In addition, CIGAR does not say how a base is substituted by another base and what sequences are inserted. For your purpose, CIGAR is inadequate by itself. You still need one additional tag anyway.

In practice, having CIGAR as an optional field is not making life much harder. You query the "CG" tag. If it is absent, you assume the part of path is identical to the corresponding the segment (sub)sequence; if present, you probably need to read another tag to generate an edited path.

@ekg
Copy link
Collaborator Author

ekg commented Oct 17, 2016

In principle, the sequence spelled from a path should not be different from the segments on the path. Allowing differences between segments and the actual path sequence is an exception needed by vg alone.

But this is also the core of GFA2, correct? Wouldn't Gap records be an instance of a path sequence not matching the underlying graph? What about Fragments that aren't fully embedded in the graph? And Edge records define mappings between different parts of the graph or different sequences in a way that isn't embedded in the S and L structure of the graph.

I would prefer not to use CIGAR, but rather this structured format that defines a function that takes a series of sequences to a new sequence via a series of edits to positions in the sequences: https://github.com/vgteam/vg/blob/master/src/vg.proto#L36-L99. However, this seems like a bridge too far for GFA1... So I have been sloppy in handling CIGARs.

Something like the CIGAR does seem necessary to explain what parts of the sequence are in the walk/path, if such is not fully overlapping the node/overlap that it is against.

As for making life harder, I'm noting that we can make a hack that detects if the CIGAR is somewhere in the optional fields... or we can maintain a full parser for all the SAM key/value stuff. It just seems like overkill to utilize the key/value pairs if we only need to have a cigar-like object, and every single mapping in the walk needs to have one.

@lh3
Copy link
Collaborator

lh3 commented Oct 17, 2016

Assembly and variation representation are different worlds. In assembly, I don't think spelling a path different segments is an often requested feature. At the bottom line, we allow to keep edits. It is really not that hard to work with tags.

I would prefer not to use CIGAR, but rather this structured format that defines a function that takes a series of sequences to a new sequence via a series of edits to positions in the sequences: https://github.com/vgteam/vg/blob/master/src/vg.proto#L36-L99. However, this seems like a bridge too far for GFA1... So I have been sloppy in handling CIGARs.

This is a good argument against having CIGAR as a fixed field. As optional tags, you have the flexibility to use other representations. Once we put CIGAR as a fixed field, there is no chance to fix this mistake without breaking compatibility. We could use the CIGAR+MD combo to describe edit, but I admit it is a bad idea (PS: I invented MD). In SAM, I wish I had proposed something like "30=1X{C}2D2I{GT}20=" to keep all edits in one string. I hate the CIGAR+MD combo and I don't want GFA to replicate the same mistake.

PS:

But this is also the core of GFA2, correct?

No, I don't think so. GFA2 is not using edits to keep variations. I bet we will go through a similar discussion to the one in this thread to refine/redefine GFA2 paths.

@ekg
Copy link
Collaborator Author

ekg commented Oct 17, 2016

OK, I see your point. There is not need to make CIGAR in its current form a fixed field because of the risk of getting stuck with something sub-optimal.

However, if we had a correctly-defined single description of the edit function, then I'd support making it fixed because it is a fundamental feature of the object. Not having it would be as strange as SAM records without CIGARs.

But taking your point further, why not use the SAM-like key/value pairings for everything? It would be more flexible. Taking this even further, why should we use this particular k/v format at all? We have other mechanisms to encode structured data that are well supported in any programming language and let us have lists, maps, strings, numbers, and booleans. Maybe the data format itself is just a distraction from the data models we want to encode, which are best defined with mathematical formalism.

Assembly and variation representation are different worlds.

This is true. At present, there isn't much overlap between assembly and variation representation. But what if we look to the future and think about the state of affairs that would best resolve standing problems with both worlds?

The problems with variation representation (such as in the realm of VCF) are numerous. Many people acknowledge that it is more natural to represent variation in an assembly graph.

On the other side, assembly--- in the sense of obtaining contigs from a set of shorter input sequences--- is also rarely possible in a perfect sense, and so we only lose information by collapsing out the uncertainty provided by graphical models.

The two approaches appear to be separate, but I think there is a huge benefit to bringing them together.

In assembly, I don't think spelling a path different segments is an often requested feature.

You can think of vg as a toolkit for distributed assembly. Because the process is distributed, we can't immediately mutate the state of the graph when we get new information from reads. The base graph is read-only during the alignment phase of the assembly process. As such, we need to have a way to talk about edits to the graph before they are committed into it.

It looks to me that this same situation is acknowledged in GFA2. There, we want to represent various constraints on the graph which have been detected through diverse sources of information but not yet collapsed into the graph itself. We want to feed this uncertainty along and bring in many sources of data to resolve it. As a result we get E, F, and G namespaces, which allow us to encode features of alignments of external or internal sequences to the base graph.

I'll go out on a limb and posit that all of these things can be encoded via a well-designed alignment structure, with the benefit that the model could be decoupled and simplified.

At the bottom line, we allow to keep edits. It is really not that hard to work with tags.

You're right that the data can be encoded this way, so it's workable.

I'm frustrated that we tend to write and maintain new structured data formats rather than use universally-supported alternatives. To some extent I should be accepting of this, but I can't help but ask for it to change. This state of affairs isn't ergonomic.

@lh3
Copy link
Collaborator

lh3 commented Oct 17, 2016

But taking your point further, why not use the SAM-like key/value pairings for everything?

In my view, a good format should approximately mirror the in-RAM data representation. In C, we define struct like:

struct abc {
  int foo;
  char *bar;
  void *ptr;
}

because we use some fields often but others only as meta information. foo and bar here become mandatory fields and others go to optional tags. Often there is not a clear cut, though.

It looks to me that this same situation is acknowledged in GFA2. There, we want to represent various constraints on the graph which have been detected through diverse sources of information but not yet collapsed into the graph itself.

My understanding is the exact contrary. GFA2 is trying to put all information into the graph itself. E-lines are proposed to encode everything into position-augmented topology. G-lines are part of the assembly graph. They are really L-lines with negative overlap lengths. F-lines only describe how contigs are assembled. They are not intended for vg-like read mapping where reads often bridge multiple segments. GFA2 is trying to solve the same problem as vg, but is approaching from a different angle. I actually like the vg approach better. That is partly why I opposed to replacing GFA1 with GFA2 right away.

I'm frustrated that we tend to write and maintain new structured data formats rather than use universally-supported alternatives.

This will lead to a much longer discussion. I am conveniently overlooking this point for now ;-)

@stale
Copy link

stale bot commented Nov 15, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Nov 15, 2018
@stale stale bot closed this Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
GFA Spec
Discussion
Development

Successfully merging this pull request may close these issues.

None yet

4 participants