From 793923b293e46a778422459fc9c88c6968c7f281 Mon Sep 17 00:00:00 2001 From: thegenemyers Date: Tue, 10 Jan 2017 11:56:12 +0100 Subject: [PATCH] Fixed a number of typos --- ChangeLog.md | 7 ++++--- GFA2-spec.md | 52 +++++++++++++++++++++++++++++++++------------------- 2 files changed, 37 insertions(+), 22 deletions(-) diff --git a/ChangeLog.md b/ChangeLog.md index 0ff2268..e6fe8fb 100644 --- a/ChangeLog.md +++ b/ChangeLog.md @@ -1,8 +1,9 @@ ## CHANGED -* All ID's are now in *one* name space and every ID in a definitional context must be unique. -If * is specified in place of an ID on edge, gap, and group lines, then that item does not -have an ID and is presumably not referred to elsewhere (i.e. U- or O-lines). +* All ID's, except external Fragment IDs, are now in *one* name space and every ID in a +definitional context must be unique. External fragment IDs are in another name space. +If * is specified in place of an ID on edge, gap, and group lines, then that item does +not have an ID and is presumably not referred to elsewhere (i.e. U- or O-lines). * Sequences in S-lines were defined as [a-zA-Z]+. Unecessarily restrictive. Changed to any printable sequence excluding space, i.e [!-~]+. diff --git a/GFA2-spec.md b/GFA2-spec.md index eee91e1..c3d3bc3 100644 --- a/GFA2-spec.md +++ b/GFA2-spec.md @@ -9,7 +9,7 @@ graph at any stage of assembly, from the graph of all overlaps, to a final resol of contig paths with multi-alignments. Apart from meeting these needs, the extensions also supports other assembly and variation graph types. -The proposal if for a *core standard*. As will be seen later in +The proposal is for a *core standard*. As will be seen later in the technical specification, the format is **extensible** in that additional description lines can be added and additional SAM tags can be appended to core description lines. @@ -34,7 +34,7 @@ assembly can be described. Finally, one can describe and attach a name to any * ## GRAMMAR ``` - <- (
| | | | )+ + <- (
| | | | | )+
<- H {VN:Z:2.0} {TS:i:} * @@ -57,9 +57,10 @@ assembly can be described. Finally, one can describe and attach a name to any * <- [+-] <- | * - <- "any syntactically valid SAM tag" + <- [A-Za-z0-9][A-Za-z0-9]:[ABHJZif]:[ -~]* <- {$} + <- [0-9]+ <- * | [!-~]+ <- * | | @@ -80,25 +81,28 @@ operator <-, and the following marks: Like GFA, GFA2 is tab-delimited in that every lexical token is separated from the next by a single tab. -Each descriptor line must begin with a letter and lies on a single line with no white space -before the first symbol. The tokens that generate descriptor lines are \, \, +Each record line must begin with a letter and lies on a single line with no white space +before the first symbol. The tokens that generate record lines are \, \, \, \, \, and \. Any line that does not begin with a recognized code (i.e. H, S, F, E, G, O, or U) can be ignored. -This will allow users to have additional descriptor lines specific to their special processes. -Moreover, the suffix of any GFA2 descriptor line may contain any number of user-specific SAM -tags which may be ignored by software designed to support the core standard. +This will allow users to have additional record lines specific to their special processes. +Moreover, the suffix of any GFA2 record line may contain any number of user-specific SAM +tags which may be ignored by software designed to support the core standard. Tags with lower-case +letters are reserved for end-users. -There is one name space for all identifiers for segments, external fragments, edges, gaps, -and groups. It is +There is one name space for all identifiers for segments, edges, gaps, +and groups. External fragment ID's are assumed to be in a distinct name space. It is an error for any identifier to be used twice in a defining context. Note carefully that instead of an identifier, one can use a * for edges, gaps, and groups, implying that an -id is not needed as the item will not be referred to elsewhere in the specification. Moreover, +id is not needed as the item will not be referred to elsewhere in the file. Moreover, almost all references to identifiers are oriented, by virtue of a post-fix + or - sign. A +-sign indicates the object is in the orientation it was defined, and a --sign indicates it should be reverse-complemented. ## SEMANTICS +### Header + The **header** contains an optional 'VN' SAM-tag version number, 2.0, and an optional 'TS' SAM-tag specifying the default the trace point spacing for any Dazzler traces specified to accelerate alignment computation. @@ -107,6 +111,8 @@ A 'TS' tag can occur after the fixed arguments on any E-, G-, or F-line in which the trace spacing to use with the trace on that specific line, otherwise the default spacing is used. +### Segment + A **segment** is specified by an S-line giving a user-specified ID for the sequence, its length in bases, and the string denoted by the segment or * if absent. The sequence is typically expected to be bases or IUPAC characters, but GFA2 places @@ -116,6 +122,8 @@ an indication to a drawing program of how long to draw the representation of the The segment sequences and any CIGAR strings referring to them if present follow the *unpadded* SAM convention. +### Fragment + **Fragments**, if present, are encoded in F-lines that give (a) the segment they belong to, (b) an oriented external ID that references a sequence in an external collection (e.g. a database of reads or segments in another GFA2 or SAM file), @@ -123,6 +131,8 @@ in an external collection (e.g. a database of reads or segments in another GFA2 the interval of the fragment that contributes to the segment. One concludes with either a trace or CIGAR string detailing the alignment, or a \* if absent. +### Edge + **Edges** are encoded in E-lines that in general represent a local alignment between arbitrary intervals of the sequences of the two vertices in question. One gives first an edge ID or * and then the *oriented* segment ID’s of the two vertices involved. @@ -152,7 +162,7 @@ the second segment to align to the next *TS* characters in the first segment whe the *TS* is either the default trace spacing given in a header line with the TS SAM-tag, or the spacing given in a TS SAM-tag on the line of the edge. If a \* is given as the alignment -note that it is still possible to compute the implied alignment by brute force. +note that it is still possible to compute the implied alignment from the sequences. The GFA2 concept of edge generalizes the link and containment lines of GFA. For example a GFA edge which encodes what is called a dovetail overlap (because two ends overlap) is a GFA2 @@ -178,11 +188,13 @@ encounters models in which there is no overlap (basically edge-labelled models c vertex-labelled form). This is captured by edges for which beg1 = end1 = x$ and beg2 = end2 = 0 (i.e. 0-length overlap of the end of segment 1 and the beginning of segment 2)! +### Gap + While not a concept for pure DeBrujin or long-read assemblers, it is the case that paired end data and external maps often order and orient contigs/vertices into scaffolds with intervening gaps. To this end we introduce a **gap** edge described in G-lines that give the estimated gap distance between the two segment sequences and the variance of that estimate -or 0 if no estimate is available. The first segment is in the orientation given by the first +or \* if no estimate is available. The first segment is in the orientation given by the first sign indicator and the second segment is in the orientation given by the second sign indicator. The next integer gives the expected distance between the first and second segment in their respective orientations, and the final field is either an integer giving the variance in this @@ -191,20 +203,22 @@ Relationships in E-lines are fixed and known, where as in a G-line, the distance is an estimate and the line type is intended to allow one to define assembly **scaffolds**. +### Group + A **group** encoding on a U- or O-line allows one to name and specify a subgraph of the overall graph. -Such a collection could for example be hilighted by a drawing program on +Such a collection could for example be highlighted by a drawing program on command, or might specify decisions about tours through the graph. U-lines encode *unordered* collections and O-lines encode *ordered* collections (defined in the next paragraph), -which we alternatively call **paths** and **sets**, respectively. +which we alternatively call **sets** and **paths**, respectively. The remainder of the line then consists of an optional ID for the collection followed by a non-empty list of ID's referring to segments, edges, or other groups that are *separated by single spaces* (i.e. the list is in a single column of the tab-delimited format). In the case of paths every reference must be oriented, and not so in a set. -U/O-lines with the same name are considered -to be concatenated together in the order in which they appear, and a group list may refer -to another group recursively. +U/O-lines with the same name are considered to be concatenated together in the order in which +they appear, and a group list may refer to another group recursively. It is an error for a U-line +and an O-line to have the same name. An unordered collection or set defined in a U-line refers to the subgraph induced by the vertices and edges in the collection (i.e. one adds all edges @@ -215,7 +229,7 @@ and the implied adjacent objects between consecutive objects in the list where t orientation of the objects matters (e.g. the edge between two consecutive segments, the segment between two consecutive edges, etc.) A set can contain a reference to a path, but not vice versa, in which case the orientation -of the objects in the path become irrevalent. +of the objects in the path become irrelavent. ## BACKWARD COMPATIBILITY WITH GFA