/
GFA1.md
327 lines (244 loc) · 16.3 KB
/
GFA1.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
---
title: Graphical Fragment Assembly (GFA) Format Specification
author: The GFA Format Specification Working Group
date: 2022-06-07
---
The master version of this document can be found at
<https://github.com/GFA-spec/GFA-spec>
# The GFA Format Specification
The purpose of the GFA format is to capture sequence graphs as the product of an assembly, a representation of variation in genomes, splice graphs in genes, or even overlap between reads from long-read sequencing technology.
The GFA format is a tab-delimited text format for describing a set of sequences and their overlap. The text is encoded in UTF-8 but is not allowed to use a codepoint value higher than 127. The first field of the line identifies the type of the line. Header lines start with `H`. Segment lines start with `S`. Link lines start with `L`. Jump lines (since v1.2) start with `J`. A containment line starts with `C`. A path line starts with `P`. Walk lines (since v1.1) start with `W`.
## Terminology
+ **Segment**: a continuous sequence or subsequence.
+ **Link**: an overlap between two segments. Each link is from the end of one segment to the beginning of another segment. The link stores the orientation of each segment and the amount of basepairs overlapping.
+ **Jump**: (since v1.2) a connection between two oriented segments. Similar to link, but does not imply a direct adjacency between the segments, instead providing an estimated distance between the segments. Main use case is to specify segment relations across assembly gaps.
+ **Containment**: an overlap between two segments where one is contained in the other.
+ **Path**: an ordered list of oriented segments, where each consecutive pair of oriented segments is supported by a link or a jump record.
+ **Walk**: (since v1.1) an ordered list of oriented segments, intended for pangenome use cases. Each consecutive pair of oriented segments must correspond to a 0-overlap link record.
## Line structure
Each line in GFA has tab-delimited fields and the first field defines the type of line. The type of the line defines the following required fields. The required fields are followed by optional fields.
| Type | Description |
|------|-------------|
| `#` | Comment |
| `H` | Header |
| `S` | Segment |
| `L` | Link |
| `J` | Jump (since v1.2) |
| `C` | Containment |
| `P` | Path |
| `W` | Walk (since v1.1) |
## Optional fields
All optional fields follow the `TAG:TYPE:VALUE` format where `TAG` is a two-character string that matches `/[A-Za-z][A-Za-z0-9]/`. Each `TAG` can only appear once in one line. A `TAG` containing lowercase letters are reserved for end users. A `TYPE` is a single case-sensitive letter which defines the format of `VALUE`.
| Type | Regexp | Description
|------|-------------------------------------------------------|------------
| `A` | `[!-~]` | Printable character
| `i` | `[-+]?[0-9]+` | Signed integer
| `f` | `[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?` | Single-precision floating number
| `Z` | `[ !-~]+` | Printable string, including space
| `J` | `[ !-~]+` | [JSON][], excluding new-line and tab characters
| `H` | `[0-9A-F]+` | Byte array in hex format
| `B` | `[cCsSiIf](,[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)+` | Array of integers or floats
[JSON]: http://json.org/
For type `B`, array of integers or floats, the first letter indicates the type of numbers in the following comma separated array. The letter can be one of `cCsSiIf`, corresponding to `int8_t` (signed 8-bit integer), `uint8_t` (unsigned 8-bit integer), `int16_t`, `uint16_t`, `int32_t`, `uint32_t` and `float`, respectively.
## Segment and path names
Path and segment records are identified by a unique name. All record types share the same namespace, so a path may not have the same name as a segment.
Names must not contain whitespace characters nor start with `*` or `=` nor contain the strings `+,` (plus comma) and `-,` (minus comma). All other printable ASCII characters are allowed. Names are case sensitive.
# `#` Comment line
Comment lines begin with `#` and are ignored.
## Required fields
| Column | Field | Type | Regexp | Description
|--------|--------------|-----------|--------|------------
| 1 | `RecordType` | Character | `#` | Record type
# `H` Header line
## Required fields
| Column | Field | Type | Regexp | Description
|--------|--------------|-----------|--------|------------
| 1 | `RecordType` | Character | `H` | Record type
## Optional fields
| Tag | Type | Description
|------|------|------------
| `VN` | `Z` | Version number
# `S` Segment line
## Required fields
| Column | Field | Type | Regexp | Description
|--------|--------------|-----------|---------------------|------------
| 1 | `RecordType` | Character | `S` | Record type
| 2 | `Name` | String | `[!-)+-<>-~][!-~]*` | Segment name
| 3 | `Sequence` | String | `\*\|[A-Za-z=.]+` | Optional nucleotide sequence
The Sequence field is optional and can be `*`, meaning that the nucleotide sequence of the segment is not specified. When the sequence is not stored in the GFA file, its length may be specified using the `LN` tag, and the sequence may be stored in an external FASTA file.
## Optional fields
| Tag | Type | Description |
|-------|------|----------------|
| `LN` | `i` | Segment length |
| `RC` | `i` | Read count |
| `FC` | `i` | Fragment count |
| `KC` | `i` | k-mer count |
| `SH` | `H` | SHA-256 checksum of the sequence |
| `UR` | `Z` | URI or local file-system path of the sequence. If it does not start with a standard protocol (e.g. ftp), it is assumed to be a local path. |
# `L` Link line
Links are the primary mechanism to connect segments. Links connect oriented segments. A link from `A` to `B` means that the end of `A` overlaps with the start of `B`. If either is marked with `-`, we replace the sequence of the segment with its reverse complement, whereas a `+` indicates the segment sequence is used as-is.
The length of the overlap is determined by the `CIGAR` string of the link. When the overlap is `0M` the `B` segment follows directly after `A`. When the `CIGAR` string is `*`, the nature of the overlap is not specified. The `CIGAR` string must be constructed so that the corresponding end of sequence `A` in the orientation given by `FromOrient` is the reference and the start of `B` in the orientation given by `ToOrient` is the query.
## Required fields
| Column | Field | Type | Regexp | Description
|--------|--------------|-----------|--------------------------|------------------
| 1 | `RecordType` | Character | `L` | Record type
| 2 | `From` | String | `[!-)+-<>-~][!-~]*` | Name of segment
| 3 | `FromOrient` | String | `+\|-` | Orientation of From segment
| 4 | `To` | String | `[!-)+-<>-~][!-~]*` | Name of segment
| 5 | `ToOrient` | String | `+\|-` | Orientation of `To` segment
| 6 | `Overlap` | String | `\*\|([0-9]+[MIDNSHPX=])+`| Optional `CIGAR` string describing overlap
The Overlap field is optional and can be `*`, meaning that the CIGAR string is not specified.
## Optional fields
| Tag | Type | Description
|-------|------|------------
| `MQ` | `i` | Mapping quality
| `NM` | `i` | Number of mismatches/gaps
| `RC` | `i` | Read count
| `FC` | `i` | Fragment count
| `KC` | `i` | k-mer count
| `ID` | `Z` | Edge identifier
# `C` Containment line
A containment line represents an overlap between two segments where one (the `Contained` segment)
is contained in the other (the `Container` segment). The `Pos` field stores the leftmost
position of the contained segment in the container segment in its forward orientation
(i.e. before this is oriented according to the `ContainerOrient` sign).
## Example
The following line describes the containment of segment 2 in the reverse complement of segment 1,
starting at position 110 of segment 1 (in its forward orientation).
```
C 1 - 2 + 110 100M
```
## Required fields
| Column | Field | Type | Regexp | Description
|--------|-------------------|-----------|--------------------------|------------
| 1 | `RecordType` | Character | `C` | Record type
| 2 | `Container` | String | `[!-)+-<>-~][!-~]*` | Name of container segment
| 3 | `ContainerOrient` | String | `+\|-` | Orientation of container segment
| 4 | `Contained` | String | `[!-)+-<>-~][!-~]*` | Name of contained segment
| 5 | `ContainedOrient` | String | `+\|-` | Orientation of contained segment
| 6 | `Pos` | Integer | `[0-9]*` | 0-based start of contained segment
| 7 | `Overlap` | String | `\*\|([0-9]+[MIDNSHPX=])+` | CIGAR string describing overlap
## Optional fields
| Tag | Type | Description
|-------|------|------------
| `RC` | `i` | Read coverage
| `NM` | `i` | Number of mismatches/gaps
| `ID` | `Z` | Edge identifier
# `P` Path line
## Required fields
| Column | Field | Type | Regexp | Description
|--------|----------------|-----------|---------------------------|--------------------
| 1 | `RecordType` | Character | `P` | Record type
| 2 | `PathName` | String | `[!-)+-<>-~][!-~]*` | Path name
| 3 | `SegmentNames` | String | `[!-)+-<>-~][!-~]*` | A comma-separated list of segment names and orientations
| 4 | `Overlaps` | String | `\*\|([0-9]+[MIDNSHPX=])+` | Optional comma-separated list of CIGAR strings
The CIGAR strings in the `Overlaps` field are optional, and may be replaced by a single `*` character, in which case the `CIGAR` strings are determined by fetching the `CIGAR` string from the corresponding link records, or by performing a pairwise overlap alignment of the two sequences. If specified, the `Overlaps` field must have one fewer values than the number of segment names and orientations in the `SegmentNames` field.
## Optional fields
None specified.
## Example
```
H VN:Z:1.0
S 11 ACCTT
S 12 TCAAGG
S 13 CTTGATT
L 11 + 12 - 4M
L 12 - 13 + 5M
L 11 + 13 + 3M
P 14 11+,12-,13+ 4M,5M
```
The resulting path is:
```
11 ACCTT
12 CCTTGA
13 CTTGATT
14 ACCTTGATT
```
## Extension to use jump connections (since v1.2)
Version 1.2 expands the `P`-line format for using jump connections given by the `J`-lines (see "`J` Jump line" section).
Semicolon (`;`) can now be used as a separator in `SegmentNames` in addition to a comma (`,`) to indicate the usage of a jump connection (defined by `J`-line), rather than a link connection (defined by `L`-line).
If specified, the `Overlaps` field uses the `[-+]?[0-9]+J` format (note the `J` at the end to match the style of a `CIGAR` string) to refer to the jump connection with a particular estimated distance, and `.` if corresponding `J`-line does not provide distance estimate.
| Column | Field | Type | Regexp | Description
|--------|----------------|-----------|---------------------------|--------------------
| 1 | `RecordType` | Character | `P` | Record type
| 2 | `PathName` | String | `[!-)+-<>-~][!-~]*` | Path name
| 3 | `SegmentNames` | String | `[!-)+-<>-~][!-~]*` | A comma/semicolon-separated list of segment names and orientations
| 4 | `Overlaps` | String | `\*\|([0-9]+[MIDNSHPX=]\|\[-+]?[0-9]+J\|.)+` | Optional comma-separated list of CIGAR strings and distance estimates
### Example
```
H VN:Z:1.2
S 11 ACCTT
S 12 TCAAGG
S 13 CTTGATT
L 11 + 12 - 4M
J 11 + 12 - * SC:i:1
J 12 - 13 + 10
P first 11+,12- *
P second 11+;12- *
P third 11+;12-;13+ .,10J
```
Note how usage of different delimeters in the first two paths disambiguates between the usage of a link vs a shortcut jump the same pair of oriented segments.
# `W` Walk line (since v1.1)
A walk line describes an oriented walk in the graph. It is only intended for a
graph without overlaps between segments. W-line was added in GFA v1.1 and was
not defined in the original GFAv1.
Note that W-lines can not use jump connections (introduced in v1.2).
## Required fields
| Column | Field | Type | Regexp | Description
|--------|-------------------|-----------|--------------------------|------------
| 1 | `RecordType` | Character | `W` | Record type
| 2 | `SampleId` | String | `[!-)+-<>-~][!-~]*` | Sample identifier
| 3 | `HapIndex` | Integer | `[0-9]+` | Haplotype index
| 4 | `SeqId` | String | `[!-)+-<>-~][!-~]*` | Sequence identifier
| 5 | `SeqStart` | Integer | `\*\|[0-9]+` | Optional Start position
| 6 | `SeqEnd` | Integer | `\*\|[0-9]+` | Optional End position (BED-like half-close-half-open)
| 7 | `Walk` | String | `([><][!-;=?-~]+)+` | Walk
For a haploid sample, `HapIndex` takes 0. For a diploid or polyploid sample,
`HapIndex` starts with 1. For two W-lines with the same
(`SampleId`,`HapIndex`,`SeqId`), their [`SeqSart`,`SeqEnd`) should have no
overlaps. A `Walk` is defined as
```txt
<walk> ::= ( `>' | `<' <segId> )+
```
where `<segId>` corresponds to the identifier of a segment. A valid walk must
exist in the graph.
## Example
```txt
H VN:Z:1.1
S s11 ACCTT
S s12 TC
S s13 GATT
L s11 + s12 - 0M
L s12 - s13 + 0M
L s11 + s13 + 0M
W NA12878 1 chr1 0 11 >s11<s12>s13
```
# `J` Jump line (since v1.2)
Jump lines are the mechanism to define the connections of segments which can not be associated with a particular overlap or sequence. Basic usecase is to represent 'gaps' corresponding to unassembled regions, most commonly due to absense or low quality of sequencing data.
`J`-lines specification generally follows one for `L`-lines, using columns 2-4 to specify connected segments and their respective orientations.
The only difference is that 6th column specifies a signed integer `Distance` (instead of the `Overlap` `CIGAR` string) -- estimated distance between the segments.
The `Distance` can take a `*` value, meaning that the distance is not specified (estimate is unavailable).
Note that the `Distance` can take negative integer values, hinting at an undetected overlap.
Since v1.2 jump connections can be used in the `P`-lines.
Note that to specify usage of a jump connection rather than a regular link within a path one should use a different separator (`;` instead of `,`). For details and examples see "Extension to use jump connections" subsection the `P`-line description.
`J`-lines can also be used to specify _shortcut_ connections that do not correspond to any missing overlap or absent sequence.
Shortcuts are primarily intended to be used within the `P`-lines to define arbitrary assembly scaffolds.
Shortcut `J`-lines must be marked with a special tag: `SC:i:1`.
## Required fields
| Column | Field | Type | Regexp | Description
|--------|--------------|-----------|--------------------------|------------------
| 1 | `RecordType` | Character | `J` | Record type
| 2 | `From` | String | `[!-)+-<>-~][!-~]*` | Name of segment
| 3 | `FromOrient` | String | `+\|-` | Orientation of From segment
| 4 | `To` | String | `[!-)+-<>-~][!-~]*` | Name of segment
| 5 | `ToOrient` | String | `+\|-` | Orientation of `To` segment
| 6 | `Distance` | String | `\*\|[-+]?[0-9]+` | Optional estimated distance between the segments
## Optional fields
| Tag | Type | Description
|------|------|------------
| `SC` | `i` | 1 indicates indirect shortcut connections. Only 0/1 allowed.
## Example
The following lines describe the jump between reverse complement of segment 1 and segment 2, with estimated distance of 100 and the 'shortcut' between segment 2 and reverse complement of segment 3 with unspecified distance.
```
J 1 - 2 + 100
J 2 + 3 - * SC:i:1
```