haberman / gazelle

A system for creating fast, reusable parsers

This URL has Read+Write access

gazelle / docs / FILEFORMAT
100644 58 lines (44 sloc) 2.034 kb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
 
This is a description of Gazelle's bytecode file format.
 
The file format itself is "Bitcode," a generic bytecode format
developed as part of the LLVM project. It is documented here:
  http://llvm.cs.uiuc.edu/docs/BitCodeFormat.html
 
This file describes the meaning of the information that Gazelle
puts into the bytecode.
 
Bitcode file, application magic number = "GH"
All offsets are 0-based unless otherwise specified.
 
STRINGS -- all strings are put into the STRINGS block, and referenced
           by offset in other parts of the file
  [STRING, <ascii array of chars>]
  [STRING, <ascii array of chars>]
  ...
 
INTFAS -- the list of all IntFAs (lexing DFAs)
  INTFA -- a single DFA
    [INTFA_STATE, # of transitions] <-- first state is assumed to be start state
    [INTFA_STATE, # of transitions]
    [INTFA_FINAL_STATE, # of transitions, terminal_name_int]
    ...
    (states are identified by their offset in this list)
 
    [INTFA_TRANSITION, edge val, dest state offset]
    [INTFA_TRANSITION_RANGE, edge val low, edge val high, dest state offset]
    ...
    (transitions are identified by their offset in this list)
 
GLAS
  GLA
    [GLA_STATE, intfa #, # of transitions]
    [GLA_FINAL_STATE, 1-based rtn alt offset (0 is return from final)]
    ...
 
    [GLA_TRANSITION, 1-based terminal_name_int (0 is EOF), dest GLA state offset]
    ...
 
RTNS -- a list of all the RTNs
  RTN - a single RTN
    [RTN_INFO, nonterm_name_int, # of slots]
 
    [RTN_STATE_WITH_INTFA, # of transitions, is final, intfa#]
    [RTN_STATE_WITH_GLA, # of transitions, is final, gla#]
    [RTN_TRIVIAL_STATE, # of transitions, is final]
    ...
 
    -- slotnums are 0-based in the runtime, but 1-based in the bytecode and in the
    -- compiler. We use 0 in the bytecode to represent "no slotnum" (-1 elsewhere),
    -- as used by subparsers.
    [RTN_TRANSITION_TERMINAL, terminal_name_int, dest RTN state offset, slotname_int, slotnum+1]
    [RTN_TRANSITION_NONTERM, rtn_offset, dest RTN state offset, slotname_int, slotnum+1]
    ...