# ATF Grammar Composition

Some factors to consider regarding the (re-)structuring of the Lark grammars for different ATF-flavors.

**TLDR**: If we really have to tinker with the existing structure, we should use the extension 
mechanisms provided by [lark](https://lark-parser.readthedocs.io/en/stable/), like `%extend` and 
`%override` (see Option 3 below).

## Prepare Test Data

In [1]:
from lark import UnexpectedCharacters
from lark.lark import Lark

ebl_lines = [
    "This is an eBL textline",
    "And this is a common linetype",
]

oracc_lines = [
    "This is an oracc-style textline",
    "Here is an oracc-specific linetype",
]

lines = ebl_lines + oracc_lines


def test_grammar(grammar_path, lines):

    parser = Lark.open(grammar_path)

    for line in lines:
        try:
            tree = parser.parse(line)
            print(tree.pretty())
        except UnexpectedCharacters as e:
            print(f"Cannot parse {line!r}")

## eBL Base Parser

Handles only eBL-flavored ATF. Fails for Oracc-style lines.

The test grammar looks like this:

```lark
start: textline | common_line

?textline: "This is an eBL textline"
?common_line: "And this is a common linetype"
```

In [2]:
test_grammar("grammars/ebl_atf.lark", lines)

start
  textline

start
  common_line

Cannot parse 'This is an oracc-style textline'
Cannot parse 'Here is an oracc-specific linetype'


## Option 1 (Currently in Place): Keep Separate Grammars for eBL and Oracc

Maintain separate grammars for each flavor, i.e., the eBL grammar above plus the Oracc-specific 
grammar below. Whether to use one or the other grammar needs to be decided during runtime (e.g., use
Oracc when importing Oracc files).

### eBL Grammar

```lark
start: textline | common_line

?textline: "This is an eBL textline"
?common_line: "And this is a common linetype"
```

### Oracc Grammar

```lark
start: textline | legacy_line | common_line

?textline: "This is an oracc-style textline"
?common_line: "And this is a common linetype"
?legacy_line: "Here is an oracc-specific linetype"
```

### Pros and Cons

Splitting grammars is common (and best) practice in Computational Linguistics. Keeping the grammars
independent of each other is future-proof as they can be maintained and extended without unwanted 
side effects. In theory, this also prevents situations where an input sequence is well-formed in one 
grammar but invalid in the other. 

The resulting duplication/verbosity is not a critical issue. It does not harm the performance or 
readability of the code. IMHO, grammar updates - especially updates that affect both eBL and Oracc 
simultaneously - are so infrequent that they do not warrant the effort needed to merge the grammars.


In [3]:
# decide which grammar to use in runtime
test_grammar("grammars/ebl_atf.lark", ebl_lines)
test_grammar("grammars/oracc_atf.lark", oracc_lines)

start
  textline

start
  common_line

start
  textline

start
  legacy_line



## Option 2 (Suggested by [#552](https://github.com/ElectronicBabylonianLiterature/ebl-api/pull/552)): Merge the Grammars

Incorporate Oracc-style structures into eBL grammar (or vice versa).

### Grammar

```lark
start: textline | common_line | legacy_line

?textline: "This is an eBL textline" | "This is an oracc-style textline"
?common_line: "And this is a common linetype"
?legacy_line: "Here is an oracc-specific linetype"
```

### Pros and Cons

It reduces redundancy. Changes that affect both flavors of the grammar only need to be implemented 
once. We end up with a single, monolothic grammar that can (in principle) handle all kinds of inputs.

While this is a good thing in executable code, it means that it becomes impossible to tell whether 
an item is shared between the two ATF flavors or specific to eBL or Oracc. This makes maintenance 
much harder because updating a rule may have uninented side effects (e.g., changing a rule for eBL 
may cause Oracc-style inputs to become unparsable) that may not become immediately apparent.

Another issue to consider are conflicts. Definitions may be entirely incompatible with each other
(something like `foobar`-lines in eBL must start with `foo` but in Oracc they must start with `Foo`),
requiring manual disambiguation via rule variants.

Merging can also cause subtle bugs if a line is invalid within one grammar (and *should* throw 
an error) but happens to match a pattern from the other grammar (and thus gets parsed anyway, though 
not as intended). We already have an issue like that with the `translation` vs. `control` lines, 
where malformed translation lines are erroneously parsed as control lines since they happen to match
the more inclusive control line pattern.


In [4]:
test_grammar("grammars/merged_atf.lark", lines)

start
  textline

start
  common_line

start
  textline

start
  legacy_line



## Option 3 (Preferred): Extend a Base Grammar

Use one of the two grammars as base grammr and implement the other as an *extension* of it.
Alternatively, define a common base grammar that only holds those elements shared by all ATF flavors
and implement both eBL- and Oracc-specific elements as separate extensions.

### eBL Base Grammar (as Before)

```lark
start: textline | common_line

?textline: "This is an eBL textline"
?common_line: "And this is a common linetype"
```

### Oracc Grammars as Extension

```lark
%import .ebl_atf (textline, common_line, start)

?oracc_line: "Here is an oracc-specific linetype"

%extend textline: "This is an oracc-style textline"
%extend start: oracc_line
```

### Pros and Cons

It keeps the grammars modular while minimizing redundancy. It clearly indicates flavor-specific 
elements. It is in line with the way lark is supposed to be used. All of this promotes maintenance
and readability.

In [5]:
test_grammar("grammars/oracc_extension.lark", lines)

start
  textline

start
  common_line

start
  textline

start
  oracc_line

