# NenaParser: A parser for Nena Standard Text format

The goal of this parser (under development) is to translate texts in the
[Nena Standard Text format][nenamd] (we should find a better name for that)
into structured groups of morphemes. Those structured morphemes can then be
easily converted to (e.g.) TextFabric format.

For the Nena Standard Text parser, we make use of [Sly][sly], a Python
implementation of the lex/yacc type of parser generators. (This may soon have
to be converted to Sly's predecessor [Ply][ply], as Sly works only with
Python 3.6+ and the NENA website runs on Python 3.5 - but that should not be
difficult).

[nenamd]: https://github.com/CambridgeSemiticsLab/nena_corpus/blob/tomarkdown/docs/text_markup.md
[sly]: https://sly.readthedocs.io/en/latest/
[ply]: http://www.dabeaz.com/ply/index.html

In [1]:
import re

## Lexer

The parser needs as its input 'tokens', which are predefined units of characters. These are provided by the 'lexer'. In Sly (and Ply), tokens are defined as regular expressions, of which the matching string is returned as the token value. If the token is defined as a function (with its regular expression as argument to the `@_` decorator), then the returned value (among other things) can be manipulated. For more detailed information, [see the documentation][slydocs].

[slydocs]: https://sly.readthedocs.io/en/latest/sly.html

In [192]:
from sly import Lexer

class NenaLexer(Lexer):
    
    # set of token names
    tokens = {
        TITLE, ATTRIBUTE, LETTER, NEWLINES, SPACE,
        PUNCTUATION, HYPHEN,
        LPAREN_COMMENT, LBRACKET_COMMENT, DIGITS,
        LANG_MARKER, COMMENT, FOOTNOTE
    }
    
    # NB \u207A == superscript +
    literals = {'*', '(', ')', '{', '}', '[', ']', '/', '^'}

    # The '(?m)' part turns on multiline matching, which makes
    # it possible to use ^ and $ for the start/end of the line.
    # Title starts with pound sign. Returns 2-tuple (key, value).
    @_(r'(?m)^\# .*$')
    def TITLE(self, t):
        t.value = ('title', t.value[2:])
        return t

    # Attribute starts key and colon. Returns 2-tuple (key, value).
    @_(r'(?m)^[a-z][a-z0-9_]+: .*$')
    def ATTRIBUTE(self, t):
        t.value = tuple(t.value.split(': '))
        return t
    
    # Footnote starts with '[^n]: ', where n is a number.
    # Returns a 2-tuple (int: fn_sym, str: footnote_text)
    @_(r'(?m)^\[\^[1-9][0-9]*\]: \D*$')
    def FOOTNOTE(self, t):
        fn_sym, footnote = t.value.split(maxsplit=1)
        t.value = (int(fn_sym[2:-2]), footnote)
        return t

    # How to get combined Unicode characters to be recognized?
    # Matching only Unicode points of letters with pre-combined
    # marks can be done with the 'word' class '\w', but it
    # includes digits and underscore. To remove those, negate
    # the inverted word class along with digits and underscore:
    # '[^\W\d_]. But that does not include separate combining
    # marks, or the '+' sign.
    # One solution would be unicodedata.normalize('NFC', data),
    # except that not all combinations have pre-combined Unicode
    # points.
    # Another solution is to use an external regex engine such as
    # `regex` (`pip install regex`), which has better Unicode
    # support. However, I would like to avoid extra dependencies.
    # Another (less elegant) solution is to make the '+' symbol
    # and the combining characters [\u0300-\u036F] each its own
    # token, which the parser will have to parse into morphemes
    # and words.
    # Another (also less elegant) solution is to use a 'negative
    # lookbehind assertion' for the negation of digits and '_':
    # https://stackoverflow.com/a/12349464/9230612
    # (?!\d_)[\w\u0300-\u036F]+
    # Because combining marks can never appear before the first
    # letter, and because some dialects have a '+' sign at the
    # beginning of some words, we prefix an optional '+' symbol
    # and an obligatory '[^\W\d_]' before the negative lookbehind.
    
    # One letter with (or without) combining marks can be matched
    # with: [^\W\d_][\u0300-\u036F]*
    # We also add a superscript plus (U-207A) as part of a letter, 
    # since this char is not a letter on its own, but rather
    # modifies the quality of a consonant
    LETTER = r'[\u207A]?[^\W\d_][\u0300-\u036F]*'
    
    # we try to make a LETTERS token:
#     LETTERS = r'[+]?[^\W\d_](?!\d_)[\w\u0300-\u036F+]*'
    # Unfortunately, with python's `re` it seems impossible to repeat
    # a group like this. So we will group the letters in the parser.
    
    # Newlines: boundaries of paragraphs and metadata are marked
    # with two newlines (meaning an empty line). The empty line
    # may contain whitespace.
    NEWLINES = r'\n\s*\n\s*'
    
    # Space is any successive number of whitespace symbols.
    SPACE = r'\s+'
    # One or more digits, not starting with zero
    DIGITS = r'[1-9][0-9]*'
    # Line id is any number of digits surrounded by round brackets
#     LINE_ID = r'\([0-9]+\)'  # TODO convert to int?
    # Punctuation is any normal punctuation symbol and vertical bar.
    # as well as a long hyphen (—)
    PUNCTUATION = r'[.,?!:;–\u02c8\u2014\u2019\u2018]'
    # There are two different hyphens, a single one and a double one.
    # The double one is the 'equals' sign.
    HYPHEN = r'[-=]'
    # Language markers are ASCII letter strings surrounded by
    # angle brackets.
    LANG_MARKER = r'<[A-Za-z]+>'
    # A special comment starts with an opening bracket, capital initials
    # and a colon.
    LPAREN_COMMENT = r'\([A-Za-z]+:'
    LBRACKET_COMMENT = r'\[[A-Za-z]+:'
    # A regular comment is text (at least one character not being a digit)
    # which may not contain a colon (otherwise it becomes a special comment/interruption)
    COMMENT = r'\([^:)]*[^:)\d]+[^:)]*\)'

lexer_test = """
# Gozáli and Nozali

text_id: A8
informant: Nanəs Bənyamən
place: ʾƐn-Nune

(1) a-\u207Aword...[^1] (a-comment) (GK: lalala) bla //

Hello
(2) also[^2] <E>*wórds*<E> [CK: 
broken comment]

(4) ‘new paragraph. Say — what? a\u0300

[^1]: First footnote
continued.
[^2]: Second footnote.
[^3]: Third footnote, not referenced in text.
"""


# demonstration of output results of lexer, to be used by parser below
lexer = NenaLexer()
[(tok.type, tok.value) for tok in lexer.tokenize(lexer_test)]

[('SPACE', '\n'),
 ('TITLE', ('title', 'Gozáli and Nozali')),
 ('NEWLINES', '\n\n'),
 ('ATTRIBUTE', ('text_id', 'A8')),
 ('SPACE', '\n'),
 ('ATTRIBUTE', ('informant', 'Nanəs Bənyamən')),
 ('SPACE', '\n'),
 ('ATTRIBUTE', ('place', 'ʾƐn-Nune')),
 ('NEWLINES', '\n\n'),
 ('(', '('),
 ('DIGITS', '1'),
 (')', ')'),
 ('SPACE', ' '),
 ('LETTER', 'a'),
 ('HYPHEN', '-'),
 ('LETTER', '⁺w'),
 ('LETTER', 'o'),
 ('LETTER', 'r'),
 ('LETTER', 'd'),
 ('PUNCTUATION', '.'),
 ('PUNCTUATION', '.'),
 ('PUNCTUATION', '.'),
 ('[', '['),
 ('^', '^'),
 ('DIGITS', '1'),
 (']', ']'),
 ('SPACE', ' '),
 ('COMMENT', '(a-comment)'),
 ('SPACE', ' '),
 ('LPAREN_COMMENT', '(GK:'),
 ('SPACE', ' '),
 ('LETTER', 'l'),
 ('LETTER', 'a'),
 ('LETTER', 'l'),
 ('LETTER', 'a'),
 ('LETTER', 'l'),
 ('LETTER', 'a'),
 (')', ')'),
 ('SPACE', ' '),
 ('LETTER', 'b'),
 ('LETTER', 'l'),
 ('LETTER', 'a'),
 ('SPACE', ' '),
 ('/', '/'),
 ('/', '/'),
 ('NEWLINES', '\n\n'),
 ('LETTER', 'H'),
 ('LETTER', 'e'),
 ('LETTER', 'l'),
 ('LETTER', 'l')

## Parsing a `.nena` file

Below is a representation of the tree-like structure of a NENA standard text file. This is the structure that the parser must recognize and reproduce.

```
text
  |
  heading
  |  |
  |  attributes
  |    |
  |    attribute (e.g. title, informant, etc.)
  | 
  paragraphs
    |
    paragraph
    |  |
    |  lines
    |    |
    |    line
    |      |
    |      line elements (in any order)
    |        |
    |        word elements
    |        |  |
    |        |  morpheme normal (+metadata, e.g. trailer, etc.)
    |        |  |
    |        |  morpheme foreign (+metadata)
    |        |  |
    |        |  morpheme language (+metadata)
    |        |
    |        footnote
    |        |
    |        comment
    |        |
    |        interruption
    |      
    orphaned footnote (processed later)
```

### Morpheme class

To conveniently store the morpheme and its features, we prepare a small `Morpheme` class, to be used by the parser.

In [193]:
class Morpheme:
    
    def __init__(self, value, trailer='',
                 footnotes=None, speaker=None,
                 foreign=False, lang=None):
        self.value = value  # list of (combined) characters
        self.trailer = trailer  # str (TODO: make this a list as well?)
        self.footnotes = footnotes if footnotes is not None else {}  # dict
        self.speaker = speaker  # str
        self.foreign = foreign  # boolean
        self.lang = lang  # str
    
    def __str__(self):
        return ''.join(self.value)
    
    def __repr__(self):
        sp = f' speaker {self.speaker!r}' if self.speaker else ''
        fr = ' foreign' if self.foreign else ''
        ln = f' lang {self.lang!r}' if self.lang else ''
        fn = f' fn_anc {",".join(str(n) for n in self.footnotes)!r}' if self.footnotes else ''
        fn = f' fn_anc {self.footnotes!r}' if self.footnotes else ''
        return f'<Morpheme {str(self)!r} trailer {self.trailer!r}{sp}{fr}{ln}{fn}>'

### The parser

The parser processes the tokens provided by the lexer, and tries to combine them into structured units. Those units are defined in the methods of the `NenaParser` class, with the patterns passed as arguments to the `@_` decorator.

The top unit (in this case, `text`) is returned as the result of the parsing, and in this case contains a tuple `(heading, paragraphs)`.

The value `heading` contains a dictionary with the text metadata. The value `paragraphs` is a list, in which each element contains a list of `lines`. Each element of `lines` is a 2-tuple containing an `int` line identifier, and a list of `line_elements`. The values of `line_elements` are `Morpheme` objects, or 2-tuples with comments in the form `('comment', str)`.

In [200]:
from sly import Parser

# dict stack to contain footnote anchors,
# until the corresponding footnote is encountered.
fn_anchors = {}

class NenaParser(Parser):
    
    debugfile = 'parser.out'

    # Get the token list from the lexer (required)
    tokens = NenaLexer.tokens
    
    def error(self, t):
        #print('ERROR:')
        #print(f'\tunexpected string {repr(t.value[0])} at index {t.index}')
        raise Exception(f'unexpected string {repr(t.value[0])} at index {t.index}')
    
    @_('heading NEWLINES paragraphs')
    def text(self, p):
        return (p.heading, p.paragraphs)
    
    # -- HEADING --
    
    @_('SPACE TITLE NEWLINES attributes',
       'TITLE NEWLINES attributes')
    def heading(self, p):
        key, value = p.TITLE
        heading = {key: value}
        heading.update(p.attributes)
        return heading
    
    @_('attributes space ATTRIBUTE')
    def attributes(self, p):
        key, value = p.ATTRIBUTE
        p.attributes[key] = value
        return p.attributes 
    
    @_('ATTRIBUTE')
    def attributes(self, p):
        key, value = p.ATTRIBUTE
        return {key: value}
    
    # -- PARAGRAPHS --
    
    @_('paragraphs NEWLINES paragraph')
    def paragraphs(self, p):
        # handle cases of null footnotes
        if p.paragraph is not None:
            return p.paragraphs + [p.paragraph]
        else:
            return p.paragraphs
        
    @_('paragraph')
    def paragraphs(self, p):
        return [p.paragraph]
    
    # paragraph
    @_('paragraph line')
    def paragraph(self, p):
        return p.paragraph + [p.line]
    
    # paragraph from orphaned footnotes
    @_('footnotes')
    def paragraph(self, p):
        if p.footnotes:
            # TODO: issue log warning about
            # unreferenced footnotes?
            return ('footnotes', p.footnotes)
    
    # -- FOOTNOTES -- 
    
    @_('footnotes footnote')
    def footnotes(self, p):
        p.footnotes.update(p.footnote)
        return p.footnotes
    
    @_('footnote')
    def footnotes(self, p):
        return p.footnote
    
    @_('FOOTNOTE space NEWLINES',
       'FOOTNOTE NEWLINES',
       'FOOTNOTE space',
       'FOOTNOTE')
    def footnote(self, p):
        fn_sym, fn_str = p.FOOTNOTE
        footnote = {}
        try:
            # lookup the fn_sym key in the fn_anchors dict,
            # and add the footnote to the appropriate morpheme
            fn_morpheme = fn_anchors.pop(fn_sym)
            fn_morpheme.footnotes[fn_sym] = fn_str
        except KeyError:
            # This means there is not footnote anchor
            # referring to this footnote. So we return
            # the footnote to the text
            footnote = {fn_sym: fn_str}
        return footnote

    # -- LINES --
    
    @_('line')
    def paragraph(self, p):
        return [p.line]
    
    @_('line_id line_elements')
    def line(self, p):
        return (p.line_id, p.line_elements)
    
    @_('"(" DIGITS ")" SPACE')
    def line_id(self, p):
        return int(p.DIGITS)

    @_('line_elements line_element',
       'line_element')
    def line_elements(self, p):
        if len(p) == 2:
            return p.line_elements + p.line_element
        else:
            return p.line_element
    
    # -- MORPHEMES -- 
    
    @_('morphemes',
       'fn_anchor',
       'interruption',
       'morphemes_foreign',
       'morphemes_language',
       'comment')
    def line_element(self, p):
        return p[0]

    # morphemes_language
    @_('lang morphemes_foreign morpheme_trailer lang trailer',
       'lang morphemes_foreign lang trailer',
       'lang morphemes_foreign lang',
       'lang morphemes_foreign')
    def morphemes_language(self, p):
        # check if language markers correspond
        if len(p) > 2:
            lang = p.lang0
            if p.lang0 != p.lang1:
                pass  # TODO issue warning: language markers do not correspond
        else:
            lang = p.lang  # TODO issue warning: missing second language marker
        for m in p.morphemes_foreign:
            m.lang = lang
        if len(p) == 4:
            p.morphemes_foreign[-1].trailer += p.trailer
        elif len(p) == 5:
            p.morpheme_trailer.trailer += p.trailer
            p.morphemes_foreign.append(morpheme_trailer)
        return p.morphemes_foreign
    
    # lang
    @_('LANG_MARKER')
    def lang(self, p):
        return p.LANG_MARKER[1:-1]

    # morphemes_foreign
    # last morpheme may not include trailer
    # add trailer after second asterisk to last morpheme
    @_('"*" morphemes letters "*" trailer',
       '"*" morphemes letters "*"',
       '"*" letters "*" trailer',
       '"*" letters "*"',
      )
    def morphemes_foreign(self, p):
        try:
            trailer = p.trailer
        except KeyError:
            trailer = ''
        try:
            morphemes = p.morphemes
        except KeyError:
            morphemes = []
        morphemes.append(Morpheme(p.letters, trailer=trailer))
        for m in morphemes:
            m.foreign = True
        return morphemes
    
    # comment
    @_('COMMENT trailer',
       'COMMENT')
    def comment(self, p):
        return [('comment', p.COMMENT[1:-1])]

    # interruption
    @_('LPAREN_COMMENT space morphemes ")" trailer',
       'LPAREN_COMMENT space morphemes ")"',
       'LBRACKET_COMMENT space morphemes "]" trailer',
       'LBRACKET_COMMENT space morphemes "]"')
    def interruption(self, p):
        speaker = p[0][1:-1]
        for m in p.morphemes:
            m.speaker = speaker
        try:
            trailer = p.trailer
            if (p.morphemes[-1].trailer.endswith(' ')
                and trailer.startswith(' ')):
                trailer = trailer[1:]
            p.morphemes[-1].trailer += trailer
        except KeyError:
            pass
        return p.morphemes
    
    # morphemes
    @_('morphemes morpheme_trailer',
       'morpheme_trailer')
    def morphemes(self, p):
        if len(p) == 2:
            return p.morphemes + [p.morpheme_trailer]
        else:
            return [p.morpheme_trailer]
    
    # -- MORPHEME ATTRIBUTES --
    
    # morpheme_trailer
    @_('letters trailer',
       'letters')
    def morpheme_trailer(self, p):
        if len(p) == 2:
            trailer = p[1]
        else:
            trailer = ''
        return Morpheme(p.letters, trailer=trailer)

    # morpheme_trailer with footnote anchor
    @_('morpheme_trailer fn_anchor trailer',
       'morpheme_trailer fn_anchor')
    def morpheme_trailer(self, p):
        if len(p) == 3:
            if (p.morpheme_trailer.trailer.endswith(' ')
                and p.trailer.startswith(' ')):
                p.trailer = p.trailer[1:]
            p.morpheme_trailer.trailer += p.trailer
        # add dummy value {fn_anc: None} to footnote dict
        p.morpheme_trailer.footnotes[p.fn_anchor] = None
        # add morpheme object to fn_anchors dict,
        # for easy access when footnote text is found
        fn_anchors[p.fn_anchor] = p.morpheme_trailer
        return p.morpheme_trailer
    
    # --VARIOUS--
    
    @_('"[" "^" DIGITS "]"')
    def fn_anchor(self, p):
        return int(p.DIGITS)
        
    @_('letters LETTER')
    def letters(self, p):
        return p.letters + [p.LETTER]
    
    @_('LETTER')
    def letters(self, p):
        return [p[0]]
    
    # trailer
    @_('trailer versebreak',
       'trailer linebreak',
       'trailer PUNCTUATION',
       'trailer space',
       'PUNCTUATION',
       'space',
       'HYPHEN',
      )
    def trailer(self, p):
        return ''.join(p)
    
    # -- LITERALS --
    
    # reduce any number of spaces (\s+)
    # to a single space (' ')
    @_('SPACE')
    def space(self, p):
        return ' '
    
    @_('"/" "/"',
       '"/" "/" space',
       '"/" "/" NEWLINES',
       '"/" "/" space NEWLINES')
    def versebreak(self, p):
        return '//'
    
    @_('"/"',
       '"/" space',
       '"/" NEWLINES',
       '"/" space NEWLINES')
    def linebreak(self, p):
        return '/'
    
parser_test = """
# Gozáli and Nozali

text_id: A8
informant: Nanəs Bənyamən
place: ʾƐn-Nune

(1) a-\u207Atest word...[^1] (a-comment) [GK: lalala fd] bla //

blatwo
(2) also[^2] <E>*wórds*<E>.i.ˈ [GK:
b-mú bəcnàšəva?] b-mù

(3) more wordsčx
(4) new-paragraph — pause. t-wéwa ... ʾáyya  <R>*tséntr*<R>-ət

[^1]: The 
[^2]: Second footnote.
continued.
[^3]: Third footnote, not referenced in text.

"""

# demonstration of output results of parser, to be used by generate_TF loop
parser = NenaParser()
parser.parse(lexer.tokenize(parser_test))

Parser debugging for NenaParser written to parser.out


({'title': 'Gozáli and Nozali',
  'text_id': 'A8',
  'informant': 'Nanəs Bənyamən',
  'place': 'ʾƐn-Nune'},
 [[(1,
    [<Morpheme 'a' trailer '-'>,
     <Morpheme '⁺test' trailer ' '>,
     <Morpheme 'word' trailer '... ' fn_anc {1: 'The '}>,
     ('comment', 'a-comment'),
     <Morpheme 'lalala' trailer ' ' speaker 'GK'>,
     <Morpheme 'fd' trailer ' ' speaker 'GK'>,
     <Morpheme 'bla' trailer ' //'>,
     <Morpheme 'blatwo' trailer ' '>]),
   (2,
    [<Morpheme 'also' trailer ' ' fn_anc {2: 'Second footnote.\ncontinued.'}>,
     <Morpheme 'wórds' trailer '.' foreign lang 'E'>,
     <Morpheme 'i' trailer '.'>,
     <Morpheme 'ˈ' trailer ' '>,
     <Morpheme 'b' trailer '-' speaker 'GK'>,
     <Morpheme 'mú' trailer ' ' speaker 'GK'>,
     <Morpheme 'bəcnàšəva' trailer '? ' speaker 'GK'>,
     <Morpheme 'b' trailer '-'>,
     <Morpheme 'mù' trailer ''>])],
  [(3, [<Morpheme 'more' trailer ' '>, <Morpheme 'wordsčx' trailer ' '>]),
   (4,
    [<Morpheme 'new' trailer '-'>,
     <Morph

In [201]:
'(4)  mə̀rrəˈ ʾáha ɟári ʾàvəˈ ʾo-nášət ʾána ⁺byàyunˈ k̭at-lá-ʾavilə ⁺ʾā̀x,ˈ k̭at-lá-ʾavilə xə̀šša,ˈ lá-ʾavilə taxmànta,ˈ ʾáha ɟánu xá-ʾaxča laxùyma,ˈ xá-ʾaxča zùyza,ˈ ʾá duccána ⁺ɟùrta,ˈ paláxə xut-ʾìdu.ˈ bas-ʾáha lə̀tlə xə́šša.ˈ p-sáp̂rən xázzən mu-p̂ṱ-òya.ˈ'.split()

['(4)',
 'mə̀rrəˈ',
 'ʾáha',
 'ɟári',
 'ʾàvəˈ',
 'ʾo-nášət',
 'ʾána',
 '⁺byàyunˈ',
 'k̭at-lá-ʾavilə',
 '⁺ʾā̀x,ˈ',
 'k̭at-lá-ʾavilə',
 'xə̀šša,ˈ',
 'lá-ʾavilə',
 'taxmànta,ˈ',
 'ʾáha',
 'ɟánu',
 'xá-ʾaxča',
 'laxùyma,ˈ',
 'xá-ʾaxča',
 'zùyza,ˈ',
 'ʾá',
 'duccána',
 '⁺ɟùrta,ˈ',
 'paláxə',
 'xut-ʾìdu.ˈ',
 'bas-ʾáha',
 'lə̀tlə',
 'xə́šša.ˈ',
 'p-sáp̂rən',
 'xázzən',
 'mu-p̂ṱ-òya.ˈ']

### Parser output

The parser prints a warning that there were shift/reduce conflicts, probably caused by ambiguous whitespace. That is not a problem (although not very elegant, ideally it should be fixed). The parser resolves the conflicts automatically.

The output of the example text shows that the parser succeeded to parse it, and structure it into heading, paragraphs, lines and morphemes, with the features stored in the Morpheme object.

## Testing with Real Texts

In [202]:
from pathlib import Path

In [203]:
# paths
data_dir = Path('../nena/0.01')
dialect_dirs = list(Path(data_dir).glob('*'))

### Run Parse On All Texts

In [235]:
parsed = []
name2text = {}
not_parsed = []

ignore = [
    #'The Adventures Of Two Brothers.nena', # FIX BY MOVING UNEMPHASIZED OUT
]

for dialect in dialect_dirs:
    print(f'--Dialect {dialect}--')
    print()
    for file in sorted(dialect.glob('*.nena')):
        
        if file.name in ignore:
            print('SKIPPING:', file.name, '\n')
            not_parsed.append(file)
            continue
        
        with open(file, 'r') as infile:
            text = infile.read()
            name2text[file.name] = text
            print(f'trying: {file.name}')
            parseit = parser.parse(lexer.tokenize(text))
            print(f'\t√')
            parsed.append(parseit)
                
print(len(parsed), 'parsed...')
print(len(not_parsed), 'not parsed...')

--Dialect ../nena/0.01/Barwar--

trying: A Hundred Gold Coins.nena
	√
trying: A Man Called Čuxo.nena
	√
trying: A Tale Of A Prince And A Princess.nena
	√
trying: A Tale Of Two Kings.nena
	√
trying: Baby Leliθa.nena
	√
trying: Dəmdəma.nena
	√
trying: Gozali And Nozali.nena
	√
trying: I Am Worth The Same As A Blind Wolf.nena
	√
trying: Man Is Treacherous.nena
	√
trying: Measure For Measure.nena
	√
trying: Nanno And Jəndo.nena
	√
trying: Qaṭina Rescues His Nephew From Leliθa.nena
	√
trying: Sour Grapes.nena
	√
trying: Tales From The 1001 Nights.nena
	√
trying: The Battle With Yuwanəs The Armenian.nena
	√
trying: The Bear And The Fox.nena
	√
trying: The Brother Of Giants.nena
	√
trying: The Cat And The Mice.nena
	√
trying: The Cooking Pot.nena
	√
trying: The Crafty Hireling.nena
	√
trying: The Crow And The Cheese.nena
	√
trying: The Daughter Of The King.nena
	√
trying: The Fox And The Lion.nena
	√
trying: The Fox And The Miller.nena
	√
trying: The Fox And The Stork.nena
	√
trying: The Gian

## Conversion to TextFabric

The output of the parser can be very easily converted to TextFabric format, as most elements are already separated and structured, which removes a lot of checking and matching from the conversion script, as it is done by the parser already. Some extra structuring, such as division into sentences, subsentences, prosaic units, and words, must be done in the conversion script, based on the contents of the `trailer` attribute.

Below an (incomplete) example of a loop converting the parser output to something TextFabric can work with.

In [4]:
tf_test = """
# Gozáli and Nozali

text_id: A8
informant: Nanəs Bənyamən
place: ʾƐn-Nune

(1) a-+word...[^1] (a-comment) (GK: lalala) bla //

bla
(2) also[^2] <E>*wórds*<E>.

(4) new paragraph.

[^1]: First footnote.
[^2]: Second footnote.
"""

heading, paragraphs = parser.parse(lexer.tokenize(tf_test))

# raw_features['title'][this_text] = heading['title']
# ... etc.

# initialize counters (will be increased to start from 1)
this_paragraph = 0
this_line = 0
this_sentence = 0
this_subsentence = 0
this_word = 0
this_morpheme = 0
this_prosa = 0

slot = 0 # i.e. chars

# Mark units that are increased upon their 'ending' boundaryas 'ended',
# so their counters will be increased on first morpheme
sentence_end = True
subsentence_end = True
prosa_end = True
word_end = True

for p in paragraphs:
    if type(p) is tuple:
        # key, value = p  # unreferenced footnotes are passed as paragraph tuples
        # and can be ignored
        continue
    this_paragraph += 1
    for line_id, line in p:
        this_line += 1
        for m in line:
            if type(m) is tuple:
                # key, value = line_element  # comments are passed as tuples
                # to be ignored?
                continue
            
            # increase counters
            this_morpheme += 1
            
            # increase counters of ended units
            if sentence_end:
                this_sentence += 1
            if subsentence_end:
                this_subsentence += 1
            if prosa_end:
                this_prosa += 1
            if word_end:
                this_word += 1
            
            for c in m.value:
                slot += 1
                
                # add main character features:
                # pretty_c = unicodedata.normalize('NFC', c)  # make pretty utf8 char text
                # trans_c = translate(c, transcr_table)  # character in transcription
                # raw_features['utf8'][slot] = pretty_c
                # raw_features['trans'][slot] = trans_c
                
                # and other char features from Morpheme object `m`:
                # if m.speaker:
                #     raw_features['speaker'][slot] = m.speaker
                # if m.foreign:
                #     raw_features['language'][slot] = m.lang or ''
            
            # the last character of a `morpheme` gets its `trailer` and `footnotes`:
            # raw_features['trailer'][slot] = m.trailer.replace('|', '\u02c8')
            # if any(m.footnotes.values()):
            #     raw_features['footnotes'][slot] = '\n'.join(m.footnotes.values())
                
            # check for unit ends
            if (any(c in m.trailer for c in '.!?')
                or m.trailer.endswith('//')):
                sentence_end = True
                subsentence_end = True
            if (any(c in m.trailer for c in ',;:')
                or m.trailer.endswith('/')):
                subsentence_end = True
            if '|' in m.trailer:
                prosa_end = True
            if m.trailer not in ('-', '=', ''):
                word_end = True
                # m.trailer == '' should only occur at end of paragraph.
                # TODO issue a warning if it occurs elsewhere? (Better in parser?)
                
    # end of paragraph also ends sentence, subsentence, prosa, and word units
    sentence_end = True
    subsentence_end = True
    prosa_end = True
    word_end = True        

TODO

- [x] ~~implement 'foreign' marker `*`~~
- [x] ~~implement language marker `<Marker>`~~
- [x] ~~implement line and verse breaks `/` and `//`~~
- [x] ~~implement footnotes~~

ISSUES

The parser does not enforce all parts of the grammar. For example, as verse and line breaks are just appended to the trailer, nothing will stop it from adding other trailer elements (save whitespace) after it. There is also no check to see whether the two `LANG_MARKER`s have the same value. A `+` sign must appear as the first character in a `morpheme`, but that just means that a `+` in the middle of a morpheme breaks it into two morphemes, instead of invalidating it. Undoubtedly there are more issues like this.

Paragraphs in which the first line lacks a `line_id` break the parser. That is true for e.g. the first line of the text in which the `line_id` is absent, or for poetic style text with no `//` verse break marker but with empty line dividing verses. This could (should?) be handled by fixing the issue (default `line_id=1` for first line, default `//` verse break for empty lines within `line`), and issuing a warning notifying the user of the automatic fix.

Footnotes can only be one line and the string is not processed (e.g. markup like `*emphasis*` is kept as is).

Footnote anchors can now only occur after a `morpheme`, not after other things like `comment` or `interruption`. (note to self: possible solution: include `fn_anc` in `morphemes` instead of `morpheme_trailer`, and put `comment` in `trailer`).

Comments and unreferenced footnotes are returned as tuples for now, and have to be filtered out in the loop.

QUESTIONS

Some questions require answers for implementation. They need not be definitive answers for now, but they should be motivated somehow (even if the motivation is 'random choice'), so it will be clear later why it is done in one way or another.

- How to store hyphen? Now it is stored as a character in a word occuring between morphemes (I think).
  
  Should it be the trailer of the morpheme?


- How to split sentences?

  Now sentences are split on .?! and subsentences on ,
  There are other symbols: ;:– and even .. ... ..., .... ..... (If I recall correctly). Should those split
  sentences or subsentences?


- What to do with poetic line breaks and sentence/paragraph boundaries?

  I think a 'poem' should not be divided into paragraphs. I suggest that a line break '/' is a subsentence division, and a verse break '//' a sentence division (even when in the source it is followed by an empty line). If there is a verse number in between, that automatically starts a new sentence.