# NenaParser: A parser for Nena Standard Text format

The goal of this parser is to parse texts written in the plaintext
[NENA markup format][nenamarkup] and deliver a structured
list of words and their features, as well as paragraphing and line 
marks, which can be stored in a data format such as [Text-Fabric][textfabric], 
[Text-as-Graph][textasgraph] or (less optimally), XML or other hierarchical 
structures. 

For the Nena Markup parser, we make use of [Sly][sly], a Python implementation 
of the lex/yacc type of parser generators.

[nenamarkup]: ../docs/nena_format.md
[sly]: https://sly.readthedocs.io/en/latest/
[textfabric]: https://github.com/annotation/text-fabric
[textasgraph]: https://www.balisage.net/Proceedings/vol19/print/Dekker01/BalisageVol19-Dekker01.html

In [1]:
import re
from sly import Lexer
import unicodedata

## Example Text

Below is a dummy text we can use to test the parsers on.

In [78]:
example = '''

dialect: Urmi_C
title: When Shall I Die?
encoding: UTF8
informant: Yulia Davudi
interviewer: Geoffrey Khan
place: +Hassar +Baba-čanɟa, N
transcriber: Geoffrey Khan
text_id: A32 

(1@0:00) xá-yuma ⁺malla ⁺Nasrádən váyələ tíva ⁺ʾal-k̭èsa.ˈ xá mən-nášə 
⁺vàrəva,ˈ mə́rrə ⁺màllaˈ ʾátən ʾo-k̭ésa pràmut,ˈ bət-nàplət.ˈ mə́rrə <P>bŏ́ro<P> 
bàbaˈ ʾàtən=daˈ ⁺šúla lə̀tluxˈ tíyyət b-dìyyi k̭ítət.ˈ ⁺šúk̭ si-⁺bar-⁺šùlux.ˈ 
ʾána ⁺šūl-ɟànilə.ˈ náplən nàplən.ˈ (2@0:08) ⁺hàlaˈ ʾo-náša léva xíša xá 
⁺ʾəsrá ⁺pasulyày,ˈ ⁺málla bitáyələ drúm ⁺ʾal-⁺ʾàrra.ˈ bək̭yámələ ⁺bərxáṱələ 
⁺bàru.ˈ màraˈ ⁺maxlèta,ˈ ʾátən ⁺dílux ʾána bət-náplənva m-⁺al-ʾilàna.ˈ 
bas-tánili xázən ʾána ʾíman bət-mètən.ˈ ʾo-náša xzílə k̭at-ʾá ⁺màllaˈ hónu 
xáč̭č̭a ... ⁺basùrələˈ mə́rrə k̭àtuˈ ⁺maxlèta,ˈ mə̀drə,ˈ 
<<Geoffrey Khan: maxlèta?>> ⁺rába ⁺maxlèta.ˈ mə́rrə k̭at-ʾíman xmártux 
⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ ʾó-yuma mètət.ˈ ʾó-yumət xmártux ⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ 
ʾó-yuma mètət.ˈ 

(3@0:16) ⁺málla múttəva ... ⁺ṱànaˈ ⁺yak̭úyra ⁺ʾal-xmàrta.ˈ ⁺ṱànaˈ mə́ndi 
⁺rába múttəva ⁺ʾal-xmàrtaˈ ʾu-xmàrtaˈ ⁺báyyava ʾask̭áva ⁺ʾùllul.ˈ
ʾu-bas-pòxa ⁺plə́ṱlə mənnó.ˈ ṱə̀r,ˈ ⁺riṱàla.ˈ ⁺málla mə́rrə ʾàha,ˈ ʾána dū́n
k̭arbúnə k̭a-myàta.ˈ (4@0:20) xáč̭č̭a=da sə̀k̭laˈ xa-xìta.ˈ ɟánu mudməxxálə
⁺ʾal-⁺ʾàrra.ˈ mə̀rrəˈ xína ⁺dā́n mòtila.ˈ ʾē=t-d-⁺ṱlàˈ ⁺málla mə̀tlə.ˈ nàšə,ˈ
 xuyravàtuˈ xə́šlun tílun mə̀rrunˈ ʾa mù-vadət? k̭a-mú=ivət ⁺tàmma?ˈ mə́rrə 
 xob-ʾána mìtən.ˈ lá bəxzáyətun k̭at-mìtən!ˈ lá mə́rrun ʾat-xàya!ˈ 
 hamzùməvət.ˈ bəšvák̭una ⁺tàmaˈ màraˈ xmàrələ,ˈ lélə ⁺p̂armùyə.ˈ
 '''

## Structure of a text with NENA Markup

Below is a representation of the tree-like structure of a NENA standard text file. This is the structure that the parser must recognize and reproduce.

`+` is used to represent one or more elements.

```
text
  |
  metadata block
  |  |
  |  +attribute
  | 
  text block
    |
    +paragraph
      |   
      +line
        |
        +word
          |
          +letter
```

These items will be returned in the following Pythonic representation:

In [79]:
[ # text
    [ # metadata block
        ('dialect', 'Urmi_C'),
        ('title', 'When Shall I Die?'),
        ('encoding', 'UTF8'),
    ],
    [ # text block
        [ # paragraph
            [ # line
                ('number', '1'),
                ('timestamp', '0:00'),
                ('words', [
                        { # word
                            'text':'xá',
                            'begin':'',
                            'end':'-',
                            'lang':'NENA', 
                            'letters':('x','á'),
                        },
                        # ...
                        { # foreign word
                            'text':'bŏ́ro',
                            'begin':'<P>',
                            'end': '<P> ',
                            'lang': 'P', 
                            'letters':('b','ŏ́','r','o'),
                        }, 
                    ],
                ),
            ],
        ],
    ],
]

[[('dialect', 'Urmi_C'), ('title', 'When Shall I Die?'), ('encoding', 'UTF8')],
 [[[('number', '1'),
    ('timestamp', '0:00'),
    ('words',
     [{'text': 'xá',
       'begin': '',
       'end': '-',
       'lang': 'NENA',
       'letters': ('x', 'á')},
      {'text': 'bŏ́ro',
       'begin': '<P>',
       'end': '<P> ',
       'lang': 'P',
       'letters': ('b', 'ŏ́', 'r', 'o')}])]]]]

## Lexer

The parser needs as its input 'tokens', which are predefined units of characters. These are provided by the 'lexer'. In Sly (and Ply), tokens are defined as regular expressions, of which the matching string is returned as the token value. If the token is defined as a function (with its regular expression as argument to the `@_` decorator), then the returned value (among other things) can be manipulated. For more detailed information, [see the documentation][slydocs].

[slydocs]: https://sly.readthedocs.io/en/latest/sly.html

In [80]:
begin_punct = '+' '\u207A'
end_punct = '.' ',' '?' '!' ':' ';' '–' '—' '\-' '\u02c8' '='
letters = fr'[^\W\d_{end_punct}{begin_punct}][\u0300-\u036F]*'

# The '(?m)' part turns on multiline matching, which makes
# it possible to use ^ and $ for the start/end of the line.

class NenaLexer(Lexer):
    
    # set of token names
    tokens = {
        NEWLINE, NEWLINES, ATTRIBUTE, 
        BEGIN, LETTER, END,
        LANG_MARKER, LINENUMBER,
        TIMESTAMP, SPEAKER_START,
        SPEAKER_END, SPACE,
    }
    
    # Text blocks and paragraphs are marked off with 2 newlines
    # spaces are allowed
    NEWLINES = r'\n\s*\n\s*'

    # Attribute starts key and colon. Returns 2-tuple (key, value).
    @_(r'[a-z][a-z0-9_]+: .*')
    def ATTRIBUTE(self, t):
        t.value = tuple(t.value.split(': '))
        return t
    
    @_(r'\(\d+\)', 
       r'\(\d+')
    def LINENUMBER(self, t):
        t.value = re.findall('\d+', t.value).pop()
        return t
    
    @_(r'\@\d:\d+\)')
    def TIMESTAMP(self, t):
        t.value = re.match(r'@(\d:\d+)\)', t.value).group(1)
        return t
    
    BEGIN = fr'[{begin_punct}]'
    LETTER = letters
    END = fr'[{end_punct}]'
    NEWLINE = r'\n'
    SPACE = r'\s'
    
    # Language markers are ASCII letter strings 
    # surrounded by angle brackets.
    LANG_MARKER = r'<[A-Za-z]+>'
    
    @_(r'>>')
    def SPEAKER_END(self, t):
        return t

    @_(r'<<[a-z\sA-Z]+: ')
    def SPEAKER_START(self, t):
        t.value = re.match('<<([a-z\sA-Z]+): ', t.value).group(1)
        return t

In [81]:
# demonstration of output results of lexer, to be used by parser below
lexer = NenaLexer()

tokens = [(tok.type, tok.value) for tok in lexer.tokenize(example)]

In [82]:
tokens[0:40]

[('NEWLINES', '\n\n'),
 ('ATTRIBUTE', ('dialect', 'Urmi_C')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('title', 'When Shall I Die?')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('encoding', 'UTF8')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('informant', 'Yulia Davudi')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('interviewer', 'Geoffrey Khan')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('place', '+Hassar +Baba-čanɟa, N')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('transcriber', 'Geoffrey Khan')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('text_id', 'A32 ')),
 ('NEWLINES', '\n\n'),
 ('LINENUMBER', '1'),
 ('TIMESTAMP', '0:00'),
 ('SPACE', ' '),
 ('LETTER', 'x'),
 ('LETTER', 'á'),
 ('END', '-'),
 ('LETTER', 'y'),
 ('LETTER', 'u'),
 ('LETTER', 'm'),
 ('LETTER', 'a'),
 ('SPACE', ' '),
 ('BEGIN', '⁺'),
 ('LETTER', 'm'),
 ('LETTER', 'a'),
 ('LETTER', 'l'),
 ('LETTER', 'l'),
 ('LETTER', 'a'),
 ('SPACE', ' '),
 ('BEGIN', '⁺'),
 ('LETTER', 'N'),
 ('LETTER', 'a'),
 ('LETTER', 's'),
 ('LETTER', 'r')]

### The parser

The parser processes the tokens provided by the lexer, and tries to combine them into structured units. Those units are defined in the methods of the `NenaParser` class, with the patterns passed as arguments to the `@_` decorator.

The top unit (in this case, `text`) is returned as the result of the parsing.

In [5]:
from sly import Parser

# dict stack to contain footnote anchors,
# until the corresponding footnote is encountered.
fn_anchors = {}

class NenaParser(Parser):
    
    debugfile = 'parser.out'

    # Get the token list from the lexer (required)
    tokens = NenaLexer.tokens
    
    

In [None]:
# demonstration of output results of parser, to be used by generate_TF loop
parser = NenaParser()
parser.parse(lexer.tokenize(parser_test))

### Parser output

The parser prints a warning that there were shift/reduce conflicts, probably caused by ambiguous whitespace. That is not a problem (although not very elegant, ideally it should be fixed). The parser resolves the conflicts automatically.

The output of the example text shows that the parser succeeded to parse it, and structure it into heading, paragraphs, lines and morphemes, with the features stored in the Morpheme object.

## Testing with Real Texts

In [None]:
from pathlib import Path

In [None]:
# paths
data_dir = Path('../nena/0.01')
dialect_dirs = list(Path(data_dir).glob('*'))

### Run Parse On All Texts

In [None]:
name2parsed = {}
name2text = {}
not_parsed = []

ignore = [
    #'The Adventures Of Two Brothers.nena', # FIX BY MOVING UNEMPHASIZED OUT
]

for dialect in dialect_dirs:
    print(f'--Dialect {dialect}--')
    print()
    for file in sorted(dialect.glob('*.nena')):
        
        if file.name in ignore:
            print('SKIPPING:', file.name, '\n')
            not_parsed.append(file)
            continue
        
        with open(file, 'r') as infile:
            text = infile.read()
            name2text[file.name] = text
            print(f'trying: {file.name}')
            parseit = parser.parse(lexer.tokenize(text))
            print(f'\t√')
            name2parsed[file.name] = parseit
                
print(len(name2parsed), 'parsed...')
print(len(not_parsed), 'not parsed...')

In [None]:
#name2text['A Man Called Čuxo.nena'][7100:7120]