# NenaParser: A parser for Nena Standard Text format

The goal of this parser is to parse texts written in the plaintext
[NENA markup format][nenamarkup] and deliver a structured
list of words and their features, as well as paragraphing and line 
marks, which can be stored in a data format such as [Text-Fabric][textfabric], 
[Text-as-Graph][textasgraph] or (less optimally), XML or other hierarchical 
structures. 

For the Nena Markup parser, we make use of [Sly][sly], a Python implementation 
of the lex/yacc type of parser generators.

[nenamarkup]: ../docs/nena_format.md
[sly]: https://sly.readthedocs.io/en/latest/
[textfabric]: https://github.com/annotation/text-fabric
[textasgraph]: https://www.balisage.net/Proceedings/vol19/print/Dekker01/BalisageVol19-Dekker01.html

In [47]:
import re
import json
from sly import Lexer, Parser
import unicodedata
from pprint import pprint

# prepare alphabet and punctuation standards for processing
alphabet_std = '../standards/alphabet.json'
punctuation_std = '../standards/punctuation.json' 
lang_std = '../standards/foreign_languages.json'

with open(alphabet_std, 'r') as infile:
    alphabet_data = {
        re.compile(data['decomposed_regex']):data 
            for data in json.load(infile)
    }
with open(punctuation_std, 'r') as infile:
    punct_data = {
        re.compile(data['regex']):data 
            for data in json.load(infile)
    }
with open(lang_std, 'r') as infile:
    foreign_data = set(lang['code'] for lang in json.load(infile))
    
# compile regexes for matching
alphabet_re = '|'.join(data['decomposed_regex'] for letter,data in alphabet_data.items())

punct_begin_re = '|'.join(
    data['regex'] for punct, data in punct_data.items()
        if data['position'] == 'begin'
)
punct_end_re = '|'.join(
    data['regex'] for punct, data in punct_data.items()
        if data['position'] == 'end'
)

foreign_codes = '|'.join(foreign_data)

## Testing Sly

In [48]:
# class TestLexer(Lexer):
#     tokens = {LETTER, SPACE}
#     LETTER = '[A-Za-z]'
#     SPACE = '\s'
    
# class TestParser(Parser):
    
#     debugfile = 'test_parser.out'
#     tokens = TestLexer.tokens
    
#     @_('word words')
#     def words(self, p):
#         return [p.word] + p.words
    
#     @_('word SPACE word')
#     def words(self, p):
#         return [p.word]
    
#     @_('letters')
#     def word(self, p):
#         return {
#             'letters': p.letters,
#             'punct': '',
#         }
    
#     @_('LETTER letters')
#     def letters(self, p):
#         return [p.LETTER] + p.letters 
    
#     @_('LETTER')
#     def letters(self, p):
#         return [p[0]]
    
    
# test_lexer = TestLexer()
# test_parser = TestParser()
# test_string = 'This is'
# test_parser.parse(test_lexer.tokenize(test_string))

In [49]:
class TestLexer(Lexer):
    tokens = {LETTER, SPACE}
    LETTER = '[A-Za-z]'
    SPACE = '\s'
    
    @_('\s')
    def SPACE(self, t):
        t.value = (t.value,)
        return t
    
class TestParser(Parser):
    debugfile = 'test_parser.out'
    tokens = TestLexer.tokens
    
    @_('word SPACE words')
    def words(self, p):
        p.word.update({"punc": p.SPACE})
        return [p.word] + p.words
    
    @_('word')
    def words(self, p):
        return [p.word]
    
    @_('letters')
    def word(self, p):
        return {"letters":  p.letters}
    
    @_('LETTER letters')
    def letters(self, p):
        return [p.LETTER] + p.letters 
    
    @_('LETTER')
    def letters(self, p):
        return [p[0]]
    
test_lexer = TestLexer()
test_parser = TestParser()
test_string = 'This is a test'
test_parser.parse(test_lexer.tokenize(test_string))

Parser debugging for TestParser written to test_parser.out


[{'letters': ['T', 'h', 'i', 's'], 'punc': (' ',)},
 {'letters': ['i', 's'], 'punc': (' ',)},
 {'letters': ['a'], 'punc': (' ',)},
 {'letters': ['t', 'e', 's', 't']}]

## Example Text

Below is a dummy text we can use to test the parsers on.

In [50]:
example = unicodedata.normalize('NFD', '''
dialect: Urmi_C
title: When Shall I Die?
encoding: UTF8
informant: Yulia Davudi
interviewer: Geoffrey Khan
place: +Hassar +Baba-čanɟa, N
transcriber: Geoffrey Khan
text_id: A32 

(1@0:00) xá-yuma "⁺malla ⁺Nasrádən" váyələ tíva ⁺ʾal-k̭èsa.ˈ xá mən-nášə 
⁺vàrəva,ˈ mə́rrə ⁺màllaˈ ʾátən ʾo-k̭ésa pràmut,ˈ bət-nàplət.ˈ mə́rrə <P: bŏ́ro> 
bàbaˈ ʾàtən=daˈ ⁺šúla lə̀tluxˈ tíyyət b-dìyyi k̭ítət.ˈ ⁺šúk̭ si-⁺bar-⁺šùlux.ˈ 
ʾána ⁺šūl-ɟànilə.ˈ náplən nàplən.ˈ (2@0:08) ⁺hàlaˈ ʾo-náša léva xíša xá 
⁺ʾəsrá ⁺pasulyày,ˈ ⁺málla bitáyələ drúm ⁺ʾal-⁺ʾàrra.ˈ bək̭yámələ ⁺bərxáṱələ 
⁺bàru.ˈ màraˈ ⁺maxlèta,ˈ ʾátən ⁺dílux ʾána bət-náplənva m-⁺al-ʾilàna.ˈ 
bas-tánili xázən ʾána ʾíman bət-mètən.ˈ ʾo-náša xzílə k̭at-ʾá ⁺màllaˈ hónu 
xáč̭č̭a ... ⁺basùrələˈ mə́rrə k̭àtuˈ ⁺maxlèta,ˈ mə̀drə,ˈ «GK: maxlèta?» ⁺rába 
⁺maxlèta.ˈ mə́rrə k̭at-ʾíman xmártux ⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ ʾó-yuma mètət.ˈ 
ʾó-yumət xmártux ⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ ʾó-yuma mètət.ˈ 

(3@0:16) ⁺málla múttəva ... ⁺ṱànaˈ ⁺yak̭úyra ⁺ʾal-xmàrta.ˈ ⁺ṱànaˈ mə́ndi 
⁺rába múttəva ⁺ʾal-xmàrtaˈ ʾu-xmàrtaˈ ⁺báyyava ʾask̭áva ⁺ʾùllul.ˈ
ʾu-bas-pòxa ⁺plə́ṱlə mənnó.ˈ ṱə̀r,ˈ ⁺riṱàla.ˈ ⁺málla mə́rrə ʾàha,ˈ ʾána dū́n
k̭arbúnə k̭a-myàta.ˈ (4@0:20) xáč̭č̭a=da sə̀k̭laˈ xa-xìta.ˈ ɟánu mudməxxálə
⁺ʾal-⁺ʾàrra.ˈ mə̀rrəˈ xína ⁺dā́n mòtila.ˈ ʾē=t-d-⁺ṱlàˈ ⁺málla mə̀tlə.ˈ nàšə,ˈ
 xuyravàtuˈ xə́šlun tílun mə̀rrunˈ ʾa mù-vadət? k̭a-mú=ivət ⁺tàmma?ˈ mə́rrə 
 xob-ʾána mìtən.ˈ lá bəxzáyətun k̭at-mìtən!ˈ lá mə́rrun ʾat-xàya!ˈ 
 hamzùməvət.ˈ bəšvák̭una ⁺tàmaˈ màraˈ xmàrələ,ˈ lélə ⁺parmùyə.ˈ
 ''')

## Structure of a text with NENA Markup

Below is a representation of the tree-like structure of a NENA standard text file. This is the structure that the parser must recognize and reproduce.

`+` is used to represent one or more elements.

```
text
  |
  metadata block
  |  |
  |  +attribute
  | 
  text block
    |
    +paragraph
      |   
      +line
        |
        +word
          |
          +letter
```

These items will be returned in the following Pythonic representation:

In [51]:
[ # text
    [ # metadata block
        ('dialect', 'Urmi_C'),
        ('title', 'When Shall I Die?'),
        ('encoding', 'UTF8'),
    ],
    [ # text block
        [ # paragraph
            [ # line
                ('number', '1'),
                ('timestamp', '0:00'),
                ('words', [
                        { # word
                            'text':'xá',
                            'begin':'',
                            'end':'-',
                            'lang':'NENA', 
                            'letters':('x','á'),
                        },
                        # ...
                        { # foreign word
                            'text':'bŏ́ro',
                            'begin':'<P:',
                            'end': '> ',
                            'lang': 'P', 
                            'letters':('b','ŏ́','r','o'),
                        }, 
                    ],
                ),
            ],
        ],
    ],
]

[[('dialect', 'Urmi_C'), ('title', 'When Shall I Die?'), ('encoding', 'UTF8')],
 [[[('number', '1'),
    ('timestamp', '0:00'),
    ('words',
     [{'text': 'xá',
       'begin': '',
       'end': '-',
       'lang': 'NENA',
       'letters': ('x', 'á')},
      {'text': 'bŏ́ro',
       'begin': '<P:',
       'end': '> ',
       'lang': 'P',
       'letters': ('b', 'ŏ́', 'r', 'o')}])]]]]

## Lexer

The parser needs as its input 'tokens', which are predefined units of characters. These are provided by the 'lexer'. In Sly (and Ply), tokens are defined as regular expressions, of which the matching string is returned as the token value. If the token is defined as a function (with its regular expression as argument to the `@_` decorator), then the returned value (among other things) can be manipulated. For more detailed information, [see the documentation][slydocs].

[slydocs]: https://sly.readthedocs.io/en/latest/sly.html

In [71]:
class NenaLexer(Lexer):
    
    def error(self, t):
        """Give warning for bad characters"""
        print(f"Illegal character {repr(t.value[0])} @ index {self.index}")
        self.index += 1
    
    # set of token names as required by
    # the Lexer class
    tokens = {
        LETTER, PUNCT_BEGIN, PUNCT_END, NEWLINES,
        NEWLINE, NEWLINES, ATTRIBUTE, 
        FOREIGN_LETTER,
        LINESTAMP, SPAN_START, SPAN_END        
    }

    # Attribute starts key and colon. Returns 2-tuple (key, value).
    @_(r'[a-z][a-z0-9_]+: .*')
    def ATTRIBUTE(self, t):
        t.value = tuple(t.value.split(': '))
        return t
    
    @_(r'\(\d+\@\d:\d+\)\s*', 
       r'\(\d+\)\s*')
    def LINESTAMP(self, t):
        t.value = t.value.strip()
        return t

    NEWLINES = r'\n\s*\n\s*' # i.e. marks text-blocks
    LETTER = alphabet_re    
    PUNCT_BEGIN = punct_begin_re
    PUNCT_END = punct_end_re
    NEWLINE = '\n\s*'
        
    # treat the language and speaker tag simultaneously as a "span"
    # this optimizes the code quite a bit since both tags
    # behave identically when they are parsed
    @_(r'[<«][A-Za-z]+:\s*')
    def SPAN_START(self, t):
        if t.value[0] == '<':
            kind = 'language'
            punct_type = 'exclusive'
        else:
            kind = 'speaker'
            punct_type = 'inclusive'
        value = re.match(r'[<«]([A-Za-z]+):', t.value).group(1)
        tag = t.value.strip() + ' ' # ensure spacing
        t.value = (tag, kind, value, punct_type) # tag, key, value, punct_type
        return t
        
    SPAN_END = r'[>»]'
    
    # NB: tokens evaluated in order of appearance here
    # thus foreign string matched lastly
    FOREIGN_LETTER = r'[a-zA-ZðÐɟəƏɛƐʾʿθΘ][\u0300-\u033d]*'

In [72]:
# demonstration of output results of lexer, to be used by parser below
lexer = NenaLexer()

tokens = [(tok.type, tok.value) for tok in lexer.tokenize(example)]

In [73]:
example[330:440]

' mə́rrə <P: bŏ́ro> \nbàbaˈ ʾàtən=daˈ ⁺šúla lə̀tluxˈ tíyyət b-dìyyi k̭ítət.ˈ ⁺šúk̭ si-⁺bar-⁺šùlux.ˈ '

In [74]:
pprint(tokens[0:100])

[('NEWLINE', '\n'),
 ('ATTRIBUTE', ('dialect', 'Urmi_C')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('title', 'When Shall I Die?')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('encoding', 'UTF8')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('informant', 'Yulia Davudi')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('interviewer', 'Geoffrey Khan')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('place', '+Hassar +Baba-čanɟa, N')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('transcriber', 'Geoffrey Khan')),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', ('text_id', 'A32 ')),
 ('NEWLINES', '\n\n'),
 ('LINESTAMP', '(1@0:00)'),
 ('LETTER', 'x'),
 ('LETTER', 'á'),
 ('PUNCT_END', '-'),
 ('LETTER', 'y'),
 ('LETTER', 'u'),
 ('LETTER', 'm'),
 ('LETTER', 'a'),
 ('PUNCT_END', ' '),
 ('PUNCT_BEGIN', '"'),
 ('PUNCT_BEGIN', '⁺'),
 ('LETTER', 'm'),
 ('LETTER', 'a'),
 ('LETTER', 'l'),
 ('LETTER', 'l'),
 ('LETTER', 'a'),
 ('PUNCT_END', ' '),
 ('PUNCT_BEGIN', '⁺'),
 ('LETTER', 'N'),
 ('LETTER', 'a'),
 ('LETTER', 's'),
 ('LETTER', 'r'),
 ('LETTER', 'á'),
 ('LETT

### The parser

The parser processes the tokens provided by the lexer, and tries to combine them into structured units. Those units are defined in the methods of the `NenaParser` class, with the patterns passed as arguments to the `@_` decorator.

The top unit (in this case, `text`) is returned as the result of the parsing.

In [75]:

    # build words with their associated punctuators
#     @_('begins letters ends')
#     def word(self, p):
#         print(p.letters)
#         return {
#             'word': ''.join(p.letters),
#             'letters': p.letters,
#             'punct_begin': p.begins,
#             'punct_end': p.ends,
#         }
    
#     @_('begins letters')
#     def word(self, p):
#         print(p.letters)
#         return {
#             'word': ''.join(p.letters),
#             'letters': p.letters,
#             'punct_begin': p.begins,
#             'punct_end': '',
#         }

In [77]:
def make_word(letters, beginnings=[], endings=[]):
    """Return word dictionary"""
    return {
        'word': ''.join(letters),
        'letters': letters,
        'beginnings': beginnings,
        'endings': endings,
    }

def modify_attribute(words, key, value):
    """Modify dict attribute for a list of words"""
    for word in words:
        word[key] = value
    return words

def format_tag_endings(tag, punct_value, endings=[]):
    """Format punctuation around a tag.
    
    Normalizes in case of irregularity. For instance, in the
    cases of both
        words.</> 
        words</>.
    the tags will be normalized to either an in/exclusive order.
    """
    if punct_value == 'inclusive':
        return endings + [tag]
    elif punct_value == 'exclusive':
        return [tag] + endings
    else:
        raise Exception(f'INVALID punct_value supplied: {punct_value}')
    
class NenaParser(Parser):
    
    debugfile = 'nena_parser.out'
    tokens = NenaLexer.tokens
    
    def error(self, t):
        raise Exception(f'unexpected {t.type} ({repr(t.value[0])}) at index {t.index}')
    
    @_('attributes NEWLINES text_block')
    def nena(self, p):
        return [p.attributes, p.text_block]
    
    @_('attributes NEWLINE ATTRIBUTE')
    def attributes(self, p):
        return p.attributes + [p.ATTRIBUTE]
    
    @_('NEWLINE ATTRIBUTE')
    def attributes(self, p):
        return [p.ATTRIBUTE]
    
    @_('text_block NEWLINES paragraph')
    def text_block(self, p):
        return p.text_block + [p.paragraph]
    
    @_('paragraph')
    def text_block(self, p):
        return [p.paragraph]
    
    @_('lines')
    def paragraph(self, p):
        return [p.lines]
    
    @_('lines line')
    def lines(self, p):
        return p.lines + [p.line]
    
    @_('line')
    def lines(self, p):
        return [p.line]
    
    @_('LINESTAMP words')
    def line(self, p):
        return [p.LINESTAMP, p.words]
        
    @_('words span')
    def words(self, p):
        return p.words + p.span
    
    @_('SPAN_START letters SPAN_END',
       'SPAN_START letters SPAN_END endings',
       'SPAN_START letters SPAN_END NEWLINE',)
    def span(self, p):
        begin_tag, kind, value, punct_type = p[0]
        beginnings = begin_tag
        
        # build ends
        trailing_ends = getattr(p, 'endings', [])
        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
            trailing_ends.append(' ')
        endings = format_tag_endings(p[2], punct_type, trailing_ends)
        
        word = make_word(p.letters, beginnings=beginnings, endings=endings)
        word[kind] = value
        return [word]
    
    @_('SPAN_START word letters SPAN_END',
       'SPAN_START word letters SPAN_END endings',
       'SPAN_START word letters SPAN_END NEWLINE',)
    def span(self, p):
        begin_tag, kind, value, punct_type = p[0]        
        p.word['beginnings'].insert(0, begin_tag)
        
        # build ends
        trailing_ends = getattr(p, 'endings', [])
        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
            trailing_ends.append(' ')
        endings = format_tag_endings(p[3], punct_type, trailing_ends)
        
        new_word = make_word(p.letters, endings=endings)
        return modify_attribute([p.word, new_word], kind, value)
    
    @_('SPAN_START words SPAN_END',
       'SPAN_START words SPAN_END endings',
       'SPAN_START words SPAN_END NEWLINE',
       'SPAN_START word SPAN_END',
       'SPAN_START word SPAN_END endings',
       'SPAN_START word SPAN_END NEWLINE',)
    def span(self, p):
        words = getattr(p, 'words', [p[1]])
        begin_tag, kind, value, punct_type = p[0]
        first_word, last_word = words[0], words[-1]
        first_word['beginnings'].insert(0, begin_tag)
        
        # build ends
        trailing_ends = last_word['endings'] + getattr(p, 'endings', [])
        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
            trailing_ends.append(' ')        
        last_word['endings'] = format_tag_endings(p[2], punct_type, trailing_ends)
        
        return modify_attribute(words, kind, value)
    
    @_('words word')
    def words(self, p):
        return p.words + [p.word]
    
    @_('word word')
    def words(self, p):
        return [p[0]] + [p[1]]
    
    @_('beginnings letters endings', 
       'letters endings',
       'letters NEWLINE',)
    def word(self, p):
        beginnings = getattr(p, 'beginnings', [])
        endings =  getattr(p, 'endings', [' '])
        return make_word(p.letters, beginnings, endings)

    @_('PUNCT_BEGIN beginnings')
    def beginnings(self, p):
        return [p.PUNCT_BEGIN] + p.beginnings
    
    @_('PUNCT_BEGIN')
    def beginnings(self, p):
        return [p.PUNCT_BEGIN]
    
    @_('endings NEWLINE')
    def endings(self, p):
        if p.endings[-1] != ' ':
            p.endings.append(' ')
        return p.endings
    
    @_('endings PUNCT_END')
    def endings(self, p):
        return p.endings + [p.PUNCT_END]
    
    @_('PUNCT_END')
    def endings(self, p):
        return [p.PUNCT_END]
        
    @_('LETTER letters', 
       'FOREIGN_LETTER letters')
    def letters(self, p):
        return [p[0]] + p[1]
    
    @_('LETTER', 
       'FOREIGN_LETTER')
    def letters(self, p):
        return [p[0]]

parser = NenaParser()
parser.parse(lexer.tokenize(example))

Parser debugging for NenaParser written to nena_parser.out


[[('dialect', 'Urmi_C'),
  ('title', 'When Shall I Die?'),
  ('encoding', 'UTF8'),
  ('informant', 'Yulia Davudi'),
  ('interviewer', 'Geoffrey Khan'),
  ('place', '+Hassar +Baba-čanɟa, N'),
  ('transcriber', 'Geoffrey Khan'),
  ('text_id', 'A32 ')],
 [[[['(1@0:00)',
     [{'word': 'xá',
       'letters': ['x', 'á'],
       'beginnings': [],
       'endings': ['-']},
      {'word': 'yuma',
       'letters': ['y', 'u', 'm', 'a'],
       'beginnings': [],
       'endings': [' ']},
      {'word': 'malla',
       'letters': ['m', 'a', 'l', 'l', 'a'],
       'beginnings': ['"', '⁺'],
       'endings': [' ']},
      {'word': 'Nasrádən',
       'letters': ['N', 'a', 's', 'r', 'á', 'd', 'ə', 'n'],
       'beginnings': ['⁺'],
       'endings': ['"', ' ']},
      {'word': 'váyələ',
       'letters': ['v', 'á', 'y', 'ə', 'l', 'ə'],
       'beginnings': [],
       'endings': [' ']},
      {'word': 'tíva',
       'letters': ['t', 'í', 'v', 'a'],
       'beginnings': [],
       'endings': [' ']},
  

In [21]:
def get_string_data(string, str_data):
    """Match a string to its correlated data.
    
    A "string" can be a letter or a punctuator.
    
    Args:
        string: str to be matched with data
        str_data: dict of data for strings where each key
            is a compiled regex object that matches
            a given letter to a dict of data (the values).
            
    Returns:
        dict of letter data
    """
    for str_pattern, data in str_data.items():
        if str_pattern.match(string):
            return data

### Parser output

The parser prints a warning that there were shift/reduce conflicts, probably caused by ambiguous whitespace. That is not a problem (although not very elegant, ideally it should be fixed). The parser resolves the conflicts automatically.

The output of the example text shows that the parser succeeded to parse it, and structure it into heading, paragraphs, lines and morphemes, with the features stored in the Morpheme object.

## Testing with Real Texts

In [22]:
from pathlib import Path

In [23]:
# paths
data_dir = Path('../nena/0.01')
dialect_dirs = list(Path(data_dir).glob('*'))

### Run Parse On All Texts

In [24]:
name2parsed = {}
name2text = {}
not_parsed = []

ignore = [
    #'The Adventures Of Two Brothers.nena', # FIX BY MOVING UNEMPHASIZED OUT
]

for dialect in dialect_dirs:
    print(f'--Dialect {dialect}--')
    print()
    for file in sorted(dialect.glob('*.nena')):
        
        if file.name in ignore:
            print('SKIPPING:', file.name, '\n')
            not_parsed.append(file)
            continue
        
        with open(file, 'r') as infile:
            text = infile.read()
            name2text[file.name] = text
            print(f'trying: {file.name}')
            parseit = parser.parse(lexer.tokenize(text))
            print(f'\t√')
            name2parsed[file.name] = parseit
                
print(len(name2parsed), 'parsed...')
print(len(not_parsed), 'not parsed...')

--Dialect ../nena/0.01/Barwar--

trying: A Hundred Gold Coins.nena


NameError: name 'parser' is not defined

In [None]:
#name2text['A Man Called Čuxo.nena'][7100:7120]