# NenaParser2: A parser for the NENA Markup standard

The goal of this parser is to parse texts written in the plaintext
[NENA markup format][nenamarkup] and deliver a structured
list of words and their features, as well as paragraphing and line 
marks, which can be stored in a data format such as [Text-Fabric][textfabric], 
[Text-as-Graph][textasgraph] or (less optimally), XML or other hierarchical 
structures. 

For the Nena Markup parser, we make use of [Sly][sly], a Python implementation 
of the lex/yacc type of parser generators.

This parser development is a follow up from the older, more verbose [NenaParser (1)](NenaParser.ipynb)
which contains a lot of shift/reduce conflicts as well as a lot of standards that will
no longer be in use (see for instance "footnoting").

[nenamarkup]: ../docs/nena_format.md
[sly]: https://sly.readthedocs.io/en/latest/
[textfabric]: https://github.com/annotation/text-fabric
[textasgraph]: https://www.balisage.net/Proceedings/vol19/print/Dekker01/BalisageVol19-Dekker01.html

In [29]:
Path.home()*

PosixPath('/Users/cody')

In [65]:
import re
from pathlib import Path
import json
from sly import Lexer, Parser
import unicodedata
from pprint import pprint

VERSION = '2.0'
PROJECT = Path.home().joinpath('github/CambridgeSemiticsLab/nena_corpus')
STANDARDS = PROJECT.joinpath('standards')
DIALECT_STANDARDS = STANDARDS.joinpath('dialect_data')
TEXT_INPUT = PROJECT.joinpath(f'texts/{VERSION}')
JSON_OUTPUT = PROJECT.joinpath(f'parsed_texts/{VERSION}')

# prepare alphabet and punctuation standards for processing
alphabet_std = STANDARDS.joinpath('alphabet.json')
punctuation_std = STANDARDS.joinpath('punctuation.json')
lang_std = STANDARDS.joinpath('foreign_languages.json')

with open(alphabet_std, 'r') as infile:
    alphabet_data = json.load(infile)    
with open(punctuation_std, 'r') as infile:    
    punct_data = json.load(infile)
with open(lang_std, 'r') as infile:
    lang_data = json.load(infile)

alphabet_re = '|'.join(let['decomposed_regex'] for let in alphabet_data)
punct_begin_re = '|'.join(punct['regex'] for punct in punct_data
                            if punct['position'] == 'begin')
punct_end_re = '|'.join(punct['regex'] for punct in punct_data 
                            if punct['position'] == 'end')
foreign_codes = '|'.join(lang['code'] for lang in lang_data)
                               


In [68]:
DIALECT_STANDARDS.joinpath('barwar')

PosixPath('/Users/cody/github/CambridgeSemiticsLab/nena_corpus/standards/dialect_data/barwar')

In [66]:
for file in DIALECT_STANDARDS.joinpath('barwar').glob('*'):
    print(file)

## Example Text

Below is a dummy text we can use to test the parsers on, written in the
[NENA Markup](../docs/nena_format.md) standard.

In [53]:
example = unicodedata.normalize('NFD', '''
dialect:: Urmi_C
title:: When Shall I Die?
encoding:: UTF8
speakers:: YD=Yulia Davudi, GK=Geoffrey Khan, CK=Cody Kingham
place:: +Hassar +Baba-čanɟa, N
transcriber:: Geoffrey Khan
text_id:: A32 

1 YD    xá-yuma "⁺malla ⁺Nasrádən" váyələ tíva ⁺ʾal-k̭èsa.ˈ xá mən-nášə
⁺vàrəva,ˈ mə́rrə ⁺màllaˈ ʾátən ʾo-k̭ésa pràmut,ˈ bət-nàplət.ˈ mə́rrə <P: bŏ́ro> 
bàbaˈ ʾàtən=daˈ ⁺šúla lə̀tluxˈ tíyyət b-dìyyi k̭ítət.ˈ ⁺šúk̭ si-⁺bar-⁺šùlux
.ˈ ʾána ⁺šūl-ɟànilə.ˈ náplən nàplən.ˈ
2 0:08    ⁺hàlaˈ ʾo-náša léva xíša xá ⁺ʾəsrá ⁺pasulyày,ˈ ⁺málla bitáyələ drúm 
⁺ʾal-⁺ʾàrra.ˈ bək̭yámələ ⁺bərxáṱələ ⁺bàru.ˈ màraˈ ⁺maxlèta,ˈ ʾátən ⁺dílux 
ʾána bət-náplənva m-⁺al-ʾilàna.ˈ bas-tánili xázən ʾána ʾíman bət-mètən.ˈ 
ʾo-náša xzílə k̭at-ʾá ⁺màllaˈ hónu xáč̭č̭a ... ⁺basùrələˈ mə́rrə k̭àtuˈ ⁺maxlèta,ˈ 
mə̀drə,ˈ

3 0:14 GK    maxlèta?
4 0:15 YD    ⁺rába ⁺maxlèta.ˈ mə́rrə k̭at-ʾíman xmártux ⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ 
ʾó-yuma mètət.ˈ ʾó-yumət xmártux ⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ ʾó-yuma mètət.ˈ 
5 0:18    ⁺málla
6 0:19 CK    <E:Why hello there!> 
7 0:21 YD    múttəva ... ⁺ṱànaˈ ⁺yak̭úyra ⁺ʾal-xmàrta.ˈ ⁺ṱànaˈ mə́ndi 
⁺rába múttəva ⁺ʾal-xmàrtaˈ ʾu-xmàrtaˈ ⁺báyyava ʾask̭áva ⁺ʾùllul.ˈ
ʾu-bas-pòxa ⁺plə́ṱlə mənnó.ˈ ṱə̀r,ˈ ⁺riṱàla.ˈ ⁺málla mə́rrə ʾàha,ˈ ʾána dū́n
k̭arbúnə k̭a-myàta.ˈ
8 0:20    <E:ok>
9 0:21 CK    <E:yes?>
10 0:22 YD    xáč̭č̭a=da sə̀k̭laˈ xa-xìta.ˈ ɟánu mudməxxálə ⁺ʾal-⁺ʾàrra.ˈ mə̀rrəˈ 
xína ⁺dā́n mòtila.ˈ ʾē=t-d-⁺ṱlàˈ ⁺málla mə̀tlə.ˈ nàšə,ˈ xuyravàtuˈ xə́šlun tílun 
mə̀rrunˈ: ʾa mù-vadət? k̭a-mú=ivət ⁺tàmma?ˈ mə́rrə xob-ʾána mìtən.ˈ
lá bəxzáyətun k̭at-mìtən!ˈ lá mə́rrun ʾat-xàya!ˈ hamzùməvət.ˈ bəšvák̭una 
⁺tàmaˈ màraˈ xmàrələ,ˈ lélə ⁺parmùyə.ˈ
 ''')

## Structure of a text with NENA Markup

Below is a representation of the tree-like structure of a NENA standard text file. This is the structure that the parser must recognize and reproduce.

`+` is used to represent one or more elements.

```
text
  |
  metadata block
  |  |
  |  +attribute
  | 
  text block
    |
    +paragraph
      |   
      +line
        |
        +word
          |
          +letter
```

These items will be returned in the following Pythonic representation:

In [54]:
example_text = [ # text
    [ # metadata block
        {
            'dialect': 'Urmi_C',
            'title': 'When Shall I Die?',
            'encoding': 'UTF8',
        }
    ],
    [ # text block
        [ # paragraph
            { # line
                'number': '1', 
                'timestamp': '0:00',
                'words': [
                    { # word
                        'text':'xá',
                        'begin':'',
                        'end':'-',
                        'lang':'NENA', 
                        'letters':('x','á'),
                    },
                    # ...
                    { # foreign word
                        'text':'bŏ́ro',
                        'begin':'<P:',
                        'end': '> ',
                        'lang': 'P', 
                        'letters':('b','ŏ́','r','o'),
                    }, 
                ],
            },
        ],
    ],
]

## Lexer

The parser needs as its input 'tokens', which are predefined units of characters. These are provided by the 'lexer'. In Sly (and Ply), tokens are defined as regular expressions, of which the matching string is returned as the token value. If the token is defined as a function (with its regular expression as argument to the `@_` decorator), then the returned value (among other things) can be manipulated. For more detailed information, [see the documentation][slydocs].

[slydocs]: https://sly.readthedocs.io/en/latest/sly.html

In [61]:
timestamp = re.compile(r'\d+:\d+\d*')
linenum = re.compile(f'\d+')
initials = re.compile(r'\D\D')

class NenaLexer(Lexer):
    
    def error(self, t):
        """Give warning for bad characters"""
        print(f"Illegal character {repr(t.value[0])} @ index {self.index}")
        self.index += 1
    
    # set of token names as required by
    # the Lexer class
    tokens = {
        LETTER, PUNCT_BEGIN, PUNCT_END, NEWLINES,
        NEWLINE, NEWLINES, LINE_INDICATOR, ATTRIBUTE, 
        FOREIGN_LETTER, LANG_START, LANG_END        
    }

    # Attribute starts key and colon. Returns 2-tuple (key, value).
    @_(r'[a-z0-9_]+\s*::\s*.*')
    def ATTRIBUTE(self, t):
        field, value = tuple(t.value.split('::'))
        t.value = {field.strip(): value.strip()}
        for attr, val in t.value.items():
            # arrange loaded speakers into dict
            if attr == 'speakers':
                speakers = {}
                for speakset in val.split(','):
                    initials, speaker = speakset.split('=')
                    speakers[initials.strip()] = speaker.strip()
                t.value[attr] = speakers
        return t
    
    @_(r'(?<=\n)\d+.*?\s{2,}',
       r'(?<=\n)\d+.*?\t')
    def LINE_INDICATOR(self, t):
        attribs = {}
        elements = t.value.split()
        attribs['number'] = elements[0]
        for element in elements:
            if timestamp.match(element):
                attribs['timestamp'] = element
            elif linenum.match(element):
                attribs['number'] = element
            elif initials.match(element):
                attribs['speaker'] = element
            else:
                raise Exception(f'invalid element {element} in line indicator {t.value}')
        t.value = attribs
        return t
    
    NEWLINES = r'\n\s*\n\s*' # i.e. marks text-blocks
    LETTER = alphabet_re    
    PUNCT_BEGIN = punct_begin_re
    PUNCT_END = punct_end_re
    NEWLINE = '\n\s*'
    
    @_(r'<[A-Za-z?]+:\s*')
    def LANG_START(self, t):
        lang = re.match(r'<([A-Za-z?]+):', t.value).group(1)
        tag = t.value.strip() + ' ' # ensure spacing
        t.value = (tag, lang)
        return t
        
    LANG_END = r'>'
    
    # NB: tokens evaluated in order of appearance here
    # thus foreign string matched lastly
    FOREIGN_LETTER = r'[a-zA-ZðÐɟəƏɛƐʾʿθΘ][\u0300-\u033d]*'

In [62]:
# demonstration of output results of lexer, to be used by parser below
lexer = NenaLexer()
tokens = [(tok.type, tok.value) for tok in lexer.tokenize(example)]

In [63]:
pprint(tokens[:20])

[('NEWLINE', '\n'),
 ('ATTRIBUTE', {'dialect': 'Urmi_C'}),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', {'title': 'When Shall I Die?'}),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', {'encoding': 'UTF8'}),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE',
  {'speakers': {'CK': 'Cody Kingham',
                'GK': 'Geoffrey Khan',
                'YD': 'Yulia Davudi'}}),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', {'place': '+Hassar +Baba-čanɟa, N'}),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', {'transcriber': 'Geoffrey Khan'}),
 ('NEWLINE', '\n'),
 ('ATTRIBUTE', {'text_id': 'A32'}),
 ('NEWLINES', '\n\n'),
 ('LINE_INDICATOR', {'number': '1', 'speaker': 'YD'}),
 ('LETTER', 'x'),
 ('LETTER', 'á'),
 ('PUNCT_END', '-'),
 ('LETTER', 'y')]


In [7]:
# re.match(r'^[a-z0-9_]+: .*(?=\n)', 're: wáy b-šɛ̀ fsdf\nfesfes')

### The parser

The parser processes the tokens provided by the lexer, and tries to combine them into structured units. Those units are defined in the methods of the `NenaParser` class, with the patterns passed as arguments to the `@_` decorator.

The top unit (in this case, `text`) is returned as the result of the parsing.

In [8]:
def make_word(letters, beginnings=[], endings=[]):
    """Return word dictionary"""
    return {
        'word': ''.join(letters),
        'letters': letters,
        'beginnings': beginnings,
        'endings': endings,
    }

def modify_attribute(words, key, value):
    """Modify dict attribute for a list of words"""
    for word in words:
        word[key] = value
    return words

def format_tag_endings(tag, endings=[]):
    """Format punctuation around a tag.
    
    Normalizes in case of irregularity. For instance, in the
    cases of both
        words.</> 
        words</>.
    the tags will be normalized to either an in/exclusive order.
    """
    return [tag] + endings
    
def tag_speakers(text, speakers_dict):
    """Tag speakers in a text.
    
    Speakers can be activated or deactivated as needed.
    """
    for paragraph in text:
        for line in paragraph:
            if 'speaker' not in line:
                line['speaker'] = cur_speaker
            else:
                new_speaker = line['speaker']
                try: 
                    cur_speaker = speakers_dict[new_speaker]
                    line['speaker'] = cur_speaker
                except KeyError:
                    raise Exception(f'speaker {cur_speaker} not specified in speakers metadata')
    
class NenaParser(Parser):
    
    #debugfile = 'nena_parser.out'
    tokens = NenaLexer.tokens
    
    def error(self, t):
        raise Exception(f'unexpected {t.type} ({repr(t.value)}) at index {t.index}')
    
    @_('attributes NEWLINES text_block')
    def nena(self, p):
        tag_speakers(p.text_block, p.attributes['speakers'])
        return [p.attributes, p.text_block]
    
    @_('attributes NEWLINE ATTRIBUTE')
    def attributes(self, p):
        p.attributes.update(p.ATTRIBUTE)
        return p.attributes
    
    @_('NEWLINE ATTRIBUTE', 'ATTRIBUTE')
    def attributes(self, p):
        return p.ATTRIBUTE
    
    @_('text_block NEWLINES paragraph')
    def text_block(self, p):
        return p.text_block + [p.paragraph]
    
    @_('paragraph')
    def text_block(self, p):
        return [p.paragraph]
    
    @_('paragraph line')
    def paragraph(self, p):
        return p.paragraph + [p.line]
    
    @_('line')
    def paragraph(self, p):
        return [p.line]
    
    @_('LINE_INDICATOR span words',
      'LINE_INDICATOR span span words',
      'LINE_INDICATOR span word span words',
      'LINE_INDICATOR span',
      )
    def line(self, p):
        words = []
        for wordtype in list(p)[1:]:
            if type(wordtype) == list: 
                words += wordtype
            else:
                words.append(wordtype)
        p.LINE_INDICATOR['words'] = words
        return p.LINE_INDICATOR
    
    @_('LINE_INDICATOR word span words')
    def line(self, p):
        p.LINE_INDICATOR['words'] = [p.word] + p.span + p.words
        return p.LINE_INDICATOR
    
    @_('LINE_INDICATOR words')
    def line(self, p):
        p.LINE_INDICATOR['words'] = p.words
        return p.LINE_INDICATOR
    
    @_('LINE_INDICATOR word')
    def line(self, p):
        p.LINE_INDICATOR['words'] = [p.word]
        return p.LINE_INDICATOR
    
    @_('words span')
    def words(self, p):
        return p.words + p.span
    
    @_('LANG_START letters LANG_END',
       'LANG_START letters LANG_END endings',
       'LANG_START letters LANG_END NEWLINE',
       'LANG_START beginnings letters LANG_END endings',
       'LANG_START beginnings letters LANG_END NEWLINE',
      )
    def span(self, p):
        begin_tag, value = p.LANG_START
        beginnings = [begin_tag] + getattr(p, 'beginnings', [])
        
        # build ends
        trailing_ends = getattr(p, 'endings', [])
        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
            trailing_ends.append(' ')
        endings = format_tag_endings(p.LANG_END, trailing_ends)
        
        word = make_word(p.letters, beginnings=beginnings, endings=endings)
        word['lang'] = value
        return [word]
    
    @_('LANG_START word letters LANG_END',
       'LANG_START word letters LANG_END endings',
       'LANG_START word letters LANG_END NEWLINE',
       'LANG_START word beginnings letters LANG_END endings',
       'LANG_START word beginnings letters LANG_END NEWLINE',
       'LANG_START words letters LANG_END endings',
       'LANG_START words letters LANG_END NEWLINE',
       'LANG_START words beginnings letters LANG_END endings',
      )
    def span(self, p):
        begin_tag, value = p[0]
        
        # compile words
        words = []
        if getattr(p, 'word', None):
            p.word['beginnings'].insert(0, begin_tag)
            words.append(p.word)
        elif getattr(p, 'words', None):
            words.extend(p.words)
            
        # build new word from dangling letters and ends
        trailing_ends = getattr(p, 'endings', [])
        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
            trailing_ends.append(' ')
        endings = format_tag_endings(p.LANG_END, trailing_ends)
        beginnings = getattr(p, 'beginnings', [])
        words.append(make_word(p.letters, beginnings=beginnings, endings=endings))
        
        return modify_attribute(words, 'lang', value)
    
    @_('LANG_START words LANG_END',
       'LANG_START words LANG_END endings',
       'LANG_START words LANG_END NEWLINE',
       'LANG_START word LANG_END',
       'LANG_START word LANG_END endings',
       'LANG_START word LANG_END NEWLINE',)
    def span(self, p):
        words = getattr(p, 'words', [p[1]])
        begin_tag, value = p[0]
        first_word, last_word = words[0], words[-1]
        first_word['beginnings'].insert(0, begin_tag)
        
        # build ends
        trailing_ends = last_word['endings'] + getattr(p, 'endings', [])
        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
            trailing_ends.append(' ')        
        last_word['endings'] = format_tag_endings(p[2], trailing_ends)
        
        return modify_attribute(words, 'lang', value)
    
    @_('words word')
    def words(self, p):
        return p.words + [p.word]
    
    @_('word word')
    def words(self, p):
        return [p[0]] + [p[1]]
    
    @_('beginnings letters endings', 
       'letters endings',
       'letters NEWLINE',
       'letters NEWLINE endings',
       'beginnings letters NEWLINE',
       'beginnings letters NEWLINE endings',
      )
    def word(self, p):
        beginnings = getattr(p, 'beginnings', [])
        endings =  getattr(p, 'endings', [' '])
        return make_word(p.letters, beginnings, endings)

    @_('PUNCT_BEGIN beginnings')
    def beginnings(self, p):
        return [p.PUNCT_BEGIN] + p.beginnings
    
    @_('PUNCT_BEGIN')
    def beginnings(self, p):
        return [p.PUNCT_BEGIN]
    
    @_('endings NEWLINE')
    def endings(self, p):
        if p.endings[-1] != ' ':
            p.endings.append(' ')
        return p.endings
    
    @_('endings PUNCT_END')
    def endings(self, p):
        return p.endings + [p.PUNCT_END]
    
    @_('PUNCT_END')
    def endings(self, p):
        return [p.PUNCT_END]
        
    @_('LETTER letters', 
       'FOREIGN_LETTER letters')
    def letters(self, p):
        return [p[0]] + p[1]
    
    @_('LETTER', 
       'FOREIGN_LETTER')
    def letters(self, p):
        return [p[0]]

parser = NenaParser()
test = parser.parse(lexer.tokenize(example))
#test

In [9]:
metadata, text = test

In [10]:
metadata

{'dialect': 'Urmi_C',
 'title': 'When Shall I Die?',
 'encoding': 'UTF8',
 'speakers': {'YD': 'Yulia Davudi',
  'GK': 'Geoffrey Khan',
  'CK': 'Cody Kingham'},
 'place': '+Hassar +Baba-čanɟa, N',
 'transcriber': 'Geoffrey Khan',
 'text_id': 'A32'}

In [11]:
# n-paragraphs
len(text)

2

In [12]:
# paragraph 1, n-lines
len(text[0])

2

In [13]:
text[0][:10]

[{'number': '1',
  'speaker': 'Yulia Davudi',
  'words': [{'word': 'xá',
    'letters': ['x', 'á'],
    'beginnings': [],
    'endings': ['-']},
   {'word': 'yuma',
    'letters': ['y', 'u', 'm', 'a'],
    'beginnings': [],
    'endings': [' ']},
   {'word': 'malla',
    'letters': ['m', 'a', 'l', 'l', 'a'],
    'beginnings': ['"', '⁺'],
    'endings': [' ']},
   {'word': 'Nasrádən',
    'letters': ['N', 'a', 's', 'r', 'á', 'd', 'ə', 'n'],
    'beginnings': ['⁺'],
    'endings': ['"', ' ']},
   {'word': 'váyələ',
    'letters': ['v', 'á', 'y', 'ə', 'l', 'ə'],
    'beginnings': [],
    'endings': [' ']},
   {'word': 'tíva',
    'letters': ['t', 'í', 'v', 'a'],
    'beginnings': [],
    'endings': [' ']},
   {'word': 'ʾal',
    'letters': ['ʾ', 'a', 'l'],
    'beginnings': ['⁺'],
    'endings': ['-']},
   {'word': 'k̭èsa',
    'letters': ['k̭', 'è', 's', 'a'],
    'beginnings': [],
    'endings': ['.', 'ˈ', ' ']},
   {'word': 'xá', 'letters': ['x', 'á'], 'beginnings': [], 'endings': [' '

In [14]:
# paragraph 1, line 1
len(text[0][0])

3

In [15]:
# paragraph 1, line 1, n-words
len(text[0][0]['words'])

40

In [16]:
# paragraph 1, line 1, word 1
text[0][0]['words'][0]

{'word': 'xá', 'letters': ['x', 'á'], 'beginnings': [], 'endings': ['-']}

In [17]:
# text[1][0]

## Testing with Real Texts

In [18]:
from pathlib import Path

In [19]:
# paths
data_dir = Path('../texts/2.0')
dialect_dirs = list(Path(data_dir).glob('*'))

In [20]:
dialect_dirs

[PosixPath('../texts/2.0/Barwar'), PosixPath('../texts/2.0/Urmi_C')]

### Run Parse On All Texts

In [21]:
import collections

In [22]:
dialect2name2parsed = collections.defaultdict(dict)
name2parsed = {}
name2text = {}
not_parsed = []

ignore = [

#    'Tales From the 1001 Nights.nena',
#    'A Cure for a Husband’s Madness.nena',
#    'A Dragon in the Well.nena',
#     'A Dutiful Son.nena',
#     'A Frog Wants a Husband.nena',
#     'A Painting of the King of Iran.nena',
#     'A Pound of Flesh.nena',
#     'A Thousand Dinars.nena',
#     'A Visit From Harun Ar-Rashid.nena',
#     'Agriculture and Village Life.nena',
]

for dialect in dialect_dirs:
    print(f'--Dialect {dialect}--')
    print()
    for file in sorted(dialect.glob('*.nena')):
        
        if file.name in ignore:
            print('SKIPPING:', file.name, '\n')
            not_parsed.append(file)
            continue
        
        with open(file, 'r') as infile:
            text = infile.read()
            name2text[file.name] = text
            print(f'trying: {file.name}')
            parseit = parser.parse(lexer.tokenize(text))
            print(f'\t√')
            name2parsed[file.stem] = parseit
            dialect2name2parsed[dialect.stem][file.stem] = parseit
                
print(len(name2parsed), 'parsed...')
print(len(not_parsed), 'not parsed...')

--Dialect ../texts/2.0/Barwar--

trying: A Hundred Gold Coins.nena
	√
trying: A Man Called Čuxo.nena
	√
trying: A Tale of Two Kings.nena
	√
trying: A Tale of a Prince and a Princess.nena
	√
trying: Baby Leliθa.nena
	√
trying: Dəmdəma.nena
	√
trying: Gozali and Nozali.nena
	√
trying: I Am Worth the Same as a Blind Wolf.nena
	√
trying: Man Is Treacherous.nena
	√
trying: Measure for Measure.nena
	√
trying: Nanno and Jəndo.nena
	√
trying: Qaṭina Rescues His Nephew From Leliθa.nena
	√
trying: Sour Grapes.nena
	√
trying: Tales From the 1001 Nights.nena
	√
trying: The Battle With Yuwanəs the Armenian.nena
	√
trying: The Bear and the Fox.nena
	√
trying: The Brother of Giants.nena
	√
trying: The Cat and the Mice.nena
	√
trying: The Cooking Pot.nena
	√
trying: The Crafty Hireling.nena
	√
trying: The Crow and the Cheese.nena
	√
trying: The Daughter of the King.nena
	√
trying: The Fox and the Lion.nena
	√
trying: The Fox and the Miller.nena
	√
trying: The Fox and the Stork.nena
	√
trying: The Gian

In [23]:
# name2text['Axiqar.nena'][22200-30:22200+100]

In [24]:
# def test_token(test_str):
#     norm_str = unicodedata.normalize('NFD', test_str)
#     return list(lexer.tokenize(norm_str))

# test_token('nux màlka.ˈ (4) ʾu-ʾímət ṛi')

## Save the parsed texts for inspection

In [25]:
import json

In [26]:
parsed_dir = Path('../parsed_texts/2.0/')
parsed_dir.mkdir(exist_ok=True)

#dialect2name2parsed['Urmi_C']['Trickster']

for dialect, texts in dialect2name2parsed.items():
    dialect_dir = parsed_dir.joinpath(dialect)
    dialect_dir.mkdir(exist_ok=True)
    for text, parsing in texts.items():
        text_file = dialect_dir.joinpath(f'{text}.json')
        with open(text_file, 'w') as outfile:
            json.dump(parsing, outfile, ensure_ascii=False, indent=2)

<hr> 

Old Code

In [2]:
#example = unicodedata.normalize('NFD', '''
#dialect::Urmi_C
#title::When Shall I Die?
#encoding::UTF8
#informant::Yulia Davudi
#interviewer::Geoffrey Khan
#place::+Hassar +Baba-čanɟa, N
#transcriber::Geoffrey Khan
#text_id::A32 
#speakers::GK=Geoffrey Khan, CK=Cody Kingham, YD=Yulia Davudi

#(1@0:00) xá-yuma "⁺malla ⁺Nasrádən" váyələ tíva ⁺ʾal-k̭èsa.ˈ xá mən-nášə
#⁺vàrəva,ˈ mə́rrə ⁺màllaˈ ʾátən ʾo-k̭ésa pràmut,ˈ bət-nàplət.ˈ mə́rrə <P: bŏ́ro> 
#bàbaˈ ʾàtən=daˈ ⁺šúla lə̀tluxˈ tíyyət b-dìyyi k̭ítət.ˈ ⁺šúk̭ si-⁺bar-⁺šùlux
#.ˈ ʾána ⁺šūl-ɟànilə.ˈ náplən nàplən.ˈ (2@0:08) ⁺hàlaˈ ʾo-náša léva xíša xá 
#⁺ʾəsrá ⁺pasulyày,ˈ ⁺málla bitáyələ drúm ⁺ʾal-⁺ʾàrra.ˈ bək̭yámələ ⁺bərxáṱələ 
#⁺bàru.ˈ màraˈ ⁺maxlèta,ˈ ʾátən ⁺dílux ʾána bət-náplənva m-⁺al-ʾilàna.ˈ 
#bas-tánili xázən ʾána ʾíman bət-mètən.ˈ ʾo-náša xzílə k̭at-ʾá ⁺màllaˈ hónu
#xáč̭č̭a ... ⁺basùrələˈ mə́rrə k̭àtuˈ ⁺maxlèta,ˈ mə̀drə,ˈ «GK: maxlèta?» ⁺rába 
#⁺maxlèta.ˈ mə́rrə k̭at-ʾíman xmártux ⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ ʾó-yuma mètət.ˈ 
#ʾó-yumət xmártux ⁺ṱlá ɟáhə ⁺ʾarṱàla,ˈ ʾó-yuma mètət.ˈ 
#(3@0:16, CK) <E:Why hello there!> (4@0:18, YD) ⁺málla! múttəva ... ⁺ṱànaˈ
#⁺yak̭úyra ⁺ʾal-xmàrta.ˈ ⁺ṱànaˈ mə́ndi ⁺rába múttəva ⁺ʾal-xmàrtaˈ ʾu-xmàrtaˈ 
#⁺báyyava ʾask̭áva ⁺ʾùllul.ˈʾu-bas-pòxa ⁺plə́ṱlə mənnó.ˈ ṱə̀r,ˈ ⁺riṱàla.ˈ ⁺málla mə́rrə ʾàha,ˈ 
#ʾána dū́n k̭arbúnə k̭a-myàta.ˈ (4@0:20)<E:ok> «CK:yes?» xáč̭č̭a=da sə̀k̭laˈ xa-xìta.ˈ ɟánu mudməxxálə
#⁺ʾal-⁺ʾàrra.ˈ mə̀rrəˈ xína ⁺dā́n mòtila.ˈ ʾē=t-d-⁺ṱlàˈ ⁺málla mə̀tlə.ˈ nàšə,ˈ
# xuyravàtuˈ xə́šlun tílun mə̀rrunˈ: ʾa mù-vadət? k̭a-mú=ivət ⁺tàmma?ˈ mə́rrə 
# xob-ʾána mìtən.ˈ lá bəxzáyətun k̭at-mìtən!ˈ lá mə́rrun ʾat-xàya!ˈ 
# hamzùməvət.ˈ bəšvák̭una ⁺tàmaˈ màraˈ xmàrələ,ˈ lélə ⁺parmùyə.ˈ
# ''')

In [119]:
#class NenaLexer(Lexer):
#    
#    def error(self, t):
#        """Give warning for bad characters"""
#        print(f"Illegal character {repr(t.value[0])} @ index {self.index}")
#        self.index += 1
#    
#    # set of token names as required by
#    # the Lexer class
#    tokens = {
#        LETTER, PUNCT_BEGIN, PUNCT_END, NEWLINES,
#        NEWLINE, NEWLINES, ATTRIBUTE, 
#        FOREIGN_LETTER,
#        LINESTAMP, SPAN_START, SPAN_END        
#    }

#    # Attribute starts key and colon. Returns 2-tuple (key, value).
#    @_(r'[a-z0-9_]+ = .*')
#    def ATTRIBUTE(self, t):
#        field, value = tuple(t.value.split('='))
#        t.value = {field.strip(): value.strip()}
#        return t
#    
#    @_(r'\(\d+\@\d:\d+\)\s*', 
#       r'\(\d+\)\s*')
#    def LINESTAMP(self, t):
#        number = re.findall('^\((\d+)', t.value)[0]
#        timestamp = re.findall('@(\d+:\d+)', t.value)
#        if timestamp:
#            timestamp = timestamp[0]
#        t.value = {'number': number, 'timestamp': timestamp}
#        return t

#    NEWLINES = r'\n\s*\n\s*' # i.e. marks text-blocks
#    LETTER = alphabet_re    
#    PUNCT_BEGIN = punct_begin_re
#    PUNCT_END = punct_end_re
#    NEWLINE = '\n\s*'
#        
#    # treat the language and speaker tag simultaneously as a "span"
#    # this optimizes the code quite a bit since both tags
#    # behave identically when they are parsed
#    @_(r'[<«][A-Za-z?]+:\s*')
#    def SPAN_START(self, t):
#        if t.value[0] == '<':
#            kind = 'language'
#            punct_type = 'exclusive'
#        else:
#            kind = 'speaker'
#            punct_type = 'inclusive'
#        value = re.match(r'[<«]([A-Za-z?]+):', t.value).group(1)
#        tag = t.value.strip() + ' ' # ensure spacing
#        t.value = (tag, kind, value, punct_type) # tag, key, value, punct_type
#        return t
#        
#    SPAN_END = r'[>»]'
#    
#    # NB: tokens evaluated in order of appearance here
#    # thus foreign string matched lastly
#    FOREIGN_LETTER = r'[a-zA-ZðÐɟəƏɛƐʾʿθΘ][\u0300-\u033d]*'

In [121]:
#def make_word(letters, beginnings=[], endings=[]):
#    """Return word dictionary"""
#    return {
#        'word': ''.join(letters),
#        'letters': letters,
#        'beginnings': beginnings,
#        'endings': endings,
#    }

#def modify_attribute(words, key, value):
#    """Modify dict attribute for a list of words"""
#    for word in words:
#        word[key] = value
#    return words

#def format_tag_endings(tag, punct_value, endings=[]):
#    """Format punctuation around a tag.
#    
#    Normalizes in case of irregularity. For instance, in the
#    cases of both
#        words.</> 
#        words</>.
#    the tags will be normalized to either an in/exclusive order.
#    """
#    if punct_value == 'inclusive':
#        return endings + [tag]
#    elif punct_value == 'exclusive':
#        return [tag] + endings
#    else:
#        raise Exception(f'INVALID punct_value supplied: {punct_value}')
#    
#class NenaParser(Parser):
#    
#    #debugfile = 'nena_parser.out'
#    tokens = NenaLexer.tokens
#    
#    def error(self, t):
#        raise Exception(f'unexpected {t.type} ({repr(t.value)}) at index {t.index}')
#    
#    @_('attributes NEWLINES text_block')
#    def nena(self, p):
#        return [p.attributes, p.text_block]
#    
#    @_('attributes NEWLINE ATTRIBUTE')
#    def attributes(self, p):
#        p.attributes.update(p.ATTRIBUTE)
#        return p.attributes
#    
#    @_('NEWLINE ATTRIBUTE', 'ATTRIBUTE')
#    def attributes(self, p):
#        return p.ATTRIBUTE
#    
#    @_('text_block NEWLINES paragraph')
#    def text_block(self, p):
#        return p.text_block + [p.paragraph]
#    
#    @_('paragraph')
#    def text_block(self, p):
#        return [p.paragraph]
#    
#    @_('paragraph line')
#    def paragraph(self, p):
#        return p.paragraph + [p.line]
#    
#    @_('line')
#    def paragraph(self, p):
#        return [p.line]
#    
#    @_('LINESTAMP span words',
#      'LINESTAMP span span words',
#      'LINESTAMP span word span words')
#    def line(self, p):
#        words = []
#        for wordtype in list(p)[1:]:
#            if type(wordtype) == list: 
#                words += wordtype
#            else:
#                words.append(wordtype)
#        p.LINESTAMP['words'] = words
#        return p.LINESTAMP
#    
#    @_('LINESTAMP word span words')
#    def line(self, p):
#        p.LINESTAMP['words'] = [p.word] + p.span + p.words
#        return p.LINESTAMP
#    
#    @_('LINESTAMP words')
#    def line(self, p):
#        p.LINESTAMP['words'] = p.words
#        return p.LINESTAMP
#    
#    @_('words span')
#    def words(self, p):
#        return p.words + p.span
#    
#    @_('SPAN_START letters SPAN_END',
#       'SPAN_START letters SPAN_END endings',
#       'SPAN_START letters SPAN_END NEWLINE',
#       'SPAN_START beginnings letters SPAN_END endings',
#       'SPAN_START beginnings letters SPAN_END NEWLINE',
#      )
#    def span(self, p):
#        begin_tag, kind, value, punct_type = p.SPAN_START
#        beginnings = [begin_tag] + getattr(p, 'beginnings', [])
#        
#        # build ends
#        trailing_ends = getattr(p, 'endings', [])
#        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
#            trailing_ends.append(' ')
#        endings = format_tag_endings(p.SPAN_END, punct_type, trailing_ends)
#        
#        word = make_word(p.letters, beginnings=beginnings, endings=endings)
#        word[kind] = value
#        return [word]
#    
#    @_('SPAN_START word letters SPAN_END',
#       'SPAN_START word letters SPAN_END endings',
#       'SPAN_START word letters SPAN_END NEWLINE',
#       'SPAN_START word beginnings letters SPAN_END endings',
#       'SPAN_START word beginnings letters SPAN_END NEWLINE',
#       'SPAN_START words letters SPAN_END endings',
#       'SPAN_START words letters SPAN_END NEWLINE',
#       'SPAN_START words beginnings letters SPAN_END endings',
#      )
#    def span(self, p):
#        begin_tag, kind, value, punct_type = p[0]
#        
#        # compile words
#        words = []
#        if getattr(p, 'word', None):
#            p.word['beginnings'].insert(0, begin_tag)
#            words.append(p.word)
#        elif getattr(p, 'words', None):
#            words.extend(p.words)
#            
#        # build new word from dangling letters and ends
#        trailing_ends = getattr(p, 'endings', [])
#        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
#            trailing_ends.append(' ')
#        endings = format_tag_endings(p.SPAN_END, punct_type, trailing_ends)
#        beginnings = getattr(p, 'beginnings', [])
#        words.append(make_word(p.letters, beginnings=beginnings, endings=endings))
#        
#        return modify_attribute(words, kind, value)
#    
#    @_('SPAN_START words SPAN_END',
#       'SPAN_START words SPAN_END endings',
#       'SPAN_START words SPAN_END NEWLINE',
#       'SPAN_START word SPAN_END',
#       'SPAN_START word SPAN_END endings',
#       'SPAN_START word SPAN_END NEWLINE',)
#    def span(self, p):
#        words = getattr(p, 'words', [p[1]])
#        begin_tag, kind, value, punct_type = p[0]
#        first_word, last_word = words[0], words[-1]
#        first_word['beginnings'].insert(0, begin_tag)
#        
#        # build ends
#        trailing_ends = last_word['endings'] + getattr(p, 'endings', [])
#        if getattr(p, 'NEWLINE', '') and not ''.join(trailing_ends).endswith(' '):
#            trailing_ends.append(' ')        
#        last_word['endings'] = format_tag_endings(p[2], punct_type, trailing_ends)
#        
#        return modify_attribute(words, kind, value)
#    
#    @_('words word')
#    def words(self, p):
#        return p.words + [p.word]
#    
#    @_('word word')
#    def words(self, p):
#        return [p[0]] + [p[1]]
#    
#    @_('beginnings letters endings', 
#       'letters endings',
#       'letters NEWLINE',
#       'letters NEWLINE endings',
#       'beginnings letters NEWLINE',
#       'beginnings letters NEWLINE endings',
#      )
#    def word(self, p):
#        beginnings = getattr(p, 'beginnings', [])
#        endings =  getattr(p, 'endings', [' '])
#        return make_word(p.letters, beginnings, endings)

#    @_('PUNCT_BEGIN beginnings')
#    def beginnings(self, p):
#        return [p.PUNCT_BEGIN] + p.beginnings
#    
#    @_('PUNCT_BEGIN')
#    def beginnings(self, p):
#        return [p.PUNCT_BEGIN]
#    
#    @_('endings NEWLINE')
#    def endings(self, p):
#        if p.endings[-1] != ' ':
#            p.endings.append(' ')
#        return p.endings
#    
#    @_('endings PUNCT_END')
#    def endings(self, p):
#        return p.endings + [p.PUNCT_END]
#    
#    @_('PUNCT_END')
#    def endings(self, p):
#        return [p.PUNCT_END]
#        
#    @_('LETTER letters', 
#       'FOREIGN_LETTER letters')
#    def letters(self, p):
#        return [p[0]] + p[1]
#    
#    @_('LETTER', 
#       'FOREIGN_LETTER')
#    def letters(self, p):
#        return [p[0]]

#parser = NenaParser()
#test = parser.parse(lexer.tokenize(example))
##test