# NENA to TF

This notebook will be used to develop code for converting texts from .nena format to Text-Fabric. The parser has principally been written by Hannes Vlaardingerbroek. Many thanks to him for his hard work on it. Updates and refinements have been added by Cody Kingham.

In [1]:
import os
import sys
import collections
import re
from pathlib import Path
from tf.convert.walker import CV
from tf.fabric import Fabric

# path to parser
parserpath = f'../../nena_corpus/parse_nena/'
sys.path.append(parserpath)
from nena_parser import NenaLexer, NenaParser

# paths
VERSION = '0.01'
data_dir = Path(f'../../nena_corpus/nena/{VERSION}')
dialect_dirs = list(Path(data_dir).glob('*'))

Parser debugging for NenaParser written to parser.out


## Parse NENA

The NENA Parser delivers the text as structured morphemes, which can then be processed into a TF graph. We do that below by opening each source text, retrieving its parsed form, and begin each iteration. 

In [2]:
lexer = NenaLexer()
parser = NenaParser()

In [3]:
dialect2file2parsed = collections.defaultdict(lambda: collections.defaultdict())

nparsed = 0

for dialect in sorted(dialect_dirs):    
    for file in sorted(dialect.glob('*.nena')):
        with open(file, 'r') as infile:
            text = infile.read()
            parse = parser.parse(lexer.tokenize(text))
            nparsed += 1
            dialect2file2parsed[dialect.name][file.name] = parse
            print(f'parsed: {file.name}')

print('\n', nparsed, 'texts ready for conversion')

parsed: A Hundred Gold Coins.nena
parsed: A Man Called Čuxo.nena
parsed: A Tale of Two Kings.nena
parsed: A Tale of a Prince and a Princess.nena
parsed: Baby Leliθa.nena
parsed: Dəmdəma.nena
parsed: Gozali and Nozali.nena
parsed: I Am Worth the Same as a Blind Wolf.nena
parsed: Man Is Treacherous.nena
parsed: Measure for Measure.nena
parsed: Nanno and Jəndo.nena
parsed: Qaṭina Rescues His Nephew From Leliθa.nena
parsed: Sour Grapes.nena
parsed: Tales From the 1001 Nights.nena
parsed: The Battle With Yuwanəs the Armenian.nena
parsed: The Bear and the Fox.nena
parsed: The Brother of Giants.nena
parsed: The Cat and the Mice.nena
parsed: The Cooking Pot.nena
parsed: The Crafty Hireling.nena
parsed: The Crow and the Cheese.nena
parsed: The Daughter of the King.nena
parsed: The Fox and the Lion.nena
parsed: The Fox and the Miller.nena
parsed: The Fox and the Stork.nena
parsed: The Giant’s Cave.nena
parsed: The Girl and the Seven Brothers.nena
parsed: The King With Forty Sons.nena
parsed: The

In [15]:
linenum, elements = dialect2file2parsed['Barwar']['The Tale of Nasimo.nena'][1][0][0]

linenum

1

In [19]:
elements[0].__dict__

{'value': ['ʾ', 'í', 'θ', 'w', 'a'],
 'trailer': ' ',
 'footnotes': {},
 'speaker': None,
 'foreign': False,
 'lang': None}

## Converter

Build a TF Walker class that can walk over the NENA parsed data and fit the text graph.

In [22]:
def make_footnotes(fn_dict):
    """Format footnote dict into string"""
    if fn_dict:
        return '; '.join(
            f'[^{num}]: {txt}' for num, txt in fn_dict.items()
        )
    else:
        return None

def director(CV):
    """Walk the source data and produce a TF graph"""
    
    for dialect in sorted(dialect_dirs):    
        for file in sorted(dialect.glob('*.nena')):
            with open(file, 'r') as infile:
                nena_text = infile.read()
            
            # parse the .nena format
            header, paragraphs = parser.parse(lexer.tokenize(nena_text))
            
            # -- begin TF node creation --
            
            # cv.node initializes a node object
            # all slots added in between its creation and 
            # termination will be considered embedded within
            # this node; same is true of following cv.node calls
            text = cv.node('text')
            cv.feature(text, **header) # adds features to supplied node
            
            for i, para in enumerate(paragraphs):
                
                # make paragraph node
                paragraph = cv.node('paragraph')
                cv.feature(paragraph, number=i+1)
                
                for line_number, line_elements in para:
                    
                    # make line nodes
                    line = cv.node('line')
                    cv.feature(line, number=line_number)
                    
                    # make words by composing morphemes
                    # this must be done iteratively and 
                    # reset once a complete word has been
                    # assembled. This happens with this_word
                    # and the loop below. Other elements are 
                    # also dealt with in the loop
                    this_word = cv.node('word')
                    prev_morph = None
                    for elem in line_elements:
                        
                        # process morphemes as slots
                        # 'slot' being the most basic element
                        if type(elem) == Morpheme:
                            
                            # determine which word to assign
                            # morpheme to:
                            
                            # add morpheme to prev word
                            # (this_word remains active)
                            if re.match('^\s*$|^-\s*$', prev_morph.trailer):
                                this_morph = cv.slot()
                                prev_morph = elem
                                
                            # add morpheme to new word
                            # (this_word is terminated and replaced)
                            else:
                                cv.terminate(this_word)
                                this_word = cv.node('word')
                                this_morph = cv.slot()
                                prev_morph = elem
                            
                            # prepare features on morphemes
                            # NB: None values are ignored by CV
                            fs = elem.__dict__
                            feats = {
                                'letters': ' '.join(fs.get('value', []))
                                'trailer': fs.get('trailer', ' '),
                                'speaker': fs.get('speaker'),
                                'footnotes': make_footnotes(fs.get('footnotes', {})),
                                'lang': fs.get('lang'),
                                'foreign': str(fs.get('foreign')) if fs.get('foreign') else None,
                            }
                            
                            # add features to morpheme
                            cv.feature(this_morph, **feats)
                            
                        # add all other elements as feature of morpheme
                        else:
                            pass

In [30]:
re.match('^\s*$|^-\s*$', '- .')

## Corpus and Feature Metadata

In [11]:
# TODO: Pre-requisites
slotType = 'char' # or morpheme?
otext = {
    # text configs here
}
generic = {
    # generic meta data here
}
intFeatures = {
    # features as integers here
}
featureMeta = {
    # feature metadata here
}
