# Experimental Corpus

In this notebook, we experiment with producing a TF resource for the Christian Urmi NENA dialect. The text itself comes from Geoffrey Khan.

In [1]:
import re, collections
from IPython.display import display, HTML
from tf.fabric import Fabric
with open('christian_urmi.txt', 'r') as infile:
    urmi = infile.read()

## Building a Text-Fabric Resource

Text-Fabric is a format and tool for the storage, annotation, and analysis of text corpora. The Text-Fabric data model is explained in depth [in its docs](https://annotation.github.io/text-fabric/Model/Data-Model/).

Herein we follow a fairly standard approach to convert a plain-text file into a TF resource.

## Build Up Node Feature and Oslot Mappings

In [7]:
def iterateKey(dictionary):
    '''
    Auto increments a key from a dictionary.
    '''
    return max(dictionary.keys(), default=0)+1

raw_node_features = collections.defaultdict(lambda:collections.defaultdict(set))
raw_oslots = collections.defaultdict(lambda:collections.defaultdict(set))
slot = 0

this_sentence = 1 # for first iteration since only sentence ends are marked

for line in urmi.split('\n'):
    
    # mark book beginnings, their "code" and title
    if line.startswith('# '): # book code
        this_book = iterateKey(raw_oslots['book'])
        raw_node_features['book_code'][this_book] = line.split()[-1].strip()
        continue
    elif line.startswith('## '): # book title
        raw_node_features['book_title'][this_book] = line.split()[-1].strip()
        continue
                
    # map slots to objects and features:
    for token in line.split():
        
        if re.match('.*\(\d*\)', token): # line start
            this_line = iterateKey(raw_oslots['line'])
            raw_node_features['line'][this_line] = token
            continue
            
        # everything up to this point is a valid slot
        # iterate slot up by 1
        slot += 1
            
        # record sentence boundaries
        if re.match('.*\.\|', token): # end of sentence
            raw_oslots['sentence'][this_sentence].add(slot)
            this_sentence = iterateKey(raw_oslots['sentence']) # get incremented, new sentence ID
        else: # beginning/within sentence
            raw_oslots['sentence'][this_sentence].add(slot)
            
        raw_node_features['trans'][slot] = token
        raw_oslots['book'][this_book].add(slot)
        raw_oslots['line'][this_line].add(slot)

## Reindex Objects Above Slot Levels

In [15]:
otype2feature = {'book':{'book_code', 'book_title'},
                 'line':{'line'},
                 'sentence':{}}

node_features = collections.defaultdict(lambda:collections.defaultdict())
node_features['trans'] = raw_node_features['trans'] # add slot features
for slot in node_features['trans']:
    node_features['otype'][slot] = 'word'
    
edge_features = collections.defaultdict(lambda:collections.defaultdict(set)) # oslots will go here

onode = max(raw_node_features['trans']) # max slot, incremented +1 in loop

for otype in raw_oslots.keys():
    for oID, slots in raw_oslots[otype].items():
        
        # make new object node number
        onode += 1
        node_features['otype'][onode] = otype
        
        # remap node features to node number
        for feat in otype2feature[otype]:
            node_features[feat][onode] = raw_node_features[feat][oID]
        edge_features['oslot'][onode] = raw_oslots[otype][oID]

## Save to TF Format

In [None]:
otext = '''

@sectionTypes=book,line
@sectionFeatures=book_code,line
@fmt:text-orig-full={trans}

'''

meta = {'':{'author': 'Geoffrey Khan and Cody Kingham'},
        ''
       }