# Experimental Corpus

In this notebook, we experiment with producing a TF resource for the Christian Urmi NENA dialect. The text itself comes from Geoffrey Khan.

In [1]:
import re, collections
from IPython.display import display, HTML
from tf.fabric import Fabric
with open('christian_urmi.txt', 'r') as infile:
    urmi = infile.read()

## Building a Text-Fabric Resource

Text-Fabric is a format and tool for the storage, annotation, and analysis of text corpora. The Text-Fabric data model is explained in depth [in its docs](https://annotation.github.io/text-fabric/Model/Data-Model/).

Herein we follow a fairly standard approach to convert a plain-text file into a TF resource.

## Build Up Node Feature and Oslot Mappings

In [2]:
def iterateKey(dictionary):
    '''
    Auto increments a key from a dictionary.
    '''
    return max(dictionary.keys(), default=0)+1


def cleanToken(token):
    '''
    Dealing with encoding variances.
    First issue deals with a and its accent.
    '''
    return token.replace(chr(97)+chr(769), chr(225))

raw_node_features = collections.defaultdict(lambda:collections.defaultdict(set))
raw_oslots = collections.defaultdict(lambda:collections.defaultdict(set))
slot = 0

this_sentence = 1 # for first iteration since only sentence ends are marked

for line in urmi.split('\n'):
    
    # mark book beginnings, their "code" and title
    if line.startswith('# '): # book code
        this_book = iterateKey(raw_oslots['book'])
        raw_node_features['book_code'][this_book] = line.split()[-1].strip()
        continue
    elif line.startswith('## '): # book title
        raw_node_features['book_title'][this_book] = line.split('#')[-1]
        continue
                
    # map slots to objects and features:
    for token in line.split():
        
        if re.match('.*\(\d*\)', token): # line start
            this_line = iterateKey(raw_oslots['line'])
            raw_node_features['line'][this_line] = token
            continue
            
        # everything up to this point is a valid slot
        # iterate slot up by 1
        slot += 1
            
        # record sentence boundaries
        if re.match('.*\.\|', token): # end of sentence
            raw_oslots['sentence'][this_sentence].add(slot)
            this_sentence = iterateKey(raw_oslots['sentence']) # get incremented, new sentence ID
        else: # beginning/within sentence
            raw_oslots['sentence'][this_sentence].add(slot)
            
        raw_node_features['trans'][slot] = cleanToken(token)
        raw_node_features['trailer'][slot] = ' '
        raw_oslots['book'][this_book].add(slot)
        raw_oslots['line'][this_line].add(slot)

## Reindex Objects Above Slot Levels

In [3]:
otype2feature = {'book':{'book_code', 'book_title'},
                 'line':{'line'},
                 'sentence':{}}

node_features = collections.defaultdict(lambda:collections.defaultdict())
node_features['trans'] = raw_node_features['trans'] # add slot features
node_features['trailer'] = raw_node_features['trailer']
for slot in node_features['trans']:
    node_features['otype'][slot] = 'word'
    
edge_features = collections.defaultdict(lambda:collections.defaultdict(set)) # oslots will go here

onode = max(raw_node_features['trans']) # max slot, incremented +1 in loop

for otype in raw_oslots.keys():
    for oID, slots in raw_oslots[otype].items():
        
        # make new object node number
        onode += 1
        node_features['otype'][onode] = otype
        
        # remap node features to node number
        for feat in otype2feature[otype]:
            node_features[feat][onode] = raw_node_features[feat][oID]
        edge_features['oslots'][onode] = raw_oslots[otype][oID]

In [4]:
node_features.keys()

dict_keys(['trans', 'trailer', 'otype', 'book_title', 'book_code', 'line'])

In [5]:
edge_features.keys()

dict_keys(['oslots'])

## Save to TF Format

In [6]:
otext = {
'sectionTypes': 'book,line',
'sectionFeatures':'book_code,line',
'fmt:text-orig-full':'{trans}{trailer}'
}

meta = {'':{'author': 'Geoffrey Khan and Cody Kingham'},
        'oslots':{'edgeValues':False, 'valueType':'int'},
        'otype':{'valueType':'str'},
        'book':{'valueType':'str'},
        'line':{'valueType':'str'},
        'trans':{'valueType':'str'},
        'book_code':{'valueType':'str'},
        'book_title':{'valueType':'str'},
        'trailer':{'valueType':'str'},
        'otext':otext}

TFs = Fabric(locations=['tf/'])

This is Text-Fabric 7.4.11
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

9 features found and 0 ignored


In [7]:
TFs.save(nodeFeatures=node_features, edgeFeatures=edge_features, metaData=meta)

  0.00s Exporting 6 node and 1 edge and 2 config features to tf/:
  0.00s VALIDATING oslots feature
  0.01s maxSlot=       2217
  0.01s maxNode=       2595
  0.01s OK: oslots is valid
   |     0.00s T book_code            to tf
   |     0.00s T book_title           to tf
   |     0.00s T line                 to tf
   |     0.00s T otype                to tf
   |     0.01s T trailer              to tf
   |     0.01s T trans                to tf
   |     0.00s T oslots               to tf
   |     0.00s M book                 to tf
   |     0.00s M otext                to tf
  0.04s Exported 6 node features and 1 edge features and 2 config features to tf/


True

# Nena Corpus Experiments

In [8]:
from tf.fabric import Fabric
import collections

In [9]:
TF = Fabric(locations='tf/')

N = TF.load('''

book_code trans otype book_title

''')

N.makeAvailableIn(globals())
print()

This is Text-Fabric 7.4.11
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

9 features found and 0 ignored
  0.00s loading features ...
   |     0.00s T otype                from tf
   |     0.00s T book_code            from tf
   |     0.01s T trans                from tf
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.02s C __order__            from otype, oslots, __levels__
   |      |     0.00s C __rank__             from otype, __order__
   |      |     0.01s C __levUp__            from otype, oslots, __levels__, __rank__
   |      |     0.00s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.00s C __boundary__         from otype, oslots, __rank__
   |      |     0.00s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book_code, line
   |     0.00s T book_title           from tf
  0.09s All features loaded/computed - for details use loadLog()



In [10]:
for book in F.otype.s('book'):
    book_words = L.d(book, 'word')
    print(book, F.book_title.v(book))
    print(f'\t{len(book_words)} words')

2218  The Loan of a Cooking Pot (Yulia Davudi, +Hassar +Baba-čanɟa, N)
	180 words
2219  Agriculture and Village Life (Natan Khoshaba, Zumallan, N)
	2037 words


## Make Token Counts

In [11]:
tokens = collections.Counter()

for w in F.otype.s('word'):
    tokens[F.trans.v(w)] += 1
    
tokens.most_common(25)

[('+xárta', 30),
 ('ʾá', 30),
 ('k̭át', 22),
 ('ʾíta', 18),
 ('xá', 16),
 ('+rába', 15),
 ('ʾátxa', 14),
 ('cúllǝ', 11),
 ('+k̭usárta', 10),
 ('xína', 10),
 ('ʾìta|', 9),
 ('míyya', 9),
 ('ʾánnǝ', 9),
 ('dástə', 8),
 ('+bár', 8),
 ('cùllǝ|', 8),
 ('bí', 8),
 ('hál', 8),
 ('tré', 7),
 ('ʾíta', 7),
 ('là', 7),
 ('c-avíva', 7),
 ('ʾǝ́tvalan', 7),
 ('lè', 7),
 ('ʾína', 6)]

In [12]:
for sent in list(F.otype.s('sentence'))[:10]:
    print(sent, T.text(sent))

2266 xá yuma +málla +Nasràdən| bərrə̀xšələ| bəšk̭álələ +k̭usárta déna mən švàvu.| 
2267 màrǝlə| hálli xá dana +k̭usàrta| +báyyən bášlən ɟávo bušàla.| 
2268 +k̭usárta +ɟúrta lə̀tli.| 
2269 bəšk̭álolə màyolə +k̭usárta| bušála bašùlələ,| labùlolə,| yávolə mə̀drə| k̭à| švàva.| 
2270 ʾína tré +k̭usaryay sùrə| mattúyəl ɟàvo.| 
2271 švàva| màrǝlə| ʾáha tré +k̭usaryàtə| k̭àm muyyévət?| mə̀rrə| +k̭usártət dìyyux| də̀lla| tré xínə mə̀nno.| 
2272 yávəl k̭àtu| ʾávət basìma,| bitàyələ.| 
2273 ʾé-šabta xìta| +málla +Nasrádən bərrə́xšəl mə̀drə.| 
2274 màrələ| +maxlèta,| xa +k̭usárta buš +ɟùrta +byáyəvən.| 
2275 +málla +Nasràdən| +ʾáynu pə́ltəva +ʾal-xa +k̭usartət švàvə.| 


## Basic Search Capacity

In [15]:
find = list(S.search(f'''

sentence
    word trans=+xárta
    <: word

'''))

print(len(find))

30


In [17]:
for res in find:
    print(T.text(res[1]), T.text(res[2]))

+xárta  pā̀n 
+xárta  ʾǝ́tva 
+xárta  ci-+yasrìvalun.| 
+xárta  ɟəddàla| 
+xárta  +mač̭ràxvalǝ.| 
+xárta  cùllǝ| 
+xárta  +marč̭ìvalǝ| 
+xárta  púmmu 
+xárta  +marč̭ìvalun.| 
+xárta  zìla 
+xárta  ʾé 
+xárta  pardùvvǝ 
+xárta  maštàxla| 
+xárta  pummé 
+xárta  k̭át 
+xárta  +xazdàxvala| 
+xárta  bí 
+xárta  mǝn-dàha| 
+xárta  ʾánnǝ 
+xárta  ɟári 
+xárta  b-ràcxa,| 
+xárta  +ʾànvǝ 
+xárta  +xárta 
+xárta  nášǝ 
+xárta  ʾá 
+xárta  b-ptána 
+xárta  +ṱárpa 
+xárta  b-labláxla 
+xárta  +xàrta,| 
+xárta  b-šatxìvalun.| 


### Suffix Searching

It looks like the ending `un` is could be a plural verb ending? Here is a query for those endings.

In [23]:
suffix = list(S.search('''

word trans~un\.|un$|un,

'''))

print(len(suffix), 'results\n')

for res in suffix:
    print(res[0], T.text(res[0]))

60 results

769 c-avǝ́dvalun 
1921 tílun 
1544 tùttun.| 
2187 túttun 
2190 túttun 
792 ʾax-šatxáxvalun 
1947 +rappívalun 
668 ci-yavvàvalun.| 
797 ci-+yasrìvalun.| 
1824 ci-+pašṱìvalun,| 
1827 ɟaršìvalun,| 
1317 túttun 
678 c-odívalun.| 
680 ʾǝ́tvalun.| 
815 mabrǝzzìvalun,| 
1967 ṱ-+axlìvalun.| 
691 c-odìvalun.| 
692 +mardǝxxívalun 
1588 maštàxvalun.| 
950 +pallìvalun.| 
1976 ʾǝ́tvalun.| 
1081 mayyáxvalun 
698 +palṱìvalun.| 
700 b-šatxìvalun.| 
958 +pallìvalun.| 
449 +rappívalun 
707 šaṱxìvalun.| 
1604 ci-+xalvìvalun.| 
1862 xǝ́šlun 
1865 muyyílun 
202 +ṱrǝ̀plun,| 
204 +ṱripàlun,| 
1484 mattáxlun 
1102 lablívalun 
1868 šk̭ǝ́llun 
1871 +zrílun.| 
977 +jammáxvalun 
850 ɟabìvalun,| 
1996 banìvalun.| 
852 šaṱxìvalun,| 
980 b-lablàxvalun.| 
854 +marč̭ìvalun.| 
1876 tìlun,| 
1879 zǝ̀dlun.| 
2001 ci-banívalun.| 
864 ci-malívalun 
1506 túttun 
1764 mattáxvalun 
1125 +daràxlun.| 
2023 ci-tanáxvalun.| 
1512 túttun 
617 b-+jammáxvalun 
2151 +dávun 
748 +ʾàvun,| 
621 ci-mayyàxvalun.| 
1397 tùttun.