# Generate Text-Fabric Resource from Source Texts

## Text preprocessing for Christian Urmi and Barwar texts


In order to work with the language data in the NENA corpus, it is important to separate the language text from other data, such as titles, authors/informants, and verse numbers.

Our version of the text comes in MS-Word document files. That is probably also the version of the text that is the richest in language data. It contains not only the text itself, but also meaningful formatting, e.g. word markers set in superscript, and foreign (loan) words set in roman type (where the regular text is set in italic type).

To convert that information from a Word document to something that we can use in Python, we first convert the word documents to HTML, using LibreOffice in headless mode. It is assumed that the Word files are in the subdirectory `texts`, where the converted `.html` files will also be saved.

    $ soffice --headless --convert-to html texts/*.doc

This produces HTML 4.0 documents in the same directory. Earlier attempts with XHTML using wvWare/AbiWord, or LibreOffice using the XHTML conversion filter, produced output that was more difficult to parse or lacked certain characters that were lost in conversion. Although the conversion with LibreOffice takes a very long time compared with AbiWord, the resulting text seems more reliable.

The custom `nena_corpus` package contains the `Text` class, and several functions that assist in the conversion from HTML, of which we only need the functions `html_to_text()` and `parse_metadata()`.

The function `html_to_text()` is a generator function yielding `Text` objects, each containing one paragraph of text.

The function `parse_metadata()` extracts metadata from heading paragraphs (e.g. `title`, `text_id`, `informant`, `place`).

The `Text` class contains an attribute `type` describing the type of paragraph (e.g., `'sectionheading'`, `'p'`, or `'footnote'`), and a list of tuples, containing the text and text style. A text like `'<i>Normal, </i>foreign<i>, and normal</i>'` becomes `[('Normal, ', ''), ('cursive,', 'italic'), (' and normal', '')]` (note the inversion -- because normal text in the source is actually set in italics).

`Text` objects are iterable. New items can be appended with the `append(text, text_style)` method.

In [1]:
from nena_corpus import Text, html_to_text, parse_metadata

A small demonstration of the `Text` class:

In [2]:
p = Text(p_type='test', default_style='normal')

p.append('Dit is ')
p.append('een test', 'test')
p.append('.')

# str(p) returns concatenated string
print(p)
# repr(p) returns class name, p_type and str(p)
print(repr(p))
# list(p) returns the list of tuples
print(list(p))
# a list comprehension also works
print([e for e in p])

Dit is een test.
<Text 'test' 'Dit is een test.'>
[('Dit is ', 'normal'), ('een test', 'test'), ('.', 'normal')]
[('Dit is ', 'normal'), ('een test', 'test'), ('.', 'normal')]


## Importing the texts

First we import some more useful libraries, and set logging level to DEBUG to make sure we see all logging messages.

In [14]:
import re
import collections
import pathlib, os
import logging
import unicodedata

from IPython.display import display, HTML
from tf.fabric import Fabric
import pandas as pd

logging.getLogger().setLevel(logging.DEBUG) # for terminal messages

Assuming that the subdirectory `texts` contains the HTML files generated earlier, we can import all files in the pattern `texts/*.html`. At this point we just want to do language statistics and not look at the actual texts, so it is sufficient to import the paragraphs of all texts in no particular order.

In [4]:
files_barwar = pathlib.Path.cwd().glob('texts/bar text *.html') # get source texts
files_urmi_c = pathlib.Path.cwd().glob('texts/cu *.html')

We also prepare a dictionary with some characters that need to be replaced.

In [5]:
# Characters to be replaced
replace = {
    '\u2011': '\u002d',  # U+2011 NON-BREAKING HYPHEN -> U+002D HYPHEN-MINUS
    '\u01dd': '\u0259',  # U+01DD LATIN SMALL LETTER TURNED E -> U+0259 LATIN SMALL LETTER SCHWA
    '\uf1ea': '\u003d',  # U+F1EA Deprecated SIL character -> U+003D '=' EQUALS SIGN
    '\u2026': '...',  # U+2026 '…' HORIZONTAL ELLIPSIS -> three dots
    'J\u0335': '\u0248',  # 'J' + U+0335 COMBINING SHORT STROKE OVERLAY -> U+0248 'Ɉ' LATIN CAPITAL LETTER J WITH STROKE
    'J\u0336': '\u0248',  # 'J' + U+0336 COMBINING LONG STROKE OVERLAY -> U+0248 'Ɉ' LATIN CAPITAL LETTER J WITH STROKE
    '\u002d\u032d': '\u032d\u002d',  # Switch positions of Hyphen and Circumflex accent below
    '\u2011\u032d': '\u032d\u002d',  # Switch positions of Non-breaking hyphen and Circumflex accent below
}

Now we go right ahead to loop over the html files and convert them to a TextFabric structure.

In [6]:
def combine_chars(text):
    """Yield letters combined with combining diacritics"""
    
    char = [] # compose string here: letter + diacritic
    
    for c in text:
        
        # add diacritic
        # indicated as 'Mn': non-spacing combining mark
        if unicodedata.category(c) == 'Mn':
            char.append(c)
            continue
        
        # yield the string; at this point will be: letter + (diacritic)
        if char:
            yield ''.join(char)
            
        char = [c] # save in list and get diacritic next if there is one
        
    yield ''.join(char) # yield empty chars

# keep object counts
raw_features = collections.defaultdict(lambda:collections.defaultdict(set))
raw_oslots = collections.defaultdict(lambda:collections.defaultdict(set))

# initialize counters (will be increased to start from 1)
this_text = 0
this_paragraph = 0
this_line = 0
this_sentence = 0
this_subsentence = 0
this_word = 0
this_morpheme = 0
this_foreign = 0
this_prosa = 0

slot = 0 # i.e. chars

process_dialects = {'Barwar': files_barwar,
                    'Urmi_C': files_urmi_c}

text_ids = []

for dialect, files in process_dialects.items():
    
    # TODO At this point record book/publication/dialect?
    # E.g. SSLL_2016_Urmi_C, HOS_2008_Barwar?
    
    for file in files:
        
        logging.info(f'Processing file {file.name} ...')
        
        for p in html_to_text(file, replace=replace):
            # metadata:
            # - dialect
            # - file.name
            
            if p.type.startswith('gp-') and str(p).strip():
                # store metadata from headings:
                # - text_id
                # - title
                # - informant
                # - place
                # - version (if applicable -- only Urmi_C A35)
                if p.type.startswith('gp-sectionheading'):
                    metadata = {}
                for k, v in parse_metadata(p):
                    metadata[k] = v
            
            elif p.type == 'p':
                # regular paragraphs
                
                # format a text_id with version added (if there is one)
                version = metadata.get('version', '')
                version = f'.{version[-1]}' if version else ''
                text_id = metadata.get('text_id', '') + version
                
                # first check if we need to update metadata
                # informant and place are also added as features of text
                if (metadata
                    and (not raw_features['text_id']
                         or raw_features['text_id'][this_text] != text_id)):
                        
                        
                    text_ids.append(text_id)
                        
                    this_text += 1
                    raw_features['text_id'][this_text] = text_id
                    raw_features['title'][this_text] = metadata['title']
                    raw_features['informant'][this_text] = metadata['informant']
                    raw_features['place'][this_text] = metadata['place']
                    raw_features['dialect'][this_text] = dialect
                    raw_features['filename'][this_text] = file.name
                
                # increment paragraph
                this_paragraph += 1
                
                # start paragraph with an empty marker stack
                marker_stack = []
                
                # set end-of-unit markers to True at the beginning of paragraph,
                # so the units can be increased on encounter of first word character
                sentence_end = True
                subsentence_end = True
                word_end = True
                morpheme_end = True
                foreign_end = True
                prosa_end = True
                
                for text, text_style in p:
                    
                    if text_style == 'verse_no':
                        this_line += 1
                        raw_features['line'][this_line] = text.strip(' ()') # TODO int()?
                        metadata['verse_no'] = text.strip(' ()')  # TODO Remove from metadata dict?
                        continue
                        
                    elif text_style == 'fn_anchor':
                        # TODO handle footnotes in some way, discard for now
                        continue
                    
                    elif text_style == 'comment':
                        continue  # TODO handle comments
                    
                    elif text_style == 'marker':
                        if marker_stack and marker_stack[-1] == text:
                            marker_stack.pop()
                        else:
                            marker_stack.append(text)
                        continue
                    
                    elif text_style not in ('', 'foreign'):
                        logging.debug(f'Unhandled text_style: {repr(text_style)}, {repr(text)}')
                        continue
                    
                    elif text_style == 'foreign' and foreign_end:
                        foreign_end = False
                        this_foreign += 1
                        if marker_stack:
                            language = marker_stack[-1]
                        else:
                            language = ''
                        raw_features['language'][this_foreign] = language
                    
                    else: # text_style == '':
                        if not foreign_end:
                            foreign_end = True
                        pass
                    
                    if (text_style == '' and marker_stack
                        and any(c.isalpha() for c in text)
                        and not text.isalpha()):
                        # In one case, there is no closing marker tag, so force closing the marker
                        # Urmi_C A42 9: 'RzdànyəlaR' (p.154, r.28) 'zdàny' roman, 'əla' cursive
                        # Urmi_C A43 17: 'ʾe-Rbuk̭ḗṱ' (p. 174, r.14), no closing 'R'
                        # Urmi_C B2 16: 'Pʾafšɑ̄rī̀P' (p.250 r.17), inital 'ʾ' cursive
                        marker = marker_stack.pop()
                        logging.warning(f'Unfinished marker: {repr(marker)}, closed forcibly..')
                        logging.debug(f'{dialect}, {metadata["text_id"]}:{metadata["verse_no"]}')
                        logging.debug(f'Text: {repr(text)}')
                    
                    # If we got this far, we have a text string,
                    # with either text_style '' or 'foreign'.
                    # We will iterate over them character by character.
                    for c in combine_chars(text):
                        
                        if c[0].isalpha() or c == '+':
                            
                            # Increment text units on start of new word
                            if morpheme_end:
                                this_morpheme += 1
                                morpheme_end = False
                            if word_end:
                                this_word += 1
                                word_end = False
                            if subsentence_end:
                                this_subsentence += 1
                                subsentence_end = False
                            if sentence_end:
                                this_sentence += 1
                                sentence_end = False
                            if prosa_end:
                                this_prosa += 1
                                prosa_end = False
                            
                            slot += 1
                            raw_features['utf8'][slot] = c
                            # initialize 'trailer' feature as empty string,
                            # so we can add characters with '+' operator later
                            raw_features['trailer'][slot] = ''
                            
                            raw_oslots['text'][this_text].add(slot)
                            raw_oslots['paragraph'][this_paragraph].add(slot)
                            raw_oslots['line'][this_line].add(slot)
                            raw_oslots['sentence'][this_sentence].add(slot)
                            raw_oslots['subsentence'][this_subsentence].add(slot)
                            raw_oslots['prosa'][this_prosa].add(slot)
                            if not word_end:
                                raw_oslots['word'][this_word].add(slot)
                            if not morpheme_end:
                                raw_oslots['morpheme'][this_morpheme].add(slot)
                            if not foreign_end:
                                raw_oslots['foreign'][this_foreign].add(slot)
                        
                        else:  # if c is anything but a letter or '+':
                            if slot == 0:
                                continue  # discard anything before first word character
                            if not morpheme_end:
                                morpheme_end = True
                            if c == '|':
                                prosa_end = True
                                c = '\u02c8'
                            if c not in ('-', '=') and not word_end:
                                word_end = True
                            if c == ',' and not subsentence_end:
                                subsentence_end = True
                            if c in ('.', '!', '?') and not sentence_end:
                                subsentence_end = True
                                sentence_end = True
                            
                            raw_features['trailer'][slot] += c
                
            else:
                logging.debug(f'Unhandled paragraph type: {repr(p.type)}.')
                logging.debug(f'Text: {repr(str(p))}.')

INFO:root:Processing file bar text a15-A17.html ...
INFO:root:Processing file bar text A45.html ...
INFO:root:Processing file bar text a28.html ...
INFO:root:Processing file bar text A49.html ...
INFO:root:Processing file bar text a24.html ...
DEBUG:root:Unhandled paragraph type: 'footer'.
DEBUG:root:Text: ' 7 '.
INFO:root:Processing file bar text A42-A44.html ...
INFO:root:Processing file bar text a25.html ...
INFO:root:Processing file bar text a48.html ...
DEBUG:root:Unhandled paragraph type: 'footer'.
DEBUG:root:Text: ' 1 '.
INFO:root:Processing file bar text a29.html ...
INFO:root:Processing file bar text A9-A13.html ...
INFO:root:Processing file bar text a46-A47.html ...
INFO:root:Processing file bar text A37-A40.html ...
INFO:root:Processing file bar text a18.html ...
INFO:root:Processing file bar text a1-A7.html ...
DEBUG:root:Unhandled paragraph type: 'sdfootnote1'.
DEBUG:root:Text: ' 1 The name Čuxo means ‘one who wears the woolen čuxa garment’. '.
INFO:root:Processing file ba

### Reindex Objects Above Slot Levels

We have given all objects a preliminary node number. Now those node numbers will be renumbered starting from the max slot number.

In [7]:
otype2feature = {
    'text': {'text_id', 'title', 'dialect', 'filename', 'informant', 'place'},
    'paragraph': {},
    'line': {'line'},
    'sentence': {},
    'subsentence': {},
    'word': {'trailer'},
    'morpheme': {},
    'foreign': {'language'},
    'prosa': {},
}

onode = max(raw_features['utf8']) # max slot, incremented +1 in loop
node_features = collections.defaultdict(lambda:collections.defaultdict())
edge_features = collections.defaultdict(lambda:collections.defaultdict(set)) # oslots will go here

# first add slot features
# object features must then be added with otype2feature
node_features['utf8'] = raw_features['utf8']
node_features['trailer'] = raw_features['trailer']

# add slot object types
for slot in node_features['utf8']:
    node_features['otype'][slot] = 'char'    
    
# for objects above slot level, 
# assign new node number and link to feature
for otype in raw_oslots.keys():
    for oID, slots in raw_oslots[otype].items():
        
        # make new object node number
        onode += 1
        node_features['otype'][onode] = otype
        
        # remap node features to node number
        for feat in otype2feature[otype]:
            node_features[feat][onode] = raw_features[feat][oID]
        edge_features['oslots'][onode] = raw_oslots[otype][oID]

The following features are logged:

In [8]:
node_features.keys()

dict_keys(['utf8', 'trailer', 'otype', 'text_id', 'informant', 'filename', 'place', 'dialect', 'title', 'line', 'language'])

The following edges are logged:

In [9]:
edge_features.keys()

dict_keys(['oslots'])

### Purge Old TF Data

In [16]:
for file in pathlib.Path.cwd().glob('tf/?*.tf'):
    os.remove(file)

### Save New TF Data

In [17]:
otext = {
    'sectionTypes': 'text,line',
    'sectionFeatures': 'text_id,line',
    'fmt:text-orig-full': '{utf8}{trailer}',
    }

mastermeta = {'author': 'Geoffrey Khan, Cody Kingham, and Hannes Vlaardingerbroek'}

meta = {'':mastermeta,
        'oslots':{'edgeValues':False, 'valueType':'int'},
        'otype':{'valueType':'str'},
        'text':{'valueType':'str'},
        'paragraph':{'valueType':'str'},
        'line':{'valueType':'str'},
        'word':{'valueType':'str'},
        'utf8':{'valueType':'str'},
        'text_id':{'valueType':'str'},
        'title':{'valueType':'str'},
        'dialect':{'valueType':'str'},
        'filename':{'valueType':'str'},
        'language':{'valueType':'str'},
        'trailer':{'valueType':'str'},
        'informant':{'valueType':'str'},
        'place':{'valueType':'str'},
        'otext':otext
       }

TFs = Fabric(locations=['tf/'])

TFs.save(nodeFeatures=node_features, edgeFeatures=edge_features, metaData=meta)

This is Text-Fabric 7.8.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

0 features found and 0 ignored
  0.01s Warp feature "otype" not found in
tf//
  0.01s Warp feature "oslots" not found in
tf//
  0.06s Warp feature "otext" not found. Working without Text-API

  0.00s Exporting 11 node and 1 edge and 4 config features to tf/:
  0.00s VALIDATING oslots feature
  0.10s VALIDATING oslots feature
  0.10s maxSlot=     551014
  0.10s maxNode=     846438
  0.16s OK: oslots is valid
   |     0.00s T dialect              to tf
   |     0.00s T filename             to tf
   |     0.00s T informant            to tf
   |     0.00s T language             to tf
   |     0.01s T line                 to tf
   |     0.23s T otype                to tf
   |     0.00s T place                to tf
   |     0.00s T text_id              to tf
   |     0.00s T title                to tf
   |     0.92s T trailer              to tf
   |     0.79s T utf8                 to tf
   |     

True

## Load New TF Resource

In [11]:
TF = Fabric(locations='tf/')

This is Text-Fabric 7.8.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

16 features found and 0 ignored


In [12]:
N = TF.load('''

text_id paragraph line word utf8 
otype title dialect language trailer

''')

N.makeAvailableIn(globals())
print()

  0.00s loading features ...
   |     0.34s T otype                from tf
   |     4.91s T oslots               from tf
   |     0.00s No section config in otext, the section part of the T-API cannot be used
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
   |     1.27s T utf8                 from tf
   |     0.88s T trailer              from tf
   |      |     0.24s C __levels__           from otype, oslots, otext
   |      |     7.06s C __order__            from otype, oslots, __levels__
   |      |     0.39s C __rank__             from otype, __order__
   |      |     8.89s C __levUp__            from otype, oslots, __rank__
   |      |     2.07s C __levDown__          from otype, __levUp__, __rank__
   |      |     2.84s C __boundary__         from otype, oslots, __rank__
   |     0.00s T text_id              from tf
   |     0.01s T line                 from tf
   |     0.00s T title                from tf
   |     0.00s T dialect        

## Enhancements and Extensions

Some features are easier to make once the TF resource is built. The following features will be constructed:

* `utf8` will be extended to word objects to aid in word searches.

In [13]:
extend_features = collections.defaultdict(lambda: collections.defaultdict())

### Extend `utf8` Feature to Word Objects

In [53]:
# first re-create the feature on characters
for char in F.otype.s('char'):
    extend_features['utf8'][char] = F.utf8.v(char)
    
# add words; compose without trailer
for word in F.otype.s('word'):
    
    txt = ''
    
    for char in L.d(word, 'char'):
        txt += F.utf8.v(char)
    
    extend_features['utf8'][word] = txt

## Export Extended Features

In [54]:
meta = {'':mastermeta,
        'utf8': {'valueType':'str'},
       }

TFs = Fabric(locations=['tf/'])

TFs.save(nodeFeatures=extend_features, metaData=meta)

This is Text-Fabric 7.8.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

16 features found and 0 ignored
  0.00s Exporting 1 node and 0 edge and 0 config features to tf/:
   |     0.95s T utf8                 to tf
  0.96s Exported 1 node features and 0 edge features and 0 config features to tf/


True

## Re-Load Enhanced TF Resource

In [55]:
TF = Fabric(locations='tf/')

This is Text-Fabric 7.8.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

16 features found and 0 ignored


In [56]:
N = TF.load('''

text_id paragraph line word utf8 
otype title dialect language trailer

''')

N.makeAvailableIn(globals())
print()

  0.00s loading features ...
   |     0.00s No section config in otext, the section part of the T-API cannot be used
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
   |     1.53s T utf8                 from tf
  2.97s All features loaded/computed - for details use loadLog()



## Statistics

### Object Types and Counts

In [31]:
for otype in F.otype.all:
    print('{:20}{:>10}'.format(otype, len(list(F.otype.s(otype)))))

text                       126
paragraph                  465
line                      2543
sentence                 16784
subsentence              24541
prosa                    35967
word                     93762
foreign                   1102
morpheme                120134
char                    551014


In [32]:
print('books and their word counts: \n')
for text in F.otype.s('text'):
    text_words = L.d(text, 'word')
    text_morphemes = L.d(text, 'morpheme')
    print(text, F.title.v(text))
    print(f'\t{len(text_words)} words, {len(text_morphemes)} morphemes')

books and their word counts: 

551015 The Monk And The Angel
	680 words, 950 morphemes
551016 THE MONK WHO WANTED TO KNOW WHEN HE WOULD DIE
	368 words, 487 morphemes
551017 THE WISE YOUNG MAN
	1091 words, 1482 morphemes
551018 THE FOX AND THE STORK
	70 words, 102 morphemes
551019 THE TALE OF RUSTAM (1)
	1008 words, 1316 morphemes
551020 THE CROW AND THE CHEESE
	51 words, 71 morphemes
551021 THE TALE OF PARIZADA, WARDA AND NARGIS
	2016 words, 2698 morphemes
551022 THE FOX AND THE LION
	95 words, 124 morphemes
551023 SOUR GRAPES
	62 words, 82 morphemes
551024 THE CAT AND THE MICE
	99 words, 138 morphemes
551025 THE TALE OF FARXO AND SƏTTIYA
	2490 words, 3303 morphemes
551026 THE MAN WHO CRIED WOLF
	199 words, 250 morphemes
551027 THE TALE OF RUSTAM (2)
	1645 words, 2254 morphemes
551028 THE SCORPION AND THE SNAKE
	216 words, 298 morphemes
551029 I AM WORTH THE SAME AS A BLIND WOLF
	528 words, 713 morphemes
551030 DƏMDƏMA
	620 words, 849 morphemes
551031 THE KING WITH FORTY SONS
	2539 wor

In [33]:
text = F.otype.s('text')[0]

In [34]:
len(L.d(text, 'word'))

680

In [57]:
print(F.title.v(text))

for sent in F.otype.s('line')[:10]:
    print(sent, T.text(sent))


The Monk And The Angel
551606 ʾìθwaˈ xa-ràbbən,ˈ tíwɛwa gu-xa-gəppìθa.ˈ θéle xa-náša swarìya,ˈ rakáwa.ˈ ṣléle rəš-xa-ʾɛ̀na.ˈ tìwle,ˈ xílle mə̀ndi,ˈ štéle mìya.ˈ ʾíθwale xákma zùze.ˈ qímle šqilìle.ˈ muttìleˈ rəš-d-ɛ-ʾɛ̀na.ˈ ʾàwwaˈ munšìle zúze díye.ˈ zìlle.ˈ ʾáwwa zílle b-ʾùrxa.ˈ
551607 θéle xá rakáwa xèna,ˈ swarìya.ˈ zílle rəš-ʾɛ̀na.ˈ qəm-xazèlaˈ ʾə̀mma dináre.ˈ šqilíle jal-jàldeˈ muttíle gu-jɛ̀beˈ ʾu-zìlle.ˈ ʾo-qamàyaˈ ʾámər ʾòhˈ zúzi munšìli.ˈ qɛ́mən dɛ̀ṛənˈ ʾázən šáqlən zùziˈ m-rəš-ʾɛ̀na.ˈ
551608 ha-t-ʾáθe ʾo-náša qamàyaˈ máṭe l-ʾɛ̀naˈ ʾáθe xa-náša sàwa.ˈ máṭe rəš-ʾɛ̀naˈ mattúla kàrteˈ ʾu-tíwle manyòxe.ˈ ʾo-qamáya ṱ-íle zúze mùnšyaˈ θéle ʾə́lle dìye.ˈ ʾàmərˈ mpáləṭla zùzi!,ˈ ʾə́mma dináre ʾána hon-mùnšəlla láxxa.ˈ lázəm yawə̀tla.ˈ yába lán-xəzya zùze,ˈ lá ʾáxxa-w tàmmaˈ ʾu-kízle b-ay-gòta.ˈ là.ˈ
551609 mə́re qaṭlə̀nnux.ˈ mə́re qṭùl!ˈ lìtli.ˈ zúze làn-xəzya.ˈ qìmleˈ qəm-qaṭə̀lle.ˈ qəm-qaṭə̀lle.ˈ ràbbənˈ yăðət-mà-yle?ˈ ràbbənˈ ʾáwwa ṱ-i-sàxəð l-ʾálahaˈ ʾu-ṱ-i-mṣàle-uˈ lé-y-ʾaxəl bə̀sr

In [65]:
for w in L.d(551615, 'word'):
    print(w, F.utf8.v(w))

631711 ʾuʾápʾawwa
631712 qəmparmìle
631713 ʾuzìlla
631714 blɛ̀le
631715 zìlla
631716 síqla
631717 xamáθa
631718 xèta
631719 mə́re
631720 ʾáyya
631721 mút
631722 ḥaqqúθa
631723 naḥaqqùθɛla
631724 yaʾaxòni
631725 hátxa
631726 măləpə́tli
631727 ḥaqqúθa
631728 ʾunaḥaqqùθa
631729 zílela
631730 guðamáθa
631731 xèta
631732 màra
631733 ʾímət
631734 sìqla
631735 gudɛ̀maθa
631736 wírra
631737 guxabɛ̀θa


### Testing Query Capability

In [67]:
w = F.utf8.v(631728)

find = list(S.search(f'''

word utf8={w}

'''))

len(find)

3

In [68]:
find

[(631728,), (631609,), (631571,)]