# Generate Text-Fabric Resource from Source Texts

## Text preprocessing for Christian Urmi and Barwar texts


In order to work with the language data in the NENA corpus, it is important to separate the language text from other data, such as titles, authors/informants, and verse numbers.

Our version of the text comes in MS-Word document files. That is probably also the version of the text that is the richest in language data. It contains not only the text itself, but also meaningful formatting, e.g. word markers set in superscript, and foreign (loan) words set in roman type (where the regular text is set in italic type).

To convert that information from a Word document to something that we can use in Python, we first convert the word documents to HTML, using LibreOffice in headless mode. It is assumed that the Word files are in the subdirectory `texts`, where the converted `.html` files will also be saved.

    $ soffice --headless --convert-to html texts/*.doc

This produces HTML 4.0 documents in the same directory. Earlier attempts with XHTML using wvWare/AbiWord, or LibreOffice using the XHTML conversion filter, produced output that was more difficult to parse or lacked certain characters that were lost in conversion. Although the conversion with LibreOffice takes a very long time compared with AbiWord, the resulting text seems more reliable.

The custom `nena_corpus` package contains the `Text` class, and several functions that assist in the conversion from HTML, of which we only need the functions `html_to_text()` and `parse_metadata()`.

The function `html_to_text()` is a generator function yielding `Text` objects, each containing one paragraph of text.

The function `parse_metadata()` extracts metadata from heading paragraphs (e.g. `title`, `text_id`, `informant`, `place`).

The `Text` class contains an attribute `type` describing the type of paragraph (e.g., `'sectionheading'`, `'p'`, or `'footnote'`), and a list of tuples, containing the text and text style. A text like `'<i>Normal, </i>foreign<i>, and normal</i>'` becomes `[('Normal, ', ''), ('cursive,', 'italic'), (' and normal', '')]` (note the inversion -- because normal text in the source is actually set in italics).

`Text` objects are iterable. New items can be appended with the `append(text, text_style)` method.

In [1]:
from nena_corpus import Text, html_to_text, parse_metadata

A small demonstration of the `Text` class:

In [2]:
p = Text(p_type='test', default_style='normal')

p.append('Dit is ')
p.append('een test', 'test')
p.append('.')

# str(p) returns concatenated string
print(p)
# repr(p) returns class name, p_type and str(p)
print(repr(p))
# list(p) returns the list of tuples
print(list(p))
# a list comprehension also works
print([e for e in p])

Dit is een test.
<Text 'test' 'Dit is een test.'>
[('Dit is ', 'normal'), ('een test', 'test'), ('.', 'normal')]
[('Dit is ', 'normal'), ('een test', 'test'), ('.', 'normal')]


## Importing the texts

First we import some more useful libraries, and set logging level to DEBUG to make sure we see all logging messages.

In [3]:
import re
import collections
import pathlib
import logging
import unicodedata

from IPython.display import display, HTML
from tf.fabric import Fabric
import pandas as pd

logging.getLogger().setLevel(logging.DEBUG) # for terminal messages

Assuming that the subdirectory `texts` contains the HTML files generated earlier, we can import all files in the pattern `texts/*.html`. At this point we just want to do language statistics and not look at the actual texts, so it is sufficient to import the paragraphs of all texts in no particular order.

In [4]:
files_barwar = pathlib.Path.cwd().glob('texts/bar text *.html') # get source texts
files_urmi_c = pathlib.Path.cwd().glob('texts/cu *.html')

We also prepare a dictionary with some characters that need to be replaced.

In [5]:
# Characters to be replaced
replace = {
    '\u2011': '\u002d',  # U+2011 NON-BREAKING HYPHEN -> U+002D HYPHEN-MINUS
    '\u01dd': '\u0259',  # U+01DD LATIN SMALL LETTER TURNED E -> U+0259 LATIN SMALL LETTER SCHWA
    '\uf1ea': '\u003d',  # U+F1EA Deprecated SIL character -> U+003D '=' EQUALS SIGN
    '\u2026': '...',  # U+2026 '…' HORIZONTAL ELLIPSIS -> three dots
    'J\u0335': '\u0248',  # 'J' + U+0335 COMBINING SHORT STROKE OVERLAY -> U+0248 'Ɉ' LATIN CAPITAL LETTER J WITH STROKE
    'J\u0336': '\u0248',  # 'J' + U+0336 COMBINING LONG STROKE OVERLAY -> U+0248 'Ɉ' LATIN CAPITAL LETTER J WITH STROKE
    '\u002d\u032d': '\u032d\u002d',  # Switch positions of Hyphen and Circumflex accent below
    '\u2011\u032d': '\u032d\u002d',  # Switch positions of Non-breaking hyphen and Circumflex accent below
}

Now we go right ahead to loop over the html files and convert them to a TextFabric structure.

In [6]:
def combine_chars(text):
    """Yield letters combined with combining diacritics"""
    
    char = []
    
    for c in text:
        if unicodedata.category(c) == 'Mn':  # 'Mn': non-spacing combining mark
            char.append(c)
            continue
        
        if char:
            yield ''.join(char)
        char = [c]
        
    yield ''.join(char)

raw_node_features = collections.defaultdict(lambda:collections.defaultdict(set))
raw_oslots = collections.defaultdict(lambda:collections.defaultdict(set))

# initialize counters (will be increased to start from 1)
this_text = 0
this_paragraph = 0
this_line = 0
this_sentence = 0
this_subsentence = 0
this_word = 0
this_morpheme = 0
this_foreign = 0
this_prosa = 0

slot = 0

for dialect, files in (('Barwar', files_barwar), ('Urmi_C', files_urmi_c)):
    
    # TODO At this point record book/publication/dialect?
    # E.g. SSLL_2016_Urmi_C, HOS_2008_Barwar?
    
    for file in files:
        
        logging.info(f'Processing file {file.name} ...')
        
        for p in html_to_text(file, replace=replace):
            # metadata:
            # - dialect
            # - file.name
            
            if p.type.startswith('gp-') and str(p).strip():
                # store metadata from headings:
                # - text_id
                # - title
                # - informant
                # - place
                # - version (if applicable -- only Urmi_C A35)
                if p.type.startswith('gp-sectionheading'):
                    metadata = {}
                for k, v in parse_metadata(p):
                    metadata[k] = v
            #
            elif p.type == 'p':
                # regular paragraphs
                
                # first check if we need to update metadata
                # TODO for now we do not store informant, place, and version,
                # since those are not always features of a text, but of a section
                # of the text, and I do not know how to do that.
                # QUESTION -- do we need to add a layer 'subsection'?
                if (metadata
                    and (not raw_node_features['text_id']
                         or raw_node_features['text_id'][this_text] != metadata['text_id'])):
                    this_text += 1
                    raw_node_features['text_id'][this_text] = metadata['text_id']
                    raw_node_features['title'][this_text] = metadata['title']
                    raw_node_features['dialect'][this_text] = dialect
                    raw_node_features['filename'][this_text] = file.name
                
                # increment paragraph
                this_paragraph += 1
                
                # start paragraph with an empty marker stack
                marker_stack = []
                
                # set end-of-unit markers to True at the beginning of paragraph,
                # so the units can be increased on encounter of first word character
                sentence_end = True
                subsentence_end = True
                word_end = True
                morpheme_end = True
                foreign_end = True
                prosa_end = True
                
                for text, text_style in p:
                    
                    if text_style == 'verse_no':
                        this_line += 1
                        raw_node_features['line'][this_line] = text.strip(' ()') # TODO int()?
                        metadata['verse_no'] = text.strip(' ()')  # TODO Remove from metadata dict?
                        continue
                        
                    elif text_style == 'fn_anchor':
                        # TODO handle footnotes in some way, discard for now
                        continue
                    
                    elif text_style == 'comment':
                        continue  # TODO handle comments
                    
                    elif text_style == 'marker':
                        if marker_stack and marker_stack[-1] == text:
                            marker_stack.pop()
                        else:
                            marker_stack.append(text)
                        continue
                    
                    elif text_style not in ('', 'foreign'):
                        logging.debug(f'Unhandled text_style: {repr(text_style)}, {repr(text)}')
                        continue
                    
                    elif text_style == 'foreign'and foreign_end:
                        foreign_end = False
                        this_foreign += 1
                        if marker_stack:
                            language = marker_stack[-1]
                        else:
                            language = ''
                        raw_node_features['language'][this_foreign] = language
                    
                    else: # text_style == '':
                        if not foreign_end:
                            foreign_end = True
                        pass
                    
                    if (text_style == '' and marker_stack
                        and any(c.isalpha() for c in text)
                        and not text.isalpha()):
                        # In one case, there is no closing marker tag, so force closing the marker
                        # Urmi_C A42 9: 'RzdànyəlaR' (p.154, r.28) 'zdàny' roman, 'əla' cursive
                        # Urmi_C A43 17: 'ʾe-Rbuk̭ḗṱ' (p. 174, r.14), no closing 'R'
                        # Urmi_C B2 16: 'Pʾafšɑ̄rī̀P' (p.250 r.17), inital 'ʾ' cursive
                        marker = marker_stack.pop()
                        logging.warning(f'Unfinished marker: {repr(marker)}, closed forcibly..')
                        logging.debug(f'{dialect}, {metadata["text_id"]}:{metadata["verse_no"]}')
                        logging.debug(f'Text: {repr(text)}')
                    
                    # If we got this far, we have a text string,
                    # with either text_style '' or 'foreign'.
                    # We will iterate over them character by character.
                    for c in combine_chars(text):
                        
                        if c[0].isalpha() or c == '+':
                            
                            # Increment text units on start of new word
                            if morpheme_end:
                                this_morpheme += 1
                                morpheme_end = False
                            if word_end:
                                this_word += 1
                                word_end = False
                            if subsentence_end:
                                this_subsentence += 1
                                subsentence_end = False
                            if sentence_end:
                                this_sentence += 1
                                sentence_end = False
                            if prosa_end:
                                this_prosa += 1
                                prosa_end = False
                            
                            slot += 1
                            raw_node_features['char'][slot] = c
                            # initialize 'trailer' feature as empty string,
                            # so we can add characters with '+' operator later
                            raw_node_features['trailer'][slot] = ''
                            
                            raw_oslots['text'][this_text].add(slot)
                            raw_oslots['paragraph'][this_paragraph].add(slot)
                            raw_oslots['line'][this_line].add(slot)
                            raw_oslots['sentence'][this_sentence].add(slot)
                            raw_oslots['subsentence'][this_subsentence].add(slot)
                            raw_oslots['prosa'][this_prosa].add(slot)
                            if not word_end:
                                raw_oslots['word'][this_word].add(slot)
                            if not morpheme_end:
                                raw_oslots['morpheme'][this_morpheme].add(slot)
                            if not foreign_end:
                                raw_oslots['foreign'][this_foreign].add(slot)
                        
                        else:  # if c is anything but a letter or '+':
                            if slot == 0:
                                continue  # discard anything before first word character
                            if not morpheme_end:
                                morpheme_end = True
                            if c == '|':
                                prosa_end = True
                                c = '\u02c8'
                            if c not in ('-', '=') and not word_end:
                                word_end = True
                            if c == ',' and not subsentence_end:
                                subsentence_end = True
                            if c in ('.', '!', '?') and not sentence_end:
                                subsentence_end = True
                                sentence_end = True
                            
                            raw_node_features['trailer'][slot] += c
                
            else:
                logging.debug(f'Unhandled paragraph type: {repr(p.type)}.')
                logging.debug(f'Text: {repr(str(p))}.')

INFO:root:Processing file bar text a1-A7.html ...
DEBUG:root:Unhandled paragraph type: 'sdfootnote1'.
DEBUG:root:Text: ' 1 The name Čuxo means ‘one who wears the woolen čuxa garment’. '.
INFO:root:Processing file bar text A14.html ...
INFO:root:Processing file bar text a15-A17.html ...
INFO:root:Processing file bar text a18.html ...
INFO:root:Processing file bar text a19-A23.html ...
INFO:root:Processing file bar text a24.html ...
DEBUG:root:Unhandled paragraph type: 'footer'.
DEBUG:root:Text: ' 7 '.
INFO:root:Processing file bar text a25.html ...
INFO:root:Processing file bar text a26.html ...
INFO:root:Processing file bar text a27.html ...
INFO:root:Processing file bar text a28.html ...
INFO:root:Processing file bar text a29.html ...
INFO:root:Processing file bar text a30.html ...
INFO:root:Processing file bar text a31-A33.html ...
INFO:root:Processing file bar text a34.html ...
INFO:root:Processing file bar text a35.html ...
INFO:root:Processing file bar text a36.html ...
INFO:roo

## Reindex Objects Above Slot Levels

In [7]:
otype2feature = {
    'text': {'text_id', 'title', 'dialect', 'filename'},
    'paragraph': {},
    'line': {'line'},
    'sentence': {},
    'subsentence': {},
    'word': {'trailer'},
    'morpheme': {},
    'foreign': {'language'},
    'prosa': {},
}

node_features = collections.defaultdict(lambda:collections.defaultdict())

node_features['char'] = raw_node_features['char'] # add slot features
node_features['trailer'] = raw_node_features['trailer']

In [8]:
for slot in node_features['char']:
    node_features['otype'][slot] = 'char'    

In [9]:
edge_features = collections.defaultdict(lambda:collections.defaultdict(set)) # oslots will go here

onode = max(raw_node_features['char']) # max slot, incremented +1 in loop

for otype in raw_oslots.keys():
    for oID, slots in raw_oslots[otype].items():
        
        # make new object node number
        onode += 1
        node_features['otype'][onode] = otype
        
        # remap node features to node number
        for feat in otype2feature[otype]:
            node_features[feat][onode] = raw_node_features[feat][oID]
        edge_features['oslots'][onode] = raw_oslots[otype][oID]

In [10]:
node_features.keys()

dict_keys(['char', 'trailer', 'otype', 'filename', 'text_id', 'title', 'dialect', 'line', 'language'])

In [11]:
edge_features.keys()

dict_keys(['oslots'])

In [12]:
otext = {
    'sectionTypes': 'text,paragraph,line,sentence',
    'sectionFeatures': 'text_id,line',
    'fmt:text-orig-full': '{char}{trailer}',
    }

meta = {'':{'author': 'Geoffrey Khan, Cody Kingham, and Hannes Vlaardingerbroek'},
        'oslots':{'edgeValues':False, 'valueType':'int'},
        'otype':{'valueType':'str'},
        'text':{'valueType':'str'},
        'paragraph':{'valueType':'str'},
        'line':{'valueType':'str'},
        'word':{'valueType':'str'},
        'char':{'valueType':'str'},
        'text_id':{'valueType':'str'},
        'title':{'valueType':'str'},
        'dialect':{'valueType':'str'},
        'filename':{'valueType':'str'},
        'language':{'valueType':'str'},
        'trailer':{'valueType':'str'},
        'otext':otext
       }

In [13]:
TFs = Fabric(locations=['tf/'])

This is Text-Fabric 7.8.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

14 features found and 0 ignored


In [14]:
TFs.save(nodeFeatures=node_features, edgeFeatures=edge_features, metaData=meta)

  0.00s Exporting 9 node and 1 edge and 4 config features to tf/:
  0.01s VALIDATING oslots feature
  0.18s VALIDATING oslots feature
  0.19s maxSlot=     551014
  0.19s maxNode=     846437
  0.27s OK: oslots is valid
   |     1.31s T char                 to tf
   |     0.00s T dialect              to tf
   |     0.00s T filename             to tf
   |     0.01s T language             to tf
   |     0.02s T line                 to tf
   |     0.45s T otype                to tf
   |     0.00s T text_id              to tf
   |     0.00s T title                to tf
   |     1.52s T trailer              to tf
   |     2.01s T oslots               to tf
   |     0.01s M otext                to tf
   |     0.00s M paragraph            to tf
   |     0.00s M text                 to tf
   |     0.01s M word                 to tf
  5.65s Exported 9 node features and 1 edge features and 4 config features to tf/


True

## Load New TF Resource

In [15]:
TF = Fabric(locations='tf/')

This is Text-Fabric 7.8.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

14 features found and 0 ignored


In [16]:
N = TF.load('''

text_id paragraph line word char otype title
dialect language trailer

''')

N.makeAvailableIn(globals())
print()

  0.00s loading features ...
   |     0.47s T otype                from tf
   |     7.95s T oslots               from tf
   |     0.00s No section config in otext, the section part of the T-API cannot be used
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
   |     2.33s T char                 from tf
   |     1.66s T trailer              from tf
   |      |     0.43s C __levels__           from otype, oslots, otext
   |      |       14s C __order__            from otype, oslots, __levels__
   |      |     0.75s C __rank__             from otype, __order__
   |      |       20s C __levUp__            from otype, oslots, __rank__
   |      |     3.97s C __levDown__          from otype, __levUp__, __rank__
   |      |     5.45s C __boundary__         from otype, oslots, __rank__
   |     0.01s T text_id              from tf
   |     0.03s T line                 from tf
   |     0.01s T title                from tf
   |     0.02s T dialect        

## Statistics

### Object Types and Counts

In [17]:
for otype in F.otype.all:
    print('{:20}{:>10}'.format(otype, len(list(F.otype.s(otype)))))

text                       125
paragraph                  465
line                      2543
sentence                 16784
subsentence              24541
prosa                    35967
word                     93762
foreign                   1102
morpheme                120134
char                    551014


### Slot Gaps Between Words

Note that there are gaps between word slots.

In [18]:
for word in L.d(list(F.otype.s('sentence'))[0], 'word'):
    print(f'{T.text(word)}')
    for slot in L.d(word, 'char'):
        print(f'\t{slot} {T.text(slot)}')


ʾána 
	1 ʾ
	2 á
	3 n
	4 a 
ʾíwən 
	5 ʾ
	6 í
	7 w
	8 ə
	9 n 
Yúwəl 
	10 Y
	11 ú
	12 w
	13 ə
	14 l 
Yuḥànnaˈ 
	15 Y
	16 u
	17 ḥ
	18 à
	19 n
	20 n
	21 aˈ 
ʾÌsḥaqˈ 
	22 ʾ
	23 Ì
	24 s
	25 ḥ
	26 a
	27 qˈ 
t-máθət 
	28 t-
	29 m
	30 á
	31 θ
	32 ə
	33 t 
Dùre,ˈ 
	34 D
	35 ù
	36 r
	37 e,ˈ 
t-Bɛ̀rwər.ˈ 
	38 t-
	39 B
	40 ɛ̀
	41 r
	42 w
	43 ə
	44 r.ˈ 


### Look at Skipped Slots

In [19]:
for c in range(9, 11):
    print(c, T.text(c))

9 n 
10 Y


In [20]:
hex(ord('|'))

'0x7c'

In [21]:
print('books and their word counts: \n')
for text in F.otype.s('text'):
    text_words = L.d(text, 'word')
    text_morphemes = L.d(text, 'morpheme')
    print(text, F.title.v(text))
    print(f'\t{len(text_words)} words, {len(text_morphemes)} morphemes')

books and their word counts: 

551015 The Wise Snake
	779 words, 1020 morphemes
551016 The Priest and the Mullah
	317 words, 435 morphemes
551017 The Selfish Neighbour
	146 words, 210 morphemes
551018 A tale of a prince and a princess
	1763 words, 2437 morphemes
551019 The Cooking Pot
	254 words, 327 morphemes
551020 A Hundred Gold Coins
	343 words, 483 morphemes
551021 A Man Called Čuxo
	789 words, 1041 morphemes
551022 TALES FROM THE 1001 NIGHTS
	3018 words, 4195 morphemes
551023 The Monk And The Angel
	680 words, 950 morphemes
551024 THE MONK WHO WANTED TO KNOW WHEN HE WOULD DIE
	368 words, 487 morphemes
551025 THE WISE YOUNG MAN
	1091 words, 1482 morphemes
551026 BABY LELIΘA
	845 words, 1148 morphemes
551027 THE LELIΘA FROM Č̭ĀL
	252 words, 321 morphemes
551028 THE BEAR AND THE FOX
	363 words, 506 morphemes
551029 THE DAUGHTER OF THE KING
	1264 words, 1716 morphemes
551030 THE SALE OF AN OX
	1294 words, 1711 morphemes
551031 THE MAN WHO WANTED TO WORK
	1066 words, 1461 morphemes

In [22]:
text = F.otype.s('text')[0]

In [23]:
len(L.d(text, 'word'))

779

In [24]:
print(F.title.v(text))

for sent in F.otype.s('line')[:10]:
    print(sent, T.text(sent))


The Wise Snake
551605 ʾána ʾíwən Yúwəl Yuḥànnaˈ ʾÌsḥaqˈ t-máθət Dùre,ˈ t-Bɛ̀rwər.ˈ ʾáyya hóla θàya,ˈ hóla màra:ˈ
551606 ʾíθwa xa-màlka.ˈ ʾáwwa málka xzéle xa-xə̀lma.ˈ ʾu-qédamta mə̀reˈ kúl náša t-yắðe mòdin xə́zya b-xə́lmiˈ bəd-yawə̀nneˈ ʾə̀mma dáwe.ˈ yáʿni kút yắðe mòdile mhumzə́ma ʾáwwa málka,ˈ mòdile xə́zya b-xə́lme díyeˈ b-yawə́lle ʾə̀mma dáwe.ˈ
551607 šúryela náše xáša kəs-màlka.ˈ ʾáwwa mára ʾàtxət xə́zya-wˈ ʾáwwa ʾàtxət xə́zya-wˈ ʾáwwa ʾàtxət xə́zya.ˈ
551608 xá-naša ʾámər ṭla-báxte dìyeˈ qɛ́mən ʾàzənˈ ʾap-àna mjarbə́nnaˈ ḥàð̣ð̣ díyi.ˈ xázəx qəsmə̀ttila,ˈ bálki qàrmən.ˈ ʾáwwa qímɛle zìlɛle,ˈ zìla,ˈ zìla,ˈ b-ʾúrxa tfíqɛle xá-xuwwe bìye.ˈ
551609 ʾo-xúwwe mə̀reˈ hà-našaˈ lɛ̀kət zála?ˈ mə́re b-álaha hon-zála kəs-màlka.ˈ málka hóle xə́zya xa-xə̀lma.ˈ màraˈ kút-yăðe mòdile xə́zya b-xə́lme w-amə̀rreˈ bəd-šáqəl ʾálpa dàwe.ˈ ʾɛ́-ga ʾáp-ana bắyən ʾàzən,ˈ ṱ-amrə́nne xa-xàbra.ˈ bàlkiˈ képa qítl