# Text preprocessing for Christian Urmi and Barwar texts


In order to work with the language data in the NENA corpus, it is important to separate the language text from other data, such as titles, authors/informants, and verse numbers.

Our version of the text comes in MS-Word document files. That is probably also the version of the text that is the richest in language data. It contains not only the text itself, but also meaningful formatting, e.g. word markers set in superscript, and loan words set in roman type (where the regular text is set in italic type).

To convert that information from a Word document to something that we can use in Python, we first convert the word documents to HTML, using LibreOffice in headless mode. It is assumed that the Word files are in the subdirectory `texts`, where the converted `.html` files will also be saved.

    $ soffice --headless --convert-to html texts/*.doc

This produces HTML 4.0 documents in the same directory. Earlier attempts with XHTML using wvWare/AbiWord, or LibreOffice using the XHTML conversion filter, produced output that was more difficult to parse or lacked certain characters that were lost in conversion. Although the conversion with LibreOffice takes a very long time compared with AbiWord, the resulting text seems more reliable.

The custom `nena_corpus` package contains the `Text` class, and several functions that assist in the conversion from HTML, of which we only need the function `html_to_text()`.

The function `html_to_text()` is a generator function yielding `Text` objects, each containing one paragraph of text.

The `Text` class contains a string `p_type` describing the type of paragraph (e.g., `'sectionheading'`, `'p'`, or `'footnote'`), and a list of tuples, containing the text and text style. A text like `'Normal, <i>cursive,</i> and normal'` becomes `[('Normal, ', ''), ('cursive,', 'italic'), (' and normal', '')]`. `Text` objects are iterable. New items can be appended with the `append(text, text_style)` method.

In [1]:
from nena_corpus import Text, html_to_text

A small demonstration of the `Text` class:

In [2]:
p = Text(p_type='test', default_style='normal')

p.append('Dit is ')
p.append('een test', 'test')
p.append('.')

# str(p) returns concatenated string
print(p)
# repr(p) returns class name, p_type and str(p)
print(repr(p))
# list(p) returns the list of tuples
print(list(p))
# a list comprehension also works
print([e for e in p])

Dit is een test.
<Text 'test' 'Dit is een test.'>
[('Dit is ', 'normal'), ('een test', 'test'), ('.', 'normal')]
[('Dit is ', 'normal'), ('een test', 'test'), ('.', 'normal')]


In the NENA text corpus, word markers are used to indicate loan words and other text attributes. These word markers are set in superscript type. The `html_to_text()` function can recognize them, provided that it knows which ones to look for. A list (or other iterable collection) of markers can be provided in the `markers` keyword argument.

Since in HTML all consecutive whitespace inside a block element is treated like a space character, `html_to_text()` also converts whitespace to spaces. It can also replace other characters or strings, if they are provided as a dictionary in the `replace` keyword argument. Replacing certain characters, such as visually similar but actually different characters (e.g. `U+01DD 'ǝ' LATIN SMALL LETTER TURNED E` and `U+0259 'ə' LATIN SMALL LETTER SCHWA`), is important later for search and comparison.

In [3]:
# Word markers
markers_around = {
    'Arm': 'Arm(enian?)',
    'Az': 'Az(eri?)',
    'E': 'English',
    'F': 'French',
    'Ge': 'Ge(rman?)',
    'P': 'P(ersian?)',
    'R': 'R(ussian?)',
}

markers_before = {
    '+': 'stress?',
}

markers_after = {
    '|': 'end stress?',
}

markers = dict(markers_around.items() | markers_before.items() | markers_after.items())

# Characters to be replaced
replace = {
    # U+2011 'NON-BREAKING HYPHEN' has same function as normal hyphen.
    # Replaced by regular U+002D 'HYPHEN-MINUS'
    '\u2011': '\u002d',
    # U+01DD 'LATIN SMALL LETTER TURNED E'
    # Replaced by U+0259 'LATIN SMALL LETTER SCHWA' (looks the same).
    '\u01dd': '\u0259',
    # Deprecated SIL character U+F1EA 'MODIFIER LETTER SHORT EQUALS SIGN'
    # (https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=PUA_Deprecated)
    # replaced by U+003D '=' 'EQUALS SIGN'
    '\uf1ea': '\u003d',
    # U+2026 '…' HORIZONTAL ELLIPSIS
    # replaced by three dots
    '\u2026': '...',
        
    # digraph 'J' LATIN CAPITAL LETTER J and 
    # U+0335 '̵' COMBINING SHORT STROKE OVERLAY or
    # U+0336 '̶' COMBINING LONG STROKE OVERLAY:
    # represents U+0248 'Ɉ' LATIN CAPITAL LETTER J WITH STROKE?
    # (capital equivalent of:
    # U+025f 'ɟ' LATIN SMALL LETTER DOTLESS J WITH STROKE?)
    # occurs 3x:
    # Urmi_C B1 The Assyrians of Urmi 20 (p.238): ... J̶avìlan ... # (long stroke overlay)
    # Urmi_C B7 Village Life 15 (p.288): ... mən-J̵avìlan ... # (short stroke overlay)
    # Urmi_C B17 Village Life 40 (p.344): ... ɟu-J̵úrjəs-+tan| ... # (short stroke overlay)
    'J\u0335': '\u0248',
    'J\u0336': '\u0248',
    
    # Hyphen and circumflex accent below must switch positions
    '\u002d\u032d': '\u032d\u002d',
    # also '\u2011', since in an unordered dictionary,
    # it is unknown which substitution will take place first
    '\u2011\u032d': '\u032d\u002d',
}

## Importing the texts

Assuming that the subdirectory `texts` contains the HTML files generated earlier, we can import all files in the pattern `texts/*.html`. At this point we just want to do language statistics and not look at the actual texts, so it is sufficient to import the paragraphs of all texts in no particular order.

In [4]:
import pathlib

html_files = pathlib.Path.cwd().glob('texts/*.html')

paragraphs = []

for inputfile in html_files:
    print(inputfile.name, end=' ')
    for p in html_to_text(inputfile, markers=markers, replace=replace):
        print('.', end='')
        paragraphs.append(p)
    print(' done.')

len(paragraphs)

bar text a15-A17.html ..........

  p_brackets = re.compile('([[()\]])')


.. done.
bar text A45.html .... done.
bar text a28.html ...... done.
bar text A49.html ......................... done.
bar text a24.html ........ done.
bar text A42-A44.html ............ done.
bar text a25.html .................. done.
bar text a48.html ...... done.
bar text a29.html ........... done.
bar text A9-A13.html ........................ done.
bar text a46-A47.html ....... done.
bar text A37-A40.html .................................................................... done.
bar text a18.html .... done.
bar text a1-A7.html ............................... done.
bar text a34.html .......... done.
bar text A14.html .................. done.
bar text a35.html ...... done.
bar text a36.html ... done.
bar text a41.html ... done.
bar text a8.html ........... done.
bar text a19-A23.html ............................... done.
cu vol 4 texts.html .................................................................................................................................................

809

Now we have imported 809 paragraphs of text. We only need the paragraphs containing the actual texts. We can look at the `p_type`s to see what paragraphs we have:

In [5]:
set(p._p_type for p in paragraphs)

{'footer',
 'gp-sectionheading-western',
 'gp-subsectionheading-western',
 'gp-subsubsectionheading-western',
 'p',
 'sdfootnote1',
 'sdfootnote2'}

Only the paragraphs with p_type `'p'` contain the actual text, the others are headings or footnotes.

The `'p'` paragraphs contain differently styled texts. The styles include the markers defined above, `'verse_no'` and `'fn_anchor'`, besides the normal text styles `'italic'` and `''` (unstyled, roman, text):

In [6]:
set(style for p in paragraphs if p._p_type == 'p' for text, style in p)

{'',
 '(',
 ')',
 '+',
 'Arm',
 'Az',
 'E',
 'F',
 'Ge',
 'P',
 'R',
 '[',
 ']',
 'comment',
 'fn_anchor',
 'italic',
 'verse_no',
 '|'}

### Get Rough Word Count

We are only interested in the normal text styles `''` and `'italic'`, so we filter out all others for the statistics. We want to know all the different characters that occur in the texts. Since many characters consist of combinations of a letter with one or more combining diacritics, we combine those first.

In [7]:
words = [(text, text_style) for p in paragraphs
            if p._p_type == 'p'
            for text, text_style in p
            if text_style in ('', 'italic')]
    
print(f'{len(words)} rough words captured...')

260609 rough words captured...


In [9]:
import collections
import unicodedata
import pandas

def make_hexs(string):
    '''Returns + separated hex string.'''
    return ' + '.join([hex(ord(e)) for e in string])
    
hex2index = collections.defaultdict(list) # keep mapping of a given hexstring to a index in paragraphs
index = -1 # becomes 0 on first iteration
characters = collections.Counter()

for p in paragraphs:
    
    index += 1 # increment index +1
    
    if p._p_type != 'p':
        continue
    for text, text_style in p:
        if text_style in ('', 'italic'):
            char = ''
            for c in text:
            
                # add accentuation (Mn) to char
                if char and unicodedata.category(c) == 'Mn':
                    char += c
                
                # trigger new char on non Mn category char
                elif char: 
                    characters[char] += 1 # count previous char
                    
                    # map hex codes to instances in text
                    hex_codes = make_hexs(char)
                    hex2index[hex_codes].append(index)
                    
                    char = c # reset to this char
                    
                # record first char
                else:
                    char = c
                                
            # retrieve last letter
            if char:
                
                # map hex codes to instances in text
                hex_codes = make_hexs(char)
                hex2index[hex_codes].append(index)
                characters[char] += 1
        
# make rows for each letter, additional columns for accents
rows = sorted(set(c[0] for c in characters))
accent_cols = sorted(set(c[1:] for c in characters if c[1:])) # select accents past first letter

# make pandas not truncate the table rows and columns
pandas.set_option('display.max_rows', 300)
pandas.set_option('display.max_columns', 300)

data = []
for c in sorted(rows):
    
    row = {'character': c,
           'count': characters[c],
           'category': unicodedata.category(c[0]),
           'hex codes': ' + '.join([hex(ord(e)) for e in c]),
          }
    
    # add accent columns
    for a in accent_cols:
        e = c+a
        if e in characters:
            row[a] = e
        else:
            row[a] = ''
            
    data.append(row)

chars = pandas.DataFrame(data)
chars.style.set_properties({'text-align': 'left'})
chars.sort_values(['category', 'character'])

Unnamed: 0,category,character,count,hex codes,̀,́,̂,̃̀,̄,̄̀,̄́,̆,̆̀,̆́,̇,̈,̈̀,̈́,̌,̣,̣̌,̭,̭̌
35,Ll,a,63409,0x61,à,á,,ã̀,ā,ā̀,ā́,ă,ằ,ắ,,ä,ä̀,,,,,,
36,Ll,b,14437,0x62,,,,,,,,,,,,,,,,,,,
37,Ll,c,4046,0x63,,,ĉ,,,,,,,,,,,,č,,č̣,c̭,č̭
38,Ll,d,13204,0x64,,,,,,,,,,,,,,,,ḍ,,,
39,Ll,e,13039,0x65,è,é,,,ē,ḕ,ḗ,,,,,,,,,,,,
40,Ll,f,326,0x66,,,,,,,,,,,,,,,,,,,
41,Ll,g,2303,0x67,,,,,,,,,,,ġ,,,,,,,,
42,Ll,h,4341,0x68,,,,,,,,,,,,,,,,ḥ,,,
43,Ll,i,9746,0x69,ì,í,,,ī,ī̀,ī́,,,,,,,,,,,,
44,Ll,j,1409,0x6a,,,,,,,,,,,,,,,,,,,


## Make character table for manual inspection

First, look at what the unicode character categories mean.

In [10]:
chars.category.unique() # print unique character categories

array(['Zs', 'Po', 'Pd', 'Sm', 'Lu', 'Ll', 'Lm'], dtype=object)

**Below is a breakdown of the unicode character categories**

> Unicode 6.0 has 7 character categories, and each category has subcategories: <br>
> Letter (L): lowercase (Ll), modifier (Lm), titlecase (Lt), uppercase (Lu), other (Lo)<br>
> Mark (M): spacing combining (Mc), enclosing (Me), non-spacing (Mn)<br>
> Number (N): decimal digit (Nd), letter (Nl), other (No)<br>
> Punctuation (P): connector (Pc), dash (Pd), initial quote (Pi), final quote (Pf), open (Ps), close (Pe), other (Po)<br>
> Symbol (S): currency (Sc), modifier (Sk), math (Sm), other (So)<br>
> Separator (Z): line (Zl), paragraph (Zp), space (Zs)<br>
> Other (C): control (Cc), format (Cf), not assigned (Cn), private use (Co), surrogate (Cs)<br>
> [source](https://unicodebook.readthedocs.io/unicode.html)

Some distinct letters have non-spacing (Mn) characters while others represent vowels that are simply accentuated. Here we assume a very general rule:

* if character is a vowel, consider it a letter with an accent
* if character is a consonant, consider both the consonant and the accent a letter

In [11]:
# vowels with accents are considered accentuated letters
# here a list of vowels is assembled

vowels = {'a', 'e', 'i', 'o', 'u', 'ə'}
for v in [vw for vw in vowels]: # add upper cased vowels
    vowels.add(v.upper())

def is_vowel(char, vowels=vowels):
    '''Checks whether a letter is a vowel.'''
    if char[0] in vowels:
        return True
    
# map cat codes to readable categories for spreadsheet
cat2category = {'Lu': 'Letter uppercase',
                'Ll': 'Letter lowercase',
                'Lm': 'Letter modifier',
                'Po': 'Punctuation other',
                'Pd': 'Punctuation dash',
                'Zs': 'Separator space',
                'Sm': 'Symbol math'} 
    
inspect = []

for c in characters:
    
    row = {'text': c,
           'count': characters[c],
           'category': cat2category[unicodedata.category(c[0])],
           'hex codes': ' + '.join([hex(ord(e)) for e in c]),
           'note': ''}

    accent = c[1:] if is_vowel(c) else ''
    letter = c[0] if is_vowel(c) else c
    
    row['accent'] = accent
    row['letter'] = letter
    
    inspect.append(row)
    
inspect = pandas.DataFrame(inspect)
inspect = inspect[['text', 'letter', 'accent', 'category', 'count', 'hex codes', 'note']]
inspect = inspect.sort_values(['letter'])

The inspection table contains three main columns for manual inspection:

1. text – this is the plain text of the character
2. letter – this is how the character is analyzed, i.e. it is identified as this letter
3. accent – this is the accent (non-letter character) that is in the plain-text of the character

A blank `note` column is provided for writing notes on how to edit the characters.

In [12]:
inspect

Unnamed: 0,text,letter,accent,category,count,hex codes,note
5,,,,Separator space,92267,0x20,
49,!,!,,Punctuation other,560,0x21,
13,",",",",,Punctuation other,7761,0x2c,
7,-,-,,Punctuation dash,25281,0x2d,
20,.,.,,Punctuation other,15191,0x2e,
65,:,:,,Punctuation other,42,0x3a,
147,;,;,,Punctuation other,1,0x3b,
127,=,=,,Symbol math,1100,0x3d,
52,?,?,,Punctuation other,1686,0x3f,
102,A,A,,Letter uppercase,143,0x41,


In [13]:
inspect.to_csv('inspect_consonants.csv') # export spreadsheet for inspection with Geoffrey

### Find Interesting Cases of a Particular Hex String in the Dataset

There are a few cases, such as diaresis, that needs to be inspected more closely. Here we isolate those cases so they can be seen in their context

In [14]:
def isolate_case(hex_string):
    '''Uses paragraphs list to find and return interesting cases'''
    
    found_indices = hex2index[hex_string] # list of index positions from pos2char
   
    # find all matches
    matches = []
    
    for index in found_indices:
        
        paragraph = paragraphs[index]
        paragraph_text = str(paragraph)
        
        for sentence in paragraph_text.split('.|'):
            
            # make_hexs(sentence) returns long hex string, 
            # just check for simple sequence match
            if hex_string in make_hexs(sentence): 
                matches.append(sentence)
        
    return matches

### Diaresis (e.g. on ü) 

Geoffrey has indicated the possibility these cases could be mistakes. Let's find their context sentences.

In [15]:
umlaut_hex = '0x75 + 0x308'
umlauts = isolate_case(umlaut_hex)
print('\n'.join(umlauts))

 màra| lišánət +hošàrə| Azjǘllen sàtıram,| jǘllen sàtıram.Az| mára Aznèynen dérsinAz| màra| AzyüzinnènAz| bi-+ʾàynə AzyüzinnènAz
 bərrə̀ššələ| mə́drə bəcláyələ k̭am-Rʾak̭ùšk̭aR| Azjǘllen sàtıram,| jǘllen sàtıram.Az| mára Aznèynen dérsinAz| bi-mù yavévət?| vàrdə zabúnən
 bi-mù yavḗt? mára AzyüzinnènAz| b-+ʾàynə
 màra| lišánət +hošàrə| Azjǘllen sàtıram,| jǘllen sàtıram.Az| mára Aznèynen dérsinAz| màra| AzyüzinnènAz| bi-+ʾàynə AzyüzinnènAz
 bərrə̀ššələ| mə́drə bəcláyələ k̭am-Rʾak̭ùšk̭aR| Azjǘllen sàtıram,| jǘllen sàtıram.Az| mára Aznèynen dérsinAz| bi-mù yavévət?| vàrdə zabúnən
 bi-mù yavḗt? mára AzyüzinnènAz| b-+ʾàynə
 màra| lišánət +hošàrə| Azjǘllen sàtıram,| jǘllen sàtıram.Az| mára Aznèynen dérsinAz| màra| AzyüzinnènAz| bi-+ʾàynə AzyüzinnènAz
 bərrə̀ššələ| mə́drə bəcláyələ k̭am-Rʾak̭ùšk̭aR| Azjǘllen sàtıram,| jǘllen sàtıram.Az| mára Aznèynen dérsinAz| bi-mù yavévət?| vàrdə zabúnən
 bi-mù yavḗt? mára AzyüzinnènAz| b-+ʾàynə
 mára mə́drə lišā́n +hošàrəla| màra| Azqabaxčàn arvàt| ... altı́

<hr>

TODO: isolate words