# Parse Christian Urmi and Barwar texts

In order to work with the texts, it is useful to try and separate the NENA text from the metadata, such as titles, authors/informants, and verse numbers, and to structure it in some way.

## Convert to text format

The texts are available in MS-Word format. That can be converted to a computer readable text format with a tool like AbiWord:

    # Convert Christian Urmi texts in "cu vol 4 texts.doc" to "urmi_c.txt"
    $ abiword --to=txt "cu vol 4 texts.doc" --to-name=urmi_c.txt

    # Convert Barwar texts in "bar text *.doc" to "bar text *.txt"
    $ for i in bar\ text*; do abiword --to=txt "${i}"; done
    # combine Barwar texts in correct order in "barwar.txt"
    $ ls -1d -- *.txt | sort -Vf | xargs -d "\n" cat -- > barwar.txt

The conversion does remove most of the formatting. Although the formatting is sometimes meaningful, such as superscript characters and symbols, the meaning can be mostly inferred from the characters themselves and their position in the context.

## Import in Python

First, we need to import some libraries.

In [1]:
import collections
import unicodedata
import re

### Defining patterns and functions

Then define a NamedTuple to conveniently store the structured text and metadata.

In [2]:
fields = 'id dialect title informant place text'

Text = collections.namedtuple('Text', fields)

In order to structure the text files into meaningful elements, such as metadata (titles, authors/informants, places, verse numbers) and texts, verses, words, and other symbols, we have to recognize some patterns that are either present in the formatting of the Word files, or generated by the conversion to text by AbiWord.

The patterns are different for the Barwar and Urmi texts. We use regular expressions to recognize the patterns.

In [3]:
# Verse numbers are one or more digits in parentheses
re_verse_no = re.compile('\s*(\([0-9]+\))\s*')

# Footnotes start with a space after a newline,
# and in our cases end with the first full stop.
# If a footnote would contain more full stops,
# I do not know how we could recognize the end.
re_footnote = re.compile('^ [^.]*\.')

# Regexes for Barwar texts
# Title contains identifier (e.g. 'A1') followed by
# at least one TAB character, and the title.
re_title = re.compile('^([ABCD][0-9]+) *[\t]+(.*\S)\s*$')
# The informant line starts with 'Informant: ', followed
# by the informant name, and in parentheses the place.
re_info = re.compile('^Informant: (.*) \((.*)\)\s*$')

# Regexes for Urmi texts
# Heading only contains identifier (e.g. 'A 1' or 'B2').
re_heading = re.compile('^([AB]\s*[0-9]+)\s*$')
# title_info line contains Title, and in parentheses
# the informant name followed by a comma, and the place
re_title_info = re.compile('^(.*\S)\s*\(([^,]*), (.*)\)\s*$')
# Ignore other lines with only capitalized text and/or numbers
re_ignore = re.compile('^[A-Z][A-Z0-9 ]+\s*$')
# Version regex is for a special case in Urmi text A35
re_version = re.compile('^(Version [0-9]+): (.*) \((.*)\)\s*$')

The function `get_text()` reads the text files line by line, and tries to match the lines with the regular expressions. It returns a list of NamedTuple objects, and a list of lines that were ignored (such as front matter, empty lines or lines with one character, and footnotes).

In [4]:
def get_texts(filename, dialect, replace=None, ignore=None):
    """Read and structure NENA texts.
    
    Reads from filename, and structures the texts to fit in the fields
    of the namedtuple Text.
    Optimized for text files extracted from MS-Word files with Barwar
    and Christian Urmi texts.
    """
    
    t_id = None
    version = None
    texts = []
    ignored = []

    with open(filename, 'r') as text_file:
        for line in text_file:
            
            # ignore 'empty' lines of 1 character
            empty = len(line.strip()) < 2
            
            # Normalize the characters: split combined characters such as 'á'
            # into separate characters 'a' and combining acute accent
            line = unicodedata.normalize('NFD', line)
            
            # these replace methods may not be the fastest solution
            if replace is not None:
                for c in replace:
                    line = line.replace(c, replace[c])
            
            if ignore is not None:
                for c in ignore:
                    line = line.replace(c, '')
            
            title_line = re_title.match(line)
            heading_line = re_heading.match(line)
            if title_line or heading_line:
                if t_id:
                    text, ignored_v = split_verses(text)
                    ignored.extend(ignored_v)
                    texts.append(Text(t_id, dialect, title, informant, place, text))
                if title_line:
                    t_id, title = title_line.groups()
                elif heading_line:
                    t_id = ''.join(heading_line.group(1).split()) # remove space in 'A 1'
                    title = None
                informant = None
                place = None
                text = []
                continue

            # Version is special case for Urmi A35
            # Version line also matches re_title_info, so must be checked before
            # Title has been added to text list
            version_line = re_version.match(line)
            if version_line:
                version, v_informant, v_place = version_line.groups()
                if len([e for e in text if e.strip()]) == 1:
                    title = text[0].strip()
                    text = []
                text.append(version)
                
                if informant is not None:
                    informant += '; {}: {}'.format(version, v_informant)
                else:
                    informant = '{}: {}'.format(version, v_informant)
                if place is not None:
                    place += '; {}: {}'.format(version, v_place)
                else:
                    place = '{}: {}'.format(version, v_place)
                continue
            
            informant_line = re_info.match(line)
            title_info_line = re_title_info.match(line)
            if t_id and informant is None and (informant_line or title_info_line):
                if informant_line:
                    informant, place = informant_line.groups()
                elif title_info_line:
                    title, informant, place = title_info_line.groups()
                continue
            
            # if no heading/title has yet been encountered,
            # line is part of front matter, so ignore
            if t_id is None:
                ignored.append(line)
                continue
            
            # if line does not match heading/title
            if re_ignore.match(line):
                ignored.append(line)
                continue
            
            # footnotes are replaced by newline(s), followed by
            # the footnote text preceded by a space.
            # Ignore footnote and preceding newlines
            # TODO: store footnote and its location somewhere?
            footnote = re_footnote.match(line)
            if footnote:
                ignored.append(footnote.group())
                line = re_footnote.sub('', line)
                while text and text[-1].strip() == '':
                    ignored.append(text.pop())
                
            if text or not empty:
                text.append(line)
            elif empty: # empty line after title or informant lines
                ignored.append(line)
                pass

        # add last text
        text, ignored_v = split_verses(text)
        ignored.extend(ignored_v)
        
        texts.append(Text(t_id, dialect, title, informant, place, text))
    
    return (texts, ignored)
   

The function `split_verses()` splits the text into verses, starting with the verse number in parentheses. It keeps all characters such as whitespace and newlines (except the trailing whitespace/newlines at the end of the text), so that special formatting, such as for poetry, is preserved.

In [5]:
def split_verses(text):
    """Split text into verses.
    
    Verses are marked by a verse number in parentheses.
    Returns a list of tuples: (verse_no, verse_text)
    """
    
    ignored = []
    
    # strip empty lines and lines with only one character from end of text
    # (single characters are sometimes appended by abiword to end of file)
    while len(text[-1].strip()) < 2:
        ignored.append(text.pop())

    verses = []
    cur_verse = []
    verse_no = ''
    
    for string in text:
        for e in re_verse_no.split(string):
            if cur_verse and cur_verse[-1].startswith('Version'):
                if not e.strip():
                    ignored.append(e)
                    continue # ignore newlines after 'Version', it will be added
            if re_verse_no.match(e):
                if cur_verse and cur_verse[-1].startswith('Version'):
                    e = '{}:\n\n{}'.format(cur_verse.pop(), e)
                if verse_no and cur_verse:
                    verses.append((verse_no, ''.join(cur_verse)))
                    cur_verse = []
                verse_no = e
            elif e:
                cur_verse.append(e)
    
    # from last verse in text, strip trailing whitespace
    verses.append((verse_no, ''.join(cur_verse).rstrip()))
    
    return verses, ignored

To split the text of the verses into words, we need to be able to find word boundaries. Splitting on spaces is not enough, since some words are prefixed or suffixed to other words by means of hyphens. In some cases the space is omitted apparently accidentally, but the word boundary is indicated by stress markers that occur only at the beginning or end of words. In all cases, we want to keep the boundary markers (spaces, hyphens, or other) because they may be useful in later analysis.

This cannot be done with `str.split()`, and also not (easily) with `re.split()`, so we use a special function `split_words()`.

In [6]:
def split_words(s, split_after=None, split_before=None):
    """Split string s into words while keeping delimiters.
    
    Splits the string on the delimiters given in split_after
    and split_before, while not removing the delimiters.
    The delimiters in split_after are appended to the preceding
    string, those in split_before to the following string (if
    any).
    If both delimiters are None or empty strings, a normal
    str.split() is returned.
    
    >>> string = 'this-here is- just\na=test=\n'
    >>> split_after = ' -\n'  # space, hyphen, newline
    >>> split_before = '='  # equals sign
    >>> split_words(string, split_after, split_before)
    ['this-', 'here ', 'is- ', 'just\n', 'a', '=test=\n']
    
    """
    
    # Another way could be using re.split, but without after_chars.
    # Example splitting on space and hyphen:
    # (adapted from https://stackoverflow.com/a/7866863/9230612)
    #
    # >>> re.split("([^ -]+[ -]?)", "my first- test-string")[1::2]
    # ['my ', 'first-', 'test-', 'string']
    #
    # but that does not keep all characters (space after hyphen is lost).
    
    split_after = '' if split_after is None else split_after
    split_before = '' if split_before is None else split_before

    delimiters = split_after + split_before
    
    if not delimiters:
        return s.split()
    
    result = []
    start = 0
    pos = 0
    
    while pos < len(s):
        if pos > start and s[pos-1] in delimiters and s[pos] not in delimiters:
            if s[pos-1] in split_after:
                result.append(s[start:pos])
                start = pos
            else: # s[pos-1] in split_before:
                i = 1
                while pos-i > start and s[pos-i-1] in split_before:
                    i += 1
                if pos-i > start:
                    result.append(s[start:pos-i])
                start = pos-i
        # UGLY HACK to split words not playing by the rules:
        # words following vertical line or ellipsis marks without spaces
        elif (pos > start and s[pos].isalpha()
              and (s[pos-1] in '|…' or (pos > start-1 and s[pos-2:pos]) == '..')):
            result.append(s[start:pos])
            start = pos
        pos += 1
    result.append(s[start:])
    
    return result

### Importing the texts

Assuming that the text files that were generated earlier, `barwar.txt` and `urmi_c.txt`, are present in the current directory, we can now import them. For convenience we also combine them in a list containing all texts.

In [7]:
barwar_texts, ignored_b = get_texts('barwar.txt', 'Barwar')

urmi_texts, ignored_u = get_texts('urmi_c.txt', 'Urmi_C')

ignored = ignored_b + ignored_u # to allow for inspection of ignored text

alltexts = barwar_texts + urmi_texts

## Statistics

Now that we have all texts separated from the metadata and structured into verses, we can start counting all kinds of things.

For example, we can count all characters.

In [8]:
cnt = collections.Counter()

for t in alltexts:
    for v, verse in t.text:
        for c in verse:
            cnt[c] += 1

Use a pandas DataFrame to display the resulting counts, sorted first by Unicode category, and then by character value:

In [9]:
# import numpy as np
import pandas as pd

# make pandas not truncate the table rows
pd.set_option('display.max_rows', 300)

data = []
for c in sorted(cnt):
    data.append({'character': c,
                 'count': cnt[c],
                 'category': unicodedata.category(c[0]),
                 'hex codes': ' '.join([hex(ord(e)) for e in c])})

df = pd.DataFrame(data)
df.sort_values(['category', 'character'])

Unnamed: 0,category,character,count,hex codes
0,Cc,\n,397,0xa
1,Cc,,7256,0x1e
91,Cf,‎,1,0x200e
94,Cf,‮,1,0x202e
95,Co,,1100,0xf1ea
41,Ll,a,108660,0x61
42,Ll,b,14437,0x62
43,Ll,c,5986,0x63
44,Ll,d,13208,0x64
45,Ll,e,18801,0x65


This shows us all the Unicode codepoints occurring in the text. Most are letters (Unicode categories 'Ll', 'Lu', and 'Lm') or combining diacritic symbols (category 'Mn'). Others are punctuation ('Pd', 'Po') or brackets/parentheses ('Ps', 'Pe'), or symbols ('Sm'). There is also white space: space (category 'Zs') and newline ('\n', category 'Cc').

### Problematic characters

Since the texts were composed in Word to be read by humans, some inconsistencies in the character encoding that are invisable for humans can confuse automatic analysis. We will look at some problematic characters that are found using the table above.

In category 'Cc', the character with the hexadecimal number 0x1e is a control character named U+001E 'INFORMATION SEPARATOR TWO'. It is (for some reason?) converted by AbiWord from the character shown in Word as U+2011 'NON-BREAKING HYPHEN', and seems to have the same function as the regular U+002D 'HYPHEN-MINUS'.

In category 'Cf' are two characters 0x200e and 0x202e, which are invisible control characters to change the text direction, but have no meaning in our text and should be ignored.

The category 'Co' is the 'Private use area' of Unicode. The character 0xf1ea is a [deprecated SIL character](https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=PUA_Deprecated), which looks like a short equals sign or a double hyphen and is apparently used to connect suffixed forms to the end of a word. It can be replaced by the valid Unicode character U+2E40 'DOUBLE HYPHEN'.

Under letters are two symbols that look the same: U+01DD 'LATIN SMALL LETTER TURNED E' and U+0259 'LATIN SMALL LETTER SCHWA'. Assuming that they represent the same character, the 'turned e' can be replaced by the 'schwa'.

In category 'Mn' are two combining strokes, a short and a long one. Inspection of the text shows that they are both combined with the capital J, apparently to form an uppercase counterpart for 'ɟ': U+025F 'LATIN SMALL LETTER DOTLESS J WITH STROKE'. The three occurrences could be replaced by the Unicode character 'Ɉ': U+0248 'LATIN CAPITAL LETTER J WITH STROKE' (which also does have an official lowercase variant, 'ɉ' U+0249 'LATIN SMALL LETTER J WITH STROKE', but that one has a dot).

The three dots of '…' U+2026 'HORIZONTAL ELLIPSIS' represent three dots and are likely automatically replaced by Word, so they can be replaced back by three dots '...'.

The 't' and the combining diacritic U+032d '̭' 'COMBINING CIRCUMFLEX ACCENT BELOW' are in some cases separated by a hyphen (t-̭). Word processors apparently still render this as intended, with the circumflex below the ṱ, but when separating words on hyphens, the circumflex ends up at the beginning of the following word, so the hyphen and the circumflex must switch positions.

To be able to replace or remove the problematic characters, we define a tuple for the characters to ignore, and a dictionary for the characters to replace.

In [10]:
# define characters to be ignored (RtL markers)
ignore = ('\u200e', '\u202e')

# define characters to be replaced
replace = {
    # In MS Word, this is U+2011 'NON-BREAKING HYPHEN'.
    # Somehow in conversion to text, AbiWord converts it to
    # U+001E 'INFORMATION SEPARATOR TWO' (record separator).
    # Replaced by regular U+002D 'HYPHEN-MINUS'
    '\u001e': '\u002d',
    # U+01DD 'LATIN SMALL LETTER TURNED E'
    # replaced by U+0259 'LATIN SMALL LETTER SCHWA' (looks the same).
    '\u01dd': '\u0259',
    # Deprecated SIL character U+F1EA 'MODIFIER LETTER SHORT EQUALS SIGN'
    # (https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=PUA_Deprecated)
    # replaced by U+2E40 'DOUBLE HYPHEN'
    '\uf1ea': '\u2e40',
    # U+2026 '…' HORIZONTAL ELLIPSIS
    # replaced by three dots
    '\u2026': '...',
        
    # digraph 'J' LATIN CAPITAL LETTER J and 
    # U+0335 '̵' COMBINING SHORT STROKE OVERLAY or
    # U+0336 '̶' COMBINING LONG STROKE OVERLAY:
    # represents U+0248 'Ɉ' LATIN CAPITAL LETTER J WITH STROKE?
    # (capital equivalent of:
    # U+025f 'ɟ' LATIN SMALL LETTER DOTLESS J WITH STROKE?)
    # occurs 3x:
    # Urmi_C B1 The Assyrians of Urmi 20 (p.238): ... J̶avìlan ... # (long stroke overlay)
    # Urmi_C B7 Village Life 15 (p.288): ... mən-J̵avìlan ... # (short stroke overlay)
    # Urmi_C B17 Village Life 40 (p.344): ... ɟu-J̵úrjəs-+tan| ... # (short stroke overlay)
    'J\u0335': '\u0248',
    'J\u0336': '\u0248',
    
    # Hyphen and circumflex accent below must switch positions
    '\u002d\u032d': '\u032d\u002d',
    # also '\u001e', since in an unordered dictionary,
    # it is unknown which substitution will take place first
    '\u001e\u032d': '\u032d\u002d',
}

With the characters to be ignored or replaced known, we can now re-read the texts, and generate a cleaner table of characters.

In [11]:
barwar_texts, ignored_b = get_texts('barwar.txt', 'Barwar', replace=replace, ignore=ignore)

urmi_texts, ignored_u = get_texts('urmi_c.txt', 'Urmi_C', replace=replace, ignore=ignore)

ignored = ignored_b + ignored_u # to allow for inspection of ignored text

alltexts = barwar_texts + urmi_texts

In [12]:
cnt = collections.Counter()

for t in alltexts:
    for v, verse in t.text:
        for c in verse:
            cnt[c] += 1

data = []
for c in sorted(cnt):
    data.append({'character': c,
                 'count': cnt[c],
                 'category': unicodedata.category(c[0]),
                 'hex codes': ' '.join([hex(ord(e)) for e in c])})

df = pd.DataFrame(data)
df.sort_values(['category', 'character'])

Unnamed: 0,category,character,count,hex codes
0,Cc,\n,397,0xa
40,Ll,a,108660,0x61
41,Ll,b,14437,0x62
42,Ll,c,5986,0x63
43,Ll,d,13208,0x64
44,Ll,e,18801,0x65
45,Ll,f,326,0x66
46,Ll,g,2573,0x67
47,Ll,h,4510,0x68
48,Ll,i,25237,0x69


### Other special characters

#### Hyphens

Hyphens connect prefixed forms to words, and double hyphens connect suffixed forms to words. When splitting words, we would split a form like `pre-+word⹀suff.|` into three parts: `pre-`, `+word`, and `⹀suff.|`.

#### Pauses and ellipsis

There are two kinds of pause or ellipsis markers: consecutive dots and em dashes. The dots vary in number from two to five. In some cases the ellipsis is not followed by a space but directly by the following word. In other cases the ellipsis is followed by either a vertical bar or by a comma.

#### Parentheses and brackets

Sometimes, parentheses or brackets are used to indicate that a text was spoken by the interviewer, and once, to insert a remark (Urmi_C A3 Axiqar 18: "(interruption)").

 - **Question:** how to handle these?
   Just filter out the remarks and brackets, and leave the text?
   Or remove altogether? It can be easily removed with a regular expression, like:

        >>> s = '(18) (GK: bla bla.|) more bla.| [GK bla?] +bla bla.|'
        >>> pattern = '[[(][A-Z]+:? ([^])]+)[\]\)]'
        >>> ''.join(re.split(pattern, s))
        '(18) bla bla.| more bla.| bla? +bla bla.|'   

With the right word boundary characters, white space and hyphens, we can now more reliably split the verses into words (but see remark in `split_words()` about 'UGLY HACK': there are not always spaces so sometimes you need to use a hack). That means we can also count the words:

In [13]:
len(alltexts)

125

In [14]:
known_marks = {
    'Arm': 'Arm(enian?)',
    'Az': 'Az(eri?)',
    'E': 'English',
    'F': 'French',
    'Ge': 'Ge(rman?)',
    'P': 'P(ersian?)',
    'R': 'R(ussian?)',
}

fields = [
    'surface',
    'word',
    'before',
    'after',
    'punct',
    'mark_s',
    'mark_e',
]

# TODO This doesn't work yet (because tuples are immutable)
# Word = collections.namedtuple('Word', fields)
# # set all values except 'surface' to empty string by default
# # https://stackoverflow.com/a/18348004/9230612
# Word.__new__.__defaults__ = ('',) * (len(Word._fields) - 1)

# for t in alltexts:
#     for v, verse in t.text:
#         words = []
#         new_verse = []
#         for word in split_words(verse, ' -\n', '+⹀'):
#             w = Word(surface=word)
#             # remove characters before word
#             while word and word[0] in '+⹀':
#                 print(repr(w.before))
#                 w.before = w.before + word[0]
#                 word = word[1:]
#             # starting markers are looked up only to match ending markers
#             # check if there are ending markers by looking for capital letters
#             # occurring after lowercase letters
#             pos = 0
#             while pos < len(word) and not word[pos].islower():
#                 pos += 1
#             while pos < len(word) and not word[pos].isupper():
#                 pos += 1
#             # check if remaining string starts with any of known_marks
#             mark_e = ([m for m in known_marks if word[pos:].startswith(m)]+[''])[0]
#             if mark_e:
#                 # remove marker from word string
#                 word = word[:pos] + word[pos+len(mark_e):]
#                 w.mark_e = mark_e
# #                 # look for starting marker
# #                 if word.startswith(mark_e) and word                       
#             new_verse.append(w)
#         verse = new_verse


In [15]:
total = 0
result = []

for i, t in enumerate(alltexts):
    for v, verse in t.text:
        words = split_words(verse, ' -\n', '+⹀')
        total += len(words)
        for j, word in enumerate(words):
            if not any(c.isalpha() for c in word):  # or 'GK' in word or 'OK' in word:
                continue
            stripped = word.lstrip('+⹀([ʾʿ').rstrip('-)]|.,?!:; \n')
            # strip off initial capitals
            while stripped and (stripped[0].isupper() or stripped[0] in 'ʾʿ'):
                stripped = stripped[1:]
            has_case = stripped != stripped.lower()
            
            if has_case or any(c in '+⹀([)]|.,?!:;' for c in stripped):
                while not stripped[0].isupper():  # == stripped[0].lower():
                    stripped = stripped[1:]
                result.append(stripped)
                if stripped not in known_marks:
                    print(t.dialect, t.id, v, word)
#             stripped = word.lstrip('+⹀([')
#             if stripped and  and not stripped[0].isalpha():
#                 print(t.dialect, t.id, v, repr((words[j-1] if j>0 else '') + word))

# problems: Urmi suffixes attached to loan words, with no separation?
# Urmi_C A3 (70) EhotèlEux,| 
# Urmi_C A3 (72) EhotèlEux.| 
# Urmi_C A3 (75) EhotèlEu,| 
# Urmi_C A3 (77) EhotèlEu| 
# Urmi_C A41 (6) EpencílEə 
# Urmi_C A41 (12) ElístEət 
# Urmi_C A43 (22) RpovàrRə 
# Urmi_C A43 (23) RpovàrRə 
# Urmi_C A51 (6) RiRʾ 
# Urmi_C A56 (4) RnèrvRu| 
# Urmi_C B2 (12) PšusèPva| 

print(total)
for w in sorted(set(result)):
    print(repr(w))


Urmi_C A3 (70) EhotèlEux,| 
Urmi_C A3 (72) EhotèlEux.| 
Urmi_C A3 (75) EhotèlEu,| 
Urmi_C A3 (77) EhotèlEu| 
Urmi_C A41 (6) EpencílEə 
Urmi_C A41 (12) ElístEət 
Urmi_C A43 (22) RpovàrRə 
Urmi_C A43 (23) RpovàrRə 
Urmi_C A51 (6) RiRʾ 
Urmi_C A56 (4) RnèrvRu| 
Urmi_C B2 (12) PšusèPva| 
120565
'Arm'
'Az'
'E'
'Eu'
'Eux'
'Eə'
'Eət'
'F'
'Ge'
'P'
'Pva'
'R'
'Ru'
'Rə'
'Rʾ'


TODO: word or section markers, such as capital P around a word (meaning Persian loanword?) or 'Az' around several words. In the Word and PDF versions, the markers are set in superscript and the marked words in roman letters (instead of cursive), but in our text we will need to recover them in other ways (such as by looking at capital letters not at the beginning of a word, and matching that with an earlier capital letter).

Also TODO: make a nice table of letters, separated into vowels and consonants, with all possible combinations of diacritic symbols, and consult with Geoffrey what combinations are significant (e.g., 'a' and 'á' are the same sound, but 'c' and 'č' are different sounds/consonants).

In [16]:
# Unicode range for combining characters
combining_characters = range(0x300, 0x370)

To be sure we did not lose anything of value, here is a list of all non-empty strings in the `ignored` list:

In [17]:
[e for e in ignored if e and e != '\n']

[' The name Čuxo means ‘one who wears the woolen čuxa garment’.',
 '4\n',
 '3\n',
 '?\n',
 '1\n',
 'THE NEO-ARAMAIC DIALECT OF\n',
 'THE ASSYRIAN CHRISTIANS OF URMI\n',
 'GEOFFREY KHAN\n',
 '\x0c\x0c\n',
 'VOLUME 4\n',
 'TEXTS\n',
 '\x0c\x0c\n',
 'Contents\n',
 'FOLKTALES\n',
 'A1 The Bald Man and the King (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t9\n',
 'A2 Women are Stronger than Men (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t13\n',
 'A3 Axiqar (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t15\n',
 'A4 Is there a Man with No Worries? (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t21\n',
 'A5 Women do Things Best (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t22\n',
 'A6 The Dead Rise and Return (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t24\n',
 'A7 A Pound of Flesh (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t25\n',
 'A8 The Loan of a Cooking Pot (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t27\n',
 'A9 Much Ado About Nothing (Yulia Davudi, +Hassar +Baba-čanɟa, N)\t27\n',
 'A10 A Visit from Harun ar-Rashid (Yulia Davudi, +