## Experiments with the Text Fabric corpora of Barwar and Christian Urmi

With the Barwar and Christian Urmi (Urmi_C) corpora converted from Word format to Text Fabric, we can perform all kinds of analyses on the texts.

First, let's load the corpus:

In [1]:
from tf.fabric import Fabric

In [2]:
TF = Fabric(locations='tf/')

This is Text-Fabric 7.8.0
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

13 features found and 0 ignored


In [3]:
N = TF.load('''

text_id paragraph line word char otype title dialect filename

''')

N.makeAvailableIn(globals())
print()

  0.00s loading features ...
   |     0.01s B otype                from tf
   |     0.00s Not enough info for sections in otext, section functionality will not work
   |     0.00s Not enough info for structure in otext, structure functionality will not work
   |     0.15s B char                 from tf
   |     0.00s B text_id              from tf
   |     0.00s B line                 from tf
   |     0.00s B title                from tf
   |     0.00s B dialect              from tf
   |     0.00s B filename             from tf
  2.01s All features loaded/computed - for details use loadLog()



## Definitions

The corpus is divided into texts. Texts are divided in paragraphs, (numbered) lines, sentences, subsentences, words, morphemes, and finally characters.

Paragraphs are units of text that are separated by a newline character in the Word files. Lines are units that start with a line (or verse) number in round brackets. Sentences are units of text that are terminated by a period (full stop) character, an exclamation mark or a question mark. Subsentences divided by a comma. Words are separated by whitespace or punctuation. Morphemes are parts of words separated by single or double hyphens. Characters are any letter, possibly combined with combining diacritics, or any other character symbol.

These definitions do not necessarily conincide with accepted linguistic terms. For example, what is here called a 'morpheme' may be a word composed of several morphemes, but for the sake of simplification we use the term here for separated parts of a word. A word is a group of syllables with one stress marker.

## Example

Let us have a look at a text:

In [4]:
text = F.otype.s('text')[0]
print(f'Text {text}: {F.text_id.v(text)} {F.title.v(text)}\n')

for line in F.otype.s('line')[:2]:
    print(line, T.text(line))

Text 730893: A14 TALES FROM THE 1001 NIGHTS

731561 xa-màlka| kút-yum ðà-brata gawə́rwa.| mbádla qayə́mwa qaṭə̀lwala.| wăzī̀r| xðírre xðìrre,| bnáθa prìqla.| kút-yum ðà,| lìθ.| ʾáwwa wăzī́r ʾíθwale ða-bràta.| ʾa-bráta mə́ra ṭla-wằzir,| ṭla-bába dìya,| mə́ra bábi ʾána nàbəlli| gawrànne ʾáwwa málka| mparqànnux m-áyya qə́ṣṣət.|
731562 qìmtɛla| ʾítwala ða-qàṭu,| nubàltəlla mə́nna díya.| nubáltəlla qáṭu mə́nna dìya,| gwìrtəlle málka.| ʾaw-dmìxɛle,| píštɛla mtanóye ða-qə̀ṣṣət| ṭla-qàṭu.|


Since I find it cumbersome to refer to otypes and nodes while navigating a text,
I wrote a small class to provide a more object-oriented interface to the texts:

In [5]:
class Node:
    
    def __init__(self, node):
        self.node = node
        self.otype = F.otype.v(node)
    
    def __repr__(self):
        return f'<Node {self.node} otype {repr(self.otype)}>'
        
    def __str__(self):
        return T.text(self.node)
        
    def __getattr__(self, name):
        # first try to give the value of a feature called `name`
        if hasattr(F, name):
            f = getattr(F, name)
            if self.node in f.data:
                return f.v(self.node)
            # if node does not have feature, try embedding nodes
            else:
                for node in L.u(self.node):
                    if node in f.data:
                        return f.v(node)
        # check for upward embedding node of otype `name`
        n = L.u(self.node, name)
        if len(n):  # in case of multiple embedding nodes of otype `name`, return first one
            return Node(n[0])
        # if nothing is returned, try looking for embedded nodes        
        if name.endswith('s'): # if name has plural form, look for otype of `name` minus 's'
            # try to give generator of downward Nodes of otype `name`
            # TODO how to check if self.otype has downward nodes of otype `name`?
            return [Node(n) for n in L.d(self.node, name[:-1])]
        else:
            raise AttributeError(f'Otype {self.otype} has no attribute {name}.')
    
    @property
    def text(self):  # make text() a property method to keep it lazy
        return str(self)
            

def nodelist(otype):
    return [Node(n) for n in F.otype.s(otype)]

Let's get a list of all texts:

In [6]:
texts = nodelist('text')

In [7]:
texts[0]

<Node 730893 otype 'text'>

In [8]:
texts[0].title

'TALES FROM THE 1001 NIGHTS'

In [9]:
texts[0].node

730893

In [10]:
texts[0].text[:500] + ' ...'

'xa-màlka| kút-yum ðà-brata gawə́rwa.| mbádla qayə́mwa qaṭə̀lwala.| wăzī̀r| xðírre xðìrre,| bnáθa prìqla.| kút-yum ðà,| lìθ.| ʾáwwa wăzī́r ʾíθwale ða-bràta.| ʾa-bráta mə́ra ṭla-wằzir,| ṭla-bába dìya,| mə́ra bábi ʾána nàbəlli| gawrànne ʾáwwa málka| mparqànnux m-áyya qə́ṣṣət.|qìmtɛla| ʾítwala ða-qàṭu,| nubàltəlla mə́nna díya.| nubáltəlla qáṭu mə́nna dìya,| gwìrtəlle málka.| ʾaw-dmìxɛle,| píštɛla mtanóye ða-qə̀ṣṣət| ṭla-qàṭu.|mə́ra ṭla-d-à-q ...'

In [11]:
print(texts[0].lines[0].line, texts[0].lines[0].text)

1 xa-màlka| kút-yum ðà-brata gawə́rwa.| mbádla qayə́mwa qaṭə̀lwala.| wăzī̀r| xðírre xðìrre,| bnáθa prìqla.| kút-yum ðà,| lìθ.| ʾáwwa wăzī́r ʾíθwale ða-bràta.| ʾa-bráta mə́ra ṭla-wằzir,| ṭla-bába dìya,| mə́ra bábi ʾána nàbəlli| gawrànne ʾáwwa málka| mparqànnux m-áyya qə́ṣṣət.|


In [12]:
line = texts[0].lines[0]
print(line.dialect, line.text_id, line.title, line.filename)
print(line)

Barwar A14 TALES FROM THE 1001 NIGHTS bar text A14.html
xa-màlka| kút-yum ðà-brata gawə́rwa.| mbádla qayə́mwa qaṭə̀lwala.| wăzī̀r| xðírre xðìrre,| bnáθa prìqla.| kút-yum ðà,| lìθ.| ʾáwwa wăzī́r ʾíθwale ða-bràta.| ʾa-bráta mə́ra ṭla-wằzir,| ṭla-bába dìya,| mə́ra bábi ʾána nàbəlli| gawrànne ʾáwwa málka| mparqànnux m-áyya qə́ṣṣət.|


To illustrate that units like paragraph, line, sentence, etc. are not strictly hierarchical, we will take as an example a poetic text, Barwar A49 *The crow and the cheese* (note -- finding the text may actually be more convenient through the TF search functions, than with a list comprehension):

In [13]:
barwar_a49 = [t for t in texts if t.dialect == 'Barwar' and t.text_id == 'A49'][0]

In [14]:
print(barwar_a49.title)
print('Paragraphs:', len(barwar_a49.paragraphs))
print('Lines:', len(barwar_a49.lines))

THE CROW AND THE CHEESE
Paragraphs: 23
Lines: 6


Where most `paragraph`s contain one or more `line`s, here the number of lines is lower than the number of paragraphs. When we print them below each other, we can see why:

In [15]:
for line in barwar_a49.lines:
    print(line.line, line)

print()

for line in barwar_a49.lines:
    print(f'({line.line}) ', '\n    '.join([p.text for p in line.paragraphs]))

1 qarə́kke ṱ-íla kùmta|   xá-yoma ʾay-tìwta| l-ʾilána b-púmma gùpta| 
2 θéle téla pandàna| mtuxmə́nne ṱ-áwəð nxìlθa| ṭla-madréla b-xerə̀tta| šaqə́lla mə́nna gùpta| 
3 mə́re ʾən-qáləx mdáme ʾə̀lləx| xa-xéna lit-daxwàθəx| 
4 qíx qréla b-gáwət qàla| gúpta mən-púmma npìlla| téla mo-ṭréle ʾə̀lla!| 
5 šeðánta qəm-šaqə̀lla| qəm-ʾaryála pəšmànta| 
6 téla mə́re šeðànta| la-mháymnət kul-maxkɛ́θa basìmta| téla mére šeðànta| la-mháymnət kul-maxkɛ́θa basìmta|

(1)  qarə́kke ṱ-íla kùmta|  
     xá-yoma ʾay-tìwta|
     l-ʾilána b-púmma gùpta|
     
(2)  θéle téla pandàna|
     mtuxmə́nne ṱ-áwəð nxìlθa|
     ṭla-madréla b-xerə̀tta|
     šaqə́lla mə́nna gùpta|
     
(3)  mə́re ʾən-qáləx mdáme ʾə̀lləx|
     xa-xéna lit-daxwàθəx|
     
(4)  qíx qréla b-gáwət qàla|
     gúpta mən-púmma npìlla|
     téla mo-ṭréle ʾə̀lla!|
     
(5)  šeðánta qəm-šaqə̀lla|
     qəm-ʾaryála pəšmànta|
     
(6)  téla mə́re šeðànta|
     la-mháymnət kul-maxkɛ́θa basìmta|
     téla mére šeðànta|
     la-mháymnət kul-maxkɛ́θa bas

## Textual problems

After adding the `+` as a word character, besides regular letter characters, the number of `morphemes` dropped from 120141 to 120134. The cause turned out to be that several `morphemes` contain a `+` character at a non-word-initial position.

Three possible causes:
- the `+` is accidentily placed after initial alaph;
- a space is omitted between two `words`, the second one with initial `+`;
- a hyphen is omitted between two `morphemes`, the second one with initial `+`.

In [16]:
morphemes = nodelist('morpheme')
print(len(morphemes))

120134


In [17]:
morphemes = nodelist('morpheme')

# this search using python string operators is slow.
# maybe TF has builtin string search capabilities which are faster?
for m in [m for m in morphemes if m.text[0] != '+' and '+' in m.text]:
    print(m.dialect, m.filename, m.text_id, m.line, repr(m.text))

Urmi_C cu vol 4 texts.html A 1 38 'ʾ+átrət'
Urmi_C cu vol 4 texts.html A3 53 'ʾəhtiyɑ̄̀j+ʾallux'
Urmi_C cu vol 4 texts.html A4 10 'ʾ+òtax'
Urmi_C cu vol 4 texts.html A47 1 'nùra+bəlláyələ'
Urmi_C cu vol 4 texts.html A47 16 'k̭a+tàla'
Urmi_C cu vol 4 texts.html B1 13 'k̭a+ʾaturáyə'
Urmi_C cu vol 4 texts.html B5 3 'ʾ+al'


It is not unlikely that there are more cases where spaces or hyphens are omitted.

A way to check this could be to look at the stress markers (`acute accent` and `grave accent`), which should occur once in each `word`. That would lead to the conclusion that in `ʾəhtiyɑ̄̀j+ʾallux`, `k̭a+tàla`, `k̭a+ʾaturáyə`, a hyphen is missing, whereas in `nùra+bəlláyələ` a space is omitted.

Assuming that every word does indeed need exactly one stress marker, any word with no or multiple stress markers is a mistake.

In [18]:
# search for words with no or multiple stress markers
words = nodelist('word')

In [19]:
no_stress = [w for w in words if not ('\u0300' in str(w) or '\u0301' in str(w))]

print('Number of words with no stress markers:', len(no_stress))
w = no_stress[0]
print('Example:', w.dialect, repr(w.filename), f'{w.text_id}:{w.line}', repr(w.text))

Number of words with no stress markers: 716
Example: Barwar 'bar text A14.html' A14:8 'Kărīm-addīn'


In [20]:
multiple_stress = [w for w in words if len([c for c in w.chars if '\u0300' in str(c) or '\u0301' in str(c)]) > 1]

print('Number of words with multiple stress markers:', len(multiple_stress))
w = multiple_stress[0]
print('Example:', w.dialect, repr(w.filename), f'{w.text_id}:{w.line}', repr(w.text))

Number of words with multiple stress markers: 97
Example: Barwar 'bar text a29.html' A29:20 'hó-b-ʾíðe'


Since the total number of words with no or multiple stress markers is apparently 813, it will be quite a challenge to check them for omitted spaces or hyphens, especially as in most cases it may not be clear what should be the correct form.