# Narrative Verbs in Barwar
### Cody Kingham and Geoffrey Khan

<a href="https://github.com/CambridgeSemiticsLab"><img src="../docs/images/CambridgeU_BW.png" height="100pt" width="200pt" align='left'></a>

In [173]:
! echo "last updated"; date

last updated
Mon  3 Feb 2020 12:03:34 GMT


## Introduction

In the story corpus narratives are told using two past verbal forms: qṭilɛle (perfect) and qṭille (preterite). These forms interchange. It seems the qṭille form clusters around the onset of narratives in a section of narrative that sets the scene and so has a concentration of adverbials.

## Research Questions
Search for four sets of verbs (i) qṭilɛle form and (ii) qṭille form, (iii) qəm-qaṭəlle, (iv) initial /ʾ/ qaṭəl forms, i.e. qaṭəl forms beginning with ʾa-. 

Look for the following correlations between each of these groups and:
1. How often are such forms clause-initial, i.e. without an explicit subject noun or other constituent before them? This could be established by checking whether the string is immediately preceded by .| or ,|  

2. How often are these strings accompanied by an adverbial in the same sentence? Typical narrative adverbials are:

```
Adverbs containing the word yoma ‘day’ or yome ‘days’ or yomət ‘the day of’
b-lɛle ‘at night’
qedamta ‘in the morning’
mbadla ‘early in the morning’
ʾaṣərta ‘in the evening’
xarθa ‘afterwards’
ga, gaye ‘time, times’
xa-ga
```

## Technical Brief

In this notebook, we will use the [NENA text-corpus](https://github.com/CambridgeSemiticsLab/nena_corpus) in a [Text-Fabric format](https://github.com/CambridgeSemiticsLab/nena_tf) (for TF see [here](https://github.com/annotation/text-fabric)). Our corpus contains a number of linguistic encodings which will be useful for the analysis, especially: 

1. word tokenization
2. intonation group boundaries
3. sentence tokenization

These linguistic units are modeled as nodes within a graph. The nodes have associated features that can be called during the analysis. For example, a word has a plain-text feature that can be called for interacting with string text. The `text-fabric` Python module provides a set of classes and methods for reading in this graph and navigating the nodes and features. We will especially make heavy use of the `F` ("feature"), `L` ("level"), and `T` ("text") classes for navigating features, hierarchical levels, and plain text data.

The principle task is to identify verbs of qṭilɛle (perfect) and qṭille (preterite). To do this, we have a list of endings we can expect to find:

```
(i) For qṭilɛle forms search for strings ending in -ɛle and -ɛla (i.e. 3ms and 3fs). The hyphen means a wild card, i.e. any characters within the same word.

(ii) For qṭille forms search for strings ending in -ele, -ela, -ble, -bla, -dle, -dla, -fle, -fla, -gle, -gla, -jle, -jla, -kle, -kla, -mle, -mla, -nne, -nna, -ple, -pla, -qle, -qla, -rre, -rra, -sle, -sla, -ṣle, -ṣla, -tle, -tla, -ṭle, -ṭla, -wle, -wla, -xle, -xla, -zle, -zla. The hyphen means a wild card, i.e. any characters within the same word. This search may pull out lots of inappropriate examples with -L suffixes that are ob-jects, so one way of refining the results would be to add i before consonants, i.e. -ible, -ibla, -idle, -idla etc. The /i/ is the stem vowel of the peʿal form (the basic form).

(iii) For qəm-qaṭəlle forms search for forms beginning with qəm- in the results of the search for group (ii).

(iv) For qaṭəl forms beginning with ʾa-, I suggest you search for the onsets of the most common verbs ʾazə-, ʾazi (to go), ʾamər, ʾamr- (to say), ʾaθ- (to come), ʾasəq, ʾasq- ‘to go up’, ʾaxəl, ʾaxl- ‘to eat’, ʾarəq, ʾarq- ‘to run’.
```

For the string matching we can use the `re` Python module. All other rules and processing can be done with Python code. We will seek to store and visualize the resulting data using the standard data science modules: `pandas` and `matplotlib`. 

<hr>

# Python

In [149]:
# helper modules
import re
import csv
import collections # advanced data containers
import unicodedata

# data science modules
import pandas as pd
import matplotlib.pyplot as plt

# Text-Fabric and corpus load
from tf.app import use
nena = use('nena')
F, L, T = nena.api.F, nena.api.L, nena.api.T

	connecting to online GitHub repo annotation/app-nena ... connected
Using TF-app in /Users/cody/text-fabric-data/annotation/app-nena/code:
	#9ec58f223a6ba6817347279da277c1efadae550a (latest commit)
	connecting to online GitHub repo CambridgeSemiticsLab/nena_tf ... connected
Using data in /Users/cody/text-fabric-data/CambridgeSemiticsLab/nena_tf/tf/0.01:
	rv0.032=#dd02c45f4294b97f9ecd8d4c3809aaf4153e2843 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


Below we access the first word in the NENA corpus to show an example of how we will maneuver the data.

In [20]:
first_word = L.u(1, 'word')[0]

print(first_word)

739771


Note the number above, which is a unique identifier for this word. That number is used to lookup the features. Below is an example of using `T` to get a plain-text representation of this word.

In [26]:
T.text(first_word)

'xá-ga '

And next we show the node but with the `trans_f` feature accessed.

In [27]:
F.trans_f.v(first_word)

"xa'-ga"

Note that the `.v` refers to the "value" of the feature.

To aid the input of verb ending patterns, we will use this one-to-one transcription feature `trans_f`. The validated patterns can be stored in the dictionary below.

In [67]:
verb_patterns = {} # a dictionary for storing all verb patterns
verbs = collections.defaultdict(list)

## qṭilɛle forms

```
(i) For qṭilɛle forms search for strings ending in -ɛle and -ɛla (i.e. 3ms and 3fs). The hyphen means a wild card, i.e. any characters within the same word.
```

In [68]:
barwar = nena.search('dialect dialect=Barwar')[0][0] # get Barwar dialect node

  0.00s 1 result


In [69]:
qtilele = verb_patterns['qtilele'] = re.compile(r'.*\$le$|.*\$la$')

# find words that match the qtilelele pattern 
for word in L.d(barwar,'word'):
    if qtilele.match(F.trans_f.v(word)): # test match
        verbs['qtilele'].append(word) # save hit
        
print(f'{len(verbs["qtilele"])} matches found...')

1827 matches found...


In order to sample the results, we write a short function to show them in context.

In [122]:
def get_sample(words, **tf_kwargs):
    """Show words and their contexts"""
    for i,w in enumerate(words):
        sentence = L.u(w, 'sentence')[0]
        nena.plain(sentence, **tf_kwargs, highlights={w})

We will look at the first 10 results.

In [123]:
get_sample(verbs['qtilele'][:10])

### Further examination

We want to export these forms to a spreadsheet so they can be examined more closely. There should be two kinds of spreadsheets. The first kind is an exhaustive list of all results, organized alphabetically. The second list contains a condensed list of all unique tokens/surface-forms along with their frequencies within the results. 

We prepare two functions for each of these tasks which we can also use throughout the study.

In [124]:
nena.sectionStrFromNode(1)

'Barwar, A Hundred Gold Coins, Ln. 1'

In [171]:
# accent list to strip out for token counts
# we should make this a TF feature
accents = '\u0300|\u0301|\u0304|\u0306|\u0308|\u0303'

def writecsv(data, filename, sep='\t'):
    """Writer function for dict spreadsheet data"""
    with open(filename, 'w', encoding='utf16') as outfile:
        header = data[0].keys()
        writer = csv.DictWriter(outfile, delimiter=sep, fieldnames=header)
        writer.writeheader()
        writer.writerows(data)
    
def export_long_results(results, filename, sep='\t'):
    """Export an exhaustive list of results"""
    # gather all data from words
    data = []
    for word in results:
        plain, trans = T.text(word), F.trans_f.v(word)
        sentence = T.text(L.u(word,'sentence')[0])
        ref = nena.sectionStrFromNode(word)
        data.append({
            'match': plain, 
            'sentence': sentence, 
            'ref': ref,
            'node': word,
        })
    # sort it and export
    data = sorted(data, key=lambda k: k['match'])
    writecsv(data, filename, sep)
        
def export_summary_results(results, filename, sep='\t'):
    """Export a summarized list"""
    token_counts = collections.Counter()
    for word in results:
        plain = F.text.v(word).replace(F.end.v(word),'')
        plain = unicodedata.normalize('NFD', plain) # decompose for accent stripping
        plain = re.sub(accents, '', plain) # strip accents
        token_counts[plain] += 1
    data = [
        {'token': tok, 'frequency': count} 
            for tok, count in token_counts.most_common()
    ]
    writecsv(data, filename, sep)

We make the export below.

In [172]:
export_long_results(verbs['qtilele'], 'inspect/qtilele.tsv')
export_summary_results(verbs['qtilele'], 'inspect/qtilele_summary.tsv')

## qṭille forms

```
(ii) For qṭille forms search for strings ending in -ele, -ela, -ble, -bla, -dle, -dla, -fle, -fla, -gle, -gla, -jle, -jla, -kle, -kla, -mle, -mla, -nne, -nna, -ple, -pla, -qle, -qla, -rre, -rra, -sle, -sla, -ṣle, -ṣla, -tle, -tla, -ṭle, -ṭla, -wle, -wla, -xle, -xla, -zle, -zla. The hyphen means a wild card, i.e. any characters within the same word. This search may pull out lots of inappropriate examples with -L suffixes that are objects, so one way of refining the results would be to add i before consonants, i.e. -ible, -ibla, -idle, -idla etc. The /i/ is the stem vowel of the peʿal form (the basic form).
```

In [None]:
qtille = verb_patterns['qtille']