# Narrative Verbs in Barwar
### Geoffrey Khan and Cody Kingham

<a href="https://github.com/CambridgeSemiticsLab"><img src="../docs/images/CambridgeU_BW.png" height="100pt" width="200pt" align='left'></a>

In [1]:
! echo "last updated"; date

last updated
Mon 10 Feb 2020 16:18:09 GMT


## Introduction

In the Barwar story-corpus, narratives are told using two past verbal forms: qṭilɛle (perfect) and qṭille (preterite). These forms interchange. It seems the qṭille form clusters around the onset of narratives in a section of narrative that sets the scene and so has a concentration of adverbials. 

## Research Questions

**In this notebook we seek to test that hypothesis that the qṭille narrative occurs more frequently with adverbials in the same clause than the qṭilɛle narrative form.** 

In order to test the hypothesis, we look for the following correlations between the two verb groups:

1. **How often are particular verb forms clause-initial, i.e. without an explicit subject noun or other constituent before them?**
    - This could be established by checking whether the string is immediately preceded by .| or ,|
2. **How often are particular verb forms accompanied by an adverbial in the same sentence?**

## Dataset

We will make a selection of key verbs and adverbial phrases for this analysis. The selections will be as follows:

### qṭilɛle

```
zilɛle 'he went', ziltɛla 'she went'
ʾəθyɛle  'he came', θiθɛla 'she came'
qimɛle 'he got up', qimtɛla 'she got up'
ṣəlyɛle 'he went down', ṣliθɛla 'she went down'
siqɛle 'he went up', siqtɛla 'she went up'
diṛɛle 'he returned', diṛtɛla 'she returned'
wirɛle 'he entered', wirtɛla 'she entered'
pliṭɛle 'he went out', pliṭṭɛla 'she went out'
riqɛle 'he ran', riqtɛla 'she ran'
tiwɛle 'he sat', tiwtɛla 'she sat'
pišɛle 'he became', pištɛla 'she became'
šqilɛle 'he took', šqiltɛla 'she took' 
npilɛle 'he fell', npiltɛla 'she fell'
məṭyɛle 'he arrived', mṭiθɛla 'she arrived'
```

### qṭille

```
zille 'he went', zilla 'she went'
θele  'he came', θela 'she came'
qimle 'he got up', qimla 'she got up'
ṣlele 'he went down', ṣlela 'she went down'
siqle 'he went up', siqla 'she went up'
diṛre 'he returned', diṛra she returned'
wirre 'he entered', wirra 'she entered'
pliṭle 'he went out', pliṭla 'she went out'
riqle 'he ran', riqla 'she ran'
tiwle 'he sat', tiwla 'she sat'
pišle 'he became', pišla 'she became'
šqille 'he took', šqilla 'she took' 
npille 'he fell', npilla 'she fell'
mṭele 'he arrived', mṭela 'she arrived'
```

### adverbials

```
- Adverbs containing the word yoma ‘day’ or yome ‘days’ or yomət ‘the day of’
- b-lɛle ‘at night’
- qedamta ‘in the morning’
- mbadla ‘early in the morning’
- ʾaṣərta ‘in the evening’
- xarθa ‘afterwards’
- ga, gaye ‘time, times’
- xa-ga
```

<hr>

# Python

In [2]:
# helper modules
import collections
import re

# data science modules
import pandas as pd
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 100)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# custom modules
from normalize_text import normalize_nena
from significance import apply_fishers, contingency_table

# Text-Fabric and corpus load
from tf.app import use
nena = use('nena')
F, L, T = nena.api.F, nena.api.L, nena.api.T

	connecting to online GitHub repo annotation/app-nena ... connected
Using TF-app in /Users/cody/text-fabric-data/annotation/app-nena/code:
	#9ec58f223a6ba6817347279da277c1efadae550a (latest commit)
	connecting to online GitHub repo CambridgeSemiticsLab/nena_tf ... connected
Using data in /Users/cody/text-fabric-data/CambridgeSemiticsLab/nena_tf/tf/0.01:
	rv0.032=#dd02c45f4294b97f9ecd8d4c3809aaf4153e2843 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


In [3]:
barwar = nena.search('dialect dialect=Barwar')[0][0] # get Barwar dialect node

  0.00s 1 result


## Select adverbials

We begin by select adverbials first. This will allow us to use the adverbials when we build the verbs dataset, since we are interested in collocation information. 

In [4]:
adverbial_patterns_raw = [
    r'.*-?yom[a,e]$',
    r'^[b,m]-yom',
    r'^yomət$|.*-yomət$',
    r'.*-lɛle$|^ʾədlɛle$',
    r'^qedamta$|.*-qedamta$',   
    r'^mbadla$|.*-mbadla$',
    r'^ʾaṣərta$|.*-ʾaṣərta$',
    r'^xarθa$',
    r'.*-ga$',
    r'^gaye$',
    r'^xa-ga$',
    
    # others:
    r'^zawna$|.*-zawna$',
    r'^dana$|.*-dana$', 
    #r'-?fatra$', # leave out for now since its a duration
]

adverbial_patterns = [re.compile(p) for p in adverbial_patterns_raw]

In [5]:
adverbial_data = []
adverbials = set()
matched_patterns = set()

for word in L.d(barwar,'word'):
    text = normalize_nena(word, nena.api)
    for patt in adverbial_patterns:
        if patt.match(text):
            adverbials.add(word)
            matched_patterns.add(patt.pattern)
            
for advb in adverbials:
    text = normalize_nena(advb, nena.api)
    sentence = L.u(advb,'sentence')[0]
    adverbial_data.append({
        'node': advb,
        'form': text,
        'sentence': T.text(sentence),
    })
            
adverbial_data = pd.DataFrame(adverbial_data).set_index('node')
            
print(f'{len(adverbials)} adverbials found...')

590 adverbials found...


In [6]:
print('unmatched patterns:')
set(adverbial_patterns_raw) - matched_patterns

unmatched patterns:


set()

Examine the adverbials data.

In [7]:
adverbial_data.shape

(590, 2)

In [8]:
advb_forms = adverbial_data.form.value_counts()

advb_forms.head(25)

xa-yoma        55
ʾɛ-ga          46
b-lɛle         44
xa-ga          38
yoma           37
xarθa          35
ʾaṣərta        30
mbadla         27
yomət          20
qedamta        19
dana           15
yome           11
dart-yoma      11
ʾo-yoma        10
ʾədlɛle        9 
gaye           8 
m-lɛle         7 
hal-ʾaṣərta    7 
tre-yome       6 
zawna          6 
b-yoma         6 
ʾo-yomət       6 
ʾaw-lɛle       5 
ʾa-dana        5 
hal-mbadla     5 
Name: form, dtype: int64

In [9]:
advb_forms.describe()

count    102.000000
mean     5.784314  
std      10.623743 
min      1.000000  
25%      1.000000  
50%      1.000000  
75%      5.000000  
max      55.000000 
Name: form, dtype: float64

**All unique forms printed below:**

In [10]:
', '.join(sorted(adverbial_data.form.unique()))

'b-lɛle, b-o-zawna, b-xa-yoma, b-yoma, b-yomaθa, b-ɛ-ga, b-ʾaṣərta, bar-tre-yome, d-o-zawna, d-ɛ-ga, dana, dart-yoma, dartət-yoma, gaye, gu-b-lɛle, gu-d-aw-zawna, gu-d-ɛ-dana, gu-mbadla, hal-mbadla, hal-o-yomət, hal-qedamta, hal-xa-ga, hal-yomət, hal-ʾaṣərta, hal-ʾo-yomət, hal-ʾɛ-ga, har-a-dana, har-b-o-lɛle, har-o-yoma, kli-ʾaw-lɛle, kut-dana, kut-qedamta, kut-yoma, kəma-dana, la-b-lɛle, la-b-yoma, la-zraq-yoma, m-la-gnay-yoma, m-lɛle, m-o-yoma, m-yomə, m-ʾaṣərta, mbadla, mən-d-o-yoma, mən-yoma, pəlgət-lɛle, qa-t-b-lɛle, qam-dana, qam-yoma, qedamta, sab-lɛle, ta-t-ʾo-yomət, tmanya-yome, tre-yome, xa-ga, xa-lɛle, xa-yoma, xarθa, xačča-dana, xu-lɛle, xu-mbadla, yoma, yome, yomət, zawna, ču-ga, šawwa-yome, ʾa-dana, ʾap-dartət-yoma, ʾap-ʾaṣərta, ʾap-ʾo-yoma, ʾap-ʾɛ-dana, ʾaw-lɛle, ʾax-d-ɛ-ga, ʾax-zawna, ʾay-ga, ʾay-xa-yoma, ʾaṣərta, ʾo-lɛle, ʾo-t-lɛle, ʾo-yoma, ʾo-yomət, ʾu-b-lɛle, ʾu-lɛle, ʾu-qedamta, ʾu-xa-ga, ʾu-yoma, ʾu-yomət, ʾu-ʾaṣərta, ʾu-ʾɛ-ga, ʾəd-lɛle, ʾədlɛle, ʾəlli-xa-ga, ʾən-

In [11]:
adverbial_data.head(25)

Unnamed: 0_level_0,form,sentence
node,Unnamed: 1_level_1,Unnamed: 2_level_1
745472,yoma,la-zráqət yòmaˈ y-azìwa.ˈ
770055,xarθa,"qărăčàyeˈ y-áθi hàtxaˈ ʾu-pɛ́ši xa-tre-ṭḷa-yomàne,ˈ xárθa y-àzi,ˈ jàwji.ˈ"
753673,yome,"ʾáp-ʾaw dàməx,ˈ Qaṭína dáməx tmànya yóme,ˈ dmìxle.ˈ"
770058,xarθa,xárθa mšúrela mxáya l-nàše.ˈ
743435,dart-yoma,dárt-yoma náše θéla škèla.ˈ
759820,xa-yoma,xá-yoma mə́re xázən mò ṱ-áwəð ʾáwwa páṛa.ˈ
780297,zawna,záwna wíyɛle t-šlàmaˈ ʾáp-ati píšlux nášət šlàma.ˈ
751637,xarθa,"zìlla,ˈ xárθa m-rə́ḥqa xzéla xa-mə́ndi xwàra.ˈ"
749592,xa-ga,zmìrtɛla qáṭu xá-ga xétaˈ ʾu-qíme rqàðɛla.ˈ
741413,xa-ga,díṛṛa xá-ga xéta l-ʾaθrày.ˈ


## Select verbs

In [12]:
verbs = {
    'qtilele': {
        'zilɛle', 'ziltɛla',
        'ʾəθyɛle', 'θiθɛla',
        'qimɛle', 'qimtɛla',
        'ṣəlyɛle', 'ṣliθɛla', 
        'siqɛle', 'siqtɛla',
        'diṛɛle', 'diṛtɛla',
        'wirɛle', 'wirtɛla',
        'pliṭɛle', 'pliṭṭɛla',
        'riqɛle', 'riqtɛla',
        'tiwɛle', 'tiwtɛla',
        'pišɛle', 'pištɛla',
        'šqilɛle', 'šqiltɛla',
        'npilɛle', 'npiltɛla',
        'məṭyɛle', 'mṭiθɛla',
    },
    'qtille': {
        'zille', 'zilla', 
        'θele', 'θela', 
        'qimle', 'qimla',
        'ṣlele', 'ṣlela', 
        'siqle', 'siqla', 
        'diṛṛe', 'diṛṛa',
        'wirre', 'wirra',
        'pliṭle', 'pliṭla',
        'riqle', 'riqla',
        'tiwle', 'tiwle',
        'pišle', 'pišla',
        'šqille', 'šqilla',
        'npille', 'npilla',
        'mṭele', 'mṭela',
    },
}

In [13]:
verb_data = []

found_forms = collections.defaultdict(set)

for word in L.d(barwar,'word'):
    text = normalize_nena(word, nena.api)
    for vkind, vforms in verbs.items():
        if text in vforms:
            
            # intonation group boundary data
            inton = L.u(word,'inton')[0]
            inton_words = L.d(inton,'word')
            inton_pos = inton_words.index(word) + 1
            
            # sentence data
            sentence = L.u(word,'sentence')[0]
            sentence_words = L.d(sentence,'word')
            sent_pos = sentence_words.index(word) + 1
            
            # paragraph data
            paragraph = L.u(sentence,'paragraph')[0]
            para_sents = L.d(paragraph,'sentence')
            para_pos = para_sents.index(sentence) + 1
                            
            # adverbials data
            sent_adverbials = adverbials & set(sentence_words)
            advb_text = '+'.join(
                normalize_nena(advb, nena.api) for advb in sent_adverbials
            )
            
            verb_data.append({
                'node': word,
                'form': text,
                'vkind': vkind,
                'sentence': T.text(sentence),
                'sent_pos': sent_pos,
                'inton_pos': inton_pos,
                'para_pos': para_pos,
                'sent_first': sent_pos == 1,
                'inton_first': inton_pos == 1,
                'para_first': para_pos == 1 ,
                'advb': bool(sent_adverbials),
                'advb_form': advb_text or np.nan, 
            })
            found_forms[vkind].add(text)
            
verb_data = pd.DataFrame(verb_data).set_index('node')

In [14]:
verb_data

Unnamed: 0_level_0,form,vkind,sentence,sent_pos,inton_pos,para_pos,sent_first,inton_first,para_first,advb,advb_form
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
739838,pliṭla,qtille,plíṭla ʾə́č̣č̣i-u ʾə̀č̣č̣a.ˈ,1,1,16,True,True,False,False,
739916,ʾəθyɛle,qtilele,sáʾət ʾə́šta mbàdlaˈ ʾə́θyɛle huðáya wáða ṭəq-ṭəq-ṭə́q l-ṭằra.ˈ,4,1,32,False,True,False,True,mbadla
740004,qimɛle,qtilele,"qímɛle ʾaw-lwíša dašdàšət málla,ˈ ʾu-čak̭àllət málla,ˈ málla lwíšɛle kášxa d-o-huðàyaˈ ʾu-tíwɛle xáṣət xmàrta,ˈ ʾu-síqela kəs-qàzi.ˈ",1,1,51,True,True,False,False,
740019,siqɛle,qtilele,"síqɛle kəs-qàzi,ˈ wírela šarṭ-qàzi.ˈ",1,1,52,True,True,False,False,
740182,zilɛle,qtilele,ʾu-ʾáwwa-ži zìlɛle.ˈ,2,2,16,False,False,False,False,
...,...,...,...,...,...,...,...,...,...,...,...
781725,pliṭla,qtille,plíṭla trè-xure.ˈ,1,1,12,True,True,False,False,
781737,pliṭla,qtille,"kazíwa práməlla har-palṭìwa,ˈ hál plíṭla hàtxa.ˈ",5,2,14,False,False,False,False,
781747,pišla,qtille,yə́mme díye píšla qwára gàna.ˈ,3,3,16,False,False,False,False,
781761,zille,qtille,brónəx zìlle.ˈ,2,2,19,False,False,False,False,


In [15]:
verb_data.to_csv('verb_dataset.tsv', sep='\t')

We inspect the categories of data in our selection.

In [16]:
verb_data.form.describe()

count     1591  
unique    55    
top       qimɛle
freq      194   
Name: form, dtype: object

In [17]:
verb_data.form.value_counts()

qimɛle      194
θele        112
zilɛle      92 
zille       91 
ʾəθyɛle     70 
zilla       67 
pišle       65 
siqɛle      58 
pišla       55 
qimtɛla     45 
ṣəlyɛle     44 
θela        43 
ṣlela       37 
tiwɛle      35 
mṭele       32 
pišɛle      31 
qimle       30 
ṣlele       29 
θiθɛla      28 
siqle       26 
qimla       23 
məṭyɛle     23 
ziltɛla     22 
siqla       21 
pliṭle      21 
pliṭɛle     21 
wirɛle      21 
wirre       19 
šqille      17 
ṣliθɛla     15 
pliṭṭɛla    14 
šqilɛle     12 
mṭela       12 
pištɛla     12 
pliṭla      12 
diṛɛle      12 
tiwle       12 
riqle       11 
siqtɛla     11 
npille      10 
wirra       10 
tiwtɛla     10 
wirtɛla     8  
diṛṛe       8  
riqɛle      7  
npilɛle     7  
riqtɛla     6  
šqilla      5  
šqiltɛla    5  
mṭiθɛla     5  
diṛṛa       4  
npilla      4  
npiltɛla    3  
diṛtɛla     2  
riqla       2  
Name: form, dtype: int64

In [18]:
verb_data.vkind.value_counts()

qtilele    813
qtille     778
Name: vkind, dtype: int64

### Check for missing verb kinds
Below we double check that we are not missing any results we intend to select.

In [19]:
for vkind, foundforms in found_forms.items():
    print(f'not found for {vkind}:')
    print(verbs[vkind] - foundforms)

not found for qtille:
set()
not found for qtilele:
set()


This means we've found all of the forms which we wanted to find.

# Correlation Analysis

Now we will do some analysis of correlations between various features.

## Verb type has adverbial in same sentence

In the table below, we provide a count of how often each verb type occurs with an adverbial.

In [20]:
advb_cor = pd.pivot_table(verb_data, index='vkind', columns=['advb'], aggfunc='size')

advb_cor

advb,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,733,80
qtille,679,99


Next, we convert the rows to ratios ($\%$). The decimal values below are out of $1$ and can be read, e.g. as $0.90$ is $90\%$.

In [21]:
advb_corr_ratio = advb_cor.div(advb_cor.sum(1), axis=0)

advb_corr_ratio

advb,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,0.901599,0.098401
qtille,0.872751,0.127249


Note that $13\%$ of the qtille verb type also contains an adverbial in the same sentence. This is as compared with only $10\%$ for qtilele. **This is in line with our hypothesis**.

We want to see whether the difference in proportion are statistically significant. Below we apply the Fisher's Exact test, which is a test for significance. Following Gries and Stefanowitsch, we also apply a $\log10$ transformation as well as sign change (negative when observed frequency is lower than expected). 

In [22]:
advb_corr_fishers, advb_corr_odds = apply_fishers(advb_cor, 0, 1)

advb_corr_fishers

Unnamed: 0,False,True
qtilele,1.093646,-1.093646
qtille,-1.093646,1.093646


Since $|\log10(0.05)|$ is $1.3$, we expect a value of $1.3$ for significance. The observed values here do not quite meet that threshold. 

It's important to recognize that statistical significance in the context of linguistic data needs to be weighed alongside all relevant factors. The fact that we do not find significance here does not automatically mean the difference in the proportions are not meaningful.

## Verb type position within intonation group, sentence, or paragraph

### With first position in intonation group

Below we present the raw counts.

In [23]:
pos_cor = pd.pivot_table(verb_data, index='vkind', columns=['inton_first'], aggfunc='size')

pos_cor

inton_first,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,261,552
qtille,320,458


And the ratios ($\%$) follow below. Again, they are calculated across the row, for the verb type.

In [24]:
pos_corr_ratio = pos_cor.div(pos_cor.sum(1), axis=0)

pos_corr_ratio

inton_first,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,0.321033,0.678967
qtille,0.411311,0.588689


We see that the **qtilele** verb has a stronger preference for first position at $68\%$ versus **qtille** at $59\%$. Are these differences statistically significant?

In [25]:
pos_corr_fish, pos_corr_odds = apply_fishers(pos_cor, 0, 1)

pos_corr_fish

Unnamed: 0,False,True
qtilele,-3.667736,3.667736
qtille,3.667736,-3.667736


Here we see that the values are indeed statistically significant. **We can conclude that qtilele verb has a preference for first position at a rate that is statistically significant.**

### With first position in sentence

In [26]:
pos_cor_sent = pd.pivot_table(verb_data, index='vkind', columns=['sent_first'], aggfunc='size')

pos_cor_sent

sent_first,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,412,401
qtille,470,308


In [27]:
pos_corr_ratio_sent = pos_cor_sent.div(pos_cor_sent.sum(1), axis=0)

pos_corr_ratio_sent

sent_first,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,0.506765,0.493235
qtille,0.604113,0.395887


### With first position in paragraph

For this count, we observe how often the verb form is found in the first sentence of a paragraph.

In [28]:
pos_cor_para = pd.pivot_table(verb_data, index='vkind', columns=['para_first'], aggfunc='size')

pos_cor_para

para_first,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,781,32
qtille,748,30


In [29]:
pos_corr_ratio_para = pos_cor_para.div(pos_cor_para.sum(1), axis=0)

pos_corr_ratio_para

para_first,False,True
vkind,Unnamed: 1_level_1,Unnamed: 2_level_1
qtilele,0.96064,0.03936
qtille,0.96144,0.03856
