# Time Clause Analysis

This notebook will analyze the annotated time clauses in order to establish what kind of samples need to be taken of non-time clauses, looking at various factors.

That is, the goal is to develop a sample of clauses that do not contain a time item.

In [20]:
import sys
import pandas as pd

from tf.fabric import Fabric
from tf.app import use
from textwrap import dedent

sys.path.append('/Users/cody/github/BH_time_collocations/data/pipeline/legacy_scripts/dataset')
from synvar_carc import in_dep_calc
from modify_domain import permissive_q

In [2]:
CORPUS = '/Users/cody/github/BH_time_collocations/data/data/corpus/'

In [65]:
tf_fabric = Fabric(locations=CORPUS)
tf_api = tf_fabric.loadAll()
tf_app = use('etcbc/bhsa', api=tf_api)
F, E, T, L = (getattr(tf_api, l) for l in 'FETL')

  1.30s Feature overview: 87 for nodes; 6 for edges; 1 configs; 9 computed


**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,11,19279.27,100
chapter,334,634.95,100
verse,10171,20.85,100
half_verse,19786,10.72,100
sentence,29518,7.18,100
sentence_atom,29764,7.13,100
clause,40591,5.22,100
clause_atom,41830,5.07,100
lex,1310,1.97,1
phrase,122585,1.73,100


In [127]:
def get_verb_from_clause(time_clause):
    """Extract a verb node from a time clause."""
    for word in L.d(time_clause, 'word'):
        if F.target.v(word) == 'verb':
            return word


def build_dataset():
    rows = []
    for time_clause in F.target.s('time_clause'):
        verse_node = L.u(L.d(time_clause, 'word')[0], 'verse')[0]
        cl_genre = F.genre.v(verse_node)
        cl_type = F.cl_type.v(time_clause)
        cl_dependency = in_dep_calc(time_clause, tf_api)
        cl_domain = permissive_q(time_clause, tf_api)
        cl_verb = get_verb_from_clause(time_clause)
        cl_aspect = F.aspect.v(time_clause)
        if cl_verb:
            cl_kind = 'VC'
            cl_tense = F.tense.v(cl_verb)
        else:
            cl_kind = 'NC'
            cl_tense = 'NA'
        book, ch, verse = T.sectionFromNode(time_clause)
        ref_string = f'{book} {ch}:{verse}'
        syndetic = (
            1 if F.lex.v(L.d(time_clause, 'word')[0]) == 'W'
            else 0
        )
        rows.append({
            'node': time_clause,
            'ref': ref_string,
            'time_clause': T.text(time_clause),
            'genre': cl_genre,
            'domain': cl_domain,
            'cl_type': cl_type,
            'aspect': cl_aspect,
            'depend': cl_dependency,
            'kind': cl_kind,
            'tense': cl_tense,
            'syndetic': syndetic,
        })
    df = pd.DataFrame(rows)
    return df

In [128]:
df = build_dataset()

In [129]:
df.head()

Unnamed: 0,node,ref,time_clause,genre,domain,cl_type,aspect,depend,kind,tense,syndetic
0,302967,Genesis 1:1,בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖י...,prose,?,x_clause,ach_id,Main,VC,past,0
1,303094,Genesis 2:2,וַיְכַ֤ל אֱלֹהִים֙ בַּיֹּ֣ום הַשְּׁבִיעִ֔י מְל...,prose,N,medial,ach_id,Main,VC,past,1
2,303096,Genesis 2:2,וַיִּשְׁבֹּת֙ בַּיֹּ֣ום הַשְּׁבִיעִ֔י מִכָּל־מ...,prose,N,medial,ach_rd,Main,VC,past,1
3,303152,Genesis 2:17,כִּ֗י בְּיֹ֛ום מֹ֥ות תָּמֽוּת׃,prose,Q,x_clause,ach_id,SubAdv,VC,fut,0
4,303392,Genesis 5:1,בְּיֹ֗ום בִּדְמ֥וּת אֱלֹהִ֖ים עָשָׂ֥ה אֹתֹֽו׃,list,?,x_clause,acc_in,Main,VC,past,0


In [130]:
df.shape

(616, 11)

In [131]:
df.genre.value_counts()

genre
prose          474
instruction    129
list            10
poetry           3
Name: count, dtype: int64

In [132]:
df.domain.value_counts()

domain
N    331
Q    233
?     42
D     10
Name: count, dtype: int64

In [133]:
df.depend.value_counts()

depend
Main      537
SubAdv     46
SubMod     27
SubArg      6
Name: count, dtype: int64

In [135]:
df.cl_type.value_counts()

cl_type
clause_x      300
x_clause      151
wayehi_x       60
medial         50
nmcl           31
ellp           12
wehaya_x        9
x_clause_x      3
Name: count, dtype: int64

In [134]:
df.syndetic.value_counts()

syndetic
1    395
0    221
Name: count, dtype: int64

In [114]:
df.tense.value_counts()

tense
past         363
mod          118
NA            43
fut           30
past perf     14
inf           13
epis mod      13
impv           7
past prog      5
pres           4
hab            2
gnom           2
pres perf      1
ptcp           1
Name: count, dtype: int64

In [115]:
df.aspect.value_counts()

aspect
acc_in         171
ach_id         112
ach_rd         111
sta_tr          80
act_di          43
act_un          34
sta_ac          24
none            12
ach_cy          12
sta_in           9
acc_ru           7
acc_di_iter      1
Name: count, dtype: int64

In [73]:
genre_domain_cts = (
    df.groupby(['genre', 'domain'])
        .size()
        .sort_values(ascending=False)
)

genre_domain_cts

genre        domain
prose        N         305
             Q         129
instruction  Q         100
prose        ?          35
instruction  N          20
list         N           6
instruction  D           5
prose        D           5
instruction  ?           4
list         ?           3
poetry       Q           3
list         Q           1
dtype: int64

In [57]:
genre_domain_cts.iloc[:5].sum()

589

In [59]:
genre_domain_cts.iloc[:5].sum() / genre_domain_cts.sum()  # % of data in top N values

0.9561688311688312

In [61]:
genre_domain_pc = genre_domain_cts.div(genre_domain_cts.sum(), 1)  # get %

genre_domain_pc

genre        domain
prose        N         0.495130
             Q         0.209416
instruction  Q         0.162338
prose        ?         0.056818
instruction  N         0.032468
list         N         0.009740
instruction  D         0.008117
prose        D         0.008117
instruction  ?         0.006494
list         ?         0.004870
poetry       Q         0.004870
list         Q         0.001623
dtype: float64

In [91]:
# get % of the keep-data if we restrict to N / Q prose

genre_domain_pc[[
    ('prose', 'N'),
    ('prose', 'Q'),
    ('instruction', 'Q'),
]].sum()

0.8668831168831169

In [74]:
# unkn_domain_search = tf_app.search('''

# clause target=time_clause
#     domain~.*\?$

# ''')

In [75]:
# tf_app.show(unkn_domain_search, condenseType='clause', withNodes=True)

In [136]:
# subdivide with cl_type taken into account as well

# restrict the sample to domain+genre of interest
df_dg_restricted = df[
    df.genre.isin(['prose', 'instruction'])
    & (df.domain.isin(['Q', 'N']))
    & (df.depend == 'Main')
    & (df.kind == 'VC')

]


cl_type_granular = (
    df_dg_restricted.groupby(['genre', 'domain', 'kind'])
        .size()
        .sort_values(ascending=False)
)

cl_type_granular = cl_type_granular[cl_type_granular > 10]

cl_type_granular

genre        domain  kind
prose        N       VC      273
             Q       VC       92
instruction  Q       VC       73
dtype: int64

In [137]:
cl_type_granular.sum() / df.shape[0]  # % of the original clauses kept with these selection criteria

0.711038961038961

In [141]:
# get estimate of N clauses for a proposed sample size per category;
# the integer values also gives a buffer for mis-tagged clauses
len(cl_type_granular) * 1100 

3300

The above assumes the following independent variables: 

* `genre = prose|instruction`
* `domain = Q|N`
* `dependency = Main`
* `kind = VC`

The remaining dependent variables for association tests would be:

* syndeton - how does presence of time affect whether a waw is used at the beginning of the clause
* cl_type - how does the presence of time affect the clause type of the clause
* tense - how does presence of time affect the tense of the clause
* aspect - how does the presence of time affect the aspect of the clause