<img align="right" src="tf-small.png"/>

# Atoms and Mothers

One of the trickiest bits in the 
[ETCBC database of the Hebrew Bible](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_home.html)
are the
[*atoms*](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/otype.html#linguistic-types)
within sentences, clauses and phrases, and the
[*mother*](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/mother.html)
relationship between objects.

Yet a lot of the coding effort of the ETCBC is located in precisely these concepts, especially in the treatment of *clause*-atoms.
For example, there is a specific feature
[code](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/code.html)
defined on clause atoms that provides a refined categorization of clauses.

In this notebook, we will explore and highlight what you can do with mothers and clause_atoms.

# Acknowledgement

This notebook owes a lot to the eager questions of Joshua Grauman and the prompt answers by Hendrik-Jan Bosman, spiced with additional insights of Cody Kingham and David van Acker.

In [1]:
import collections
from tf.fabric import Fabric

In [2]:
ETCBC = 'hebrew/etcbc4c'
TF = Fabric( modules=ETCBC )

This is Text-Fabric 2.3.5
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
108 features found and 0 ignored


In [19]:
api = TF.load('''
    typ function
    mother
''')
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.30s B mother               from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s Feature overview: 102 for nodes; 5 for edges; 1 configs; 7 computed
  0.78s All features loaded/computed - for details use loadLog()


# clause and clause_atom

The ETCBC does not work with *embedded* clauses. In the clause

`we'll see whether this works out later`

there is an inner clause `whether this works out`, and an outer clause `we'll see ... later`.

In many types of linguistic analysis, the inner clause is part of the outer clause, in the role of
direct object. The word `works` belongs both to the inner and outer clause.

Not so in the ETCBC analysis of things. 
The inner clause *interrupts* the outer clause, and the outer clause has a *gap*.
The word `works` belongs to the inner clause only.

Because of the gap, the outer clause splits into two segments, one before the gap, and one after the gap.
These parts are called the *clause_atoms*.

The clause_atom before the gap is rather complete, it has a subject and a predicate.
The clause_atom after the gap is, well, defective.

## Explore

Let us see some clauses that consist of multiple clause atoms.

In [4]:
results = list(S.search('''
clause
  clause_atom
  < clause_atom
'''))

mClauses = sortNodes(set(x[0] for x in results))
info('{} multiple atom clauses'.format(len(mClauses)))

  0.69s 2441 multiple atom clauses


In [5]:
for r in results[0:5]: print(S.glean(r))

clause[וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ ...] clause_atom[וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ ] clause_atom[וּבֵ֣ין הַמַּ֔יִם ]
clause[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ...] clause_atom[תַּֽדְשֵׁ֤א הָאָ֨רֶץ֙ דֶּ֔שֶׁא עֵ֚שֶׂב ] clause_atom[עֵ֣ץ פְּרִ֞י ]
clause[עֹ֤שֶׂה פְּרִי֙ לְמִינֹ֔ו עַל־...] clause_atom[עֹ֤שֶׂה פְּרִי֙ לְמִינֹ֔ו ] clause_atom[עַל־הָאָ֑רֶץ ]
clause[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause_atom[וַתֹּוצֵ֨א הָאָ֜רֶץ דֶּ֠שֶׁא ...] clause_atom[וְעֵ֧ץ ]
clause[עֹ֥שֶׂה פְּרִ֛י לְמִינֵ֑הוּ ] clause_atom[עֹ֥שֶׂה פְּרִ֛י ] clause_atom[לְמִינֵ֑הוּ ]


Now count the how many clauses have how many atoms.

In [6]:
caCount = collections.Counter()
for c in mClauses:
    caCount[len(L.d(c, 'clause_atom'))] += 1
for (nca, nc) in sorted(caCount.items(), key=lambda x: (-x[1], x[0])):
    print('{:>2} atoms: {:>5} clauses'.format(nca, nc))

 2 atoms:  2334 clauses
 3 atoms:    95 clauses
 4 atoms:    10 clauses
 5 atoms:     2 clauses


The next thing is: we want to see every multi-atom clause, and for each atom at which slot it starts and end, and whether its 
[clause type](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/typ.html)
is defective or not.

In [7]:
chunks = []
for c in mClauses:
    cas = L.d(c, 'clause_atom')
    cwords = L.d(c, otype='word')
    rep = ['{}-{}'.format(cwords[0], cwords[-1])]
    for ca in cas:
        defc = F.typ.v(ca) == 'Defc'
        slots = L.d(ca, otype='word')
        bs = slots[0]
        es = slots[-1]
        rep.append('\t{}-{}-{}'.format(bs, 'D' if defc else '-', es))
    chunks.append(rep)

for ch in chunks[0:10]: print('\n'.join(ch))

101-115
	101---105
	112-D-115
182-190
	182---186
	189-D-190
191-200
	191---194
	198-D-200
204-215
	204---209
	214-D-215
216-222
	216---217
	221-D-222
380-401
	380---391
	400-D-401
593-611
	593---598
	607-D-611
622-650
	622---640
	645-D-650
989-995
	989-D-991
	994---995
1097-1110
	1097---1103
	1109-D-1110


Is it the case that every clause splits into exactly one non-defective atom and the rest defective?
Lets count the profiles of clauses. A profile is a sequence of `-` and `D` characters, corresponding to the
defectiveness of its successive clause_atoms.

In [8]:
profiles = collections.Counter()
for c in F.otype.s('clause'):
    cas = L.d(c, 'clause_atom')
    profile = ''.join('D' if F.typ.v(ca) == 'Defc' else '-' for ca in L.d(c, otype='clause_atom'))
    profiles[profile] += 1
info('{} profiles'.format(len(profiles)))
for (profile, n) in sorted(profiles.items()):
    print('{:<6} : {:>5}x'.format(profile, n))

    15s 10 profiles
-      : 85559x
-D     :  1083x
-DD    :    62x
-DDD   :     7x
-DDDD  :     2x
D-     :  1251x
D-D    :    10x
DD-    :    23x
DD-D   :     1x
DDD-   :     2x


This gives a pretty good picture of the construction of clauses out of their atoms.
Note that we have inspected all clauses, including the single atoms clauses, and note that those are never
defective.

Is it true then, that the defective clause atoms do not contain a predicate, and the others do.
We'll check. A predicate is a phrase with a `function` that is one of a few values.
We count the clause_atoms with and without a predicate, separately for defective and complete ones.

We expect the classes `D-` (defective, no predicate) and `-P` (complete, with predicate) to be represented, 
whilst the classes `DP` (defective with predicate) and `--` (complete, without predicate) should be empty.

In [9]:
predicates = {'Pred', 'PreO', 'PreS', 'PrcS', 'PtcO', 'PreC'}

def classify(clauseSet, predLabels):
    defPred = collections.Counter()

    for c in clauseSet:
        defc = F.typ.v(c) == 'Defc'
        pred = any(F.function.v(p) in predLabels for p in L.d(c, otype='phrase'))
        defPred[('D' if defc else '-')+('P' if pred else '-')] += 1

    for x in sorted(defPred.items()): print('{} x {:>5}'.format(*x))

classify(F.otype.s('clause_atom'), predicates)

-- x  7395
-P x 80605
D- x  2520
DP x    42


It is nearly true that defective atoms do not have a predicate, because the class `DP` is very small.
But there is a fair amount of `--` clause_atoms.

We can determine which function labels of phrases do not occur in defective clause atoms.

In [10]:
allFunctions = {F.function.v(p) for p in F.otype.s('phrase')}
sorted(allFunctions)

['Adju',
 'Cmpl',
 'Conj',
 'EPPr',
 'ExsS',
 'Exst',
 'Frnt',
 'IntS',
 'Intj',
 'Loca',
 'ModS',
 'Modi',
 'NCoS',
 'NCop',
 'Nega',
 'Objc',
 'PrAd',
 'PrcS',
 'PreC',
 'PreO',
 'PreS',
 'Pred',
 'PtcO',
 'Ques',
 'Rela',
 'Subj',
 'Supp',
 'Time',
 'Voct']

In [11]:
defcFunctions = collections.Counter()
completeFunctions = collections.Counter()
for c in F.otype.s('clause_atom'):
    dest = defcFunctions if F.typ.v(c) == 'Defc' else completeFunctions
    for p in L.d(c, otype='phrase'):
        dest[F.function.v(p)] +=1  

In [12]:
defcFunctions

Counter({'Adju': 336,
         'Cmpl': 356,
         'Conj': 800,
         'Exst': 1,
         'Intj': 24,
         'Loca': 85,
         'Modi': 35,
         'NCop': 1,
         'Nega': 24,
         'Objc': 319,
         'PrAd': 7,
         'PreC': 42,
         'Ques': 30,
         'Rela': 2,
         'Subj': 507,
         'Time': 210})

In [13]:
completeFunctions

Counter({'Adju': 9177,
         'Cmpl': 29604,
         'Conj': 45343,
         'EPPr': 21,
         'ExsS': 14,
         'Exst': 142,
         'Frnt': 1091,
         'IntS': 251,
         'Intj': 1597,
         'Loca': 2528,
         'ModS': 35,
         'Modi': 3951,
         'NCoS': 101,
         'NCop': 594,
         'Nega': 6023,
         'Objc': 22193,
         'PrAd': 235,
         'PrcS': 8,
         'PreC': 19296,
         'PreO': 5403,
         'PreS': 886,
         'Pred': 57069,
         'PtcO': 162,
         'Ques': 1174,
         'Rela': 6327,
         'Subj': 31354,
         'Supp': 178,
         'Time': 3631,
         'Voct': 1607})

So, there afre a few defective clause_atoms with a predicative complement, and there are quite a few 
complete clauses lacking anything that looks like a predicate.

If we restrict ourselves to multiple atom clauses, the picture is this.

In [18]:
mClauseAtoms = set()
for c in mClauses:
    for ca in L.d(c, otype='clause_atom'): mClauseAtoms.add(ca)
classify(mClauseAtoms, predicates)

-- x   181
-P x  2260
D- x  2520
DP x    42


## Conclusion (Atoms)
Defective clause atoms are always part of clauses with multiple atoms.
Such clauses have exactly one non defective clause_atoms.
Defective clause_atoms do not have predicates, but may have a predicative complement or adjunct.
Most non-defective clause atoms have a predicate, but their is a fair collection without.

# Mothers

The `mother` relationship between nodes tells something about linguistic dependency.
We first investigate the extent of the `mother` relationship in terms of node types, and then we concentrate on the mothers and daughters of clause atoms.

In [21]:
motherInventory = collections.Counter()
for n in N():
    for m in E.mother.f(n):
        motherInventory[(F.otype.v(n), F.otype.v(m))] += 1

In [24]:
for ((fr, to), n) in sorted(motherInventory.items()):
    print('{:>12} => {:<12} x {:>6}'.format(fr, to, n))

      clause => clause       x  13970
      clause => phrase       x   5763
      clause => word         x   1104
 clause_atom => clause_atom  x  89508
      phrase => clause       x     22
      phrase => phrase       x    452
      phrase => word         x      8
 phrase_atom => phrase_atom  x  12502
 phrase_atom => word         x   1839
   subphrase => subphrase    x  21129
   subphrase => word         x  34885


Clearly, the `mother` relationship does a big thing with clause atoms, more than with any other object type.
Also, mothers and daughters of clause atoms are always clause atoms themselves.