# Word patterns

## "Big words"

We want to count the words in the BHSA, but not the words as members of the `word` type, but the words in as far as they
are written together.

In [1]:
from tf.app import use

In [2]:
A = use("ETCBC/bhsa", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,39,10938.21,100
chapter,929,459.19,100
lex,9230,46.22,100
verse,23213,18.38,100
half_verse,45179,9.44,100
sentence,63717,6.7,100
sentence_atom,64514,6.61,100
clause,88131,4.84,100
clause_atom,90704,4.7,100
phrase,253203,1.68,100


We start by putting all words (in the sense of contiguous stretches of letters without white space)
together in one list.

In order to do that, we'll use the `trailer` feature on words.

Let's first inspect that feature by means of a frequency list.

In [3]:
F.trailer.freqList()

((' ', 236930),
 ('', 121801),
 ('&', 42275),
 ('00 ', 20146),
 ('05 ', 2266),
 ('00_S ', 1892),
 ('00_P ', 1165),
 ('_S ', 76),
 (' 05 ', 17),
 ('_P ', 13),
 ('00_N ', 7),
 ('00_N_P ', 1),
 ('00_N_S ', 1))

The `''` value is used when the word is glued to the next word. We take this as a non-word separator.

The `'&'` is used to indicate a maqqef. We take this as a word separator.
All other values are punctuation marks that imply word separation.

So, we make a list of all words where we glue words together if their trailer is the empty string.

We accumulate the words in a list `words`.
The items in this list are lists of word nodes, the nodes that belong to words that are separated by the
empty string.

For each word node that we visit, we examine the trailer of the last node of the last entry.
If it is the empty string, we append the node to that entry.
Otherwise we make a new entry.

In [4]:
words = []

for w in F.otype.s("word"):
    if len(words) == 0:
        words.append([w])
        continue
        
    lastEntry = words[-1]
    lastNode = lastEntry[-1]
    lastTrailer = F.trailer.v(lastNode)
    if lastTrailer == "":
        lastEntry.append(w)
    else:
        words.append([w])

Let's see whether we have done this right.

We show the first entries of our word list:

In [5]:
for entry in words[0:11]:
    print(entry)

[1, 2]
[3]
[4]
[5]
[6, 7]
[8, 9]
[10, 11]
[12, 13, 14]
[15]
[16]
[17, 18]


This seems to go well, but let's print out the words themselves:

In [6]:
for entry in words[0:11]:
    A.plainTuple(entry[::-1])

n,p,word,word.1
Genesis 1:1,רֵאשִׁ֖ית,בְּ,


n,p,word
Genesis 1:1,בָּרָ֣א,


n,p,word
Genesis 1:1,אֱלֹהִ֑ים,


n,p,word
Genesis 1:1,אֵ֥ת,


n,p,word,word.1
Genesis 1:1,שָּׁמַ֖יִם,הַ,


n,p,word,word.1
Genesis 1:1,אֵ֥ת,וְ,


n,p,word,word.1
Genesis 1:1,אָֽרֶץ׃,הָ,


n,p,word,word.1,word.2
Genesis 1:2,אָ֗רֶץ,הָ,וְ,


n,p,word
Genesis 1:2,הָיְתָ֥ה,


n,p,word
Genesis 1:2,תֹ֨הוּ֙,


n,p,word,word.1
Genesis 1:2,בֹ֔הוּ,וָ,


Yes, the entries correspond to the words in the new sense.

So, how many do we have?

In [7]:
len(words)

304789

## Final consonants

What is the frequency of final consonants w.r.t. their non final counterparts?

In [8]:
import re
import collections

First a little investment in naming the consonants and linking the final ones to the non-final ones.

In [9]:
# keys: final consonants
# values: corresponding non-final consonants

theFive = dict(
    kafF="\u05da",
    kaf="\u05db",
    memF="\u05dd",
    mem="\u05de",
    nunF="\u05df",
    nun="\u05e0",
    peF="\u05e3",
    pe="\u05e4",
    tsadiF="\u05e5",
    tsadi="\u05e6",
)

for (name, ch) in theFive.items():
    print(f"{name} = {ch}")

kafF = ך
kaf = כ
memF = ם
mem = מ
nunF = ן
nun = נ
peF = ף
pe = פ
tsadiF = ץ
tsadi = צ


Here comes the linking.

In [10]:
finalConsonants = {}

for name in theFive:
    if name.endswith("F"):
        finalConsonants[theFive[name]] = theFive[name[0:-1]]
        
for (cF, c) in finalConsonants.items():
    print(f"final {cF} corresponds to {c}")

final ך corresponds to כ
final ם corresponds to מ
final ן corresponds to נ
final ף corresponds to פ
final ץ corresponds to צ


Now we construct a regular expression that matches all of these consonants.

In [11]:
finals = "".join(finalConsonants.keys())
nonFinals = "".join(finalConsonants.values())
theFiveRe = re.compile(fr"([{finals}{nonFinals}])")
theFiveRe

re.compile(r'([ךםןףץכמנפצ])', re.UNICODE)

We are going to match each of the 400,000+ words to this pattern, and collect all matches of it.

In [12]:
freqs = collections.Counter()

for w in F.otype.s("word"):
    text  = F.g_word_utf8.v(w)
    
    chars = theFiveRe.findall(text)
    for ch in chars:
        freqs[ch] += 1

We compute the proportion in percents of consonants against their counterparts,
we also give totals.

We present the result as a markdown table.

In [13]:
from tf.advanced.helpers import dm

In [14]:
text = """
consonant | #total | #final | #nonfinal | %final
--- | --- | --- | --- | ---
"""

for (consF, cons) in sorted(finalConsonants.items()):
    final = freqs[consF]
    nonFinal = freqs[cons]
    total = final + nonFinal
    percent = int(round(final * 100 / total))
    text += f"{cons} and {consF} | {total} | {final} | {nonFinal} | {percent}\n"
    
dm(text)


consonant | #total | #final | #nonfinal | %final
--- | --- | --- | --- | ---
כ and ך | 47469 | 14002 | 33467 | 29
מ and ם | 98929 | 41291 | 57638 | 42
נ and ן | 55093 | 15245 | 39848 | 28
פ and ף | 18284 | 2554 | 15730 | 14
צ and ץ | 14977 | 3288 | 11689 | 22


# Remaining questions

Have we seen all the consonants? The text of the BHSA is mainly store in the feature
[g_word_utf8](https://etcbc.github.io/bhsa/features/g_word_utf8/),
but there are other features as well that store material:

* qere material that deviates from the ketiv is in
  [qere_utf8](https://etcbc.github.io/bhsa/features/qere_utf8/)
  
* [trailer_utf8](https://etcbc.github.io/bhsa/features/trailer_utf8/) stores inter-word material, but can contain
  consonants. Again, if there is a deviating qere, the inter-word material is in 
  [qere_trailer_utf8](https://etcbc.github.io/bhsa/features/qere_trailer_utf8/).
  
These factors may cause discrepancies in different ways of counting consonants.


# Exploration

Which consonants occur in `trailer_utf8` and `qere_trailer_utf8`?

In [27]:
nonFinalConsonants = set(finalConsonants.values())

for (chars, n) in sorted(F.trailer_utf8.freqList()):
    print(f"{n:>6}x {repr(chars)}")
    for c in chars:
        if c in finalConsonants:
            extra = "final consonant"
        elif c in nonFinalConsonants:
            extra = "non-final consonant"
        else:
            extra = ""
        if extra:
            print(f"\t{extra:<20} {c}")

121801x ''
236930x ' '
    17x ' ׀ '
    76x ' ס '
    13x ' פ '
	non-final consonant  פ
 42275x '־'
  2266x '׀ '
 20146x '׃ '
     7x '׃ ׆ '
     1x '׃ נ ס '
	non-final consonant  נ
     1x '׃ נ פ '
	non-final consonant  נ
	non-final consonant  פ
  1892x '׃ ס '
  1165x '׃ פ '
	non-final consonant  פ


OK, there are consonants, and they include some consonants we are interested in,
the pe and the nun, both non-final.

But they they are really the petuha and the nun-hafuka and they do not function as consonants.
See [writing](https://annotation.github.io/text-fabric/tf/writing/hebrew.html).

# Ketiv/qere sensitive counting

So let's add a layer of sophistication, and redo the counting in several ways:

* Q+ the qere text, with interword material
* Q- the qere text, without interword material
* K+ the ketiv text, with interword material
* K- the ketiv text, without interword material

In [40]:
freqs = collections.defaultdict(lambda: collections.Counter())

for w in F.otype.s("word"):
    ketiv = F.g_word_utf8.v(w)
    qere = F.qere_utf8.v(w) or ketiv
    trailerk = F.trailer_utf8.v(w)
    trailerq = F.qere_trailer_utf8.v(w) or trailerk
    
    texts = {}
    texts["Km"]  = ketiv
    texts["Qm"]  = qere
    texts["Kp"] = f"{ketiv}{trailerk}"
    texts["Qp"] = f"{qere}{trailerq}"
    
    for (kind, text) in texts.items():
        chars = theFiveRe.findall(text)
        for ch in chars:
            freqs[kind][ch] += 1

Now we have to display the results:

In [41]:
text = """
consonant | kind | #totalQp | #final | #nonfinal | %final
--- | --- | --- | --- | ---| ---
"""

for (consF, cons) in sorted(finalConsonants.items()):
    for kind in ("Qp", "Qm", "Kp", "Km"):
        theseFreqs = freqs[kind]
        
        final = theseFreqs[consF]
        nonFinal = theseFreqs[cons]
        total = final + nonFinal
        percent = int(round(final * 100 / total))
        text += f"{cons} and {consF} | {kind} | {total} | {final} | {nonFinal} | {percent}\n"
    
dm(text)


consonant | kind | #totalQp | #final | #nonfinal | %final
--- | --- | --- | --- | ---| ---
כ and ך | Qp | 47482 | 14001 | 33481 | 29
כ and ך | Qm | 47482 | 14001 | 33481 | 29
כ and ך | Kp | 47469 | 14002 | 33467 | 29
כ and ך | Km | 47469 | 14002 | 33467 | 29
מ and ם | Qp | 98940 | 41291 | 57649 | 42
מ and ם | Qm | 98940 | 41291 | 57649 | 42
מ and ם | Kp | 98929 | 41291 | 57638 | 42
מ and ם | Km | 98929 | 41291 | 57638 | 42
נ and ן | Qp | 55093 | 15240 | 39853 | 28
נ and ן | Qm | 55091 | 15240 | 39851 | 28
נ and ן | Kp | 55095 | 15245 | 39850 | 28
נ and ן | Km | 55093 | 15245 | 39848 | 28
פ and ף | Qp | 19457 | 2555 | 16902 | 13
פ and ף | Qm | 18278 | 2555 | 15723 | 14
פ and ף | Kp | 19463 | 2554 | 16909 | 13
פ and ף | Km | 18284 | 2554 | 15730 | 14
צ and ץ | Qp | 14977 | 3288 | 11689 | 22
צ and ץ | Qm | 14977 | 3288 | 11689 | 22
צ and ץ | Kp | 14977 | 3288 | 11689 | 22
צ and ץ | Km | 14977 | 3288 | 11689 | 22
