<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


# Vocabulary

This notebook creates a list of all Hebrew and Aramaic lexemes, with glosses and frequences, listed in
reverse frequency order,
This list is stored as 
[vocab.tsv](vocab.tsv), a tab-separated, plain unicode text file.

This is an answer to a [question by Kirk Lowery](http://bhebrew.biblicalhumanities.org/viewtopic.php?f=7&t=946).

In [1]:
import os

from tf.fabric import Fabric

# Load data
We load the some features of the
[BHSA](https://github.com/etcbc/bhsa) data.
See the [feature documentation](https://etcbc.github.io/bhsa/features/hebrew/2017/0_home.html) for more info.

In [2]:
BHSA = 'BHSA/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=BHSA)
api = TF.load('''
    language
    lex
    voc_lex_utf8
    gloss
    freq_lex
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored
  0.00s loading features ...
   |     0.14s B language             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.15s B lex                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.01s B voc_lex_utf8         from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.01s B gloss                from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.10s B freq_lex             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.00s Feature overview: 108 for nodes; 5 for edges; 1 configs; 7 computed
  5.56s All features loaded/computed - for details use loadLog()


We walk through all lexemes, and collect their language, lexeme identifier, vocalized lexeme representation,
gloss, and frequency. 
We combine it in one list.

In [3]:
vocab = []
for lexNode in F.otype.s('lex'):
    vocab.append((
        F.freq_lex.v(lexNode),
        F.language.v(lexNode),
        F.lex.v(lexNode),
        F.gloss.v(lexNode),
        F.voc_lex_utf8.v(lexNode),
    ))

We sort the list on frequency, then language, then vocalised lexeme.

In [4]:
vocab = sorted(vocab, key=lambda e: (-e[0], e[1], e[4]))

Here are the first 10.

In [5]:
vocab[0:10]

[(50272, 'hbo', 'W', 'and', 'וְ'),
 (30386, 'hbo', 'H', 'the', 'הַ'),
 (20069, 'hbo', 'L', 'to', 'לְ'),
 (15542, 'hbo', 'B', 'in', 'בְּ'),
 (10997, 'hbo', '>T', '<object marker>', 'אֵת'),
 (7562, 'hbo', 'MN', 'from', 'מִן'),
 (6828, 'hbo', 'JHWH/', 'YHWH', 'יְהוָה'),
 (5766, 'hbo', '<L', 'upon', 'עַל'),
 (5517, 'hbo', '>L', 'to', 'אֶל'),
 (5500, 'hbo', '>CR', '<relative>', 'אֲשֶׁר')]

`hbo` and `arc` are 
[ISO codes](https://www.loc.gov/standards/iso639-2/php/code_list.php) for the Hebrew and Aramaic languages.

We store the result in [vocab.tsv](vocab.tsv)

In [6]:
with open('vocab.tsv', 'w') as f:
    f.write('frequency\tlanguage\tidentifier\tgloss\tlexeme\n') #header
    for entry in vocab:
        f.write('{}\t{}\t{}\t{}\t{}\n'.format(*entry))