<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Lexeme spaces

# Research Question

Following an idea of Martijn Naaijer:

In order to research characteristics that separate early biblical Hebrew from late biblical Hebrew and even the intermediate stages, we need to count lexemes in a sophisticated way.

The problem is that there are books with many different lexemes and books with fewer lexemes. If we compare them in a straightforward manner, it is not obvious whether we can ascribe the differences to language variation over time or to genre or just random fluctuation, or even the mere size of the book.

We need to get more grip on the potential lexeme choice for each book.

## Scope of a lexeme

We approximate the *potential lexeme choice* by investigating *lexeme spaces*.

For each lexeme we count as follows:

The *lexeme scope* of a lexeme is the sum of the lengths (in words) of the books in which lexeme occur.

## Lexeme profile of a book

Now for each book in the bible, plot its lexemes as a cloud of points 
where the $x$-axis corresponds to the lexeme scope and the $y$-axis to the number of lexemes in that book for each scope.

For convenience, we divide the lexeme scopes on the $x$-axis in intervals, called *buckets*.
and then the $y$-values are the number of lexemes in the book whose scope is within the buckets in question.
We also group the $y$ values together, we call them the *n-ranges*.

So, on the $x$-axis we have buckets of lexeme scopes, and on the $y$-axis we have *n-ranges*,
such that if a point links a bucket $b$ to an n-range $n$, the number of lexemes in $b$ is within $n$.

We generate several book profiles, based on various subsets of lexemes and with different interval settings.

## Subsets of lexemes

We work with the following subsets of lexemes:

* ``all``: all lexemes
* ``c``: all lexemes except proper nouns
* ``nva``: only nouns, verbs and adjectives, no proper nouns

We work with two kinds of intervals: *linear* and *logarithmic*.

The linear intervals are specified by two numbers: the *bucket size* ``B`` and the *n-range size* ``N``.
A nice graph is obtained by taking ``B`` $= 10000$ and ``N`` $=10$.
So we group lexeme scopes in intervals of 10000 and we group the numbers of lexemes in groups of 10.

Logarithmic intervals start small and increase exponentially: 
take the number of bits used to represent a number instead of that number itself.
So the logaritmic scale goes as follows:

    0 1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 etc
   
We do this both horizontally and vertically.

# Tasks

We generate sets of book profiles for all the books, directed by the ``TASKS`` variable below.
It specifies the interval method (*linear* or *logarithmic*) and if linear, it specifies the interval size for buckets and for n-ranges.

All tasks are run separately for all possible lexeme subsets.

In [1]:
TASKS = (
    ('log',),
    ('linear', 10000, 10),
    ('linear', 5000, 5),
    ('linear', 2500, 1),
)

# Fire up

In [2]:
import sys
import collections
import matplotlib.pyplot as plt
from laf.fabric import LafFabric
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.4.6
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: http://shebanq-doc.readthedocs.org/en/latest/texts/welcome.html



In [3]:
fabric.load('etcbc4', '--', 'lexemes', {
    'xmlids': {'node': False, 'edge': False},
    'features': ('''otype g_word language sp lex book chapter''', ''),
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2014-07-23T09-31-37
  2.51s LOGFILE=/Users/dirk/Dropbox/laf-fabric-output/etcbc4/lexemes/__log__lexemes.txt
  2.51s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK lexemes AT 2014-10-30T15-08-48


# Compute
## Lexemes
Compute the set of lexemes contained in every book.

In [5]:
msg('Fetching lexemes')
sources = {'all', 'c', 'nva'}
cur_book = None
book_sizes = collections.Counter()
book_sizes_aram = collections.Counter()
lexeme_books = collections.defaultdict(lambda: collections.defaultdict(lambda: set()))
for n in NN():
    otype = F.otype.v(n)
    if otype == 'book':
        cur_book = F.book.v(n)
        continue
    if otype != 'word': continue
    lex = F.lex.v(n)
    lan = F.language.v(n)
    sp = F.sp.v(n)
    book_sizes[cur_book] += 1
    if lan == 'Aramaic':
        book_sizes_aram[cur_book] += 1
    if True: lexeme_books['all'][lex].add(cur_book)
    if sp != 'nmpr': lexeme_books['c'][lex].add(cur_book)
    if sp == 'verb' or sp == 'subs' or sp == 'adjv': lexeme_books['nva'][lex].add(cur_book)

msg('{} lexemes'.format(len(lexeme_books)))
for (book, size) in sorted(book_sizes.items(), key=lambda x:(-x[1], x[0])):
    print('{:<15} has {:>5} words of which {:>} Aramaic'.format(book, size, book_sizes_aram[book]))

 2m 30s Fetching lexemes
 2m 34s 3 lexemes


Jeremia         has 29735 words of which 19 Aramaic
Genesis         has 28756 words of which 2 Aramaic
Ezechiel        has 26180 words of which 0 Aramaic
Psalmi          has 25371 words of which 2 Aramaic
Exodus          has 23748 words of which 0 Aramaic
Numeri          has 23186 words of which 0 Aramaic
Jesaia          has 22934 words of which 0 Aramaic
Deuteronomium   has 20127 words of which 0 Aramaic
Chronica_II     has 19760 words of which 0 Aramaic
Samuel_I        has 18926 words of which 0 Aramaic
Reges_I         has 18684 words of which 0 Aramaic
Reges_II        has 17305 words of which 0 Aramaic
Leviticus       has 17099 words of which 0 Aramaic
Samuel_II       has 15612 words of which 0 Aramaic
Chronica_I      has 15561 words of which 0 Aramaic
Josua           has 14523 words of which 0 Aramaic
Judices         has 14084 words of which 0 Aramaic
Iob             has 10912 words of which 0 Aramaic
Proverbia       has  8859 words of which 0 Aramaic
Daniel          has  8071 word

In [8]:
b = 0
for n in NN():
    if F.otype.v(n) == 'book':
        b +=1
        book = F.book.v(n)
        print('{}\t{}\t{}'.format(b, book, book_sizes[book]))

1	Genesis	28756
2	Exodus	23748
3	Leviticus	17099
4	Numeri	23186
5	Deuteronomium	20127
6	Josua	14523
7	Judices	14084
8	Samuel_I	18926
9	Samuel_II	15612
10	Reges_I	18684
11	Reges_II	17305
12	Jesaia	22934
13	Jeremia	29735
14	Ezechiel	26180
15	Hosea	3146
16	Joel	1318
17	Amos	2780
18	Obadia	392
19	Jona	985
20	Micha	1895
21	Nahum	746
22	Habakuk	897
23	Zephania	1037
24	Haggai	877
25	Sacharia	4469
26	Maleachi	1187
27	Psalmi	25371
28	Iob	10912
29	Proverbia	8859
30	Ruth	1802
31	Canticum	1682
32	Ecclesiastes	4233
33	Threni	1945
34	Esther	4621
35	Daniel	8071
36	Esra	5268
37	Nehemia	7842
38	Chronica_I	15561
39	Chronica_II	19760


## Language use
As an aside we
visualize the Aramaic language use versus the Hebrew language use.

In [5]:
rep = {'heb': '_', 'arm': '-'}
repi = dict((y,x) for (x,y) in rep.items())
repo = dict((y, 'heb' if x == 'arm' else 'arm') for (x,y) in rep.items())
condense = {'heb': 100, 'arm': 20}

def distil(words):
    n = {'heb': 0, 'arm': 0}
    
    def inc(this_lan, that_lan, passive=False):
        if n[that_lan]:
            n[that_lan] = 0
            yield rep[that_lan]
        if not passive:
            n[this_lan] += 1
            if n[this_lan] == condense[this_lan]:
                n[this_lan] = 0
                yield rep[this_lan]
    for w in words:
        for y in inc(repi[w], repo[w]): yield y
    for y in inc('arm', 'heb', passive=True): yield y
    for y in inc('heb', 'arm', passive=True): yield y
        
msg('''
Showing the Aramaic:
'{}' is {} Hebrew  word{} (or less but at least 1),
'{}' is {} Aramaic word{} (or less but at least 1)'''.format(
    rep['heb'], condense['heb'], '' if condense['heb'] == 1 else 's', 
    rep['arm'], condense['arm'], '' if condense['arm'] == 1 else 's',
))
skipping = True
words = []
for n in NN():
    otype = F.otype.v(n)
    if otype == 'book':
        if words:
            print(''.join(distil(words)))
            words = []
        cur_book = F.book.v(n)
        skipping = book_sizes_aram[cur_book] == 0
        if not skipping:
            print(cur_book)
        continue
    if skipping or otype != 'word': continue
    words.append(rep['arm'] if F.language.v(n) == 'Aramaic' else rep['heb'])

    12s 
Showing the Aramaic:
'_' is 100 Hebrew  words (or less but at least 1),
'-' is 20 Aramaic words (or less but at least 1)


Genesis
_____________________________________________________________________________________________________________________________________________________________________________-___________________________________________________________________________________________________________________
Jeremia
___________________________________________________-_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Psalmi
____________________________________________________________________________________________________________________________________________________________________________________________________________-___________________________________________________
Daniel
______---------------------------------------------------------------------------------------------------------------------

## Scopes and profiles
Compile the scopes of all lexemes and then the profiles of lexical scopes for each book.

Make buckets of scopes, in a logaritmic scale, and bucketize the book profiles.

In [12]:
hdash = '─'
vdash = '│'

def make_profiles(source):
    book_profile = collections.defaultdict(lambda: collections.Counter())
    lexeme_bookss = lexeme_books[source]

    lexeme_scope = {}
    for lex in lexeme_bookss:
        lexeme_scope[lex] = sum(book_sizes[book] for book in lexeme_bookss[lex])

    scope_file = outfile('scopes_{}.csv'.format(source))
    for ls in sorted(lexeme_scope.items(), key=lambda x: (-x[1], x[0])):
        scope_file.write('"{}";{}\n'.format(*ls))
    scope_file.close()

    for lex in lexeme_bookss:
        for book in lexeme_bookss[lex]:
            book_profile[book][lexeme_scope[lex]] += 1

    book_file = outfile('profiles_{}.csv'.format(source))
    for book in sorted(book_profile):
        book_file.write('"{}"\n'.format(book))
        for scope in sorted(book_profile[book]):
            book_file.write('{};{}\n'.format(scope, book_profile[book][scope]))
    book_file.close()

    return book_profile

def bucketize(data, method, params=()):
    if method == 'log':
        buckets_proto = collections.Counter()
        buckets = collections.Counter()
        for (scope, n) in data.items(): buckets_proto[int.bit_length(scope)] += n
        buckets = dict((b, int.bit_length(n)) for (b, n) in buckets_proto.items())
        return (buckets, max(buckets.keys()), max(buckets.values()))
    if method == 'linear':
        (B_LINEAR, N_LINEAR) = params
        buckets_proto = collections.Counter()
        buckets = collections.Counter()
        for (scope, n) in data.items(): buckets_proto[int(scope // B_LINEAR)] += n
        buckets = dict((b, n // N_LINEAR) for (b, n) in buckets_proto.items())
        return (buckets, max(buckets.keys()), max(buckets.values()))

def show_buckets(buckets, book, maxb, maxn):
    bucket_rows = collections.defaultdict(lambda: set())
    for (b, n) in buckets.items(): bucket_rows[n].add(b)
    lines = []
    lines.append('{:<4}┏{}{}{}┓'.format(' ',hdash * 4, book, hdash * (maxb + 1 - len(book) - 4)))
    max_row_show = maxn
    max_row_show = max(bucket_rows.keys())
    for row in range(max_row_show + 1, 0, -1):
        reps = ('█' if b in bucket_rows[row] else ' ' for b in range(maxb + 1))
        lines.append('{:<4}┃{}┃'.format(row, ''.join(reps)))
    lines.append('{:<4}┗{}┛'.format(' ', hdash * (maxb + 1)))
    lines.append('{:<4} 0{}{}'.format(' ', ' ' * (maxb - 1), maxb))
    lines.append('\n')
    return '\n'.join(lines)


def show_books(source, method, params=()):
    book_buckets = {}
    maxb = -1
    maxn = -1
    paramstr = '-'.join(str(x) for x in params)

    book_profile = make_profiles(source)
    for book in sorted(book_profile):
        (buckets, this_maxb, this_maxn) = bucketize(book_profile[book], method, params)    
        if this_maxb > maxb: maxb = this_maxb
        if this_maxn > maxn: maxn = this_maxn
        book_buckets[book] = buckets

    book_file_bucket = outfile('data_{}_{}{}.csv'.format(source, method, paramstr))
    for book in sorted(book_buckets):
        buckets = book_buckets[book]
        book_file_bucket.write('"{}";\n'.format(book))
        #for bn in sorted(book_buckets[book].items()):
        for b in range(max(buckets.keys()) + 1):
            book_file_bucket.write('{};{}\n'.format(b, buckets.get(b, 0)))
    book_file_bucket.close()

    book_file_graph = outfile('graph_{}_{}{}.txt'.format(source, method, paramstr))
    message = '''
BUCKET METHOD = {}
HORIZONTAL    = the lexeme scope
VERTICAL      = the number of lexemes in that scope
'''.format(method)
    book_file_graph.write(message + '\n')
    for book in sorted(book_buckets):
        graph = show_buckets(book_buckets[book], book, maxb, maxn)
        book_file_graph.write(graph)
    book_file_graph.close()

In [13]:
for task in TASKS:
    for source in sources:
        method = task[0]
        params = task[1:]
        print('Profiles from {:<5} with method {:<6} {}'.format(source, method, params))
        show_books(source, method, params=params)

Profiles from c     with method log    ()
Profiles from nva   with method log    ()
Profiles from all   with method log    ()
Profiles from c     with method linear (10000, 10)
Profiles from nva   with method linear (10000, 10)
Profiles from all   with method linear (10000, 10)
Profiles from c     with method linear (5000, 5)
Profiles from nva   with method linear (5000, 5)
Profiles from all   with method linear (5000, 5)
Profiles from c     with method linear (2500, 1)
Profiles from nva   with method linear (2500, 1)
Profiles from all   with method linear (2500, 1)
