<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Tutorial

This notebook shows you how to use LAF-Fabric for exciting data processing on the Hebrew Bible.

# Preliminaries

You need hardware, an operating system, additional software, LAF-Fabric itself and ETCBC data.

Here is a summary to get it all.

## Get it all!

1. Download and install [Anaconda](), choose a python3 based version for your platform.

1. get the data
    
    cd ~
    git clone .../ETCBC/laf-fabric-data
    
1. get laf-fabric
    
    cd ~
    git clone .../ETCBC/laf-fabric
    cd laf-fabric-dist/laf-fabric-*
    python3 setup.py

Skip the next two steps if this is your first acquaintance with LAF-Fabric.

1. Install databases

    sudo apt-get install mysql-server
    sudo apt-get install sqlite3
    
1. Download [Emdros](). Build and install it.

## Hardware

You need a computer with enough RAM (at least 8GB). It can be a laptop, or a virtual machine inside a laptop.
That virtual machine needs at least 4GB of RAM, but make it 6GB. If your VM has a 64-bit architecture, you need to allocate 12 GB.

Personally, I feel most comfortable with running LAF-Fabric natively, not in a VM, because then everything integrates much better with the rest of your work.

## Operating system

Preferably, the computer understand unix-like commands and has a *terminal* utility, also known as *command line* in the Windows world. That means that Mac OS X is OK and Linux is OK, and Windows is doable with extra care.

Working with LAF-Fabric breathes a culture of terminal commands, it involves git, bash, and the installation instructions are given in unix lingo. 

If you are on Windows, use unix-like tools that have been ported to Windows, such as Git-Bash, Cygwin, or other stuff.
It is possible to get LAF-Fabric up and running on Windows, without these unix tools, but you need the expertise to translate the unix way of working into a Windows way of working.

## Additional software

What you need in this chapter are programming languages and databases and other utilities.

### Python and friends

Python is not only a programming language but also an ecosystem of packages that are very helpful in data processing.
It comes in versions 2 and 3. Version 3 has proper UNICODE handling. LAF-Fabric is based on version 3.

You can run a python program by issuing a command on the terminal. 
But you can also run a python script interactively on the terminal.
Even better, you can run python in a rich interface in the browser, where you can document your code in a rich way.
This is the *notebook* concept. What you are reading now is part of a notebook.
The technology that makes this possible is called **Jupyter** (formerly **IPython**).
In order to install Python and Jupyter and all they need in one go, download **Anaconda** and install it.

### MySQL, SQLite3

Relational databases may not be the hype nowadays, they have lost nothing of their usefulness. Make sure that you have either sqlite3 or mysql on your system, or both.

### Emdros

Emdros implements the concept of a text database, where text is modelled as things that come in an order and that can be nested  The things can be given features that describe their properties, and things can be connected by relationships. Having done that, Emdros supports a query language in which your queries are structural templates with fixed and variable properties. The result are all possible ways by which the variable parts can be filled from the data in the database.

Most data processing in LAF-Fabric does not need Emdros, because LAF-Fabric offers its own way of retrieving information from a graph database. Nevertheless, LAF-Fabric has an API for MQL queries, so that you can run them programmatically, and process their results. You can combine results of multiple queries and compare results with other ways of querying the data.

If you plan to do this, you also have to install Emdros.

## LAF-Fabric

LAF-Fabric is a python package, with two sub packages: **laf** and **etcbc**. 
The **laf** package offers an API for working efficiently with LAF data, i.e. data in Linguistic Annotation Framework. It does so by compiling a LAF resource (which is a set of bulky XML files) into fast-loading data structures for Python.
Where as the LAF resource takes 10 minutes to parse, before you can do useful work with it, the compiled data loads in a matter of seconds. LAF-Fabric loads and unloads data according to your request.

In order to get it, you can clone it from Github, after which you have to perform an installation step to incorporate the package into your python environment.

## ETCBC data

The Hebrew Bible as encoded by the ETCBC contains the text of the Biblia Hebraica Stuttgartensia plus linguistic features provided by the Eep Talstra Centre for Bible and Computer.

This data has been curated and archived at DANS/EASY, in several versions: 3, 4, and 4b.
Versions 4 and 4b contain the curated data in LAF, but also in other convenient formats.
The compiled data of the latest version is in a Github repository, and this is the recommended way to quickly get up and running with LAF-Fabric.

# Work

Now we can get to work.

We import some modules, among which **laf** and **etcbc** from the LAF-fabric package.

In [1]:
import sys
import collections

from laf.fabric import LafFabric
import etcbc
from etcbc.preprocess import prepare

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.5
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



Note the hyperlinks above:

1. to the documentation of the LAF-Fabric API. This tells you what the software can do.
2. to the feature documentation. This tells you what information has been encoded in the data and how.

# Choose your source and features

LAF-Fabric is a generic tool for LAF resources. We have to specify a data source.
In our case, that is a sub-directory of the **laf-fabric-data** directory.

Our data is in **etcbc4b**. We call **etcbc** the source, and **4b** the version.

We also give our task a name, in this case **tutorial**.
The task name will become a sub-directory of **laf-fabric-output**.
There are handy commands to read and write files in this directory.

Below, in the call of **fabric.load** you see an argument '--'.
This is nothing, but it could have been a package with extra annotations.
If you specify such a package, its annotations will be merged with the annotations in the main resource.

This is a convenient way to enrich the base data with extra information, or to override erroneous annotations with correct ones.

In [2]:
source = 'etcbc'
version = '4b'
task = 'tutorial'

It is time to specify the data *features* that we want to use.
We declare just a bare minimum; later on we add more as needed.

In [3]:
API = fabric.load(source+version, '--', task, {
    'xmlids': {'node': False, 'edge': False},
    'features': ('''
        otype sp gn
''',''),
    'prepare': prepare,
}, verbose='DETAIL')

exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: UP TO DATE
  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s DETAIL: COMPILING a: UP TO DATE
  0.01s DETAIL: load main: G.node_anchor_min
  0.12s DETAIL: load main: G.node_anchor_max
  0.22s DETAIL: load main: G.node_sort
  0.31s DETAIL: load main: G.node_sort_inv
  0.75s DETAIL: load main: G.edges_from
  0.87s DETAIL: load main: G.edges_to
  1.00s DETAIL: load main: F.etcbc4_db_otype [node] 
  1.70s DETAIL: load main: F.etcbc4_ft_gn [node] 
  1.85s DETAIL: load main: F.etcbc4_ft_sp [node] 
  2.04s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/tutorial/__log__tutorial.txt
  2.04s DETAIL: prep prep: G.node_sort
  2.14s DETAIL: prep prep: G.node_sort_inv
  2.61s DETAIL: prep prep: L.node_up
  5.56s DETAIL: prep prep: L.node_down
    11s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  0.00s LOADING API with EXTRAs: please wait ... 
  0.00s DETAIL: COMPILIN

# Let's go computing

We are now in an excellent position to explore the data.
We shall assume very little knowledge about what is in the data.
We are going to discover it.

## Counting

So, the data is a graph with nodes and edges. How many nodes? We are going to walk through them all.

In [4]:
msg('Counting nodes')

n_nodes = 0
for n in NN(): n_nodes += 1
    
msg('There are {} nodes'.format(n_nodes))

 1m 34s Counting nodes
 1m 34s There are 1436858 nodes


The function ``msg`` is part of LAF-Fabric. It gives a progress message with the elapsed time since the API was loaded. You see that counting 1.4 million nodes is a breeze.

## Finer counting

The nodes corresponds with *types* of data.
The feature ``otype`` tells what type a node has.
Lets get those types and count how many nodes each type has.

In [5]:
msg('Counting node types')

node_types = collections.Counter()

for n in NN(): node_types[F.otype.v(n)] += 1

msg('There are {} node types:'.format(len(node_types)))
for nt in sorted(node_types):
    print('\t{:<15}: {:>5} nodes'.format(nt, node_types[nt]))

 4m 27s Counting node types
 4m 29s There are 12 node types:


	book           :    39 nodes
	chapter        :   929 nodes
	clause         : 88011 nodes
	clause_atom    : 90554 nodes
	half_verse     : 45180 nodes
	phrase         : 253161 nodes
	phrase_atom    : 267499 nodes
	sentence       : 63586 nodes
	sentence_atom  : 64354 nodes
	subphrase      : 113764 nodes
	verse          : 23213 nodes
	word           : 426568 nodes


So our corpus has 436568 words.

# Gender in Hebrew

The Hebrew language has two genders: ``m`` (masculine) and ``f`` (feminine).
Any specific words either has one of these genders, or it is unknown or irrelevant what gender it has.

Let us count the proportions.

In [6]:
msg('Counting genders')

genders = collections.Counter()

for w in F.otype.s('word'): genders[F.gn.v(w)] += 1

msg('These are the values that the feature gn can take: {}'.format(', '.join(genders)))
for gn in sorted(genders):
    print('\t{:<15}: {:>5} nodes'.format(gn, genders[gn]))

 5m 55s Counting genders
 5m 57s These are the values that the feature gn can take: NA, unknown, m, f


	NA             : 180149 nodes
	f              : 36712 nodes
	m              : 164183 nodes
	unknown        : 45524 nodes


Let us now make an overview of the proportion between positively masculine words and positively feminine words per book in the Hebrew Bible. Most feminine books on top.

In [7]:
msg ('Ranking books on gender ratio')

book_genders = collections.defaultdict(lambda: collections.Counter())
real_genders = {'m', 'f'}

current_book = None
for n in NN():
    ntype = F.otype.v(n)
    if ntype == 'book': current_book = F.book.v(n)
    elif ntype == 'word': 
        gn = F.gn.v(n)
        if gn in real_genders: book_genders[current_book][gn] += 1
            
book_ratios = dict((book, book_genders[book]['m'] / book_genders[book]['f']) for book in book_genders)
msg('Done')

for (book, ratio) in sorted(book_ratios.items(), key=lambda x: (x[1], x[0])):
    print('{:<15} m/f = {:.2}'.format(book, ratio))

 6m 49s Ranking books on gender ratio
 6m 50s Done


Ruth            m/f = 1.4
Canticum        m/f = 1.8
Ezechiel        m/f = 2.8
Esther          m/f = 3.1
Nahum           m/f = 3.2
Threni          m/f = 3.3
Proverbia       m/f = 3.4
Micha           m/f = 3.4
Leviticus       m/f = 3.4
Zephania        m/f = 3.7
Genesis         m/f = 4.0
Daniel          m/f = 4.0
Numeri          m/f = 4.1
Amos            m/f = 4.2
Jesaia          m/f = 4.2
Sacharia        m/f = 4.2
Jona            m/f = 4.3
Jeremia         m/f = 4.3
Iob             m/f = 4.4
Exodus          m/f = 4.7
Judices         m/f = 4.7
Hosea           m/f = 4.7
Psalmi          m/f = 4.9
Nehemia         m/f = 5.0
Joel            m/f = 5.1
Josua           m/f = 5.2
Reges_I         m/f = 5.3
Chronica_II     m/f = 5.3
Habakuk         m/f = 5.4
Esra            m/f = 5.4
Deuteronomium   m/f = 5.4
Ecclesiastes    m/f = 5.5
Maleachi        m/f = 6.0
Reges_II        m/f = 6.0
Samuel_I        m/f = 6.1
Samuel_II       m/f = 6.2
Chronica_I      m/f = 7.4
Obadia          m/f = 7.7
Haggai      