# Simple data export

This is a self contained notebook to make a basic data export of the missives data from its Text-Fabric representation.

For more ways to compute with the missives in Text-Fabric, see this 
[tutorial](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/missieven/start.ipynb).


# Goal

We want to produce a tsv file in which each row corresponds to a single word.

For each word, we store information about whether the word occurs in original letter text or in modern redactional text.
If there is a footnote at a word, we also store the text of the footnote in a column.

# Details

We produce a tsv file with over 5 million rows, with a header row.
The columns are separated by tabs, the rows by newlines.
Here is a description of the columns:

column name | data type | remarks
--- | --- | ---
word | string | the text of a word without punctuation or other interword material
kind | `o` or `e` or `h` or `f` or `0`| *original text* or *editorial text* or *heading text* or *folio-ref text* or an empty word.
note | text | empty or the text of a footnote to the word in question

*Heading text* is the text found in the headings of the letters.

*Folio-ref text* is text found in folio references.

*Empty words* are words without text and punctuation, inserted in otherwise empty lines/pages.

We do not use text quotation characters, not even in the column that contains the footnotes.
If footnote text contains newlines or tabs, they have been escaped by the strings 
`\n` and `\t` respectively. 

Footnote texts contain the original footnote mark.
If there are multiple footnotes to one word, they are separated by a double newline.

**Example:** there are two footnotes at word 83202, one with mark `3` and one with mark `4`.
That means that in row 83202 in column *note* you'll see the value

```
3. Onderwegen, bedoeld: voor de schulden afbetaald zijn, houdt het verlopen en afsterven\nvan de slaven de vrijburgers arm.\n\n4. Voor de tegenslagen van de perkeniers op Banda vgl. men\nPerkeniers, p. 565-566.
```

# Download

The resulting tsv file can be downloaded from a [release](https://github.com/Dans-labs/clariah-gm/releases).

This is a direct link to the version generated when this notebook was last updated:

[words.tsv.gz](https://github.com/Dans-labs/clariah-gm/releases/download/v0.6/words.tsv.gz)

# Exporting

Here comes the code that performs the export.

We generate the export file `words.tsv.gz` in a directory `_local` in this repo, which will not be committed to git (it is in the `.gitignore` file.
It is a gzipped version of `words.tsv`.

In [1]:
import os
import gzip
from itertools import chain
from shutil import copyfileobj

from tf.app import use

In [2]:
DEST = os.path.expanduser("~/github/Dans-labs/clariah-gm/_local/words.tsv")

In [5]:
A = use("CLARIAH/wp6-missieven:v0.6", checkout="v0.6", hoist=globals())

rate limit is 5000 requests per hour, with 4988 left for this hour
	connecting to online GitHub repo annotation/app-missieven ... connected
	code/__init__.py...downloaded
	code/app.py...downloaded
	code/config.yaml...downloaded
	code/static...directory
		code/static/display.css...downloaded
		code/static/logo.png...downloaded
	OK


rate limit is 5000 requests per hour, with 4973 left for this hour
	connecting to online GitHub repo Dans-labs/clariah-gm ... connected
	downloading https://github.com/Dans-labs/clariah-gm/releases/download/v0.6/tf-0.6.zip ... 
	unzipping ... 
	saving data


   |     1.63s T otype                from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |       14s T oslots               from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     6.95s T transo               from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     3.72s T transr               from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     8.56s T punc                 from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     5.70s T punco                from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.00s T title                from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.65s T n                    from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     3.14s T puncr                from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |       11s T trans                from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |      |     0.34s C __levels__           from otype, oslots, otext
   |      |       51s C __ord



   |      |     1.27s C __sections__         from otype, oslots, otext, __levUp__, __levels__, n, n, n
   |      |     0.24s C __structure__        from otype, oslots, otext, __rank__, __levUp__, n, title, n
   |     0.00s T author               from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.00s T authorFull           from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.02s T col                  from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.00s T day                  from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.01s T emph                 from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.02s T facs                 from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.07s T fnote                from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.02s T folio                from ~/text-fabric-data/Dans-labs/clariah-gm/tf/0.6
   |     0.00s T month                from ~/text-fabric-data/Dans-labs/

Hint: in order to see the data documentation of the TF data of the missives, click the link **Feature docs** in the output of the previous cell.

In order to quickly decide whether a word belongs to a head of a letter, we make an index of those words upfront:

In [6]:
headSet = set(chain.from_iterable(L.d(h, otype="word") for h in F.otype.s("head")))

In [7]:
len(headSet)

16043

In [8]:
A.indent(reset=True)
A.info("Make data file ...")

with open(DEST, "w") as f:
    f.write("word\tkind\tnote\n")

    for w in F.otype.s("word"):
        word = F.trans.v(w)
        kind = (
            "h"
            if w in headSet
            else "f"
            if F.folio.v(w)
            else "e"
            if F.remark.v(w)
            else "0"
            if word == "" and F.punc.v(w) == ""
            else "o"
        )
        note = F.fnote.v(w) or ""

        f.write(f"{word}\t{kind}\t{note}\n")

A.info("done")

  0.00s Make data file ...
  7.41s done


In [10]:
!head -n 50 {DEST}

word	kind	note
I	h	
PIETER	h	
BOTH	h	
AAN	h	
BOORD	h	
VAN	h	
HET	h	
WAPEN	h	
VAN	h	
AMSTERDAM	h	
VOOR	h	
ILE	h	
DE	h	
MAYO	h	1. De eerste drie brieven, door Both op reis naar Indië geschrovon, wijken niet af van
de andere brieven van vlootvoogden aan Heren XVII ; dat hij in de kwaliteit van Gouverneur-
Generaal aan boord vertoefde, verandert daaraan niets. Ile de Mayo is een der Kaap-Verdische
eilanden.
25	h	
februari	h	
1610	h	
961	f	
1folio	f	
Scheepshericht	e	
vnl	e	
handelend	e	
over	e	
uitgedeelde	e	
straffen	e	
	0	
	0	
II	h	
PIETER	h	
BOTH	h	
AAN	h	
BOORD	h	
VAN	h	
HET	h	
WAPEN	h	
VAN	h	
AMSTERDAM	h	
LIGGENDE	h	
IN	h	
DE	h	
TAFELBAAI	h	
6	h	
augustus	h	
1610	h	
961	f	
copie	f	


In [11]:
!tail -n 50 {DEST}

factuur	o	
van	o	
aanreekening	o	
weegens	o	
1	o	
aam	o	
gember	o	
geconfijte	o	
met	o	
eerstgemelde	o	
bodem	o	
niet	o	
aangereekent	o	
ƒ	o	
154	o	
6	o	
Dus	o	
’t	o	
weesentlijke	o	
gelaadene	o	
in	o	
deesen	o	
boodem	o	
bedraagt	o	
19e	o	
januarij	o	
per	o	
’t	o	
schip	o	
Pijlsweert	o	
ƒ	o	
144	o	
314	o	
8	o	
ƒ	o	
237	o	
609	o	
3	o	
ƒ	o	
381	o	
923	o	
3	o	
8	o	
Somma	o	
ƒ	o	
2	o	
434	o	
436	o	
2	o	
	0	


Finally we gzip the `words.tsv file`:

In [12]:
with open(DEST, 'rb') as fIn:
    with gzip.open(f"{DEST}.gz", "wb") as fOut:
        copyfileobj(fIn, fOut)

In [13]:
!ls -lh {DEST}
!ls -lh {DEST}.gz

-rw-r--r--  1 dirk  staff    45M Dec  7 12:12 /Users/dirk/github/Dans-labs/clariah-gm/_local/words.tsv
-rw-r--r--  1 dirk  staff    13M Dec  7 12:12 /Users/dirk/github/Dans-labs/clariah-gm/_local/words.tsv.gz
