<img align="right" src="images/tf.png"/>
<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/logo.png"/>

---

To get started: consult [start](start.ipynb)

---

# Computing "by hand"

We descend to a more concrete level, and interact with the data by means of a bit of hand-coding.

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os, collections

In [3]:
from tf.app import use

In [4]:
A = use('missieven', hoist=globals())
# A = use('missieven:latest', checkout="latest", hoist=globals())
# A = use('missieven:clone', checkout="clone", hoist=globals())

# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all words, several millions, in this corpus.

Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

Edge features are marked by **_bold italic_** formatting.

# Counting

In [5]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.70s 5572965 nodes


# Node types

In [6]:
F.otype.slotType

'word'

In [7]:
F.otype.all

('volume',
 'letter',
 'page',
 'table',
 'para',
 'remark',
 'head',
 'line',
 'row',
 'folio',
 'cell',
 'subhead',
 'word')

In [8]:
C.levels.data

(('volume', 386957.23076923075, 5572953, 5572965),
 ('letter', 8540.6519524618, 5054182, 5054770),
 ('page', 495.5956251847473, 5499749, 5509897),
 ('table', 125.38509316770187, 5572631, 5572952),
 ('para', 95.00634499040873, 5509898, 5543782),
 ('remark', 75.2654654916674, 5543783, 5566704),
 ('head', 27.237691001697794, 5053593, 5054181),
 ('line', 11.304390329409545, 5054771, 5499748),
 ('row', 8.841217696014017, 5566705, 5571270),
 ('folio', 3.4285714285714284, 5051038, 5053592),
 ('cell', 1.960326324479192, 5030445, 5051037),
 ('subhead', 1.4875, 5571271, 5572630),
 ('word', 1, 1, 5030444))

The second column is the average size (in words) of the node type mentioned in the first column.

The third and fourth column are the node numbers of the first and the last node of that kind.

In [9]:
for (typ, av, start, end) in C.levels.data:
    print(
        f"{end - start + 1:>7} x {typ:<7} having an average size of {int(round(av)):>6} words"
    )

     13 x volume  having an average size of 386957 words
    589 x letter  having an average size of   8541 words
  10149 x page    having an average size of    496 words
    322 x table   having an average size of    125 words
  33885 x para    having an average size of     95 words
  22922 x remark  having an average size of     75 words
    589 x head    having an average size of     27 words
 444978 x line    having an average size of     11 words
   4566 x row     having an average size of      9 words
   2555 x folio   having an average size of      3 words
  20593 x cell    having an average size of      2 words
   1360 x subhead having an average size of      1 words
5030444 x word    having an average size of      1 words


We can show the text in another text format.

# Feature statistics

There are no linguistic features (yet).

# Word matters

We can only work with the surface forms of words, there is no concept of lexeme in the corpus (yet).

## Top 20 frequent words

In [10]:
for (w, amount) in F.trans.freqList("word")[0:20]:
    print(f"{amount:>6} {w}")

215742 de
175262 van
121566 en
106517 te
 76108 in
 65950 het
 57001 den
 49056 dat
 48155 een
 47568 met
 45623 op
 41990 is
 36058 die
 35904 
 35436 voor
 33244 niet
 31983 tot
 30355 aan
 29644 ende
 27607 door


## Hapaxes

We look for words that occur only once.

We are only interested in words that are completely alphabetic, i.e. words that do not have numbers
or other non-letters in them.

In [11]:
hapaxes1 = sorted(w for (w, amount) in F.trans.freqList('word') if amount == 1 and w.isalpha())
len(hapaxes1)

77627

In [12]:
for lx in hapaxes1[0:20]:
    print(lx)

AC
ADRIAEN
AF
AFRIKA
AGRA
AND
ANDREASVAN
ANTHONTO
ANTONIOCAENENJOAN
ANTONY
APRIL
ARDECRÖON
ARE
ARNOUD
ASTELIJN
AUen
AUorkulan
AVR
Aacken
Aade


### Small occurrence base

The occurrence base of a word are the missives (letters) in which the word occurs.

**N.B. (terminology)**
Here *letter* means a document that has been sent to a recipient. This corpus consists of *missives*
which are letters.

We look only in the content of the original missives.

In [13]:
occurrenceBase = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("letter"):
    title = F.title.v(s)
    for w in L.d(s, otype="word"):
        trans = F.transo.v(w)
        if not trans or not trans.isalpha():
            continue
        occurrenceBase[trans].add(title)
A.info("done")
A.info(f"{len(occurrenceBase)} entries")

  0.00s compiling occurrence base ...
  3.55s done
  3.55s 124787 entries


An overview of how many words have how big occurrence bases:

In [14]:
occurrenceSize = collections.Counter()

for (w, letters) in occurrenceBase.items():
    occurrenceSize[len(letters)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"letters {size:>4} : {amount:>6} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"letters {size:>4} : {amount:>6} words")

letters    1 :  68639 words
letters    2 :  16209 words
letters    3 :   8030 words
letters    4 :   5141 words
letters    5 :   3560 words
letters    6 :   2566 words
letters    7 :   2069 words
letters    8 :   1672 words
letters    9 :   1310 words
letters   10 :   1193 words
...
letters  461 :      1 words
letters  463 :      1 words
letters  467 :      1 words
letters  471 :      1 words
letters  475 :      1 words
letters  476 :      1 words
letters  480 :      1 words
letters  483 :      1 words
letters  490 :      1 words
letters  493 :      1 words


Let's give the predicate *private* to those words whose occurrence base is a single missive.

In [15]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

68639

### Peculiarity of missives

As a final exercise with missives, lets make a list of all them, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the missive

In [16]:
letterList = []

empty = set()
ordinary = set()

for d in F.otype.s("letter"):
    letter = F.title.v(d)
    if len(letter) > 50:
        letter = f"{letter[0:22]} .. {letter[-22:]}"
    words = {
        trans
        for w in L.d(d, otype="word")
        if (trans := F.transo.v(w)) and trans.isalpha()
    }
    a = len(words)
    if not a:
        empty.add(letter)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(letter)
        continue
    p = 100 * o / a
    letterList.append((letter, a, o, p))

letterList = sorted(letterList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty letters")
print(f"Found {len(ordinary):>4} ordinary letters (i.e. without private words)")

Found    0 empty letters
Found   59 ordinary letters (i.e. without private words)


In [17]:
print(
    "{:<50}{:>5}{:>5}{:>5}\n{}".format(
        "missive",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in letterList[0:20]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in letterList[-20:]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))

missive                                            #all #own %own
-----------------------------------
Both; zonder plaats, zonder datum                     7    3 42.9%
Both; zonder plaats, zonder datum                     7    3 42.9%
Van Diemen; in het Sch .. an Afrika, 5 juni 1631     17    4 23.5%
Maetsuycker, Verburch, .. via, 25 september 1675     20    4 20.0%
Durven, Hasselaar, Blo .. tavia, 17 oktober 1730    120   22 18.3%
Maetsuycker, Verburch, .. avia, 20 februari 1672     17    3 17.6%
Reynst; Bantam, 26 oktober 1615                     748  131 17.5%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Reael; Kasteel Mauriti .. kéan, 20 augustus 1618   1175  181 15.4%
Reniers, Maetsuycker,  .. avia, 24 december 1652   5032  723 14.4%
Brouwer, Van Diemen, L .. atavia, 4 januari 1636   4084  585 14.3%
Coen, Jansz, Lefebvre, .. tavia, 3 november 1628     21    3 14.3%
Coen, Sonck; Schip Nie .. nda-Neira , 6 mei 1621     21    3 14.3%
Both; Fort Mauritius n .. d

---

# Next steps

By now you have an impression how to compute around in the Missieven.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **compute** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours

CC-BY Dirk Roorda