<img align="right" src="images/tf.png"/>
<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/logo.png"/>

---

To get started: consult [start](start.ipynb)

---

# Computing "by hand"

We descend to a more concrete level, and interact with the data by means of a bit of hand-coding.

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

In [3]:
from tf.app import use

In [4]:
A = use("missieven", hoist=globals())
# A = use('missieven:latest', checkout="latest", hoist=globals())
# A = use('missieven:clone', checkout="clone", hoist=globals())

# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all words, several millions, in this corpus.

Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

Edge features are marked by **_bold italic_** formatting.

# Counting

In [5]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.83s 5815918 nodes


# Node types

In [6]:
F.otype.slotType

'word'

In [7]:
F.otype.all

('volume',
 'letter',
 'page',
 'table',
 'para',
 'remark',
 'head',
 'note',
 'line',
 'row',
 'folio',
 'cell',
 'subhead',
 'word')

In [8]:
C.levels.data

(('volume', 403134.8461538461, 5815906, 5815918),
 ('letter', 8897.713073005094, 5264487, 5265075),
 ('page', 516.283279140802, 5742706, 5752854),
 ('table', 127.3167701863354, 5815584, 5815905),
 ('para', 100.99678285764882, 5752855, 5786735),
 ('remark', 75.43059069889189, 5786736, 5809657),
 ('head', 30.92190152801358, 5263898, 5264486),
 ('note', 16.990561432058584, 5730416, 5742705),
 ('line', 11.261698972794086, 5265076, 5730415),
 ('row', 8.974594831362243, 5809658, 5814223),
 ('folio', 3.674245393963152, 5261347, 5263897),
 ('cell', 1.989899480405963, 5240754, 5261346),
 ('subhead', 1.4970588235294118, 5814224, 5815583),
 ('word', 1, 1, 5240753))

The second column is the average size (in words) of the node type mentioned in the first column.

The third and fourth column are the node numbers of the first and the last node of that kind.

In [13]:
for (typ, av, start, end) in C.levels.data:
    print(
        f"{end - start + 1:>7} x {typ:<7}"
        f" having an average size of {int(round(av)):>6} words"
        f" and a total size of {int(round(av * (end - start + 1))):>7} words"
    )

     13 x volume  having an average size of 403135 words and a total size of 5240753 words
    589 x letter  having an average size of   8898 words and a total size of 5240753 words
  10149 x page    having an average size of    516 words and a total size of 5239759 words
    322 x table   having an average size of    127 words and a total size of   40996 words
  33881 x para    having an average size of    101 words and a total size of 3421872 words
  22922 x remark  having an average size of     75 words and a total size of 1729020 words
    589 x head    having an average size of     31 words and a total size of   18213 words
  12290 x note    having an average size of     17 words and a total size of  208814 words
 465340 x line    having an average size of     11 words and a total size of 5240519 words
   4566 x row     having an average size of      9 words and a total size of   40978 words
   2551 x folio   having an average size of      4 words and a total size of    9373 words

The node type `note` corresponds to footnotes. Here we see that there are over 12,000 footnotes
in this corpus, with on average 17 words in a footnote.

Note that the node type `folio` corresponds to a reference to a folio, not to the contents of a folio.
That explains its short average length in words.

By inspecting the total size (in words) of a node type, we quickly see
which node types cover the corpus and which node types are rare:

* the types `volume`, `letter`, `line`, `word` partition the corpus exactly
* the type `page` nearly partitions the corpus. Explanation: there are some words outside pages. See below.
* not all material is divided in `para`s (e.g. folios, headings, subheadings, tables)

Let's collect a few of those mysterious words outside any page:

In [17]:
outsiders = []

for w in F.otype.s("word"):
    if not L.u(w, otype="page"):
        outsiders.append((w,))
        if len(outsiders) > 10:
            break

A.table(outsiders, withNodes=True)

n,p,word
1,1,74II.
2,1,75PIETER
3,1,"76BOTH,"
4,1,77AAN
5,1,78BOORD
6,1,79VAN
7,1,80HET
8,1,81WAPEN
9,1,82VAN
10,1,"83AMSTERDAM,"


This is a case where a few virtually empty letters are combined on one page.
There is no page break element in those letters (except the first one on a page),
and so our code has failed to assign a page to those letters.

This is not nice and should be improved!

On the other hand, it is also not a real problem in processing the text.

# Feature statistics

There are no linguistic features (yet).

# Word matters

We can only work with the surface forms of words, there is no concept of lexeme in the corpus (yet).

## Top 20 frequent words

In [18]:
for (w, amount) in F.trans.freqList("word")[0:20]:
    print(f"{amount:>6} {w}")

223936 de
185116 van
125399 en
108979 te
 81934 in
 68927 het
 57858 den
 50945 een
 49721 dat
 48744 met
 48565 op
 43818 is
 37511 
 37207 die
 36654 voor
 34229 niet
 33056 tot
 31431 aan
 29661 ende
 28250 door


## Hapaxes

We look for words that occur only once.

We are only interested in words that are completely alphabetic, i.e. words that do not have numbers
or other non-letters in them.

In [19]:
hapaxes1 = sorted(
    w for (w, amount) in F.trans.freqList("word") if amount == 1 and w.isalpha()
)
len(hapaxes1)

83078

In [20]:
for lx in hapaxes1[0:20]:
    print(lx)

AC
ADRIAEN
AF
AFRIKA
AGRA
AJcbar
AND
ANDREASVAN
ANTHONTO
ANTONIOCAENENJOAN
ANTONY
APRIL
ARDECRÖON
ARE
ARNOUD
ASTELIJN
AUahabad
AUen
AUorkulan
AVR


### Small occurrence base

The occurrence base of a word are the missives (letters) in which the word occurs.

**N.B. (terminology)**
Here *letter* means a document that has been sent to a recipient. This corpus consists of *missives*
which are letters.

We look only in the content of the original missives.

In [21]:
occurrenceBase = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("letter"):
    title = F.title.v(s)
    for w in L.d(s, otype="word"):
        trans = F.transo.v(w)
        if not trans or not trans.isalpha():
            continue
        occurrenceBase[trans].add(title)
A.info("done")
A.info(f"{len(occurrenceBase)} entries")

  0.00s compiling occurrence base ...
  3.33s done
  3.33s 124787 entries


An overview of how many words have how big occurrence bases:

In [22]:
occurrenceSize = collections.Counter()

for (w, letters) in occurrenceBase.items():
    occurrenceSize[len(letters)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"letters {size:>4} : {amount:>6} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"letters {size:>4} : {amount:>6} words")

letters    1 :  68639 words
letters    2 :  16209 words
letters    3 :   8030 words
letters    4 :   5141 words
letters    5 :   3561 words
letters    6 :   2566 words
letters    7 :   2068 words
letters    8 :   1672 words
letters    9 :   1310 words
letters   10 :   1193 words
...
letters  461 :      1 words
letters  463 :      1 words
letters  467 :      1 words
letters  471 :      1 words
letters  475 :      1 words
letters  476 :      1 words
letters  480 :      1 words
letters  483 :      1 words
letters  490 :      1 words
letters  493 :      1 words


Let's give the predicate *private* to those words whose occurrence base is a single missive.

In [23]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

68639

### Peculiarity of missives

As a final exercise with missives, lets make a list of all them, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the missive

In [24]:
letterList = []

empty = set()
ordinary = set()

for d in F.otype.s("letter"):
    letter = F.title.v(d)
    if len(letter) > 50:
        letter = f"{letter[0:22]} .. {letter[-22:]}"
    words = {
        trans
        for w in L.d(d, otype="word")
        if (trans := F.transo.v(w)) and trans.isalpha()
    }
    a = len(words)
    if not a:
        empty.add(letter)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(letter)
        continue
    p = 100 * o / a
    letterList.append((letter, a, o, p))

letterList = sorted(letterList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty letters")
print(f"Found {len(ordinary):>4} ordinary letters (i.e. without private words)")

Found    0 empty letters
Found   59 ordinary letters (i.e. without private words)


In [25]:
print(
    "{:<50}{:>5}{:>5}{:>5}\n{}".format(
        "missive",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in letterList[0:20]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in letterList[-20:]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))

missive                                            #all #own %own
-----------------------------------
Both; zonder plaats, zonder datum                     7    3 42.9%
Both; zonder plaats, zonder datum                     7    3 42.9%
Van Diemen; in het Sch .. an Afrika, 5 juni 1631     17    4 23.5%
Maetsuycker, Verburch, .. via, 25 september 1675     20    4 20.0%
Durven, Hasselaar, Blo .. tavia, 17 oktober 1730    120   22 18.3%
Maetsuycker, Verburch, .. avia, 20 februari 1672     17    3 17.6%
Reynst; Bantam, 26 oktober 1615                     748  131 17.5%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Reael; Kasteel Mauriti .. kéan, 20 augustus 1618   1175  181 15.4%
Reniers, Maetsuycker,  .. avia, 24 december 1652   5032  723 14.4%
Brouwer, Van Diemen, L .. a

---

# Next steps

By now you have an impression how to compute around in the Missieven.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **compute** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda