<img align="right" src="images/tf.png"/>
<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/logo.png"/>

---

To get started: consult [start](start.ipynb)

---

# Computing "by hand"

We descend to a more concrete level, and interact with the data by means of a bit of hand-coding.

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

In [3]:
from tf.app import use

In [4]:
A = use("clariah/wp6-missieven:clone", checkout="clone", hoist=globals())
# A = use("annotation/clariah-gm", hoist=globals())

This is Text-Fabric 9.4.2
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

44 features found and 0 ignored


# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all words, several millions, in this corpus.

Each piece of information about the words, including the text of the words, constitute a column in that spreadsheet.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

Edge features are marked by **_bold italic_** formatting.

# Counting

In [5]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.42s 6638983 nodes


# Node types

In [6]:
F.otype.slotType

'word'

In [7]:
F.otype.all

('volume',
 'letter',
 'page',
 'table',
 'para',
 'remark',
 'head',
 'note',
 'line',
 'row',
 'folio',
 'cell',
 'subhead',
 'word')

In [8]:
C.levels.data

(('volume', 426954.28571428574, 6638970, 6638983),
 ('letter', 9847.380560131796, 6018166, 6018772),
 ('page', 532.979045920642, 6558167, 6569381),
 ('table', 137.91038696537677, 6638479, 6638969),
 ('para', 100.7875075489604, 6569382, 6604154),
 ('remark', 97.49029448361675, 6604155, 6628264),
 ('head', 31.115321252059307, 6017559, 6018165),
 ('note', 16.88329592818211, 6545691, 6558166),
 ('line', 11.344004190405338, 6018773, 6545690),
 ('row', 8.099520958083833, 6628265, 6636614),
 ('folio', 2.6304595518420055, 6009660, 6017558),
 ('cell', 2.0938419146103593, 5977361, 6009659),
 ('subhead', 1.4248927038626609, 6636615, 6638478),
 ('word', 1, 1, 5977360))

The second column is the average size (in words) of the node type mentioned in the first column.

The third and fourth column are the node numbers of the first and the last node of that kind.

In [9]:
for (typ, av, start, end) in C.levels.data:
    print(
        f"{end - start + 1:>7} x {typ:<7}"
        f" having an average size of {int(round(av)):>6} words"
        f" and a total size of {int(round(av * (end - start + 1))):>7} words"
    )

     14 x volume  having an average size of 426954 words and a total size of 5977360 words
    607 x letter  having an average size of   9847 words and a total size of 5977360 words
  11215 x page    having an average size of    533 words and a total size of 5977360 words
    491 x table   having an average size of    138 words and a total size of   67714 words
  34773 x para    having an average size of    101 words and a total size of 3504684 words
  24110 x remark  having an average size of     97 words and a total size of 2350491 words
    607 x head    having an average size of     31 words and a total size of   18887 words
  12476 x note    having an average size of     17 words and a total size of  210636 words
 526918 x line    having an average size of     11 words and a total size of 5977360 words
   8350 x row     having an average size of      8 words and a total size of   67631 words
   7899 x folio   having an average size of      3 words and a total size of   20778 words

The node type `note` corresponds to footnotes. Here we see that there are over 12,000 footnotes
in this corpus, with on average 17 words in a footnote.

Note that the node type `folio` corresponds to a reference to a folio, not to the contents of a folio.
That explains its short average length in words.

By inspecting the total size (in words) of a node type, we quickly see
which node types cover the corpus and which node types are rare:

* the types `volume`, `letter`, `page`, `line`, `word` partition the corpus exactly
* previously, the type `page` nearly partitioned the corpus, but there were some words outside pages.
  Not anymore. See below.
* not all material is divided in `para`s (e.g. folios, headings, subheadings, tables)

Let's collect a the words outside any page, if any:

In [10]:
outsiders = []

for w in F.otype.s("word"):
    if not L.u(w, otype="page"):
        outsiders.append((w,))
        if len(outsiders) > 10:
            break

print(f"{len(outsiders)} outsiders")
A.table(outsiders, withNodes=True)

0 outsiders


# Feature statistics

There are no linguistic features (yet).

# Word matters

We can only work with the surface forms of words, there is no concept of lexeme in the corpus (yet).

## Top 20 frequent words

In [11]:
for (w, amount) in F.trans.freqList("word")[0:20]:
    print(f"{amount:>6} {w}")

267097 de
215289 van
145076 en
123549 te
 92507 in
 84398 het
 59475 den
 58915 dat
 58055 een
 56649 is
 56626 op
 54628 met
 43217 die
 43071 
 42416 voor
 38526 niet
 36956 aan
 34957 tot
 33338 zijn
 31128 door


## Hapaxes

We look for words that occur only once.

We are only interested in words that are completely alphabetic, i.e. words that do not have numbers
or other non-letters in them.

In [12]:
hapaxes1 = sorted(
    w for (w, amount) in F.trans.freqList("word") if amount == 1 and w.isalpha()
)
len(hapaxes1)

85759

In [13]:
for lx in hapaxes1[0:20]:
    print(lx)

AA
AC
ADRIAEN
AF
AFRIKA
AGRA
AJcbar
AND
ANDREASVAN
ANTHONTO
ANTONIOCAENENJOAN
ANTONY
ARDECRÖON
ARE
ARNOUD
ASTELIJN
AUahabad
AUen
AUorkulan
AVR


### Small occurrence base

The occurrence base of a word are the missives (letters) in which the word occurs.

**N.B. (terminology)**
Here *letter* means a document that has been sent to a recipient. This corpus consists of *missives*
which are letters.

We look only in the content of the original missives.

In [14]:
occurrenceBase = collections.defaultdict(set)

A.indent(reset=True)
A.info("compiling occurrence base ...")
for s in F.otype.s("letter"):
    title = F.title.v(s)
    for w in L.d(s, otype="word"):
        trans = F.transo.v(w)
        if not trans or not trans.isalpha():
            continue
        occurrenceBase[trans].add(title)
A.info("done")
A.info(f"{len(occurrenceBase)} entries")

  0.00s compiling occurrence base ...
  1.86s done
  1.86s 127166 entries


An overview of how many words have how big occurrence bases:

In [15]:
occurrenceSize = collections.Counter()

for (w, letters) in occurrenceBase.items():
    occurrenceSize[len(letters)] += 1

occurrenceSize = sorted(
    occurrenceSize.items(),
    key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
    print(f"letters {size:>4} : {amount:>6} words")
print("...")
for (size, amount) in occurrenceSize[-10:]:
    print(f"letters {size:>4} : {amount:>6} words")

letters    1 :  69897 words
letters    2 :  16495 words
letters    3 :   8224 words
letters    4 :   5165 words
letters    5 :   3636 words
letters    6 :   2612 words
letters    7 :   2131 words
letters    8 :   1708 words
letters    9 :   1323 words
letters   10 :   1201 words
...
letters  479 :      1 words
letters  481 :      1 words
letters  485 :      1 words
letters  489 :      1 words
letters  493 :      1 words
letters  494 :      1 words
letters  498 :      1 words
letters  501 :      1 words
letters  508 :      1 words
letters  511 :      1 words


Let's give the predicate *private* to those words whose occurrence base is a single missive.

In [16]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

69897

### Peculiarity of missives

As a final exercise with missives, lets make a list of all them, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the missive

In [17]:
letterList = []

empty = set()
ordinary = set()

for d in F.otype.s("letter"):
    letter = F.title.v(d)
    if len(letter) > 50:
        letter = f"{letter[0:22]} .. {letter[-22:]}"
    words = {
        trans
        for w in L.d(d, otype="word")
        if (trans := F.transo.v(w)) and trans.isalpha()
    }
    a = len(words)
    if not a:
        empty.add(letter)
        continue
    o = len({w for w in words if w in privates})
    if not o:
        ordinary.add(letter)
        continue
    p = 100 * o / a
    letterList.append((letter, a, o, p))

letterList = sorted(letterList, key=lambda e: (-e[3], -e[1], e[0]))

print(f"Found {len(empty):>4} empty letters")
print(f"Found {len(ordinary):>4} ordinary letters (i.e. without private words)")

Found    0 empty letters
Found   59 ordinary letters (i.e. without private words)


In [18]:
print(
    "{:<50}{:>5}{:>5}{:>5}\n{}".format(
        "missive",
        "#all",
        "#own",
        "%own",
        "-" * 35,
    )
)

for x in letterList[0:20]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))
print("...")
for x in letterList[-20:]:
    print("{:<50} {:>4} {:>4} {:>4.1f}%".format(*x))

missive                                            #all #own %own
-----------------------------------
Both; zonder plaats, zonder datum                     7    3 42.9%
Both; zonder plaats, zonder datum                     7    3 42.9%
Van Diemen; in het Sch .. an Afrika, 5 juni 1631     17    4 23.5%
Maetsuycker, Verburch, .. via, 25 september 1675     20    4 20.0%
Durven, Hasselaar, Blo .. tavia, 17 oktober 1730    120   22 18.3%
Maetsuycker, Verburch, .. avia, 20 februari 1672     17    3 17.6%
Reynst; Bantam, 26 oktober 1615                     748  130 17.4%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Both; Fort Mauritius n .. d Makéan, 26 juli 1612     12    2 16.7%
Reael; Kasteel Mauriti .. kéan, 20 augustus 1618   1175  181 15.4%
Reniers, Maetsuycker,  .. avia, 24 december 1652   5032  720 14.3%
Brouwer, Van Diemen, L .. a

---

# Next steps

By now you have an impression how to compute around in the Missieven.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

* **[start](start.ipynb)** start computing with this corpus
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **compute** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[annotate](annotate.ipynb)** export text, annotate with BRAT, import annotations
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **[volumes](volumes.ipynb)** work with selected volumes only

CC-BY Dirk Roorda