<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Computing "by hand"

We descend to a more concrete level, and interact with the data by means of a bit of hand-coding.

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

In [3]:
from tf.app import use

In [4]:
A = use(
    "CLARIAH/descartes-tf:clone",
    checkout="clone",
    hoist=globals(),
)

This is Text-Fabric 11.0.7
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

28 features found and 0 ignored
  0.09s Dataset without structure sections in otext:no structure functions in the T-API
  0.34s All features loaded/computed - for details use TF.isLoaded()
  0.01s All additional features loaded - for details use TF.isLoaded()


Name,# of nodes,# slots/node,% coverage
,,,
volume,8.0,85241.88,100.0
letter,725.0,940.6,100.0
page,2884.0,236.45,100.0
postscriptum,56.0,46.79,0.0
opener,545.0,1.97,0.0
closer,541.0,13.1,1.0
address,86.0,15.22,0.0
head,725.0,23.37,2.0
p,8438.0,80.82,100.0


# What have we got?

Let's inspect the data.

The text is represented as nodes with properties. The first word is node 1, the second word is node 2, and so on.
After the last word node we get nodes for the elements, such a p, formula. We also have nodes for letters and volumes.

All nodes can be dressed up with *features*.
A feature is a piece of data that specifies values for nodes.

For example, the feature `trans` gives the text of each word node, and the feature `punc` gives the text after a word but before the next word.

This gives a very crude insight in the data that Text-Fabric works with. Text-Fabric is a machine
that can weave the orginal text out of the threads given by the features.

Think of the nodes as the warp, through which the features are woven as wefts.
See also the [fabric metaphor](https://annotation.github.io/text-fabric/tf/about/datamodel.html#fabric-metaphor).

But it can also weave all kinds of other things out of the data.

We can get a stock overview of the ware house of nodes and features as follows:

* **features** if you click on the little triangle before **Descartes = Descartes, all letters** above,
  you'll see a list of features with their descriptions:
  * you can see which features have been loaded;
  * if you click on a feature name, you find its documentation;
  * if you hover over a name, you see where the feature is located on your system;
  * edge features are marked by **_bold italic_** formatting.
* **nodes** we show an inventory using
  [`C.levels.data`](https://annotation.github.io/text-fabric/tf/cheatsheet.html#c-computed-data-components)

# Counting
We count all nodes, of any type.

In [5]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.06s 722766 nodes


# Node types

What is the basic textual unit in this corpus?

In [6]:
F.otype.slotType

'word'

A quick way to list all node types:

In [7]:
F.otype.all

('volume',
 'letter',
 'page',
 'postscriptum',
 'opener',
 'closer',
 'address',
 'head',
 'p',
 'sentence',
 'hi',
 'formula',
 'figure',
 'word')

# Checks and balances

Let's collect a the words outside any page, if any:

In [8]:
outsiders = []

for w in F.otype.s("word"):
    if not L.u(w, otype="page"):
        outsiders.append((w,))
        if len(outsiders) > 10:
            break

print(f"{len(outsiders)} page outsiders")
A.table(outsiders, withNodes=True)

0 page outsiders


# Word matters

We can only work with the surface forms of words, there is no concept of lexeme in the corpus (yet).

## Top 30 frequent words

There is a simple function to get a frequency list of feature values.
Here we call it for the feature `transa`, which contains the text for every word in the base text and in every variant:

In [9]:
for (word, amount) in F.trans.freqList()[0:30]:
    print(f"{amount:>6} {word}")

 24103 de
 20661 
 18240 que
 14184 et
 11833 la
 10493 en
 10368 à
  9924 qu
  9354 il
  8407 je
  8225 l
  8162 est
  7933 le
  7629 qui
  7214 ne
  7139 vous
  7048 les
  5726 d
  5511 ce
  4633 n
  4597 pour
  4173 a
  3838 plus
  3821 si
  3748 un
  3659 pas
  3545 des
  3438 j
  3396 par
  3393 me


# Words that are unique to a letter

Are there words that are unique to a letter?
And if so, which letter has the most of them?
That letter is the most idiosyncratic letter.

Task: list the letters in a table sorted by degree of idiosyncrasy, and show the
idiosyncrasy of each letter.

## Method

For each word, the support base is the set of letters in which the word occurs.
We take only distinct words into account when we count words.
We make all words lower case.

Let's compute the support base of all words.

We also need to count how much distinct words each letter contains.

And we also want to find out how many hapaxes there are, so we also make an
index for the occurrences of each word form.

In [10]:
wordOccs = collections.defaultdict(list)
wordsByLetter = collections.defaultdict(set)
supportBase = collections.defaultdict(set)

for letter in F.otype.s("letter"):
    for w in L.d(letter, otype="word"):
        word = F.trans.v(w)
        if not word:
            continue
            
        wordOccs[word].append(w)
        wordsByLetter[letter].add(word)
        supportBase[word].add(letter)
        
print(f"There are {len(wordOccs)} distinct words")

There are 38034 distinct words


We can find the hapaxes as follows:

In [11]:
hapaxes = {word for (word, occs) in wordOccs.items() if len(occs) == 1}

print(f"There are {len(hapaxes)} hapaxes")

There are 19326 hapaxes


In the same way we can find the idiosyncratic words:

In [12]:
idiosyncraticWords = {word for (word, letters) in supportBase.items() if len(letters) == 1}

print(f"There are {len(idiosyncraticWords)} idiosyncratic words")

There are 20798 idiosyncratic words


Now we can make a table of the letters where for each letter we list the total
amount of distinct words, the amount of idiosyncratic words,
and the percentage of idiosyncratic words wrt. to the total number of words.

In [13]:
table = []

for letter in F.otype.s("letter"):
    letterId = F.id.v(letter)
    words = wordsByLetter[letter]
    idio = {word for word in words if word in idiosyncraticWords}
    
    nWords = len(words)
    nIdio = len(idio)
    perc = int(round(100 * nIdio / nWords))
    
    table.append((letterId, nWords, nIdio, perc))
    
table[0:10]

[('1001', 275, 61, 22),
 ('1002', 504, 123, 24),
 ('1003', 77, 8, 10),
 ('1004', 240, 48, 20),
 ('1005', 250, 47, 19),
 ('1006', 363, 94, 26),
 ('1007', 112, 11, 10),
 ('1008', 122, 17, 14),
 ('1009', 128, 10, 8),
 ('1010', 182, 8, 4)]

We can make that prettier by rendering it in Markdown.
And we have to sort it on the percentage column.
And we add a grand total.

We do not show the letters that have less than 20% idiosyncratic words.

In [14]:
md = """
letter | #words | #idio | %perc
--- | --- | --- | ---
"""

totalNw = 0

for (letter, nw, ni, per) in sorted(table, key=lambda x: (-x[-1], -x[-2], x[1], x[0])):
    if per >= 20:
        md += f"""{letter} | {nw} | {ni} | {per}\n"""
    totalNw += nw
    
    
overall = int(round(100 * len(idiosyncraticWords) / len(wordOccs)))
overall2 = int(round(100 * len(idiosyncraticWords) / totalNw))
md += f"""**{len(table)}** letters | **{len(wordOccs)}** | **{len(idiosyncraticWords)}** | **{overall}**\n"""
md += f"""**{len(table)}** letters | **{totalNw}** | **{len(idiosyncraticWords)}** | **{overall2}**\n"""

A.dm(md)


letter | #words | #idio | %perc
--- | --- | --- | ---
6391 | 412 | 271 | 66
6431 | 84 | 40 | 48
8648 | 1249 | 408 | 33
1012 | 738 | 241 | 33
6394 | 120 | 40 | 33
1032 | 1272 | 404 | 32
6425 | 1203 | 389 | 32
4280 | 752 | 240 | 32
4233 | 510 | 157 | 31
5382 | 288 | 90 | 31
8661 | 1888 | 568 | 30
7588 | 1238 | 367 | 30
6420 | 192 | 58 | 30
3226 | 178 | 53 | 30
7543 | 2078 | 574 | 28
5307 | 1503 | 407 | 27
1006 | 363 | 94 | 26
1116 | 251 | 62 | 25
2139 | 1312 | 312 | 24
4251 | 979 | 239 | 24
8681 | 910 | 222 | 24
4243 | 823 | 200 | 24
5335 | 768 | 184 | 24
1002 | 504 | 123 | 24
2149 | 682 | 160 | 23
4303 | 374 | 87 | 23
5341 | 259 | 60 | 23
5383 | 79 | 18 | 23
2120 | 1621 | 363 | 22
1117 | 906 | 198 | 22
2150 | 838 | 188 | 22
2132 | 418 | 90 | 22
4234 | 323 | 72 | 22
1001 | 275 | 61 | 22
5327 | 2028 | 422 | 21
5318 | 311 | 65 | 21
4260 | 759 | 151 | 20
1066 | 604 | 122 | 20
1004 | 240 | 48 | 20
6453 | 123 | 25 | 20
**725** letters | **38034** | **20798** | **55**
**725** letters | **262183** | **20798** | **8**


It might seem strange that the overall idiosyncracy is much bigger than the idiosyncracy of the individual
chapters.

This follows from the fact that if we take the amounts of distinct words per chapter and take the sum of that,
we end up with a much bigger number than the total amount of distinct words in the whole book.

Because words that occur in multiple chapters are counted multiple times.

If we use the sum of the per-chapter distinct words, the total idiosyncracy is the weighted average of the chapter
idiosyncracies.

---

# Contents

* **[start](start.ipynb)** intro and highlights
* **search** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results

Advanced

* **[similar sentences](similar.ipynb)** find similar sentences

CC-BY Dirk Roorda