# How to work with Text-Fabric on a corpus?

## Installation

1. install Python, e.g. from the 
   [official site](https://www.python.org).
2. install Text-Fabric by

   ``` sh
   pip install 'text-fabric[all]'
   ```

## Usage

This notebook shows you how to use it.

In [1]:
from tf.app import use

In [2]:
A = use("CLARIAH/wp6-ferdinandhuyck:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
,,,
text,1.0,218055.0,100.0
body,1.0,218048.0,100.0
div,42.0,5191.62,100.0
chapter,44.0,4963.91,100.0
fileDesc,1.0,301.0,0.0
editionStmt,1.0,270.0,0.0
p,3725.0,58.05,99.0
chunk,3833.0,56.94,100.0
lg,41.0,23.34,0.0


Now `A` is a handle to the complete corpus.

Because of the `hoist=globals()`, there are several other variables defined, see
the last line above. Click on the link after **Text-Fabric API** to see what they mean.

We have loaded a TF dataset. It is a bit like a Pandas dataframe.

There are *nodes* (like rows in a dataframe) and *features* (like columns).

## Nodes

The nodes are organized in types, click **▶︎ Node types** above.

You see a list of node types, how many nodes each type has, etc.

## Features

Click the **︎▶** below **Features**.

You see a list of features with a shot description.
Each feature name is a link to the feature documentation.

## Chapters

One of the node types is `chapter`.

Let's collect the chapters:

In [3]:
chapterNodes = F.otype.s("chapter")
chapterNodes

range(218426, 218470)

One of the features is also called `chapter`.
Let's ask for the value of the feature `chapter` for each of the nodes of type `chapter`:

In [4]:
for cn in chapterNodes:
    print(F.chapter.v(cn))

TEI header
2 interpGrp
Brief van den Heer P. aan den Uitgever, tot inleiding dienende.
[Woord van de uitgever]
Eerste hoofdstuk.
Tweede hoofdstuk.
Derde hoofdstuk.
Vierde hoofdstuk.
Vijfde hoofdstuk.
Zesde hoofdstuk.
Zevende hoofdstuk.
Achtste hoofdstuk.
Negende hoofdstuk.
Tiende hoofdstuk.
Elfde hoofdstuk.
Twaalfde hoofdstuk.
Dertiende hoofdstuk.
Veertiende hoofdstuk.
Vijftiende hoofdstuk.
Zestiende hoofdstuk.
Zeventiende hoofdstuk.
Achttiende hoofdstuk.
Negentiende hoofdstuk.
Twintigste hoofdstuk.
Een-en-twintigste hoofdstuk.
Twee-en-twintigste hoofdstuk.
Drie-en-twintigste hoofdstuk.
Vier-en-twintigste hoofdstuk.
Vijf-en-twintigste hoofdstuk.
Zes-en - twintigste hoofdstuk.
Zeven-en-twintigste hoofdstuk.
Acht-en-twintigste hoofdstuk.
Negen-en-twintigste hoofdstuk.
Dertigste hoofdstuk.
Een-en-dertigste hoofdstuk.
Twee-en-dertigste hoofdstuk.
Drie-en-dertigste hoofdstuk.
Vier-en-dertigste hoofdstuk.
Vijf-en-dertigste hoofdstuk.
Zes-en-dertigste hoofdstuk.
Zeven-en-dertigste hoofdstuk.


We can get the heading from a chapter node in a bit more streamlined way:

In [5]:
cha = chapterNodes[5]
cha

218431

In [6]:
A.sectionStrFromNode(cha)

'Tweede hoofdstuk.'

We can also go back:

In [7]:
A.nodeFromSectionStr('Tweede hoofdstuk.')

218431

## Chunks

Chapters are divided in chunks.

Let's get the chunks of the chapter above:

In [8]:
chunkNodes = L.d(cha, otype="chunk")
chunkNodes

(218569,
 218570,
 218571,
 218572,
 218573,
 218574,
 218575,
 218576,
 218577,
 218578,
 218579,
 218580,
 218581,
 218582,
 218583,
 218584,
 218585,
 218586,
 218587,
 218588,
 218589,
 218590,
 218591,
 218592,
 218593,
 218594,
 218595,
 218596,
 218597,
 218598,
 218599,
 218600,
 218601,
 218602,
 218603,
 218604,
 218605,
 218606,
 218607,
 218608,
 218609,
 218610,
 218611,
 218612,
 218613,
 218614,
 218615,
 218616,
 218617,
 218618,
 218619,
 218620,
 218621,
 218622,
 218623,
 218624,
 218625,
 218626,
 218627,
 218628,
 218629,
 218630,
 218631,
 218632,
 218633,
 218634,
 218635,
 218636,
 218637,
 218638,
 218639,
 218640,
 218641,
 218642,
 218643,
 218644,
 218645,
 218646,
 218647,
 218648,
 218649,
 218650,
 218651,
 218652,
 218653,
 218654,
 218655,
 218656,
 218657,
 218658,
 218659,
 218660,
 218661,
 218662,
 218663,
 218664,
 218665,
 218666,
 218667,
 218668,
 218669,
 218670,
 218671,
 218672)

Chunks have headings as well, they are in the feature `chunk`:

In [9]:
for cn in chunkNodes:
    print(F.chunk.v(cn))

-1
-2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-3
47
48
49
50
-4
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


The positive chunks are all `<p>` elements, the others are various other elements,
at the same level as those paragraph elements.

More precisely: the positive chunks lie wrapped around a `<p>` element,
the others lie wrapped around another kind of element.

If we ask text-fabric to descend from an element to the elements contained in it, it will
list those elements in the canonical order, that means that embedders come before embeddees.

So the first embedded element of each chunk is the TEI element that lies wrapped in it.

Let's check that:

In [10]:
for cn in chunkNodes:
    inside = L.d(cn)[0]
    print(f"{F.chunk.v(cn):>3} of type {F.otype.v(inside)}")

 -1 of type word
 -2 of type head
  1 of type p
  2 of type p
  3 of type p
  4 of type p
  5 of type p
  6 of type p
  7 of type p
  8 of type p
  9 of type p
 10 of type p
 11 of type p
 12 of type p
 13 of type p
 14 of type p
 15 of type p
 16 of type p
 17 of type p
 18 of type p
 19 of type p
 20 of type p
 21 of type p
 22 of type p
 23 of type p
 24 of type p
 25 of type p
 26 of type p
 27 of type p
 28 of type p
 29 of type p
 30 of type p
 31 of type p
 32 of type p
 33 of type p
 34 of type p
 35 of type p
 36 of type p
 37 of type p
 38 of type p
 39 of type p
 40 of type p
 41 of type p
 42 of type p
 43 of type p
 44 of type p
 45 of type p
 46 of type p
 -3 of type lg
 47 of type p
 48 of type p
 49 of type p
 50 of type p
 -4 of type lg
 51 of type p
 52 of type p
 53 of type p
 54 of type p
 55 of type p
 56 of type p
 57 of type p
 58 of type p
 59 of type p
 60 of type p
 61 of type p
 62 of type p
 63 of type p
 64 of type p
 65 of type p
 66 of type p
 67 of type 

We can get the full heading of a chunk in a streamlined way.

In [11]:
chu = chunkNodes[19]
chu

218588

In [12]:
A.sectionStrFromNode(chu)

'Tweede hoofdstuk.@18'

We can also go back:

In [13]:
A.nodeFromSectionStr('Tweede hoofdstuk.@18')

218588

## Text of a chunk

We can get the raw text of a chunk as follows:

In [14]:
T.text(chu)

"- ‘Messen! - scharen! - khurkhetrekkers! - khammen!’ vervolgde de Jood, met een pause tusschen elk voorwerp, dat hij opnoemde: ‘of... wil je liever kurieuser whaar: je bent toch een ghesthudeerd jong mensch... hik 'ep hook mooie poekkies: 'ier is de Arlekhijn Haksinischt!... 't plijspel van Khinkampoeis!Arlequin Actionist: Quincampoix of de Windhandelaars: blijspelen van Langendijk. de leste woorden van Saco, toen ie op et schavot stond...’ -\n"

We can also get it a bit nicer:

In [15]:
A.plain(chu)

Even nicer:

In [16]:
A.plain(chu, fmt="layout-orig-full")

We now see that some of the words are special: the belong to a note.

Here is a more complete view of the chunk:

In [17]:
A.pretty(chu)

We see now more of what is going on in the markup of the text, but we can get an even more complete view:

In [18]:
A.pretty(chu, multiFeatures=True)

Indeed, we now see all features of all nodes in so far they have non-null/empty values.

Yet we can see even more: the node numbers themselves:

In [20]:
A.pretty(chu, multiFeatures=False, withNodes=True)