# How to work with Text-Fabric on a corpus?

## Installation

1. install Python, e.g. from the 
   [official site](https://www.python.org).
2. install Text-Fabric by

   ``` sh
   pip install 'text-fabric[all]'
   ```

## Usage

This notebook shows you how to use it.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use

In [3]:
A = use("CLARIAH/wp6-mobydick:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
text,1,213677.0,100
body,1,213136.0,100
div,138,1548.33,100
chapter,141,1516.29,100
front,1,541.0,0
note,19,124.37,1
fileDesc,1,99.0,0
p,2421,87.92,100
chunk,2606,81.99,100
publicationStmt,2,33.5,0


Now `A` is a handle to the complete corpus.

Because of the `hoist=globals()`, there are several other variables defined, see
the last line above. Click on the link after **Text-Fabric API** to see what they mean.

We have loaded a TF dataset. It is a bit like a Pandas dataframe.

There are *nodes* (like rows in a dataframe) and *features* (like columns).

## Nodes

The nodes are organized in types, click **▶︎ Node types** above.

You see a list of node types, how many nodes each type has, etc.

## Features

Click the **︎▶** below **Features**.

You see a list of features with a shot description.
Each feature name is a link to the feature documentation.

## Chapters

One of the node types is `chapter`.

Let's collect the chapters:

In [4]:
chapterNodes = F.otype.s("chapter")
chapterNodes

range(213819, 213960)

One of the features is also called `chapter`.
Let's ask for the value of the feature `chapter` for each of the nodes of type `chapter`:

In [5]:
for cn in chapterNodes:
    print(F.chapter.v(cn))

TEI header
2 div
Preliminary Matter.
4 titlePage
LOOMINGS
THE CARPET-BAG 
THE SPOUTER-INN
THE COUNTERPANE
BREAKFAST
THE STREET
THE CHAPEL
THE PULPIT
THE SERMON
A BOSOM FRIEND
NIGHTGOWN
BIOGRAPHICAL
WHEELBARROW
NANTUCKET
CHOWDER
THE SHIP
THE RAMADAN
HIS MARK
THE PROPHET
ALL ASTIR
GOING ABOARD
MERRY CHRISTMAS
THE LEE SHORE
THE ADVOCATE
POSTSCRIPT
KNIGHTS AND SQUIRES
KNIGHTS AND SQUIRES
AHAB
ENTER AHAB; TO HIM, STUBB
THE PIPE
QUEEN MAB
CETOLOGY
THE SPECKSYNDER
THE CABIN-TABLE
THE MAST-HEAD
THE QUARTER-DECK
SUNSET
DUSK
FIRST NIGHT-WATCH
MIDNIGHT, FORECASTLE
MOBY DICK
THE WHITENESS OF THE WHALE
HARK!
THE CHART
THE AFFIDAVIT
SURMISES
THE MAT-MAKER
THE FIRST LOWERING
THE HYENA
AHAB'S BOAT AND CREW. FEDALLAH
THE SPIRIT-SPOUT
THE ALBATROSS
THE GAM
THE TOWN-HO'S STORY
OF THE MONSTROUS PICTURES OF WHALES
OF THE LESS ERRONEOUS PICTURES OF WHALES, AND THE TRUE  PICTURES OF WHALING SCENES
OF WHALES IN PAINT; IN TEETH; IN WOOD; IN SHEET-IRON; IN  STONE; IN MOUNTAINS; IN STARS
BRIT
SQUID
THE LINE
STUB

We can get the heading from a chapter node in a bit more streamlined way:

In [6]:
cha = chapterNodes[5]
cha

213824

In [7]:
A.sectionStrFromNode(cha)

'THE CARPET-BAG '

We can also go back:

In [9]:
A.nodeFromSectionStr('THE CARPET-BAG ')

213824

## Chunks

Chapters are divided in chunks.

Let's get the chunks of the chapter above:

In [10]:
chunkNodes = L.d(cha, otype="chunk")
chunkNodes

(213991,
 213992,
 213993,
 213994,
 213995,
 213996,
 213997,
 213998,
 213999,
 214000,
 214001,
 214002,
 214003)

Chunks have headings as well, they are in the feature `chunk`:

In [11]:
for cn in chunkNodes:
    print(F.chunk.v(cn))

-1
1
2
3
4
5
6
7
8
9
10
11
12


The positive chunks are all `<p>` elements, the others are various other elements,
at the same level as those paragraph elements.

More precisely: the positive chunks lie wrapped around a `<p>` element,
the others lie wrapped around another kind of element.

If we ask text-fabric to descend from an element to the elements contained in it, it will
list those elements in the canonical order, that means that embedders come before embeddees.

So the first embedded element of each chunk is the TEI element that lies wrapped in it.

Let's check that:

In [12]:
for cn in chunkNodes:
    inside = L.d(cn)[0]
    print(f"{F.chunk.v(cn):>3} of type {F.otype.v(inside)}")

 -1 of type head
  1 of type p
  2 of type p
  3 of type p
  4 of type p
  5 of type p
  6 of type p
  7 of type p
  8 of type p
  9 of type p
 10 of type p
 11 of type p
 12 of type p


We can get the full heading of a chunk in a streamlined way.

In [14]:
chu = chunkNodes[6]
chu

213997

In [15]:
A.sectionStrFromNode(chu)

'THE CARPET-BAG @6'

We can also go back:

In [16]:
A.nodeFromSectionStr('THE CARPET-BAG @6')

213997

## Text of a chunk

We can get the raw text of a chunk as follows:

In [17]:
T.text(chu)

"It seemed the great Black Parliament sitting in Tophet. A hundred black faces turned round in their rows to peer; and beyond, a black Angel of Doom was beating a book in a pulpit. It was a negro church; and the preacher's text was about the blackness of darkness, and the weeping and wailing and teeth- gnashing there. Ha, Ishmael, muttered I, backing out, Wretched entertainment at the sign of ‘The Trap!’ \n"

We can also get it a bit nicer:

In [18]:
A.plain(chu)

Even nicer:

In [19]:
A.plain(chu, fmt="layout-orig-full")

We now see that some of the words are special: the belong to a note.

Here is a more complete view of the chunk:

In [20]:
A.pretty(chu)

We see now more of what is going on in the markup of the text, but we can get an even more complete view:

In [21]:
A.pretty(chu, multiFeatures=True)

Indeed, we now see all features of all nodes in so far they have non-null/empty values.

Yet we can see even more: the node numbers themselves:

In [22]:
A.pretty(chu, multiFeatures=False, withNodes=True)