In [1]:
%load_ext autoreload
%autoreload 2

# Lowfat to TF

We use the machinery of Text-Fabric combined with some custom code to convert
the lowfat XML of the Greek New Testament into TF.

# Set up

We gather all prerequisites.

In [2]:
from tf.convert.xml import XML
from lowfat import convertTaskCustom
from tf.advanced.helpers import dm
from tf.app import use

The custom code is in `lowfat.py`, here in this directory.

It consists of two functions that replace default functions in
[xmlCustom](https://annotation.github.io/text-fabric/tf/convert/xmlCustom.html),
which is part of TF.

So you only have to focus on the bits that actually touch the lowfat XML.

We pass the function `convertCustomTask()`, defined in `lowfat.py`, to the XML converter.

We also specify the way we want to see some attributes in the report files:

* keyword attributes: we want to see an inventory of all words that occur in such attributes
* trim attributes: we do not want to see the values of these attributes

In [3]:
keywordAtts = set(
    """
    case
    class
    number
    gender
    mood
    person
    role
    tense
    type
    voice
""".strip().split()
)

trimAtts = set(
    """
    domain
    frame
    gloss
    id
    lemma
    ln
    morph
    normalized
    ref
    referent
    rule
    strong
    subjref
    unicode
""".strip().split()
)

We do not want both the `Rule` and `rule` features in our dataset, because this can clash on file systems
that are case insensitive.

We translate the `frame` attribute to an edge feature, but we retain the original contents in the
`framespec` attribute.

The name `class` is exceptionally cumbersome if you want to use it inside Python code,
so we rename it to `cls`.

In [5]:
renameAtts = {
    "Rule": "crule",
    "frame": "framespec",
    "subjref": "subjrefspec",
    "class": "cls",
}

In [6]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    xml=0,
    tf="0.4.1",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2022-11-01 (most recent)
TF data version is 0.4.1 (explicit new)
Processing instructions will be ignored


Now we can run tasks.

# Check

First we check the input:

In [7]:
X.task(check=True)

XML to TF checking: ~/github/ETCBC/nestle1904/xml/2022-11-01 => ~/github/ETCBC/nestle1904/report/2022-11-01
Processing instructions are ignored
Start folder gnt:
  27 27-revelation.xml                                 
End   folder gnt

151 info line(s) written to ~/github/ETCBC/nestle1904/report/2022-11-01/elements.txt
0 error(s) in 0 file(s) written to ~/github/ETCBC/nestle1904/report/2022-11-01/errors.txt
7 tags of which 0 with multiple namespaces written to ~/github/ETCBC/nestle1904/report/2022-11-01/namespaces.txt


True

# Convert

Here we generate the actual TF data.

In [8]:
X.task(convert=True)

XML to TF converting: ~/github/ETCBC/nestle1904/xml/2022-11-01 => ~/github/ETCBC/nestle1904/tf/0.4.1
  0.00s Not all of the warp features otype and oslots are present in
~/github/ETCBC/nestle1904/tf/0.4.1
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    book, chapter, verse
   |   SECTION   FEATURES: book, chapter, verse
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       after, text
   |     0.00s OK
   |     0.00s Following director... 
  27 27-revelation.xml                                 
There are no broken subjref references.
There are 9 broken frame references.
gnt/01-matthew.xml            : n40003016026, n40004024030
gnt/03-luke.xml               : n42001054008, n4200105

True

# Load

The best check to see that the TF is valid is to load it.

In [8]:
X.task(load=True)

   |     0.18s T otype                from ~/github/ETCBC/nestle1904/tf/0.4.0
   |     2.23s T oslots               from ~/github/ETCBC/nestle1904/tf/0.4.0
   |     0.34s T text                 from ~/github/ETCBC/nestle1904/tf/0.4.0
   |     0.27s T after                from ~/github/ETCBC/nestle1904/tf/0.4.0
   |     0.28s T book                 from ~/github/ETCBC/nestle1904/tf/0.4.0
   |     0.24s T chapter              from ~/github/ETCBC/nestle1904/tf/0.4.0
   |     0.25s T verse                from ~/github/ETCBC/nestle1904/tf/0.4.0
   |      |     0.05s C __levels__           from otype, oslots, otext
   |      |     1.39s C __order__            from otype, oslots, __levels__
   |      |     0.06s C __rank__             from otype, __order__
   |      |     4.30s C __levUp__            from otype, oslots, __rank__
   |      |     2.30s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.04s C __characters__       from otext
   |      |     0.71s C __boundar

True

# App creation

We create the config file that turns the dataset into a TF app.

In [9]:
X.task(app=True)

App updating ...
	~/github/ETCBC/nestle1904/app/static/logo.png (already exists, not overwritten)
	~/github/ETCBC/nestle1904/app/static/display.css (no custom info, older orginal exists)
	~/github/ETCBC/nestle1904/app/config.yaml (generated with custom info)
	~/github/ETCBC/nestle1904/app/app.py (deleted)
Done


True

# Test

We test a bit of the resulting dataset right here.

In [10]:
A = use("ETCBC/nestle1904:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
wg,114879,7.6,633
clause,30152,7.37,161
phrase,42636,3.21,99
w,137779,1.0,100


# Operational matters

You can now get a report of the memory footprint of a TF dataset:

In [11]:
A.footprint()

                                                


# 55 features

feature | members | size in bytes
--- | --- | ---
__levUp__ | 341,688 | 132,441,444
parent | 252,657 | 47,545,404
sibling | 129,767 | 42,761,080
__levDown__ | 203,909 | 29,705,672
oslots | 3 | 27,725,264
__boundary__ | 2 | 21,671,488
cls | 283,355 | 18,420,730
id | 137,779 | 17,505,291
ref | 137,779 | 17,322,488
strong | 137,779 | 12,836,336
__order__ | 341,688 | 12,300,808
unicode | 137,779 | 11,383,528
text | 137,779 | 10,860,278
normalized | 137,779 | 10,773,384
gloss | 137,779 | 10,312,724
rule | 161,050 | 9,801,834
type | 155,789 | 9,606,021
lemma | 137,779 | 9,581,090
num | 145,817 | 9,447,440
verse | 145,723 | 9,323,204
ln | 126,879 | 9,226,616
morph | 137,779 | 9,160,456
chapter | 138,039 | 9,108,052
book | 137,806 | 9,102,932
after | 137,779 | 9,101,961
frame | 25,491 | 9,001,021
domain | 126,879 | 8,893,453
role | 113,179 | 8,412,742
number | 98,937 | 8,013,308
case | 79,518 | 4,848,319
gender | 73,975 | 4,692,998
articular | 53,093 | 4,108,160
framespec | 25,493 | 3,588,614
subjref | 16,575 | 2,462,368
junction | 33,327 | 2,244,075
mood | 28,357 | 2,105,149
tense | 28,357 | 2,105,135
voice | 28,357 | 2,105,024
otype | 4 | 1,631,871
__rank__ | 341,688 | 1,452,264
subjrefspec | 16,575 | 1,407,769
referent | 14,471 | 1,389,083
person | 19,419 | 1,133,807
nodeId | 10,530 | 926,792
crule | 10,530 | 589,939
clauseType | 9,990 | 574,772
__sections__ | 2 | 571,032
discontinuous | 6,034 | 463,972
cltype | 4,854 | 283,626
appositioncontainer | 1,908 | 127,260
degree | 513 | 32,996
__characters__ | 1 | 29,876
lang | 27 | 1,975
__levels__ | 8 | 1,727
note | 1 | 324
TOTAL | 5,110,055 | 590,224,976

## Existence of error nodes

In [12]:
F.otype.s("error")

()

No error nodes anymore.

# The books

In [13]:
for b in F.otype.s("book"):
    print(f"{b} {A.sectionStrFromNode(b)}")

137780 MAT
137781 MRK
137782 LUK
137783 JHN
137784 ACT
137785 ROM
137786 1CO
137787 2CO
137788 GAL
137789 EPH
137790 PHP
137791 COL
137792 1TH
137793 2TH
137794 1TI
137795 2TI
137796 TIT
137797 PHM
137798 HEB
137799 JAS
137800 1PE
137801 2PE
137802 1JN
137803 2JN
137804 3JN
137805 JUD
137806 REV


# Features on `w` and `wg`

Some features have values for `w` and for `wg` nodes. Yet we can get frequnecy lists for them separately.

In [14]:
F.cls.freqList(nodeTypes={"w"})

(('noun', 28455),
 ('verb', 28357),
 ('det', 19786),
 ('conj', 18227),
 ('pron', 16177),
 ('prep', 10914),
 ('adj', 8452),
 ('adv', 6147),
 ('ptcl', 773),
 ('num', 476),
 ('intj', 15))

In [15]:
F.cls.freqList(nodeTypes={"wg"})

(('np', 30911),
 ('cl', 30152),
 ('pp', 11169),
 ('vp', 207),
 ('adjp', 168),
 ('advp', 166),
 ('adv', 7),
 ('nump', 7),
 ('conj', 1))

In [16]:
F.cls.freqList(nodeTypes={"clause"})

(('cl', 30152),)

In [17]:
F.cls.freqList(nodeTypes={"phrase"})

(('np', 30911),
 ('pp', 11169),
 ('vp', 207),
 ('adjp', 168),
 ('advp', 166),
 ('adv', 7),
 ('nump', 7),
 ('conj', 1))

## A sentence with discontinuous word groups

In [24]:
A.displaySetup(withNodes=True, standardFeatures=True, hiddenTypes={"clause", "phrase"}, hideTypes=True)

In [25]:
jude = A.nodeFromSectionStr("JUD")
s = L.d(jude, otype="sentence")[0]
A.pretty(s)

## Clauses and phrases

We can switch from a lowlevel display with `wg` elements to a higher-level display with `clause` and `phrase` elements.

In [37]:
A.pretty(s, hiddenTypes={"wg"})

## Parents

For now, we stick to the low-level display with `wg` elements.

In [39]:
w = L.d(s, otype="w")[5]

parent = E.parent.f(w)[0]

A.pretty(s, highlights={parent: "cyan", w: "salmon"})

Look for `wg` parents that have `w` children.

In [40]:
results = A.search("""
book book=JUD
  wg
  <parent- w
""")

  0.11s 457 results


In [41]:
A.show(results, end=2, condenseType="wg", queryFeatures=False)

## Siblings

In [42]:
wg = L.d(s, otype="wg")[7]
parent = E.parent.f(wg)[0]
siblings = E.sibling.b(wg)

for (sib, dist) in siblings:
    dm(f"distance from {F.otype.v(sib)}-node {sib} to wg-node {wg} = `{dist}`\n\n")
    A.pretty(sib, queryFeatures=False)
    
highlights = {sib[0]: "cyan" for sib in siblings}
highlights[wg] = "salmon"

A.pretty(parent, highlights=highlights)

distance from w-node 127496 to wg-node 332680 = `1`



Look for `w` siblings with distance greater than 3 in Jude:

In [43]:
results = A.search("""
book book=JUD
  w
  -sibling>3> w
""")

  0.11s 2 results


In [44]:
A.show(results, condenseType="wg")

# Sibling distance

What is the biggest distance between siblings?

In [45]:
maxDist = -1
pair = None

for (fro, tos) in E.sibling.items():
    for (to, dist) in tos.items():
        if dist > maxDist:
            maxDist = dist
            pair = (fro, to)
        
(fro, to) = pair

In [46]:
dm(f"""
The maximum distance between siblings is **{maxDist}**

* from {F.otype.v(fro)}-node {fro}
* to {F.otype.v(to)}-node {to}
""")

A.plain(fro)
A.plain(to)
parent = E.parent.f(fro)[0]

A.pretty(parent, highlights={fro, to})


The maximum distance between siblings is **20**

* from wg-node 308878
* to wg-node 308912


# Frames

We show a few verbal frames.

In [47]:
frames = A.search("""
book book=JUD
    w framespec
    -frame> w
""")

  0.09s 145 results


In [48]:
A.show(frames, start=3, end=6, colorMap={2: "cyan", 3: "salmon"})

With hand coding we can get a better display:

In [49]:
colors = dict(A1="salmon", A2="goldenrod", self="cyan")
highlights = {}

self = frames[2][1]
parent = E.parent.f(self)[0]

highlights[self] = colors["self"]

for (arg, label) in E.frame.f(self):
    highlights[arg] = colors[label]
    
A.pretty(parent, highlights=highlights)

# Browse

We are ready to browse the data.
If you run this notebook, then the next cell will open a browser window with the TF-browser
on the Greek New Testament.

In [50]:
X.task(browse=True)

This is Text-Fabric 11.4.12
Starting new kernel listening on 17116
Loading data for ETCBC/nestle1904. Please wait ...
Setting up TF kernel for ETCBC/nestle1904  
**Locating corpus resources ...**
Using app in ~/github/ETCBC/nestle1904/app:
	repo clone offline under ~/github (local github)
Using data in ~/github/ETCBC/nestle1904/tf/0.4.0:
	repo clone offline under ~/github (local github)
TF setup done.
Starting new webserver listening on 27116


 * Running on http://localhost:27116
[33mPress CTRL+C to quit[0m


Opening ETCBC/nestle1904 in browser
Press <Ctrl+C> to stop the TF browser


127.0.0.1 - - [11/May/2023 15:14:49] "POST / HTTP/1.1" 200 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/highlight.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/fonts.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/index.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/display.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/fontawesome.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/base.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/jquery.js HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/tf3.0.js HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/icon.png HTTP/1.1[0m" 304 -
127.0.0.1 - - [11/May/2023 15:14:49] "[36mGET /server/static/huc.png HTTP/1.1[0m" 304 -
127.0.0.1 - 

Kernel listening at port 17116

TF web server has stopped
TF kernel has stopped


keyboard interrupt!


True

# Terminate

You can stop the browser by pressing `i` twice.

# Create zip

It is time to commit and push the repo to GitHub now:

```
git add --all .
git commit "new data version"
git push origin master
```

Then go over to GitHub and create a new release there.

After that, fetch the new tags from GitHub by

```
git pull --tags
```

Then we are ready to create a zip file for publishing the dataset in a release on Github,
so that users can get it easily.

In [51]:
A.zipAll()

Data to be zipped:
	OK       app                      (v0.4.0 07c60c)     : ~/github/ETCBC/nestle1904/app
	OK       main data                (v0.4.0 07c60c)     : ~/github/ETCBC/nestle1904/tf/0.4.0
Writing zip file ...
Result: ~/Downloads/github/ETCBC/nestle1904/complete.zip


# Fetch

We now test wether users can use this dataset in the normal way.

Run this after you have attached the complete.zip file that we create earlier, to the latest release on GitHub.

In [52]:
A = use("ETCBC/nestle1904:latest")

**Locating corpus resources ...**

   |     0.17s T otype                from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     2.36s T oslots               from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.36s T text                 from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.28s T after                from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.28s T book                 from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.24s T chapter              from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.26s T verse                from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |      |     0.05s C __levels__           from otype, oslots, otext
   |      |     1.47s C __order__            from otype, oslots, __levels__
   |      |     0.06s C __rank__             from otype, __order__
   |      |     4.38s C __levUp__            from otype, oslots, __rank__
   |      |     2.38s C __levDown__          fr

Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
wg,114879,7.6,633
clause,30152,7.37,161
phrase,42636,3.21,99
w,137779,1.0,100


Indeed, downloading and installing went without hassle.

# Demo

We demo the effect of the reshuffling of the words.

Our test corpus is the letter of Jude, first sentence, twice.

The first time we do not shuffle the words in the sentence, the second time we do.

We run the conversion with `demo = True` in `lowfat.py`.

In [5]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    xml=-1,
    tf="0.3.1t",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2000-01-01 (oldest)
TF data version is 0.3.1t (explicit new)


In [7]:
X.task(convert=True, load=True, app=True)

XML to TF converting: ~/github/ETCBC/nestle1904/xml/2000-01-01 => ~/github/ETCBC/nestle1904/tf/0.3.1t
  0.00s Not all of the warp features otype and oslots are present in
~/github/ETCBC/nestle1904/tf/0.3.1t
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    book, chapter, verse
   |   SECTION   FEATURES: book, chapter, verse
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       after, text
   |     0.00s OK
   |     0.00s Following director... 
   1 26-jude.xml                                       
source reading done
   |     0.00s "edge" actions: 0
   |     0.00s "feature" actions: 71
   |     0.00s "node" actions: 39
   |     0.00s "resume" actions: 0
   |     0.00s "slot" actions

True

In [9]:
A = use("ETCBC/nestle1904:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,1,34.0,100
chapter,1,34.0,100
verse,1,34.0,100
sentence,2,17.0,100
wg,34,6.0,600
w,34,1.0,100


In [10]:
(s1, s2) = F.otype.s("sentence")

In [21]:
color1 = "cyan"
color2 = "goldenrod"
start = 5
offset = 17
highlights = {
    start: color1,
    start + 1: color2,
    start + offset: color2,
    start + offset + 1: color1,
}
A.displaySetup(standardFeatures=True, highlights=highlights)

In [22]:
A.pretty(s1)

In [23]:
A.pretty(s2)

# Restore

We restore the app so that it uses the normal tf version again.

In [32]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    tf="0.3.1",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2022-11-01 (most recent)
TF data version is 0.3.1 (explicit existing)


In [33]:
X.task(app=True)

App updating ...
	~/github/ETCBC/nestle1904/app/static/logo.png (already exists, not overwritten)
	~/github/ETCBC/nestle1904/app/static/display.css (no custom info, older orginal exists)
	~/github/ETCBC/nestle1904/app/config.yaml (generated with custom info)
	~/github/ETCBC/nestle1904/app/app.py (deleted)
Done


True

Now save this notebook, commit and push the repo again to publish this very notebook.

```
git add --all .
git commit "maker notebook updated"
git push origin master
```