In [1]:
%load_ext autoreload
%autoreload 2

# Lowfat to TF

We use the machinery of Text-Fabric combined with some custom code to convert
the lowfat XML of the Greek New Testament into TF.

# Set up

We gather all prerequisites.

In [2]:
from tf.convert.xml import XML
from lowfat import convertTaskCustom
from tf.app import use

The custom code is in `lowfat.py`, here in this directory.

It consists of two functions that replace default functions in
[xmlCustom](https://annotation.github.io/text-fabric/tf/convert/xmlCustom.html),
which is part of TF.

So you only have to focus on the bits that actually touch the lowfat XML.

We pass the function `convertCustomTask()`, defined in `lowfat.py`, to the XML converter.

We also specify the way we want to see some attributes in the report files:

* keyword attributes: we want to see an inventory of all words that occur in such attributes
* trim attributes: we do not want to see the values of these attributes

In [3]:
keywordAtts = set(
    """
    case
    class
    number
    gender
    mood
    person
    role
    tense
    type
    voice
""".strip().split()
)

trimAtts = set(
    """
    domain
    frame
    gloss
    id
    lemma
    ln
    morph
    normalized
    ref
    referent
    rule
    strong
    subjref
    unicode
""".strip().split()
)

renameAtts = dict(Rule="crule")

We do not want both the `Rule` and `rule` features in our dataset, because this can clash on file systems
that are case insensitive.

In [4]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    xml=0,
    tf="0.3.2",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2022-11-01 (most recent)
TF data version is 0.3.2 (explicit new)


Now we can run tasks.

# Check

First we check the input:

In [5]:
X.task(check=True)

XML to TF checking: ~/github/ETCBC/nestle1904/xml/2022-11-01 => ~/github/ETCBC/nestle1904/report/2022-11-01
Start folder gnt:
  27 27-revelation.xml                                 
End   folder gnt

151 info line(s) written to ~/github/ETCBC/nestle1904/report/2022-11-01/elements.txt
0 error(s) in 0 file(s) written to ~/github/ETCBC/nestle1904/report/2022-11-01/errors.txt
7 tags of which 0 with multiple namespaces written to ~/github/ETCBC/nestle1904/report/2022-11-01/namespaces.txt


True

# Convert

Here we generate the actual TF data.

In [6]:
X.task(convert=True)

XML to TF converting: ~/github/ETCBC/nestle1904/xml/2022-11-01 => ~/github/ETCBC/nestle1904/tf/0.3.2
  0.00s Not all of the warp features otype and oslots are present in
~/github/ETCBC/nestle1904/tf/0.3.2
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    book, chapter, verse
   |   SECTION   FEATURES: book, chapter, verse
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       after, text
   |     0.00s OK
   |     0.00s Following director... 
  27 27-revelation.xml                                 
source reading done
   |     5.33s "edge" actions: 0
   |     5.33s "feature" actions: 260889
   |     5.33s "node" actions: 131121
   |     5.33s "resume" actions: 0
   |     5.33s "slot" a

True

# Load

The best check to see that the TF is valid is to load it.

In [7]:
X.task(load=True)

   |     0.15s T otype                from ~/github/ETCBC/nestle1904/tf/0.3.2
   |     1.89s T oslots               from ~/github/ETCBC/nestle1904/tf/0.3.2
   |     0.31s T chapter              from ~/github/ETCBC/nestle1904/tf/0.3.2
   |     0.45s T text                 from ~/github/ETCBC/nestle1904/tf/0.3.2
   |     0.37s T book                 from ~/github/ETCBC/nestle1904/tf/0.3.2
   |     0.36s T after                from ~/github/ETCBC/nestle1904/tf/0.3.2
   |     0.33s T verse                from ~/github/ETCBC/nestle1904/tf/0.3.2
   |      |     0.04s C __levels__           from otype, oslots, otext
   |      |     1.49s C __order__            from otype, oslots, __levels__
   |      |     0.06s C __rank__             from otype, __order__
   |      |     2.93s C __levUp__            from otype, oslots, __rank__
   |      |     1.66s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.05s C __characters__       from otext
   |      |     0.76s C __boundar

True

# App creation

We create the config file that turns the dataset into a TF app.

In [8]:
X.task(app=True)

App updating ...
	~/github/ETCBC/nestle1904/app/static/logo.png (already exists, not overwritten)
	~/github/ETCBC/nestle1904/app/static/display.css (no custom info, older orginal exists)
	~/github/ETCBC/nestle1904/app/config.yaml (generated with custom info)
	~/github/ETCBC/nestle1904/app/app.py (deleted)
Done


True

# Test

We test a bit of the resulting dataset right here.

In [9]:
A = use("ETCBC/nestle1904:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
wg,114879,7.6,633
w,137779,1.0,100


In [10]:
s2 = F.otype.s("sentence")[1]
A.pretty(s2, withNodes=True, standardFeatures=True)

Are there `error` nodes?

In [11]:
F.otype.s("error")

()

Not anymore.

# Browse

We are ready to browse the data.
If you run this notebook, then the next cell will open a browser window with the TF-browser
on the Greek New Testament.

In [36]:
X.task(browse=True)

This is Text-Fabric 11.4.11
Starting new kernel listening on 17116
Loading data for ETCBC/nestle1904. Please wait ...
Setting up TF kernel for ETCBC/nestle1904  
**Locating corpus resources ...**
Using app in ~/github/ETCBC/nestle1904/app:
	repo clone offline under ~/github (local github)
Using data in ~/github/ETCBC/nestle1904/tf/0.3.1:
	repo clone offline under ~/github (local github)
TF setup done.
Starting new webserver listening on 27116


 * Running on http://localhost:27116
[33mPress CTRL+C to quit[0m


Opening ETCBC/nestle1904 in browser
Press <Ctrl+C> to stop the TF browser


127.0.0.1 - - [10/May/2023 16:03:54] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/highlight.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/fonts.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/index.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/display.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/base.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/fontawesome.css HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/tf3.0.js HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/jquery.js HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/fonts/fa-regular-400.woff2 HTTP/1.1[0m" 304 -
127.0.0.1 - - [10/May/2023 16:03:54] "[36mGET /server/static/fonts/fa-solid-900.woff

Kernel listening at port 17116

TF web server has stopped
TF kernel has stopped


keyboard interrupt!


True

# Terminate

You can stop the browser by pressing `i` twice.

# Create zip

It is time to commit and push the repo to GitHub now:

```
git add --all .
git commit "new data version"
git push origin master
```

Then go over to GitHub and create a new release there.

After that, fetch the new tags from GitHub by

```
git pull --tags
```

Then we are ready to create a zip file for publishing the dataset in a release on Github,
so that users can get it easily.

In [37]:
A.zipAll()

Data to be zipped:
	OK       app                      (v0.3.1 41dd47)     : ~/github/ETCBC/nestle1904/app
	OK       main data                (v0.3.1 41dd47)     : ~/github/ETCBC/nestle1904/tf/0.3.1
Writing zip file ...
Result: ~/Downloads/github/ETCBC/nestle1904/complete.zip


# Fetch

We now test wether users can use this dataset in the normal way.

Run this after you have attached the complete.zip file that we create earlier, to the latest release on GitHub.

In [35]:
A = use("ETCBC/nestle1904:latest")

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
error,1,34.0,0
verse,7944,17.34,100
sentence,8011,17.2,100
wg,114878,7.6,633
w,137779,1.0,100


Indeed, downloading and installing went without hassle.

# Operational matters

You can now get a report of the memory footprint of a TF dataset:

In [36]:
A.footprint()

                                                


# 51 features

feature | members | size in bytes
--- | --- | ---
__levUp__ | 268,900 | 53,888,212
oslots | 3 | 18,990,804
id | 137,779 | 17,505,291
ref | 137,779 | 17,322,488
__levDown__ | 131,121 | 17,228,176
__boundary__ | 2 | 16,430,752
class | 210,567 | 16,382,666
strong | 137,779 | 12,836,336
unicode | 137,779 | 11,383,528
text | 137,779 | 10,860,278
normalized | 137,779 | 10,773,384
gloss | 137,779 | 10,312,724
__order__ | 268,900 | 9,680,440
lemma | 137,779 | 9,581,090
num | 145,817 | 9,447,440
verse | 145,723 | 9,323,204
ln | 126,879 | 9,226,616
morph | 137,779 | 9,160,456
chapter | 138,039 | 9,108,052
book | 137,806 | 9,102,932
after | 137,779 | 9,101,958
domain | 126,879 | 8,893,453
type | 113,168 | 8,412,633
number | 98,937 | 8,013,308
rule | 93,234 | 7,902,986
role | 84,218 | 4,980,402
case | 79,518 | 4,848,319
gender | 73,975 | 4,692,998
frame | 25,493 | 3,588,614
articular | 28,772 | 2,116,444
mood | 28,357 | 2,105,149
tense | 28,357 | 2,105,135
voice | 28,357 | 2,105,024
subjref | 16,575 | 1,407,769
referent | 14,471 | 1,389,083
__rank__ | 268,900 | 1,142,916
person | 19,419 | 1,133,807
junction | 17,921 | 1,091,819
otype | 4 | 1,049,511
nodeId | 5,558 | 628,472
__sections__ | 2 | 571,032
discontinuous | 6,034 | 463,972
crule | 5,558 | 450,723
clauseType | 5,296 | 295,892
cltype | 2,843 | 227,318
appositioncontainer | 1,908 | 127,260
degree | 513 | 32,996
__characters__ | 1 | 29,873
lang | 27 | 1,975
__levels__ | 7 | 1,511
note | 1 | 324
TOTAL | 4,125,850 | 367,457,545

The GNT takes 367 MB of your precious RAM.

It will run on an older iPad, through the app Carnets.

# Demo

We demo the effect of the reshuffling of the words.

Our test corpus is the letter of Jude, first sentence, twice.

The first time we do not shuffle the words in the sentence, the second time we do.

We run the conversion with `demo = True` in `lowfat.py`.

In [5]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    xml=-1,
    tf="0.3.1t",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2000-01-01 (oldest)
TF data version is 0.3.1t (explicit new)


In [7]:
X.task(convert=True, load=True, app=True)

XML to TF converting: ~/github/ETCBC/nestle1904/xml/2000-01-01 => ~/github/ETCBC/nestle1904/tf/0.3.1t
  0.00s Not all of the warp features otype and oslots are present in
~/github/ETCBC/nestle1904/tf/0.3.1t
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    book, chapter, verse
   |   SECTION   FEATURES: book, chapter, verse
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       after, text
   |     0.00s OK
   |     0.00s Following director... 
   1 26-jude.xml                                       
source reading done
   |     0.00s "edge" actions: 0
   |     0.00s "feature" actions: 71
   |     0.00s "node" actions: 39
   |     0.00s "resume" actions: 0
   |     0.00s "slot" actions

True

In [9]:
A = use("ETCBC/nestle1904:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,1,34.0,100
chapter,1,34.0,100
verse,1,34.0,100
sentence,2,17.0,100
wg,34,6.0,600
w,34,1.0,100


In [10]:
(s1, s2) = F.otype.s("sentence")

In [21]:
color1 = "cyan"
color2 = "goldenrod"
start = 5
offset = 17
highlights = {
    start: color1,
    start + 1: color2,
    start + offset: color2,
    start + offset + 1: color1,
}
A.displaySetup(standardFeatures=True, highlights=highlights)

In [22]:
A.pretty(s1)

In [23]:
A.pretty(s2)

# Restore

We restore the app so that it uses the normal tf version again.

In [32]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    tf="0.3.1",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2022-11-01 (most recent)
TF data version is 0.3.1 (explicit existing)


In [33]:
X.task(app=True)

App updating ...
	~/github/ETCBC/nestle1904/app/static/logo.png (already exists, not overwritten)
	~/github/ETCBC/nestle1904/app/static/display.css (no custom info, older orginal exists)
	~/github/ETCBC/nestle1904/app/config.yaml (generated with custom info)
	~/github/ETCBC/nestle1904/app/app.py (deleted)
Done


True

Now save this notebook, commit and push the repo again to publish this very notebook.

```
git add --all .
git commit "maker notebook updated"
git push origin master
```