# Convert from TEI to TF to WATM

First we convert Mondriaan TEI to TF and then the TF to WATM.

This notebook is bare, no explanations, no illustrations, no checks.
For more documentation, try any of the following variants:

* `convertExpress` : as few commands/feedback/interaction as possible, 
* [convertSteps](convertSteps.ipynb): broken down in a few command line commands, more feedback
* [convertDetails](convertDetails.ipynb): run from Python with full control

# One shot

Here is the express, mindless way to convert the corpus.

If something goes wrong, you can follow the rest of the notebook.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from make import run

In [7]:
%%time

run("all 0.0.1 --silent")

Using TF version: 0.8.15
Checking TEI ...
	folder proeftuin:
	folder backmatter:
Converting TEI to TF ...
	folder proeftuin:
	folder backmatter:
Loading TF ...
App updated
Add tokens and sentences ...


  0.21s Using NLP pipeline Spacy (it) ...
NLP with language model it True
This language supports tagging
This language supports morphologizing
This language supports lemmatizing
  4.95s NLP done
  0.00s Feature overview: 65 for nodes; 5 for edges; 1 configs; 9 computed
App updated with NLP output 
Producing WATM


	16 x of type sentence
	4 x of type ent
	6 x of type token
OK - whether all tests passed
CPU times: user 12.1 s, sys: 1.16 s, total: 13.3 s
Wall time: 14.1 s


# Only WATM:

In [6]:
%%time

run("watm 0.0.1 --silent")

Using TF version: 0.8.14
Producing WATM


	16 x of type sentence
	4 x of type ent
	6 x of type token
OK - whether all tests passed
CPU times: user 755 ms, sys: 32 ms, total: 787 ms
Wall time: 787 ms


# Step by step

Below you can inspect all the steps of the conversion:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.convert.watm import WATM
from tf.advanced.helpers import dm

In [44]:
MINI = False
Tei = TEI(verbose=-1, tei="2024-01-01", tf="0.0.1pre") if MINI else TEI(
    verbose=-1, tei=0, tf="0.1.0pre"
)

# Step 1: Check

In [4]:
Tei.task(check=True, verbose=1, validate=True)

TEI to TF checking: ~/gitlab.huc.knaw.nl/van-gogh/letters/tei/2024-01-01 => ~/gitlab.huc.knaw.nl/van-gogh/letters/report/2024-01-01
Processing instructions are treated
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
INFO: Needs ns2.xsd (exists)
INFO: Needs ns1.xsd (exists)
INFO: Needs dcr.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/gitlab.huc.knaw.nl/van-gogh/letters/schema/MD.xsd
	round   1:  45 changes
172 identical override(s)
  4 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	transpose pure ==> mixed
INFO: Needs artwork1.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/gitlab.huc.knaw.nl/van-gogh/letters/schema/artwork.xsd
	round   1:  15 changes
	round   2:   2 changes
 24 identical override(s)
  3 changing o

1 validation error(s) in 1 file(s) written to ~/gitlab.huc.knaw.nl/van-gogh/letters/report/2024-01-01/errors.txt


167 info line(s) written to ~/gitlab.huc.knaw.nl/van-gogh/letters/report/2024-01-01/elements.txt
587 tag(s) type info written to ~/gitlab.huc.knaw.nl/van-gogh/letters/report/2024-01-01/types.txt
0 processing instructions encountered.
50 tags of which 0 with multiple namespaces written to ~/gitlab.huc.knaw.nl/van-gogh/letters/report/2024-01-01/namespaces.txt
Refs written to ~/gitlab.huc.knaw.nl/van-gogh/letters/report/2024-01-01/refs.txt
	resolvable:   37 in   74
	dangling:     56 in  122
	ALL:          93 in  196 
Ids written to ~/gitlab.huc.knaw.nl/van-gogh/letters/report/2024-01-01/ids.txt
	referenced:   37 by   74
	non-unique:   30
	unused:      249
	ALL:         256 in  286


False

# Step 2: Convert

In [5]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Processing instructions are treated
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/gitlab.huc.knaw.nl/van-gogh/letters/schema/MD.xsd
	round   1:  45 changes
172 identical override(s)
  4 changing override(s)
	address pure ==> mixed
	postmark complex mixed (added)
	rewrite complex mixed (added)
	transpose pure ==> mixed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/gitlab.huc.knaw.nl/van-gogh/letters/schema/artwork.xsd
	round   1:  15 changes
	round   2:   2 changes
 24 identical override(s)
  3 changing override(s)
	artwork complex pure (added)
	bibl mixed ==> pure
	catRef pure ==> mixed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/gitlab.huc.knaw.nl/van-gogh/letters/schema/bio.xsd
	round   1:  19 changes
 30 identical override(

True

# Step 3: Load the TF data

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [6]:
Tei.task(load=True, verbose=1)

   |     0.01s T otype                from ~/gitlab.huc.knaw.nl/van-gogh/letters/tf/0.0.1pre
   |     0.11s T oslots               from ~/gitlab.huc.knaw.nl/van-gogh/letters/tf/0.0.1pre
   |     0.00s T chunk                from ~/gitlab.huc.knaw.nl/van-gogh/letters/tf/0.0.1pre
   |     0.06s T ch                   from ~/gitlab.huc.knaw.nl/van-gogh/letters/tf/0.0.1pre
   |     0.00s T file                 from ~/gitlab.huc.knaw.nl/van-gogh/letters/tf/0.0.1pre
   |     0.00s T folder               from ~/gitlab.huc.knaw.nl/van-gogh/letters/tf/0.0.1pre
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.28s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.27s C __levUp__            from otype, oslots, __rank__
   |      |     0.06s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.00s C __characters__       from otext
   |      |     0.06s

True

# Step 4: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.
Most of it will be generated now, but there are ways to keep custom additions intact.

In [7]:
Tei.task(app=True)

App updated


True

# Step 5: Add tokens and sentences

In [8]:
Apre = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
folder,3,10211.0,100
letter,7,4359.57,100
file,8,3829.12,100
body,8,2136.5,56
text,8,2136.5,56
standOff,7,1529.29,35
div,14,1216.21,56
page,32,532.09,56
listAnnotation,21,509.71,35
fileDesc,8,284.12,7


In [9]:
NLP = NLPipeline(lang="en", verbose=0, write=True)
NLP.loadApp(Apre)

Input data has version 0.0.1pre
Compute element boundaries
   822 start positions
   938 end positions


In [10]:
NLP.task(plaintext=True, lingo=True, ingest=True)

Input data has version 0.0.1pre
Compute element boundaries
   822 start positions
   938 end positions
  0.00s Generating a plain text with positions ...
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
   |   Found 201 empty slots
   |   recorded flow main       with  61538 items
   |   recorded flow note       with  32405 items
  0.11s Done. Generated text and positions written to ~/gitlab.huc.knaw.nl/van-gogh/letters/_temp/txt/0.0.1pre/plain.txt
  0.11s Using NLP pipeline Spacy (en) ...
NLP with language model en False
  2.66s Tokens written to ~/gitlab.huc.knaw.nl/van-gogh/letters/_temp/txt/0.0.1pre/tokens.tsv
  2.66s Sentences written to ~/gitlab.huc.knaw.nl/van-gogh/letters/_temp/txt/0.0.1pre/sentences.tsv
  2.66s NLP done
  2.66s Ingesting NLP output into the dataset ...
   |     3.87s Mapping NLP data to nodes and features ...
   |      |     0.00s generating t-nodes with features str, after, empty
   |      |      |    -0.00s 7251 t nodes have values assigned

'0.0.1'

In [46]:
Tei.task(apptoken=True)

App updated with NLP output 


True

# Step 6: Use the new dataset

In [47]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.3.6
49 features found and 0 ignored
  1.97s Dataset without structure sections in otext:no structure functions in the T-API
  6.14s All features loaded / computed - for details use TF.isLoaded()
  0.35s All additional features loaded - for details use TF.isLoaded()


Name,# of nodes,# slots / node,% coverage
folder,3,1087728.67,100
artworklist,1,55287.0,2
listObject,1,55261.0,2
bibliolist,1,27621.0,1
listBibl,1,27466.0,1
biolist,1,22954.0,1
listPerson,1,22943.0,1
file,931,3505.03,100
letter,928,3402.29,97
body,931,2549.17,73


In [13]:
T.sectionTypes

['folder', 'file', 'chunk']

# Step 7 Convert to WATM

N.B. For docs click the WATM link in the output cell.

In [48]:
WA = WATM(A, "tei", skipMeta=False, extra=dict())
WA.makeText()

textRepoLevel is section level 'file'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

In [49]:
WA.makeAnno()

              folder [   0:     0] - [ 902:  2552]
              folder [ 903:     0] - [ 927:  1229]
              folder [ 928:     0] - [ 930: 22952]


In [50]:
WA.writeAll()

Text file    0:      842 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-0.tsv
Text file    1:      905 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-1.tsv
Text file    2:      624 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-2.tsv
Text file    3:     1831 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-3.tsv
Text file    4:     1743 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-4.tsv
Text file    5:     2182 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-5.tsv
Text file    6:      643 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-6.tsv
Text file    7:      828 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-7.tsv
Text file    8:      757 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-8.tsv
Text file    9:     2026 segments to ~/gitlab.huc.knaw.nl/van-gogh/letters/watm/0.1.0/text-9.tsv
Text file   10:     3743 segme

# Step 8 Test the WATM against the TF

In [51]:
WA.error = False
WA.testAll()

Testing the text ...
	TF: 3263186
	WA: 3263186
OK - whether the amounts of tokens agree
	TF: To Theo van Gogh. Th ... wart  Dutch artist  
	WA: To Theo van Gogh. Th ... wart  Dutch artist  
OK - whether the text is the same
Testing the elements ...
	TF: 500842
	WA: 500842
OK - whether the amounts of elements and nodes agree
Testing the processing instructions ...
	TF:      0
	WA:      0
OK - whether the amounts of processing instructions agree
Testing the element/pi annotations ...
	3764028 element/pi annotations
	Element      : 500842 x
	Pi           :      0 x
	Other        : 4106174 x
	Good name    : 500842 x
	Wrong name   :      0 x
	Good target  : 500842 x
	Wrong target :      0 x
	Unmapped     :      0 x
OK - whether all element/pi annotations have good bodies
OK - whether all element/pi annotations have good targets
Testing the attributes ...
	938358 attribute values
	Good:     938358 x
	Wrong:        0 x
OK - whether annotations are consistent with features
	WA attributes: 9383