# Convert from TEI to TF to WATM

First we convert the corpus TEI to TF and then the TF to WATM.

This notebook is bare, no explanations, no illustrations, no checks.
For more documentation, try any of the following variants:

# Step by step

Below you can inspect all the steps of the conversion:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.convert.watm import WATM
from tf.convert.iiif import IIIF
from tf.advanced.helpers import dm
from processscans import Scans

## Step 0: Scan ingest

In [3]:
# SC = Scans(subset=True, silent=False, force=True)
SC = Scans(subset=False, silent=False, force=False)

In [4]:
SC.ingest(dry=False)

2649 files
2648 png files


## Step 2: Scan processing

In [5]:
# SC.process(force=True)
SC.process(force=False)

Already present: thumbnails (pages)
Already present: sizes file thumbnails (pages)


In [53]:
Tei = TEI(verbose=-1, tei="2025-02-14", tf="0.0.1")
# Tei = TEI(verbose=-1, tei="2025-02-14-subset", tf="0.0.1s")

# Step 1: Check

In [54]:
Tei.task(check=True, verbose=1, validate=True, carryon=True)

TEI to TF checking: ~/github/HuygensING/vangogh/tei/2025-02-14 => ~/github/HuygensING/vangogh/report/2025-02-14
Processing instructions are treated
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
INFO: Needs editem.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/vangogh/schema/editem-letter.xsd
	round   1:  73 changes
172 identical override(s)
  4 changing override(s)
	eventName complex mixed (added)
	postaldiv complex pure (added)
	postmark complex mixed (added)
	transpose pure ==> mixed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/vangogh/schema/artwork.xsd
	round   1:  15 changes
	round   2:   2 changes
 24 identical override(s)
  2 changing override(s)
	bibl mixed ==> pure
	catRef pure ==> mixed
Analysing ~/github/annotation/text-fabric/tf/tools/te

157 validation error(s) in 3 file(s) written to ~/github/HuygensING/vangogh/report/2025-02-14/errors.txt


0 pagebreaks without facs attribute.
8260 pagebreaks encountered.
4130 distinct scans referred to by pagebreaks.
2585 surfaces declared
4219 zones declared
6804 scans declared and mapped.


2674 unused scans of which 2585 surfaces and 89 zones
42 missing zone region specifiers
See ~/github/HuygensING/vangogh/report/2025-02-14/zoneErrors.yml


0 processing instructions encountered.
60 tags of which 0 with multiple namespaces written to ~/github/HuygensING/vangogh/report/2025-02-14/namespaces.txt
479 info line(s) written to ~/github/HuygensING/vangogh/report/2025-02-14/elements.txt
Refs written to ~/github/HuygensING/vangogh/report/2025-02-14/refs.txt
	resolvable: 13073 in 24541
	dangling:   8156 in 41230
	ALL:        21229 in 65771 
Ids written to ~/github/HuygensING/vangogh/report/2025-02-14/ids.txt
	referenced: 13073 by 24541
	non-unique:    0
	unused:     188508
	ALL:        201581 in 201581
lb-parent info written to ~/github/HuygensING/vangogh/report/2025-02-14/lb-parents.txt


False

# Step 2: Convert

In [33]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Page model II with page nodes for pages started by pb elements without keeping the pb elements
Section model I
Processing instructions are treated
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
	Start folder letters:
		   1 editem-letter vg           letter       let001.xml                                        
		   2 editem-letter vg           letter       let001a.xml                                       
		   3 editem-letter vg           letter       let002.xml                                        
		   4 editem-letter vg           letter       let003.xml                                        
		   5 editem-letter vg           letter       let004.xml                                        

End   folder letters

Resolving links into edges ...
	246 in 480 reference(s) resolved
	274 in 704 reference(s): could not be resolved
   |     0.27s "delete" actions: 0
   |     0.27s "

True

# Step 3: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.
Most of it will be generated now, but there are ways to keep custom additions intact.

In [34]:
Tei.task(app=True)

True

# Step 4: Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [35]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.6.4
43 features found and 0 ignored
   |     0.01s T otype                from ~/github/HuygensING/vangogh/tf/0.0.1s
   |     0.15s T oslots               from ~/github/HuygensING/vangogh/tf/0.0.1s
  0.16s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T file                 from ~/github/HuygensING/vangogh/tf/0.0.1s
   |     0.00s T folder               from ~/github/HuygensING/vangogh/tf/0.0.1s
   |     0.09s T after                from ~/github/HuygensING/vangogh/tf/0.0.1s
   |     0.10s T str                  from ~/github/HuygensING/vangogh/tf/0.0.1s
   |     0.00s T chunk                from ~/github/HuygensING/vangogh/tf/0.0.1s
   |      |     0.00s C __levels__           from otype, oslots, otext
   |      |     0.36s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.37s C __levUp__            from otype, oslots, __rank_

Name,# of nodes,# slots / node,% coverage
folder,1,45224.0,100
file,24,1884.33,100
letter,24,1884.33,100
body,24,1238.33,66
text,24,1238.33,66
div,48,619.17,66
listAnnotation,72,185.58,30
page,169,175.86,66
teiHeader,24,79.58,4
publicationStmt,24,33.0,2


# Step 5 Convert to WATM

N.B. For docs click the WATM link in the output cell.

In [36]:
WA = WATM(A, "tei", skipMeta=False, extra=dict())
WA.makeText()

conversion settings read from ~/github/HuygensING/vangogh/config/watm.yml
IIIF settings read from ~/github/HuygensING/vangogh/config/iiif.yml
textRepoLevel is section level 'file'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

In [37]:
WA.makeAnno()

              folder [   0:     0] - [  23:   981]


In [38]:
WA.writeAll()

	Writing WATM ...
Writing development data to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev
Text file    0:      841 segments to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/text-0.tsv
Text file    1:      929 segments to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/text-1.tsv
Text file    2:      628 segments to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/text-2.tsv
Text file    3:     1913 segments to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/text-3.tsv
Text file    4:     1822 segments to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/text-4.tsv

Text files all:    45224 segments to 24 files
Anno file    1:    57662 annotations written to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/anno-1.tsv
Inherited annotations: 0
Anno files all:    57662 annotations to 1 file
Slot mapping written to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/pos2node.tsv
Node mapping written to ~/github/HuygensING/vangogh/watm/0.0.1s-000/dev/anno2node.tsv


# Step 8 Test the WATM against the TF

In [39]:
WA.error = False
WA.testAll()

	Testing WATM ...
Testing the text ...
	TF:  45224
	WA:  45224
OK - whether the amounts of tokens agree
	TF: To Theo van Gogh. Th ... Paris in May 1873.  
	WA: To Theo van Gogh. Th ... Paris in May 1873.  
OK - whether the text is the same
Testing the elements ...
	TF:   5213
	WA:   5213
OK - whether the amounts of elements and nodes agree
Testing the processing instructions ...
	TF:      0
	WA:      0
OK - whether the amounts of processing instructions agree
Testing the element/pi annotations ...
	50437 element/pi annotations
	Element      :   5213 x
	Pi           :      0 x
	Other        :  51704 x
	Good name    :   5213 x
	Wrong name   :      0 x
	Good target  :   5213 x
	Wrong target :      0 x
	Unmapped     :      0 x
OK - whether all element/pi annotations have good bodies
OK - whether all element/pi annotations have good targets
Testing the attributes ...
	14547 attribute values
	Good:     14547 x
	Wrong:        0 x
OK - whether annotations are consistent with features
	WA attri

## Step 7: Generate IIIF manifests

In [40]:
II = IIIF(Tei.teiVersion, A, Tei.reportPath, prod=False, silent=False)
II.manifests()

No cover directory: ~/github/HuygensING/vangogh/scans/covers
IIIF settings read from ~/github/HuygensING/vangogh/config/iiif.yml
TEI settings read from ~/github/HuygensING/vangogh/config/tei.yml
Maximum dimensions: W = 4584 H = 2423
Average dimensions: W =  994 H =  855
Average deviation:  W =  226 H =   92
Rotation file not found: ~/github/HuygensING/vangogh/thumb/rotation_pages.tsv
Using facs file info file ~/github/HuygensING/vangogh/report/2025-02-14-subset/facs.yml
Using facs mapping file ~/github/HuygensING/vangogh/report/2025-02-14-subset/facsMapping.yml
Collections:
letters with  172 pages
IIIF manifests generated in ~/github/HuygensING/vangogh/static/2025-02-14-subset/dev/manifests
