# Convert from TEI to TF to WATM

First we convert the corpus TEI to TF and then the TF to WATM.

This notebook is bare, no explanations, no illustrations, no checks.
For more documentation, try any of the following variants:

# Step by step

Below you can inspect all the steps of the conversion:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tff.convert.tei import TEI
from tff.convert.addnlp import NLPipeline
from tff.convert.watm import WATM
from tff.convert.iiif import IIIF
from tff.convert.scans import Scans
from tf.advanced.helpers import dm

## Step 0: Scan ingest

In [5]:
SC = Scans(verbose=1, force=False)

Working in repository HuygensING/israels in back-end github
Source dir = ~/github/HuygensING/israels
imageprep settings read from ~/github/HuygensING/israels/config/scans.yml


## Step 2: Scan processing

In [6]:
SC.process(force=False)

Initialized scanInfo dir
Already present: thumbnails (pages)
Already present: sizes file thumbnails (pages)
Already present: sizes file originals (pages)
Copied sizes file to scanInfo
originals: 217
thumbnails: 217


In [4]:
SC.process(force=True)

	Convert 218 originals to thumbnails (pages)
		 46% done
		 92% done
		100% done
	Get sizes of 218 thumbnails (pages)
		 46% done
		 92% done
		100% done
	Get sizes of 218 originals (pages)
		 46% done
		 92% done
		100% done
originals: 217
thumbnails: 217


In [3]:
Tei = TEI(verbose=-1, tei="2025-04-24", tf="0.1.0")

# Step 1: Check

In [6]:
Tei.task(check=True, verbose=1, validate=True, carryon=True)

TEI to TF checking: ~/github/HuygensING/israels/tei/2025-04-24 => ~/github/HuygensING/israels/report/2025-04-24
Processing instructions are treated
XML validation will be performed
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-about.xsd
	round   1:  52 changes
118 identical override(s)
  0 changing override(s)
INFO: Needs editem.xsd (exists)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-letter.xsd
	round   1:  71 changes
172 identical override(s)
  2 changing override(s)
	artwork complex pure (added)
	eventName complex mixed (added)
INFO: Needs editem.xsd (exists)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/githu

2 undeclared scans
41 unused surfaces
2 unused zones


0 processing instructions encountered.
72 tags of which 0 with multiple namespaces written to ~/github/HuygensING/israels/report/2025-04-24/namespaces.txt
340 info line(s) written to ~/github/HuygensING/israels/report/2025-04-24/elements.txt
Refs written to ~/github/HuygensING/israels/report/2025-04-24/refs.txt
	resolvable:  670 in  670
	dangling:    312 in 2070
	ALL:         982 in 2740 
Ids written to ~/github/HuygensING/israels/report/2025-04-24/ids.txt
	referenced:  670 by  670
	non-unique:    0
	unused:     3457
	ALL:        4127 in 4127
lb-parent info written to ~/github/HuygensING/israels/report/2025-04-24/lb-parents.txt


True

# Step 2: Convert

In [7]:
Tei.good = True

In [8]:
Tei.task(convert=True, verbose=0)

Page model II with page nodes for pages started by pb elements without keeping the pb elements
Section model I
Processing instructions are treated
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-about.xsd
	round   1:  52 changes
118 identical override(s)
  0 changing override(s)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-letter.xsd
	round   1:  71 changes
172 identical override(s)
  2 changing override(s)
	artwork complex pure (added)
	eventName complex mixed (added)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-artworklist.xsd
	round   1:  21 changes
	round   2:   2 changes

True

# Step 3: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.
Most of it will be generated now, but there are ways to keep custom additions intact.

In [9]:
Tei.task(app=True)

True

# Step 4: Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [4]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 13.0.6
64 features found and 0 ignored
  0.02s Dataset without structure sections in otext:no structure functions in the T-API
  0.09s All features loaded / computed - for details use TF.isLoaded()
  0.02s All additional features loaded - for details use TF.isLoaded()


Name,# of nodes,# slots / node,% coverage
folder,3,39802.67,100
artworklist,1,6792.0,6
listObject,1,6754.0,6
about,4,5529.25,19
bibliolist,2,1144.5,2
biolist,1,1089.0,1
file,112,1066.14,100
listPerson,1,1024.0,1
listBibl,2,952.5,2
letter,104,837.7,73


# Step 5: Generate IIIF manifests

In [11]:
II = IIIF(Tei.teiVersion, A, Tei.reportPath, prod="dev", silent=False)

No cover directory: ~/github/HuygensING/israels/thumb/covers
Scan images taken from ~/github/HuygensING/israels/thumb
IIIF settings read from ~/github/HuygensING/israels/config/iiif.yml
Manifestlevel = file
Maximum dimensions: W = 2895 H = 2895
Average dimensions: W = 2791 H = 2207
Average deviation:  W =  186 H =  114
Rotation file not found: ~/github/HuygensING/israels/thumb/rotation_pages.tsv
Using facs file info file ~/github/HuygensING/israels/report/2025-04-24/facs.yml
Using facs mapping file ~/github/HuygensING/israels/report/2025-04-24/facsMapping.yml
Collections:
     about with    4 files and    0 pages
   letters with  104 files and  514 pages
 apparatus with    4 files and    0 pages


In [12]:
II.manifests()

Missing image files:
	pages:
		ii029:
			  4 x NOTFOUND
		ii045:
			  2 x NOTFOUND
		ii056:
			  2 x NOTFOUND
		ii80:
			  4 x dummy
			  2 x z2r3
			  2 x z2v4


	total occurrences of a missing file: 16
104 IIIF manifests with 216 items for 514 pages generated in ~/github/HuygensING/israels/static/2025-04-24/dev/manifests


# Step 6 Convert to WATM

N.B. For docs click the WATM link in the output cell.

In [13]:
WA = WATM(A, "tei", skipMeta=False, prod="preview", pageInfoDir=Tei.reportPath, extra=dict(), silent=False)
WA.makeText()
WA.makeAnno()
WA.writeAll()

conversion settings read from ~/github/HuygensING/israels/config/watm.yml
IIIF settings read from ~/github/HuygensING/israels/config/iiif.yml
Manifestlevel = file
textRepoLevel is section level 'file'
Top level exclusions:
	  1 node with folder=about
	  1 node with folder=apparatus
Excluded nodes: 37388 from 139841 nodes
Using facs file info file ~/github/HuygensING/israels/report/2025-04-24/facs.yml
Using facs mapping file ~/github/HuygensING/israels/report/2025-04-24/facsMapping.yml


[WATM exporter documentation](https://annotation.github.io/text-fabric-factory/tff/convert/watm.html)

No cover directory: ~/github/HuygensING/israels/scans/covers
Maximum dimensions: W = 8272 H = 8272
Average dimensions: W = 7973 H = 6301
Average deviation:  W =  535 H =  325
Skipping file node 124928 because folder=about
Skipping file node 124929 because folder=about
Skipping file node 124930 because folder=about
Skipping file node 124931 because folder=about
Skipping file node 125036 because folder=apparatus
Skipping file node 125037 because folder=apparatus
Skipping file node 125038 because folder=apparatus
Skipping file node 125039 because folder=apparatus


	   1x letter:canvasUrl: substituted filenotfound
		   1x ii80: dummy
	  14x page:canvasUrl: substituted filenotfound
		   4x ii029: NOTFOUND
		   2x ii045: NOTFOUND
		   2x ii056: NOTFOUND
		   4x ii80: dummy
		   2x ii80: z2r3
	  14x page:pageUrl: substituted filenotfound
		   4x ii029: NOTFOUND
		   2x ii045: NOTFOUND
		   2x ii056: NOTFOUND
		   4x ii80: dummy
		   2x ii80: z2r3


              folder [   4:     0] - [ 107:   686]
	Writing WATM ...
Writing preview data to ~/github/HuygensING/israels/watm/0.1.0/preview
Text file    4:      650 segments to ~/github/HuygensING/israels/watm/0.1.0/preview/text-4.tsv
Text file    5:      562 segments to ~/github/HuygensING/israels/watm/0.1.0/preview/text-5.tsv
Text file    6:      783 segments to ~/github/HuygensING/israels/watm/0.1.0/preview/text-6.tsv
Text file    7:      204 segments to ~/github/HuygensING/israels/watm/0.1.0/preview/text-7.tsv
Text file    8:      239 segments to ~/github/HuygensING/israels/watm/0.1.0/preview/text-8.tsv

Text files all:    87121 segments to 104 files
Anno file    1:   124624 annotations written to ~/github/HuygensING/israels/watm/0.1.0/preview/anno-1.tsv
Inherited annotations: 0
Anno files all:   124624 annotations to 1 file
Slot mapping written to ~/github/HuygensING/israels/watm/0.1.0/preview/pos2node.tsv
Node mapping written to ~/github/HuygensING/israels/watm/0.1.0/preview/anno

# Step 7 Test the WATM against the TF

In [24]:
WA.error = False
WA.testAll()

	Testing WATM ...
Excluded nodes: 37388 from 139841 nodes
Testing the text ...
	TF:  87121
	WA:  87121
OK - whether the amounts of tokens agree
	TF: Isaac Israëls aan Jo ... s from letter 082.  
	WA: Isaac Israëls aan Jo ... s from letter 082.  
OK - whether the text is the same
Testing the elements ...
	TF:  12463
	WA:  12463
OK - whether the amounts of elements and nodes agree
Testing the processing instructions ...
	TF:      0
	WA:      0
OK - whether the amounts of processing instructions agree
Testing the element/pi annotations ...
	12463 element/pi annotations
	Element      :  12463 x
	Pi           :      0 x
	Good name    :  12463 x
	Wrong name   :      0 x
	Good target  :  12463 x
	Wrong target :      0 x
	Unmapped     :      0 x
OK - whether all element/pi annotations have good bodies
OK - whether all element/pi annotations have good targets
Testing the attributes ...
	37262 attribute values
	Good:     37262 x
	Wrong:        0 x
OK - whether annotations are consistent with fea