# Convert from TEI to TF to WATM

First we convert the corpus TEI to TF and then the TF to WATM.

This notebook is bare, no explanations, no illustrations, no checks.
For more documentation, try any of the following variants:

# Step by step

Below you can inspect all the steps of the conversion:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.core.files import fileCopy, dirContents, expanduser as ex
from tff.convert.tei import TEI
from tff.convert.addnlp import NLPipeline
from tff.convert.watm import WATM
from tff.convert.iiif import IIIF
from tff.convert.scans import Scans
from tf.advanced.helpers import dm

## Step 0: Scan ingest

To convert a directory with `.jpf` files to `.jpg` files do

```
cd directory
mogrify -format jpg *.jpf
```

In [3]:
SC = Scans(verbose=1, force=False)

Working in repository HuygensING/israels in back-end github
Source dir = ~/github/HuygensING/israels
imageprep settings read from ~/github/HuygensING/israels/config/scans.yml


In [4]:
SC.process(force=False)

Initialized ~/github/HuygensING/israels/scanInfo
Initialized ~/github/HuygensING/israels/thumb
illustrations:
	Already present: thumbnails (illustrations)
	Already present: sizes file thumbnails (illustrations)
		Get sizes of 73 originals (illustrations)
			100% done
	Copied sizes_illustrations file to scanInfo
		Get colorspaces of 73 originals (illustrations)
			100% done
		Get colorspaces of 73 originals (illustrations)
			100% done
	originals: 73
	thumbnails: 0
logo:
pages:
	Already present: thumbnails (pages)
	Already present: sizes file thumbnails (pages)
		Get sizes of 264 originals (pages)
			 38% done
			 76% done
			100% done
	Copied sizes_pages file to scanInfo
		Get colorspaces of 264 originals (pages)
			 38% done
			 76% done
			100% done
		Get colorspaces of 264 originals (pages)
			 38% done
			 76% done
			100% done
	originals: 263
	thumbnails: 0


In [5]:
# SC.process(force=True)

Copy scan info files to the `israels-settings` repo, to directory `scans`

In [5]:
scanDir = SC.scanDir
destDir = ex("~/gitlab.huc.knaw.nl/eDITem/israels-settings/scans")

for file in dirContents(scanDir)[0]:
    fileCopy(f"{scanDir}/{file}", f"{destDir}/{file}")

You might also want to do

### push the changes in israels-settings

```
cd ~/gitlab.huc.knaw.nl/eDITem/israels-settings
git add --all .
git commit -m "updated image info"
git push
```

### sync scans to surfdrive

```
k
kset iiif-prod
app scans ~/github/HuygensING israels up
```

follow the prompt instruction to proceed.

### sync scans to iiif-prod server (production)

Assuming the `kset iiif-prod` is still in force:

```
app scans ~/github/HuygensING israels iiif
```

After this, you need to empty the server cache:

```
ksh
cd /var/cache/cantaloupe
rm -rf info image
```

### sync scans to iiif-dev server (development for preview)

```
k
kset iiif-dev
app scans ~/github/HuygensING israels iiif
```

After this, you need to empty the server cache:

```
ksh
cd /var/cache/cantaloupe
rm -rf info image
```

### run the preview workflow again

SSH into the peen preview machine, the command I have configured (Dirk) is

```
ssh peenpreview
```

Go to the peen directory by running an `app` command:

```
app version
```

(Check whether it reports that you are working in preview setting).

Then start a shell in the editem container:

```
app sh
```

In that shell (you are already in the /app directory), run

```
./workflow.sh israels
```

After you see the final OK message:

```
exit
```

### run the workflow on the pub machine again

This works exactly the same, but in another VM, which will tell you that you are in publication setting:

```
ssh peenpub
```

```
app version
app sh
./workflow.sh israels
exit
```



# Step 1: Check

In [26]:
Tei = TEI(verbose=-1, tei="2025-07-09", tf="0.3.0")

In [27]:
Tei.task(check=True, verbose=1, validate=True, carryon=True)

TEI to TF checking: ~/github/HuygensING/israels/tei/2025-07-09 => ~/github/HuygensING/israels/report/2025-07-09
Processing instructions are treated
XML validation will be performed
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-about.xsd
	round   1:  52 changes
118 identical override(s)
  0 changing override(s)
INFO: Needs editem.xsd (exists)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-letter.xsd
	round   1:  71 changes
172 identical override(s)
  2 changing override(s)
	artwork complex pure (added)
	eventName complex mixed (added)
INFO: Needs editem.xsd (exists)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/githu

3 undeclared scans
46 unused surfaces
1 unused zone


0 processing instructions encountered.
72 tags of which 0 with multiple namespaces written to ~/github/HuygensING/israels/report/2025-07-09/namespaces.txt
365 info line(s) written to ~/github/HuygensING/israels/report/2025-07-09/elements.txt
Refs written to ~/github/HuygensING/israels/report/2025-07-09/refs.txt
	resolvable:  664 in  664
	dangling:    299 in 2061
	ALL:         963 in 2725 
Ids written to ~/github/HuygensING/israels/report/2025-07-09/ids.txt
	referenced:  664 by  664
	non-unique:    0
	unused:     3440
	ALL:        4104 in 4104
lb-parent info written to ~/github/HuygensING/israels/report/2025-07-09/lb-parents.txt


True

# Step 2: Convert

In [28]:
Tei.good = True

In [30]:
Tei.task(convert=True, verbose=0)

Page model II with page nodes for pages started by pb elements without keeping the pb elements
Section model I
Processing instructions are treated
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-about.xsd
	round   1:  52 changes
118 identical override(s)
  0 changing override(s)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-letter.xsd
	round   1:  71 changes
172 identical override(s)
  2 changing override(s)
	artwork complex pure (added)
	eventName complex mixed (added)
Analysing ~/github/annotation/text-fabric-factory/tff/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/israels/schema/editem-artworklist.xsd
	round   1:  21 changes
	round   2:   2 changes

True

# Step 3: Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.
Most of it will be generated now, but there are ways to keep custom additions intact.

In [31]:
Tei.task(app=True)

App updated


True

# Step 4: Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [32]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 13.0.10
68 features found and 0 ignored
   |     0.04s T otype                from ~/github/HuygensING/israels/tf/0.3.0
   |     0.43s T oslots               from ~/github/HuygensING/israels/tf/0.3.0
  0.47s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T folder               from ~/github/HuygensING/israels/tf/0.3.0
   |     0.01s T chunk                from ~/github/HuygensING/israels/tf/0.3.0
   |     0.00s T file                 from ~/github/HuygensING/israels/tf/0.3.0
   |     0.24s T after                from ~/github/HuygensING/israels/tf/0.3.0
   |     0.27s T str                  from ~/github/HuygensING/israels/tf/0.3.0
   |      |     0.01s C __levels__           from otype, oslots, otext
   |      |     0.86s C __order__            from otype, oslots, __levels__
   |      |     0.03s C __rank__             from otype, __order__
   |      |     0.97s C __levUp__            from otype, oslots, __rank__
   |

Name,# of nodes,# slots / node,% coverage
folder,3,39404.0,100
about,4,5616.25,19
artworklist,1,5426.0,5
listObject,1,5388.0,5
bibliolist,2,1138.5,2
biolist,1,1101.0,1
file,111,1064.97,100
listPerson,1,1036.0,1
listBibl,2,946.5,2
letter,103,844.11,74


# Step 5: Generate IIIF manifests

In [17]:
II = IIIF(Tei.teiVersion, A, Tei.reportPath, prod="dev", silent=False)

No cover directory: ~/github/HuygensING/israels/None/covers
Scan images taken from ~/github/HuygensING/israels/None
IIIF settings read from ~/github/HuygensING/israels/config/iiif.yml
Manifestlevel = file
All generated urls are for a dev deployment on http:/localhost:8087


Size file not found: ~/github/HuygensING/israels/None/sizes_pages.tsv


Rotation file not found: ~/github/HuygensING/israels/None/rotation_pages.tsv
Using facs file info file ~/github/HuygensING/israels/report/2025-07-02/facs.yml
Using facs mapping file ~/github/HuygensING/israels/report/2025-07-02/facsMapping.yml
Collections:
     about with    4 files and    0 pages
     intro with    1 files and    0 pages (excluded in config)
   letters with  103 files and  510 pages
 apparatus with    4 files and    0 pages


In [18]:
II.manifests()

Directory with logos not found: ~/github/HuygensING/israels/None/logo
Missing image files:
	pages:
		ii001:
			  2 x b8564V2008_b
		ii002:
			  2 x b8579V2008v_vs_b
			  2 x b8579V2008r_b
		ii003:
			  2 x b8575V2008_b
		ii004:
			  2 x b8580V2008v_vs_b
			  2 x b8580V2008r_b
		ii005:
			  2 x b8581V2008v_vs_b
			  2 x b8581V2008r_b
		ii006:
			  2 x b8582V2008v_vs_b
			  2 x b8582V2008r_b
		ii007:
			  2 x b8574V2008_b
		ii008:
			  4 x b8565V2008_b
		ii009:
			  2 x b8629V2008r_b
			  2 x b8629V2008v_vs_b
		ii010:
			  4 x b8572V2008r_b
			  4 x b8572V2008v_vs_b
		ii011:
			  2 x b8570V2008r_b
			  2 x b8570V2008v_vs_b
		ii012:
			  4 x b8571V2008_b
		ii013:
			  2 x b8569V2008_b
		ii014:
			  2 x b8578V2008v_vs_b
			  2 x b8578V2008r_b
		ii015:
			  2 x b8568V2008r_b
			  2 x b8568V2008v_vs_b
		ii016:
			  2 x b8666V2008_b
		ii017:
			  4 x b8573V2008_b
		ii018:
			  4 x b8662V2008_b
		ii019:
			  2 x b8652V2008_b
		ii020:
			  2 x b8576V2008r_b
			  4 x b8576V2008v_vs_b
		ii021:
		

	total occurrences of a missing file: 510
103 IIIF manifests with 103 items for 510 pages generated in ~/github/HuygensING/israels/static/2025-07-02/dev/manifests


# Step 6 Convert to WATM

N.B. For docs click the WATM link in the output cell.

In [19]:
WA = WATM(A, "tei", skipMeta=False, prod="preview", pageInfoDir=Tei.reportPath, extra=dict(), silent=False)
WA.makeText()
WA.makeAnno()
WA.writeAll()

conversion settings read from ~/github/HuygensING/israels/config/watm.yml
IIIF settings read from ~/github/HuygensING/israels/config/iiif.yml
Manifestlevel = file
textRepoLevel is section level 'file'
Top level exclusions:
	  1 node with folder=apparatus
Excluded nodes: 11443 from 162749 nodes
Using facs file info file ~/github/HuygensING/israels/report/2025-07-02/facs.yml
Using facs mapping file ~/github/HuygensING/israels/report/2025-07-02/facsMapping.yml


[WATM exporter documentation](https://annotation.github.io/text-fabric-factory/tff/convert/watm.html)

No cover directory: ~/github/HuygensING/israels/scans/covers
Maximum dimensions: W = 8272 H = 8272
Average dimensions: W = 7966 H = 6340
Average deviation:  W =  541 H =  370
Skipping file node 146416 because folder=apparatus
Skipping file node 146417 because folder=apparatus
Skipping file node 146418 because folder=apparatus
Skipping file node 146419 because folder=apparatus


	 103x letter:canvasUrl: substituted filenotfound
		   1x ii001: b8564V2008_b
		   1x ii002: b8579V2008v_vs_b
		   1x ii003: b8575V2008_b
		   1x ii004: b8580V2008v_vs_b
		   1x ii005: b8581V2008v_vs_b
		   1x ii006: b8582V2008v_vs_b
		   1x ii007: b8574V2008_b
		   1x ii008: b8565V2008_b
		   1x ii009: b8629V2008r_b
		   1x ii010: b8572V2008r_b
		   1x ii011: b8570V2008r_b
		   1x ii012: b8571V2008_b
		   1x ii013: b8569V2008_b
		   1x ii014: b8578V2008v_vs_b
		   1x ii015: b8568V2008r_b
		   1x ii016: b8666V2008_b
		   1x ii017: b8573V2008_b
		   1x ii018: b8662V2008_b
		   1x ii019: b8652V2008_b
		   1x ii020: b8576V2008r_b
		   1x ii021: b8577V2008_b
		   1x ii022: b8567V2008_b
		   1x ii023: b8588V2008_b
		   1x ii024: b8589V2008_b
		   1x ii025: b8657V2008_b
		   1x ii026: b8660V2008_b
		   1x ii027: b8656V2008_b
		   1x ii028: b8587-001V2008_b
		   1x ii029: b8566V2008_b
		   1x ii030: b8585V2008_b
		   1x ii031: b8584V2008_b
		   1x ii032: b8594-001V2008_b
		   1x ii033: b8586V

              folder [   0:     0] - [   3:   600]
              folder [   5:     0] - [ 107:  1227]
	Writing WATM ...
Writing preview data to ~/github/HuygensING/israels/watm/0.2.0/preview
Text file    0:    18691 segments to ~/github/HuygensING/israels/watm/0.2.0/preview/text-0.tsv
Text file    1:     2672 segments to ~/github/HuygensING/israels/watm/0.2.0/preview/text-1.tsv
Text file    2:      501 segments to ~/github/HuygensING/israels/watm/0.2.0/preview/text-2.tsv
Text file    3:      602 segments to ~/github/HuygensING/israels/watm/0.2.0/preview/text-3.tsv
Text file    4:    22203 segments to ~/github/HuygensING/israels/watm/0.2.0/preview/text-4.tsv
Text file  107:     1229 segments to ~/github/HuygensING/israels/watm/0.2.0/preview/text-107.tsv
Text files all:   131494 segments to 108 files
Anno file    1:   162765 annotations written to ~/github/HuygensING/israels/watm/0.2.0/preview/anno-1.tsv
Inherited annotations: 0
Anno files all:   162765 annotations to 1 file
Slot mapping

# Step 7 Test the WATM against the TF

In [24]:
WA.error = False
WA.testAll()

	Testing WATM ...
Excluded nodes: 37388 from 139841 nodes
Testing the text ...
	TF:  87121
	WA:  87121
OK - whether the amounts of tokens agree
	TF: Isaac Israëls aan Jo ... s from letter 082.  
	WA: Isaac Israëls aan Jo ... s from letter 082.  
OK - whether the text is the same
Testing the elements ...
	TF:  12463
	WA:  12463
OK - whether the amounts of elements and nodes agree
Testing the processing instructions ...
	TF:      0
	WA:      0
OK - whether the amounts of processing instructions agree
Testing the element/pi annotations ...
	12463 element/pi annotations
	Element      :  12463 x
	Pi           :      0 x
	Good name    :  12463 x
	Wrong name   :      0 x
	Good target  :  12463 x
	Wrong target :      0 x
	Unmapped     :      0 x
OK - whether all element/pi annotations have good bodies
OK - whether all element/pi annotations have good targets
Testing the attributes ...
	37262 attribute values
	Good:     37262 x
	Wrong:        0 x
OK - whether annotations are consistent with fea