# Convert from DOCX to TEI to TF to WATM

We convert Suriano DOCX to TEI to TF and then to WATM.

This notebook provides three levels of refinement in the execution. They all have the same outcome,
but they differ in the level of detail they provide on the conversion.

These are the levels:

* **Debugging**: all the commands directly in Python, the intermediate data remains in memory and can  be inspected.
* **Step by step**: one command for each main step of the conversion;
* **Express**: one single command on the command line for the complete conversion;

# Production or development

Mosts steps are unaffected by the production/development setting.

In the first steps of the pipeline (*ingest* and *scan processing*) we prepare both the dev and the prod data.

The intermediate steps (*from DOCX to TEI*, *from TEI to TF*, *mark named entities*) are identical for prod and dev.

Only for the latter steps (*convert TF to WATM*, generate IIIF manifests*, *deploy to k8s*) there is a distinction between prod and dev.

For these steps we have commented out the line that does the dev version.

# Requirements

* zsh as command line shell (as in macOS);
* access to suitable k8s clusters, streamlined by the 
  [k-suite](https://code.huc.knaw.nl/tt/smart-k8s/-/blob/main/docs/k-suite.md);
* [Pandoc](https://pandoc.org)
* [Imagemagick](https://imagemagick.org)
* [Python](https://www.python.org) (3.12 or higher) with additional pip-installable modules:
  * text-fabric
  * doc2python
  * openpyxl

# Declare the version

Always set the version before running any cell in this notebook!

In [1]:
VERSION = "1.0.3"

# In debugging mode.

Now we dig a bit deeper, en do all the steps while keeping the program in memory.
Now it becomes doable to inspect all intermediate results.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from tf.app import use
from tff.convert.tei import TEI
from tff.convert.watm import WATM
from tff.convert.iiif import IIIF
from tf.advanced.helpers import dm

from processscans import Scans
from processdocs import TeiFromDocx
from processhelpers import setStage, nerMeta, NER_NAME, SOURCEBASE, REPORT_PAGESDIR, REPORT_TEIDIR

## Step 1: Scan ingest

In [6]:
SC = Scans(silent=False, force=False)

In [8]:
SC.ingest(dry=False)

	Already ingested pages. Remove ~/github/HuygensING/suriano/scans/pages or pass --force to ingest again


## Step 2: Scan processing

In [9]:
SC.process()

Already present: sizes file originals (covers)
Already present: sizes file originals (pages)
	Convert 48 originals to thumbnails (covers)
		100% done
	Get sizes of 48 thumbnails (covers)
		100% done
	Convert 8904 originals to thumbnails (pages)
		 11% done
		 22% done
		 34% done
		 45% done
		 56% done
		 67% done
		 79% done
		 90% done
		100% done
	Get sizes of 8904 thumbnails (pages)
		 11% done
		 22% done
		 34% done
		 45% done
		 56% done
		 67% done
		 79% done
		 90% done
		100% done


## Step 3: From DOCX to TEI

You might need to do

```
pip install docx2python
```

In [7]:
TFD = TeiFromDocx(silent=False)

In [8]:
TFD.task("pandoc")

DOCX => simple TEI per filza ...
	02.docx ... uptodate
	03.docx ... uptodate
	04.docx ... uptodate
	05.docx ... uptodate
	06.docx ... uptodate
	07.docx ... uptodate
	08.docx ... uptodate
	09.docx ... uptodate
	10.docx ... uptodate
	11.docx ... uptodate
	12.docx ... uptodate


In [9]:
TFD.task("headers")

DOCX => headers ...
	02.docx
	03.docx
	04.docx
	05.docx
	06.docx
	07.docx
	08.docx
	09.docx
	10.docx
	11.docx
	12.docx
	OK: All headers are OK
Angelo              : 1181 pages in 11 filzas
Cristina            : 1034 pages in 10 filzas
Federica            : 684 pages in  7 filzas
Filippo             : 1084 pages in  9 filzas
Flavia              : 1282 pages in 11 filzas
Giorgia             : 934 pages in  9 filzas
Renzo               :  56 pages in  1 filza 
Ruben               : 1162 pages in 10 filzas
Vera                : 966 pages in 10 filzas
Vera, Federica      : 210 pages in  1 filza 


In [10]:
TFD.task("tei")

Collecting transcribers ...
Collecting page scans ...
  0 x error
8766 x good
Collecting excel metadata ...
	found metadata for 725 letters
simple TEI per filza => enriched TEI per letter ...
	02.xml
	03.xml
	04.xml
	05.xml
	06.xml
	07.xml
	08.xml
	09.xml
	10.xml
	11.xml
	12.xml
Translated italian editorial phrases (218 x 14157)
Metadata in summary file corresponds to transcribed letters
Pages with    transcription and    scan:       8766
Pages with    transcription and missing scan:     0
Pages with    transcription and no scan:          0
Pages with no transcription and    scan:          0
See ~/github/HuygensING/suriano/report/pages/pagescan.tsv


## Step 4: From TEI to TF

### Check the validity of the TEI.

In [11]:
Tei = TEI(verbose=-1, sourceBase=SOURCEBASE, reportDir=REPORT_TEIDIR, tei="", tf=VERSION)

In [12]:
Tei.task(check=1, verbose=1, validate=True)

TEI to TF checking: ~/github/HuygensING/suriano/datasource/tei => ~/github/HuygensING/suriano/report/tei
Processing instructions are ignored
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
INFO: Needs dcr.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/suriano/datasource/schema/suriano.xsd
	round   1:  68 changes
137 identical override(s)
  1 changing override(s)
	metamark mixed ==> pure
Section model I
	Start folder 02
	Start folder 03
	Start folder 04
	Start folder 05
	Start folder 06
	Start folder 07
	Start folder 08
	Start folder 09
	Start folder 10
	Start folder 11
	Start folder 12
725 suriano file(s) ...
	Validating ...
	Making inventory ...

584 tag(s) type info written to ~/github/HuygensING/suriano/report/tei/types.txt
Validation OK
0 pagebreaks without facs attribute.
8764 pagebreaks encountered.
8764 distinct scans 

True

### Convert the data

In [13]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Line model II with ln nodes for lines between lb elements
Page model II with page nodes for pages started by pb elements  keeping the pb elements
Section model I
Processing instructions are ignored
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/suriano/datasource/schema/suriano.xsd
	round   1:  68 changes
137 identical override(s)
  1 changing override(s)
	metamark mixed ==> pure
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
	Start folder 02:
		   1 suriano                                001.xml                                           
		   2 suriano                                002.xml                                           
		   3 suriano                                003.xml                                  

True

### Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [14]:
Tei.task(app=True)

True

### Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [15]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.6.4
37 features found and 0 ignored
   |     0.48s T otype                from ~/github/HuygensING/suriano/tf/1.0.3
   |     6.56s T oslots               from ~/github/HuygensING/suriano/tf/1.0.3
  7.04s Dataset without structure sections in otext:no structure functions in the T-API
   |     3.46s T after                from ~/github/HuygensING/suriano/tf/1.0.3
   |     4.12s T str                  from ~/github/HuygensING/suriano/tf/1.0.3
   |     0.08s T chunk                from ~/github/HuygensING/suriano/tf/1.0.3
   |     0.00s T folder               from ~/github/HuygensING/suriano/tf/1.0.3
   |     0.00s T file                 from ~/github/HuygensING/suriano/tf/1.0.3
   |      |     0.11s C __levels__           from otype, oslots, otext
   |      |       16s C __order__            from otype, oslots, __levels__
   |      |     0.39s C __rank__             from otype, __order__
   |      |       16s C __levUp__            from otype, oslots, __rank__
   | 

Name,# of nodes,# slots / node,% coverage
folder,11,158047.18,100
file,725,2397.96,100
body,725,2206.0,92
text,725,2206.0,92
div,4148,737.03,176
table,243,217.58,3
teiHeader,725,191.96,8
page,8764,157.79,80
correspDesc,725,125.82,5
sourceDesc,725,51.06,2


## Step 5: Mark named entities

First stage: we use the human-crafted triggers as is.

In the separate notebook [nerCorrect](nerCorrect.ipynb) you can then collect
additional spelling variants of the triggers.

Some of these triggers are not fit to be used, you have to make a list of variants that are not valid triggers
and store them in the file `persons-notmerged.txt`, one per line.

Then you can merge the new valid variants automatically with the human-crafted triggers, and
the result is a new spreadsheet, `persons-merged`.

Second stage: use the merged spreadsheet.

### Stage 1

We need to copy the relevant spreadsheet from the `datasource/metadata` directory to the `ner` directory where TF expects it.

In [16]:
nerStage = 1
nerName, nerOutFile = setStage(nerStage)

Stage 1: working with sheet persons


In [17]:
NE = A.makeNer(caseSensitive=False, silent=False)

normalizeChars() loaded from ~/github/HuygensING/suriano/ner/code.py


In [18]:
NE.setTask(f".{nerName}", force=True)

Annotation set 🧾 persons has 10279 annotations


In [19]:
NE.triggerInterference()

Looking up 28 potential interferences in 2 passes over the corpus ..
2 potential conflicting trigger pairs with 2 conflicts
----------
different rows (0 pairs)
----------
Diagnostic trigger interferences written to ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons/interference.txt


In [20]:
NE.reportHits(showNoHits=True)

No slot is covered by more than one trigger


Triggers without hits: 106x:


Looking up 106 triggers in 2 passes over the corpus ..
Entities targeted:            818
Triggers searched for:       1666
Triggers without hits:        106
 - completely covered:        106
 - missing hits:                0
Triggers with hits:          1560
Total hits:                 10426

All hits in report file:      ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons/hits.tsv
Triggers by slot in file:     ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons/triggerBySlot.tsv


In [21]:
NE.makeSheetOfSingleTokens()
NE.setTask(f".{NER_NAME}-single", caseSensitive=False, force=True)
NE.reportHits(showNoHits=True)

Annotation set 🧾 persons-single has 502815 annotations
No slot is covered by more than one trigger


Triggers without hits: 14x:


Looking up 14 triggers in 1 pass over the corpus .
Entities targeted:           1523
Triggers searched for:       1523
Triggers without hits:         14
 - completely covered:         14
 - missing hits:                0
Triggers with hits:          1509
Total hits:                 502726

All hits in report file:      ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons-single/hits.tsv
Triggers by slot in file:     ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons-single/triggerBySlot.tsv


In [22]:
NE.setTask(f".{NER_NAME}", caseSensitive=False)

Annotation set 🧾 persons has 10426 annotations


### Prepare for stage 2: search variants

In [23]:
D = NE.variantDetection()

Overview of names by length:
  10 tokens:   2 names e.g.:
      cavalier inglese , et il fratello del vescovo di Londra
      signor d ’ Aspri fratello dell ’ ambasciator dei Stati
  9 tokens:   2 names e.g.:
      conte d ’ Henin , chiamato duca di Bornenvil
      Proveditor general dell ’ arm [ i ] Priuli
  8 tokens:  15 names e.g.:
      Alessandro , Giacomo et Pietr ’ Antonio Guadagni
      ambasciator del signor Granduca nella corte di Praga
      ambasciator di Francia residente appresso li serenissimi arciduchi
      doi altri suoi parenti cognato , et cugino
      doi altri un maschio , et una femina
  7 tokens:  21 names e.g.:
      [ duc ] a d ’ Angauleme
      ambasciator di Spagna residente appresso il Christianissimo
      Ambasciator straordinario che venne ultimamente di Spagna
      Ambasciatore , che dal re di Bohemia
      capitan scocese fratello di quel capitano Seiton
  6 tokens:  45 names e.g.:
      al luogotenente di monsignor di Sciatiglione
      ambasciator G

In [24]:
D.prepare()

Alphabet written to ~/github/HuygensING/suriano/_temp/ner/analyticcl/alphabet.tsv
Text written to ~/github/HuygensING/suriano/_temp/ner/analyticcl/text.txt - 8582308 characters
  0.00s Collecting the triggers for the lexicon
  0.00s 1498 triggers collected
  346 x conte di Mansfelt
  336 x Pasini
  236 x Bernvel
  236 x Spinola
  212 x principe d’Oranges
  211 x re di Spagna
  210 x Carleton
  208 x principe Mauritio
  193 x duca di Savoia
  180 x serenissimi arciduchi
  ...
    1 x visconte di Duncaster
    1 x visconte di Gantes fratello del principe di Pinoe
    1 x Vuandermil
    1 x Wandernoot
    1 x Wassenhovven
    1 x Wasson[ hoven]
    1 x Weesterbeeck
    1 x Weimar il maggiore
    1 x zelandese nominato Cornelio
    1 x Zeno
    1498 lexicon length
Lexicon written to ~/github/HuygensING/suriano/_temp/ner/analyticcl/lexicon.tsv
Set up analiticcl


Computing anagram values for all items in the lexicon...
 - Found 1498 instances
Adding all instances to the index...
 - Found 1493 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 3 anagrams of length 3
 - Found 13 anagrams of length 4
 - Found 23 anagrams of length 5
 - Found 38 anagrams of length 6
 - Found 49 anagrams of length 7
 - Found 41 anagrams of length 8
 - Found 41 anagrams of length 9
 - Found 29 anagrams of length 10
 - Found 47 anagrams of length 11
 - Found 61 anagrams of length 12
 - Found 70 anagrams of length 13
 - Found 90 anagrams of length 14
 - Found 106 anagrams of length 15
 - Found 103 anagrams of length 16
 - Found 102 anagrams of length 17
 - Found 100 anagrams of length 18
 - Found 83 anagrams of length 19
 - Found 72 anagrams of length 20
 - Found 66 anagrams of length 21
 - Found 53 anagrams of length 22
 - Found 34 anagrams of length 23
 - Found 36 anagrams of length 24
 - Found 25 anagrams of length 25
 - Found 22 anagram

In [25]:
%%time

!date

D.search(start=None, end=None, force=1)

Wed Mar 12 10:22:16 CET 2025
 8582308 text  length
       0 offset in complete text
  0.00s Read previously computed variants of the lexicon words ...
  8.34s  1375523 raw   matches
  8.34s Filter variants of the lexicon words ...
  9.35s      697 filtered matches
697 variants found
CPU times: user 9.06 s, sys: 381 ms, total: 9.44 s
Wall time: 9.55 s


In [26]:
D.listVariants(end=10)
D.listVariants(start=-10)

   i | variant                   | score | candidate
---- | ------------------------- | ----- | -------------------------
   0 | a Ferdinando              |  0.87 | re Ferdinando
   1 | a suo figliolo            |  0.87 | un suo figliolo
   2 | abbot Moronato            |  0.85 | abbate Moronato
   3 | Adrian Plois              |  0.88 | Adrian Ploos
   4 | Adriano Ploos             |  0.85 | Adrian Ploos
   5 | Adrien Ploos              |  0.86 | Adrian Ploos
   6 | agente del Kerkoven       |  0.89 | agente del Kerckoven
   7 | al figliolo               |  0.88 | il figliolo
   8 | al suo figlio             |  0.89 | il suo figlio
   9 | Alag]ambe                 |  0.84 | Alagambe
---- | ------------------------- | ----- | -------------------------
697 variants found and written to ~/github/HuygensING/suriano/_temp/ner/analyticcl/variants.tsv
   i | variant                   | score | candidate
---- | ------------------------- | ----- | -------------------------
 -10 | Vorberghen   

In [27]:
D.mergeTriggers()

218 excluded variants found in ~/github/HuygensING/suriano/ner/specs/persons-notmerged.txt
213 variants excluded as trigger
row  trigger                                  ~> variant                                  = occurences
8    Guglielmo di Nassau                      ~> Guglielmo de Nassau                      = 09x1
8    Guglielmo di Nassau                      ~> Guglielmo di Nansau                      = 08x1
8    conte Guglielmo suo figlio               ~> conte Guglielmo suo figliolo             = 08x1
13   vescovo di Paterborn                     ~> vescovo di Paterbon                      = 07x1
14   duca di Nevers                           ~> duca di Nivers                           = 11x1
15   Massimiliano                             ~> Massimilia                               = 11x1
18   duca di Guisa                            ~> duca di Guesa                            = 07x1
19   cardinal Lodovisio                       ~> card(ina)l Lodovisio                     = 10

### Stage 2

In [28]:
nerStage = 2
nerName, nerOutFile = setStage(nerStage)

Stage 2: working with sheet persons-merged


In [29]:
NE = A.makeNer(caseSensitive=False, silent=False)

normalizeChars() loaded from ~/github/HuygensING/suriano/ner/code.py


In [30]:
NE.setTask(f".{nerName}", force=True)

Annotation set 🧾 persons-merged has 11821 annotations


In [31]:
NE.triggerInterference()

Looking up 29 potential interferences in 2 passes over the corpus ..
2 potential conflicting trigger pairs with 2 conflicts
----------
different rows (0 pairs)
----------
Diagnostic trigger interferences written to ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons-merged/interference.txt


In [32]:
NE.reportHits(showNoHits=True)

No slot is covered by more than one trigger


Triggers without hits: 173x:


Looking up 173 triggers in 2 passes over the corpus ..

durante di prignì                        ()          : 03@011:9 x 1, 03@018:11 x 1
smit                                     (02)        : 06@028:132 x 2, 06@028:134 x 1, 06@028:175 x 1, 06@034:74 x 1
signor conte gioan                       (02-05)     : 06@034:160 x 1
ambasciator de francia                   (03)        : 12@030:25 x 1
ambasciatore di franza                   (03)        : 02@012:15 x 1, 08@059:18 x 1
ambasciatori francia                     (03)        : 04@036:9 x 1
ambasciator inglese straordinario        (04)        : 08@067:14 x 1
conte gio ernesto                        (06, 08, 09, 11): 04@026:7 x 1
figliolo maggiore                        (08)        : 04@016:7 x 1, 11@100:23 x 1
al suo figlio                            (09)        : 04@019:8 x 1
ambasciatore di danimarca                (10)        : 12@036:93 x 2
presidenti di gheldria                   (10)        : 04@031:13 x 1
deputati della rocella                   (11)        : 10@058:27 x 1, 10@098:13


Entities targeted:            819
Triggers searched for:       2142
Triggers without hits:        173
 - completely covered:        156
 - missing hits:               17
Triggers with hits:          1969
Total hits:                 11962

All hits in report file:      ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons-merged/hits.tsv
Triggers by slot in file:     ~/github/HuygensING/suriano/_temp/ner/1.0.3/.persons-merged/triggerBySlot.tsv


In [33]:
nerMeta(*NE.getMeta(), silent=False)

In [34]:
NE.bakeEntities()

Entity consolidation for 11962 entity occurrences into version 1.0.3e
 11962 entity occurrences
   786 distinct entities
  0.00s Creating a dataset with entity nodes ...
  0.00s preparing and checking ...
  0.00s Feature overview: 34 for nodes; 2 for edges; 1 configs; 9 computed
   |     3.90s done
   |   Delete types: t                   : keep:   shift  nodes       1-1738519 to         1-1738519
   |   Delete types: author              : keep:   shift  nodes 1738520-1739244 to   1738520-1739244
   |   Delete types: bibl                : keep:   shift  nodes 1739245-1739969 to   1739245-1739969
   |   Delete types: biblScope           : keep:   shift  nodes 1739970-1740694 to   1739970-1740694
   |   Delete types: body                : keep:   shift  nodes 1740695-1741419 to   1740695-1741419
   |   Delete types: cell                : keep:   shift  nodes 1741420-1755754 to   1741420-1755754
   |   Delete types: chunk               : keep:   shift  nodes 1755755-1802724 to   1755755-1

True

We load the new data:

In [35]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.6.4
41 features found and 0 ignored
   |     0.48s T otype                from ~/github/HuygensING/suriano/tf/1.0.3e
   |     6.13s T oslots               from ~/github/HuygensING/suriano/tf/1.0.3e
  6.62s Dataset without structure sections in otext:no structure functions in the T-API
   |     3.40s T after                from ~/github/HuygensING/suriano/tf/1.0.3e
   |     4.04s T str                  from ~/github/HuygensING/suriano/tf/1.0.3e
   |     0.08s T chunk                from ~/github/HuygensING/suriano/tf/1.0.3e
   |     0.00s T folder               from ~/github/HuygensING/suriano/tf/1.0.3e
   |     0.00s T file                 from ~/github/HuygensING/suriano/tf/1.0.3e
   |      |     0.12s C __levels__           from otype, oslots, otext
   |      |       15s C __order__            from otype, oslots, __levels__
   |      |     0.39s C __rank__             from otype, __order__
   |      |       17s C __levUp__            from otype, oslots, __rank_

Name,# of nodes,# slots / node,% coverage
folder,11,158047.18,100
file,725,2397.96,100
body,725,2206.0,92
text,725,2206.0,92
div,4148,737.03,176
table,243,217.58,3
teiHeader,725,191.96,8
page,8764,157.79,80
correspDesc,725,125.82,5
sourceDesc,725,51.06,2


## Step 6: Convert TF to WATM

N.B. For docs click the WATM link in the output cell.

In [36]:
WA = WATM(A, "tei", skipMeta=False, prod=True)
# WA = WATM(A, "tei", skipMeta=False, prod=False)
WA.makeText()
WA.makeAnno()
WA.writeAll()
WA.testAll()

conversion settings read from ~/github/HuygensING/suriano/config/watm.yml
IIIF settings read from ~/github/HuygensING/suriano/config/iiif.yml
textRepoLevel is section level 'folder'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

	Writing WATM ...
Writing production data to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod
Text file    0:    44343 segments to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/text-0.tsv
Text file    1:   117901 segments to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/text-1.tsv
Text file    2:   147685 segments to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/text-2.tsv
Text file    3:   109425 segments to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/text-3.tsv
Text file    4:   154932 segments to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/text-4.tsv

Text files all:  1738519 segments to 11 files
Anno file    1:   400000 annotations written to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/anno-1.tsv
Anno file    2:   400000 annotations written to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/anno-2.tsv
Anno file    3:   246988 annotations written to ~/github/HuygensING/suriano/watm/1.0.3e-031/prod/anno-3.tsv
Inherited annotations: 0
Anno files all:  1046988 

## Step 7: Generate IIIF manifests

In [37]:
# II = IIIF(Tei.teiVersion, A, REPORT_PAGESDIR, prod=True, silent=False)
II = IIIF(Tei.teiVersion, A, REPORT_PAGESDIR, prod=False, silent=False)
II.manifests()

Found covers in directory: ~/github/HuygensING/suriano/scans/covers
IIIF settings read from ~/github/HuygensING/suriano/config/iiif.yml
Maximum dimensions: W = 1871 H = 1315
Average dimensions: W =  910 H = 1071
Average deviation:  W =  228 H =  145
Maximum dimensions: W = 1232 H = 1273
Average dimensions: W =  716 H =  959
Average deviation:  W =  105 H =  133
Using page info file ~/github/HuygensING/suriano/report/pages/pageseq.json
Collections:
   02 with  262 pages
   03 with  660 pages
   04 with  806 pages
   05 with  688 pages
   06 with  684 pages
   07 with  628 pages
   08 with  988 pages
   09 with  944 pages
   10 with 1062 pages
   11 with 1116 pages
   12 with  928 pages
IIIF manifests generated in ~/github/HuygensING/suriano/static/dev/manifests


## Step 8: Deploy to k8s and TeamText VM

NB: you need to have access to the k8s cluster and to the team text VM.

That means:

* The LDAP of the relevant k8s clusters know you
* You have an ssh key-based login on the Team Text VPN
* You work inside the firewall

In [62]:
!./provision.sh watm

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

WATM export version: 1.0.1e-029
anno-1.tsv                                    100%   17MB   5.9MB/s   00:02    
anno-3.tsv                                    100%   10MB   6.4MB/s   00:01    
anno-2.tsv                                    100%   15MB   6.6MB/s   00:02    
anno2node.tsv                                 100% 5658KB   5.9MB/s   00:00    
text-6.tsv                                    100% 1228KB   4.3MB/s   00:00    
text-7.tsv                                    100% 1196KB   4.1MB/s   00:00    
text-5.tsv                                    100% 1018KB   6.0MB/s   00:00    
text-4.tsv                                    100%  914KB   5.8MB/s   00:00    
text-0.tsv                                    100%  267KB   4.3MB/s   00:00    
text-1.tsv                                    100%  705KB   6.2MB/s   00:00    
text-3.tsv                                   

In [63]:
!./provision.sh files

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

copying to pod: prod/covers.html
copying to pod: prod/logo
copying to pod: prod/manifests
copying to pod: both/metadata


In [None]:
!./provision.sh prod images

## Step 9: Test the images

* [covers](https://data.suriano.huygens.knaw.nl/files/covers.html)

* [02.json](https://data.suriano.huygens.knaw.nl/files/manifests/02.json)

* [page 02_171r](https://data.suriano.huygens.knaw.nl/iiif/3/pages%2F02_071r.jpg/full/max/0/default.jpg)

# On the command line: Step by step

Here we do the main steps of the conversion.

Every step is a separate run of a python program.
After completion of a step, all information to run a next step, is saved to disk in the form
of result files and report files.

If the results of earlier steps are present, you can just do the following step.

## Step 0: Initialization

**N.B.** Check the VERSION variable here!

In [33]:
VERSION

'1.0.2'

## Step1: Scan ingest

In [9]:
%%time

!python make.py ingest -

Ingest scans ...
CPU times: user 9.15 ms, sys: 7.67 ms, total: 16.8 ms
Wall time: 1.29 s


## Step 2: Scan processing

In [10]:
%%time

!python make.py scans

Process scans ...
CPU times: user 6.13 ms, sys: 7.51 ms, total: 13.6 ms
Wall time: 615 ms


## Step 3: From DOCX to TEI

In [11]:
%%time

!python make.py docx2tei $VERSION

DOCX ==> TEI files ...
DOCX => simple TEI per filza ...
DOCX => headers ...
Collecting transcribers ...
Collecting page scans ...
Collecting excel metadata ...
simple TEI per filza => enriched TEI per letter ...
CPU times: user 12.8 ms, sys: 9.18 ms, total: 22 ms
Wall time: 1.66 s


## Step 4: From TEI to TF

In [15]:
%%time

!python make.py tei2tf - $VERSION

TEI => TF ...
	Validating TEI ...
	Converting TEI ...
	Loading TF ...
CPU times: user 322 ms, sys: 119 ms, total: 441 ms
Wall time: 1min 9s


## Step 5: Mark named entities

In [16]:
%%time

!python make.py ner $VERSION

Annotate named entities ...
	Loading TF  ...
5 rows with a duplicate name:
  r305: William, Count of Nassau-Siegen
  r359: William of Orange
  r361: Maurice of Nassau
  r506: Henry II, Duke of Lorraine
  r645: Guillaume III de Melun
1 row without a name:
	e.g.: 365
149 rows without triggers:
	e.g.: 6, 8, 16, 21, 24, 26, 29, 30, 32, 34
Clash: Gugliemo di Nassau: r9 vs r50
Clash: Nicolò Perez: r75 vs r571
Clash: conte di Frusten: r148 vs r149
Clash: conte di Wanderlip: r150 vs r151
Clash: colonello Sciombergh: r253 vs r352
Clash: colonnello Sciombergh: r253 vs r352
Clash: signor di Rocalaura: r386 vs r459
Clash: colonel Rocalaura: r386 vs r459
Clash: monsignor di Rocalaura: r386 vs r459
Clash: colonello Rocalaura: r386 vs r459
	491 entities targeted with 6063 occurrences. See ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/0.4.3/.people.0.6/hits.tsv
	Loading TF with entities ...
CPU times: user 306 ms, sys: 107 ms, total: 413 ms
Wall time: 1min 3s


## Step 6: Convert TF to WATM

In [3]:
%%time

!python make.py watm "$VERSION"e
#!python make.py watm "$VERSION"e --no-prod

TF => WATM ...
	Loading TF ...
	Making WATM for version 0.4.3e
	Writing WATM ...
	Testing WATM ...
	OK - whether all tests passed
CPU times: user 45.9 ms, sys: 20.7 ms, total: 66.7 ms
Wall time: 9.3 s


## Step 7: Generate IIIF manifests

### Development

In [4]:
%%time

!python make.py iiif "$VERSION"e
#!python make.py iiif "$VERSION"e --no-prod

Generate IIIF manifests ...
CPU times: user 13.7 ms, sys: 13.1 ms, total: 26.8 ms
Wall time: 2.24 s


## Step 8: Deploy to k8s and TeamText VM

In [None]:
%%time

!python make.py deploy
#!python make.py deploy --no-prod

# Express: One shot

Here is the express, mindless way to convert the corpus.
If something goes wrong, you can follow the step-by-step section or the debugging section.

**N.B.** Check the VERSION variable here!

In [4]:
VERSION

'0.7.1'

In [None]:
%%time

!python make.py all $VERSION
# !python make.py all $VERSION --no-prod