# Convert from TEI to TF to WATM

We convert Suriano TEI to TF and then to WATM.

This notebook provides three levels of refinement in the execution. They all have the same outcome,
but they differ in the level of detail they provide on the conversion.

These are the levels:

* **Express**: one single command on the command line for the complete conversion;
* **Step by step**: one command for each main step of the conversion;
* **Debugging**: all the commands directly in Python, the intermediate data remains in memory and can be inspected.

# Production or development

Mosts steps are unaffected by the production/development setting.

In the first steps of the pipeline (*ingest* and *scan processing*) we prepare both the dev and the prod data.

The intermediate steps (*from DOCX to TEI*, *from TEI to TF*, *mark named entities*) are identical for prod and dev.

Only for the latter steps (*convert TF to WATM*, generate IIIF manifests*, *deploy to k8s*) there is a distinction between prod and dev.

For these steps we have commented out the line that does the dev version.

# Requirements

* zsh as command line shell (as in macOS);
* access to suitable k8s clusters, streamlined by the 
  [k-suite](https://code.huc.knaw.nl/tt/smart-k8s/-/blob/main/docs/k-suite.md);
* [Pandoc](https://pandoc.org)
* [Imagemagick](https://imagemagick.org)
* [Python](https://www.python.org) (3.12 or higher) with additional pip-installable modules:
  * text-fabric
  * doc2python
  * openpyxl

# Declare the version

Always set the version before running any cell in this notebook!

In [7]:
VERSION = "1.0.1"

# In debugging mode.

Now we dig a bit deeper, en do all the steps while keeping the program in memory.
Now it becomes doable to inspect all intermediate results.

In [8]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.addnlp import NLPipeline
from tf.convert.watm import WATM
from tf.convert.iiif import IIIF
from tf.advanced.helpers import dm

from processscans import Scans
from processdocs import TeiFromDocx
from processhelpers import nerMeta, NER_NAME, SOURCEBASE, PAGESEQ_JSON

## Step 1: Scan ingest

In [10]:
SC = Scans(silent=False, force=False)

In [5]:
SC.ingest(dry=False)

	Already ingested covers. Remove ~/gitlab.huc.knaw.nl/suriano/letters/scans/covers or pass --force to ingest again
	Already ingested pages. Remove ~/gitlab.huc.knaw.nl/suriano/letters/scans/pages or pass --force to ingest again


## Step 2: Scan processing

In [6]:
SC.process()

Already present: sizes file originals (covers)
Already present: sizes file originals (pages)
Already present: thumbnails (covers)
Already present: sizes file thumbnails (covers)
Already present: thumbnails (pages)
Already present: sizes file thumbnails (pages)


## Step 3: From DOCX to TEI

You might need to do

```
pip install docx2python
```

In [11]:
TFD = TeiFromDocx(silent=False)

In [10]:
TFD.task("pandoc")

DOCX => simple TEI per filza ...
	02.docx ... uptodate
	03.docx ... uptodate
	04.docx ... uptodate
	05.docx ... uptodate
	06.docx ... uptodate
	07.docx ... uptodate
	08.docx ... uptodate
	09.docx ... uptodate
	10.docx ... uptodate
	11.docx ... uptodate
	12.docx ... uptodate


In [11]:
TFD.task("headers")

DOCX => headers ...
	02.docx
	03.docx
	04.docx
	05.docx
	06.docx
	07.docx
	08.docx
	09.docx
	10.docx
	11.docx
	12.docx
	OK: All headers are OK
Angelo              : 1181 pages in 11 filzas
Cristina            : 1034 pages in 10 filzas
Federica            : 684 pages in  7 filzas
Filippo             : 1084 pages in  9 filzas
Flavia              : 1282 pages in 11 filzas
Giorgia             : 934 pages in  9 filzas
Renzo               :  56 pages in  1 filza 
Ruben               : 1162 pages in 10 filzas
Vera                : 966 pages in 10 filzas
Vera, Federica      : 210 pages in  1 filza 


In [12]:
TFD.task("tei")

Collecting transcribers ...
Collecting page scans ...
  0 x error
8766 x good
Collecting excel metadata ...
	found metadata for 725 letters
simple TEI per filza => enriched TEI per letter ...
	02.xml
	03.xml
	04.xml
	05.xml
	06.xml
	07.xml
	08.xml
	09.xml
	10.xml
	11.xml
	12.xml
Translated italian editorial phrases (219 x 14158)
Metadata in summary file corresponds to transcribed letters
Pages with    transcription and    scan:       8766
Pages with    transcription and missing scan:     0
Pages with    transcription and no scan:          0
Pages with no transcription and    scan:          0
See ~/gitlab.huc.knaw.nl/suriano/letters/datasource/transcriptions/report/scantrans.tsv


## Step 4: From TEI to TF

### Check the validity of the TEI.

In [12]:
Tei = TEI(verbose=-1, sourceBase=SOURCEBASE, tei="", tf=VERSION)

In [14]:
Tei.task(check=True, verbose=1, validate=True)

TEI to TF checking: ~/gitlab.huc.knaw.nl/suriano/letters/datasource/tei => ~/gitlab.huc.knaw.nl/suriano/letters/datasource/report
Processing instructions are ignored
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
INFO: Needs dcr.xsd (exists)
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/gitlab.huc.knaw.nl/suriano/letters/datasource/schema/suriano.xsd
	round   1:  68 changes
137 identical override(s)
  1 changing override(s)
	metamark mixed ==> pure
Section model I
	Start folder 02:
		   1 suriano                                001.xml                                           
		   2 suriano                                002.xml                                           
		   3 suriano                                003.xml                                           
		   4 suriano                                004.xml                                   

True

### Convert the data

In [15]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Line model II with ln nodes for lines between lb elements
Page model II with page nodes for pages started by pb elements  keeping the pb elements
Section model I
Processing instructions are ignored
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/gitlab.huc.knaw.nl/suriano/letters/datasource/schema/suriano.xsd
	round   1:  68 changes
137 identical override(s)
  1 changing override(s)
	metamark mixed ==> pure
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
	Start folder 02:
		   1 suriano                                001.xml                                           
		   2 suriano                                002.xml                                           
		   3 suriano                                003.xml                         

True

### Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [13]:
Tei.task(app=True)

True

### Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [14]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.5.5
37 features found and 0 ignored
  0.28s Dataset without structure sections in otext:no structure functions in the T-API
  1.03s All features loaded / computed - for details use TF.isLoaded()
  0.05s All additional features loaded - for details use TF.isLoaded()


Name,# of nodes,# slots / node,% coverage
folder,11,157567.73,100
file,725,2390.68,100
body,725,2206.02,92
text,725,2206.02,92
div,4148,737.03,176
table,243,217.58,3
teiHeader,725,184.67,8
page,8764,157.79,80
correspDesc,725,118.53,5
sourceDesc,725,51.05,2


## Step 5: Mark named entities

First stage: we use the human-crafted triggers as is.

In the separate notebook [nerCorrect](nerCorrect.ipynb) you can then collect
additional spelling variants of the triggers.

Some of these triggers are not fit to be used, you have to make a list of variants that are not valid triggers
and store them in the file `persons-notmerged.txt`, one per line.

Then you can merge the new valid variants automatically with the human-crafted triggers, and
the result is a new spreadsheet, `persons-merged`.

Second stage: use the merged spreadsheet.

### Stage 1

We need to copy the relevant spreadsheet from the `datasource/metadata` directory to the `ner` directory where TF expects it.

In [15]:
nerStage = 1
nerName, nerOutFile = setStage(nerStage)

Stage 1: working with sheet persons


In [16]:
NE = A.makeNer(caseSensitive=False, silent=False)

normalizeChars() loaded from ~/gitlab.huc.knaw.nl/suriano/letters/ner/code.py


In [17]:
NE.setTask(f".{nerName}", force=True)

Annotation set 🧾 persons has 10645 annotations
SHEET data: computing from scratch ...
--------------
Reading sheets
--------------
Sheet with 871 rows and 18 columns


9 rows without triggers:
	e.g.: 129, 216, 234, 235, 236, 424, 452, 464, 488



-------------------
Checking scopes ...
-------------------

--
()
--

--
02
--

--
03
--

--
04
--

--
05
--

--
06
--

--
07
--

-------------
08@001-08@089
-------------

---------
08@090-08
---------

---------
09-09@021
---------

---------
09@022-09
---------

---------
10-10@017
---------

-------------
10@018-10@020
-------------

---------
10@021-10
---------

------------
11-11@026:22
------------

---------
11@026:23
---------

------------
11@026:24-11
------------

------------
12-12@037:27
------------

---------
12@037:28
---------

----------------
12@037:29-12@041
----------------

-------------
12@042-12@044
-------------

---------
12@045-12
---------
  0.00s Looking up occurrences of many candidates ...
  2.28s done
done


In [18]:
NE.triggerInterference()

Looking up 26 potential interferences in 2 passes over the corpus ..
2 potential conflicting trigger pairs with 2 conflicts
----------
different rows (0 pairs)
----------
Diagnostic trigger interferences written to ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons/interference.txt


In [19]:
NE.reportHits(showNoHits=True)

No slot is covered by more than one trigger


Triggers without hits: 102x:


Looking up 102 triggers in 2 passes over the corpus ..
Entities targeted:            811
Triggers searched for:       1622
Triggers without hits:        102
 - completely covered:        102
 - missing hits:                0
Triggers with hits:          1520
Total hits:                 10645

All hits in report file:      ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons/hits.tsv
Triggers by slot in file:     ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons/triggerBySlot.tsv


In [20]:
NE.makeSheetOfSingleTokens()
NE.setTask(f".{NER_NAME}-single", caseSensitive=False, force=True)
NE.reportHits(showNoHits=True)
NE.setTask(f".{NER_NAME}", caseSensitive=False)

Annotation set 🧾 persons-single has 0 annotations
SHEET data: computing from scratch ...
--------------
Reading sheets
--------------
Sheet with 1504 rows and 6 columns

-------------------
Checking scopes ...
-------------------

--
()
--
  0.00s Looking up occurrences of many candidates ...
  2.30s done
done
No slot is covered by more than one trigger


Triggers without hits: 14x:


Looking up 14 triggers in 1 pass over the corpus .
Entities targeted:           1502
Triggers searched for:       1502
Triggers without hits:         14
 - completely covered:         14
 - missing hits:                0
Triggers with hits:          1488
Total hits:                 499902

All hits in report file:      ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons-single/hits.tsv
Triggers by slot in file:     ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons-single/triggerBySlot.tsv
Annotation set 🧾 persons has 10645 annotations
SHEET data: already in memory and uptodate

--------------
Reading sheets
--------------


9 rows without triggers:
	e.g.: 129, 216, 234, 235, 236, 424, 452, 464, 488



-------------------
Checking scopes ...
-------------------

--
()
--

--
02
--

--
03
--

--
04
--

--
05
--

--
06
--

--
07
--

-------------
08@001-08@089
-------------

---------
08@090-08
---------

---------
09-09@021
---------

---------
09@022-09
---------

---------
10-10@017
---------

-------------
10@018-10@020
-------------

---------
10@021-10
---------

------------
11-11@026:22
------------

---------
11@026:23
---------

------------
11@026:24-11
------------

------------
12-12@037:27
------------

---------
12@037:28
---------

----------------
12@037:29-12@041
----------------

-------------
12@042-12@044
-------------

---------
12@045-12
---------


### Prepare for stage 2: search variants

In [34]:
nerStage = 1
nerName, nerOutFile = setStage(nerStage)

Stage 1: working with sheet persons


In [35]:
NE = A.makeNer(caseSensitive=False, silent=False)

normalizeChars() loaded from ~/gitlab.huc.knaw.nl/suriano/letters/ner/code.py


In [36]:
NE.setTask(f".{nerName}", force=True)

Annotation set 🧾 persons has 10112 annotations
SHEET data: computing from scratch ...
--------------
Reading sheets
--------------
Sheet with 871 rows and 18 columns


9 rows without triggers:
	e.g.: 129, 216, 234, 235, 236, 424, 452, 464, 488



-------------------
Checking scopes ...
-------------------

--
()
--

--
02
--

--
03
--

--
04
--

--
05
--

--
06
--

--
07
--

-------------
08@001-08@089
-------------

---------
08@090-08
---------

---------
09-09@021
---------

---------
09@022-09
---------

---------
10-10@017
---------

-------------
10@018-10@020
-------------

---------
10@021-10
---------

------------
11-11@026:22
------------

---------
11@026:23
---------

------------
11@026:24-11
------------

------------
12-12@037:27
------------

---------
12@037:28
---------

----------------
12@037:29-12@041
----------------

-------------
12@042-12@044
-------------

---------
12@045-12
---------
  0.00s Looking up occurrences of many candidates ...
  2.28s done
done


In [45]:
D = NE.variantDetection()

SHEET data: computing from scratch ...
--------------
Reading sheets
--------------
Sheet with 871 rows and 18 columns


9 rows without triggers:
	e.g.: 129, 216, 234, 235, 236, 424, 452, 464, 488



-------------------
Checking scopes ...
-------------------

--
()
--

--
02
--

--
03
--

--
04
--

--
05
--

--
06
--

--
07
--

-------------
08@001-08@089
-------------

---------
08@090-08
---------

---------
09-09@021
---------

---------
09@022-09
---------

---------
10-10@017
---------

-------------
10@018-10@020
-------------

---------
10@021-10
---------

------------
11-11@026:22
------------

---------
11@026:23
---------

------------
11@026:24-11
------------

------------
12-12@037:27
------------

---------
12@037:28
---------

----------------
12@037:29-12@041
----------------

-------------
12@042-12@044
-------------

---------
12@045-12
---------
  0.00s Looking up occurrences of many candidates ...
  2.16s done
done
SHEET data: computing from scratch ...
--------------
Reading sheets
--------------
Sheet with 871 rows and 18 columns


9 rows without triggers:
	e.g.: 129, 216, 234, 235, 236, 424, 452, 464, 488



-------------------
Checking scopes ...
-------------------

--
()
--

--
02
--

--
03
--

--
04
--

--
05
--

--
06
--

--
07
--

-------------
08@001-08@089
-------------

---------
08@090-08
---------

---------
09-09@021
---------

---------
09@022-09
---------

---------
10-10@017
---------

-------------
10@018-10@020
-------------

---------
10@021-10
---------

------------
11-11@026:22
------------

---------
11@026:23
---------

------------
11@026:24-11
------------

------------
12-12@037:27
------------

---------
12@037:28
---------

----------------
12@037:29-12@041
----------------

-------------
12@042-12@044
-------------

---------
12@045-12
---------
  0.00s Looking up occurrences of many candidates ...
  2.16s done
done
Overview of names by length:
  10 tokens:   2 names e.g.:
      cavalier inglese , et il fratello del vescovo di Londra
      signor d ’ Aspri fratello dell ’ ambasciator dei Stati
  9 tokens:   2 names e.g.:
      conte d ’ Henin , chiamato duca d

In [46]:
D.prepare()

Alphabet written to ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/analyticcl/alphabet.tsv
Text written to ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/analyticcl/text.txt - 8554012 characters
  0.00s Collecting the triggers for the lexicon
  0.00s 1460 triggers collected
  345 x conte di Mansfelt
  329 x Pasini
  236 x Bernvel
  236 x Spinola
  211 x principe d’Oranges
  211 x re di Spagna
  209 x Carleton
  207 x principe Mauritio
  191 x duca di Savoia
  180 x serenissimi arciduchi
  ...
    1 x visconte di Duncaster
    1 x visconte di Gantes fratello del principe di Pinoe
    1 x Vuandermil
    1 x Wandernoot
    1 x Wassenhovven
    1 x Wasson[ hoven]
    1 x Weesterbeeck
    1 x Weimar il maggiore
    1 x zelandese nominato Cornelio
    1 x Zeno
    1460 lexicon length
Lexicon written to ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/analyticcl/lexicon.tsv
Set up analiticcl


Computing anagram values for all items in the lexicon...
 - Found 1460 instances
Adding all instances to the index...
 - Found 1455 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 3 anagrams of length 3
 - Found 13 anagrams of length 4
 - Found 23 anagrams of length 5
 - Found 39 anagrams of length 6
 - Found 46 anagrams of length 7
 - Found 41 anagrams of length 8
 - Found 41 anagrams of length 9
 - Found 27 anagrams of length 10
 - Found 45 anagrams of length 11
 - Found 61 anagrams of length 12
 - Found 67 anagrams of length 13
 - Found 84 anagrams of length 14
 - Found 106 anagrams of length 15
 - Found 99 anagrams of length 16
 - Found 99 anagrams of length 17
 - Found 98 anagrams of length 18
 - Found 82 anagrams of length 19
 - Found 71 anagrams of length 20
 - Found 65 anagrams of length 21
 - Found 51 anagrams of length 22
 - Found 34 anagrams of length 23
 - Found 35 anagrams of length 24
 - Found 23 anagrams of length 25
 - Found 22 anagrams o

In [47]:
%%time

!date

D.search(start=None, end=None, force=1)

Mon Oct 14 13:01:05 CEST 2024
 8554012 text  length
       0 offset in complete text
  0.00s Read previously computed variants of the lexicon words ...
    11s  1371254 raw   matches
    11s Filter variants of the lexicon words ...
    12s      694 filtered matches
CPU times: user 11.5 s, sys: 374 ms, total: 11.9 s
Wall time: 12 s


In [48]:
D.listResults(end=20)

   i | variant                   | score | candidate
---- | ------------------------- | ----- | -------------------------
   0 | a Ferdinando              |  0.87 | re Ferdinando
   1 | a suo figliolo            |  0.87 | un suo figliolo
   2 | abbot Moronato            |  0.85 | abbate Moronato
   3 | Adrian Plois              |  0.88 | Adrian Ploos
   4 | Adriano Ploos             |  0.85 | Adrian Ploos
   5 | Adrien Ploos              |  0.86 | Adrian Ploos
   6 | agente del Kerkoven       |  0.89 | agente del Kerckoven
   7 | al figliolo               |  0.88 | il figliolo
   8 | al suo figlio             |  0.89 | il suo figlio
   9 | Alag]ambe                 |  0.84 | Alagambe
  10 | Alessandro Vanderberge    |  0.90 | Alessandro Vanderbergh
  11 | alla moglie               |  0.86 | della moglie
  12 | alla moglie               |  0.85 | la moglie
  13 | ambasciator Aghe          |  0.91 | ambasciator Aghes
  14 | ambasciator Cont          |  0.84 | ambasciator Caron
  15 | amb

In [52]:
D.mergeTriggers()

155 excluded variants found in ~/gitlab.huc.knaw.nl/suriano/letters/ner/specs/persons-notmerged.txt
152 variants excluded as trigger
254 triggersets expanded with 552 triggers
Wrote merged triggers to sheet ~/gitlab.huc.knaw.nl/suriano/letters/ner/specs/persons-merged.xlsx


In [53]:
D.showResults(start=162, end=164)

   1 Variant «Pauls» of 1 candidate
  Occurrences:
    07@005:110: 581, 1, 8 La nave Menonisterkerch capitano Peter «Pauls »Quakes 149 2550 1020 [1]431​ 
  Candidates with score:
	0.84 Pauli
-----
   2 Variant «Pelicard» of 1 candidate
  Occurrences:
    09@081:2: the Moraves have defeated Bocqouy, and opposite reports from «Pelicard »in Bxl. Fears in Dutch Republic for the 
  Candidates with score:
	0.87 Pelicart
-----


In [54]:
D.displayResults(start=162, end=164)

# 1: 1 x variant `Monteneau` on candidate `Montereau`



n,p,chunk
1,12@010:197,"Il Tenente cattolico che già scrissi esser il mezettino di quelli che vorriano tirar a sé il Generale seguita continuamente la corte perché hieri l’altro partì sua Eccellenza con li ambasciatori da Stichausen, et venne anote 65​ Oldenham una lega discosto da Embdem castello di un barone di Suart Zionbergh qui hoggi veniranno li deputati delli signori Stati, et quelli di questo Paese a portar al Generale la conclusione dei loro negotii, et per quel che ho inteso daranno la summa delli denari concertati d’ accordo contandogli di presente 46 mila raistaleri gl’altri in tre termini corti. Questa mattina ilnote 66​ Francese, et Savoia sono partiti. Io sono venuto con loro fino a Embdem per fargli anco con una cena honorata dimostratione di buon affetto. Questo per conclusione di tutto il negotio dirò a vostra Serenità che Bosnote 67​ è stato il concludente di ogni cosa egli ha havuto più stretti negotii, et discorsi et ove pensava di incontrar qualche difficoltà ne chiedeva a me le cause et le ragioni. Questa mattina poi più di un’hora è stato in secreto con sua Eccellenza la quale per il suo secretario Veis ci ha a tutti tre fatto intender, che desiderava tener presso di sé una copia dellanote 68​ risposta che faceva a sua Maestà christianissima, et alli folio 100r 📷​ confederati, la quale fosse sottoscritta da tutti tre; il Montereau ricusava assolu- tamente di farlo con dire non haver tal ordine dal suo Re. Io dicevo nonnote 69​ poter in modo alcunonote 70​ far; nonnote 71​ havendo havuto né da Venetia, né dalli ambasciatori Moresini, e Pesaro, nemeno dal residente Suriano alcuna instruttione, o commissione in questo affare, finalmente il Savoiardo, che governava la facenda ci consigliava a farlo con dir a me in particolare haver havuta questa commissione dalli signori ambasciatori Moresini, e Pesaro; che stipulando io dovessi concorrer in tutto quello fosse stipulato, et fare quanto lor altri. Ho pregato sua Eccellenza di volersi contentare di quanto fa- cevano gli altri, et lasciar star me, ch’ella sapea bene ciò non esser a me conveniente, et a lei di niun proffitto; non l’ho potuto ottenere. Il Monte- reau stava saldo, aspettandonote 72​ che risolutione io havessi pigliato; et alla fine disse, che se Montereau havesse sottoscritto, io vi ha- verei gionto il mio nome, come per testimonio di quel trattato. Tutti questi partiti non valendo, et parendo che la rottura, o conclusione del negotio stasse in questa mia risolutione dissi, che l’haverei fatto: ma però conditionatamente se fosse stato di contentamento, et a beneplacito di sua Maestà christianissima et delli signori confederati. Montereau diceva in questo modo si potrà ben fare, et così la subsi- gnerò col mio nome, quando voi altri ignori lo vogliate fare. Bos prontamente disse che l’haverebbe  folio 100v 📷​ fatto; sottoscrisse adunque primieramente Monte- reau; Bos disse non voler sottoscriversi se prima io non l’havessi fatto; et essendo così l’ordine di tutte le scritture, nelle quali si deve precede- re a Savoia vi missi il mio nome semplice senza dir altro; così fece poi il Savoiardo sendo prima della sottoscritione nel fine della scrittura stata messa la conditione del nostro sottoscritto; cioè con riserva se così piacerà a sua Maestà christianissima, et alli signori collegati. ​"
2,12@012:85,"Quello, che presenterà la lettera a vostra Eccellenza monsignor Nicolas mi dice non haver altre lettere, che a Montereau, et della Bossen, per sollicitarli a far presto il camino, acciò possi haver la risposta dentro un mese. ​"
3,12@015:70,"Dicono di più, che Sassonia non vogli dar ilnote 17​ passo al Mansfelt; il che vien detto per disgusto, et sono in tanto disordine, che non vogliono credere, che il signor dinote 18​ Monteneau, et di Bos fossero mandati espressamente da sua Maestà christianissima, et da sua Altezza di Savoianote 19​ folio 147v 📷​ sua Eccellenza di Mansfelt ma dicevano, che era una finta di Mansfelt et poi hanno visitato, et presentato di buon vino il colonello Grè perché credevano fosse ambasciator del re della Gran Bretagna."


# 2: 1 x variant `Monteran` on candidate `Monterau`



n,p,chunk
1,12@032:48,"Quelli che intendono bene le cose veggono anco, che francesi si sono condotti in questo affare più per le solicitationi della serenissima Republica, et anco del signor duca di Savoia, che per la volontà, che vengono di Franza si burlano della levata, ma più dei capi, Monteran l’hanno per interes- satissimo, et qualcheduno poco fidato, massime per questi signori havendo servito altre volte l’ inimico, al qual tempo fu in questi paesi con certo petardiero, che poi patì l’ultimo suplicio in Genevra, qualche anni sono fin qui sotto pretesto folio 328v 📷​ di curiosità vedendo le piazze di questi paesi, et fu scoperto, che si pensava da Spagnoli colla rellatione loro far qualche proffitto al disavantaggio dei signori Stati. Turnone poi è tenuto per buono che si accommoda a ricever hoggi uno vestito da uno, dimani proffittarsi col termi- ne di entrante qualche scudo, o catena d’oro, che vengono stimate qualità non proprie in capi di militie; l’Ambasciator anco mostra poca sodisfattione di detti doi collonelli, dicendomi che molto meglio si sarebbe fatto il servitio con persone di altra qualità, et desinteressate. Gratie etc. ​"
2,12@032:66,"Bos scrisse a Monterau prima di partire con seguito, et che egli in tanto, e Turnon dovessero valersi della libertà di pigliar dopo li 8 vasselli a Cales, o Bologna, cominciando ad inviar le genti etc."


 1m 21s    2 matches done


In [55]:
# D.displayResults(asFile="variants")

### Stage 2

In [62]:
nerStage = 2
nerName, nerOutFile = setStage(nerStage)

Stage 2: working with sheet persons-merged


In [63]:
NE = A.makeNer(caseSensitive=False, silent=False)

normalizeChars() loaded from ~/gitlab.huc.knaw.nl/suriano/letters/ner/code.py


In [64]:
NE.setTask(f".{nerName}", force=True)

Annotation set 🧾 persons-merged has 12632 annotations
SHEET data: computing from scratch ...
--------------
Reading sheets
--------------
Sheet with 871 rows and 18 columns


9 rows without triggers:
	e.g.: 129, 216, 234, 235, 236, 424, 452, 464, 488



-------------------
Checking scopes ...
-------------------

--
()
--

--
02
--

--
03
--

--
04
--

--
05
--

--
06
--

--
07
--

-------------
08@001-08@089
-------------

---------
08@090-08
---------

---------
09-09@021
---------

---------
09@022-09
---------

---------
10-10@017
---------

-------------
10@018-10@020
-------------

---------
10@021-10
---------

------------
11-11@026:22
------------

---------
11@026:23
---------

------------
11@026:24-11
------------

------------
12-12@037:27
------------

---------
12@037:28
---------

----------------
12@037:29-12@041
----------------

-------------
12@042-12@044
-------------

---------
12@045-12
---------
  0.00s Looking up occurrences of many candidates ...
  2.35s done
done


In [65]:
NE.triggerInterference()

Looking up 27 potential interferences in 2 passes over the corpus ..
2 potential conflicting trigger pairs with 2 conflicts
----------
different rows (0 pairs)
----------
Diagnostic trigger interferences written to ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons-merged/interference.txt


In [66]:
NE.reportHits(showNoHits=True)

No slot is covered by more than one trigger


Triggers without hits: 154x:


Looking up 154 triggers in 2 passes over the corpus ..

durante di prignì                        ()          : 03@011:9 x 1, 03@018:11 x 1
jan janson linden                        ()          : 07@005:69 x 1
jan janson vander linden                 ()          : 07@018:50 x 1
jan janssoon van linden                  ()          : 06@028:181 x 1
signor conte gioan                       (02-05)     : 06@034:160 x 1
ambasciator de francia                   (03)        : 12@030:25 x 1
ambasciatore di franza                   (03)        : 02@012:15 x 1, 08@059:18 x 1
ambasciatori francia                     (03)        : 04@036:9 x 1
ambasciator inglese straordinario        (04)        : 08@067:14 x 1
maximiliano                              (04)        : 10@081:93 x 1
conte gio ernesto                        (06, 08, 09, 11): 04@026:7 x 1
figliolo maggiore                        (08)        : 04@016:7 x 1, 11@100:23 x 1
eccellentissimo signor general           (08@001-08@089): 03@018:7 x 1, 04@060:7 x 1, 04@071:7 x 1, 04@075:7 x 1, 05@003:12 x


Entities targeted:            812
Triggers searched for:       2154
Triggers without hits:        154
 - completely covered:        131
 - missing hits:               23
Triggers with hits:          2000
Total hits:                 12636

All hits in report file:      ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons-merged/hits.tsv
Triggers by slot in file:     ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/1.0.1/.persons-merged/triggerBySlot.tsv


In [67]:
nerMeta(*NE.getMeta(), silent=False)

In [68]:
NE.bakeEntities()

Entity consolidation for 12636 entity occurrences into version 1.0.1e
 12636 entity occurrences
   780 distinct entities
  0.00s Creating a dataset with entity nodes ...
  0.00s preparing and checking ...
  0.00s Feature overview: 34 for nodes; 2 for edges; 1 configs; 9 computed
   |     0.95s done
   |   Delete types: t                   : keep:   shift  nodes       1-1733245 to         1-1733245
   |   Delete types: author              : keep:   shift  nodes 1733246-1733970 to   1733246-1733970
   |   Delete types: bibl                : keep:   shift  nodes 1733971-1734695 to   1733971-1734695
   |   Delete types: biblScope           : keep:   shift  nodes 1734696-1735420 to   1734696-1735420
   |   Delete types: body                : keep:   shift  nodes 1735421-1736145 to   1735421-1736145
   |   Delete types: cell                : keep:   shift  nodes 1736146-1750480 to   1736146-1750480
   |   Delete types: chunk               : keep:   shift  nodes 1750481-1797451 to   1750481-1

True

We load the new data:

In [69]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.5.5
40 features found and 0 ignored
   |     0.43s T otype                from ~/gitlab.huc.knaw.nl/suriano/letters/tf/1.0.1e
   |     5.86s T oslots               from ~/gitlab.huc.knaw.nl/suriano/letters/tf/1.0.1e
  6.29s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.08s T chunk                from ~/gitlab.huc.knaw.nl/suriano/letters/tf/1.0.1e
   |     3.90s T str                  from ~/gitlab.huc.knaw.nl/suriano/letters/tf/1.0.1e
   |     0.00s T folder               from ~/gitlab.huc.knaw.nl/suriano/letters/tf/1.0.1e
   |     0.00s T file                 from ~/gitlab.huc.knaw.nl/suriano/letters/tf/1.0.1e
   |     3.23s T after                from ~/gitlab.huc.knaw.nl/suriano/letters/tf/1.0.1e
   |      |     0.11s C __levels__           from otype, oslots, otext
   |      |       19s C __order__            from otype, oslots, __levels__
   |      |     0.40s C __rank__             from otype, __order__
   |     

Name,# of nodes,# slots / node,% coverage
folder,11,157567.73,100
file,725,2390.68,100
body,725,2206.02,92
text,725,2206.02,92
div,4148,737.03,176
table,243,217.58,3
teiHeader,725,184.67,8
page,8764,157.79,80
correspDesc,725,118.53,5
sourceDesc,725,51.05,2


## Step 6: Convert TF to WATM

N.B. For docs click the WATM link in the output cell.

In [33]:
WA = WATM(A, "tei", skipMeta=False, prod=True)
# WA = WATM(A, "tei", skipMeta=False, prod=False)
WA.makeText()
WA.makeAnno()
WA.writeAll()
WA.testAll()

textRepoLevel is section level 'folder'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

	Writing WATM ...
Writing production data to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod
Text file    0:    44211 segments to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod/text-0.tsv
Text file    1:   117416 segments to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod/text-1.tsv
Text file    2:   146993 segments to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod/text-2.tsv
Text file    3:   109061 segments to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod/text-3.tsv
Text file    4:   154304 segments to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod/text-4.tsv
Text file   10:   132103 segments to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod/text-10.tsv
Text files all:  1733245 segments to 11 files
Anno file    1:   400000 annotations written to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-028/prod/anno-1.tsv
Anno file    2:   400000 annotations written to ~/gitlab.huc.knaw.nl/suriano/letters/watm/1.0.0e-

## Step 7: Generate IIIF manifests

In [34]:
II = IIIF(Tei.teiVersion, A, PAGESEQ_JSON, prod=True, silent=False)
# II = IIIF(Tei.teiVersion, A, prod=False, silent=False)
II.manifests()

Maximum dimensions: W = 8504 H = 5976
Average dimensions: W = 4138 H = 4869
Average deviation:  W = 1038 H =  660
Maximum dimensions: W = 5600 H = 5786
Average dimensions: W = 3253 H = 4359
Average deviation:  W =  477 H =  607
Collections:
   02 with  262 pages
   03 with  660 pages
   04 with  806 pages
   05 with  688 pages
   06 with  684 pages
   07 with  628 pages
   08 with  988 pages
   09 with  944 pages
   10 with 1062 pages
   11 with 1116 pages
   12 with  928 pages
IIIF manifests generated in ~/gitlab.huc.knaw.nl/suriano/letters/static/prod/manifests


## Step 8: Deploy to k8s and TeamText VM

NB: you need to have access to the k8s cluster and to the team text VM.

That means:

* The LDAP of the relevant k8s clusters know you
* You have an ssh key-based login on the Team Text VPN
* You work inside the firewall

In [35]:
!./provision.sh watm

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

WATM export version: 1.0.0e-028
anno-1.tsv                                    100%   17MB   5.9MB/s   00:02    
anno-3.tsv                                    100% 9773KB   6.2MB/s   00:01    
anno-2.tsv                                    100%   15MB   6.9MB/s   00:02    
anno2node.tsv                                 100% 5668KB   6.5MB/s   00:00    
text-6.tsv                                    100% 1224KB   6.0MB/s   00:00    
text-7.tsv                                    100% 1192KB   6.0MB/s   00:00    
text-5.tsv                                    100% 1014KB   5.8MB/s   00:00    
text-4.tsv                                    100%  910KB   6.1MB/s   00:00    
text-0.tsv                                    100%  266KB   5.5MB/s   00:00    
text-1.tsv                                    100%  702KB   5.7MB/s   00:00    
text-3.tsv                                   

In [36]:
!./provision.sh files

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

copying to pod: prod/covers.html
copying to pod: prod/logo
copying to pod: prod/manifests
copying to pod: both/metadata


In [None]:
!./provision.sh prod images

## Step 9: Test the images

* [covers](https://data.suriano.huygens.knaw.nl/files/covers.html)

* [02.json](https://data.suriano.huygens.knaw.nl/files/manifests/02.json)

* [page 02_171r](https://data.suriano.huygens.knaw.nl/iiif/3/pages%2F02_071r.jpg/full/max/0/default.jpg)

# On the command line: Step by step

Here we do the main steps of the conversion.

Every step is a separate run of a python program.
After completion of a step, all information to run a next step, is saved to disk in the form
of result files and report files.

If the results of earlier steps are present, you can just do the following step.

## Step 0: Initialization

**N.B.** Check the VERSION variable here!

In [4]:
VERSION

'0.7.1'

## Step1: Scan ingest

In [9]:
%%time

!python make.py ingest -

Ingest scans ...
CPU times: user 9.15 ms, sys: 7.67 ms, total: 16.8 ms
Wall time: 1.29 s


## Step 2: Scan processing

In [10]:
%%time

!python make.py scans

Process scans ...
CPU times: user 6.13 ms, sys: 7.51 ms, total: 13.6 ms
Wall time: 615 ms


## Step 3: From DOCX to TEI

In [11]:
%%time

!python make.py docx2tei $VERSION

DOCX ==> TEI files ...
DOCX => simple TEI per filza ...
DOCX => headers ...
Collecting transcribers ...
Collecting page scans ...
Collecting excel metadata ...
simple TEI per filza => enriched TEI per letter ...
CPU times: user 12.8 ms, sys: 9.18 ms, total: 22 ms
Wall time: 1.66 s


## Step 4: From TEI to TF

In [15]:
%%time

!python make.py tei2tf - $VERSION

TEI => TF ...
	Validating TEI ...
	Converting TEI ...
	Loading TF ...
CPU times: user 322 ms, sys: 119 ms, total: 441 ms
Wall time: 1min 9s


## Step 5: Mark named entities

In [16]:
%%time

!python make.py ner $VERSION

Annotate named entities ...
	Loading TF  ...
5 rows with a duplicate name:
  r305: William, Count of Nassau-Siegen
  r359: William of Orange
  r361: Maurice of Nassau
  r506: Henry II, Duke of Lorraine
  r645: Guillaume III de Melun
1 row without a name:
	e.g.: 365
149 rows without triggers:
	e.g.: 6, 8, 16, 21, 24, 26, 29, 30, 32, 34
Clash: Gugliemo di Nassau: r9 vs r50
Clash: Nicolò Perez: r75 vs r571
Clash: conte di Frusten: r148 vs r149
Clash: conte di Wanderlip: r150 vs r151
Clash: colonello Sciombergh: r253 vs r352
Clash: colonnello Sciombergh: r253 vs r352
Clash: signor di Rocalaura: r386 vs r459
Clash: colonel Rocalaura: r386 vs r459
Clash: monsignor di Rocalaura: r386 vs r459
Clash: colonello Rocalaura: r386 vs r459
	491 entities targeted with 6063 occurrences. See ~/gitlab.huc.knaw.nl/suriano/letters/_temp/ner/0.4.3/.people.0.6/hits.tsv
	Loading TF with entities ...
CPU times: user 306 ms, sys: 107 ms, total: 413 ms
Wall time: 1min 3s


## Step 6: Convert TF to WATM

In [3]:
%%time

!python make.py watm "$VERSION"e
#!python make.py watm "$VERSION"e --no-prod

TF => WATM ...
	Loading TF ...
	Making WATM for version 0.4.3e
	Writing WATM ...
	Testing WATM ...
	OK - whether all tests passed
CPU times: user 45.9 ms, sys: 20.7 ms, total: 66.7 ms
Wall time: 9.3 s


## Step 7: Generate IIIF manifests

### Development

In [4]:
%%time

!python make.py iiif "$VERSION"e
#!python make.py iiif "$VERSION"e --no-prod

Generate IIIF manifests ...
CPU times: user 13.7 ms, sys: 13.1 ms, total: 26.8 ms
Wall time: 2.24 s


## Step 8: Deploy to k8s and TeamText VM

In [None]:
%%time

!python make.py deploy
#!python make.py deploy --no-prod

# Express: One shot

Here is the express, mindless way to convert the corpus.
If something goes wrong, you can follow the step-by-step section or the debugging section.

**N.B.** Check the VERSION variable here!

In [4]:
VERSION

'0.7.1'

In [None]:
%%time

!python make.py all $VERSION
# !python make.py all $VERSION --no-prod