# Convert from DOCX to TEI to TF to WATM

We convert DOCX to TEI to TF and then to WATM.

This notebook provides three levels of refinement in the execution. They all have the same outcome,
but they differ in the level of detail they provide on the conversion.

These are the levels:

* **Debugging**: all the commands directly in Python, the intermediate data remains in memory and can  be inspected.
* **Step by step**: one command for each main step of the conversion;
* **Express**: one single command on the command line for the complete conversion;

# Production or development

Mosts steps are unaffected by the production/development setting.

In the first steps of the pipeline (*ingest* and *scan processing*) we prepare both the dev and the prod data.

The intermediate steps (*from DOCX to TEI*, *from TEI to TF*, *mark named entities*) are identical for prod and dev.

Only for the latter steps (*convert TF to WATM*, generate IIIF manifests*, *deploy to k8s*) there is a distinction between prod and dev.

For these steps we have commented out the line that does the dev version.

# Requirements

* zsh as command line shell (as in macOS);
* access to suitable k8s clusters, streamlined by the 
  [k-suite](https://code.huc.knaw.nl/tt/smart-k8s/-/blob/main/docs/k-suite.md);
* [Pandoc](https://pandoc.org)
* [Python](https://www.python.org) (3.12 or higher) with additional pip-installable modules:
  * text-fabric
  * openpyxl

# Declare the version

Always set the version before running any cell in this notebook!

In [1]:
VERSION = "0.0.1"

# In debugging mode.

Now we dig a bit deeper, en do all the steps while keeping the program in memory.
Now it becomes doable to inspect all intermediate results.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.watm import WATM
from tf.advanced.helpers import dm

from processdocs import TeiFromDocx
from processhelpers import SOURCEBASE, REPORT_TEIDIR

In [4]:
import re

LINENUMBER_BARE_RE = re.compile(
    r"""
    ^
    (
        .*
        [a-z]
        .*
    )
    \s+
    (
        [0-9]+
    )
    $
    """,
    re.M | re.X
)

main = """
Iterum aperimus, nec tamen

Ingrata, vini cantharo 5

Dignata poetam patria est

Nec actione ephebulos

Cunctis probatos plurimum.

Vestris Thalia siccine

Facundus Atlantis nepos 10

Aversus est farinulis?

Nihil ille iam facit noui

Qui sordido lucellulo

Captus, Stygem colit nigram,

Vmbris studens degeneribus, 15

"""

In [5]:
main = LINENUMBER_BARE_RE.sub(r"«\2» \1\n", main)
print(main)


Iterum aperimus, nec tamen

«5» Ingrata, vini cantharo


Dignata poetam patria est

Nec actione ephebulos

Cunctis probatos plurimum.

Vestris Thalia siccine

«10» Facundus Atlantis nepos


Aversus est farinulis?

Nihil ille iam facit noui

Qui sordido lucellulo

Captus, Stygem colit nigram,

«15» Vmbris studens degeneribus,





## Step 1: From DOCX to TEI

You might need to do

```
pip install docx2python
```

In [18]:
TFD = TeiFromDocx(silent=False)

GET docx files and their metadata from XLS ...

Collecting excel metadata ...
Metadata collected.
Continuing with 118 good files among 118 in total


In [29]:
TFD.task("markdown", forceClean=False)

DOCX => Markdown per work ...
	Laurimanus-Exodus : clean specific
	Laurimanus-Miles : clean specific
118 files done.


In [17]:
TFD.task("tei")

simple TEI => enriched TEI ...
	Alberti-Philodoxus             ... front-main-back
	Aler-Innocentia                ... front-main-back
	Balde-Iephtias                 ... front-main-back
	Barlandus-Dialogus             ... front-main-back
	Beza-Abrahamus                 ... front-main-back
	Brechtus-Euripus               ... front-main-back
	Caussin-Felicitas              ... front-main-back
	Caussin-Hermenigildus          ... front-main-back
	Caussin-Nabuchodonosor         ... front-main-back
	Caussin-Solyma                 ... front-main-back
	Caussin-Theodoricus            ... front-main-back
	Cellotius-Adrianus             ... front-main-back
	Cellotius-Chosroes             ... front-main-back
	Cellotius-Reviviscentes        ... front-main-back
	Cellotius-Sapor                ... front-main-back
	Celtis-Ludus                   ... front-main-back
	Claus-Vulpanser                ... front-main-back
	Crocus-Ioseph                  ... front-main-back
	Cunaeus-Dido                   .

## Step 4: From TEI to TF

### Check the validity of the TEI.

In [7]:
Tei = TEI(verbose=-1, sourceBase=SOURCEBASE, reportDir=REPORT_TEIDIR, tei="", tf=VERSION)

In [8]:
# validate = 1
validate = True
Tei.task(check=True, verbose=1, validate=validate)

TEI to TF checking: ~/github/HuygensING/translatin/datasource/tei => ~/github/HuygensING/translatin/report/tei
Processing instructions are ignored
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/translatin/datasource/schema/translatin.xsd
	round   1:  49 changes
 91 identical override(s)
  0 changing override(s)
Section model III
	   1 translatin                             Alberti-Philodoxus.xml                            
	   2 translatin                             Aler-Innocentia.xml                               
	   3 translatin                             Balde-Iephtias.xml                                
	   4 translatin                             Barlandus-Dialogus.xml                            
	   5 translatin                             Beza-Abrahamus.xml                        

True

### Convert the data

In [9]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Line model II with ln nodes for lines between lb elements
Page model II with page nodes for pages started by pb elements  keeping the pb elements
Section model III
Processing instructions are ignored
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/translatin/datasource/schema/translatin.xsd
	round   1:  49 changes
 91 identical override(s)
  0 changing override(s)
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
		   1 translatin                             Alberti-Philodoxus.xml                            
		   2 translatin                             Aler-Innocentia.xml                               
		   3 translatin                             Balde-Iephtias.xml                                
		   4 translatin        

True

### Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [32]:
Tei.task(app=True)

True

### Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [33]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.6.2
36 features found and 0 ignored
   |     0.49s T otype                from ~/github/HuygensING/translatin/tf/0.0.1
   |     5.30s T oslots               from ~/github/HuygensING/translatin/tf/0.0.1
  5.79s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.20s T chunk                from ~/github/HuygensING/translatin/tf/0.0.1
   |     4.01s T str                  from ~/github/HuygensING/translatin/tf/0.0.1
   |     0.00s T work                 from ~/github/HuygensING/translatin/tf/0.0.1
   |     3.22s T after                from ~/github/HuygensING/translatin/tf/0.0.1
   |     0.26s T part                 from ~/github/HuygensING/translatin/tf/0.0.1
   |      |     0.12s C __levels__           from otype, oslots, otext
   |      |     9.11s C __order__            from otype, oslots, __levels__
   |      |     0.36s C __rank__             from otype, __order__
   |      |       11s C __levUp__            from otype, o

Name,# of nodes,# slots / node,% coverage
work,118,13968.5,100
text,118,13826.04,99
body,118,12888.03,92
div,118,815.37,6
front,118,815.37,6
teiHeader,118,142.46,1
back,118,122.64,1
bibl,118,75.41,1
sourceDesc,118,75.41,1
titleStmt,118,55.74,0


## Step 6: Convert TF to WATM

N.B. For docs click the WATM link in the output cell.

In [38]:
WA = WATM(A, "tei", skipMeta=False, prod=True)
# WA = WATM(A, "tei", skipMeta=False, prod=False)
WA.makeText()
WA.makeAnno()
WA.writeAll()
WA.testAll()

textRepoLevel is section level 'work'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

	Writing WATM ...
Writing production data to ~/github/HuygensING/translatin/watm/0.0.1-002/prod
Text file    0:    11495 segments to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/text-0.tsv
Text file    1:    10895 segments to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/text-1.tsv
Text file    2:    44701 segments to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/text-2.tsv
Text file    3:     1121 segments to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/text-3.tsv
Text file    4:    14131 segments to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/text-4.tsv

Text files all:  1648283 segments to 118 files
Anno file    1:   400000 annotations written to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/anno-1.tsv
Anno file    2:    82181 annotations written to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/anno-2.tsv
Anno files all:   482181 annotations to 2 files
Slot mapping written to ~/github/HuygensING/translatin/watm/0.0.1-002/prod/pos2node.tsv
Node 

## Step 8: Deploy to k8s and TeamText VM

NB: you need to have access to the k8s cluster and to the team text VM.

That means:

* The LDAP of the relevant k8s clusters know you
* You have an ssh key-based login on the Team Text VPN
* You work inside the firewall

In [62]:
!./provision.sh watm

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

WATM export version: 1.0.1e-029
anno-1.tsv                                    100%   17MB   5.9MB/s   00:02    
anno-3.tsv                                    100%   10MB   6.4MB/s   00:01    
anno-2.tsv                                    100%   15MB   6.6MB/s   00:02    
anno2node.tsv                                 100% 5658KB   5.9MB/s   00:00    
text-6.tsv                                    100% 1228KB   4.3MB/s   00:00    
text-7.tsv                                    100% 1196KB   4.1MB/s   00:00    
text-5.tsv                                    100% 1018KB   6.0MB/s   00:00    
text-4.tsv                                    100%  914KB   5.8MB/s   00:00    
text-0.tsv                                    100%  267KB   4.3MB/s   00:00    
text-1.tsv                                    100%  705KB   6.2MB/s   00:00    
text-3.tsv                                   

In [63]:
!./provision.sh files

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

copying to pod: prod/covers.html
copying to pod: prod/logo
copying to pod: prod/manifests
copying to pod: both/metadata
