# Convert from DOCX to TEI to TF to WATM

We convert DOCX to TEI to TF and then to WATM.

This notebook provides three levels of refinement in the execution. They all have the same outcome,
but they differ in the level of detail they provide on the conversion.

These are the levels:

* **Debugging**: all the commands directly in Python, the intermediate data remains in memory and can  be inspected.
* **Step by step**: one command for each main step of the conversion;
* **Express**: one single command on the command line for the complete conversion;

# Production or development

Mosts steps are unaffected by the production/development setting.

In the first steps of the pipeline (*ingest* and *scan processing*) we prepare both the dev and the prod data.

The intermediate steps (*from DOCX to TEI*, *from TEI to TF*, *mark named entities*) are identical for prod and dev.

Only for the latter steps (*convert TF to WATM*, generate IIIF manifests*, *deploy to k8s*) there is a distinction between prod and dev.

For these steps we have commented out the line that does the dev version.

# Requirements

* zsh as command line shell (as in macOS);
* access to suitable k8s clusters, streamlined by the 
  [k-suite](https://code.huc.knaw.nl/tt/smart-k8s/-/blob/main/docs/k-suite.md);
* [Pandoc](https://pandoc.org)
* [Python](https://www.python.org) (3.12 or higher) with additional pip-installable modules:
  * text-fabric
  * openpyxl

# Declare the version

Always set the version before running any cell in this notebook!

In [1]:
VERSION = "0.0.3"

# In debugging mode.

Now we dig a bit deeper, en do all the steps while keeping the program in memory.
Now it becomes doable to inspect all intermediate results.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.watm import WATM
from tf.advanced.helpers import dm

from processdocs import TeiFromDocx
from processhelpers import SOURCEBASE, REPORT_TEIDIR

## Step 1: From DOCX to TEI

You might need to do

```
pip install docx2python
```

In [4]:
TFD = TeiFromDocx(silent=False)

GET docx files and their metadata from XLS ...

Collecting excel metadata ...
Metadata collected.
Continuing with 118 good files among 118 in total
0 informational messages


In [5]:
# TFD.task("convert", forceClean=True, skipTeix=True, skipTei=True, workName="Zovitius-Ruth")

In [6]:
# TFD.task("convert", forceClean=True, skipTeix=True, skipTei=True)

In [8]:
TFD.task("convert", forceClean=True)

DOCX => Markdown per drama ...
	   drama                          |   lines | lnums | pages | folios| sect | conversion steps
	OK Alberti-Philodoxus             |     891 |     0 |    38 |     0 |   23 | clean special   teisimple tei 
	OK Aler-Innocentia                |    2755 |     0 |    40 |     0 |   36 | clean special   teisimple tei 
	OK Balde-Iephtias                 |    9861 |     0 |     0 |     0 |   72 | clean generic   teisimple tei 
	OK Barlandus-Dialogus             |      95 |     0 |     0 |     0 |    0 | clean generic   teisimple tei 
	OK Beza-Abrahamus                 |    3137 |     0 |    46 |     0 |    9 | clean special   teisimple tei 
	OK Brechtus-Euripus               |    6155 |   470 |   112 |    94 |   13 | clean special   teisimple tei 
	OK Caussin-Felicitas              |    4235 |     0 |     0 |     0 |   23 | clean generic   teisimple tei 
	OK Caussin-Hermenigildus          |    1145 |     0 |   101 |     0 |    7 | clean special   teisimple tei 
	O

## Step 4: From TEI to TF

### Check the validity of the TEI.

In [9]:
Tei = TEI(verbose=-1, sourceBase=SOURCEBASE, reportDir=REPORT_TEIDIR, tei="", tf=VERSION)

In [10]:
# validate = 1
validate = True
Tei.task(check=True, verbose=1, validate=validate)

TEI to TF checking: ~/github/HuygensING/translatin/tei => ~/github/HuygensING/translatin/report/tei
Processing instructions are ignored
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/translatin/schema/translatin.xsd
	round   1:  49 changes
 91 identical override(s)
  0 changing override(s)
Section model III
118 translatin file(s) ...
	Validating ...
	Making inventory ...

584 tag(s) type info written to ~/github/HuygensING/translatin/report/tei/types.txt
Validation OK
35 tags of which 0 with multiple namespaces written to ~/github/HuygensING/translatin/report/tei/namespaces.txt
972 info line(s) written to ~/github/HuygensING/translatin/report/tei/elements.txt
Refs written to ~/github/HuygensING/translatin/report/tei/refs.txt
	resolvable: 3109 in 3109
	dangling:      1 in    1
	ALL:        31

True

### Convert the data

In [11]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Section model III
Processing instructions are ignored
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/translatin/schema/translatin.xsd
	round   1:  49 changes
 91 identical override(s)
  0 changing override(s)
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
		   1 translatin                             Alberti-Philodoxus.xml                            
		   2 translatin                             Aler-Innocentia.xml                               
		   3 translatin                             Balde-Iephtias.xml                                
		   4 translatin                             Barlandus-Dialogus.xml                            
		   5 translatin                             Beza-Abrahamus.xml                     

True

### Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [12]:
Tei.task(app=True)

True

### Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [13]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.6.4
43 features found and 0 ignored
   |     0.37s T otype                from ~/github/HuygensING/translatin/tf/0.0.3
   |     4.62s T oslots               from ~/github/HuygensING/translatin/tf/0.0.3
  4.99s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.03s T chunk                from ~/github/HuygensING/translatin/tf/0.0.3
   |     3.31s T after                from ~/github/HuygensING/translatin/tf/0.0.3
   |     4.13s T str                  from ~/github/HuygensING/translatin/tf/0.0.3
   |     0.00s T part                 from ~/github/HuygensING/translatin/tf/0.0.3
   |     0.00s T drama                from ~/github/HuygensING/translatin/tf/0.0.3
   |      |     0.08s C __levels__           from otype, oslots, otext
   |      |       10s C __order__            from otype, oslots, __levels__
   |      |     0.33s C __rank__             from otype, __order__
   |      |       11s C __levUp__            from otype, o

Name,# of nodes,# slots / node,% coverage
drama,118,13997.89,100
body,118,13889.82,99
text,118,13889.82,99
part,510,3238.73,100
chunk,13522,122.15,100
div,2928,1066.98,189
teiHeader,118,108.07,1
fileDesc,118,103.06,1
bibl,118,75.37,1
sourceDesc,118,75.37,1


## Step 6: Convert TF to WATM

N.B. For docs click the WATM link in the output cell.

In [17]:
WA = WATM(A, "tei", skipMeta=False, prod=True)
# WA = WATM(A, "tei", skipMeta=False, prod=False)
WA.makeText()
WA.makeAnno()
WA.writeAll()
WA.testAll()

conversion settings read from ~/github/HuygensING/translatin/config/watm.yml
IIIF settings read from ~/github/HuygensING/translatin/config/iiif.yml
textRepoLevel is section level 'drama'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

	Writing WATM ...
Writing production data to ~/github/HuygensING/translatin/watm/0.0.3-017/prod
Text file    0:    11469 segments to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/text-0.tsv
Text file    1:    10815 segments to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/text-1.tsv
Text file    2:    44732 segments to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/text-2.tsv
Text file    3:     1091 segments to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/text-3.tsv
Text file    4:    13976 segments to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/text-4.tsv

Text files all:  1651751 segments to 118 files
Anno file    1:   368596 annotations written to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/anno-1.tsv
Inherited annotations: 7650
Anno files all:   368596 annotations to 1 file
Slot mapping written to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/pos2node.tsv
Node mapping written to ~/github/HuygensING/translatin/watm/0.0.3-017/prod/anno2node.tsv

## Step 8: Deploy to k8s and TeamText VM

NB: you need to have access to the k8s cluster and to the team text VM.

That means:

* The LDAP of the relevant k8s clusters know you
* You have an ssh key-based login on the Team Text VPN
* You work inside the firewall

In [18]:
!./provision.sh watm prod

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-translatin : type khelp for an overview of commands.

WATM export version: 0.0.3-017
text-32.tsv                                   100%   77KB   2.4MB/s   00:00    
text-26.tsv                                   100%   89KB   3.0MB/s   00:00    
text-112.tsv                                  100%   91KB   3.1MB/s   00:00    
text-106.tsv                                  100%   70KB   3.2MB/s   00:00    
anno-1.tsv                                    100%   14MB   7.8MB/s   00:01    
text-107.tsv                                  100%  105KB   5.6MB/s   00:00    
text-113.tsv                                  100%   93KB   4.5MB/s   00:00    
text-27.tsv                                   100%   77KB   3.9MB/s   00:00    
text-33.tsv                                   100%  134KB   5.5MB/s   00:00    
text-19.tsv                                   100%  121KB   4.4MB/s   00:00    
text-25.tsv                                