# Convert from DOCX to TEI to TF to WATM

We convert DOCX to TEI to TF and then to WATM.

This notebook provides three levels of refinement in the execution. They all have the same outcome,
but they differ in the level of detail they provide on the conversion.

These are the levels:

* **Debugging**: all the commands directly in Python, the intermediate data remains in memory and can  be inspected.
* **Step by step**: one command for each main step of the conversion;
* **Express**: one single command on the command line for the complete conversion;

# Production or development

Mosts steps are unaffected by the production/development setting.

In the first steps of the pipeline (*ingest* and *scan processing*) we prepare both the dev and the prod data.

The intermediate steps (*from DOCX to TEI*, *from TEI to TF*, *mark named entities*) are identical for prod and dev.

Only for the latter steps (*convert TF to WATM*, generate IIIF manifests*, *deploy to k8s*) there is a distinction between prod and dev.

For these steps we have commented out the line that does the dev version.

# Requirements

* zsh as command line shell (as in macOS);
* access to suitable k8s clusters, streamlined by the 
  [k-suite](https://code.huc.knaw.nl/tt/smart-k8s/-/blob/main/docs/k-suite.md);
* [Pandoc](https://pandoc.org)
* [Python](https://www.python.org) (3.12 or higher) with additional pip-installable modules:
  * text-fabric
  * openpyxl

# Declare the version

Always set the version before running any cell in this notebook!

In [1]:
VERSION = "0.0.1"

# In debugging mode.

Now we dig a bit deeper, en do all the steps while keeping the program in memory.
Now it becomes doable to inspect all intermediate results.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from tf.app import use
from tf.convert.tei import TEI
from tf.convert.watm import WATM
from tf.advanced.helpers import dm

from processdocs import TeiFromDocx
from processhelpers import SOURCEBASE, REPORT_TEIDIR

## Step 1: From DOCX to TEI

You might need to do

```
pip install docx2python
```

In [57]:
TFD = TeiFromDocx(silent=False)

GET docx files and their metadata from XLS ...

Collecting excel metadata ...
Metadata collected.


Alberti-Philodoxus            :    1 metadata but no data          
Aler-Innocentia               :    2 metadata but no data          
Balde-Iephtias                :    3 metadata but no data          
Barlandus-Dialogus            :    4 metadata but no data          
Beza-Abrahamus                :    5 metadata but no data          
Brechtus-Euripus              :    6 metadata but no data          
Caussin-Felicitas             :    7 metadata but no data          
Caussin-Hermenigildus         :    8 metadata but no data          
Caussin-Nabuchodonosor        :    9 metadata but no data          
Caussin-Solyma                :   10 metadata but no data          
Caussin-Theodoricus           :   11 metadata but no data          
Cellotius-Adrianus            :   12 metadata but no data          
Cellotius-Chosroes            :   13 metadata but no data          
Cellotius-Reviviscentes       :   14 metadata but no data          
Cellotius-Sapor               :   15 metadata bu

Continuing with 16 good files among 16 in total


In [58]:
TFD.task("pandoc")

DOCX => simple TEI per work ...
16 files done.


In [60]:
TFD.task("tei")

simple TEI => enriched TEI ...
	Houthem-Gedeon                 ... front-main-back
	Laurimanus-Miles               ... front-main-back
	Macropedius-Adamus             ... front-main-back
	Macropedius-Aluta              ... front-main-back
	Macropedius-Asotus             ... front-main-back
	Macropedius-Bassarus           ... front-main-back
	Macropedius-Hecastus           ... front-main-back
	Macropedius-Hypomone           ... front-main-back
	Macropedius-Iesus              ... front-main-back
	Macropedius-Iosephus           ... front-main-back
	Macropedius-Lazarus            ... front-main-back
	Macropedius-Petriscus          ... front-main-back
	Macropedius-Rebelles           ... front-main-back
	Stymmelius-Studentes           ... front-main-back
	Vernulaeus-Crispus             ... front-main-back
	Vernulaeus-Gorcomienses        ... front-main-back

    1 x      special replacement
See ~/github/HuygensING/translatin/report/trans/info.txt

1 informational message




## Step 4: From TEI to TF

### Check the validity of the TEI.

In [61]:
Tei = TEI(verbose=-1, sourceBase=SOURCEBASE, reportDir=REPORT_TEIDIR, tei="", tf=VERSION)

In [62]:
validate = 1
# validate = True
Tei.task(check=True, verbose=1, validate=validate)

TEI to TF checking: ~/github/HuygensING/translatin/datasource/tei => ~/github/HuygensING/translatin/report/tei
Processing instructions are ignored
XML validation will be performed
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/translatin/datasource/schema/translatin.xsd
	round   1:  49 changes
 91 identical override(s)
  0 changing override(s)
Section model III
	   1 translatin                             Houthem-Gedeon.xml                                
	   2 translatin                             Laurimanus-Miles.xml                              
	   3 translatin                             Macropedius-Adamus.xml                            
	   4 translatin                             Macropedius-Aluta.xml                             
	   5 translatin                             Macropedius-Asotus.xml                    

25 validation error(s) in 15 file(s) written to ~/github/HuygensING/translatin/report/tei/errors.txt


False

### Convert the data

In [13]:
Tei.good = True
Tei.task(convert=True, verbose=0)

Line model II with ln nodes for lines between lb elements
Page model II with page nodes for pages started by pb elements  keeping the pb elements
Section model I
Processing instructions are ignored
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/annotation/text-fabric/tf/tools/tei/tei_all.xsd
	round   1: 232 changes
Analysing ~/github/HuygensING/suriano/datasource/schema/suriano.xsd
	round   1:  68 changes
137 identical override(s)
  1 changing override(s)
	metamark mixed ==> pure
  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s OK
   |     0.00s Following director... 
	Start folder 02:
		   1 suriano                                001.xml                                           
		   2 suriano                                002.xml                                           
		   3 suriano                                003.xml                                  

True

### Configure a TF app

The TF app has configuration settings, a bit of custom code, and documentation.

Most of it will be generated now, but there are ways to keep custom additions intact.

In [14]:
Tei.task(app=True)

True

### Use the new dataset

The final proof that the conversion has worked is to load the data.
On first-time loading several checks and pre-computations are performed.
Next time the loading will be much quicker.

In [15]:
A = use(f"{Tei.org}/{Tei.repo}:clone", backend=Tei.backend, checkout="clone", silent="verbose", hoist=globals())

**Locating corpus resources ...**

This is Text-Fabric 12.5.5
37 features found and 0 ignored
   |     0.49s T otype                from ~/github/HuygensING/suriano/tf/1.0.2
   |     6.71s T oslots               from ~/github/HuygensING/suriano/tf/1.0.2
  7.20s Dataset without structure sections in otext:no structure functions in the T-API
   |     4.05s T str                  from ~/github/HuygensING/suriano/tf/1.0.2
   |     0.08s T chunk                from ~/github/HuygensING/suriano/tf/1.0.2
   |     0.00s T file                 from ~/github/HuygensING/suriano/tf/1.0.2
   |     3.46s T after                from ~/github/HuygensING/suriano/tf/1.0.2
   |     0.00s T folder               from ~/github/HuygensING/suriano/tf/1.0.2
   |      |     0.11s C __levels__           from otype, oslots, otext
   |      |       15s C __order__            from otype, oslots, __levels__
   |      |     0.39s C __rank__             from otype, __order__
   |      |       15s C __levUp__            from otype, oslots, __rank__
   | 

Name,# of nodes,# slots / node,% coverage
folder,11,158047.18,100
file,725,2397.96,100
body,725,2206.0,92
text,725,2206.0,92
div,4148,737.03,176
table,243,217.58,3
teiHeader,725,191.96,8
page,8764,157.79,80
correspDesc,725,125.82,5
sourceDesc,725,51.06,2


## Step 6: Convert TF to WATM

N.B. For docs click the WATM link in the output cell.

In [30]:
WA = WATM(A, "tei", skipMeta=False, prod=True)
# WA = WATM(A, "tei", skipMeta=False, prod=False)
WA.makeText()
WA.makeAnno()
WA.writeAll()
WA.testAll()

textRepoLevel is section level 'folder'


[WATM exporter documentation](https://annotation.github.io/text-fabric/tf/convert/watm.html)

	Writing WATM ...
Writing production data to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod
Text file    0:    44343 segments to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/text-0.tsv
Text file    1:   117901 segments to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/text-1.tsv
Text file    2:   147685 segments to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/text-2.tsv
Text file    3:   109425 segments to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/text-3.tsv
Text file    4:   154932 segments to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/text-4.tsv
Text file   10:   132279 segments to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/text-10.tsv
Text files all:  1738519 segments to 11 files
Anno file    1:   400000 annotations written to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/anno-1.tsv
Anno file    2:   400000 annotations written to ~/github/HuygensING/suriano/watm/1.0.2e-030/prod/anno-2.tsv
Anno file    3:   246988 annotations written to ~/github/Huyg

## Step 8: Deploy to k8s and TeamText VM

NB: you need to have access to the k8s cluster and to the team text VM.

That means:

* The LDAP of the relevant k8s clusters know you
* You have an ssh key-based login on the Team Text VPN
* You work inside the firewall

In [62]:
!./provision.sh watm

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

WATM export version: 1.0.1e-029
anno-1.tsv                                    100%   17MB   5.9MB/s   00:02    
anno-3.tsv                                    100%   10MB   6.4MB/s   00:01    
anno-2.tsv                                    100%   15MB   6.6MB/s   00:02    
anno2node.tsv                                 100% 5658KB   5.9MB/s   00:00    
text-6.tsv                                    100% 1228KB   4.3MB/s   00:00    
text-7.tsv                                    100% 1196KB   4.1MB/s   00:00    
text-5.tsv                                    100% 1018KB   6.0MB/s   00:00    
text-4.tsv                                    100%  914KB   5.8MB/s   00:00    
text-0.tsv                                    100%  267KB   4.3MB/s   00:00    
text-1.tsv                                    100%  705KB   6.2MB/s   00:00    
text-3.tsv                                   

In [63]:
!./provision.sh files

k-suite enabled
Context "k8s-10-26-2-0" modified.

Quick access to iiif-suriano : type khelp for an overview of commands.

copying to pod: prod/covers.html
copying to pod: prod/logo
copying to pod: prod/manifests
copying to pod: both/metadata
