# Convert from TEI to TF

We show how to convert a TEI data source into TF.

This has two stages:

1. make an preliminary TF dataset with the character as slot type
1. feed the plain text to a tokeniser, and add tokens and sentences to the data set,
   while removing its character and word nodes;
   the new slot type is token.
   
A dataset based on characters is precise, but rather inefficient.
The second step makes the dataset much more efficient.

## Preliminary conversion

In [18]:
!tf-fromtei all

	Validating ...

Namespaces OK
chapter  140 EPILOGUE                                          SH?L  ENDERBY, OF LONDONINS; IN STARS
App updated


## Add tokens and sentences

Now we have a preliminary TF dataset to work with.
The next step is no longer involved with the source TEI.

In [19]:
!addnlp all


  1.98s Using NLP pipeline Spacy (en) ...
    47s NLP done
  0.00s Feature overview: 24 for nodes; 1 for edges; 1 configs; 9 computed


In [20]:

!tf-fromtei apptoken

App updated with tokens and sentences 


# Zip the data

This is for producing a zip file to attach to the latest release, so that TF can download the data smoothly.

In [21]:
!tf-zipall

loading tf app ...
Data to be zipped:
	OK       app                      (v0.2 baeb5b)       : ~/github/annotation/mobydick/app
	OK       main data                (v0.2 baeb5b)       : ~/github/annotation/mobydick/tf/0.2
Writing zip file ...


# Inspect

We view the result in the TF browser.

To stop the browser, interrupt the kernel (Press `i` twice).

In [22]:
!tf

This is Text-Fabric 12.0.5
Loading TF corpus data. Please wait ...
Setting up TF browser for annotation/mobydick  
**Locating corpus resources ...**
Using app in ~/github/annotation/mobydick/app:
	repo clone offline under ~/github (local github)
Using data in ~/github/annotation/mobydick/tf/0.2:
	repo clone offline under ~/github (local github)
TF setup done.
 * Running on http://localhost:15333
[33mPress CTRL+C to quit[0m
Opening corpus in browser
Press <Ctrl+C> to stop the TF browser
127.0.0.1 - - [11/Jul/2023 13:24:49] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [11/Jul/2023 13:24:49] "GET /browser/static/base.css HTTP/1.1" 200 -
127.0.0.1 - - [11/Jul/2023 13:24:49] "GET /browser/static/display.css HTTP/1.1" 200 -
127.0.0.1 - - [11/Jul/2023 13:24:49] "GET /browser/static/highlight.css HTTP/1.1" 200 -
127.0.0.1 - - [11/Jul/2023 13:24:49] "GET /browser/static/fonts.css HTTP/1.1" 200 -
127.0.0.1 - - [11/Jul/2023 13:24:49] "GET /browser/static/index.css HTTP/1.1" 200 -
127.0.0.1 - - [11/Jul/

# Use the new dataset

In [23]:
from tf.app import use

In [24]:
ORG = "annotation"
REPO = "mobydick"

In [25]:
A = use(f"{ORG}/{REPO}:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
text,1,259591.0,100
body,1,258921.0,100
chapter,141,1842.25,100
div,138,1880.97,100
front,1,670.0,0
teiHeader,1,166.0,0
chunk,2606,99.63,100
fileDesc,1,134.0,0
p,2421,106.75,99
note,19,145.32,1


In [26]:
sentences = F.otype.s("sentence")
len(sentences)

9774

In [27]:
for i in range(50, 150):
    s = sentences[i]
    print(f"SENTENCE {i + 1}: {T.text(s)}")

SENTENCE 51: by Herman Melville (1819-1891) 
SENTENCE 52: LOOMINGS
SENTENCE 53: Call me Ishmael. 
SENTENCE 54: Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. 
SENTENCE 55: It is a way I have of driving off the spleen, and regulating the circulation. 
SENTENCE 56: Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off—then, I account it high time to get to sea as soon as I can. 
SENTENCE 57: This is my substitute for pistol and ball. 
SENTENCE 58: With a phi

In [28]:
sent = F.otype.s("sentence")[99]
A.pretty(sent)

In [31]:
query = """
s1:sentence
&& s2:sentence

s1 # s2
"""

results = A.search(query)

  0.07s 0 results


In [29]:
query = """
sentence
  =: t1:token
  := t2:token
  
sentence
  =: t3:token
  
t1 < t3
t3 < t2
"""

results = A.search(query)

    22s 0 results


In [30]:
len(list(F.otype.s("note")))

19

In [31]:
for (i, nn) in enumerate(F.otype.s("note")[0:2]):
    A.dm(f"### Note {i + 1}\n\n")
    s1 = L.u(L.d(nn, otype="token")[0], otype="chunk")[0]
    A.pretty(nn, withNodes=True, full=True)
    A.pretty(s1, withNodes=True, full=True, highlights={nn})

### Note 1



### Note 2



In [32]:
nt = F.otype.s("note")[5]
nt1 = L.d(nt, otype="token")[0]
ntmin = nt1 - 5
s = L.u(ntmin, otype="sentence")[0]

In [33]:
A.plain(nt, withNodes=True, full=True)
A.plain(nt1, withNodes=True, full=True)
A.plain(ntmin, withNodes=True, full=True)
A.pretty(s, withNodes=True, full=True)

In [34]:
slots = F.otype.s("token")
len(slots)

258431