# Conversion of The Real Bible to Text-Fabric dataset

In this notebook, we will convert a very simple to a text-fabric dataset. The notebook is based on [this](https://nbviewer.org/github/annotation/banks/blob/master/programs/convert.ipynb) notebook by Dirk Roorda. Our text has a slightly simpler structure, and we add a few more word features.

Note, that when you have converted the dataset, you still need to configure the app.

### The text

We will convert the text of an old book, consisting of two pages of text to a text-fabric dataset. We will divide text in the sections "book", "page" and "line". The slots of the dataset are the words in the text. The node type book has a feature "name", the page nodes have the features "number" and "author", and the line nodes have the features "number" and "language". The line has a feature "language" and the words have the features "text", "lexeme", and "language".

In [1]:
source = '''
# TheRealBible

## PAGE 1
$ AUTHOR MOSES
EN The cat sat on the mat
NL De kat zat op de mat
DE Die Katze saß auf der Matte

## PAGE 2
$ AUTHOR DAVID
EN He who laughs last laughs best
NL Wie het laatst lacht lacht het best
'''

The text contains several pieces of metadata: 

- The first line "# TheRealBible" contains the title of the book, it is not part of the text.
- "## PAGE 1" and "## PAGE 2" are the page numbers.
- Lines starting with $ contain the author name of each page.
- Lines with the text of the book are introduced with a two letter code indicating the language of the text.

### Prepare the conversion.

In [2]:
import os

from tf.fabric import Fabric
from tf.convert.walker import CV

In [3]:
DATA_FOLDER = 'tf'
VERSION = '0.1'

TF_PATH = f'./{DATA_FOLDER}/{VERSION}'
TF = Fabric(locations=TF_PATH, silent=True)

Define the slot type. Often, the word is an obvious choice, but you can make a different choice. Sometimes a sign can be a better choice. 

In [4]:
slotType = 'word'

Create a dict that contains metadata. Who did the conversion? What are your sources? You will find this information at the top of each TF feature file.

In [5]:
generic = {
    'name': 'The Real Bible',
    'compiler': 'Martijn Naaijer',
    'source': 'Various sacred texts',
    'version': '0.1',
    'purpose': 'exposition'
}

A text can have different representations, especially if you work with a text which is written in another script and you want to add a transcription in Latin script. Here you can define the representations. They consist of placeholders, with the features you want to use for it.
Also, you define the section types here.

In [6]:
otext = {
    'fmt:text-orig-full': '{text} ',
    'sectionTypes': 'book,page,line',
    'sectionFeatures': 'title,number,number',
}

In text, most of your features are categorical, but you may also have integer features. You define them here.

In [7]:
intFeatures = {
  'number'
}

Give a short description of all your features. That is usefule for the users of the dataset, and for yourself!

In [8]:
featureMeta = {
    'number': {
        'description': 'number of page, or line on page',
    },
    'title': {
        'description': 'title of a book',    
    },
    'author': {
        'description': 'the author of a page',
    },
    'language': {
        'description': 'language of a word of text',
    },
    'text': {
        'description': 'the text of a word',
    },
    'lexeme': {
        'description': 'lexeme of a word',
    },
}

The dict language_dict contains all the lexemes of the words that we found in the text. It contains a sub-dictionary for each language in the text.

In [9]:
languages_dict = {
         'EN': {'the': 'the', 
                'cat': 'cat', 
                'sat': 'sit', 
                'on': 'on', 
                'mat': 'mat', 
                'he': 'he', 
                'who': 'who', 
                'laughs': 'laugh', 
                'last': 'late', 
                'best': 'good'},

         'DE': {'die': 'die', 
               'katze': 'katze', 
               'saß': 'sitzen', 
               'auf': 'auf', 
               'der': 'der', 
               'matte': 'matte'},

        'NL': {'de': 'de', 
              'kat': 'kat', 
              'zat': 'zitten', 
              'op': 'op', 
              'mat': 'mat', 
              'wie': 'wie', 
              'het': 'het', 
              'laatst': 'laat', 
              'lacht': 'lachen', 
              'best': 'goed'}
}

Now, we define a function director, in which the text is parsed, and features are added. For each type of line, it has a specific action, which corresponds with the metadata the line contains.

You add node features to cv.feature(). Nodes are initialized with cv.node(). When a node is done, it needs to be terminated with cv.terminate().

In [10]:
def director(cv):
    counter = dict(
      page=0,
      line=0,
    )
    cur = dict(
            book=None,
            page=None,
            line=None,
          )
    
    for line in source.strip().split('\n'):

        if not line:
            cv.terminate(cur['line'])               
            for ntp in counter:
                counter[ntp] += 1
            cur['line'] = cv.node('line')
            cv.feature(
              cur['line'],
              number=counter['line'],
              )
            continue

        if line.startswith('# '):
            for ntp in ('line', 'page', 'book'):
                cv.terminate(cur[ntp])
                cur[ntp] = None         
            title = line[2:].strip()
            cur['book'] = cv.node('book')
            for ntp in counter:
                counter[ntp] = 0
            cv.feature(
              cur['book'],
              title=title,
            )
            continue
        
        if line.startswith('## '):
            for ntp in ('line', 'page'):
                cv.terminate(cur[ntp])
                cur[ntp] = None         
            number = line.split('PAGE')[1].strip()
            cur['page'] = cv.node('page')
            for ntp in counter:
                counter[ntp] = 0
            cv.feature(
              cur['page'],
              number=number,
            )
            continue
            
        if line.startswith('$ '):
            author = line.split('AUTHOR')[1].strip()
            cv.feature(
              cur['page'],
              author=author,
            )
            continue
            
        cur['line'] = cv.node('line')
        counter['line'] += 1
        language=line[:2]
        cv.feature(
          cur['line'],
          number=counter['line'],
          language=language
        )
        
        for word in line[3:].split():
            w = cv.slot()
            lexeme=languages_dict[language].get(word.lower(), word)
            cv.feature(w, 
                       text=word,
                       lexeme=lexeme,
                       language=language
                       )

        cv.terminate(cur['line'])
        
        
    # just for informational purposes
    print('\nINFORMATION:', cv.activeTypes(), '\n')
  
    for ntp in ('line', 'page', 'book'):
        cv.terminate(cur[ntp])  
            
        

In [11]:
cv = CV(TF)

good = cv.walk(
    director,
    slotType,
    otext=otext,
    generic=generic,
    intFeatures=intFeatures,
    featureMeta=featureMeta,
)

good

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    book, page, line
   |   SECTION   FEATURES: title, number, number
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       text
   |     0.00s OK
   |     0.00s Following director... 

INFORMATION: {'page', 'book'} 

   |     0.00s "edge" actions: 0
   |     0.00s "feature" actions: 43
   |     0.00s "node" actions: 10
   |     0.00s "resume" actions: 0
   |     0.00s "slot" actions: 31
   |     0.00s "terminate" actions: 17
   |          1 x "book" node 
   |          7 x "line" node 
   |          2 x "page" node 
   |         31 x "word" node  = slot type
   |         41 nodes of all types
   |     0.00s OK
   |     0.00s Removing unlinked nodes ... 
   |      |     0.00s      2 unlinked "line" nodes: [1, 5]
   |      |     0.00s      2 unlinked nodes
 

True

Now, load the dataset to check if the conversion has succeeded.

In [13]:
from tf.app import use

A = use(f"data:{TF_PATH}", hoist=globals())

This is Text-Fabric 9.4.1
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

9 features found and 0 ignored
