If you want to know more about the features of the ETCBC-database, check the feature documentation on the Shebanq website:
https://shebanq.ancient-data.org/shebanq/static/docs/featuredoc/features/comments/0_overview.html

Look also at the text-fabric API: https://github.com/ETCBC/text-fabric/wiki/Api

First import some modules

In [1]:
import sys, collections, os
import pprint as pp

Text-fabric wakes up!

In [2]:
from tf.fabric import Fabric
TF = Fabric(modules='hebrew/etcbc4c')

This is Text-Fabric 2.3.2
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
108 features found and 0 ignored


Load the features that you need.

In [60]:
api = TF.load('''
    lex sp gn nu vt vs typ function prs book language
''')

  0.00s loading features ...
   |     0.26s B language             from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s Feature overview: 102 nodes; 5 edges; 1 configs; 7 computeds
  0.34s All features loaded/computed - for details use loadLog()


In [61]:
api.loadLog()
api.makeAvailableIn(globals())

   |     0.00s M otext                from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.26s B language             from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@am              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@ar              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@bn              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@da              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@de              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@el              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@en              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B book@es              from C:/Users/Martijn/github/text-fabric-data/hebrew/etcbc4c
   |     0

Objects
Which objects are there in the ETCBC database and how many of each? In the following cell they are counted. 
The total number of objects is called n and it is initialized as 0. The different types of objects are counted in the dictionary called 'object_types'. This is a defaultdict() from the collections module. Using the defaultdict instaed of an ordinary dictionary has the advantage that new keys in the dictionary do not need to be initialized explicitly.


In [5]:
n = 0
object_types = collections.defaultdict(int)

for node in N():
    n += 1
    object_types[F.otype.v(node)] += 1

print(n, object_types)

1446130 defaultdict(<class 'int'>, {'book': 39, 'lex': 9236, 'verse': 23213, 'chapter': 929, 'half_verse': 45180, 'sentence_atom': 64339, 'word': 426581, 'subphrase': 113792, 'sentence': 63570, 'clause': 88000, 'phrase': 253174, 'phrase_atom': 267515, 'clause_atom': 90562})


With "for node in N():" you walk through all the nodes (or objects) in the database. This script will add 1 to the variable n every time it sees a new node. The next line of code is a bit more complex. Of every node, it asks what kind of object it is using the feature "otype". Note that this feature does not need to be initialized in cell 3.

When object_types is initialized it is still an empty dictionary. Once it encounters the first node, it checks the object-type, adds that object-type to the dictionary and adds 1 to its initial value 0. In F.otype.v(), F is the class of object features, which is followed by the name of the feature. This is followed by v, which stands for the value of the feature. The values for the feature otype are word, clause, sentence, and so on.

If you want to check the object-type of the first node, use this script:

In [8]:
F.otype.v(1)

'word'

Compare print() with pprint() from the pprint module

In [9]:
pp.pprint(object_types)

defaultdict(<class 'int'>,
            {'book': 39,
             'chapter': 929,
             'clause': 88000,
             'clause_atom': 90562,
             'half_verse': 45180,
             'lex': 9236,
             'phrase': 253174,
             'phrase_atom': 267515,
             'sentence': 63570,
             'sentence_atom': 64339,
             'subphrase': 113792,
             'verse': 23213,
             'word': 426581})


In [None]:
The result is clear, the database contains 39 books, 929 chapters, and so on.

If we would not have used the defaultdict(), the script would have looked like this:

In [16]:
n = 0
object_types = collections.defaultdict(int)

for node in N():
    n += 1
    if F.otype.v(node) in object_types:
        object_types[F.otype.v(node)] += 1
    else:
        object_types[F.otype.v(node)] = 1
        #the object-type has to be initialized

print(n, object_types)

1446130 defaultdict(<class 'int'>, {'sentence': 63570, 'clause': 88000, 'subphrase': 113792, 'word': 426581, 'clause_atom': 90562, 'half_verse': 45180, 'lex': 9236, 'phrase_atom': 267515, 'verse': 23213, 'phrase': 253174, 'chapter': 929, 'book': 39, 'sentence_atom': 64339})


There is a more efficient way of walking through the nodes. In general you do not need information from all the objects, but only from one specific object-type, for instance words. If this is the case, you do the following:

In [18]:
word_count = 0

for word in F.otype.s('word'):
    word_count += 1
    
print(word_count)

426581


In [None]:
If you want to know which range of slots is used for one specific object, you use sInterval.

In [15]:
print(F.otype.sInterval('word'))
print(F.otype.sInterval('clause'))

(1, 426581)
(426582, 514581)


In [None]:
Features

The objects that we have encountered are abstract entities, and if you want to know what concrete properties an object has, you need to use the features that characterize a certain object. As we have seen, the slots 1 upt to 426581 represent words. If we want to know what the lexemes are of the first 10 word slots, we use the feature "lex", which was initialized in cell 3.

In [18]:
for word_slot in range(1, 11):
    print(F.lex.v(word_slot))

B
R>CJT/
BR>[
>LHJM/
>T
H
CMJM/
W
>T
H


You recognize that these are the lexemes of the first clause in the book of Genesis. However, in which book can you find word slot 100000? To be able to locate it, use T.sectionfromNode(), which returns a tuple with the book (by default in English), chapter and verse.

In [28]:
T.sectionFromNode(100000)

('Deuteronomy', 11, 19)

If you need the name of the book in a different language, youse the argument "lang".

In [26]:
T.sectionFromNode(100000, lang = 'fr')

('Deutéronome', 11, 19)

We have seen the feature "lex", which is a word feature. Other important word features are "sp" (part of speech), "gn" (gender), "nu" (number), "vt" (verbal tense), "vs" (verbal stem). The latter two have a value only if the word is a verb. Suppose you want to know the values of all of these features of the forst 10 words of Genesis, we do this: 

In [35]:
for word in F.otype.s('word'):
    if word < 11: #an alternative way of finding the first 10 words
        print(F.lex.v(word), F.sp.v(word), F.gn.v(word), F.nu.v(word), F.vt.v(word), F.vs.v(word))

B prep NA NA NA NA
R>CJT/ subs f sg NA NA
BR>[ verb m sg perf qal
>LHJM/ subs m pl NA NA
>T prep NA NA NA NA
H art NA NA NA NA
CMJM/ subs m pl NA NA
W conj NA NA NA NA
>T prep NA NA NA NA
H art NA NA NA NA


If a feature has no value in the case of a specific word, NA is returned.

You will also need features that are characteristic of phrases or clauses. Suppose we want to know the types and functions of the first 10 phrases in the first book of Genesis.

In [38]:
print(F.otype.sInterval('phrase'))

(605144, 858317)


In [40]:
for phrase_slot in range(605144, 605154):
    print(F.typ.v(phrase_slot), F.function.v(phrase_slot))

PP Time
VP Pred
NP Subj
PP Objc
CP Conj
NP Subj
VP Pred
NP PreC
CP Conj
NP Subj


Very often you may not only interested be interested in the features of specific objects, but also of other objects in its environment. Suppose you are interested in eating habits in the Hebrew Bible. You decide to search for cases of the verb >KL[ (to eat), used as the predicate of a clause, and you are interested in those cases in which that clause has an explicit subject. You would like to know all the lexemes in the subject.  

The strategy is as follows. Fist we search for all cases of >KL[. From the word >KL[ we move upward to the phrase in which it occurs, using L.u(). Of this phrase it is checked if it is a predicate. We move upward again, to the level of the clause, and in the clause we check if there is an explicit subject, by moving downwards to the phrases in the clause. It looks like this:

In [43]:
for word in F.otype.s('word'):
    if F.lex.v(word) == '>KL[':
        phrase = L.u(word, 'phrase')[0] # L.u() returns a tuple. We want to know the slot of the phrase, so we add the index [0], which is the first value of the tuple.
        #now we check if the phrase is a predicate, we chech also for cases in a nominal predicate (PreC), and predicates with an object suffix (PreO):
        if F.function.v(phrase) in {'Pred','PreC','PreO'}:
            #we move upwards to the clause
            clause = L.u(phrase, 'clause')[0]
            #and we go down again to all the phrases in that clause
            phrases = L.d(clause, 'phrase')
            #we loop over all the phrases to check if there is an explicit subject:
            for phrase in phrases:
                if F.function.v(phrase) == 'Subj':
                    #we create an empty list, in which all lexemes are stored.
                    lex_list = []
                    #we move down to the lexemes of the subject:
                    words = L.d(phrase, 'word')
                    for word in words:
                        lex_list.append(F.lex.v(word))
                    print(lex_list)

['H', 'N<R/']
['XRB=/']
['BN/', 'JFR>L/']
['XJH/', 'R</']
['XJH/', 'R</']
['HW>']
['H', '<WP/']
['H', '<WP/']
['H', 'PRH/', 'R</', 'H', 'MR>H/', 'W', 'DQ/', 'H', 'BFR/']
['H', 'PRH/', 'H', 'RQ=/', 'W', 'H', 'R</']
['H', '>JC/']
['KL/', 'BN/', 'NKR/']
['TWCB/', 'W', 'FKJR/']
['KL/', '<RL/']
['XMY/']
['BN/', 'JFR>L/']
['>T', 'BFR/']
['GDJC/', '>W', 'H', 'QMH/', '>W', 'H', 'FDH/']
['>BJWN/', '<M/']
['XJH/', 'H', 'FDH/']
['MR>H/', 'KBWD/', 'JHWH/']
['>HRN/', 'W', 'BN/']
['ZR/']
['H', '>C/']
['>HRN/', 'W', 'BN/']
['KL/', 'ZKR=/', 'B', 'BN/', '>HRN/']
['H', 'KHN/']
['KL/', 'ZKR=/', 'B', 'H', 'KHN/']
['KL/', 'XV>T/']
['KL/', 'ZKR=/', 'B', 'H', 'KHN/']
['BFR/', 'ZBX/', 'TWDH/', 'CLM/']
['H', 'BFR/']
['KL/', 'VHR/']
['>HRN/', 'W', 'BN/']
['>TH', 'W', 'BN/', 'W', 'BT/', '>T==']
['KL/', 'NPC/', 'MN']
['H', 'GR/']
['KL/', 'ZR/']
['TWCB/', 'KHN/', 'W', 'FKJR/']
['HW>']
['HM']
['HJ>']
['BT/', 'KHN/']
['KL/', 'ZR/']
['>JB[']
['>RY/', '>JB[']
['MJ']
['MJ']
['XYJ/', 'BFR/']
['KL/', 'ZKR=/']
['KL/', 'VH

It can be useful to save the data in a csv file to process the data further outside text-fabric. We want to do that with the >KL[ data from the previous cell. On every row in the csv file we store one clause containing >KL[, with the following information in columns: 
slot of >KL[, 
book, 
chapter, 
verse,
verbal tense, 
verbal stem
predicate type (Pred, PreC, PreO, PreS)
lexemes of the subject, concatenated in a string, separated by underscores.
The first 6 columns contain information about the verb, the predicate type contains information about the phrase in which >KL[ occurs, and the last column contains information about the subject of the clause. It may look like a lot of work, but you will notice that it is done straightforwardly.

In [68]:
#in a dictionary (eat_dict) we save all the rows that will be saved in the csv file. 
#the keys of the dictionary will be the >KL slots, the values are lists with information about the words and phrases
#we want to save the data in the csv file in canonical order, so we store the >KL[ slots in a list (eat_list), becuse there is no order in the keys of the dictionary.

eat_dict = {}
eat_list = []

#this part is nearly identical to what you have already seen
for w in F.otype.s('word'):
    #select the words with the right lexeme and make sure the language is Hebrew.
    if F.lex.v(w) == '>KL[' and F.language.v(w) == 'hbo':
        phrase = L.u(w, 'phrase')[0] 
        #we include cases with a subjectsuffix
        if F.function.v(phrase) == 'PreS':
            suffix = F.prs.v(w)
            #now we collect the information needed
            where = T.sectionFromNode(w)
            info = [w, where[0], where[1], where[2], F.vt.v(w), F.vs.v(w), F.function.v(phrase), suffix]
            eat_dict[w] = info
            eat_list.append(w)
        #here the other predicate types are processed
        else:
            if F.function.v(phrase) in {'Pred','PreC','PreO'}:
                where = T.sectionFromNode(w)
                clause = L.u(phrase, 'clause')[0]
                phrases = L.d(clause, 'phrase')
                subject = False #we only include those cases that have an explicit subject
                for phr in phrases:
                    if F.function.v(phr) == 'Subj':
                        subject = True
                        lex_list = []
                        words = L.d(phr, 'word')
                        subj_lexemes = ''
                        for word in words:
                            if not word == words[-1]:
                                subj_lexemes += F.lex.v(word)
                                #if the lexeme is not the last word of the phrase, we add a '_'.
                                subj_lexemes += '_'
                            else:
                                subj_lexemes += F.lex.v(word)
                if subject == True:
                    info = [w, where[0], where[1], where[2], F.vt.v(w), F.vs.v(w), F.function.v(phrase), subj_lexemes]
                    eat_dict[w] = info
                    eat_list.append(w)

In [67]:
with open(r'C:\Users\Martijn\Documents\SynVar\CourseShebanq\eat_data.csv', "w") as csv_file:
    
    #it is often useful to make a header
    header = ['slot', 'book', 'chapter', 'verse', 'tense', 'stem', 'predicate', 'subj_lex']
    csv_file.write('{}\n'.format(','.join(header)))

    for case in eat_list:
        info_list = eat_dict[case]
        line = [str(element) for element in info_list]
        csv_file.write('{}\n'.format(','.join(line)))

In the previous examples a so called structured dataset was made, which we saved in a csv file. The dataset is called structured, because it has a fixed format. It consists of a number of columns, and in each column you find the same kind of information for each case in the database.  

You should also be able to make unstructured or semi-structured datasets. An unstructured dataset contains data that are closer to the raw data as we find them in 'nature'. An example is a picture of a Dead Sea Scroll. In a semi-structured dataset the data are structured partly. In our case you could for instance make a text file with the consonantal text of the Hebrew Bible. In the following example, a text file is made in which the biblical text is represented per verse as a sequence of lexemes, separated by strings.

In [69]:
with open("lexemes.txt", "w") as lex_file:    
    for verse in F.otype.s('verse'):
        where = T.sectionFromNode(verse)
        #do not forget to make strings of chapter and verse
        verse_string = where[0] + ' ' + str(where[1]) + ' ' + str(where[2]) + ' '
        words = L.d(verse, 'word')
        for word in words:
            if word != words[-1]:
                verse_string += F.lex.v(word)
                verse_string += ' '
            else:
                verse_string += F.lex.v(word)
                #a new verse gets a new line.
                verse_string += '\n'
        lex_file.write(verse_string)    

In the previous script a file was made that contains the text of the whole MT. If you want to make a separate file for each book, you can do the following.

In [37]:
def lexeme_processing(v):
    """ 
    This function returns a string of lexemes for a verse, which is the input.
    It is identical to part of the code you have seen in the previous cell.
    """
    where = T.sectionFromNode(v)
    #do not forget to make strings of chapter and verse
    verse_string = where[0] + ' ' + str(where[1]) + ' ' + str(where[2]) + ' '
    words = L.d(verse, 'word')
    for word in words:
        if word != words[-1]:
            verse_string += F.lex.v(word)
            verse_string += ' '
        else:
            verse_string += F.lex.v(word)
            #a new verse gets a new line.
            verse_string += '\n'    
    return(verse_string)

In [39]:
#for every book a new file is created.
for book in F.otype.s('book'):
    book_file = F.book.v(book) + '.txt'
    with open(book_file, "w") as new_file:
        verses = L.d(book, 'verse')
        for verse in verses:
            #here the function lexeme_processing is called
            new_string = lexeme_processing(verse)
            new_file.write(new_string)

In [53]:
T.sectionFromNode(977)

('Genesis', 2, 16)