# Simple Data Export

## \*Updates\*

### 10 October

After looking at initial data, we decided to refine our dataset to exclude clauses with certain phrase functions based on the counts below:

```
EBH [('IntS', 1.0),
 ('Ques', 2.0),
 ('Exst', 4.0),
 ('ModS', 4.0),
 ('NCoS', 4.0),
 ('NCop', 13.0),
 ('Supp', 13.0),
 ('PrAd', 16.0),
 ('Frnt', 26.0),
 ('Intj', 55.0),
 ('PreS', 74.0),
 ('Nega', 105.0),
 ('Modi', 146.0),
 ('PreO', 179.0),
 ('Time', 236.0),
 ('Loca', 250.0),
 ('Rela', 295.0),
 ('Adju', 355.0),
 ('PreC', 788.0),
 ('Objc', 1601.0),
 ('Cmpl', 1977.0),
 ('Subj', 2243.0),
 ('Conj', 4052.0),
 ('Pred', 4147.0)]

LBH [('Frnt', 1.0),
 ('ModS', 1.0),
 ('Supp', 1.0),
 ('PrAd', 3.0),
 ('Ques', 3.0),
 ('PreS', 4.0),
 ('NCop', 5.0),
 ('Modi', 9.0),
 ('Loca', 11.0),
 ('Nega', 11.0),
 ('PreO', 19.0),
 ('Time', 35.0),
 ('Rela', 53.0),
 ('Adju', 61.0),
 ('PreC', 89.0),
 ('Objc', 112.0),
 ('Subj', 192.0),
 ('Cmpl', 209.0),
 ('Conj', 288.0),
 ('Pred', 315.0)]

```

As can be seen, certain functions are quite rare in the dataset. It is also not entirely clear why some functions like `Ques` are included in narratival clauses. To enable our predictions to be more focused, we now only export clauses with the following functions:

```
('IntS', 1.0),
 ('Ques', 2.0),
 ('Exst', 4.0),
 ('ModS', 4.0),
 ('NCoS', 4.0),
 ('NCop', 13.0),
 ('Supp', 13.0),
 ('PrAd', 16.0),
 ('Frnt', 26.0),
 ('Intj', 55.0),
```

Finally, we also export clauses based on individual books instead of their supposed dating.


### 3 October
In this notebook we export a series of .txt files containing a variety of different data sets. For our purposes here, we export narrative texts only from two general classes of texts: texts traditionally labeled as "Early Biblical Hebrew" and texts considered "Late Biblical Hebrew".

We will export here the following type(s) of data:
1. [phrase constituent](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/pdp) functions (words, also known as "part of speech") per clause in late/early Biblical Hebrew sources.


The data is accessed using [Text-Fabric](https://github.com/ETCBC/text-fabric), a python package made specially for accessing copora like the ETCBC Hebrew database. 

## Load Text-Fabric and ETCBC Syntactic Data

In [1]:
import collections
from tf.fabric import Fabric # for Text-Fabric

In [2]:
# instantiate Text-Fabric (TF) objects

TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c') # load ETCBC Hebrew database

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored


In [3]:
# load features for linguistic objects (i.e. clauses, phrases, words) from the database

# features loaded in a string, space separated
api = TF.load('''
              book chapter verse
              typ pdp function
              domain
              ''')

# TF classes are globalized for easier use
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.72s T otype                from /Users/cody/github/etcbc/bhsa/tf/c
   |     8.97s T oslots               from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.08s T book                 from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.04s T chapter              from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.04s T verse                from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.16s T g_cons               from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.28s T g_cons_utf8          from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.28s T g_lex                from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.37s T g_lex_utf8           from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.30s T g_word               from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.36s T g_word_utf8          from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.11s T lex0                 from /Users/cody/github/etcbc/bhsa/tf/c
   |     1.38s T lex_utf8          

## Gather, Arrange, and Export Data

ETCBC data is stored in graph structure with linguistic objects existing as nodes that have corresponding features. TF uses a node integer to access a dictionary and pull the requested feature with a function: `F.feature.v(node_number)`. There are various other functions used to iterate through the nodes which you can explore more thoroughly in the tutorial [here](https://github.com/codykingham/tfNotebooks/blob/master/timeSpans/Text_Fabric_Tutorial.ipynb) or [here](https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb). There are also edge relationships between some nodes (such as clause relations which represent the discourse structure of the text).


### Functions for Data Export

In [46]:
early_hebrew = {'Genesis', 'Exodus', 'Leviticus', 
                'Deuteronomy', 'Joshua', 'Judges',
                '1_Samuel', '2_Samuel', '1_Kings',
                '2_Kings'}

late_hebrew = {'Esther', 'Ezra', 'Nehemiah',
               '1_Chronicles', '2_Chronicles'}

def get_data():
    
    '''
    Returns dictionary with section (book, dating) as key and list as value.
    List contains space-separated strings of word/phrase level functions.
    Requires the feature and ETCBC object type.
    '''
    
    function_data = collections.defaultdict(list)

    # exclude these functions from the export (as of 10.10)
    exclude = {'IntS', 'Ques', 'Exst', 'ModS',
                 'NCoS', 'NCop', 'Supp', 'PrAd',
                 'Frnt', 'Intj'}
    
    
    for book in F.otype.s('book'):

        # skip extraneous books
        if T.sectionFromNode(book)[0] not in early_hebrew | late_hebrew:            
            continue
        
        # set the tag under which individual files are exported
        # i.e. this will determine the name of the files
        book_tag = F.book.v(book)

        book_clauses = [clause for clause in L.d(book, otype='clause')]

        
        # Restrictions on clauses here:
        # get all clauses in the book. The Clauses must domain of NARRATIVE
        # exclude clauses with certain functions
        export_clauses = [cl for cl in book_clauses
                             if F.domain.v(cl) == 'N'
                             and not set(F.function.v(p) for p in L.d(cl, otype='phrase')) & exclude
                         ]
                
            
        # add phrase data per clause
        for clause in export_clauses:

            # format data for all phrases in the clause
            phrase_functions = [F.function.v(phrase) for phrase in L.d(clause, otype='phrase')]
            phrase_funct_str = ' '.join(phrase_functions)

            function_data[book_tag].append(phrase_funct_str) # save data
            
    return(function_data)
     
    
def export_dated_files(data_dict, file_name):
    
    '''
    Exports simple data txt files per dated text.
    '''
    
    for section, linguistic_data in data_dict.items():

        filename = file_name.format(section)

        with open(filename, 'w') as outfile:

            for phrase in linguistic_data:
                outfile.write(phrase+'\n')

### Clause Constituents (phrases and their functions)

In [47]:
# apply function
phrase_function_data = get_data()
phrase_function_data['Genesis'][0] # sample of data

'Conj Pred Subj'

### Test before exporting

In [48]:
represented_functions = set()

for book, clause_dat in phrase_function_data.items():
    
    for clause in clause_dat:
        
        represented_functions |= set(clause.split())
    
represented_functions

{'Adju',
 'Cmpl',
 'Conj',
 'Loca',
 'Modi',
 'Nega',
 'Objc',
 'PreC',
 'PreO',
 'PreS',
 'Pred',
 'PtcO',
 'Rela',
 'Subj',
 'Time'}

In [49]:
phrase_function_data.keys()

dict_keys(['Genesis', 'Exodus', 'Leviticus', 'Deuteronomium', 'Josua', 'Judices', 'Samuel_I', 'Samuel_II', 'Reges_I', 'Reges_II', 'Esther', 'Esra', 'Nehemia', 'Chronica_I', 'Chronica_II'])

In [50]:
# export file
export_dated_files(phrase_function_data, 'phrase_functions/phrase_functions_{}.txt')