# Pipeline

This module describes in detail the data-production pipeline of the Time Collocations project. The root source of the data is the [BHSA](https://www.github.com/etcbc/bhsa) of the Eep Talstra Centre for Bible and Computer, Vrije Universiteit Amsterdam. The BHSA contains grammatical and syntactic annotations, including a feature called `Function`, which identifies adverbial time phrases.

This project makes four primary kinds of modifications to the root BHSA data:

1. **Some phrases marked for time adverbial function are corrected and remapped to the appropriate function.** These corrections are based on examining the data manually and deciding whether the default BHSA label should be kept or modified. Those cases will be described explicitly below.
2. **Meaningful groups above or below a phrase or other formal object are "chunked" into new objects, called `chunk`.** The `chunk` object contains word groups such as quantifier number chains (below the phrase level) or time adverbial chunks (units above the phrase level). This latter case consists of certain situations where BHSA splits a time adverbial phrase into two separate phrases. Those two parts are recombined in `chunks` for further analysis.
3. **Construction objects are generated**. This is the final step in the pipeline, which is based on the statistical and functional analysis of time adverbials. A series of new objects will be built using the `chunk` objects. They will contain labels for semantic roles and semantic function. These labels are to be constructed using inductive, data-driven analysis combined with insights from Construction Grammar.
4. **New features on phrases, chunks, or constructions.** These are various features stored on the objects noted above.

**The end result is a custom database, which consists of original BHSA data, modified BHSA phrase function data, and new objects with their associated features**.

The whole pipeline is represented below in diagram form. 

<table>
<img src="../../docs/images/pipeline_diagram.png" width="50%" height="50%" align="middle">
</table>

In [1]:
# import needed packages and modules
import os, sys, collections, glob
from tf.fabric import Fabric
from tf.app import use
show_examples = False

For visualizing the data, I load an instance of text-fabric with BHSA.

In [2]:
if show_examples:
    bhsa = use('bhsa', silent=True)
    F, T, L = bhsa.api.F, bhsa.api.T, bhsa.api.L

# 0. Import Pipeline Classes

The classes below enact the pipeline.

In [3]:
from remapfunctions import RemapFunctions # remaps phrase functions
from chunking import Chunker # chunks meaningful groups

Input and output files are outlined below. Both directories consist (or will) of .tf resources. The input files are the base BHSA dataset while the output will be the customized, project dataset. Another dataset, `heads`, which is crucial for this work, is identified below as well. See the documentation for heads [here](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb).

In [4]:
home = os.path.expanduser('~/')
bhsa_dir = os.path.join(home, 'text-fabric-data/etcbc/bhsa/tf/c') # input
heads_dir = os.path.join(home, 'github/etcbc/heads/tf/c') # heads data
output_dir = '../tf' # output; all new data goes here

### Purge Old .tf Files

Remove all .tf files in the output directory. This is necessary since all output data goes to the same place, and subsequent classes depend on previous data. Without the purge, subsequent runs will add new objects on top of previous ones!

In [5]:
print('Purging old data...')
old_tf = glob.glob(os.path.join(output_dir, '*.tf'))
for file in old_tf:
    print(f'\tpurging {file}')
    os.remove(file)
print('\tDONE')

Purging old data...
	purging ../tf/function.tf
	purging ../tf/note.tf
	purging ../tf/label.tf
	purging ../tf/otype.tf
	purging ../tf/role.tf
	purging ../tf/oslots.tf
	DONE


The metadata below is assigned to all new features.

In [6]:
base_metadata = {'source': 'https://github.com/etcbc/bhsa',
                 'origin': 'Made by the ETCBC of the Vrije Universiteit Amsterdam; edited by Cody Kingham'}

# 1. Phrase Function Edits

In [7]:
remap_functs = RemapFunctions(bhsa_dir=bhsa_dir, 
                              export_dir=output_dir, 
                              metadata=base_metadata)

The following functions will be changed...

In [8]:
if show_examples:
    for phrase, newfunct in remap_functs.newfunctions.items():
        print(f'{phrase} node with function {F.function.v(phrase)} will be remapped to {newfunct}')
        bhsa.pretty(phrase)

The change is executed below.

In [9]:
remap_functs.execute()

Updating phrase functions...
	adding notes feature to new functions...
	exporting tf...
   |     0.36s T function             to /Users/cody/github/csl/time_collocations/data/tf
   |     0.00s T note                 to /Users/cody/github/csl/time_collocations/data/tf
SUCCESS


# 2. Generate Chunk Objects

The ETCBC data is not granular enough for many types of searches beneath the phrase level. For example, if there are coordinated noun phrases that all function as a single phrase, the individual phrases are not delineated. I need them spliced out so that I can track coordinated nouns in a phrase. Another case is with quantifiers, wherein the quantifier chains themselves are not in any way set apart from other items in the phrase. For these cases, I will make `chunk` objects—these are essentially phrase-like objects. Another problem is that some phrases are split into two, whereas elsewhere in the database the same phrase pattern is portrayed as a single phrase. This is fixed by creating a new `chunk` object.

`chunk` objects have two important features, called `label` and `role`. The `label` feature is essentially the name of the function. For instance, `timephrase` or `quant` (quantifier). The feature `role` is an edge feature, which maps the chunk's component parts to the new object. For example, in a `quant_NP`, there is an edge drawn from the noun to the `quant_NP` chunk; the edge has a value of "quantified" since the noun is the quantified item.

In [10]:
chunker = Chunker(bhsa_dir=bhsa_dir, 
                  heads_dir=heads_dir,
                  output_dir=output_dir, 
                  metadata=base_metadata)

In [11]:
chunker.execute()

Running chunkers...
	running time chunker...
	running quant chunker...
	running prep chunker...
82289 chunk objects formed...
	73963 chunk objects with label prep
	4072 chunk objects with label timephrase
	3196 chunk objects with label quant_NP
	1058 chunk objects with label quant
exporting tf...
  0.00s VALIDATING oslots feature
  0.12s maxSlot=     426584
  0.12s maxNode=    1529088
  0.35s OK: oslots is valid
   |     0.12s T label                to /Users/cody/github/csl/time_collocations/data/tf
   |     0.67s T otype                to /Users/cody/github/csl/time_collocations/data/tf
   |     3.02s T oslots               to /Users/cody/github/csl/time_collocations/data/tf
   |     0.06s T role                 to /Users/cody/github/csl/time_collocations/data/tf
SUCCESS
