# Pipeline

This module describes in detail the data-production pipeline of the Time Collocations project. The root source of the data is the [BHSA](https://www.github.com/etcbc/bhsa) of the Eep Talstra Centre for Bible and Computer, Vrije Universiteit Amsterdam. The BHSA contains grammatical and syntactic annotations, including a feature called `Function`, which identifies adverbial time phrases.

This project makes four primary kinds of modifications to the root BHSA data:

1. **Some phrases marked for time adverbial function are corrected and remapped to the appropriate function.** These corrections are based on examining the data manually and deciding whether the default BHSA label should be kept or modified. Those cases will be described explicitly below.
2. **Meaningful groups above or below a phrase or other formal object are "chunked" into new objects, called `chunk`.** The `chunk` object contains word groups such as quantifier number chains (below the phrase level) or time adverbial chunks (units above the phrase level). This latter case consists of certain situations where BHSA splits a time adverbial phrase into two separate phrases. Those two parts are recombined in `chunks` for further analysis.
3. **Construction objects are generated**. This is the final step in the pipeline, which is based on the statistical and functional analysis of time adverbials. A series of new objects will be built using the `chunk` objects. They will contain labels for semantic roles and semantic function. These labels are to be constructed using inductive, data-driven analysis combined with insights from Construction Grammar.
4. **New features on phrases, chunks, or constructions.** These are various features stored on the objects noted above.

**The end result is a custom database, which consists of original BHSA data, modified BHSA phrase function data, and new objects with their associated features**.

The whole pipeline is represented below in diagram form. 

<table>
<img src="../../docs/images/pipeline_diagram.png" width="30%" height="30%" align="middle">
</table>

# 0. Modules, Paths, and Classes

Input and output files are outlined below. Both directories consist (or will) of .tf resources. The input files are the base BHSA dataset while the output will be the customized, project dataset. Another dataset, `heads`, which is crucial for this work, is identified below as well. See the documentation for heads [here](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb).

In [1]:
# import generic packages and modules
import os, sys, collections, glob
from tf.fabric import Fabric
from tf.app import use

# import pipeline classes
from remapfunctions import RemapFunctions # remaps phrase functions
from chunking import Chunker # chunks meaningful groups
from enhance import Enhance # adds helper data to chunks
from function_association import FunctAssoc # calcs function associations

# function associations take a long time to run
# only run them if explicitly asked
run_associations = '-full' in sys.argv or False # config to True in NB to run it

# determine whether script is being run from ipython or as simple .py
# this notebook code is converted with nbconvert to a .py script and has
# additional visualizations that do not need to be run in the .py version.
try:
    is_nb = __IPYTHON__
except: 
    is_nb = False
    
# load Text-Fabric instance if running inside NB
# For visualizing data that will be exported
if is_nb:
    bhsa = use('bhsa', silent=True)
    F, T, L = bhsa.api.F, bhsa.api.T, bhsa.api.L
    
# configure input TF data and output dir for new TF data
home = os.path.expanduser('~/')
bhsa_dir = os.path.join(home, 'text-fabric-data/etcbc/bhsa/tf/c') # input
heads_dir = os.path.join(home, 'github/etcbc/heads/tf/c') # heads data
output_dir = '../tf' # output; all new data goes here
locations = {'bhsa': bhsa_dir,
             'heads': heads_dir,
             'output': output_dir}

# The metadata below is assigned to all new features.
base_metadata = {'source': 'https://github.com/etcbc/bhsa',
                 'data_version': 'BHSA version c',
                 'origin': 'Made by the ETCBC of the Vrije Universiteit Amsterdam; edited by Cody Kingham'}

	connecting to online GitHub repo annotation/app-bhsa ... failed
The offline TF-app may not be the latest
Using TF-app in /Users/cody/text-fabric-data/annotation/app-bhsa/code:
	rv1.0=#d3cf8f0c2ab5d690a0fda14ea31c33da5c5c8483 (latest? release)


### Purge Old .tf Files

Remove all .tf files in the output directory. This is necessary since all output data goes to the same place, and subsequent classes depend on previous data. Without the purge, subsequent runs will add new objects on top of previous ones!

The `funct_assoc.tf` and `top_assoc.tf` features rely only on the phrase function feature and need not be purged unless new phrase function edits are added (cf. #1). Keeping them prevents the significant runtime needed to calculate them.

In [2]:
keep = [os.path.join(output_dir, file) for file in ('funct_assoc.tf', 'top_assoc.tf')]
print('Purging old data...')
old_tf = glob.glob(os.path.join(output_dir, '*.tf'))
for file in old_tf:
    if file not in keep:
        print(f'\tpurging {file}')
        os.remove(file)
print('DONE')

Purging old data...
	purging ../tf/function.tf
	purging ../tf/note.tf
	purging ../tf/label.tf
	purging ../tf/otype.tf
	purging ../tf/role.tf
	purging ../tf/oslots.tf
DONE


# 1. Phrase Function Edits

In [3]:
remap_functs = RemapFunctions(locations, base_metadata)

The following functions will be changed...

In [4]:
if is_nb:
    for phrase, newfunct in remap_functs.newfunctions.items():
        print(f'{phrase} node with function {F.function.v(phrase)} will be remapped to {newfunct}')
        bhsa.pretty(phrase)

849296 node with function Time will be remapped to Loca


825329 node with function Time will be remapped to Loca


828081 node with function Time will be remapped to Cmpl


774349 node with function Time will be remapped to Adju


774352 node with function Time will be remapped to Adju


775948 node with function Time will be remapped to Adju


775985 node with function Time will be remapped to Adju


876172 node with function Time will be remapped to Adju


The change is executed below.

In [5]:
remap_functs.execute()

Updating phrase functions...
	adding notes feature to new functions...
	exporting tf...
   |     0.32s T function             to /Users/cody/github/csl/time_collocations/data/tf
   |     0.00s T note                 to /Users/cody/github/csl/time_collocations/data/tf
SUCCESS


# 2.1 Generate Chunk Objects

The ETCBC data is not granular enough for many types of searches beneath the phrase level. For example, if there are coordinated noun phrases that all function as a single phrase, the individual phrases are not delineated. I need them spliced out so that I can track coordinated nouns in a phrase. Another case is with quantifiers, wherein the quantifier chains themselves are not in any way set apart from other items in the phrase. For these cases, I will make `chunk` objects—these are essentially phrase-like objects. Another problem is that some phrases are split into two, whereas elsewhere in the database the same phrase pattern is portrayed as a single phrase. This is fixed by creating a new `chunk` object.

`chunk` objects have two important features, called `label` and `role`. The `label` feature is essentially the name of the function. For instance, `timephrase` or `quant` (quantifier). The feature `role` is an edge feature, which maps the chunk's component parts to the new object. For example, in a `quant_NP`, there is an edge drawn from the noun to the `quant_NP` chunk; the edge has a value of "quantified" since the noun is the quantified item.

In [6]:
chunker = Chunker(locations, base_metadata)

### [relevant, illustrative examples will be shown here]

In [7]:
chunker.execute()

Running chunkers...
	running time chunker...
	running quant chunker...
	running prep chunker...
82289 chunk objects formed...
	73963 chunk objects with label prep
	4072 chunk objects with label timephrase
	3196 chunk objects with label quant_NP
	1058 chunk objects with label quant
	5698 words with chunk role of quant
	3774 words with chunk role of time
	2984 words with chunk role of subs
exporting tf...
  0.00s VALIDATING oslots feature
  0.12s maxSlot=     426584
  0.12s maxNode=    1529088
  0.35s OK: oslots is valid
   |     0.11s T label                to /Users/cody/github/csl/time_collocations/data/tf
   |     0.76s T otype                to /Users/cody/github/csl/time_collocations/data/tf
   |     3.03s T oslots               to /Users/cody/github/csl/time_collocations/data/tf
   |     0.06s T role                 to /Users/cody/github/csl/time_collocations/data/tf
SUCCESS


## 2.2 Generate Helper Data for Chunks

In [8]:
enhance = Enhance(locations, base_metadata)

enhancing chunks...
	calculating fresh TF binaries...
   |      |     0.75s C __levels__           from otype, oslots, otext
   |      |       13s C __order__            from otype, oslots, __levels__
   |      |     0.77s C __rank__             from otype, __order__
   |      |       12s C __levUp__            from otype, oslots, __levels__, __rank__
   |      |     8.50s C __levDown__          from otype, __levUp__, __rank__
   |      |     2.59s C __boundary__         from otype, oslots, __rank__
   |      |     0.09s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse


### 2.2.1 Embedding Data

Quantifier chunks often consist of smaller, component chunks. When clustering time adverbials, it is often not necessary to know that a quantifier contains two component parts. Rather, only the top level quantifier chunk is needed to indicate that there is quantification in the adverbial. The `embed` feature is a simple `true` or `false` tag which indicates whether a given chunk is contained within another chunk of the same kind. This allows quick and efficient selection of non-embedded chunks by `embed=false`. 

In [9]:
enhance.embeddings()

82289 new embed features added...
	77882 embed features with value of false...
	4407 embed features with value of true...


### 2.2.2 Quantifier Time Roles

Quantifiers contained in a timephrase do not yet have a role mapping from the quantified noun to the time chunk. We add that below, creating an edge feature with a role of 'time' from a quantified noun to its embedding `timephrase` chunk.

In [10]:
enhance.quanttimes()

Adding new quanttime role data...
	539 new time roles added...


### Export Enhancements

In [11]:
enhance.export()

  0.00s Missing dependency for computed data feature "__levels__": "otext"


exporting tf...
   |     0.12s T embed                to /Users/cody/github/csl/time_collocations/data/tf
   |     0.05s T role                 to /Users/cody/github/csl/time_collocations/data/tf
SUCCESS


## 3. Function Associations

Build association scores between head lexemes and their phrase functions. This data takes a significant amount of time to recalculate and only needs to be run if function data has changed.

In [5]:
if run_associations:
    funct_assoc = FunctAssoc(locations, base_metadata)
    funct_assoc.execute()

Beginning head lexeme // phrase function co-occurrence counts...
	8203 unique head lexemes counted...
	22 unique phrase functions counted...
Applying Fisher's tests to 180466 pairwise relations...


  strength = -np.log10(p_value)
  strength = np.log10(p_value)


	calculations DONE
Substituting infinite scores with max and min associations...
	min association: -323.0
	max association: 317.0


  0.00s Missing dependency for computed data feature "__levels__": "otext"


Exporting TF data...
	264517 association scores...
	424152 top associations...
   |     0.37s T funct_assoc          to /Users/cody/github/csl/time_collocations/data/tf
   |     0.58s T top_assoc            to /Users/cody/github/csl/time_collocations/data/tf
SUCCESS
