# Getting Heads 😶
## By Cody Kingham & Christiaan Erwich

## Problem Description
The ETCBC's BHSA core data does not contain the standard syntax tree format. This also means that syntactic and functional relationships between individual words are not mapped in a transparent or easily accessible way. In some cases, fine-grained relationships are ignored altogether. For example, for a given noun phrase (NP), there is no explicit way of obtaining its head noun (i.e. the noun itself without any modifying elements). This causes numerous problems for research in the realm of semantics. For instance, it is currently very difficult to calculate the complete person, gender, and number (PGN) of a given subject phrase. That is because PGN is stored at the word level only. But this is a very inadequate representation. Phrases in the ETCBC often contain coordinate relationships within the phrase. So even if one selects the first "noun" in the phrase and checks for its PGN value, they may overlook the presence of another noun which makes the phrase plural. Ideally, the phrase itself would have a PGN feature. But before this kind of data is created, it is necessary to separate the head words of a phrase from their modifying elements such as adjectives, determiners, or nouns in construct (genitive) relations.

A head word can be defined as the word for which a phrase type is named after. A phrase type can be NP for noun phrase or VP for verb phrase. In this notebook, we experiment with and build the functions stored in `heads.py` in order to export a set of Text-Fabric edge features. The edge features represent a mapping from a phrase node to its head element. 

This goal requires us to think carefully about the way inter-word, semantic relations are reflected in the ETCBC's data. The ETCBC *does* contain some rudimentary semantic embeddings through the so-called [subphrase](https://etcbc.github.io/bhsa/features/hebrew/c/otype). These can be utilized to isolate head words from secondary elements. A subphrase should *not* be thought of as a smaller, embedded phrase, like the ETCBC's phrase-atom (though it sometimes must indadequately fill that role). Rather, the subphrase is a way to encode relationships between words below the level of a phrase(atom), hence "sub." A subphrase can be a single word, or it can be a collection of words. A word can be in multiple subphrases, but can not be in more than 3 (due to the limitations of the data creation program, [parsephrases](http://www.etcbc.nl/datacreation/#ps3.p)).

## Method
The types of phrases represented in the ETCBC include `NP` (noun phrase), `VP` (verb phrase), `PrNP` (proper noun phrase), `PP` (prepositional phrase), `AdvP` (adverbial phrase), and [eight others](https://etcbc.github.io/bhsa/features/hebrew/c/typ). For some of these types, isolating the head word is a simple affair. By coordinating a word's phrase-dependent part of speech with its enclosing phrase's type, one can identify the head word. For a `VP`, that would mean simply finding the word within the phrase that has a `pdp` (phrase dependent part of speech) value of `verb`. Or for a prepositional phrase, find the word with a `pdp` of `prep`.

The `NP` and `PrNP`, on the other hand, present special challenges. These phrases often contain multiple words with a modifying relation to the head noun. An example of this is the construct relation (e.g. "Son of Jacob"). The problem becomes particularly thorny when relations like the construct are chained together so that one is faced with the choice between multiple potential head nouns.

To navigate the problem, we must use the feature [rela (relationship)](https://etcbc.github.io/bhsa/features/hebrew/c/rela) stored on `subphrase`s in addition to the `pdp` and phrase `type` features. In order to isolate the head word of a `NP`, we look for a word within the phrase that has a `pdp` value of `subs` (i.e. noun). We then obtain a list of all the `subphrase`s which contain that word using the [L.u Text-Fabric method](https://github.com/Dans-labs/text-fabric/wiki/Api#locality). We then use the list of subphrase node numbers to create a list of all subphrase relations containing the word. If the list contains *any* dependent relations, then the word is automatically excluded from being a head word and we can move on to the next candidate. One final check is required for candidate words at the level of the `phrase`: the same procedure described above for `subphrase`s must be performed for `phrase_atom` relations. This means excluding words within a `phrase_atom` with a dependent relation to another `phrase_atom` within the `phrase`. If the head of a *`phrase_atom`* is being calculated, this step is not necessary.

There are only two possible `subphrase` or `phrase_atom` relations for a valid head word: `NA` or `par`/`Para`. `NA` means that no relation is reflected. The word is independent. The `par` (`subphrase`) and `Para` (`phrase_atom`) stands for parallel relations, i.e. coordinates. These words require one further test, that is, it must be verified that their mother (using the [edge feature](https://github.com/Dans-labs/text-fabric/wiki/Api#edge-features) "[mother](https://etcbc.github.io/bhsa/features/hebrew/c/mother.html)") is itself a head word. To do this step thus requires us to keep track of those words within the phrase which have been validated. We can do so with a simple list.

## Code Development

Below we experiment with the code and develop the functions that will extract the head nouns. This involves a good deal of manual inspections of the results before exporting the Text-Fabric features.

The code is written immediately below. Associated questions that arise while writing or evaluating the code are contained in the following section.

In [43]:
import collections, sys, os
from tf.fabric import Fabric

TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c')
api = TF.load('''
              book chapter verse
              typ pdp rela mother 
              function lex
              ''')

api.makeAvailableIn(globals())

# for viewing TF search results:
LOC = ('~/github', 'etcbc/bhsa', 'getting_heads')
sys.path.append(os.path.expanduser(f'{LOC[0]}/{LOC[1]}/programs'))
from bhsa import Bhsa
B = Bhsa(*LOC)
B.api.makeAvailableIn(globals())

B.load('mother') # for some reason B makes it to where this has to be reloaded (!?)

This is Text-Fabric 3.2.5
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

115 features found and 0 ignored
  0.00s loading features ...
   |     0.01s B book                 from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.01s B chapter              from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.01s B verse                from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.24s B typ                  from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.14s B pdp                  from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.23s B rela                 from /Users/cody/github/etcbc/bhsa/tf/c
   |     4.19s B mother               from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.05s B function             from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.10s B lex                  from /Users/cody/github/etcbc/bh

**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://etcbc.github.io/bhsa/api.html" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api" title="text-fabric-api">Text-Fabric API</a>


This notebook online:
<a target="_blank" href="http://nbviewer.jupyter.org/github/etcbc/lingo/blob/master/heads/getting_heads.ipynb">NBViewer</a>
<a target="_blank" href="https://github.com/etcbc/lingo/blob/master/heads/getting_heads.ipynb">GitHub</a>


In [6]:
def give_heads(phrase):
    '''
    Extracts and returns the heads of a supplied
    phrase or phrase atom based on that phrase's type
    and the relations reflected within the phrase.
    
    --input--
    phrase(atom) node number
    
    --output--
    list of head word node(s) 
    '''
    
    # mapping from phrase type to its head part of speech
    type_to_pdp = {'VP': 'verb', # verb 
                   'NP': 'subs', # noun 
                   'PrNP': 'nmpr', # proper-noun 
                   'AdvP': 'advb', # adverbial 
                   'PP': 'prep', # prepositional 
                   'CP': 'conj', # conjunctive
                   'PPrP': 'prps', # personal pronoun
                   'DPrP': 'prde', # demonstrative pronoun
                   'IPrP': 'prin', # interrogative pronoun
                   'InjP': 'intj', # interjectional
                   'NegP': 'nega', # negative
                   'InrP': 'inrg', # interrogative
                   'AdjP': 'adjv'} # adjective

## Tests

### Make definitions available for tests:

In [7]:
# mapping from phrase type to its head part of speech
type_to_pdp = {'VP': 'verb', # verb 
               'NP': 'subs', # noun 
               'PrNP': 'nmpr', # proper-noun 
               'AdvP': 'advb', # adverbial 
               'PP': 'prep', # prepositional 
               'CP': 'conj', # conjunctive
               'PPrP': 'prps', # personal pronoun
               'DPrP': 'prde', # demonstrative pronoun
               'IPrP': 'prin', # interrogative pronoun
               'InjP': 'intj', # interjectional
               'NegP': 'nega', # negative
               'InrP': 'inrg', # interrogative
               'AdjP': 'adjv'} # adjective

# dependent relations for phrase atoms
dependency_pa = {'Appo', # apposition
                'Spec'} # specification

# dependent relations for subphrases
dependent_sp = {'rec', # nomen rectum
                'adj', # adjunct 
                'atr', # attributive
                'mod', # modifier
                'dem'} # demontrative

### Test for non-NP phrases with valid pdp but invalid head

Are there cases in which a non-NP phrase(atom) contains a word with the correct pdp value, but which are probably not a head?

If so, then we must apply our relational tests described above for the NP to other phrase types. If not, then our code can be much simpler for these types.

In [16]:
def test_pdp_safe(phrase_object='phrase_atom'):
    
    '''
    find candidates for mismatched heads and pdp values
    i.e., they have the right pdp, but are in a relation
    that may be considered dependent and non-head-like
    first we make a survey of all the relations and potential
    heads that are found
    '''
    
    pdp_relas_survey = collections.defaultdict(lambda: collections.Counter())

    for phrase in F.otype.s(phrase_object):

        typ = F.typ.v(phrase) # phrase type

        # skip noun phrases
        if typ in {'NP', 'PrNP'}: 
            continue

        head_pdp = type_to_pdp[typ]

        maybe_heads = [w for w in L.d(phrase, 'word') 
                           if F.pdp.v(w) == head_pdp]

        # survey the candidate heads' relations
        for word in maybe_heads:

            head_name = typ + '|' + head_pdp
            subphrases = L.u(word, 'subphrase')
            sp_relas = set(F.rela.v(sp) for sp in subphrases)\
                        if subphrases else {'NA'} # <- handle cases without any subphrases (i.e. verbs)

            pdp_relas_survey[head_name].update(sp_relas)

    for name, rela_counts in pdp_relas_survey.items():

        print(name)

        for r, count in rela_counts.items():
            print('\t', r, '-', count)

In [17]:
# for phrase_atoms
test_pdp_safe()

PP|prep
	 NA - 64539
	 par - 3822
	 adj - 42
	 rec - 9
VP|verb
	 NA - 69011
	 rec - 14
	 par - 1
CP|conj
	 NA - 53848
AdvP|advb
	 NA - 5137
	 par - 102
	 mod - 49
	 adj - 1
AdjP|adjv
	 NA - 1854
	 par - 138
	 atr - 6
	 adj - 3
	 rec - 1
InjP|intj
	 NA - 1872
	 par - 11
DPrP|prde
	 NA - 790
NegP|nega
	 NA - 6743
PPrP|prps
	 NA - 4468
	 par - 9
IPrP|prin
	 NA - 797
	 par - 1
InrP|inrg
	 NA - 1291
	 par - 3


In [58]:
# and for phrases
# test_pdp_safe(phrase_object='phrase') # uncomment me

^ These surveys tell us that for several of these phrase types, e.g. `InjP`, we can automatically take the word with the pdp value that corresponds with its phrase type as the head.

A few questions are raised by the 14 examples of VP with verbs that have a `rec` (nomen regens) relation. Are these heads or not? We check now...

In [30]:
rec_verbs = '''

phrase_atom typ=VP
    subphrase rela=rec
        word pdp=verb
'''

rec_verbs = sorted(B.search(rec_verbs))

len(rec_verbs)

14

In [32]:
# run notebook locally to see HTML-formatted results

# B.show(rec_verbs) # uncomment me

In all 14 results, the verb serves as the true head word of the `VP`. 

The `PP` also has some strange relations. We see what's going on with the same kind of inspection. First we look at the `rec` (regens) relations.

In [33]:
rec_preps = '''

phrase_atom typ=PP
    subphrase rela=rec
        word pdp=prep
'''

rec_preps = sorted(B.search(rec_preps))

len(rec_preps)

13

In [57]:
# B.show(rec_preps)

The PP is different. In cases where the `phrase_atom` = `rec`, the preposition is *not* the head. Thus, the algorithm will need to check for these cases.

Now for the `adj` subphrase relation in `PP`:

In [59]:
adj_preps = '''

phrase_atom typ=PP
    subphrase rela=adj
        word pdp=prep
'''

adj_preps = sorted(B.search(adj_preps))

len(adj_preps)

42

In [65]:
# B.show(adj_preps) # uncomment me

The results above show that the `adj` subphrase relation is also a non-head. These cases have to be excluded.

Now we move on to test the **adverb** relations reflected in the survey...