# Getting Heads 😶
## By Cody Kingham & Christiaan Erwich

## Problem Description
The ETCBC's BHSA core data does not contain the standard syntax tree format. This also means that syntactic and functional relationships between individual words are not mapped in a transparent or easily accessible way. In some cases, fine-grained relationships are ignored altogether. For example, for a given noun phrase (NP), there is no explicit way of obtaining its head noun (i.e. the noun itself without any modifying elements). This causes numerous problems for research in the realm of semantics. For instance, it is currently very difficult to calculate the complete person, gender, and number (PGN) of a given subject phrase. That is because PGN is stored at the word level only. But this is a very inadequate representation. Phrases in the ETCBC often contain coordinate relationships within the phrase. So even if one selects the first "noun" in the phrase and checks for its PGN value, they may overlook the presence of another noun which makes the phrase plural. Ideally, the phrase itself would have a PGN feature. But before this kind of data is created, it is necessary to separate the head words of a phrase from their modifying elements such as adjectives, determiners, or nouns in construct (genitive) relations.

A head word can be defined as the word for which a phrase type is named after. A phrase type can be NP for noun phrase or VP for verb phrase. In this notebook, we experiment with and build the functions stored in `heads.py` in order to export a set of Text-Fabric edge features. The edge features represent a mapping from a phrase node to its head element. 

This goal requires us to think carefully about the way inter-word, semantic relations are reflected in the ETCBC's data. The ETCBC *does* contain some rudimentary semantic embeddings through the so-called [subphrase](https://etcbc.github.io/bhsa/features/hebrew/c/otype). These can be utilized to isolate head words from secondary elements. A subphrase should *not* be thought of as a smaller, embedded phrase, like the ETCBC's phrase-atom (though it sometimes must indadequately fill that role). Rather, the subphrase is a way to encode relationships between words below the level of a phrase(atom), hence "sub." A subphrase can be a single word, or it can be a collection of words. A word can be in multiple subphrases, but can not be in more than 3 (due to the limitations of the data creation program, [parsephrases](http://www.etcbc.nl/datacreation/#ps3.p)).

## Method
The types of phrases represented in the ETCBC include `NP` (noun phrase), `VP` (verb phrase), `PrNP` (proper noun phrase), `PP` (prepositional phrase), `AdvP` (adverbial phrase), and [eight others](https://etcbc.github.io/bhsa/features/hebrew/c/typ). For some of these types, isolating the head word is a simple affair. By coordinating a word's phrase-dependent part of speech with its enclosing phrase's type, one can identify the head word. For a `VP`, that would mean simply finding the word within the phrase that has a `pdp` (phrase dependent part of speech) value of `verb`. Or for a prepositional phrase, find the word with a `pdp` of `prep`.

The `NP` and `PrNP`, on the other hand, present special challenges. These phrases often contain multiple words with a modifying relation to the head noun. An example of this is the construct relation (e.g. "Son of Jacob"). The problem becomes particularly thorny when relations like the construct are chained together so that one is faced with the choice between multiple potential head nouns.

To navigate the problem, we must use the feature [rela (relationship)](https://etcbc.github.io/bhsa/features/hebrew/c/rela) stored on `subphrase`s in addition to the `pdp` and phrase `type` features. In order to isolate the head word of a `NP`, we look for a word within the phrase that has a `pdp` value of `subs` (i.e. noun). We then obtain a list of all the `subphrase`s which contain that word using the [L.u Text-Fabric method](https://github.com/Dans-labs/text-fabric/wiki/Api#locality). We then use the list of subphrase node numbers to create a list of all subphrase relations containing the word. If the list contains *any* dependent relations, then the word is automatically excluded from being a head word and we can move on to the next candidate. One final check is required for candidate words at the level of the `phrase`: the same procedure described above for `subphrase`s must be performed for `phrase_atom` relations. This means excluding words within a `phrase_atom` with a dependent relation to another `phrase_atom` within the `phrase`. If the head of a *`phrase_atom`* is being calculated, this step is not necessary.

There are only two possible `subphrase` or `phrase_atom` relations for a valid head word: `NA` or `par`/`Para`. `NA` means that no relation is reflected. The word is independent. The `par` (`subphrase`) and `Para` (`phrase_atom`) stands for parallel relations, i.e. coordinates. These words require one further test, that is, it must be verified that their mother (using the [edge feature](https://github.com/Dans-labs/text-fabric/wiki/Api#edge-features) "[mother](https://etcbc.github.io/bhsa/features/hebrew/c/mother.html)") is itself a head word. To do this step thus requires us to keep track of those words within the phrase which have been validated. We can do so with a simple list.

## Code Development

Below we experiment with the code and develop the functions that will extract the head nouns. This involves a good deal of manual inspections of the results before exporting the Text-Fabric features.

The code is written immediately below. Associated questions that arise while writing or evaluating the code are contained in the subsequent section.

In [33]:
import collections, sys, os

# Load Text-Fabric and B (visualizer)
# separate import of tf.fabric not needed with bhsa module
LOC = ('~/github', 'etcbc/bhsa', 'getting_heads')
sys.path.append(os.path.expanduser(f'{LOC[0]}/{LOC[1]}/programs'))
from bhsa import Bhsa 
B = Bhsa(*LOC)
B.api.makeAvailableIn(globals())

B.load('''
      book chapter verse
      typ pdp rela mother 
      function lex sp
       ''')

**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://etcbc.github.io/bhsa/api.html" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api" title="text-fabric-api">Text-Fabric API</a>


This notebook online:
<a target="_blank" href="http://nbviewer.jupyter.org/github/etcbc/lingo/blob/master/heads/getting_heads.ipynb">NBViewer</a>
<a target="_blank" href="https://github.com/etcbc/lingo/blob/master/heads/getting_heads.ipynb">GitHub</a>


In [6]:
def give_heads(phrase):
    '''
    Extracts and returns the heads of a supplied
    phrase or phrase atom based on that phrase's type
    and the relations reflected within the phrase.
    
    --input--
    phrase(atom) node number
    
    --output--
    tuple of head word node(s) 
    '''
    
    # mapping from phrase type to good part of speech values for heads
    head_pdps = {'VP': {'verb'},                   # verb 
                 'NP': {'subs'},                   # noun 
                 'PrNP': {'nmpr', 'subs'},         # proper-noun 
                 'AdvP': {'advb', 'nmpr', 'subs'}, # adverbial 
                 'PP': {'prep'},                   # prepositional 
                 'CP': {'conj', 'prep'},           # conjunctive
                 'PPrP': {'prps'},                 # personal pronoun
                 'DPrP': {'prde'},                 # demonstrative pronoun
                 'IPrP': {'prin'},                 # interrogative pronoun
                 'InjP': {'intj'},                 # interjectional
                 'NegP': {'nega'},                 # negative
                 'InrP': {'inrg'},                 # interrogative
                 'AdjP': {'adjv'}                  # adjective
                } 
    
    # get phrase-head's part of speech value and list of candidate matches
    phrase_type = F.typ.v(phrase)
    head_candidates = [F.pdp.v(w) for w in L.d(phrase, 'word')
                          if F.pdp.v(w) in head_pdps[phrase_type]]
        
    # VP with verbs require no further processing, return the head verb
    if phrase_type == 'VP':
        return tuple(head_candidates[0])
        
    # prepare to process all other types 
    # these are acceptable subphrase/phraseatom relations for a candidate head word
    independent_relas = {'NA', 'par', 'Para'}
        
    # go head-hunting!
    heads = []
    for word in head_candidates:
        
        # check the word's subphrase (+ phrase_atom if otype is phrase) relations
        

In [65]:
{'rec', 'NA'} - {'NA', 'par', 'Para'}

{'rec'}

## Tests

In this section, important questions are asked whose answers are needed to ensure the code is written correctly. The BHSA data is queried to answer them. These are questions like, "Do we need to check for relational independency for only noun phrases?" (no); and "does every phrase type have a word with a corresponding pdp?" (no).

### Make definitions available for tests:

In [7]:
# mapping from phrase type to its head part of speech
type_to_pdp = {'VP': 'verb', # verb 
               'NP': 'subs', # noun 
               'PrNP': 'nmpr', # proper-noun 
               'AdvP': 'advb', # adverbial 
               'PP': 'prep', # prepositional 
               'CP': 'conj', # conjunctive
               'PPrP': 'prps', # personal pronoun
               'DPrP': 'prde', # demonstrative pronoun
               'IPrP': 'prin', # interrogative pronoun
               'InjP': 'intj', # interjectional
               'NegP': 'nega', # negative
               'InrP': 'inrg', # interrogative
               'AdjP': 'adjv'} # adjective

# dependent relations for phrase atoms
dependency_pa = {'Appo', # apposition
                'Spec'} # specification

# dependent relations for subphrases
dependent_sp = {'rec', # nomen rectum
                'adj', # adjunct 
                'atr', # attributive
                'mod', # modifier
                'dem'} # demontrative

### Test for non-NP phrases with valid pdp but invalid head

These tests demonstrate that subphrase relation checks are also needed for phrase types besides noun phrases. The only valid subphrase/phrase_atom relations for any potential head word is either `NA` or `par`/`Para`. While a few phrase types do not need additional relational checks, e.g. personal pronoun phrases, we can go ahead and consistently handle all phrases in the same way.

The only exception to the above rule is the `VP`, for which there are 14 cases of the `VP`'s head word (verb) that is also in a subphrase with a regens (`rec`) relation.

The operational question of these tests was:
> Are there cases in which a non-NP phrase(atom) contains a word with the corresponding pdp value, but which is probably not a head?

To answer the question, we first survey all cases where the phrase type's head candidate is in a subphrase with a relation that is not normally "independent." Based on the survey, we manually check the most pertinent phrase types and results. The tests reveal that, indeed, relation checks are needed for many phrase types.

In [57]:
def test_pdp_safe(phrase_object='phrase_atom'):
    
    '''
    Make a survey of phrase types and their matching pdp words,
    count what kinds of subphrase relations these words 
    occurr in. The survey can then be used to investigate 
    whether phrase types besides noun phrases require relationship
    checks for independency.
    '''
    
    pdp_relas_survey = collections.defaultdict(lambda: collections.Counter())
    headless = 0
    
    for phrase in F.otype.s(phrase_object):

        typ = F.typ.v(phrase) # phrase type

        # skip noun phrases
        if typ in {'NP', 'PrNP'}: 
            continue

        head_pdp = type_to_pdp[typ]

        maybe_heads = [w for w in L.d(phrase, 'word') 
                           if F.pdp.v(w) == head_pdp]
        
        # this check shows that many
        # phrases don't have a word 
        # with a corresponding pdp!
        if not maybe_heads:
            headless += 1

        # survey the candidate heads' relations
        for word in maybe_heads:

            head_name = typ + '|' + head_pdp
            subphrases = L.u(word, 'subphrase')
            sp_relas = set(F.rela.v(sp) for sp in subphrases)\
                        if subphrases else {'NA'} # <- handle cases without any subphrases (i.e. verbs)

            pdp_relas_survey[head_name].update(sp_relas)

    print(f'phrases without matching pdp: {headless}\n')
    print('subphrase relation survey: ')
    for name, rela_counts in pdp_relas_survey.items():

        print(name)

        for r, count in rela_counts.items():
            print('\t', r, '-', count)

In [58]:
# for phrase_atoms
test_pdp_safe()

phrases without matching pdp: 837

subphrase relation survey: 
PP|prep
	 NA - 64539
	 par - 3822
	 adj - 42
	 rec - 9
VP|verb
	 NA - 69011
	 rec - 14
	 par - 1
CP|conj
	 NA - 53848
AdvP|advb
	 NA - 5137
	 par - 102
	 mod - 49
	 adj - 1
AdjP|adjv
	 NA - 1854
	 par - 138
	 atr - 6
	 adj - 3
	 rec - 1
InjP|intj
	 NA - 1872
	 par - 11
DPrP|prde
	 NA - 790
NegP|nega
	 NA - 6743
PPrP|prps
	 NA - 4468
	 par - 9
IPrP|prin
	 NA - 797
	 par - 1
InrP|inrg
	 NA - 1291
	 par - 3


In [59]:
# and for phrases
test_pdp_safe(phrase_object='phrase')

phrases without matching pdp: 670

subphrase relation survey: 
PP|prep
	 NA - 62330
	 par - 3676
	 adj - 42
	 rec - 10
VP|verb
	 NA - 69011
	 rec - 14
	 par - 1
CP|conj
	 NA - 52537
AdvP|advb
	 NA - 5089
	 par - 101
	 mod - 46
	 adj - 1
AdjP|adjv
	 NA - 1803
	 par - 119
	 atr - 6
	 adj - 3
	 rec - 1
InjP|intj
	 NA - 1872
	 par - 11
DPrP|prde
	 NA - 791
NegP|nega
	 NA - 6743
PPrP|prps
	 NA - 4388
	 par - 9
IPrP|prin
	 NA - 797
	 par - 1
InrP|inrg
	 NA - 1291
	 par - 3


^ These surveys tell us that for several of these phrase types, e.g. `InjP`, we can automatically take the word with the pdp value that corresponds with its phrase type as the head.

There are also quite a few cases where the phrase type does not have a word with a matching pdp value: 837 for phrase atoms and 670 for phrases. In the subsequent section we will run tests to find out why this is the case.

Back to the question of this section: There are 14 examples of VP with verbs that have a `rec` (nomen regens) relation. Are these heads or not? We check now...

In [60]:
def find_and_show(search_pattern):
    results = sorted(B.search(search_pattern))
    print(len(results), 'results')
    B.show(results)

In [12]:
# run notebook locally to see HTML-formatted results for the below searches


rec_verbs = '''

phrase_atom typ=VP
    subphrase rela=rec
        word pdp=verb
'''

#find_and_show(rec_verbs) # uncomment me!

In all 14 results, the verb serves as the true head word of the `VP`.

*Note: The verb will prove to be an exception, as all other words in a `rec` relation are not head words*

The `PP` also has some strange relations. We see what's going on with the same kind of inspection. First we look at the `rec` (regens) relations.

In [13]:
rec_preps = '''

phrase_atom typ=PP
    subphrase rela=rec
        word pdp=prep
'''

#find_and_show(rec_preps) #uncomment me!

The PP is different. In cases where the `phrase_atom` = `rec`, the preposition is *not* the head. Thus, the algorithm will need to check for these cases.

Now for the `adj` subphrase relation in `PP`:

In [14]:
adj_preps = '''

phrase_atom typ=PP
    subphrase rela=adj
        word pdp=prep
'''

#find_and_show(adj_preps) # uncomment me!

The results above show that the `adj` subphrase relation is also a non-head. These cases have to be excluded.

Now we move on to test the **adverb** relations reflected in the survey...

In [15]:
adv_adj = '''

phrase_atom typ=AdvP
    subphrase rela=adj
        word pdp=advb

'''

#find_and_show(adv_adj) # uncomment me!

The `adj` relationships in the adverbial phrase is also not a true head. Now for the `mod` (modifier) relation.

In [16]:
adv_mod = '''

phrase_atom typ=AdvP
    subphrase rela=mod
        word pdp=advb

'''

#find_and_show(adv_mod) # uncomment me!

In this case, it appears that `mod` is also an invalid relation for adverb phrases. And example is גם הלם ('also here') where גם is the adverb in `mod` relation, but the head is really הלם "here" (also an adverb). In several cases, the modifier modifies a verb. In these cases the "head," often a participle or infinitive, acts as the adverb, even though it is not explicitly marked as such.

Now we move on to the last examination, that of the `AdjP` (adjective phrase). There are three relations of interest:
> atr - 6 <br>
> adj - 3 <br>
> rec - 1 <br>

In [17]:
adj_atr = '''

phrase_atom typ=AdjP
    subphrase rela=atr
        word pdp=adjv

'''

#find_and_show(adj_atr) # uncomment me!

In [18]:
adj_adj = '''

phrase_atom typ=AdjP
    subphrase rela=adj
        word pdp=adjv

'''

#find_and_show(adj_adj) # uncomment me!

In [19]:
adj_rec = '''

phrase_atom typ=AdjP
    subphrase rela=rec
        word pdp=adjv

'''

#find_and_show(adj_rec) # uncomment me!

The results for the three searches above show indeed that the relations of `atr`, `adj`, and `rec` are not head words.

### Tests for phrase types without a word that has a valid pdp value

The initial survey above revealed that 837 phrase atoms and 670 phrases lack a word with a corresponding pdp value. Here we investigate to see why that is the case. Is there a way to compensate for this problem? Are these truly phrases that lack heads?

We run another survey and count the phrase types against the non-matching pdp values found within them. At this point, we must also exclude words that have dependent relations (as defined above, subphrase values of NA or parallel).

In [22]:
count_no_pdp = collections.defaultdict(lambda: collections.Counter())
record_no_pdp = collections.defaultdict(lambda: collections.defaultdict(list))

for phrase in F.otype.s('phrase_atom'):
    
    typ = F.typ.v(phrase)
    
    # see if there is not corresponding pdp value
    corres_pdp = type_to_pdp[typ]
    corresponding_pdps = [w for w in L.d(phrase, 'word') 
                             if F.pdp.v(w) == corres_pdp]
    
    if not corresponding_pdps:
        
        # put potential heads here
        maybe_heads = []
        
        # calculate subphrase relations
        for word in L.d(phrase, 'word'):
            
            # get subphrase relations
            word_subphrs = L.u(word, 'subphrase')
            sp_relas = set(F.rela.v(sp) for sp in word_subphrs) or {'NA'}
            
            # check subphrase relations for independence
            if sp_relas == {'NA'}:
                maybe_heads.append(word)
                
            # test parallel relation for independence
            elif sp_relas == {'NA', 'par'} or sp_relas == {'par'}:
                
                # check for good, head mothers
                good_mothers = set(sp for w in maybe_heads for sp in L.u(w, 'subphrase'))
                this_daughter = [sp for sp in word_subphrs if F.rela.v(sp) == 'par'][0]
                this_mother = E.mother.f(this_daughter)
                
                if this_mother in good_mothers:
                    maybe_heads.append(word)
                    
        # sanity check
        # maybe_heads should have SOMETHING
        if not maybe_heads:
            raise Exception(f'phrase {phrase} looks HEADLESS!')
        
        # count pdp types
        head_pdps = [F.pdp.v(w) for w in maybe_heads]
        count_no_pdp[typ].update(head_pdps)
        
        # save for examination
        for word in maybe_heads:
            record_no_pdp[typ][F.pdp.v(word)].append((phrase, word))
        
for name, counts in count_no_pdp.items():

    print(name)

    for pdp, count in counts.items():
        print('\t', pdp, '-', count)

AdvP
	 nmpr - 253
	 subs - 499
	 art - 190
	 conj - 13
PrNP
	 subs - 9
	 art - 3
CP
	 prep - 85
	 subs - 79
	 advb - 6
NP
	 adjv - 1
	 intj - 1


These results are a bit puzzling. The numbers here are words within the phrase atoms that have NO subphrase relations. That means, for example, words such as הַ "the" do not appear to have any subphrase relation to their modified nouns. That again illustrates the shortcoming of the ETCBC data in this respect. There should be a relation from the article to the determined noun.

From this point forward, I will begin working through all four phrase types and the cases reflected in the survey.

Beginning with the `AdvP` type and the article. Upon some initial inspection, I've found that in many of the `AdvP` with the article, there is also a substantive (`subs`) that was found by the search. Are there any cases where there is no `nmpr` or `subs` found alongside the article? We can use the dict `record_no_pdp` which has recorded all cases reflected in the survey. Below I look to see if all 190 cases of an article in these `AdvP` phrases also has a corresponding noun.

In [23]:
no_noun = []

for phrase in record_no_pdp['AdvP']['art']:
    
    pdps = set(F.pdp.v(w) for w in L.d(phrase[0], 'word'))
    
    if not {'nmpr', 'subs'} & pdps:
        no_noun.append((phrase,))
        
print(len(no_noun), 'without nouns found...')

0 without nouns found...


There it is. So all cases of these articles can be discarded. In these cases, the noun serves as the head of the adverbial phrase. An example of this is when the noun marks the location of the action (hence adverb). 

Next, we check the conjunctions found in the adverbial phrases. Are any of those heads?

In [113]:
#B.show(record_no_pdp['AdvP']['conj']) # uncomment me!

All conjunctions in these `AdvP` phrases function to mark coordinate elements (only ו in these results). They can also be discarded as not possible heads.

Now we investigate the `PrNP` results with `subs` and `art`...

In [27]:
#B.show(record_no_pdp['PrNP']['subs']) # uncomment me!

In [29]:
#B.show(record_no_pdp['PrNP']['art']) # uncomment me!

The `art` relations reflected in the second search are not heads, but are all related to a substantive. All of the results in `subs` are heads. Thus, the only acceptable pdp for `PrNP` besides a proper noun is `subs`.

Now we dig into `CP` results. 85 of them have no `pdp` of conjunction, but have a preposition instead. Let's see what's going on...

In [31]:
#B.show(record_no_pdp['CP']['prep'][:20]) # uncomment me!

These are very interesting results. These conjunction phrases are made up of constructions like ב+עבור and ב+טרם. Together these words function as a conjunction, but alone they are prepositions and particles. Is it even possible in this case to say that there is a "head"?

It could be said that these combinations of words mean more than the sum of their parts; they are good examples of constructions, i.e. combinations of words whose meaning cannot be inferred simply from their individual words. Constructions illustrate the vague boundary between syntax and lexicon (cf. e.g. Goldberg, 1995, *Constructions*).

While these words are indeed marked as conjunction phrases, it is better in this case to analyze them as prepositional phrases (which they also are...this is another shortcoming of our data, or perhaps a mistake??). Thus, the head is the preposition, not the prepositional object.

We should expect that the remaining `subs` and `advb` groups are in fact the objects of those prepositions (and hence excluded). Let's test that assumption by looking for a preposition behind these words...

In [39]:
no_prep = []

for (phrase, word) in record_no_pdp['CP']['subs'] + record_no_pdp['CP']['advb']:
    
    possible_prep = word - 1
    
    if F.sp.v(possible_prep) != 'prep':
        no_prep.append((phrase, word))
        
print(f'subs|advb with no preceding prepositions: {len(no_prep)}')

subs|advb with no preceding prepositions: 0


Here we see. We can confirm that none of the substantives or adverbs will be the head of a conjunction phrase. A preposition is the only other kind of head for the `CP` besides a conjunction itself.

Finally, we're left with a last noun phrase (`NP`) for which no matching noun was found. The search found instead both `adjv` (adjective) and a `intj` (interjection). Let's see it.

In [48]:
#B.show(record_no_pdp['NP']['adjv']) # uncomment me

This ^ one actually looks like a mistake in the BHSA data. The adjective ידיד "beloved" functions here as a substantive, and hence should have a phrase dependent part of speech of `subs` not `adjv`. This phrase will not receive a head relation, because the underlying data itself is flawed and should be fixed. Then a new run of the algorithm will assign the proper head.

In [52]:
#B.show(record_no_pdp['NP']['intj']) # uncomment me

In this case, the word אוי "woe" functions like a noun. This thus appears to be another mislabeled `pdp` value, since it should read `subs`. This, like the previous example, will not receive a head value due to the mistake.