# Getting Heads 2

This notebook aims to develop a new method of head detection using insights gained from the first version of this data. This new effort improves on the previous one in two main ways:

* head selection is performed using Text Fabric templates, which offers a clearer, more transparent way to select and filter data
* aims to track and address all edge cases

Most of the rationale and rules generated in [getting_heads.ipynb](getting_heads.ipynb) are carried over to this present notebook.

In [1]:
from tf.extra.bhsa import Bhsa
import collections

In [2]:
A = Bhsa(hoist=globals(), silent=True)
print(f'running version {A.version} of BHSA...')

running version c of BHSA...


# Defining Heads

The basic definition of a phrase head from the previous version is carried over here, which is:
> the word with a part of speech after which a phrase type is named

As applied in the previous effort, this includes a secondary criterion:
> the word which semantically determines grammatical agreement

This latter case thus excludes quantifiers such as כל and cardinal numbers that are in construct or attribution to a given word.

From the point of view of the ETCBC database, heads can be extracted using the `subphrase` object and its relations. These relations are not always coded in a transparent or beneficial way. But they are at least useful enough to disambiguate independent words from dependent words. From the ETCBC database perspective, we add a third criterion:
> a word contained in an independent subphrase or a subphrase only dependent upon a quantifier


## Tracking Head Selection

Using the guiding principles listed above, we will follow a process of deduction for assigning heads to phrases. We select all phrases to track which heads are accounted for.

In [21]:
remaining_phrases = set(result[0] for result in A.search('phrase'))
covered_phrases = set()
remaining_types = list(feat[0] for feat in F.typ.freqList(nodeTypes='phrase'))

  0.09s 253207 results


In [22]:
phrase2heads = collections.defaultdict(set)

In [23]:
def record_head(phrase, head, mapping=phrase2heads, remaining=remaining_phrases, covered=covered_phrases):
    '''
    Simple function to track phrases
    with heads that are accounted for
    and to modify the phrase2heads
    dict, which is a mapping from a phrase
    node to its head nodes.
    '''
    # try/except accounts for phrases with plural heads, 
    # one of which is already recorded
    try:
          remaining.remove(phrase)
    except: 
        pass
    
    mapping[phrase].add(head) # record it
    covered.add(phrase)

## Simple Heads

The selection of heads for certain phrase types is very straightforward. Those are defined in the templates below and are subsequently applied. These phrase types are selected based on the survey of their subphrase relations as found in the old notebook.

In [24]:
simp_heads = dict(

PPrP = '''

phrase typ=PPrP
    word pdp=prps

''',

DPrP = '''

phrase typ=DPrP
    word pdp=prde

''',

InjP = '''

phrase typ=InjP
    word pdp=intj

''',

NegP = '''

phrase typ=NegP
    word pdp=nega

''',

InrP = '''

phrase typ=InrP
    word pdp=inrg

''',

) # end of dictionary

### Make Queries, Record Heads, See What Remains

Here we run the queries and run `record_head` over each result. In all of the templates the head is the second item in the result tuple.

In [25]:
def query_heads(querydict):
    '''
    Runs queries on phrasetype/query dict.
    Reports results.
    Adds results.
    '''
    for phrasetype, query in querydict.items():
        print(f'running query on {phrasetype}')
        results = A.search(query, silent=True)
        print(f'\t{len(results)} results found')
        for phrase, head in results:
            record_head(phrase, head)
            
def heads_status():
    # simply prints accounted vs unaccounted heads
    print(f'{len(covered_phrases)} phrases matched with a head...')
    print(f'{len(remaining_phrases)} phrases remaining...')
            
query_heads(simp_heads)
        
print('\n', '<>'*20, '\n')
heads_status()

running query on PPrP
	4392 results found
running query on DPrP
	791 results found
running query on InjP
	1883 results found
running query on NegP
	6742 results found
running query on InrP
	1291 results found

 <><><><><><><><><><><><><><><><><><><><> 

15070 phrases matched with a head...
238137 phrases remaining...


### Find Remaining Phrases

What phrases with the above types remain unaccounted for?

In [26]:
unaccounted_simp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) in simp_heads)
len(unaccounted_simp)

0

In [27]:
for typ in simp_heads:
    remaining_types.remove(typ)
print(remaining_types)

['VP', 'PP', 'CP', 'NP', 'PrNP', 'AdvP', 'AdjP', 'IPrP']


## Mostly Simple Heads

The next set of heads require a bit more care since they can contain a bigger variety of relationships.

### VP
Note that there is one VP that has more than one verb:

In [28]:
mult_verbs = A.search('''

phrase typ=VP
/with/
    word pdp=verb
    < word pdp=verb
/-/
''')
A.show(mult_verbs, condenseType='clause', withNodes=True)

  1.42s 1 result




**clause** *1*



The template below excludes this case without ignoring VP's that do not necessarily begin with a verb.

In [29]:
VP = '''

phrase typ=VP
    
    head:word pdp=verb
    
    /without/
    phrase
        word pdp=verb
        < head
    /-/

'''

VP_search = A.search(VP)

for phrase, head in VP_search:
    record_head(phrase, head)
    
heads_status()

  1.63s 69024 results
84094 phrases matched with a head...
169113 phrases remaining...


### VP Sanity Check

We double check that the indicated phrase above only has one head.

In [30]:
phrase2heads[893310]

{403602}

See what's left...

In [31]:
unaccounted_vp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'VP')
len(unaccounted_vp)

0

In [32]:
remaining_types.remove('VP')
print(remaining_types)

['PP', 'CP', 'NP', 'PrNP', 'AdvP', 'AdjP', 'IPrP']


### CP

The conjunction phrase is relatively straightforward. But there are 1140 cases where the conjunction is technically headed by a preposition in the ETCBC data. These are phrases such as בטרם and בעבור (see the more detailed analysis in the prev. notebook). It is not clear at all why the ETCBC encodes these as conjunction phrases. This is almost certainly a confusion of the formal `typ` value and the functional `function` label (with a value of `Conj`). Nevertheless, here we make a choice to select the preposition as the true head.

In a BHSA2, these cases ought to be repaired.

In [33]:
cp_heads = dict(

conj = '''

phrase typ=CP
/without/
    word pdp=prep
/-/
    word pdp=conj

''',
    
prep_conj = '''

phrase typ=CP
    =: word pdp=prep

'''

)



In [34]:
query_heads(cp_heads)
        
print('\n', '<>'*20, '\n')
heads_status()

running query on conj
	51485 results found
running query on prep_conj
	1140 results found

 <><><><><><><><><><><><><><><><><><><><> 

136575 phrases matched with a head...
116632 phrases remaining...


### CP Sanity Check

In [35]:
unaccounted_cp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'CP')
len(unaccounted_cp)

0

In [36]:
remaining_types.remove('CP')
print(remaining_types)

['PP', 'NP', 'PrNP', 'AdvP', 'AdjP', 'IPrP']
