# Getting Heads 2

This notebook aims to develop a new method of head detection using insights gained from the first version of this data. This new effort improves on the previous one in two main ways:

* head selection is performed using Text Fabric templates, which offers a clearer, more transparent way to select and filter data
* aims to track and address all edge cases

Most of the rationale and rules generated in [getting_heads.ipynb](getting_heads.ipynb) are carried over to this present notebook.

In [1]:
from tf.extra.bhsa import Bhsa
import collections, textwrap

In [2]:
A = Bhsa(hoist=globals(), silent=True)
print(f'running version {A.version} of BHSA...')

running version c of BHSA...


# Defining Heads

The basic definition of a phrase head from the previous version is carried over here, which is:
> the word with a part of speech after which a phrase type is named

As applied in the previous effort, this includes a secondary criterion:
> the word which semantically determines grammatical agreement

This latter case thus excludes quantifiers such as כל and cardinal numbers that are in construct or attribution to a given word.

From the point of view of the ETCBC database, heads can be extracted using the `subphrase` object and its relations. These relations are not always coded in a transparent or beneficial way. But they are at least useful enough to disambiguate independent words from dependent words. From the ETCBC database perspective, we add a third criterion:
> a word contained in an independent subphrase or a subphrase only dependent upon a quantifier


## Tracking Head Selection

Using the guiding principles listed above, we will follow a process of deduction for assigning heads to phrases. We select all phrases to track which heads are accounted for.

In [3]:
remaining_phrases = set(result[0] for result in A.search('phrase'))
covered_phrases = set()
remaining_types = list(feat[0] for feat in F.typ.freqList(nodeTypes='phrase'))

  0.08s 253207 results


In [4]:
phrase2heads = collections.defaultdict(set)

In [5]:
def record_head(phrase, head, mapping=phrase2heads, remaining=remaining_phrases, covered=covered_phrases):
    '''
    Simple function to track phrases
    with heads that are accounted for
    and to modify the phrase2heads
    dict, which is a mapping from a phrase
    node to its head nodes.
    '''
    # try/except accounts for phrases with plural heads, 
    # one of which is already recorded
    try:
          remaining.remove(phrase)
    except: 
        pass
    
    mapping[phrase].add(head) # record it
    covered.add(phrase)

## Simple Heads

The selection of heads for certain phrase types is very straightforward. Those are defined in the templates below and are subsequently applied. These phrase types are selected based on the survey of their subphrase relations as found in the old notebook.

In [6]:
simp_heads = dict(

PPrP = '''
% personal pronoun

phrase typ=PPrP
    word pdp=prps

''',

DPrP = '''
% demonstrative pronoun

phrase typ=DPrP
    word pdp=prde

''',

InjP = '''
% interjectional

phrase typ=InjP
    word pdp=intj

''',

NegP = '''
% negative

phrase typ=NegP
    word pdp=nega

''',

InrP = '''
% interrogative

phrase typ=InrP
    word pdp=inrg

''',
    
IPrP = '''
% interrogative pronoun

phrase typ=IPrP
    word pdp=prin

''',

) # end of dictionary

### Make Queries, Record Heads, See What Remains

Here we run the queries and run `record_head` over each result. In all of the templates the head is the second item in the result tuple.

In [7]:
def query_heads(querydict, headi=1):
    '''
    Runs queries on phrasetype/query dict.
    Reports results.
    Adds results.
    '''
    for phrasetype, query in querydict.items():
        print(f'running query on {phrasetype}')
        results = A.search(query, silent=True)
        print(f'\t{len(results)} results found')
        for res in results:
            phrase, head = res[0], res[headi]
            record_head(phrase, head)
            
def heads_status():
    # simply prints accounted vs unaccounted heads
    print(f'{len(covered_phrases)} phrases matched with a head...')
    print(f'{len(remaining_phrases)} phrases remaining...')
            
query_heads(simp_heads)
        
print('\n', '<>'*20, '\n')
heads_status()

running query on PPrP
	4392 results found
running query on DPrP
	791 results found
running query on InjP
	1883 results found
running query on NegP
	6742 results found
running query on InrP
	1291 results found
running query on IPrP
	798 results found

 <><><><><><><><><><><><><><><><><><><><> 

15866 phrases matched with a head...
237341 phrases remaining...


### Find Remaining Phrases

What phrases with the above types remain unaccounted for?

In [8]:
unaccounted_simp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) in simp_heads)
len(unaccounted_simp)

0

In [9]:
for typ in simp_heads:
    remaining_types.remove(typ)
print(remaining_types)

['VP', 'PP', 'CP', 'NP', 'PrNP', 'AdvP', 'AdjP']


## Mostly Simple Heads

The next set of heads require a bit more care since they can contain a bigger variety of relationships.

### VP
There is only one complication for the VP: that is that there is one VP that has more than one verb:

In [10]:
mult_verbs = A.search('''

phrase typ=VP
/with/
    word pdp=verb
    < word pdp=verb
/-/
''')
A.show(mult_verbs, condenseType='clause', withNodes=True)

  1.28s 1 result




**clause** *1*



The template below excludes this case without ignoring VP's that do not necessarily begin with a verb.

In [11]:
VP = '''

phrase typ=VP
    
    head:word pdp=verb
    
    /without/
    phrase
        word pdp=verb
        < head
    /-/

'''

VP_search = A.search(VP)

for phrase, head in VP_search:
    record_head(phrase, head)
    
heads_status()

  1.55s 69024 results
84890 phrases matched with a head...
168317 phrases remaining...


### VP Sanity Check

We double check that the indicated phrase above only has one head.

In [12]:
phrase2heads[893310]

{403602}

See what's left...

In [13]:
unaccounted_vp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'VP')
len(unaccounted_vp)

0

In [14]:
remaining_types.remove('VP')
print(remaining_types)

['PP', 'CP', 'NP', 'PrNP', 'AdvP', 'AdjP']


### CP

The conjunction phrase is relatively straightforward. But there are 1140 cases where the conjunction is technically headed by a preposition in the ETCBC data. These are phrases such as בטרם and בעבור (see the more detailed analysis in the prev. notebook). It is not clear at all why the ETCBC encodes these as conjunction phrases. This is almost certainly a confusion of the formal `typ` value and the functional `function` label (with a value of `Conj`). Nevertheless, here we make a choice to select the preposition as the true head.

In a BHSA2, these cases ought to be repaired.

In [15]:
cp_heads = dict(

conj = '''

phrase typ=CP
/without/
    word pdp=prep
/-/
    word pdp=conj

''',
    
prep_conj = '''

phrase typ=CP
    =: word pdp=prep

'''

)



In [16]:
query_heads(cp_heads)
        
print('\n', '<>'*20, '\n')
heads_status()

running query on conj
	51485 results found
running query on prep_conj
	1140 results found

 <><><><><><><><><><><><><><><><><><><><> 

137371 phrases matched with a head...
115836 phrases remaining...


### CP Sanity Check

In [17]:
unaccounted_cp = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'CP')
len(unaccounted_cp)

0

In [18]:
remaining_types.remove('CP')
print(remaining_types)

['PP', 'NP', 'PrNP', 'AdvP', 'AdjP']


### AdjP

The adjective phrase always occurs with a word that has a `pdp` of adjective: 

In [19]:
A.search('''

phrase typ=AdjP
/without/
    word pdp=adjv
/-/

''')

  0.64s 0 results


[]

By playing with the `head:word pdp=` value below, I ascertain that there are 8 uses of `subs` as a head in this phrase type, and 1 use of `advb` as a head. These variants are due to the phrase containing multiple heads, with the first having a `pdp` of `adjv`, formally making the phrase an `AdjP`. 

The selection criteria is as follows. We want all cases in an adjective phrase where the word has a `pdp` of `adjv`, `subs`, or `advb`. The head candidate must not be found in a modifying subphrase, defined as `rela=adj|atr|rec|mod|dem` (remember that a word can often occur in multiple subphrases); and the only acceptable values for phrase_atom and subphrase relations are either `NA` (no relation), or `Para`/`par` (coordinate relation). In this latter case, it is expected here that the first requirement will prevent spurious parallel results (that is, words that are parallel not to a head but to a modiying element).

The requirements are set in the pattern below.

In [20]:
AdjP = '''

phrase typ=AdjP
    phrase_atom rela=NA|Para
        head:word pdp=adjv|subs|advb
        
% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /-/
        
% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
'''
AdjP = A.search(AdjP)

for res in AdjP:
    phrase, head = res[0], res[2]
    record_head(phrase, head)
    
heads_status()

  1.08s 1875 results
139120 phrases matched with a head...
114087 phrases remaining...


In [21]:
unaccounted_AdjP = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'AdjP')
len(unaccounted_AdjP)

0

In [22]:
remaining_types.remove('AdjP')
print(remaining_types)

['PP', 'NP', 'PrNP', 'AdvP']


### AdvP

The adverb phrase has similar internal relations to AdjP. Thus, we apply the same basic template search.

By modifying `pdp=` parameter, I have found 2 examples of a preposition in the AdvP, which is caused by a prepositional phrase_atom coordinated with the `AdvP` phrase atom. These mixed cases must be dealt with imperfectly by taking the preposition head literally. It is then up to the user of the heads feature to include/exclude cases such as these, or to depend on the phrase_atoms.

There are two cases where an `inrg` serves as a head element. These are incorrect encodings, as they belong under their own phrase type of `InrP`. These should be fixed in BHSA2. For now the `inrg` is excluded as a phrase head as they are followed by a `advb` which is probably triggering these phrases' classification.

In [23]:
AdvP = '''

phrase typ=AdvP
    phrase_atom rela=NA|Para
        head:word pdp=advb|subs|nmpr|prep
        
% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /-/
        
% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
'''
AdvP = A.search(AdvP)

for res in AdvP:
    phrase, head = res[0], res[2]
    record_head(phrase, head)
    
heads_status()

  1.50s 5778 results
144779 phrases matched with a head...
108428 phrases remaining...


In [24]:
unaccounted_AdvP = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'AdvP')
len(unaccounted_AdvP)

0

In [25]:
remaining_types.remove('AdvP')
print(remaining_types)

['PP', 'NP', 'PrNP']


### PP

The same method used above applies to prepositional phrases.

In [26]:
PP = '''

phrase typ=PP
    phrase_atom rela=NA|Para
        head:word pdp=prep
        
% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /-/
        
% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
'''
PP = A.search(PP)

for res in PP:
    phrase, head = res[0], res[2]
    record_head(phrase, head)
    
heads_status()

  1.07s 62335 results
202261 phrases matched with a head...
50946 phrases remaining...


In [27]:
unaccounted_PP = set(phrase for phrase in remaining_phrases
                          if F.typ.v(phrase) == 'PP')
len(unaccounted_PP)

0

In [28]:
remaining_types.remove('PP')
print(remaining_types)

['NP', 'PrNP']


### NP and PrNP

The noun phrase is the most complicated for head selection due to the presence of quantifers. 

In [29]:
custom_quants = {'KL/', 'M<V/', 'JTR/', # quantifier lexemes, others?
                 'M<FR/', 'XYJ/'}
quantlexs = '|'.join(custom_quants)

In [30]:
NP_heads = dict(
    
NP_noqant = f'''

phrase typ=NP|PrNP
    phrase_atom rela=NA|Para
        head:word pdp=subs|adjv|nmpr ls#card lex#{quantlexs}
        
% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            head
        /or/
        /without/
        subphrase
            head
        /-/
        /-/
        
% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            head
        /-/
''',

NP_quant_alone = f'''

phrase typ=NP|PrNP
    phrase_atom rela=NA|Para
        quant:word pdp=subs|adjv|nmpr
            
% require quantification
        /with/
        ls=card
        /or/
        lex={quantlexs}
        /-/
        
% quantifier does not precede a quantified element
        /without/
        phrase_atom
            quant
            < word pdp=subs|nmpr|verb|prde|prps lex#{quantlexs} ls#card
        /-/

% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            quant
        /or/
        /without/
        subphrase
            quant
        /-/
        /-/

% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            quant
        /-/

''',

NP_quantified = f'''

phrase typ=NP|PrNP
    phrase_atom rela=NA|Para
    
% phrase atom must have quantifier as formal head:
    /with/
        quant:word pdp=subs|adjv|nmpr
            
% require quantification
        /with/
        ls=card
        /or/
        lex={quantlexs}
        /-/
        
% quantifier precedes a quantified element
        /with/
        phrase_atom
            quant
            < word pdp=subs|nmpr|verb|prde|prps lex#{quantlexs} ls#card
        /-/

% require either NA subphrase relation
% or no subphrase embedding:
        /with/
        subphrase rela=NA|par
            quant
        /or/
        /without/
        subphrase
            quant
        /-/
        /-/

% exclude uses as modifier:
        /without/
        subphrase rela=adj|atr|rec|mod|dem
            quant
        /-/
    /-/
    
    head:word pdp#art|conj ls#card lex#{quantlexs}
    
% require head word to be adjacent to any quantifier in the phrase atom:
    /with/
    word        
    /with/
    ls=card
    /or/
    lex={quantlexs}
    /-/
    <1: head
    /-/
''')

query_heads(NP_heads)
print('\n', '<>'*20, '\n')
heads_status()

running query on NP_noqant
	51877 results found
running query on NP_quant_alone
	1718 results found
running query on NP_quantified
	4557 results found

 <><><><><><><><><><><><><><><><><><><><> 

253205 phrases matched with a head...
2 phrases remaining...


In [32]:
for phrase in remaining_phrases:
    A.pretty(phrase)