# Getting Heads
## By Cody Kingham & Christiaan Erwich

## Problem Description
The ETCBC's BHSA core data does not contain the standard syntax tree format. This also means that syntactic and functional relationships between individual words are not mapped in a transparent or easily accessible way. In some cases, fine-grained relationships are ignored altogether. For example, for a given noun phrase (NP), there is no explicit way of obtaining its head noun (i.e. the noun itself without any modifying elements). This causes numerous problems for research in the realm of semantics. For instance, it is currently very difficult to calculate the complete person, gender, and number (PGN) of a given subject phrase. That is because PGN is stored at the word level only. But this is a very inadequate representation. Phrases in the ETCBC often contain coordinate relationships within the phrase. So even if one selects the first "noun" in the phrase and checks for its PGN value, they may overlook the presence of another noun which makes the phrase plural. Ideally, the phrase itself would have a PGN feature. But before this kind of data is created, it is necessary to separate the head words of a phrase from their modifying elements such as adjectives, determiners, or nouns in construct (genitive) relations.

A head word can be defined as the word for which a phrase type is named after. A phrase type can be NP for noun phrase or VP for verb phrase. In this notebook, we experiment with and build the functions stored in `heads.py` in order to export a set of Text-Fabric edge features. The edge features represent a mapping from a phrase node to its head element. 

This goal requires us to think carefully about the way inter-word, semantic relations are reflected in the ETCBC's data. The ETCBC *does* contain some rudimentary semantic embeddings through the so-called [subphrase](https://etcbc.github.io/bhsa/features/hebrew/c/otype). These can be utilized to isolate head words from secondary elements. A subphrase should *not* be thought of as a smaller, embedded phrase, like the ETCBC's phrase-atom (though it can sometimes must indadequately fill that role). Rather, the subphrase is a way to encode relationships between words below the level of a phrase(atom), hence "sub." A subphrase can be a single word, or it can be a collection of words. A word can be in multiple subphrases, but can not be in more than 3 (due to the limitations of the data creation program, [parsephrases](http://www.etcbc.nl/datacreation/#ps3.p)).

## Method
The types of phrases represented in the ETCBC include `NP` (noun phrase), `VP` (verb phrase), `PrNP` (proper noun phrase), `PP` (prepositional phrase), `AdvP` (adverbial phrase), and [eight others](https://etcbc.github.io/bhsa/features/hebrew/c/typ). For some of these types, isolating the head word is a simple affair. By coordinating a word's phrase-dependent part of speech with its type, one can identify the head word. For a `VP`, that would mean simply finding the word within the phrase that has a `pdp` (phrase dependent part of speech) value of `verb`. Or for a prepositional phrase, find the word with a `pdp` of `prep`.

The `NP` and `PrNP`, on the other hand, present special challenges. 