# Chunking (and phrase merges)

The ETCBC data is not granular enough for many types of searches beneath the phrase level. For example, if there are coordinated noun phrases that all function as a single phrase, the individual phrases are not delineated. I need them spliced out so that I can track coordinated nouns in a phrase. Another case is with quantifiers, wherein the quantifier chains themselves are not in any way set apart from other items in the phrase. For these cases, I will make "chunk" objects—these are essentially phrase-like objects.

Another problem is that some phrases are split into two, whereas elsewhere in the database the same phrase pattern is portrayed as a single phrase. This is fixed by creating a new object, `phrase2`. Later on, another object will be generated, `cx` (construction), which will contain any mixture of single words, phrases, and sentences, essentially ignoring the old, strict divisions.

### Update Function Features

Ensure we're working with up-to-date functions. I am using modified ETCBC functions in this project. Ignore the error message about dependency.

In [1]:
!python remap_phrase_functions.py

(null): can't open file 'remap_phrase_functions.py': [Errno 2] No such file or directory


### Load TF / BHSA Data

In [4]:
import collections, random, csv
import pandas as pd
from tf.fabric import Fabric
from tf.app import use

locs = {'bhsa':'~/text-fabric-data/etcbc/bhsa/tf/c',
        'heads':'~/github/etcbc/heads/tf/c',
        'custom':'~/github/csl/time_collocations/data/tf'}

# load BHSA
TF = Fabric(locations=locs.values())
api = TF.load('''

vs vt pdp gloss lex typ number  prs
g_cons_utf8 nu mother st language ls rela
obj_prep sem_set head nhead
note function

''')

A = use('bhsa', api=api, hoist=globals(), silent=True)

This is Text-Fabric 7.6.8
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

119 features found and 1 ignored
  0.00s loading features ...
   |     0.18s B g_cons_utf8          from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.11s B lex                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.13s B vs                   from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.12s B vt                   from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.12s B pdp                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.14s B gloss                from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.24s B typ                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.23s B number               from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.10s B prs                  from /Users/cody/text-fabric-data/etcbc/bhsa/tf/c
   |     0.13s B nu                   from /Users/

### dicts for object manipulation

Writing new TF objects to BHSA requires us to link up new data with old data.

In [5]:
nodeFeatures=collections.defaultdict(lambda:collections.defaultdict())
edgeFeatures=collections.defaultdict(lambda:collections.defaultdict())

In [6]:
# recreate ETCBC otype, oslots, and function features
# these will be appended to rather than altered
nodeFeatures['otype'] = dict((n, F.otype.v(n)) for n in N())
edgeFeatures['oslots'] = dict((n, L.d(n, 'word')) for n in N() if F.otype.v(n) != 'word')
nodeFeatures['function'] = dict((n, F.function.v(n)) for n in F.otype.s('phrase'))
nodeFeatures['note'] = dict((n, F.note.v(n)) for n in N() if F.note.v(n))

## Merge Phrases (`phrase2`)

Some phrases are unnecessarily split into two. This is fixed by creating a new object, `phrase2`. Later on, another object will be generated, `cx` (construction), which will contain any mixture of single words, phrases, and sentences, essentially ignoring the old, strict divisions.

In [7]:
merge_in_silence = True # whether to report which phrases are being modified

#### Merge Adjacent TP's

There are several cases in BHSA where time phrases are divided up into 2, 3, or even 4 pieces, whereas elsewhere the parts are kept together as a single phrase. This is an undesirable inconsistency. To solve this problem, new phrase boundaries are generated and mapped over the old boundaries stored in the oslots file. For all of the cases that are remapped, a print-out confirms the new slots.

In [8]:
first_tp = set(res[1] for res in A.search('''

% find all cases of time phrases followed by, 
% but not preceded by, another time phrase

clause
    phrase function=Time
    /without/
    clause
        phrase function=Time
        <: ..
    /-/
    <: phrase function=Time
''', silent=False))

delete_tp = set(res[2] for res in A.search('''

% find all cases of time phrases preceded by
% another time phrase for deletion

clause
    phrase function=Time
    <: phrase function=Time

''', silent=False))

oldmaxotype = max(nodeFeatures['otype'].keys())
maxotype = oldmaxotype+1
new_phrase = set()

for phrase in F.otype.s('phrase'):
    
    if phrase not in first_tp|delete_tp:
        edgeFeatures['oslots'][maxotype] = L.d(phrase, 'word')
        nodeFeatures['otype'][maxotype] = 'phrase2'
        nodeFeatures['function'][maxotype] = F.function.v(phrase)
        maxotype += 1
    
    elif phrase in first_tp: 
        new_slots = list(L.d(phrase, 'word')) # compile new slots here
        this_phrase, this_clause = phrase, L.u(phrase, 'clause')[0] # this_phrase iterates +1 each loop, this_clause does not
        
        # gather all slots in subsequent time phrases
        while (F.function.v(this_phrase+1) == 'Time')\
            and this_phrase+1 in L.d(this_clause, 'phrase'): # subsequent TP must also be in same clause
            new_slots.extend(L.d(this_phrase+1, 'word'))
            this_phrase = this_phrase+1

        edgeFeatures['oslots'][maxotype] = new_slots
        nodeFeatures['otype'][maxotype] = 'phrase2'
        nodeFeatures['note'][maxotype] = 'new phrase by phrase merge'
        nodeFeatures['function'][maxotype] = 'Time'
        new_phrase.add(maxotype)
        maxotype+=1
        
    elif phrase in delete_tp and not merge_in_silence: # delete by skipping
        print(f"skipping over tp {phrase} {T.text(phrase)}")
        print(f"\tin {T.text(L.u(phrase,'clause')[0])}")
        print()
        continue
           
if not merge_in_silence:
    print('new phrases: ')
    for np in sorted(new_phrase):
        print(np, T.text(edgeFeatures['oslots'][np]))

  0.84s 62 results
  0.53s 72 results


Check for gaps. Every slot should be inside a `phrase2`.

In [7]:
coveredslots = sorted(slot for ph, slots in edgeFeatures['oslots'].items()
                          for slot in slots
                          if ph > oldmaxotype)

for i, slot in enumerate(coveredslots):
    if i+2 > len(coveredslots):
        print('no problems found!')
        break
    if slot+1 != coveredslots[i+1]:
        raise Exception(f'{slot}, {coveredslots[i+1]}')

no problems found!


#### Complex Cases

Some cases will require further research. These are marked with a new feature, called simply a note. The note in this case is "complex"

In [8]:
nodeFeatures['note'][846434] = 'complex'

# Chunking

Building sub-phrase-like chunks based on `heads`.

## Noun Chunks

**!TODO!**: This section requires more work than I can give to it now.

In [9]:
# # first handle NP's

# np_chunks = collections.defaultdict(list)

# for np in F.typ.s('NP'):
    
#     if F.otype.v(np) == 'phrase_atom':
#         continue
    
#     heads = E.head.t(np)
#     ph_words = L.d(np, 'word')
    
#     # build the chunks
#     this_chunk = []
#     for i, word in enumerate(ph_words):
        
#         if F.pdp.v(word) == 'conj': # don't include conjunctions in chunks
#             continue
        
#         if word in heads:
#             this_chunk.append(word)
#             np_chunks[np].append(this_chunk)
#             this_chunk = []
            
#         elif i == len(ph_words)-1:
#             this_chunk.append(word)
#             np_chunks[np].append(this_chunk)            
#         else:
#             this_chunk.append(word)
    
# len(np_chunks)

In [10]:
# Find NP's of Time with > 1 head and without a cardinal number

# A.show(A.search('''

# p:phrase typ=NP function=Time
# /without/
#     word ls=card
# /-/
# /with/
#     t:word pdp#conj
#     /without/
#     phrase
#         <head- t
#     /-/
#     word pdp=conj
# /-/
#     w1:word
#     < w2:word
    
# p <head- w1
# p <head- w2
# '''), end=50, condensed=True)

## Quantifier Chunks

In [11]:
# quantification atoms
quant_atoms = []

# quant_subs
quant_atoms.extend((res[1], res[2]) for res in A.search('''

phrase function=Time
    word ls=card language=Hebrew
    <: word pdp=subs ls#card sem_set#prep

'''))

# subs_quant
quant_atoms.extend((res[2], res[3]) for res in A.search('''

phrase function=Time
    phrase_atom
        word pdp=subs ls#card sem_set#prep st=a language=Hebrew
        <: word ls=card

'''))

# quant_h_subs
quant_atoms.extend((res[1], res[3]) for res in A.search('''

phrase function=Time
    word ls=card language=Hebrew st=c
    <: word pdp=art
    <: word pdp=subs ls#card

'''))

# quant_w_quant
quant_atoms.extend((res[1], res[3]) for res in A.search('''

phrase function=Time
    word ls=card language=Hebrew
    <: word lex=W
    <: word ls=card

'''))

# quant_quant
quant_atoms.extend((res[1], res[2]) for res in A.search('''

phrase function=Time
    word ls=card language=Hebrew
    <: word ls=card

'''))

quant_atoms.sort()

quant_chunks = []

i = 0

while i < len(quant_atoms):

    worda, wordb = quant_atoms[i][0], quant_atoms[i][-1]
    nexta, nextb = quant_atoms[i+1][0], quant_atoms[i+1][-1]
    qchunk = [worda, wordb]
    
    while wordb == nexta:
        
        qchunk.append(nextb)
        i += 1
        if i == len(quant_atoms)-1: break
        worda, wordb = quant_atoms[i][0], quant_atoms[i][-1]
        nexta, nextb = quant_atoms[i+1][0], quant_atoms[i+1][-1]
        
    quant_chunks.append(qchunk)
    i += 1

def fillGaps(chunk):
    '''
    Fills in gapped slots such as waws and other
    items that are missing in the chunk.
    '''
    chunk.sort()
    minSlot, maxSlot = chunk[0], chunk[-1]
    return list(range(minSlot, maxSlot+1))

# add quantification construction objects and their features
maxNode = max(edgeFeatures['oslots'])+1
for chunk in quant_chunks:
    node = maxNode
    maxNode += 1
    nodeFeatures['otype'][node] = 'chunk'
    edgeFeatures['oslots'][node] = fillGaps(chunk)
    
    # map individual semantic roles within construction
    quantified_noun = False
    for w in chunk:
        if F.ls.v(w) != 'card' and F.lex.v(w) != 'H':
            edgeFeatures['role'][w] = {node:'quantified'}
            quantified_noun = True
        elif F.ls.v(w) == 'card':
            edgeFeatures['role'][w] = {node:'quantifier'}
    
    label = 'quant_NP' if quantified_noun else 'quant'
    nodeFeatures['label'][node] = label

  2.10s 471 results
  2.12s 56 results
  2.42s 11 results
  2.03s 82 results
  1.55s 121 results


### Write cardinal numbers that stand on their own but are not quantified NP's

These cases still assume a chained form in Time Phrases.

In [12]:
quant_slots = set(slot for nde in edgeFeatures['oslots'] 
                     if nodeFeatures['otype'][nde]=='chunk'
                     for slot in edgeFeatures['oslots'][nde])

non_quant_slots = set(w for w in F.otype.s('word') if w not in quant_slots)

In [13]:
# quant_alone
quant_alone = A.search('''

phrase function=Time
    alonequant ls=card language=Hebrew

''', sets={'alonequant':non_quant_slots})

for res in quant_alone:
    node = maxNode
    maxNode += 1
    nodeFeatures['otype'][node] = 'chunk'
    nodeFeatures['label'][node] = 'quant'
    edgeFeatures['oslots'][node] = (res[1],)
    edgeFeatures['role'][res[1]] = {node:'quantifier'}

  1.15s 79 results


### Chunk Component Quantifier

How to handle the following construction?

> [ [ numberChain + nounA ] + [ numberChain + nounA ] ]

In this construction, where nounA = nounA, the entire construction functions as a single number. This example simultaneously shows how a complex construction can be compiled from smaller component versions. Indeed, both numberChain + noun combinations function as a single constructional unit indicating a quantified noun. But when two of these are used back to back with the same noun, they are to be read together as a single quantified unit. As the outter-most brackets indicate, this itself functions as a construction.

These can be found by first mapping the chunks to a phrase node number, and then comparing the identity of the noun.

In [14]:
phrase2chunks = collections.defaultdict(list)

for chunk in quant_chunks:
    phrase_atom = L.u(chunk[0], 'phrase_atom')[0]
    phrase2chunks[phrase_atom].append(chunk)
    
# add component chunks to the database
for phrase, chunks in phrase2chunks.items():
    
    if len(chunks) < 2: 
        continue
    
    chunknouns = [w for chunk in chunks for w in chunk 
                      if (F.ls.v(w) != 'card') and (F.lex.v(w) != 'H')]
    
    if len(chunknouns) < 2:
        continue
    
    nounA, nounB = chunknouns[:2]
    
    # generate compositive quantitative object
    chunk = [w for chunk in chunks for w in chunk]
    node = maxNode
    maxNode += 1
    nodeFeatures['otype'][node] = 'chunk'
    nodeFeatures['label'][node] = 'quant_NP_chain'
    edgeFeatures['oslots'][node] = fillGaps(chunk)

## Build Preposition Chunks

Prepositions that are chained together function as a single directional unit, and some words function as prepositions within a certain frame where elsewhere they may function as nouns. Using the `sem_set` feature from the `heads` project and the `obj_prep` edge relation, we can easily export a construction that can cover these cases.

In [15]:
def climbPrepChain(prep, prep_list):
    '''
    Recursively climbs a prepositional chain.
    '''
    prep_list.append(prep)
    daughter = next((po for po in E.obj_prep.t(prep) if F.sem_set.v(po)=='prep'),[])
    if daughter:
        climbPrepChain(daughter, prep_list)

In [16]:
new_obj = []

for prep in F.sem_set.s('prep'):
    
    # skip governed preps
    if E.obj_prep.f(prep):
        continue
    
    # climb down prep chain
    prep_cx = []
    climbPrepChain(prep, prep_cx)
    
    # export object
    node = maxNode
    new_obj.append(node)
    maxNode += 1
    nodeFeatures['otype'][node] = 'chunk'
    nodeFeatures['label'][node] = 'prep'
    edgeFeatures['oslots'][node] = prep_cx
    
print(len(new_obj), 'new preposition chunks made...')

73963 new preposition chunks made...


## Export New Object Data to TF

In [17]:
print('Features to be added...')
for data in (nodeFeatures, edgeFeatures):
    for feat in data:
        print('\t', feat)

Features to be added...
	 otype
	 function
	 note
	 label
	 oslots
	 role


In [19]:
featuredir = '../../data'

phrasefunctions = dict((ph, F.function.v(ph)) for ph in F.otype.s('phrase'))

meta = {'':{'source': 'https://github.com/etcbc/bhsa',
            'origin': 'Made by the ETCBC of the Vrije Universiteit Amsterdam; edited by Cody Kingham',
            'coreData':'BHSA',
            'coreVersion':'c',},
        'oslots': {'valueType':'int', 
                   'edgeValues':False},
        'otype':{'valueType':'str'},
        'note':{'valueType':'str',
                'description':'notes on objects for tracking issues throughout my research'},
        'role':{'edgeValues':True,
                'valueType':'str', 
                'description':'role of the word in the chunk'},
        'label':{'valueType':'str'},
        'steps':{'valueType':'int'},
        'function':{'valueType':'str'}
       }

TFsave = Fabric(locations=featuredir)

TFsave.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=meta)

This is Text-Fabric 7.4.11
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

2 features found and 0 ignored


  0.00s Warp feature "otype" not found in
/Users/cody/github/csl/time_collocations/analysis/../data/
  0.00s Warp feature "oslots" not found in
/Users/cody/github/csl/time_collocations/analysis/../data/


  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Exporting 4 node and 2 edge and 1 config features to /Users/cody/github/csl/time_collocations/analysis/../data:
  0.00s VALIDATING oslots feature
  0.15s maxSlot=     426584
  0.15s maxNode=    1774598
  0.42s OK: oslots is valid
   |     0.71s T function             to /Users/cody/github/csl/time_collocations/analysis/../data
   |     0.10s T label                to /Users/cody/github/csl/time_collocations/analysis/../data
   |     0.00s T note                 to /Users/cody/github/csl/time_collocations/analysis/../data
   |     0.78s T otype                to /Users/cody/github/csl/time_collocations/analysis/../data
   |     3.84s T oslots               to /Users/cody/github/csl/time_collocations/analysis/../data
   |     0.01s T role                 to /Users/cody/github/csl/time_collocations/analysis/../data
   |     0.00s M steps                to /Users/cody/github/csl/time_collocations/analysis/../data
  5

True

## Exploring New Objects

In [None]:
locations2 = ['/Users/cody/text-fabric-data/etcbc/bhsa/tf/c/',
             '../../data/',]

TF2 = Fabric(locations=locations2)
api2 = TF2.load('''

vs vt pdp gloss lex 
language rela typ number
function
role label
''')

B = use('bhsa', api=api2)

In [None]:
B.show(B.search('''

phrase function=Time
    chunk label=prep
        word
        < word

'''), condenseType='clause', condensed=True, end=5)

In [None]:
def tokenPhrase(phrasenode):
    '''Tokenizes a phrase with
    dot-separated words.
    input: phrase node number
    output: token string'''
    words = [(F.g_cons_utf8.v(w) if F.lex.v(w) != 'H' else 'ה') for w in L.d(phrasenode, 'word')]
    return '.'.join(words)

def tokenHeads(headslist):
    '''same as tokenPhrase but with list of head word nodes'''
    return '.'.join((F.g_cons_utf8.v(w) if F.lex.v(w) != 'H' else 'ה') for w in headslist)

## Phrase Tokens and Phrase Heads

This search counts all of the discrete time phrase tokens in Hebrew and gathers data about their heads. This data is exported to a spreadsheet for manual inspection. Per every token, a key of its heads is saved into a dictionary, linked to a list of phrase nodes. Tokens that have more than 1 head are suspicious, since the surface form is the same. All other tokens will be exported with their standard heads for inspection. 

In [None]:
tp_heads = collections.defaultdict(lambda: collections.defaultdict(list))
tp_nheads = collections.defaultdict(lambda: collections.defaultdict(list))
tp_count = collections.Counter()

tps = A.search('''

phrase function=Time
/with/
    word language=Hebrew
/-/

''', shallow=True)

for tp in tps:
    token = tokenPhrase(tp)
    heads_token = tokenHeads(E.head.t(tp))
    nheads_token = tokenHeads(E.nhead.t(tp))
    
    
    tp_heads[token][heads_token].append(tp)
        
    # only populate nheads with PP phrases, since nhead feature for NP is exactly the same
    if F.typ.v(tp) == 'PP':
        tp_nheads[token][nheads_token].append(tp)
        
    tp_count[token] += 1
    
suspect_heads = [tp for tp in tp_heads if len(tp_heads[tp]) > 1]
suspect_nheads = [tp for tp in tp_nheads if len(tp_nheads[tp]) > 1]

print(f'total phrase tokens 2 head mappings: {len(tp_heads)}')
print(f'total phrase tokens 2 nhead mappings: {len(tp_nheads)}')
print(f'total suspect heads: {len(suspect_heads)}')
print(f'total suspect nheads {len(suspect_nheads)}')

**NB**<br>
The initial run of this search found problems in the phrase: ב.ה.בקר.ב.ה.בקר. Some cases marked the second part of the phrase a parallel element, whereas others marked them as either a phrase atom specification relation (`Spec`) or a subphrase adjunct relation (`adj`). This is an inconsistent tagging on the BHSA's part. These issues were addressed in the [heads notebook](https://nbviewer.jupyter.org/github/ETCBC/heads/blob/master/phrase_heads.ipynb) of the ETCBC heads repository. The phrase in question is now correctly annotated.

### Compile Manual Inspection Spreadsheet

In [23]:
# tp_heads_data = []
# tp_nheads_data = []
# data_header = ['token', '(n)heads_token', 'freq', 'mark', 'note', 'ex_ref', 'ex', 'ex_node', 'ex_verse']

# for htp, nhtp in zip(tp_heads.keys(), tp_nheads.keys()):
#     head = next(tp for tp in tp_heads[htp])
#     nhead = next(tp for tp in tp_nheads[nhtp])
#     head_ex = random.choice(tp_heads[htp][head])
#     nhead_ex = random.choice(tp_nheads[nhtp][nhead])
    
#     head_ref, nhead_ref = ['{} {}:{}'.format(*T.sectionFromNode(ex)) for ex in (head_ex, nhead_ex)]
#     head_txt, nhead_txt = [T.text(ex) for ex in (head_ex, nhead_ex)]
#     head_verse, nhead_verse = [T.text(L.u(ex, 'verse')[0]) for ex in (head_ex, nhead_ex)]
    
#     heads_data = [htp, head, tp_count[htp], '', '', head_ref, head_txt, head_ex, head_verse]
#     nheads_data = [nhtp, nhead, tp_count[nhtp], '', '', nhead_ref, nhead_txt, nhead_ex, nhead_verse]
#     tp_heads_data.append(heads_data)
#     tp_nheads_data.append(nheads_data)
    
# tp_heads_data, tp_nheads_data = sorted(tp_heads_data), sorted(tp_nheads_data)

In [24]:
# with open('manual_curation/tp_heads.csv', 'w') as outfile:
#     writer = csv.writer(outfile)
#     writer.writerow(data_header)
#     writer.writerows(tp_heads_data)
    
# with open('manual_curation/tp_nheads.csv', 'w') as outfile:
#     writer = csv.writer(outfile)
#     writer.writerow(data_header)
#     writer.writerows(tp_nheads_data)

# Manual Review of Heads and Features

The manual annotations are intended to serve 2 roles: 1. to evaluate the accuracy of the head assignments on time phrases in the BHSA data, 2. to evaluate the time phrase structure in the BHSA dataset, and 3. to gain hands-on exposure to the kinds of time phrases in the dataset. This process consisted of comparing the selected heads against the surface text of the time phrase, and of reading the time phrases in the context of a verse when questions or anomalies arose. The annotation process consisted of marking a given time phrase as "g" for "good," "b" for "bad," and "?" for questionable cases. These classifications refer to both head assignments and internal structuring of the time phrases in the BHSA. The markings are often accompanied with notes: for bad or questionable entries the note explains what is wrong, for good entries the note might describe an interesting phenomenon, in some cases it might give a "light caution" about a given phrase. 

The annotations suggest that custom database is necessary to consistently represent time phrases: there are many cases in BHSA time phrases where the phrase is cleft into 2 adjacent parts whereas in the majority of the data they are kept together. This is an inconsistency that should be solved. In other cases, the notion of "phrase" is not broad enough to encompass the full range of expressions that can mark time. For instance, several time phrases are split off from the infinitives they direct, in which the infinitive is an event. This is because the ETCBC's strict structuralist methodology has defined the infinitive event as a clause, operating at a different hierarchical level than the phrase; phrases are defined as strictly non-predicative. In the framework of Construction Grammar (Goldberg, *Constructions*, 1995, Croft, *Radical Construction Grammar*, 2001) preferred by this study, these divisions are not necessary, and in fact hinder an accurate and comprehensive description. In Construction Grammar, no division is assumed between syntax and semantics, and thus the difference between a clause and a phrase is merely a difference in degree based on the two construction's forms and meanings, but it is not a fundamental difference in kind. Indeed, many phrases in this dataset refer to event-like nouns, which are from a syntactical perspective non-predicative; במות "in the death of" is a very frequent example, but other cases include ביום ישׁועה "in the day of salvation," which assumes a salvation event. If the difference between time phrases and time "clauses" is seen as merely an incremental difference rather than categorical, then one can apply a unified strategy in analyzing these cases. 

In [25]:
pd.set_option('display.max_colwidth', 0)  # configure DataFrame to show full notes with no truncation

In [26]:
head_anno = pd.read_csv('manual_curation/tp_heads_annotated.csv')
nhead_anno = pd.read_csv('manual_curation/tp_nheads_annotated.csv')

## Looking at Bad Cases

In [27]:
#head_anno[head_anno['mark'] == 'b'].to_csv('manual_curation/head_fixes.csv')

In [28]:
#nhead_anno[nhead_anno['mark']=='b'].to_csv('manual_curation/nhead_fixes.csv')

**All situations have been remedied. The chosen solution is stored under the new column, "fix", in `head_fixes.csv` and `nhead_fixes.csv`.**

## Looking at Questionable Cases

In [29]:
#head_anno[head_anno['mark'] == '?']

In [30]:
#nhead_anno[nhead_anno['mark']=='?']