### Coref mt5 Output Reformat

In this notebook, we get the results outputted by coref_mt5/main.py in the data/results folder, called coref_mt5_raw.csv, and process it to create another csv called coref_mt5.csv, with a column called corefs. This column follows the format expected by the cr_eval.py script in evaluations/automatic, and so makes evaluation much simpler.

The desired format is like so:\
\[ coreference_chain, coreference_chain, ... ] where coreference_chain = \[mention_span, mention_span, ...] and mention_span = \[start_word_index, end_word_index]\
Such that the coreference chain for the sentence "PILOT LANDED ON WHAT HE THOUGHT TO BE ONE FOOT HIGH GRASS. IT TURNED OUT TO BE THREE FEET HIGH. ACFT NOSED OVER.":\
Which is: \[\["PILOT", "HE"], \["ONE FOOT HIGH GRASS", "IT"]]\
Appears as: [[0,0],[4,4],[[8,11],[13,13]]]\
The word indices are based on the word tokenization used in data/FAA_data/faa.json, which is used to create faa.jsonl in coref_mt5/data The word indices continue to increase throughout the whole doc/entry and do not reset at sentence starts

In [119]:
import pandas as pd
import re
import ast

In [120]:
result_df = pd.read_csv('../../data/results/coref_mt5/coref_mt5_raw.csv')
result_df.head()

Unnamed: 0,c5_id,input,prediction_strings,results
0,19750315005389A,"{'doc_key': 'faa/0_19750315005389A', 'sentence...",{0: ''},None [+ E]
1,19750419011349A,"{'doc_key': 'faa/1_19750419011349A', 'sentence...","{0: '', 1: 'None [+ E]'}",None [+ E]
2,19750419011349A,"{'doc_key': 'faa/1_19750419011349A', 'sentence...","{0: '', 1: 'None [+ E]', 2: 'None [+ E]'}",None [+ E]
3,19750419011349A,"{'doc_key': 'faa/1_19750419011349A', 'sentence...","{0: '', 1: 'None [+ E]', 2: 'None [+ E]'}",None [+ E]
4,19751029037799A,"{'doc_key': 'faa/2_19751029037799A', 'sentence...","{0: '', 1: 'None [+ E]'}",None [+ E]


In [121]:
# example coreference
result_df[result_df['c5_id'] == '19840107019539I']

Unnamed: 0,c5_id,input,prediction_strings,results
1696,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]'}",None [+ E]
1697,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]', 2: 'it ## . ** _ -> c...",it ## . ** _ -> cabin door ## opened . | ;;
1698,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]', 2: 'it ## . ** _ -> c...",his ## hand blew into -> pilot ## tried to clo...
1699,19840107019539I,"{'doc_key': 'faa/743_19840107019539I', 'senten...","{0: '', 1: 'None [+ E]', 2: 'it ## . ** _ -> c...",door ## not secured ** -> [1 ;;


**Define functions**

In [122]:
def find_sublist(lst, sublst):
    """
    Find the start index of the first occurrence of the sublist in the list.

    Args:
    lst (list): The list to search within.
    sublst (list): The sublist to search for.

    Returns:
    int: The starting index of the first occurrence of the sublist in the list, or -1 if the sublist is not found.
    """
    n = len(sublst)
    for i in range(len(lst) - n + 1):
        if lst[i:i + n] == sublst:
            return i
    return -1

In [123]:
def get_coref_chains(c5_id, result_df):

    output = result_df[result_df['c5_id'] == c5_id]
    input = ast.literal_eval(list(output['input'].unique())[0]) # get dict form of input for c5_id
    sent_idx_to_doc_idx = {sent_no : {idx:idx+sum([len(input['sentences'][isent]) for isent in range(sent_no)]) for idx in range(len(input['sentences'][sent_no]))} for sent_no in input['sentences']}
    
    coref_chains = {}
    coref_chain_no = 1
    
    for isent in input['sentences'].keys():
        result = output['results'].iat[isent]
        if result != 'None [+ E]':
            pairs = result.split(';;')
            for pair in pairs:
                if len(pair) > 0:
                    a, b = pair.split('->')
            
                    # find span of a, which will be in the sentence
                    a_coref = a.split('##')[0].split()
                    a_context_right = a.split('##')[1].split('**')[0].split()
                    a_coref_start = find_sublist(input['sentences'][isent], a_coref+a_context_right)
                    a_coref_start = sent_idx_to_doc_idx[isent][a_coref_start] # translate to doc_idx
                    a_coref_end = a_coref_start + len(a_coref) - 1 # calc end idx from start
            
                    # find b
                    mo = re.match(r'\[([1-9])', b.strip()) # check if b is a reference to a previous group, i.e., '[1' means it is a ref to group 1
                    if mo:
                        ref_chain = ast.literal_eval(mo.groups()[0])
                        coref_chains[ref_chain].append([a_coref_start, a_coref_end]) # add coref a to previously existing chain
            
                    # else new coref chain
                    else:
                        
                        # parse b
                        b_coref = b.split('##')[0].split()
                        b_context_right = b.split('##')[1].split('|')[0].split()
                        for ichecksent in range(isent + 1): # check sentences before this one
                            b_coref_start = find_sublist(input['sentences'][ichecksent], b_coref+b_context_right)
                            if b_coref_start > -1:
                                b_coref_start = sent_idx_to_doc_idx[ichecksent][b_coref_start] # translate to doc_idx
                                continue
                        b_coref_end = b_coref_start + len(b_coref) - 1 # calc end idx from start
            
                        # Now save to coref_chains
                        coref_chains[coref_chain_no] = [[a_coref_start, a_coref_end], [b_coref_start, b_coref_end]]
                        coref_chain_no = coref_chain_no + 1
        
    return list(coref_chains.values())

**Correct small error found in coref_mt5 output**

When running the code under "Parse Coreferences" without any edit, there is an error at document 20040822021239A.

The result which triggers the bug in find_sublist: pilot ## , owner m. -> ( -23 ) pilot ## reported aircraft accident ;;\
The sentence it's looking in:  ['pilot', ',', 'owner', 'mr.', 'gary', 'jones', 'st'

We note that the issue is that for whatever reason, the string generated is simple 'm.' where it should be 'mr.' Since the mistake is just a part of the context and not the actual coreference, we'll correct if for them.]

In [144]:
# data cited above ^
print(result_df[result_df['c5_id']=='20040822021239A']['results'].iat[1])
print(ast.literal_eval(result_df[result_df['c5_id']=='20040822021239A']['input'].iat[1])['sentences'][1])

pilot ## , owner m. -> ( -23 ) pilot ## reported aircraft accident ;;
['pilot', ',', 'owner', 'mr.', 'gary', 'jones', 'st']


In [153]:
# correct err in results_df:
result_df.iloc[6674]['results'] = "pilot ## , owner mr. -> ( -23 ) pilot ## reported aircraft accident ;;"

**Parse Coreferences**

If any c5 id's get printed out, need to debug before saving

In [155]:
coref_chains = []
for c5_id in result_df['c5_id'].unique():
    try:
        coref_chains.append(get_coref_chains(c5_id, result_df))
    except:
        print(c5_id)
        coref_chains.append([])

**Save Processed CSV**

In [157]:
# get original data
original_df = pd.read_csv('../../data/FAA_data/Maintenance_Text_data_nona.csv')

# make processed df
processed_df = pd.DataFrame({'c5_id':list(original_df['c5']), 'c119_input':list(original_df['c119']), 'corefs':coref_chains})

In [158]:
processed_df

Unnamed: 0,c5_id,c119_input,corefs
0,19750315005389A,TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ...,[]
1,19750419011349A,TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT...,[]
2,19751029037799A,"2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON...",[]
3,19751209037899A,PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF....,[]
4,19750818025579A,TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO...,[]
...,...,...,...
2743,20080227010559I,(-23) A/C RELOCATED TO NEW HANGAR TO CHECK SIZ...,"[[[18, 18], [12, 12], [23, 23]]]"
2744,20080223013919I,(-23) ON 2/23/08 @ APPROXIMATELY 2130 DURING T...,[]
2745,20080226013819I,(-23) PILOT TOOK OFF FOR LEESBURG AIRPORT AND ...,"[[[23, 23], [0, 3]]]"
2746,20080702843529I,(-23) OWNER FORGOT TO FASTEN THE LOWER LEFT 4 ...,"[[[17, 17], [-1, 1]]]"


In [159]:
processed_df.to_csv('../../data/results/coref_mt5.csv', index=False)