### Parse ASP Output

This notebook parses the output of the ASP's eval script, which is saved to eval.out, and saves a csv file with the appropriate coreferences

In [66]:
import ast
import pandas as pd

In [108]:
with open("ASP/eval_xl.out") as f:
#with open("eval.out") as f:
    lines = f.readlines()

In [109]:
lines[0] # example output

'07/31/2024 16:04:30 - INFO - util.func - Running experiment: flant5_xl\n'

**Collect relevant data from logger output lines**

In [110]:
# Get length of string which always precedes a meaningful logger output so that we can index to it
len("01/26/2024 20:50:45 - INFO - /afs/crc.nd.edu/group/TAI/Users/kmealey2/git/2K-paper/coref/asp2/ASP/util/runner.py - ")

115

In [111]:
num_to_meaning = {"0":"doc_key","1":"input", "2":"predicted", "3":"gold"}
output = {"doc_key":[], "input":[], "predicted":[],"gold":[]}

num = 0 # used to keep track of whether the logger output is doc_key, input, predicted, or gold. These always occur sequentially
for line in lines:
    if "INFO - /afs/crc.nd.edu/group/TAI/Users/kmealey2/git/2K-paper/coref/asp/ASP/util/runner.py" in line:
        if "[" in line and "]" in line:
            output[num_to_meaning[f"{num}"]].append(ast.literal_eval(line[114:].strip()))
            
            # count up
            if num == 3:
                num = 0
            else:
                num = num + 1
        elif "faa/" in line:
            output[num_to_meaning[f"{num}"]].append(line[114:].strip())

            # count up
            if num == 3:
                num = 0
            else:
                num = num + 1

**View data**

In [112]:
print(len(output["doc_key"]), len(output["input"]), len(output["predicted"]), len(output["gold"]))

2748 2748 2748 2748


In [113]:
output['predicted'][20:40]

[[],
 [],
 [((4, 7), (10, 10), (21, 21))],
 [],
 [((5, 5), (16, 16))],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 []]

**Change predicted coreferences from sets to lists**\
This will match the format expected by the eval script

In [114]:
corefs = output['predicted']

In [115]:
corefs[8] # example

[((4, 5), (30, 30))]

In [116]:
for i in range(len(corefs)):
    for i_coref_chain, coref_chain in enumerate(corefs[i]):
        corefs[i][i_coref_chain] = [[span[0], span[1]] for span in coref_chain]

In [117]:
corefs[8] # example

[[[4, 5], [30, 30]]]

**Change predicted coreferences to a word index one smaller**

If you check coref_gold, you'll note that the correct output for 19800217031649I is [[[6,9],[16,16]]], but asp reports [((7, 10), (17, 17))]. The "plus one" trend continues in many other documents. We assume that this is a quirk of the data processing step, and adjust it in the step below

In [118]:
output["doc_key"][984]

'faa/354_19800217031649I_0_0'

In [119]:
output["predicted"][984]

[[[7, 10], [17, 17]]]

In [120]:
for i in range(len(corefs)):
    for i_coref_chain, coref_chain in enumerate(corefs[i]):
        corefs[i][i_coref_chain] = [[span[0] - 1, span[1] - 1] for span in coref_chain]

In [121]:
corefs[8]

[[[3, 4], [29, 29]]]

**Save data**

In [122]:
# get original faa data:
original_data = pd.read_csv('../../data/FAA_data/Maintenance_Text_data_nona.csv')

In [123]:
# get predictions id-wise
predictions = {output["doc_key"][i].split('_')[1] : output["predicted"][i] for i in range(len(output["predicted"]))}

In [124]:
c5s = list(original_data['c5'])
results_df = pd.DataFrame({'c5_id':c5s, 'c119_input':list(original_data['c119']), 'input':output["input"], 'corefs':[predictions[c5] for c5 in c5s]})
results_df.loc[2318:2325]

Unnamed: 0,c5_id,c119_input,input,corefs
2318,19990213001379A,ACFT WAS TAXIING FOR TAKE OFF WHEN IT LOST CON...,"[▁core, ference, ▁resolution, :, <speaker>, ▁s...","[[[0, 0], [7, 7]]]"
2319,19990217015679I,(-5) A FLIGHT ATTENDANT OPENED AN OVERHEAD BIN...,"[▁core, ference, ▁resolution, :, <speaker>, ▁s...","[[[7, 9], [15, 16]]]"
2320,19990425003889A,PILOT REPORTED THAT THE ACFT ENGINE HAD QUIT A...,"[▁core, ference, ▁resolution, :, <speaker>, ▁s...","[[[4, 4], [16, 17]]]"
2321,19990506015529I,"ON 5/6/99 AT APPROXIMATELY 1608C, DELTA 789 DE...","[▁core, ference, ▁resolution, :, <speaker>, ▁s...",[]
2322,19990816019439I,"(-23)AFTER LANDING, TAXIED ON TAXIWAY PAPA WAS...","[▁core, ference, ▁resolution, :, <speaker>, ▁s...",[]
2323,19990904035409I,(-23) PILOT DEPARTED A PRIVATE RANCH FOR A FOU...,"[▁core, ference, ▁resolution, :, <speaker>, ▁s...",[]
2324,19990909030819A,(-23) THIS AIRCRAFT EXPERIENCED AN ENGINE STOP...,"[▁core, ference, ▁resolution, :, <speaker>, ▁s...",[]
2325,19990909031699I,(-23) NO NARRATIVE ...,"[▁core, ference, ▁resolution, :, <speaker>, ▁s...",[]


In [125]:
results_df.to_csv('../../data/results/asp/asp_xl.csv')