### GENRE Output Reformat

This notebook reformats GENRE's output and saves it to a csv which is more easily interpretable by evaluation scripts

In [1]:
import pandas as pd
import re
import pickle

In [21]:
result_df = pd.read_csv('../../data/results/genre/FAA_model_gold_standards_independant_20240611150355.csv')
result_df.head()

Unnamed: 0,c5_unique_id,c119_text,c119_output,c119_score
0,19990213001379A,ACFT WAS TAXIING FOR TAKE OFF WHEN IT LOST CON...,ACFT WAS TAXIING FOR TAKE OFF WHEN IT LOST { C...,-0.553834
1,19800217031649I,"AFTER TAKEOFF, ENGINE QUIT. WING FUEL TANK SUM...","AFTER TAKEOFF, ENGINE QUIT. WING FUEL TANK SUM...",-0.262258
2,19790720021329A,HELICOPTER TOOK OFF WITH SLING LOAD ATTACHED. ...,HELICOPTER TOOK OFF WITH SLING LOAD ATTACHED. ...,-0.145806
3,19841214074599I,WHILE TAXIING LOST NOSEWHEEL STEERING AND BRAK...,WHILE TAXIING LOST NOSEWHEEL STEERING AND BRAK...,-0.068722
4,19860128014289I,FORWARD CARGO DOOR OPENED AS AIRCRAFT TOOK OFF...,FORWARD CARGO DOOR OPENED AS AIRCRAFT TOOK OFF...,-0.15218


**Load title2wikidataID dict**

We follow the example at https://github.com/facebookresearch/GENRE/tree/main/examples_mgenre and download lang_title2wikidataID-normalized_with_redirect.pkl to translate the generated titles to Wikidata QIDs

In [None]:
with open("lang_title2wikidataID-normalized_with_redirect.pkl", "rb") as f:
    lang_title2wikidataID = pickle.load(f)

In [None]:
list(lang_title2wikidataID.keys())[10:20] # see sample

In [5]:
('en','Boeing') in lang_title2wikidataID

True

In [6]:
max(lang_title2wikidataID[('en','Boeing')], key=lambda y: int(y[1:])) # func from mgenre examples notebook

'Q66'

**Define title2id function**

Flexible for capitalization issues

In [8]:
def title2id(title):
    
    try:
        id = max(lang_title2wikidataID[('en',title)], key=lambda y: int(y[1:]))
    except KeyError:
        try:
            id = max(lang_title2wikidataID[('en',title.title())], key=lambda y: int(y[1:]))
        except:
            id = None
    return id

**Get entities, titles, and id using regex**

In [9]:
out_dict = {'c5_unique_id':[],'c119_text':[],'c119_output':[],'c119_score':[], 'entities':[],'titles':[],'ids':[]}
ent_title_p = re.compile('[^\{]*\{ ([^\}]+) \} \[ ([^\]]+) ] (.*)') # collects groups: ent, title, rest

for i in range(len(result_df)):
    text = result_df['c119_output'].iat[i]
    
    while text:
        mo = re.match(ent_title_p, text)
        if mo:
            ent, title, text = mo.groups()

            id = title2id(title)
    
            out_dict['c5_unique_id'].append(result_df['c5_unique_id'].iat[i])
            out_dict['c119_text'].append(result_df['c119_text'].iat[i])
            out_dict['c119_output'].append(result_df['c119_output'].iat[i])
            out_dict['c119_score'].append(result_df['c119_score'].iat[i])
            out_dict['entities'].append(ent)
            out_dict['titles'].append(title)
            out_dict['ids'].append(id)
        
        else:
            text = None

In [18]:
out_df = pd.DataFrame(out_dict)
out_df.head()

Unnamed: 0,c5_unique_id,c119_text,c119_output,c119_score,entities,titles,ids
0,19900425011659A,RAN OUT OF FUEL ON FERRY FLIGHT. LEFT PONTOON ...,RAN OUT OF { FUEL } [ El Salvador ] ON FERRY F...,-0.200809,FUEL,El Salvador,Q792
1,19900425011659A,RAN OUT OF FUEL ON FERRY FLIGHT. LEFT PONTOON ...,RAN OUT OF { FUEL } [ El Salvador ] ON FERRY F...,-0.200809,FUEL,El Salvador,Q792
2,19900425011659A,RAN OUT OF FUEL ON FERRY FLIGHT. LEFT PONTOON ...,RAN OUT OF { FUEL } [ El Salvador ] ON FERRY F...,-0.200809,FUEL,El Salvador,Q792
3,20000625032189I,(-23)NORTHWEST AIRLINES DC-10 BOEING PUSHED BA...,(-23)NORTHWEST AIRLINES { DC-10 } [ Douglas DC...,-0.109136,DC-10,Douglas DC-10,Q208075
4,20000625032189I,(-23)NORTHWEST AIRLINES DC-10 BOEING PUSHED BA...,(-23)NORTHWEST AIRLINES { DC-10 } [ Douglas DC...,-0.109136,B-52,Boeing B-52 Stratofortress,Q174534


**Save DataFrame**

In [20]:
out_df.to_csv('../../data/results/genre/genre.csv', index=False)