# GENIA Filtered Type Analysis
The GENIA dataset used in many NER papers, including the DyGIE++ paper, is reduced to only 5 of the original 47 entity mention types: DNA, RNA, Protein, Cell Line, and Cell Type. This explains much of the unexpected poor performance of the GENIA model; the model performs exceptionally well on the DNA, RNA and Protein types, and our PICKLE dataset doesn't have many Cell annotations at all. Here, we look at the performacne of the GENIA model if we filter the PICKLE test set down to just those types represented in the GENIA model before evaluation. We perform this filtering using the script `filter_pickle_to_GENIA.py` in the `annotation/abstract_scripts` directory.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from ast import literal_eval

## Reading in data
### Performance data
First, let's read in the original performance data:

In [None]:
orig_perf_seedev = pd.read_csv('../data/straying_off_topic_data/model_output/dygiepp/15Jul2023_all_on_pickle_but_seedev/performance/17Jul2023_seedev_on_pickle_performance_no_labels.csv')
orig_perf_all = pd.read_csv('../data/straying_off_topic_data/model_output/dygiepp/15Jul2023_all_on_pickle_but_seedev/performance/15Jul2023_all_on_pickle_no_seedev_performance.csv')
orig_perf_all = pd.concat([orig_perf_seedev, orig_perf_all])
orig_perf_all['model'] = orig_perf_all['pred_file'].str.split('_').str[-2]
orig_perf_all = orig_perf_all.drop(index=1).reset_index(drop=True)
orig_perf_all

In [None]:
orig_perf = pd.read_csv('../data/straying_off_topic_data/model_output/dygiepp/15Jul2023_all_on_pickle_but_seedev/performance/15Jul2023_all_on_pickle_no_seedev_performance.csv')
orig_perf['model'] = orig_perf['pred_file'].str.split('_').str[-2]
orig_perf = orig_perf[orig_perf['model'].isin(['genia', 'genia-lightweight'])]
orig_perf = orig_perf.reset_index(drop=True)
orig_perf

Then let's read in the performance data for the filtered PICKLE test set:

In [None]:
prefix = '../data/straying_off_topic_data/model_output/mismatch_analysis/'
filepaths = {
    'filtered_with_types': 'PICKLE_down_to_GENIA_eval_with_bootstrap_with_types.csv',
    'filtered_without_types': 'PICKLE_down_to_GENIA_eval_with_bootstrap_without_types.csv'
}
evals = {k: pd.read_csv(f'{prefix}{v}') for k,v in filepaths.items()}
for k in evals.keys():
    evals[k]['model'] = evals[k]['pred_file'].str.split('_').str[-2]

In [None]:
evals['filtered_without_types']

In [None]:
# Add orig to new
evals['orig_perfs'] = orig_perf

## Performance plots
Let's take a look at the performance differences across conditions:

In [None]:
def process_CIs(df, kind='F1'):
    """
    literal_evals the CI strings in a given df.
    """
    ent_CIs = df[f"ent_{kind}_CI"].apply(lambda x: literal_eval(str(x)))
    ent_CIs = pd.DataFrame([[df[f'ent_{kind}'][i] - val[0] for i, val in enumerate(ent_CIs)], [val[1] - df[f'ent_{kind}'][i] for i, val in enumerate(ent_CIs)]])
    col_names = {i: df.loc[i, 'model'] for i in range(len(ent_CIs))}
    ent_CIs = ent_CIs.rename(columns=col_names) 

    return ent_CIs

In [None]:
perf_plot_data = {'x': [], 'genia_f1': [], 'genialight_f1': [], 'genia_ci': [], 'genialight_ci': []}
for name, perf_df in evals.items():
    CIs = process_CIs(perf_df)
    F1s = perf_df[['model', 'ent_F1']]
    perf_plot_data['x'].append(name)
    perf_plot_data['genia_f1'].append(F1s.loc[F1s['model'] == 'genia', 'ent_F1'].values[0])
    perf_plot_data['genialight_f1'].append(F1s.loc[F1s['model'] == 'genia-lightweight', 'ent_F1'].values[0])
    perf_plot_data['genia_ci'].append(CIs['genia'])
    perf_plot_data['genialight_ci'].append(CIs['genia-lightweight'])
perf_plot_data['genia_ci'] = pd.concat(perf_plot_data['genia_ci'], axis=1)
perf_plot_data['genialight_ci'] = pd.concat(perf_plot_data['genialight_ci'], axis=1)

In [None]:
# Define a name mapping for semantic labels
name_map = {
    'filtered_with_types': 'Filtered PICKLE\nWith types',
    'filtered_without_types': 'Filtered PICKLE\nWithout types',
    'orig_perfs': 'All PICKLE\nWithout types'
}

In [None]:
perf_plot_data['semantic_x'] = pd.Series([name_map[x] for x in perf_plot_data['x']])

In [None]:
# Define numerical x values to allow offsets
x_dict = {mod:i for i,mod in enumerate(perf_plot_data["x"])}
x_dict

In [None]:
perf_plot_data['num_x'] = pd.Series([x_dict[x] for x in perf_plot_data['x']])

In [None]:
plt.errorbar(x=perf_plot_data['num_x'] - 0.1, y=perf_plot_data['genia_f1'], yerr=perf_plot_data['genia_ci'], fmt='o', color='purple', label='GENIA')
plt.errorbar(x=perf_plot_data['num_x'] + 0.1, y=perf_plot_data['genialight_f1'], yerr=perf_plot_data['genialight_ci'], fmt='^', color='orange', label='GENIA Lightweight')

plt.xticks(perf_plot_data['num_x'], perf_plot_data['semantic_x'], size=12, ha='center')
plt.xlabel('Evaluation Dataset', size=14, labelpad=10)
plt.ylabel('Entity F1', size=14, labelpad=10)

plt.legend()

Now let's look at the performances of all of the models next to both original and filtered GENIA without types:

In [None]:
orig_perf_all = orig_perf_all.sort_values('ent_F1')
orig_perf_all = orig_perf_all.reset_index(drop=True)
filtered_perf = evals['filtered_without_types']

In [None]:
def process_ent_only_CIs(df, kind='F1'):
    """
    literal_evals the CI strings in a given df, and returns two sets of CIs,
    one for entities and one for relations.
    """
    ent_CIs = df[f"ent_{kind}_CI"].apply(lambda x: literal_eval(str(x)))
    ent_CIs = pd.DataFrame([[df[f'ent_{kind}'][i] - val[0] for i, val in enumerate(ent_CIs)], [val[1] - df[f'ent_{kind}'][i] for i, val in enumerate(ent_CIs)]])

    return ent_CIs

In [None]:
all_no_filter_ent_CIs = process_ent_only_CIs(orig_perf_all)
filter_ent_CIs = process_ent_only_CIs(filtered_perf)

In [None]:
x_dict = {mod:i for i,mod in enumerate(orig_perf_all["model"].values.tolist())}
orig_perf_all["x"] = orig_perf_all["model"].map(x_dict)
filtered_perf["x"] = filtered_perf["model"].map(x_dict)

In [None]:
orig_label_key = {'chemprot': 'ChemProt',
         'scierc': 'SciERC',
         'bioinfer': 'BioInfer',
         'genia': 'GENIA',
         'pickle': 'PICKLE',
         'scierc-lightweight': 'SciERC lightweight',
         'genia-lightweight': 'GENIA lightweight',
         'ace05-relation': 'ACE05',
         'seedev': 'SeeDev'}
filtered_label_key = {'genia': 'GENIA on filtered PICKLE',
                     'genia-lightweight': 'GENIA lightweight on filtered PICKLE'}

In [None]:
name_x = [orig_label_key[mod] for mod in orig_perf_all["model"]]

plt.errorbar(x=orig_perf_all["x"] + 0.1, y=orig_perf_all["ent_F1"], yerr=all_no_filter_ent_CIs, fmt="o", color='#E66100', label='On original PICKLE test set')
plt.errorbar(x=filtered_perf["x"] - 0.1, y=filtered_perf["ent_F1"], yerr=filter_ent_CIs, fmt="^", color='#5D3A9B', label='On GENIA-filtered PICKLE test set')
plt.xticks(orig_perf_all["x"], name_x, size=12, rotation=60, ha='right')
plt.xlabel('Model', size=14)
plt.ylabel('F1', y=0.6,  size=14)
plt.legend()

## Exploring poor performance with types
The poor performance of the GENIA model on the filtered PICKLE dataset when evaluated with types is unexpected. Let's dig into that here to make sure nothing untoward is happening.

Given that the performance when no types are considered is quite high, it's likely that the issue is GENIA consistently misidentnifying types on entities that are otherwise correctly predicted. Let's quantify that supposition by checking what percentage of false negatives are negative becuase of type and not because of span boundaries.

To do this, we'll actually look at the mismatches from where the types were ignored, and look at the false negatives versus the true positives where the type of the prediction and the gold don't match. This allows us to separate the false negatives by whether or not it was due to type or boundary in a straighforward manner.

In [None]:
without_types_mismatches = pd.read_csv('../data/straying_off_topic_data/model_output/mismatch_analysis/PICKLE_down_to_GENIA_eval_without_bootstrap_without_types_MISMATCHES.csv')
without_types_mismatches['model_shorthand'] = without_types_mismatches['model'].str.split('_').str[-2]
without_types_mismatches['gold_ent_type'] = without_types_mismatches['gold_ent_type'].str.lower()
without_types_mismatches['pred_ent_type'] = without_types_mismatches['pred_ent_type'].str.lower()
without_types_mismatches.head()

In [None]:
neg_stats_with_type = {}
for model in ['genia', 'genia-lightweight']:
    stats = {}
    false_neg_type = without_types_mismatches[(without_types_mismatches['model_shorthand'] == model) & (without_types_mismatches['mismatch_type'] == 1) & (without_types_mismatches['gold_ent_type'] != without_types_mismatches['pred_ent_type'])]
    stats['num_false_neg_type'] = false_neg_type.shape[0]
    false_neg_boundary = without_types_mismatches[(without_types_mismatches['model_shorthand'] == model) & (without_types_mismatches['mismatch_type'] == 0)]
    stats['num_false_neg_boundary'] = false_neg_boundary.shape[0]
    false_pos = without_types_mismatches[(without_types_mismatches['model_shorthand'] == model) & (without_types_mismatches['mismatch_type'] == 2)]
    stats['num_false_pos'] = false_pos.shape[0]
    true_pos_with_type = without_types_mismatches[(without_types_mismatches['model_shorthand'] == model) & (without_types_mismatches['mismatch_type'] == 1) & (without_types_mismatches['gold_ent_type'] == without_types_mismatches['pred_ent_type'])]
    stats['num_true_pos_with_type'] = true_pos_with_type.shape[0]
    all_preds = without_types_mismatches[without_types_mismatches['model_shorthand'] == model]
    stats['total_preds'] = all_preds.shape[0]
    neg_stats_with_type[model] = stats
    
neg_stats_with_type

In [None]:
neg_stats_without_type = {}
for model in ['genia', 'genia-lightweight']:
    stats = {}
    false_neg_boundary = without_types_mismatches[(without_types_mismatches['model_shorthand'] == model) & (without_types_mismatches['mismatch_type'] == 0)]
    stats['num_false_neg_boundary'] = false_neg_boundary.shape[0]
    false_pos = without_types_mismatches[(without_types_mismatches['model_shorthand'] == model) & (without_types_mismatches['mismatch_type'] == 2)]
    stats['num_false_pos'] = false_pos.shape[0]
    true_pos = without_types_mismatches[(without_types_mismatches['model_shorthand'] == model) & (without_types_mismatches['mismatch_type'] == 1)]
    stats['num_true_pos'] = true_pos.shape[0]
    all_preds = without_types_mismatches[without_types_mismatches['model_shorthand'] == model]
    stats['total_preds'] = all_preds.shape[0]
    neg_stats_without_type[model] = stats
    
neg_stats_without_type

Common-sense check that the misidentification of types causes the discrepancy in performance:

In [None]:
# Without type
tp = neg_stats_without_type['genia']['num_true_pos']
fp = neg_stats_without_type['genia']['num_false_pos']
fn = neg_stats_without_type['genia']['num_false_neg_boundary']
2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn))))

When we consider types, anything that had a correct boundary but an incorrect type, rather than being a true positive, is now both a false negative (because it's an element of the gold standard that doesn't match any prediction) and a false positive (because it's a prediction that doesn't match anything in the gold standard):

In [None]:
# With type
tp = neg_stats_with_type['genia']['num_true_pos_with_type']
fp = neg_stats_with_type['genia']['num_false_pos'] + neg_stats_with_type['genia']['num_false_neg_type']
fn = neg_stats_with_type['genia']['num_false_neg_boundary'] + neg_stats_with_type['genia']['num_false_neg_type']
2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn))))

Well that is certainly suspicious! Let's read in the actual mismatches dataframe with types and re-calculate the F1 from there to make sure we didn't do anything wrong here.

In [None]:
with_types_mismatches = pd.read_csv('../data/straying_off_topic_data/model_output/mismatch_analysis/PICKLE_down_to_GENIA_eval_without_bootstrap_with_types_MISMATCHES.csv')
with_types_mismatches['model_shorthand'] = with_types_mismatches['model'].str.split('_').str[-2]
with_types_mismatches['gold_ent_type'] = with_types_mismatches['gold_ent_type'].str.lower()
with_types_mismatches['pred_ent_type'] = with_types_mismatches['pred_ent_type'].str.lower()
with_types_mismatches.head()

In [None]:
# With type
tp = with_types_mismatches[(with_types_mismatches['model_shorthand'] == 'genia') & (with_types_mismatches['mismatch_type'] == 1)].shape[0]
fp = with_types_mismatches[(with_types_mismatches['model_shorthand'] == 'genia') & (with_types_mismatches['mismatch_type'] == 2)].shape[0]
fn = with_types_mismatches[(with_types_mismatches['model_shorthand'] == 'genia') & (with_types_mismatches['mismatch_type'] == 0)].shape[0]
2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn))))

This number is what we expected to see. So what else is different between the true positive and false negative categories when we look at types vs don't?

In [None]:
all_mismatches = with_types_mismatches.merge(without_types_mismatches, how='outer', on=['doc_key', 'sent_num', 'gold_ent_list', 'gold_ent_type', 'pred_ent_list', 'pred_ent_type', 'model', 'model_shorthand'], suffixes=['_with_type', '_without_type'])
print(all_mismatches[all_mismatches['model_shorthand'] == 'genia'].shape)
all_mismatches.head()

In [None]:
diff_mismatch_type_genia = all_mismatches[(all_mismatches['model_shorthand'] == 'genia') & (all_mismatches['mismatch_type_with_type'] != all_mismatches['mismatch_type_without_type'])]
diff_mismatch_type_genia