## CoNLL-2003 Example for Text Extensions for Pandas
### Part 2

### Introduction

In Part 1 of this demo, we showed how to use Text Extensions for Pandas to examine the 
overall result quality of one entrant in the CoNLL-2003 Shared Task, as well as how to
identify and examine the documents from the test set where that entry had the 
most errors.

In Part 2, we'll perform a broader analysis that goes across all the contents entries
and come to some deeper and more surprising conclusions.

In [1]:
# INITIALIZATION BOILERPLATE

# Libraries
import os
import sys
import numpy as np
import pandas as pd
from typing import *

# And of course we need the text_extensions_for_pandas library itself.
PROJECT_ROOT = "../.."
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("corpus"):
        raise e
    if PROJECT_ROOT not in sys.path:
        sys.path.insert(0, PROJECT_ROOT)
    import text_extensions_for_pandas as tp

# Code shared among notebooks is kept in util.py, in this directory.
import util

In [2]:
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
#  to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info

{'train': 'outputs/eng.train',
 'dev': 'outputs/eng.testa',
 'test': 'outputs/eng.testb'}

In [3]:
# Load up the same gold standard data we used in Part 1.
gold_standard = tp.io.conll.conll_2003_to_dataframes(
    data_set_info["test"], ["pos", "phrase", "ent"], [False, True, True])
gold_standard = [
    df.drop(columns=["pos", "phrase_iob", "phrase_type"])
    for df in gold_standard
]

# Dictionary from (collection, offset within collection) to dataframe
gold_standard_spans = {("test", i): 
                       tp.io.conll.iob_to_spans(gold_standard[i]) 
                       for i in range(len(gold_standard))}

In [4]:
# Load up the results from all 16 teams at once.
teams = ["bender", "carrerasa", "carrerasb", "chieu", "curran",
         "demeulder", "florian", "hammerton", "hendrickx",
         "klein", "mayfield", "mccallum", "munro", "whitelaw",
         "wu", "zhang"]

# Read all the output files into one dataframe per <document, team> pair.
outputs = { 
    t: tp.io.conll.conll_2003_output_to_dataframes(
        gold_standard, f"{PROJECT_ROOT}/resources/conll_03/ner/results/{t}/eng.testb")
    for t in teams
}  # Type: Dict[str, List[pd.DataFrame]]

# As an example of what we just loaded, show the token metadata for the 
# "mayfield" team's model's output on document 3.
outputs["mayfield"][3]

Unnamed: 0,span,ent_iob,ent_type,sentence
0,"[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'"
1,"[11, 20): 'FREESTYLE'",O,,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
2,"[21, 33): 'SKIING-WORLD'",O,,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
3,"[34, 37): 'CUP'",B,MISC,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
4,"[38, 43): 'MOGUL'",O,,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
...,...,...,...,...
161,"[803, 809): 'Allais'",I,PER,"[791, 824): '10. Katleen Allais (France) 21.58'"
162,"[810, 811): '('",O,,"[791, 824): '10. Katleen Allais (France) 21.58'"
163,"[811, 817): 'France'",B,LOC,"[791, 824): '10. Katleen Allais (France) 21.58'"
164,"[817, 818): ')'",O,,"[791, 824): '10. Katleen Allais (France) 21.58'"


In [5]:
# Convert results from IOB2 tags to spans across all teams and documents
# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) for details on IOB2 format.
output_spans = {
    t: {("test", i): tp.io.conll.iob_to_spans(outputs[t][i]) 
        for i in range(len(outputs[t]))}
    for t in teams
}  # Type: Dict[str, Dict[Tuple[str, int], pd.DataFrame]]

# As an example, show the first 10 spans that the "florian" team's model
# found on document 2.
output_spans["florian"][("test", 2)].head(10)

Unnamed: 0,span,ent_type
0,"[35, 40): 'JAPAN'",LOC
1,"[50, 55): 'SYRIA'",LOC
2,"[57, 63): 'AL-AIN'",LOC
3,"[65, 85): 'United Arab Emirates'",LOC
4,"[144, 149): 'Japan'",LOC
5,"[169, 178): 'Asian Cup'",MISC
6,"[192, 197): 'Syria'",LOC
7,"[209, 222): 'Takuya Takagi'",PER
8,"[297, 308): 'Salem Bitar'",PER
9,"[403, 409): 'Syrian'",MISC


In [6]:
# Use Pandas merge to find what spans match up exactly for each team's
# results.
# Unlike Part 1, we perform the join across all entity types, looking for
# matches of both the extracted span *and* the entity type label.
#
# Text Extensions for Pandas includes a utility function 
# compute_accuracy_by_document() to makes this collection-level computation
# simpler.
stats = {
    t: tp.io.conll.compute_accuracy_by_document(
    gold_standard_spans, output_spans[t]) 
    for t in teams
}

# Show the result quality statistics by document for the "carrerasa" team
stats["carrerasa"]

Unnamed: 0,fold,doc_num,num_true_positives,num_extracted,num_entities,precision,recall,F1
0,test,0,42,48,45,0.875000,0.933333,0.903226
1,test,1,43,44,44,0.977273,0.977273,0.977273
2,test,2,51,52,54,0.980769,0.944444,0.962264
3,test,3,43,44,44,0.977273,0.977273,0.977273
4,test,4,14,19,19,0.736842,0.736842,0.736842
...,...,...,...,...,...,...,...,...
226,test,226,7,7,7,1.000000,1.000000,1.000000
227,test,227,19,22,21,0.863636,0.904762,0.883721
228,test,228,22,27,27,0.814815,0.814815,0.814815
229,test,229,26,27,27,0.962963,0.962963,0.962963


In [7]:
# F1 for document 4 is looking a bit low. Is that just fluke, or is it
# part of a larger trend? 
# In Part 1, we showed how to drill down to and examine "problem" documents.
# Since we have all this additional data, let's try a broader, more 
# quantitative approach. We'll start by building up some more fine-grained 
# data about congruence between the gold standard and the model outputs.
# Pandas' outer join will tell us what entities showed up just in the gold
# standard, just in the model output, or in both sets.
# For starters, let's do this just for the "carrerasa" team and document 4.
doc_id = ("test", 4)
(gold_standard_spans[doc_id]
    .merge(output_spans["carrerasa"][doc_id], how="outer", indicator=True)
    .sort_values("span"))

Unnamed: 0,span,ent_type,_merge
0,"[19, 28): 'ASIAN CUP'",MISC,both
1,"[46, 52): 'AL-AIN'",LOC,both
2,"[54, 74): 'United Arab Emirates'",LOC,both
3,"[97, 106): 'Asian Cup'",MISC,both
4,"[141, 146): 'Japan'",LOC,left_only
19,"[141, 146): 'Japan'",ORG,right_only
5,"[149, 154): 'Syria'",LOC,left_only
20,"[149, 154): 'Syria'",ORG,right_only
6,"[181, 186): 'Japan'",LOC,both
7,"[188, 200): 'Hassan Abbas'",PER,both


In [8]:
# Repeat the analysis from the previous cell across all teams and documents.
# That is, perform an outer join between the gold standard spans dataframe
# for each document and the corresponding dataframe from each team.
def merge_span_sets(team: str,
                    gold_results: Dict[Tuple[str, int], pd.DataFrame],
                    results_by_team: Dict[str, Dict[Tuple[str, int], pd.DataFrame]]):
    result = {}  # Type: Dict[Tuple[str, int]: pd.DataFrame]
    for k in gold_results.keys():
        merged = gold_results[k].merge(results_by_team[team][k],
                                       how="outer", indicator=True)
        merged["gold"] = merged["_merge"].isin(("both", "left_only"))
        merged[team] = merged["_merge"].isin(("both", "right_only"))
        result[k] = merged[["span", "ent_type", "gold", team]]
    return result

span_flags = {t: merge_span_sets(t, gold_standard_spans, output_spans) for t in teams}  # Type: Dict[Tuple[str, int]: pd.DataFrame]

In [9]:
# Now we have indicator variables for every extracted span, telling whether 
# it was in the gold standard data set and/or in each of the team's results.
# For example, here are the first 5 spans for document 2 in the "carrerasa"
# team's results:
doc_id = ("test", 2)
span_flags["carrerasa"][doc_id].head(5)

Unnamed: 0,span,ent_type,gold,carrerasa
0,"[35, 40): 'JAPAN'",LOC,True,True
1,"[50, 55): 'SYRIA'",LOC,True,True
2,"[57, 63): 'AL-AIN'",LOC,True,True
3,"[65, 85): 'United Arab Emirates'",LOC,True,True
4,"[144, 149): 'Japan'",LOC,True,True


In [10]:
# Do an n-way merge of all those indicator variables across documents.
# This operation produces a single summary dataframe per document.
indicators = {}  # Type: Dict[Tuple[str, int], pd.DataFrame]
for k in gold_standard_spans.keys():
    result = gold_standard_spans[k]
    for t in teams:
        result = result.merge(span_flags[t][k], how="outer")
    indicators[k] = result.fillna(False)
    
# Now we have a vector of indicator variables for every span extracted 
# from every document across all the model outputs and the gold standard.
# For example, let's show the results for document 10:
doc_10 = ("test", 10)
indicators[doc_10]

Unnamed: 0,span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang
0,"[11, 22): 'RUGBY UNION'",ORG,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,"[24, 30): 'LITTLE'",PER,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False,False,True
2,"[39, 46): 'CAMPESE'",PER,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False
3,"[57, 70): 'Robert Kitson'",PER,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True,True
4,"[71, 77): 'LONDON'",LOC,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,"[588, 601): 'European tour'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
150,"[960, 967): 'Campese'",LOC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
151,"[39, 46): 'CAMPESE'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
152,"[1332, 1342): 'Twickenham'",PER,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False


In [11]:
# If you look at the above dataframe, you can see that some entities 
# ("RUGBY UNION", for example) are "easy", in that almost every entry
# found them correctly. Other entities, like "CAMPESE", are "harder",
# in that few of the entrants correctly identified them. Let's add
# a column that quantifies this "difficulty level" by counting how 
# many teams found each true or false positive.
for df in indicators.values():
    # Convert the teams' indicator columns into a single matrix of 
    # Boolean values, and sum the number of True values in each row.
    vectors = df[df.columns[3:]].values
    counts = np.count_nonzero(vectors, axis=1)
    df["num_teams"] = counts

# Show the dataframe for document 10 again, this time with the new
# "num_teams" column at the far right.
indicators[doc_10]

Unnamed: 0,span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang,num_teams
0,"[11, 22): 'RUGBY UNION'",ORG,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
1,"[24, 30): 'LITTLE'",PER,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False,False,True,6
2,"[39, 46): 'CAMPESE'",PER,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,2
3,"[57, 70): 'Robert Kitson'",PER,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True,True,15
4,"[71, 77): 'LONDON'",LOC,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,"[588, 601): 'European tour'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,1
150,"[960, 967): 'Campese'",LOC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,1
151,"[39, 46): 'CAMPESE'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,1
152,"[1332, 1342): 'Twickenham'",PER,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,1


In [12]:
# Now we can rank the entities in document 10 by "difficulty", either as 
# true positives for the models to find...
# (just for document 10 for the moment)
ind = indicators[doc_10].copy()
ind[ind["gold"] == True].sort_values("num_teams").head(10)

Unnamed: 0,span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang,num_teams
2,"[39, 46): 'CAMPESE'",PER,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,2
21,"[1018, 1028): 'Barbarians'",ORG,True,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,2
38,"[1687, 1696): 'All Black'",ORG,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,3
10,"[333, 345): 'Queenslander'",MISC,True,False,False,False,True,False,True,True,False,False,False,False,False,False,False,False,True,4
34,"[1535, 1545): 'Barbarians'",ORG,True,False,True,False,False,False,False,False,False,True,True,False,True,True,False,False,False,5
7,"[163, 173): 'Barbarians'",ORG,True,True,False,False,False,False,True,False,False,True,False,False,True,True,False,False,False,5
28,"[1332, 1342): 'Twickenham'",LOC,True,True,False,False,False,True,False,False,False,True,True,True,False,False,False,False,False,5
1,"[24, 30): 'LITTLE'",PER,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False,False,True,6
41,"[1740, 1750): 'Barbarians'",ORG,True,False,True,False,False,False,False,True,False,False,False,True,True,True,True,False,False,6
19,"[759, 768): 'Wallabies'",ORG,True,False,True,False,True,True,False,True,False,True,False,False,True,True,False,True,False,8


In [13]:
# ...or as false positives to avoid:
ind[ind["gold"] == False].sort_values("num_teams", ascending=False).head(10)

Unnamed: 0,span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang,num_teams
90,"[1018, 1028): 'Barbarians'",MISC,False,True,True,True,True,True,False,True,False,True,False,True,False,False,False,True,True,10
91,"[1535, 1545): 'Barbarians'",MISC,False,True,False,True,True,True,False,True,False,False,False,True,False,False,False,True,True,8
94,"[163, 173): 'Barbarians'",MISC,False,False,True,False,True,True,False,True,False,False,True,True,False,False,False,True,True,8
104,"[24, 30): 'LITTLE'",LOC,False,False,False,False,False,True,False,False,True,True,False,True,False,True,False,False,False,5
98,"[2013, 2023): 'Pontypridd'",ORG,False,False,True,True,False,False,False,False,False,True,True,False,True,False,False,False,False,5
95,"[333, 360): 'Queenslander Daniel Herbert'",PER,False,False,True,False,False,False,False,False,False,False,False,True,False,True,False,True,False,4
96,"[1332, 1342): 'Twickenham'",MISC,False,False,True,False,False,False,True,True,False,False,False,False,False,False,False,False,True,4
101,"[1332, 1342): 'Twickenham'",ORG,False,False,False,False,True,False,False,False,False,False,False,False,True,True,True,False,False,4
102,"[1687, 1696): 'All Black'",MISC,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False,False,True,4
103,"[1740, 1750): 'Barbarians'",MISC,False,False,False,False,True,True,False,False,False,False,True,False,False,False,False,False,True,4


In [14]:
# To get a better picture of what entities are "difficult", we need to look 
# across the entire test set. Let's combine the dataframes in 
# `indicators` into a single dataframe that covers all the documents.

# First we preprocess each dataframe to make it easier to combine.
to_stack = [
    pd.DataFrame({
        "fold": k[0],  # Keys are (collection, offset) tuples
        "doc_offset": k[1],
        # TokenSpanArrays from different documents can't currently be stacked,
        # so convert to TokenSpan objects.
        "span" : indicators[k]["span"].astype(object),
        "ent_type": indicators[k]["ent_type"],
        "gold": indicators[k]["gold"],
        "num_teams": indicators[k]["num_teams"]
    })
    for k in indicators.keys()
    #for i in range(len(indicators))
]

# Then we concatenate all the preprocessed dataframes into a single dataframe.
counts = pd.concat(to_stack)
counts

Unnamed: 0,fold,doc_offset,span,ent_type,gold,num_teams
0,test,0,"[19, 24): 'JAPAN'",LOC,True,12
1,test,0,"[40, 45): 'CHINA'",PER,True,0
2,test,0,"[66, 77): 'Nadim Ladki'",PER,True,15
3,test,0,"[78, 84): 'AL-AIN'",LOC,True,12
4,test,0,"[86, 106): 'United Arab Emirates'",LOC,True,15
...,...,...,...,...,...,...
50,test,230,"[19, 29): 'ENGLISHMAN'",LOC,False,1
51,test,230,"[427, 435): 'Charlton'",LOC,False,3
52,test,230,"[1076, 1097): 'European championship'",MISC,False,1
53,test,230,"[1346, 1363): 'World Cup winning'",MISC,False,1


In [15]:
# Now we can pull out the most difficult entities across the entire test
# set.
# First, let's find the most difficult entities from the standpoint of recall:
# entities that are in the gold standard, but not in most results.
difficult_recall = counts[counts["gold"] == True].sort_values("num_teams").reset_index(drop=True)
difficult_recall.head(10)

Unnamed: 0,fold,doc_offset,span,ent_type,gold,num_teams
0,test,216,"[20, 36): 'SHEFFIELD SHIELD'",MISC,True,0
1,test,54,"[3231, 3241): 'Full Light'",MISC,True,0
2,test,177,"[11, 19): 'Honda RV'",MISC,True,0
3,test,31,"[529, 542): '1. FC Cologne'",ORG,True,0
4,test,54,"[1717, 1723): 'Okocim'",ORG,True,0
5,test,149,"[1504, 1520): 'Consumer Project'",PER,True,0
6,test,90,"[1129, 1140): 'Warsaw Pact'",MISC,True,0
7,test,92,"[534, 568): 'Movement for a Democratic Slovakia'",ORG,True,0
8,test,216,"[308, 316): 'Victoria'",ORG,True,0
9,test,216,"[179, 187): 'Victoria'",ORG,True,0


In [16]:
# Hmm, everything is zero. How many entities were found by zero teams?  One team?
(counts[counts["gold"] == True][["num_teams", "span"]]
 .groupby("num_teams").count()
 .rename(columns={"span": "count"}))

Unnamed: 0_level_0,count
num_teams,Unnamed: 1_level_1
0,140
1,73
2,88
3,73
4,99
5,80
6,85
7,89
8,125
9,124


In [17]:
# Yikes! 140 entities in the test set were so hard to find, they
# were extracted by 0 teams.
# Let's go back and look at some of those 0-team entities in context:
difficult_recall["context"] = difficult_recall["span"].apply(lambda t: t.context())
pd.set_option('max_colwidth', 100)
difficult_recall.head(20)

Unnamed: 0,fold,doc_offset,span,ent_type,gold,num_teams,context
0,test,216,"[20, 36): 'SHEFFIELD SHIELD'",MISC,True,0,"[SHEFFIELD SHIELD] SCORE.\nHOBART, Australia 1996-12-07\nClo..."
1,test,54,"[3231, 3241): 'Full Light'",MISC,True,0,"...centrating on its leading brand, Zywiec [Full Light], which accounts for 85 percent of sales..."
2,test,177,"[11, 19): 'Honda RV'",MISC,True,0,[Honda RV] exceeds sales target.\nTOKYO 1996-12-06\n...
3,test,31,"[529, 542): '1. FC Cologne'",ORG,True,0,...5 30 20 28\nVfL Bochum 16 7 6 3 23 21 27\n[1. FC Cologne] 16 8 2 6 31 27 26\nSchalke 04 17 7 ...
4,test,54,"[1717, 1723): 'Okocim'",ORG,True,0,... while Carlsberg has the same amount in [Okocim].\nEarlier this year South African Brewer...
5,test,149,"[1504, 1520): 'Consumer Project'",PER,True,0,"...r lobbyist heading the Washington-based [Consumer Project] on Technology.\n"" None of the trea..."
6,test,90,"[1129, 1140): 'Warsaw Pact'",MISC,True,0,"...which used to be part of the Soviet-led [Warsaw Pact], saying such moves would threaten its s..."
7,test,92,"[534, 568): 'Movement for a Democratic Slovakia'",ORG,True,0,"...Prime Minister Vladimir Meciar's ruling [Movement for a Democratic Slovakia], was stripped of..."
8,test,216,"[308, 316): 'Victoria'",ORG,True,0,"... 119, David Boon 118, Shaun Young 113); [Victoria] 220 for three (Dean Jones 130 not out)."
9,test,216,"[179, 187): 'Victoria'",ORG,True,0,...ield cricket match between Tasmania and [Victoria] at Bellerive Oval on Saturday:\nTasmania...


**Some of these entities are "difficult" because the test set contains incorrect labels.**

For reference, there's a copy of the CoNLL labeling rules in this repository at
[resources/conll_03/ner/annotation.txt](../resources/conll_03/ner/annotation.txt)

There are 4 incorrect labels in this first set of 20:
* `[3289, 3299): 'Full Light'` should be "Zywiec Full Light"
* `[11, 19): 'Honda RV'` should be tagged `ORG`
* `[1525, 1541): 'Consumer Project'` should be "Consumer Project on Technology" and should be tagged `ORG`
* `[244, 255): 'McDonald 's'` should be tagged `MISC` (because it's an "adjective ... derived from a word which is ... organisation")

In [18]:
# Let's look at the entities that are difficult from the perspective of 
# precision: that is, in many models' results, but not in the gold standard.
difficult_precision = counts[counts["gold"] == False].sort_values("num_teams", ascending=False).reset_index(drop=True)

# Again, we can add some context to these spans:
difficult_precision["context"] = difficult_precision["span"].apply(lambda t: t.context())
difficult_precision.head(20)

Unnamed: 0,fold,doc_offset,span,ent_type,gold,num_teams,context
0,test,202,"[24, 31): 'BRITISH'",MISC,False,16,[BRITISH] RESULTS.\nLONDON 1996-12-07\nResults of B...
1,test,207,"[1304, 1314): 'Portsmouth'",ORG,False,16,...2 26\nManchester City 22 8 2 12 26 35 26\n[Portsmouth] 22 7 5 10 25 29 26\nReading 22 7 5 10 ...
2,test,199,"[108, 116): 'Scottish'",MISC,False,16,...W 1996-12-07\nLeading goalscorers in the\n[Scottish] premier division after Saturday's match...
3,test,216,"[166, 174): 'Tasmania'",LOC,False,16,... Sheffield Shield cricket match between [Tasmania] and Victoria at Bellerive Oval on Satur...
4,test,40,"[144, 161): 'Santiago Bernabeu'",LOC,False,16,...ll breathalyse fans at the gates of the [Santiago Bernabeu] stadium and ban drunk supporters ...
5,test,223,"[231, 243): 'Philadelphia'",ORG,False,16,...rgh 5 WASHINGTON 3\nMontreal 3 CHICAGO 1\n[Philadelphia] 6 DALLAS 3\nSt Louis 4 COLORADO 3\nE...
6,test,216,"[308, 316): 'Victoria'",LOC,False,16,"... 119, David Boon 118, Shaun Young 113); [Victoria] 220 for three (Dean Jones 130 not out)."
7,test,36,"[349, 358): 'Karlsruhe'",ORG,False,16,"...w 8th).\nHalftime 0-1.\nAttendance 33,000\n[Karlsruhe] 3 (Reich 29th, Carl 44th, Dundee 69th)..."
8,test,100,"[987, 995): 'Congress'",ORG,False,16,...n Congress would ratify the treaty with [Congress] quickly.\n' ' The reactions from busines...
9,test,36,"[398, 406): 'Freiburg'",ORG,False,16,"... 3 (Reich 29th, Carl 44th, Dundee 69th) [Freiburg] 0.\nHalftime 2-0.\nAttendance 33,000\nScha..."


**As with the entities in `difficult_recall`, some of these entities in `difficult_precision` are "difficult" because the test set has missing and incorrect labels.**

**13** of these first 20 "incorrect" results are due to missing and incorrect labels:
* `[25, 32): 'BRITISH''` in document 202 should be tagged `MISC`.
* `[1317, 1327): 'Portsmouth'` in document 207 should be tagged `ORG`, not `LOC`.
* `[110, 118): 'Scottish'` in document 199 should be tagged `MISC`
  (or `[28, 53): 'SCOTTISH PREMIER DIVISION'` and 
  `[110, 135): 'Scottish premier division'` should both be tagged `ORG`).
* `[146, 163): 'Santiago Bernabeu'` in document 40 should be tagged `MISC`
  (because the "s" in `[146, 171): 'Santiago Bernabeu stadium'` is not capitalized).
* `[239, 251): 'Philadelphia'` in document 223 should be tagged `ORG`, not `LOC`.
* `[367, 376): 'Karlsruhe'` in document 36 should be tagged `ORG`, not `LOC`.
* `[1003, 1011): 'Congress'` in document 100 should be tagged `ORG`
  (also, `[957, 964): 'Chilean' ==> MISC` should be replaced with 
  `[957, 973): 'Chilean Congress' ==> ORG`).
* `[420, 428): 'Freiburg'` in document 36 should be tagged `ORG`, not `LOC`.
* In document 70, `[186, 211): 'New York Commodities Desk'`, not `[186, 206): 'New York Commodities'`, should be tagged `ORG`.
* `[263, 271): 'St Louis'` in document 223 should be tagged `ORG`, not `LOC`.
* `[788, 795): 'Antwerp'` in document 155 should be tagged `LOC`, not `ORG`.
* In document 112, `[178, 191): 'John Mills Jr'`, not `[178, 188): 'John Mills'`, should be tagged `PER`.
* `[274, 282): 'COLORADO'` in document 223 should be tagged `ORG`.


In [19]:
# Here's the gold standard data for document 155, for example.
# Note line 12.
doc_id = ("test", 155)
gold_standard_spans[doc_id][0:60]

Unnamed: 0,span,ent_type
0,"[11, 18): 'Belgian'",MISC
1,"[64, 72): 'BRUSSELS'",LOC
2,"[170, 175): 'Spain'",LOC
3,"[230, 237): 'Belgian'",MISC
4,"[348, 355): 'Belgian'",MISC
5,"[423, 430): 'Antwerp'",ORG
6,"[459, 466): 'Belgian'",MISC
7,"[537, 546): 'Barcelona'",LOC
8,"[605, 612): 'Turkish'",MISC
9,"[712, 719): 'Belgium'",LOC


In [20]:
# The above gold standard spans in context. 
gold_standard_spans[doc_id]["span"].values

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,11,18,1,2,Belgian
1,64,72,11,12,BRUSSELS
2,170,175,27,28,Spain
3,230,237,40,41,Belgian
4,348,355,62,63,Belgian
5,423,430,76,77,Antwerp
6,459,466,83,84,Belgian
7,537,546,101,102,Barcelona
8,605,612,111,112,Turkish
9,712,719,133,134,Belgium


In [21]:
# Repeat the steps from the previous cells using the dev set.
# This takes a while.
dev_gold_standard = tp.io.conll.conll_2003_to_dataframes(
    data_set_info["dev"], ["pos", "phrase", "ent"], [False, True, True])
dev_gold_standard = [
    df.drop(columns=["pos", "phrase_iob", "phrase_type"])
    for df in dev_gold_standard
]

dev_gold_standard_spans = {
    ("dev", i): tp.io.conll.iob_to_spans(dev_gold_standard[i]) 
    for i in range(len(dev_gold_standard))}

dev_outputs = { 
    t: tp.io.conll.conll_2003_output_to_dataframes(
        dev_gold_standard, f"{PROJECT_ROOT}/resources/conll_03/ner/results/{t}/eng.testa")
    for t in teams
}  # Type: Dict[str, List[pd.DataFrame]]

dev_output_spans = {
    t: {("dev", i): tp.io.conll.iob_to_spans(dev_outputs[t][i]) 
        for i in range(len(dev_outputs[t]))}
    for t in teams
}  # Type: Dict[str, Dict[Tuple[str, int], pd.DataFrame]]

dev_span_flags = {t: merge_span_sets(t, dev_gold_standard_spans,
                                      dev_output_spans) 
                   for t in teams}  # Type: Dict[Tuple[str, int]: pd.DataFrame]

dev_indicators = {}  # Type: Dict[Tuple[str, int], pd.DataFrame]
for k in dev_gold_standard_spans.keys():
    result = dev_gold_standard_spans[k]
    for t in teams:
        result = result.merge(dev_span_flags[t][k], how="outer")
    dev_indicators[k] = result.fillna(False)
    
for df in dev_indicators.values():
    # Convert the teams' indicator columns into a single matrix of 
    # Boolean values, and sum the number of True values in each row.
    vectors = df[df.columns[3:]].values
    nonzero_counts = np.count_nonzero(vectors, axis=1)
    df["num_teams"] = nonzero_counts
    
dev_counts = pd.concat([
    pd.DataFrame({
        "fold": k[0],  # Keys are (collection, offset) tuples
        "doc_offset": k[1],
        # TokenSpanArrays from different documents can't currently be stacked,
        # so convert to TokenSpan objects.
        "span" : dev_indicators[k]["span"].astype(object),
        "ent_type": dev_indicators[k]["ent_type"],
        "gold": dev_indicators[k]["gold"],
        "num_teams": dev_indicators[k]["num_teams"]
    })
    for k in dev_indicators.keys()
])

In [22]:
# How many teams found entities from the dev set that are in the gold standard?
(dev_counts[dev_counts["gold"] == True][["num_teams", "span"]]
 .groupby("num_teams").count()
 .rename(columns={"span": "count"}))

Unnamed: 0_level_0,count
num_teams,Unnamed: 1_level_1
0,72
1,55
2,43
3,68
4,50
5,61
6,55
7,62
8,71
9,64


In [23]:
# How many teams found entities from the dev set that aren't in the gold standard?
(dev_counts[dev_counts["gold"] == False][["num_teams", "span"]]
 .groupby("num_teams").count()
 .rename(columns={"span": "count"}))

Unnamed: 0_level_0,count
num_teams,Unnamed: 1_level_1
1,2740
2,704
3,324
4,214
5,134
6,77
7,68
8,36
9,43
10,30


In [24]:
# Merge the results from the two folds
all_counts = pd.concat([counts, dev_counts])
all_counts.head()

Unnamed: 0,fold,doc_offset,span,ent_type,gold,num_teams
0,test,0,"[19, 24): 'JAPAN'",LOC,True,12
1,test,0,"[40, 45): 'CHINA'",PER,True,0
2,test,0,"[66, 77): 'Nadim Ladki'",PER,True,15
3,test,0,"[78, 84): 'AL-AIN'",LOC,True,12
4,test,0,"[86, 106): 'United Arab Emirates'",LOC,True,15


In [25]:
all_counts.tail()

Unnamed: 0,fold,doc_offset,span,ent_type,gold,num_teams
25,dev,215,"[673, 678): 'Atlas'",ORG,False,4
26,dev,215,"[679, 689): 'Bangladesh'",LOC,False,3
27,dev,215,"[983, 991): 'Newsroom'",ORG,False,1
28,dev,215,"[463, 467): 'Alam'",PER,False,1
29,dev,215,"[291, 304): 'Moslem Friday'",PER,False,1


In [26]:
# Write out the results that are in the gold standard but not in
# few teams' outputs.
in_gold_to_write, not_in_gold_to_write = util.csv_prep(all_counts, "num_teams")
in_gold_to_write

Unnamed: 0,num_teams,fold,doc_offset,corpus_span,corpus_ent_type,error_type,correct_span,correct_ent_type,notes,time_started,time_stopped,time_elapsed
0,0,dev,2,"[25, 30): 'ASHES'",MISC,,,,,,,
1,0,dev,15,"[15, 40): 'AMERICAN FOOTBALL-RANDALL'",MISC,,,,,,,
2,0,dev,20,"[90, 96): 'Berlin'",MISC,,,,,,,
7,0,dev,22,"[213, 244): 'Solidarity Meeting for Sarajevo'",MISC,,,,,,,
17,0,dev,22,"[826, 847): 'IAAF Grand Prix Final'",MISC,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
19,16,test,230,"[1031, 1040): 'World Cup'",MISC,,,,,,,
21,16,test,230,"[1108, 1115): 'Germany'",LOC,,,,,,,
22,16,test,230,"[1127, 1132): 'Irish'",MISC,,,,,,,
23,16,test,230,"[1153, 1160): 'England'",LOC,,,,,,,


In [27]:
not_in_gold_to_write

Unnamed: 0,num_teams,fold,doc_offset,model_span,model_ent_type,error_type,corpus_span,corpus_ent_type,correct_span,correct_ent_type,notes,time_started,time_stopped,time_elapsed
310,16,dev,20,"[90, 96): 'Berlin'",LOC,,,,,,,,,
22,16,dev,22,"[236, 244): 'Sarajevo'",LOC,,,,,,,,,
74,16,dev,157,"[132, 141): 'World Cup'",MISC,,,,,,,,,
32,16,dev,187,"[374, 379): 'China'",LOC,,,,,,,,,
71,16,dev,206,"[2399, 2406): 'Marines'",ORG,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49,1,test,230,"[521, 540): 'Republic of Ireland'",ORG,,,,,,,,,
50,1,test,230,"[19, 29): 'ENGLISHMAN'",LOC,,,,,,,,,
52,1,test,230,"[1076, 1097): 'European championship'",MISC,,,,,,,,,
53,1,test,230,"[1346, 1363): 'World Cup winning'",MISC,,,,,,,,,


In [28]:
# Write output files.
in_gold_to_write.to_csv("outputs/CoNLL_2_in_gold.csv", index=False)
not_in_gold_to_write.to_csv("outputs/CoNLL_2_not_in_gold.csv", index=False)