## CoNLL-2003 Example for Text Extensions for Pandas
### Part 2

To run this notebook, you will need to obtain a copy of the CoNLL-2003 data set's corpus.
Drop the corpus's files into the following locations:
* conll_03/eng.testa
* conll_03/eng.testb
* conll_03/eng.train

If you are unfamiliar with the basics of Text Extensions for Pandas, we recommend you 
start with Part 1 of this example.

### Introduction

In Part 1 of this demo, we showed how to use Text Extensions for Pandas to examine the 
overall result quality of one entrant in the CoNLL-2003 Shared Task, as well as how to
identify and examine the documents from the validation set where that entry had the 
most errors.

In Part 2, we'll perform a broader analysis that goes across all the contents entries
and come to some deeper and more surprising conclusions.

In [1]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd
from typing import *

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

# Code shared among notebooks is kept in util.py, in this directory.
import util

In [2]:
# Load up the same gold standard data we used in Part 1.
gold_standard = tp.conll_2003_to_dataframes("../conll_03/eng.testa")

# Dictionary from (collection, offset within collection) to dataframe
gold_standard_spans = {("validation", i): 
                       tp.iob_to_spans(gold_standard[i]) 
                       for i in range(len(gold_standard))}

In [3]:
# Load up the results from all 16 teams at once.
teams = ["bender", "carrerasa", "carrerasb", "chieu", "curran",
         "demeulder", "florian", "hammerton", "hendrickx",
         "klein", "mayfield", "mccallum", "munro", "whitelaw",
         "wu", "zhang"]

# Read all the output files into one dataframe per <document, team> pair.
outputs = { 
    t: tp.conll_2003_output_to_dataframes(
        gold_standard, f"../resources/conll_03/ner/results/{t}/eng.testb")
    for t in teams
}  # Type: Dict[str, List[pd.DataFrame]]

# As an example of what we just loaded, show the token metadata for the 
# "mayfield" team's model's output on document 3.
outputs["mayfield"][3]

Unnamed: 0,char_span,token_span,ent_iob,ent_type,sentence
0,"[0, 10): '-DOCSTART-'","[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'"
1,"[11, 20): 'FREESTYLE'","[11, 20): 'FREESTYLE'",O,,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
2,"[21, 33): 'SKIING-WORLD'","[21, 33): 'SKIING-WORLD'",O,,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
3,"[34, 37): 'CUP'","[34, 37): 'CUP'",B,MISC,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
4,"[38, 43): 'MOGUL'","[38, 43): 'MOGUL'",O,,"[11, 52): 'FREESTYLE SKIING-WORLD CUP MOGUL RE..."
...,...,...,...,...,...
161,"[803, 809): 'Allais'","[803, 809): 'Allais'",I,PER,"[791, 824): '10. Katleen Allais( France) 21.58'"
162,"[809, 810): '('","[809, 810): '('",O,,"[791, 824): '10. Katleen Allais( France) 21.58'"
163,"[811, 817): 'France'","[811, 817): 'France'",B,LOC,"[791, 824): '10. Katleen Allais( France) 21.58'"
164,"[817, 818): ')'","[817, 818): ')'",O,,"[791, 824): '10. Katleen Allais( France) 21.58'"


In [4]:
# Convert results from IOB2 tags to spans across all teams and documents
# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) for details on IOB2 format.
output_spans = {
    t: {("validation", i): tp.iob_to_spans(outputs[t][i]) 
        for i in range(len(outputs[t]))}
    for t in teams
    #t: [tp.iob_to_spans(df) for df in outputs[t]] for t in teams
}  # Type: Dict[str, Dict[Tuple[str, int], pd.DataFrame]]

# As an example, show the first 10 spans that the "florian" team's model
# found on document 2.
output_spans["florian"][("validation", 2)].head(10)

Unnamed: 0,token_span,ent_type
0,"[35, 40): 'JAPAN'",LOC
1,"[50, 55): 'SYRIA'",LOC
2,"[57, 63): 'AL-AIN'",LOC
3,"[65, 85): 'United Arab Emirates'",LOC
4,"[144, 149): 'Japan'",LOC
5,"[169, 178): 'Asian Cup'",MISC
6,"[192, 197): 'Syria'",LOC
7,"[209, 222): 'Takuya Takagi'",PER
8,"[297, 308): 'Salem Bitar'",PER
9,"[403, 409): 'Syrian'",MISC


In [5]:
# Use Pandas merge to find what spans match up exactly for each team's
# results.
# Unlike Part 1, we perform the join across all entity types, looking for
# matches of both the extracted span *and* the entity type label.
#
# The make_stats_df() function is in util.py and is shared across multiple
# notebooks.
stats = {t: util.make_stats_df(gold_standard_spans, output_spans[t]) for t in teams}

# Show the result quality statistics by document for the "carrerasa" team
stats["carrerasa"]

Unnamed: 0,fold,doc_num,num_true_positives,num_extracted,num_entities,precision,recall,F1
0,validation,0,42,48,45,0.875000,0.933333,0.903226
1,validation,1,43,44,44,0.977273,0.977273,0.977273
2,validation,2,51,52,54,0.980769,0.944444,0.962264
3,validation,3,43,44,44,0.977273,0.977273,0.977273
4,validation,4,14,19,19,0.736842,0.736842,0.736842
...,...,...,...,...,...,...,...,...
226,validation,226,7,7,7,1.000000,1.000000,1.000000
227,validation,227,19,22,21,0.863636,0.904762,0.883721
228,validation,228,22,27,27,0.814815,0.814815,0.814815
229,validation,229,26,27,27,0.962963,0.962963,0.962963


In [6]:
# F1 for document 4 is looking a bit low. Is that just fluke, or is it
# part of a larger trend? 
# In Part 1, we showed how to drill down to and examine "problem" documents.
# Since we have all this additional data, let's try a broader, more 
# quantitative approach. We'll start by building up some more fine-grained 
# data about congruence between the gold standard and the model outputs.
# Pandas' outer join will tell us what entities showed up just in the gold
# standard, just in the model output, or in both sets.
# For starters, let's do this just for the "carrerasa" team and document 4.
doc_id = ("validation", 4)
(gold_standard_spans[doc_id]
    .merge(output_spans["carrerasa"][doc_id], how="outer", indicator=True)
    .sort_values("token_span"))

Unnamed: 0,token_span,ent_type,_merge
0,"[19, 28): 'ASIAN CUP'",MISC,both
1,"[46, 52): 'AL-AIN'",LOC,both
2,"[54, 74): 'United Arab Emirates'",LOC,both
3,"[97, 106): 'Asian Cup'",MISC,both
4,"[141, 146): 'Japan'",LOC,left_only
19,"[141, 146): 'Japan'",ORG,right_only
5,"[149, 154): 'Syria'",LOC,left_only
20,"[149, 154): 'Syria'",ORG,right_only
6,"[181, 186): 'Japan'",LOC,both
7,"[188, 200): 'Hassan Abbas'",PER,both


In [7]:
# Repeat the analysis from the previous cell across all teams and documents.
# That is, perform an outer join between the gold standard spans dataframe
# for each document and the corresponding dataframe from each team.
def merge_span_sets(team: str,
                    gold_results: Dict[Tuple[str, int], pd.DataFrame],
                    results_by_team: Dict[str, Dict[Tuple[str, int], pd.DataFrame]]):
    result = {}  # Type: Dict[Tuple[str, int]: pd.DataFrame]
    for k in gold_results.keys():
        merged = gold_results[k].merge(results_by_team[team][k],
                                       how="outer", indicator=True)
        merged["gold"] = merged["_merge"].isin(("both", "left_only"))
        merged[team] = merged["_merge"].isin(("both", "right_only"))
        result[k] = merged[["token_span", "ent_type", "gold", team]]
    return result

span_flags = {t: merge_span_sets(t, gold_standard_spans, output_spans) for t in teams}  # Type: Dict[Tuple[str, int]: pd.DataFrame]

In [8]:
# Now we have indicator variables for every extracted span, telling whether 
# it was in the gold standard data set and/or in each of the team's results.
# For example, here are the first 5 spans for document 2 in the "carrerasa"
# team's results:
doc_id = ("validation", 2)
span_flags["carrerasa"][doc_id].head(5)

Unnamed: 0,token_span,ent_type,gold,carrerasa
0,"[35, 40): 'JAPAN'",LOC,True,True
1,"[50, 55): 'SYRIA'",LOC,True,True
2,"[57, 63): 'AL-AIN'",LOC,True,True
3,"[65, 85): 'United Arab Emirates'",LOC,True,True
4,"[144, 149): 'Japan'",LOC,True,True


In [9]:
# Do an n-way merge of all those indicator variables across documents.
# This operation produces a single summary dataframe per document.
indicators = {}  # Type: Dict[Tuple[str, int], pd.DataFrame]
for k in gold_standard_spans.keys():
    result = gold_standard_spans[k]
    for t in teams:
        result = result.merge(span_flags[t][k], how="outer")
    indicators[k] = result.fillna(False)
    
# Now we have a vector of indicator variables for every span extracted 
# from every document across all the model outputs and the gold standard.
# For example, let's show the results for document 10:
doc_10 = ("validation", 10)
indicators[doc_10]

Unnamed: 0,token_span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang
0,"[11, 22): 'RUGBY UNION'",ORG,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,"[24, 30): 'LITTLE'",PER,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False,False,True
2,"[39, 46): 'CAMPESE'",PER,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False
3,"[57, 70): 'Robert Kitson'",PER,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True,True
4,"[71, 77): 'LONDON'",LOC,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,"[590, 603): 'European tour'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
150,"[962, 969): 'Campese'",LOC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
151,"[39, 46): 'CAMPESE'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
152,"[1333, 1343): 'Twickenham'",PER,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False


In [10]:
# If you look at the above dataframe, you can see that some entities 
# ("RUGBY UNION", for example) are "easy", in that almost every entry
# found them correctly. Other entities, like "CAMPESE", are "harder",
# in that few of the entrants correctly identified them. Let's add
# a column that quantifies this "difficulty level" by counting how 
# many teams found each true or false positive.
for df in indicators.values():
    # Convert the teams' indicator columns into a single matrix of 
    # Boolean values, and sum the number of True values in each row.
    vectors = df[df.columns[3:]].values
    counts = np.count_nonzero(vectors, axis=1)
    df["num_teams"] = counts

# Show the dataframe for document 10 again, this time with the new
# "num_teams" column at the far right.
indicators[doc_10]

Unnamed: 0,token_span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang,num_teams
0,"[11, 22): 'RUGBY UNION'",ORG,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
1,"[24, 30): 'LITTLE'",PER,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False,False,True,6
2,"[39, 46): 'CAMPESE'",PER,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,2
3,"[57, 70): 'Robert Kitson'",PER,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True,True,15
4,"[71, 77): 'LONDON'",LOC,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,"[590, 603): 'European tour'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,1
150,"[962, 969): 'Campese'",LOC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,1
151,"[39, 46): 'CAMPESE'",MISC,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,1
152,"[1333, 1343): 'Twickenham'",PER,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,1


In [11]:
# Now we can rank the entities in document 10 by "difficulty", either as 
# true positives for the models to find...
# (just for document 10 for the moment)
ind = indicators[doc_10].copy()
ind[ind["gold"] == True].sort_values("num_teams").head(10)

Unnamed: 0,token_span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang,num_teams
2,"[39, 46): 'CAMPESE'",PER,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,2
21,"[1020, 1030): 'Barbarians'",ORG,True,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,2
38,"[1687, 1696): 'All Black'",ORG,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False,3
10,"[334, 346): 'Queenslander'",MISC,True,False,False,False,True,False,True,True,False,False,False,False,False,False,False,False,True,4
34,"[1536, 1546): 'Barbarians'",ORG,True,False,True,False,False,False,False,False,False,True,True,False,True,True,False,False,False,5
7,"[164, 174): 'Barbarians'",ORG,True,True,False,False,False,False,True,False,False,True,False,False,True,True,False,False,False,5
28,"[1333, 1343): 'Twickenham'",LOC,True,True,False,False,False,True,False,False,False,True,True,True,False,False,False,False,False,5
1,"[24, 30): 'LITTLE'",PER,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False,False,True,6
41,"[1740, 1750): 'Barbarians'",ORG,True,False,True,False,False,False,False,True,False,False,False,True,True,True,True,False,False,6
19,"[761, 770): 'Wallabies'",ORG,True,False,True,False,True,True,False,True,False,True,False,False,True,True,False,True,False,8


In [12]:
# ...or as false positives to avoid:
ind[ind["gold"] == False].sort_values("num_teams", ascending=False).head(10)

Unnamed: 0,token_span,ent_type,gold,bender,carrerasa,carrerasb,chieu,curran,demeulder,florian,hammerton,hendrickx,klein,mayfield,mccallum,munro,whitelaw,wu,zhang,num_teams
90,"[1020, 1030): 'Barbarians'",MISC,False,True,True,True,True,True,False,True,False,True,False,True,False,False,False,True,True,10
91,"[1536, 1546): 'Barbarians'",MISC,False,True,False,True,True,True,False,True,False,False,False,True,False,False,False,True,True,8
94,"[164, 174): 'Barbarians'",MISC,False,False,True,False,True,True,False,True,False,False,True,True,False,False,False,True,True,8
104,"[24, 30): 'LITTLE'",LOC,False,False,False,False,False,True,False,False,True,True,False,True,False,True,False,False,False,5
98,"[2013, 2023): 'Pontypridd'",ORG,False,False,True,True,False,False,False,False,False,True,True,False,True,False,False,False,False,5
95,"[334, 361): 'Queenslander Daniel Herbert'",PER,False,False,True,False,False,False,False,False,False,False,False,True,False,True,False,True,False,4
96,"[1333, 1343): 'Twickenham'",MISC,False,False,True,False,False,False,True,True,False,False,False,False,False,False,False,False,True,4
101,"[1333, 1343): 'Twickenham'",ORG,False,False,False,False,True,False,False,False,False,False,False,False,True,True,True,False,False,4
102,"[1687, 1696): 'All Black'",MISC,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False,False,True,4
103,"[1740, 1750): 'Barbarians'",MISC,False,False,False,False,True,True,False,False,False,False,True,False,False,False,False,False,True,4


In [13]:
# To get a better picture of what entities are "difficult", we need to look 
# across the entire validation set. Let's combine the dataframes in 
# `indicators` into a single dataframe that covers all the documents.

# First we preprocess each dataframe to make it easier to combine.
to_stack = [
    pd.DataFrame({
        "fold": k[0],  # Keys are (collection, offset) tuples
        "doc_offset": k[1],
        # TokenSpanArrays from different documents can't currently be stacked,
        # so convert to TokenSpan objects.
        "token_span" : indicators[k]["token_span"].astype(object),
        "ent_type": indicators[k]["ent_type"],
        "gold": indicators[k]["gold"],
        "num_teams": indicators[k]["num_teams"]
    })
    for k in indicators.keys()
    #for i in range(len(indicators))
]

# Then we concatenate all the preprocessed dataframes into a single dataframe.
counts = pd.concat(to_stack)
counts

Unnamed: 0,fold,doc_offset,token_span,ent_type,gold,num_teams
0,validation,0,"[19, 24): 'JAPAN'",LOC,True,12
1,validation,0,"[40, 45): 'CHINA'",PER,True,0
2,validation,0,"[66, 77): 'Nadim Ladki'",PER,True,15
3,validation,0,"[78, 84): 'AL-AIN'",LOC,True,12
4,validation,0,"[86, 106): 'United Arab Emirates'",LOC,True,15
...,...,...,...,...,...,...
50,validation,230,"[19, 29): 'ENGLISHMAN'",LOC,False,1
51,validation,230,"[428, 436): 'Charlton'",LOC,False,3
52,validation,230,"[1076, 1097): 'European championship'",MISC,False,1
53,validation,230,"[1346, 1363): 'World Cup winning'",MISC,False,1


In [14]:
# Now we can pull out the most difficult entities across the entire validation
# set.
# First, let's find the most difficult entities from the standpoint of recall:
# entities that are in the gold standard, but not in most results.
difficult_recall = counts[counts["gold"] == True].sort_values("num_teams").reset_index(drop=True)
difficult_recall.head(10)

Unnamed: 0,fold,doc_offset,token_span,ent_type,gold,num_teams
0,validation,216,"[20, 36): 'SHEFFIELD SHIELD'",MISC,True,0
1,validation,54,"[3239, 3249): 'Full Light'",MISC,True,0
2,validation,177,"[11, 19): 'Honda RV'",MISC,True,0
3,validation,31,"[529, 542): '1. FC Cologne'",ORG,True,0
4,validation,54,"[1722, 1728): 'Okocim'",ORG,True,0
5,validation,149,"[1502, 1518): 'Consumer Project'",PER,True,0
6,validation,90,"[1129, 1140): 'Warsaw Pact'",MISC,True,0
7,validation,92,"[536, 570): 'Movement for a Democratic Slovakia'",ORG,True,0
8,validation,216,"[308, 316): 'Victoria'",ORG,True,0
9,validation,216,"[179, 187): 'Victoria'",ORG,True,0


In [15]:
# Hmm, everything is zero. How many entities were found by zero teams?  One team?
(counts[counts["gold"] == True][["num_teams", "token_span"]]
 .groupby("num_teams").count()
 .rename(columns={"token_span": "count"}))

Unnamed: 0_level_0,count
num_teams,Unnamed: 1_level_1
0,140
1,73
2,88
3,73
4,99
5,80
6,85
7,89
8,125
9,124


In [16]:
# Yikes! 140 entities in the validation set were so hard to find, they
# were extracted by 0 teams.
# Let's go back and look at some of those 0-team entities in context:
difficult_recall["context"] = difficult_recall["token_span"].apply(lambda t: t.context())
pd.set_option('max_colwidth', 100)
difficult_recall.head(20)

Unnamed: 0,fold,doc_offset,token_span,ent_type,gold,num_teams,context
0,validation,216,"[20, 36): 'SHEFFIELD SHIELD'",MISC,True,0,"[SHEFFIELD SHIELD] SCORE.\nHOBART, Australia 1996-12-07\nClo..."
1,validation,54,"[3239, 3249): 'Full Light'",MISC,True,0,"...centrating on its leading brand, Zywiec [Full Light], which accounts for 85 percent of sales..."
2,validation,177,"[11, 19): 'Honda RV'",MISC,True,0,[Honda RV] exceeds sales target.\nTOKYO 1996-12-06\n...
3,validation,31,"[529, 542): '1. FC Cologne'",ORG,True,0,...5 30 20 28\nVfL Bochum 16 7 6 3 23 21 27\n[1. FC Cologne] 16 8 2 6 31 27 26\nSchalke 04 17 7 ...
4,validation,54,"[1722, 1728): 'Okocim'",ORG,True,0,... while Carlsberg has the same amount in [Okocim].\nEarlier this year South African Brewer...
5,validation,149,"[1502, 1518): 'Consumer Project'",PER,True,0,"...r lobbyist heading the Washington-based [Consumer Project] on Technology.\n"" None of the trea..."
6,validation,90,"[1129, 1140): 'Warsaw Pact'",MISC,True,0,"...which used to be part of the Soviet-led [Warsaw Pact], saying such moves would threaten its s..."
7,validation,92,"[536, 570): 'Movement for a Democratic Slovakia'",ORG,True,0,"...rime Minister Vladimir Meciar 's ruling [Movement for a Democratic Slovakia], was stripped of..."
8,validation,216,"[308, 316): 'Victoria'",ORG,True,0,"... 119, David Boon 118, Shaun Young 113); [Victoria] 220 for three( Dean Jones 130 not out)."
9,validation,216,"[179, 187): 'Victoria'",ORG,True,0,...ield cricket match between Tasmania and [Victoria] at Bellerive Oval on Saturday:\nTasmania...


**Some of these entities are "difficult" because the validation set contains incorrect labels.**

For reference, there's a copy of the CoNLL labeling rules in this repository at
[resources/conll_03/ner/annotation.txt](../resources/conll_03/ner/annotation.txt)

There are 4 incorrect labels in this first set of 20:
* `[3289, 3299): 'Full Light'` should be "Zywiec Full Light"
* `[11, 19): 'Honda RV'` should be tagged `ORG`
* `[1525, 1541): 'Consumer Project'` should be "Consumer Project on Technology" and should be tagged `ORG`
* `[244, 255): 'McDonald 's'` should be tagged `MISC` (because it's an "adjective ... derived from a word which is ... organisation")

In [17]:
# Let's look at the entities that are difficult from the perspective of 
# precision: that is, in many models' results, but not in the gold standard.
difficult_precision = counts[counts["gold"] == False].sort_values("num_teams", ascending=False).reset_index(drop=True)

# Again, we can add some context to these spans:
difficult_precision["context"] = difficult_precision["token_span"].apply(lambda t: t.context())
difficult_precision.head(20)

Unnamed: 0,fold,doc_offset,token_span,ent_type,gold,num_teams,context
0,validation,202,"[24, 31): 'BRITISH'",MISC,False,16,[BRITISH] RESULTS.\nLONDON 1996-12-07\nResults of B...
1,validation,207,"[1305, 1315): 'Portsmouth'",ORG,False,16,...2 26\nManchester City 22 8 2 12 26 35 26\n[Portsmouth] 22 7 5 10 25 29 26\nReading 22 7 5 10 ...
2,validation,199,"[108, 116): 'Scottish'",MISC,False,16,...W 1996-12-07\nLeading goalscorers in the\n[Scottish] premier division after Saturday 's matc...
3,validation,216,"[166, 174): 'Tasmania'",LOC,False,16,... Sheffield Shield cricket match between [Tasmania] and Victoria at Bellerive Oval on Satur...
4,validation,40,"[144, 161): 'Santiago Bernabeu'",LOC,False,16,...ll breathalyse fans at the gates of the [Santiago Bernabeu] stadium and ban drunk supporters ...
5,validation,223,"[232, 244): 'Philadelphia'",ORG,False,16,...rgh 5 WASHINGTON 3\nMontreal 3 CHICAGO 1\n[Philadelphia] 6 DALLAS 3\nSt Louis 4 COLORADO 3\nE...
6,validation,216,"[308, 316): 'Victoria'",LOC,False,16,"... 119, David Boon 118, Shaun Young 113); [Victoria] 220 for three( Dean Jones 130 not out)."
7,validation,36,"[349, 358): 'Karlsruhe'",ORG,False,16,"...w 8th).\nHalftime 0-1.\nAttendance 33,000\n[Karlsruhe] 3( Reich 29th, Carl 44th, Dundee 69th)..."
8,validation,100,"[983, 991): 'Congress'",ORG,False,16,...n Congress would ratify the treaty with [Congress] quickly.\n'' The reactions from business...
9,validation,36,"[398, 406): 'Freiburg'",ORG,False,16,"... 3( Reich 29th, Carl 44th, Dundee 69th) [Freiburg] 0.\nHalftime 2-0.\nAttendance 33,000\nScha..."


**As with the entities in `difficult_recall`, some of these entities in `difficult_precision` are "difficult" because the validation set has missing and incorrect labels.**

**13** of these first 20 "incorrect" results are due to missing and incorrect labels:
* `[25, 32): 'BRITISH''` in document 202 should be tagged `MISC`.
* `[1317, 1327): 'Portsmouth'` in document 207 should be tagged `ORG`, not `LOC`.
* `[110, 118): 'Scottish'` in document 199 should be tagged `MISC`
  (or `[28, 53): 'SCOTTISH PREMIER DIVISION'` and 
  `[110, 135): 'Scottish premier division'` should both be tagged `ORG`).
* `[146, 163): 'Santiago Bernabeu'` in document 40 should be tagged `MISC`
  (because the "s" in `[146, 171): 'Santiago Bernabeu stadium'` is not capitalized).
* `[239, 251): 'Philadelphia'` in document 223 should be tagged `ORG`, not `LOC`.
* `[367, 376): 'Karlsruhe'` in document 36 should be tagged `ORG`, not `LOC`.
* `[1003, 1011): 'Congress'` in document 100 should be tagged `ORG`
  (also, `[957, 964): 'Chilean' ==> MISC` should be replaced with 
  `[957, 973): 'Chilean Congress' ==> ORG`).
* `[420, 428): 'Freiburg'` in document 36 should be tagged `ORG`, not `LOC`.
* In document 70, `[186, 211): 'New York Commodities Desk'`, not `[186, 206): 'New York Commodities'`, should be tagged `ORG`.
* `[263, 271): 'St Louis'` in document 223 should be tagged `ORG`, not `LOC`.
* `[788, 795): 'Antwerp'` in document 155 should be tagged `LOC`, not `ORG`.
* In document 112, `[178, 191): 'John Mills Jr'`, not `[178, 188): 'John Mills'`, should be tagged `PER`.
* `[274, 282): 'COLORADO'` in document 223 should be tagged `ORG`.


In [18]:
# Here's the gold standard data for document 155, for example.
# Note line 12.
doc_id = ("validation", 155)
gold_standard_spans[doc_id][0:60]

Unnamed: 0,token_span,ent_type
0,"[11, 18): 'Belgian'",MISC
1,"[64, 72): 'BRUSSELS'",LOC
2,"[170, 175): 'Spain'",LOC
3,"[230, 237): 'Belgian'",MISC
4,"[348, 355): 'Belgian'",MISC
5,"[424, 431): 'Antwerp'",ORG
6,"[460, 467): 'Belgian'",MISC
7,"[538, 547): 'Barcelona'",LOC
8,"[606, 613): 'Turkish'",MISC
9,"[713, 720): 'Belgium'",LOC


In [19]:
# The above gold standard spans in context. 
gold_standard_spans[doc_id]["token_span"].values

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,11,18,1,2,Belgian
1,64,72,11,12,BRUSSELS
2,170,175,27,28,Spain
3,230,237,40,41,Belgian
4,348,355,62,63,Belgian
5,424,431,76,77,Antwerp
6,460,467,83,84,Belgian
7,538,547,101,102,Barcelona
8,606,613,111,112,Turkish
9,713,720,133,134,Belgium


In [20]:
# Repeat the steps from the previous cells using the test set.
# This takes a while.
test_gold_standard = tp.conll_2003_to_dataframes("../conll_03/eng.testb")

test_gold_standard_spans = {("test", i): 
                       tp.iob_to_spans(test_gold_standard[i]) 
                       for i in range(len(test_gold_standard))}

test_outputs = { 
    t: tp.conll_2003_output_to_dataframes(
        test_gold_standard, f"../resources/conll_03/ner/results/{t}/eng.testa")
    for t in teams
}  # Type: Dict[str, List[pd.DataFrame]]

test_output_spans = {
    t: {("test", i): tp.iob_to_spans(test_outputs[t][i]) 
        for i in range(len(test_outputs[t]))}
    for t in teams
}  # Type: Dict[str, Dict[Tuple[str, int], pd.DataFrame]]

test_span_flags = {t: merge_span_sets(t, test_gold_standard_spans,
                                      test_output_spans) 
                   for t in teams}  # Type: Dict[Tuple[str, int]: pd.DataFrame]

test_indicators = {}  # Type: Dict[Tuple[str, int], pd.DataFrame]
for k in test_gold_standard_spans.keys():
    result = test_gold_standard_spans[k]
    for t in teams:
        result = result.merge(test_span_flags[t][k], how="outer")
    test_indicators[k] = result.fillna(False)
    
for df in test_indicators.values():
    # Convert the teams' indicator columns into a single matrix of 
    # Boolean values, and sum the number of True values in each row.
    vectors = df[df.columns[3:]].values
    nonzero_counts = np.count_nonzero(vectors, axis=1)
    df["num_teams"] = nonzero_counts
    
test_counts = pd.concat([
    pd.DataFrame({
        "fold": k[0],  # Keys are (collection, offset) tuples
        "doc_offset": k[1],
        # TokenSpanArrays from different documents can't currently be stacked,
        # so convert to TokenSpan objects.
        "token_span" : test_indicators[k]["token_span"].astype(object),
        "ent_type": test_indicators[k]["ent_type"],
        "gold": test_indicators[k]["gold"],
        "num_teams": test_indicators[k]["num_teams"]
    })
    for k in test_indicators.keys()
])

In [21]:
# How many teams found entities from the test set that are in the gold standard?
(test_counts[test_counts["gold"] == True][["num_teams", "token_span"]]
 .groupby("num_teams").count()
 .rename(columns={"token_span": "count"}))

Unnamed: 0_level_0,count
num_teams,Unnamed: 1_level_1
0,72
1,55
2,43
3,68
4,50
5,61
6,55
7,62
8,71
9,64


In [22]:
# How many teams found entities from the test set that aren't in the gold standard?
(test_counts[test_counts["gold"] == False][["num_teams", "token_span"]]
 .groupby("num_teams").count()
 .rename(columns={"token_span": "count"}))

Unnamed: 0_level_0,count
num_teams,Unnamed: 1_level_1
1,2740
2,704
3,324
4,214
5,134
6,77
7,68
8,36
9,43
10,30


In [23]:
# Merge the results from the two folds
all_counts = pd.concat([counts, test_counts])
all_counts.head()

Unnamed: 0,fold,doc_offset,token_span,ent_type,gold,num_teams
0,validation,0,"[19, 24): 'JAPAN'",LOC,True,12
1,validation,0,"[40, 45): 'CHINA'",PER,True,0
2,validation,0,"[66, 77): 'Nadim Ladki'",PER,True,15
3,validation,0,"[78, 84): 'AL-AIN'",LOC,True,12
4,validation,0,"[86, 106): 'United Arab Emirates'",LOC,True,15


In [24]:
all_counts.tail()

Unnamed: 0,fold,doc_offset,token_span,ent_type,gold,num_teams
25,test,215,"[670, 675): 'Atlas'",ORG,False,4
26,test,215,"[676, 686): 'Bangladesh'",LOC,False,3
27,test,215,"[980, 988): 'Newsroom'",ORG,False,1
28,test,215,"[462, 466): 'Alam'",PER,False,1
29,test,215,"[291, 304): 'Moslem Friday'",PER,False,1


In [25]:
# Write out the results that are in the gold standard but not in
# few teams' outputs.
in_gold_to_write, not_in_gold_to_write = util.csv_prep(all_counts, "num_teams")
in_gold_to_write

Unnamed: 0,num_teams,fold,doc_offset,corpus_span,corpus_ent_type,error_type,correct_span,correct_ent_type,notes,time_started,time_stopped,time_elapsed
0,0,test,2,"[25, 30): 'ASHES'",MISC,,,,,,,
1,0,test,15,"[15, 40): 'AMERICAN FOOTBALL-RANDALL'",MISC,,,,,,,
2,0,test,20,"[90, 96): 'Berlin'",MISC,,,,,,,
7,0,test,22,"[213, 244): 'Solidarity Meeting for Sarajevo'",MISC,,,,,,,
17,0,test,22,"[823, 844): 'IAAF Grand Prix Final'",MISC,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
19,16,validation,230,"[1031, 1040): 'World Cup'",MISC,,,,,,,
21,16,validation,230,"[1108, 1115): 'Germany'",LOC,,,,,,,
22,16,validation,230,"[1127, 1132): 'Irish'",MISC,,,,,,,
23,16,validation,230,"[1153, 1160): 'England'",LOC,,,,,,,


In [26]:
not_in_gold_to_write

Unnamed: 0,num_teams,fold,doc_offset,model_span,model_ent_type,error_type,corpus_span,corpus_ent_type,correct_span,correct_ent_type,notes,time_started,time_stopped,time_elapsed
310,16,test,20,"[90, 96): 'Berlin'",LOC,,,,,,,,,
22,16,test,22,"[236, 244): 'Sarajevo'",LOC,,,,,,,,,
74,16,test,157,"[132, 141): 'World Cup'",MISC,,,,,,,,,
32,16,test,187,"[376, 381): 'China'",LOC,,,,,,,,,
71,16,test,206,"[2393, 2400): 'Marines'",ORG,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49,1,validation,230,"[523, 542): 'Republic of Ireland'",ORG,,,,,,,,,
50,1,validation,230,"[19, 29): 'ENGLISHMAN'",LOC,,,,,,,,,
52,1,validation,230,"[1076, 1097): 'European championship'",MISC,,,,,,,,,
53,1,validation,230,"[1346, 1363): 'World Cup winning'",MISC,,,,,,,,,


In [27]:
# Write output files.
in_gold_to_write.to_csv("outputs/CoNLL_2_in_gold.csv", index=False)
not_in_gold_to_write.to_csv("outputs/CoNLL_2_not_in_gold.csv", index=False)

In [41]:
# Scratchpad for viewing results from the test set
doc_id = ("test", 15)
test_gold_standard_spans[doc_id][0:60]

Unnamed: 0,token_span,ent_type
0,"[11, 14): 'NFL'",ORG
1,"[15, 40): 'AMERICAN FOOTBALL-RANDALL'",MISC
2,"[41, 51): 'CUNNINGHAM'",PER
3,"[61, 73): 'PHILADELPHIA'",LOC
4,"[85, 103): 'Randall Cunningham'",PER
5,"[109, 133): 'National Football League'",ORG
6,"[262, 272): 'Cunningham'",PER
7,"[315, 334): 'Philadelphia Eagles'",ORG
8,"[349, 357): 'Pro Bowl'",MISC
9,"[369, 379): 'Cunningham'",PER


In [42]:
test_gold_standard_spans[doc_id]["token_span"].values

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,11,14,1,2,NFL
1,15,40,2,4,AMERICAN FOOTBALL-RANDALL
2,41,51,4,5,CUNNINGHAM
3,61,73,7,8,PHILADELPHIA
4,85,103,9,11,Randall Cunningham
5,109,133,13,16,National Football League
6,262,272,40,41,Cunningham
7,315,334,48,50,Philadelphia Eagles
8,349,357,53,55,Pro Bowl
9,369,379,57,58,Cunningham


In [38]:
# Scratchpad for viewing results from the validation set
doc_id = ("validation", 199)
gold_standard_spans[doc_id][0:60]

KeyError: ('test', 27)

In [37]:
gold_standard_spans[doc_id]["token_span"].values

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,11,14,1,2,NFL
1,67,75,9,11,NEW YORK
2,87,111,12,15,National Football League
3,210,218,36,37,AMERICAN
4,268,279,46,48,NEW ENGLAND
5,294,301,53,54,BUFFALO
6,316,328,59,60,INDIANAPOLIS
7,343,348,65,66,MIAMI
8,363,370,71,73,NY JETS
9,412,414,84,85,PA
