## Analyze_Model_Outputs.ipynb
### Analyze Model Outputs with Text Extensions for Pandas

This Jupyter notebook shows how to use the [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) library to analyze the outputs of a NLP model on a target corpus.

We use the [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/) corpus as our target corpus, and we use the output of the `bender` team in the original CoNLL 2003 competition as our example model output.



### Environment Setup

This notebook requires a Python 3.7 or later environment with `numpy` and `pandas`. 

The notebook also requires the  `text_extensions_for_pandas` library. You can satisfy this dependency in two ways:

* Run `pip install text_extensions_for_pandas` before running this notebook. This command adds the library to your Python environment.
* Run this notebook out of your local copy of the Text Extensions for Pandas project's [source tree](https://github.com/CODAIT/text-extensions-for-pandas). In this case, the notebook will use the version of Text Extensions for Pandas in your local source tree **if the package is not installed in your Python environment**.


In [1]:
import os
import sys
import numpy as np
import pandas as pd

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

## Read the Data Set

[CoNLL](https://www.conll.org/), the SIGNLL Conference on Computational Natural Language Learning, is an annual academic conference for natural language processing researchers. Each year's conference features a competition involving a challenging NLP task. The task for the 2003 competition involved identifying mentions of [named entities](https://en.wikipedia.org/wiki/Named-entity_recognition) in English and German news articles from the late 1990's. The corpus for this 2003 competition is one of the most widely-used benchmarks for the performance of named entity recognition models. Current [state-of-the-art results](https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003) on this corpus produce an F1 score (harmonic mean of precision and recall) of 0.93. The best F1 score in the original competition was 0.89.

For more information about this data set, we recommend reading the conference paper about the competition results, ["Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition,"](https://www.aclweb.org/anthology/W03-0419/).

**Note that the data set is licensed for research use only. Be sure to adhere to the terms of the license when using this data set!**

The developers of the CoNLL-2003 corpus defined a file format for the corpus, based on the file format used in the earlier [Message Understanding Conference](https://en.wikipedia.org/wiki/Message_Understanding_Conference) competition. This format is generally known as "CoNLL format" or "CoNLL-2003 format".

In the following cell, we use the facilities of Text Extensions for Pandas to download a copy of the CoNLL-2003 data set. Then we read the CoNLL-2003-format file containing the `test` fold of the corpus and translate the data into a collection of Pandas [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) objects, one Dataframe per document. Finally, we display the Dataframe for the first document of the `test` fold of the corpus.

In [2]:
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
#  to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info

# Read gold standard data for the "test" fold of the corpus.
corpus_test_fold = tp.io.conll.conll_2003_to_dataframes(
    data_set_info["test"], column_names=["pos", "phrase", "ent"],
    iob_columns=[False, True, True])

# Pick some documents to use as examples
SHORT_DOC_NUM = 6
LONG_DOC_NUM = 0
# We use document 6 here because it's short.
corpus_test_fold[SHORT_DOC_NUM].head(11)

Unnamed: 0,span,pos,phrase_iob,phrase_type,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'",-X-,O,,O,,"[0, 10): '-DOCSTART-'",1871
1,"[11, 17): 'SOCCER'",NN,B,NP,O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1873
2,"[17, 18): '-'",:,O,,O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1874
3,"[19, 26): 'ENGLISH'",NNP,B,NP,B,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1875
4,"[27, 31): 'F.A.'",NNP,I,NP,I,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1876
5,"[32, 35): 'CUP'",NNP,I,NP,I,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1877
6,"[36, 42): 'SECOND'",NNP,I,NP,O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1878
7,"[43, 48): 'ROUND'",NNP,I,NP,O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1879
8,"[49, 55): 'RESULT'",NNP,I,NP,O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1880
9,"[55, 56): '.'",.,O,,O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1881


The output of the previous cell corresponds to the first 10 lines of one document in the `test` fold of the corpus in its original format. Here's what the original data looks like:
```
-DOCSTART- -X- -X- O

SOCCER NN I-NP O
- : O O
ENGLISH NNP I-NP I-MISC
F.A. NNP I-NP I-MISC
CUP NNP I-NP I-MISC
SECOND NNP I-NP O
ROUND NNP I-NP O
RESULT NNP I-NP O
. . O O
```

Each line represents a single token of the file. The first token of each document is a special token `-DOCSTART-`. Each token is labeled with multiple attributes.

The function `tp.io.conll.conll_2003_to_dataframes()` returns a list of DataFrames, one DataFrame per document. The DataFrame above contains the following columns:

 * **`span`:** The span of the token within a reconstruction of the original document text, with begin and end offsets measured in characters. This column is stored using Text Extensions for Pandas' `SpanArray` extension type.
 * **`pos`:** Part of speech information for the token, drawn from the second field of each line in the original file. Note that CoNLL-2003 format does not specify the names of metadata fields; this column has the name `pos` because we specified that name in the `column_names` argument to `conll_2003_to_dataframes()`.
 * **`phrase_iob` and `phrase_type`:** Noun/verb phrase information for the token, drawn from the third field of the original file, in Inside-Outside-Beginning-2 (IOB2) format. The names for these columns come from the `column_names` argument we passed to `conll_2003_to_dataframes()`.
 * **`ent_iob` and `ent_type`:** Information about named entity mentions at this token offset, in IOB2 format. The names for these columns come from the `column_names` argument we passed to `conll_2003_to_dataframes()`.
 * **`sentence`:** The span of the sentence containing this token in the reconstructed document text.
 * **`line_num`:** Which line of the original input file contains this token

Note that CoNLL-2003 format uses IOB1 tags for the metadata fields that we call "ent" and "phrase". The function `conll_2003_to_dataframes()` converts these IOB1 tags to the IOB2 format for ease
of consumption. In IOB2 format, every entity starts with a "begin" tag, so your code can determine whether a token is the first token in an entity without needing to inspect the previous token. See [the Wikipedia entry for IOB tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))
for more information. 

We don't need the fields `pos`, `phrase_iob`, `phrase_type` for the remainder of this notebook, so let's drop them:

In [3]:
# Drop unneeded metadata columns
if "pos" in corpus_test_fold[0].columns:
    corpus_test_fold = [
        df.drop(columns=["pos", "phrase_iob", "phrase_type"])
        for df in corpus_test_fold
    ]
corpus_test_fold[SHORT_DOC_NUM].head(9)

Unnamed: 0,span,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'",1871
1,"[11, 17): 'SOCCER'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1873
2,"[17, 18): '-'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1874
3,"[19, 26): 'ENGLISH'",B,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1875
4,"[27, 31): 'F.A.'",I,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1876
5,"[32, 35): 'CUP'",I,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1877
6,"[36, 42): 'SECOND'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1878
7,"[43, 48): 'ROUND'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1879
8,"[49, 55): 'RESULT'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU...",1880


## Read the Model's Outputs

In this example, we will use the outputs of the "bender" team in the original competition as our model outputs.
A copy of these outputs is available in this repository under `resources/conll_03/ner/results/bender`.
These outputs are also in CoNLL-2003 format, but they do not contain any information about the tokens. For example, 
here's what the first 10 lines of the example document we've been using in the previous cells looks like
in the model outputs:
```
O

O
O
I-MISC
I-MISC
I-MISC
O
O
O
O
```

Text Extensions for Pandas includes a function `conll_2003_output_to_dataframes()` that will read this format
of model output and merge the tags with the full token information in the original corpus, provided that you
have read the original corpus in with `conll_2003_to_dataframes()`. The cell that follows uses this function
to read the output of the "bender" team, using the `corpus_test_fold` list of DataFrames that we constructed
a few cells back.

In [4]:
# Read the outputs of the "bender" team in the original competition.
bender_output = tp.io.conll.conll_2003_output_to_dataframes(
    corpus_test_fold, "../resources/conll_03/ner/results/bender/eng.testb")
bender_output[SHORT_DOC_NUM].head(10)

Unnamed: 0,span,ent_iob,ent_type,sentence
0,"[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'"
1,"[11, 17): 'SOCCER'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
2,"[17, 18): '-'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
3,"[19, 26): 'ENGLISH'",B,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
4,"[27, 31): 'F.A.'",I,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
5,"[32, 35): 'CUP'",I,MISC,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
6,"[36, 42): 'SECOND'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
7,"[43, 48): 'ROUND'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
8,"[49, 55): 'RESULT'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."
9,"[55, 56): '.'",O,,"[11, 56): 'SOCCER- ENGLISH F.A. CUP SECOND ROU..."


## Convert IOB-Tagged Data to Lists of Entity Mentions

The data we've looked at so far has been in [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). 
Each row of our DataFrame represents a token, and each token is tagged with an entity type (`ent_type`) and an IOB tag (`ent_iob`). The first token of each named entity mention is tagged `B`, while subsequent tokens are tagged `I`. Tokens that aren't part of any named entity are tagged `O`.

IOB2 format is a convenient way to represent a corpus, but it is a less useful representation for analyzing the result quality of named entity recognition models. Most tokens in a typical NER corpus will be tagged `O`, any measure of error rate in terms of tokens will over-emphasizing the tokens that are part of entities. Token-level error rate implicitly assigns higher weight to named entity mentions that consist of multiple tokens, further unbalancing error metrics. And most crucially, a naive comparison of IOB tags can result in marking an incorrect answer as correct. Consider a case where the correct sequence of labels is `B, B, I` but the model has output `B, I, I`; in this case, last two tokens of model output are both incorrect (the model has assigned them to the same entity as the first token), but a naive token-level comparison will consider the last token to be correct.

The CoNLL 2003 competition used the number of errors in extracting *entire* entity mentions to measure the result quality of the entries. We will use the same metric in this notebook. To compute entity-level errors, we convert the IOB-tagged tokens into pairs of `<entity span,  entity type>`. 
Text Extensions for Pandas includes a function `iob_to_spans()` that will handle this conversion for you.

In the next cell, we use `iob_to_spans()` to convert both the corpus and our example model's output to DataFrames of entity span and type information. Then we display the `<entity span,  entity type>` pairs for the "bender" team's output on our example document.

In [5]:
# Convert from IOB2-tagged tokens to <span, entity type> pairs.
# Again, one DataFrame per document.
corpus_spans = [tp.io.conll.iob_to_spans(df) for df in corpus_test_fold]
bender_spans = [tp.io.conll.iob_to_spans(df) for df in bender_output]
bender_spans[SHORT_DOC_NUM].head(10)

Unnamed: 0,span,ent_type
0,"[19, 35): 'ENGLISH F.A. CUP'",MISC
1,"[57, 63): 'LONDON'",LOC
2,"[88, 110): 'English F.A. Challenge'",MISC
3,"[111, 114): 'Cup'",MISC
4,"[145, 153): 'Plymouth'",ORG
5,"[156, 162): 'Exeter'",ORG


Each DataFrame in the list `bender_spans` contains two columns, `span` and `ent_type`.
The `span` column in the DataFrame has the data type (or ["dtype"](https://numpy.org/doc/stable/reference/arrays.dtypes.html), as Pandas
and Numpy call them) `TokenSpanDtype`. `TokenSpanDtype` is one of the 
extension types from Text Extensions for Pandas. 
The string representation of a Pandas Series shows the dtype of the series
("TokenSpanDtype" in this case) on its last line:

In [6]:
bender_spans[SHORT_DOC_NUM]["span"]

0           [19, 35): 'ENGLISH F.A. CUP'
1                     [57, 63): 'LONDON'
2    [88, 110): 'English F.A. Challenge'
3                      [111, 114): 'Cup'
4                 [145, 153): 'Plymouth'
5                   [156, 162): 'Exeter'
Name: span, dtype: TokenSpanDtype

Columns with a dtype of `TokenSpanDtype` are stored internally using the class
`TokenSpanArray`, which is also part of Text Extensions for Pandas.
`TokenSpanArray` is a subclass of [`ExtensionArray`](https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.html), the base class for custom 1-D array types in Pandas.

Pandas stores extension arrays inside the associated Pandas `Series` object
for the column. To obtain a reference to a the extension array that backs a 
column, use the [`array` property of `pandas.Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.array.html#pandas.Series.array):

In [7]:
print(bender_spans[SHORT_DOC_NUM]["span"].array)

<TokenSpanArray>
[       [19, 35): 'ENGLISH F.A. CUP',                  [57, 63): 'LONDON',
 [88, 110): 'English F.A. Challenge',                   [111, 114): 'Cup',
              [145, 153): 'Plymouth',                [156, 162): 'Exeter']
Length: 6, dtype: TokenSpanDtype


Note how the previous cell passed the `TokenSpanArray` to `print()`. The
`TokenSpanArray` class can also render itself using [Jupyter Notebook callbacks](https://ipython.readthedocs.io/en/stable/config/integrating.html). To
see the HTML representation of the `TokenSpanArray`, pass the array object
to Jupyter's [`display()`](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.display)
function; or make that object be the last line of
the cell, as in the following example:

In [8]:
bender_spans[SHORT_DOC_NUM]["span"].array

Unnamed: 0,begin,end,begin token,end token,context
0,19,35,3,6,ENGLISH F.A. CUP
1,57,63,10,11,LONDON
2,88,110,15,18,English F.A. Challenge
3,111,114,18,19,Cup
4,145,153,25,26,Plymouth
5,156,162,27,28,Exeter


The text on the right side of the HTML shows the spans in the context of the reconstructed document text.
The table on the left shows detailed information about the spans. 
Internally, instances of `TokenSpanArray` consist of arrays of begin and end *token* offsets, plus a 
reference to the tokens. So only the `begin_token` and `end_token` values in the above table are actually 
stored inside the array. The other attributes in the table, `begin`, `end`, and `covered_text`, are computed
on demand.

You can obtain a reference to the underlying tokens for a `TokenSpanArray` via the properties
`document_tokens` and `tokens`. `document_tokens` returns the single underlying set of tokens
for arrays of spans that are all from the same document, while `tokens` returns the (potentially
different) backing tokens for each span of an array that may cover spans from multiple documents.

In this case, each of our DataFrames spans a single document, so we can use the `document_tokens`
property to fetch the tokens of the corresponding document. For example, here are the spans of the 
first 5 tokens:

In [9]:
bender_spans[SHORT_DOC_NUM]["span"].array.document_tokens[0:5]

Unnamed: 0,begin,end,context
0,0,10,-DOCSTART-
1,11,17,SOCCER
2,17,18,-
3,19,26,ENGLISH
4,27,31,F.A.


Note that the `SpanArray` object we displayed in the previous cell is the same object
that backs the "span" column in the token information DataFrame `bender_output[SHORT_DOC_NUM]`
from a few cells back:

In [10]:
bender_output[SHORT_DOC_NUM]["span"].array[0:5]

Unnamed: 0,begin,end,context
0,0,10,-DOCSTART-
1,11,17,SOCCER
2,17,18,-
3,19,26,ENGLISH
4,27,31,F.A.


## Compare Model Outputs with the Corpus Labels

Now that we have converted our corpus and model outputs to DataFrames of `<span, entity type>` pairs, we can use Pandas to compare the model's output with the labels. 

There are several ways to compare a set of `<span, entity type>` pairs. You may want to require an exact match of two spans, or you may want to consider partial matches. You might want to require exact matches of the entity types, or you may want to give partial credit to a model output that correctly identifies an entity's span but assigns the wrong entity type to that span.

Which of these types of comparisons is the "right" way to compare depends on your application. In the cells that follow, we'll show how to use Text Extensions for Pandas to do three different types of span comparison. To simplify the code that follows, we'll start by filtering down to just the Person (`PER`) entities in both output sets. That way we can compare just spans, ignoring the `ent_type` column for now.

In [11]:
# Let's look at just PER annotations
corpus_person = [df[df["ent_type"] == "PER"] for df in corpus_spans]
bender_person = [df[df["ent_type"] == "PER"] for df in bender_spans]

# Also, switch to a longer example document to make the comparisons that 
# follow more interesting.
corpus_person[LONG_DOC_NUM]

Unnamed: 0,span,ent_type
1,"[40, 45): 'CHINA'",PER
2,"[66, 77): 'Nadim Ladki'",PER
12,"[482, 495): 'Igor Shkvyrin'",PER
14,"[618, 632): 'Oleg Shatskiku'",PER
21,"[1079, 1092): 'Takuya Takagi'",PER
22,"[1148, 1168): 'Hiroshige Yanagimoto'",PER
24,"[1216, 1227): 'Salem Bitar'",PER
26,"[1360, 1372): 'Hassan Abbas'",PER
27,"[1489, 1494): 'Bitar'",PER
28,"[1503, 1517): 'Nader Jokhadar'",PER


### Exact-match comparison between sets of spans

Our extension types for spans consider two spans to be equal if they have the same target text and the same begin and end offsets.
So you can do exact match comparison between two DataFrames of span data using the standard Pandas 
[`merge()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html).

In [12]:
corpus_person[LONG_DOC_NUM].merge(bender_person[LONG_DOC_NUM])

Unnamed: 0,span,ent_type
0,"[66, 77): 'Nadim Ladki'",PER
1,"[618, 632): 'Oleg Shatskiku'",PER
2,"[1079, 1092): 'Takuya Takagi'",PER
3,"[1148, 1168): 'Hiroshige Yanagimoto'",PER
4,"[1216, 1227): 'Salem Bitar'",PER
5,"[1360, 1372): 'Hassan Abbas'",PER
6,"[1503, 1517): 'Nader Jokhadar'",PER
7,"[1761, 1769): 'Shu Kamo'",PER


### Relaxed notions of span equality

Text Extensions for Pandas also includes a function `contain_join()` for finding pairs of spans where one span in the pair either equals or contains the other span. We can use `contain_join()` to compare two DataFrame columns of spans and find all pairs that satisfy this looser notion of span equivalence:

In [13]:
# ...give credit for partial matches contained entirely within a true match:
tp.spanner.contain_join(corpus_person[LONG_DOC_NUM]["span"], 
                        bender_person[LONG_DOC_NUM]["span"], 
                        "corpus", "bender")

Unnamed: 0,corpus,bender
0,"[66, 77): 'Nadim Ladki'","[66, 77): 'Nadim Ladki'"
1,"[482, 495): 'Igor Shkvyrin'","[487, 495): 'Shkvyrin'"
2,"[618, 632): 'Oleg Shatskiku'","[618, 632): 'Oleg Shatskiku'"
3,"[1079, 1092): 'Takuya Takagi'","[1079, 1092): 'Takuya Takagi'"
4,"[1148, 1168): 'Hiroshige Yanagimoto'","[1148, 1168): 'Hiroshige Yanagimoto'"
5,"[1216, 1227): 'Salem Bitar'","[1216, 1227): 'Salem Bitar'"
6,"[1360, 1372): 'Hassan Abbas'","[1360, 1372): 'Hassan Abbas'"
7,"[1503, 1517): 'Nader Jokhadar'","[1503, 1517): 'Nader Jokhadar'"
8,"[1761, 1769): 'Shu Kamo'","[1761, 1769): 'Shu Kamo'"


Compared with the output of the previous cell, we now have 9 matches instead of 8. The "bender" team's model identified the `[487, 495): 'Shkvyrin'` as a Person entity, while the corpus includes the longer span `[482, 495): 'Igor Shkvyrin'`.

Text Extensions for Pandas also has second span comparison function `overlap_join()` that finds pairs of spans that overlap. You can use this function to look for equivalent spans between two sets of spans, using this even-looser notion of span equivalence:

In [14]:
# ...give credit for matches that overlap at all with a true match:
tp.spanner.overlap_join(corpus_person[LONG_DOC_NUM]["span"], 
                        bender_person[LONG_DOC_NUM]["span"],
                        "gold", "extracted")

Unnamed: 0,gold,extracted
0,"[66, 77): 'Nadim Ladki'","[66, 77): 'Nadim Ladki'"
1,"[482, 495): 'Igor Shkvyrin'","[487, 495): 'Shkvyrin'"
2,"[618, 632): 'Oleg Shatskiku'","[618, 632): 'Oleg Shatskiku'"
3,"[1079, 1092): 'Takuya Takagi'","[1079, 1092): 'Takuya Takagi'"
4,"[1148, 1168): 'Hiroshige Yanagimoto'","[1148, 1168): 'Hiroshige Yanagimoto'"
5,"[1216, 1227): 'Salem Bitar'","[1216, 1227): 'Salem Bitar'"
6,"[1360, 1372): 'Hassan Abbas'","[1360, 1372): 'Hassan Abbas'"
7,"[1503, 1517): 'Nader Jokhadar'","[1503, 1517): 'Nader Jokhadar'"
8,"[1761, 1769): 'Shu Kamo'","[1761, 1769): 'Shu Kamo'"


In this example document, the "overlap" type of span equivalence produces the same result as "containment" span equivalence.

## Compute Collection-Level Model Accuracy

Most benchmark results on the CoNLL-2003 dataset use the accuracy metric from the original competition: F1 score (the harmonic mean of precision and recall) over entity mentions, where an entity mention is considered "correct" if it corresponds exactly to an entity mention in the corpus labels.  We can compute this metric using Text Extensions for Pandas' extension types.

We start by using Pandas' `merge()` function to find the number of matches between pairs of DataFrames of `<span, label>` pairs. The number of `<span, label>` pairs that exactly match between each pair of DataFrames gives us the number of *true positives* in each document:

In [15]:
num_true_positives = [len(corpus_person[i].merge(bender_person[i]).index)
                      for i in range(len(corpus_person))]
num_true_positives[0:5]

[8, 31, 32, 20, 5]

The remaining inputs we need are the number of entity mentions the model extracted in each document and the number of mentions that the corpus contains in each document. We can obtain these figures directly from the lengths of our per-document DataFrames:

In [16]:
num_extracted = [len(df.index) for df in bender_person]
num_entities = [len(df.index) for df in corpus_person]
num_extracted[0:5], num_entities[0:5]

([9, 31, 33, 20, 5], [12, 31, 40, 20, 5])

Then we combine these three lists of counts into a single DataFrame:

In [17]:
stats_by_doc = pd.DataFrame({
    "doc_num": np.arange(len(corpus_person)),
    "num_true_positives": num_true_positives,
    "num_extracted": num_extracted,
    "num_entities": num_entities
})
stats_by_doc

Unnamed: 0,doc_num,num_true_positives,num_extracted,num_entities
0,0,8,9,12
1,1,31,31,31
2,2,32,33,40
3,3,20,20,20
4,4,5,5,5
...,...,...,...,...
226,226,2,2,2
227,227,4,4,6
228,228,4,5,4
229,229,0,0,0


The standard CoNLL-2003 accuracy metric is F1 score over the entire "test" fold. We can compute this statistic by aggregating the DataFrame:

In [18]:
total_true_positives = stats_by_doc["num_true_positives"].sum()
total_entities = stats_by_doc["num_entities"].sum()
total_extracted = stats_by_doc["num_extracted"].sum()

precision = total_true_positives / total_extracted
recall = total_true_positives / total_entities
F1 = 2.0 * (precision * recall) / (precision + recall)
print(
f"""Number of correct answers: {total_true_positives}
Number of entities identified: {total_extracted}
Actual number of entities: {total_entities}
Precision: {precision:1.4f}
Recall: {recall:1.4f}
F1: {F1:1.4f}""")

Number of correct answers: 1421
Number of entities identified: 1583
Actual number of entities: 1617
Precision: 0.8977
Recall: 0.8788
F1: 0.8881


The above numbers match up with the official results (last line below)

In [19]:
!head -14 ../resources/conll_03/ner/results/bender/conlleval.out

eng.testa
processed 51578 tokens with 5942 phrases; found: 5846 phrases; correct: 5280.
accuracy:  98.07%; precision:  90.32%; recall:  88.86%; FB1:  89.58
              LOC: precision:  93.27%; recall:  93.58%; FB1:  93.42
             MISC: precision:  88.51%; recall:  81.02%; FB1:  84.60
              ORG: precision:  84.67%; recall:  83.59%; FB1:  84.13
              PER: precision:  92.26%; recall:  91.91%; FB1:  92.09
eng.testb
processed 46666 tokens with 5648 phrases; found: 5548 phrases; correct: 4698.
accuracy:  96.80%; precision:  84.68%; recall:  83.18%; FB1:  83.92
              LOC: precision:  86.44%; recall:  89.81%; FB1:  88.09
             MISC: precision:  78.35%; recall:  73.22%; FB1:  75.70
              ORG: precision:  80.27%; recall:  76.16%; FB1:  78.16
              PER: precision:  89.77%; recall:  87.88%; FB1:  88.81


### Compute F1 Score for Each Document

In addition to the standard corpus-level accuracy statistics, we can also compute precision, recall, and F1 score for each document by adding some additional columns to our DataFrame `stats_by_doc`:

In [20]:
stats_by_doc["precision"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_extracted"]
stats_by_doc["recall"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_entities"]
stats_by_doc["F1"] = 2.0 * (stats_by_doc["precision"] * stats_by_doc["recall"]) / (stats_by_doc["precision"] + stats_by_doc["recall"])
stats_by_doc

Unnamed: 0,doc_num,num_true_positives,num_extracted,num_entities,precision,recall,F1
0,0,8,9,12,0.888889,0.666667,0.761905
1,1,31,31,31,1.000000,1.000000,1.000000
2,2,32,33,40,0.969697,0.800000,0.876712
3,3,20,20,20,1.000000,1.000000,1.000000
4,4,5,5,5,1.000000,1.000000,1.000000
...,...,...,...,...,...,...,...
226,226,2,2,2,1.000000,1.000000,1.000000
227,227,4,4,6,1.000000,0.666667,0.800000
228,228,4,5,4,0.800000,1.000000,0.888889
229,229,0,0,0,,,


### Use Per-Document Accuracy to Identify Problems in Model Outputs

We can use these per-document statistics to find documents where this model performed poorly. 
Here, we use the Pandas `sort_values()` function to identify the top ten most problematic documents by F1 score:

In [21]:
stats_by_doc.sort_values("F1").head(10)

Unnamed: 0,doc_num,num_true_positives,num_extracted,num_entities,precision,recall,F1
75,75,2,21,2,0.095238,1.0,0.173913
7,7,1,2,4,0.5,0.25,0.333333
8,8,1,1,5,1.0,0.2,0.333333
138,138,2,2,10,1.0,0.2,0.333333
161,161,1,3,2,0.333333,0.5,0.4
104,104,2,3,6,0.666667,0.333333,0.444444
185,185,1,2,2,0.5,0.5,0.5
131,131,1,2,2,0.5,0.5,0.5
85,85,1,1,3,1.0,0.333333,0.5
43,43,7,8,16,0.875,0.4375,0.583333


What's going on with document 75?

In [22]:
from IPython import display
display.display(display.HTML("<h3>PER entities in corpus for document 75:</h3>"))
display.display(corpus_person[75])

display.display(display.HTML("<p><h3>PER entities in model outputs for document 75:</h3>"))
display.display(bender_person[75])

Unnamed: 0,span,ent_type
2,"[53, 70): 'Brendan Intindola'",PER
54,"[2242, 2252): 'Marc Cohen'",PER


Unnamed: 0,span,ent_type
2,"[53, 70): 'Brendan Intindola'",PER
6,"[177, 185): 'Santa Fe'",PER
8,"[207, 215): 'Santa Fe'",PER
10,"[264, 272): 'Santa Fe'",PER
14,"[455, 463): 'Santa Fe'",PER
16,"[694, 702): 'Santa Fe'",PER
18,"[828, 836): 'Santa Fe'",PER
25,"[1348, 1356): 'Santa Fe'",PER
28,"[1471, 1475): 'Dome'",PER
30,"[1578, 1586): 'Santa Fe'",PER


It looks like this model had trouble with "Santa Fe". Let's look at all instances of that string in this document. We can use Text Extensions for Pandas' regular expression support to create spans for every mention of "Santa Fe" in document 75:

In [23]:
import regex
doc_75_tokens = corpus_test_fold[75]["span"]

# Find all matches of "Santa Fe" that start and end on a token boundary
santa_fe_mentions = tp.spanner.extract_regex_tok(
    doc_75_tokens, regex.compile(r'[Ss]anta\s+[Ff]e'),
    min_len=2, max_len=2)
santa_fe_mentions

Unnamed: 0,match
0,"[36, 44): 'Santa Fe'"
1,"[177, 185): 'Santa Fe'"
2,"[207, 215): 'Santa Fe'"
3,"[264, 272): 'Santa Fe'"
4,"[455, 463): 'Santa Fe'"
5,"[694, 702): 'Santa Fe'"
6,"[828, 836): 'Santa Fe'"
7,"[980, 988): 'Santa Fe'"
8,"[1348, 1356): 'Santa Fe'"
9,"[1578, 1586): 'Santa Fe'"


Now let's line up those regex matches with the  model outputs and corpus labels:

In [24]:
santa_fe_bender = pd.merge(santa_fe_mentions, bender_person[75], 
                           left_on="match", right_on="span")
santa_fe_bender

Unnamed: 0,match,span,ent_type
0,"[177, 185): 'Santa Fe'","[177, 185): 'Santa Fe'",PER
1,"[207, 215): 'Santa Fe'","[207, 215): 'Santa Fe'",PER
2,"[264, 272): 'Santa Fe'","[264, 272): 'Santa Fe'",PER
3,"[455, 463): 'Santa Fe'","[455, 463): 'Santa Fe'",PER
4,"[694, 702): 'Santa Fe'","[694, 702): 'Santa Fe'",PER
5,"[828, 836): 'Santa Fe'","[828, 836): 'Santa Fe'",PER
6,"[1348, 1356): 'Santa Fe'","[1348, 1356): 'Santa Fe'",PER
7,"[1578, 1586): 'Santa Fe'","[1578, 1586): 'Santa Fe'",PER
8,"[1944, 1952): 'Santa Fe'","[1944, 1952): 'Santa Fe'",PER
9,"[2080, 2088): 'Santa Fe'","[2080, 2088): 'Santa Fe'",PER


In [25]:
santa_fe_corpus = pd.merge(santa_fe_mentions, corpus_spans[75], 
                           left_on="match", right_on="span")
santa_fe_corpus

Unnamed: 0,match,span,ent_type
0,"[36, 44): 'Santa Fe'","[36, 44): 'Santa Fe'",LOC
1,"[207, 215): 'Santa Fe'","[207, 215): 'Santa Fe'",ORG
2,"[264, 272): 'Santa Fe'","[264, 272): 'Santa Fe'",ORG
3,"[455, 463): 'Santa Fe'","[455, 463): 'Santa Fe'",ORG
4,"[694, 702): 'Santa Fe'","[694, 702): 'Santa Fe'",ORG
5,"[828, 836): 'Santa Fe'","[828, 836): 'Santa Fe'",ORG
6,"[980, 988): 'Santa Fe'","[980, 988): 'Santa Fe'",ORG
7,"[1348, 1356): 'Santa Fe'","[1348, 1356): 'Santa Fe'",ORG
8,"[1578, 1586): 'Santa Fe'","[1578, 1586): 'Santa Fe'",ORG
9,"[1944, 1952): 'Santa Fe'","[1944, 1952): 'Santa Fe'",ORG


Next, we'll compare the above two sets of spans to each other to create a picture of how the corpus and the "bender" team's output treated each of our regular expression matches.

In [26]:
cols = ["span", "ent_type"]
matching_spans = pd.merge(santa_fe_bender[cols], santa_fe_corpus[cols],
                          on="span", how="left", suffixes=["_bender", "_corpus"])
matching_spans

Unnamed: 0,span,ent_type_bender,ent_type_corpus
0,"[177, 185): 'Santa Fe'",PER,
1,"[207, 215): 'Santa Fe'",PER,ORG
2,"[264, 272): 'Santa Fe'",PER,ORG
3,"[455, 463): 'Santa Fe'",PER,ORG
4,"[694, 702): 'Santa Fe'",PER,ORG
5,"[828, 836): 'Santa Fe'",PER,ORG
6,"[1348, 1356): 'Santa Fe'",PER,ORG
7,"[1578, 1586): 'Santa Fe'",PER,ORG
8,"[1944, 1952): 'Santa Fe'",PER,ORG
9,"[2080, 2088): 'Santa Fe'",PER,ORG


Out of 15 instances of "Santa Fe" that the "bender" model tagged as `PER`, 14 are tagged in the corpus as `ORG` or `LOC`. What's going on with the 15th? Let's look at the context of that span in the reconstructed document:

In [27]:
matching_spans.iloc[0]["span"].context()

'... the most likely white knight buyer for [Santa Fe] Pacific Gold Corp if Santa Fe rejects u...'

It like that span is part of a larger entity, "Santa Fe Pacific Gold Corp". We can verify this fact by 
using `text_extensions_for_pandas.overlap_join()` to find the span in the corpus labels that overlaps
with the regular expression match:

In [28]:
tp.spanner.overlap_join(
    corpus_spans[75]["span"], matching_spans.iloc[[0]]["span"],
    "corpus", "regex_match")

Unnamed: 0,corpus,regex_match
0,"[177, 203): 'Santa Fe Pacific Gold Corp'","[177, 185): 'Santa Fe'"
