## CoNLL-2003 Example for Text Extensions for Pandas
### Part 1

To run this notebook, you will need to obtain a copy of the CoNLL-2003 data set's corpus.
Drop the corpus's files into the following locations:
* conll_03/eng.testa
* conll_03/eng.testb
* conll_03/eng.train


In [None]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

In [None]:
# Read gold standard data for the validation set.

# Note that this data is NOT kept in source control; you need to obtain 
# an appropriate license and download the data yourself separately to
# run this notebook.

# Note also that the original corpus started with the special "-DOCSTART-"
# tag, while other versions start right in with the 
# If you have one of those other versions, you'll need to add the following
# two lines at the beginning to make the tokens line up with the output in
# ner.tgz:
# -DOCSTART- O
# 
# ^^^ note blank line after special token.
#
# If you need to add those lines, you should also remove the extra 
# "-DOCSTART-" token at the end of each file.

gold_standard = tp.conll_2003_to_dataframes("../conll_03/eng.testa")

# tp.conll_2003_to_dataframes() returns a list of dataframes
len(gold_standard)

In [None]:
gold_standard[0]

In [None]:
# Read the outputs of the "bender" team in the original competition.
# Yes, this file is called "testa" in one data set and "testb" in the other.
# Go figure.
# Also, we needed to remove the first two lines of the output file, because 
# the original version corpus apparently started with "-DOCSTART-", while the
# new one does not.
bender_output = tp.conll_2003_output_to_dataframes(
    gold_standard, "../resources/conll_03/ner/results/bender/eng.testb")
bender_output[0].head(20)

In [None]:
# Convert the gold standard to spans.
# Again, one dataframe per document.
gold_standard_spans = [tp.iob_to_spans(df) for df in gold_standard]
bender_output_spans = [tp.iob_to_spans(df) for df in bender_output]
bender_output_spans[0]

In [None]:
# Let's look at just PER annotations
gold_person = [df[df["ent_type"] == "PER"] for df in gold_standard_spans]
bender_person = [df[df["ent_type"] == "PER"] for df in bender_output_spans]
gold_person[0]

In [None]:
# We can also ask these span columns to render themselves to HTML for a
# closer look at the target document.
bender_person[0]["token_span"].values

In [None]:
# Let's show how to evaluate these results against the gold standard.
# We could look at exact matches...
gold_person[0].merge(bender_person[0])

In [None]:
# ...or we could give credit for partial matches contained entirely 
# within a true match:
tp.contain_join(gold_person[0]["token_span"], bender_person[0]["token_span"], "gold", "extracted")

In [None]:
# ...or we could give credit for matches that overlap at all with
# a true match:
tp.overlap_join(gold_person[0]["token_span"], bender_person[0]["token_span"], "gold", "extracted")

In [None]:
# Let's stick with exact matches for now.
# Iterate over the pairs of dataframes for all the documents finding the
# inputs we need to compute precision and recall for each document, and
# wrap these values in a new dataframe.
num_true_positives = [len(gold_person[i].merge(bender_person[i]).index)
                      for i in range(len(gold_person))]
num_extracted = [len(df.index) for df in bender_person]
num_entities = [len(df.index) for df in gold_person]
doc_num = np.arange(len(gold_person))

stats_by_doc = pd.DataFrame({
    "doc_num": doc_num,
    "num_true_positives": num_true_positives,
    "num_extracted": num_extracted,
    "num_entities": num_entities
})
stats_by_doc

In [None]:
# Collection-wide precision and recall can be computed by aggregating
# our dataframe:
num_true_positives = stats_by_doc["num_true_positives"].sum()
num_entities = stats_by_doc["num_entities"].sum()
num_extracted = stats_by_doc["num_extracted"].sum()

precision = num_true_positives / num_extracted
recall = num_true_positives / num_entities
F1 = 2.0 * (precision * recall) / (precision + recall)
print(
"""Number of correct answers: {}
Number of entities identified: {}
Actual number of entities: {}
Precision: {:1.2f}
Recall: {:1.2f}
F1: {:1.2f}""".format(num_true_positives, num_entities, num_entities, precision, recall, F1))

In [None]:
# Let's also add some additional columns with per-document stats:
stats_by_doc["precision"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_extracted"]
stats_by_doc["recall"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_entities"]
stats_by_doc["F1"] = 2.0 * (stats_by_doc["precision"] * stats_by_doc["recall"]) / (stats_by_doc["precision"] + stats_by_doc["recall"])
stats_by_doc

In [None]:
# Let's zero in on the ten most problematic documents by F1 score.
stats_by_doc.sort_values("F1").head(10)

In [None]:
# What's happening with document 75?
gold_person[75]

In [None]:
bender_person[75]