## CoNLL-2003 Example for Text Extensions for Pandas
### Part 3

To run this notebook, you will need to obtain a copy of the CoNLL-2003 data set's corpus.
Drop the corpus's files into the following locations:
* conll_03/eng.testa
* conll_03/eng.testb
* conll_03/eng.train

If you are unfamiliar with the basics of Text Extensions for Pandas, we recommend you 
start with Part 1 of this example.

### Introduction

At the end of part 2 of the demo, we've shown that there are incorrect labels hidden in the CoNLL-2003 validation set, and that you can pinpoint those incorrect labels by data-mining the results of the 16 models the competitors submitted.

Our goal for the remainder of the demo is to pinpoint incorrect labels across the entire data set. The (rough) process to do so will be:

1. Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
2. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a dataframe column (of type TensorType) alongside the tokens and labels.
3. Use the embeddings to quickly train multiple models at multiple levels of sophistication (something like: SVMs, random forests, and LSTMs with small and large numbers of hidden states). Split the corpus into 10 parts and perform a 10-fold cross-validation.
4. Repeat the process from part 2 on each fold of the 10-fold cross-validation, comparing the outputs of every model on the validation set for each fold.
5. Analyze the results of the models to pipoint potential incorrect labels. Inspect those labels manually and build up a list of labels that are actually incorrect.



In [None]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd
import time
import torch
import transformers
from typing import *

import sklearn.pipeline

import matplotlib.pyplot as plt
import ipywidgets
from IPython.display import display

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

# Code to display a progress bar while iterating over a list of dataframes.
def run_with_progress_bar(num_docs: int, fn):
    _UPDATE_SEC = 0.1
    result = [] # Type: List[pd.DataFrame]
    last_update = time.time()
    progress_bar = ipywidgets.IntProgress(0, 0, num_docs,
                                          description="Starting...",
                                          layout=ipywidgets.Layout(width="100%"),
                                          style={"description_width": "12%"})
    display(progress_bar)
    for i in range(num_docs):
        result.append(fn(i))
        now = time.time()
        if i == num_docs - 1 or now - last_update >= _UPDATE_SEC:
            progress_bar.value = i + 1
            progress_bar.description = f"{i + 1}/{num_docs} docs"
            last_update = now
    progress_bar.bar_style = "success"
    return result

# Step 1: Retokenize with a BERT tokenizer.

Retokenize the corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.

In [None]:
# Read in the validation set in its original tokenization
valid_raw = tp.conll_2003_to_dataframes("../conll_03/eng.testa")

# Pick out the dataframe for a single example document.
example_df = valid_raw[5]
example_df

In [None]:
spans_df = tp.iob_to_spans(example_df)
spans_df

In [None]:
# Retokenize the document's text with the BERT tokenizer
#bert_model_name = "bert-base-uncased"
#bert_model_name = "bert-large-uncased"
bert_model_name = "dslim/bert-base-NER"

tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name, 
                                                           add_special_tokens=True)
bert_toks_df = tp.make_bert_tokens(example_df["char_span"].values[0].target_text, tokenizer)
bert_toks_df

In [None]:
# BERT tokenization includes special zero-length tokens.
bert_toks_df[bert_toks_df["special_tokens_mask"]]

In [None]:
# Align the BERT tokens with the original tokenization
# Start by converting the elements of gold_standard_spans to character spans
def align_spans(spans_df, toks_df):
    original_spans = spans_df["token_span"]
    bert_token_spans = tp.TokenSpanArray.align_to_tokens(toks_df["char_span"],
                                                         original_spans)
    return pd.DataFrame({
        "original_span": original_spans,
        "bert_token_span": bert_token_spans,
        "ent_type": spans_df["ent_type"]
    })

aligned_spans_df = align_spans(spans_df, bert_toks_df)
aligned_spans_df

In [None]:
# Generate IOB2 tags and entity labels that align with the BERT tokens.
# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
def add_iob_tags(toks_df, aligned_spans_df) -> pd.DataFrame:
    iob_df = toks_df.copy()
    iob_df["ent_iob"] = tp.spans_to_iob(aligned_spans_df["bert_token_span"])
    tmp_df = (
        tp
        .contain_join(aligned_spans_df["bert_token_span"], iob_df["token_span"],
                      "bert_token_span", "token_span")
        .merge(aligned_spans_df[["bert_token_span", "ent_type"]])
        [["token_span", "ent_type"]]
    )
    iob_df = iob_df.merge(tmp_df, how="left")
    return iob_df

iob_tags_df = add_iob_tags(bert_toks_df, aligned_spans_df)
iob_tags_df

In [None]:
# The traditional way to transform NER to token classification is to 
# treat each combination of {I,O,B} X {entity type} as a different
# class. Generate class labels in that format.
# Create a Pandas categorical type for consistent encoding of categories
# across all documents.
_ENTITY_TYPES = ["LOC", "MISC", "ORG", "PER"]
_ALL_LABELS = ["O"] + [f"{x}-{y}" for x in ["B", "I"] for y in _ENTITY_TYPES]
token_class_dtype = pd.CategoricalDtype(categories=_ALL_LABELS)
_LABEL_TO_INT = {_ALL_LABELS[i]: i for i in range(len(_ALL_LABELS))}


def add_token_classes(df: pd.DataFrame) -> pd.DataFrame:
    elems = []  # Type: str
    for index, row in df[["ent_iob", "ent_type"]].iterrows():
        if row["ent_iob"] == "O":
            elems.append("O")
        else:
            elems.append(f"{row['ent_iob']}-{row['ent_type']}")
    ret = df.copy()
    ret["token_class"] = pd.Categorical(elems, dtype=token_class_dtype)
    ret["token_class_id"] = [_LABEL_TO_INT[l] for l in elems]
    return ret

classes_df = add_token_classes(iob_tags_df)
classes_df

# Step 2: Add embeddings

Generate BERT embeddings for every token in the entire corpus in one pass, 
and store those embeddings in a dataframe column (of type TensorType) 
alongside the tokens and labels.

In [None]:
# Now we need to attach some embeddings. Fire up a canned BERT model.
bert = transformers.BertModel.from_pretrained(bert_model_name)

# Level of indirection so we can more easily swap implementations
def bert_fn(input_ids: np.ndarray, attention_mask: np.ndarray):
    return bert(input_ids=torch.tensor(input_ids),
                attention_mask=torch.tensor(attention_mask))

In [None]:
# Define a function that adds embeddings to a dataframe.
def add_embeddings(df: pd.DataFrame) -> pd.DataFrame:
    _OVERLAP = 32
    _NON_OVERLAP = 64
    flat_input_ids = df["input_id"].values
    windows = tp.seq_to_windows(flat_input_ids, _OVERLAP, _NON_OVERLAP)
    bert_result = bert_fn(input_ids=windows["input_ids"], 
                          attention_mask=windows["attention_masks"])
    hidden_states = tp.windows_to_seq(flat_input_ids, 
                                      bert_result[0].detach().numpy(),
                                      _OVERLAP, _NON_OVERLAP)
    embeddings = tp.TensorArray(hidden_states)
    ret = df.copy()
    ret["embedding"] = embeddings
    return ret

embeddings_df = add_embeddings(classes_df)
embeddings_df

In [None]:
# Combine all the steps we've done so far into a single function.
def process_doc(df: pd.DataFrame) -> pd.DataFrame:
    """
    :param df: One dataframe from the conll_2003_to_dataframes() function,
     representing the tokens of a single document in the original tokenization.
    
    :returns: A version of the same dataframe, but with BERT tokens, BERT
     embeddings for each token, and token class labels.
    """
    spans_df = tp.iob_to_spans(df)
    bert_toks_df = tp.make_bert_tokens(df["char_span"].values[0].target_text, 
                                       tokenizer)
    aligned_spans_df = align_spans(spans_df, bert_toks_df)
    iob_tags_df = add_iob_tags(bert_toks_df, aligned_spans_df)
    classes_df = add_token_classes(iob_tags_df)
    embeddings_df = add_embeddings(classes_df)
    return embeddings_df

# Rerun our example document to verify that our new function does the same 
# operations as the original code.
process_doc(example_df)

In [None]:
# Read the training and test sets.
train_raw = tp.conll_2003_to_dataframes("../conll_03/eng.train")
test_raw = tp.conll_2003_to_dataframes("../conll_03/eng.testb")

train_raw[0]

In [None]:
# Run the entire test set through our processing pipeline.
test = run_with_progress_bar(
    len(test_raw), lambda i: process_doc(test_raw[i]))
test[20]

In [None]:
# Run the entire validation set through our processing pipeline.
valid = run_with_progress_bar(
    len(valid_raw), lambda i: process_doc(valid_raw[i]))
valid[5]

In [None]:
# Run the entire training set through our processing pipeline.
train = run_with_progress_bar(
    len(train_raw), lambda i: process_doc(train_raw[i]))
train[42]

In [None]:
# Create a single dataframe with the entire corpus's embeddings.
def prep_for_stacking(collection_name: str, doc_num: int, df: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({
        "collection": collection_name,
        "doc_num": doc_num,
        "token_id": df["id"],
        "ent_iob": df["ent_iob"],
        "ent_type": df["ent_type"],
        "token_class": df["token_class"],
        "token_class_id": df["token_class_id"],
        "embedding": df["embedding"]
    })

to_stack = (
    [prep_for_stacking("train", i, train[i]) for i in range(len(train))]
    + [prep_for_stacking("test", i, test[i]) for i in range(len(test))]
    + [prep_for_stacking("valid", i, valid[i]) for i in range(len(valid))]
)
corpus_df = pd.concat(to_stack).reset_index(drop=True)
corpus_df

In [None]:
# Write the tokenized corpus with embeddings to a Feather file.
corpus_df.to_feather("outputs/corpus.feather")

In [None]:
# Stuff all the span information into a single dictionary of lists.
# Dictionary index corresponds to the "collection" column of `corpus_df`,
# and list index corresponds to the "doc_num" column of `curpos_df`.
all_span_dfs = {
    "train": train,
    "test": test,
    "valid": valid
}

# Step 3: Train models

Use the embeddings to quickly train multiple models at multiple levels of sophistication. Split the corpus into 10 parts and perform a 10-fold cross-validation.

In [None]:
# Additional initialization boilerplate
import sklearn.linear_model

In [None]:
#corpus_df = pd.read_feather("outputs/corpus.feather")
corpus_df

In [None]:
train_df = corpus_df[corpus_df["collection"] == "train"]
train_df

In [None]:
# Train a multinomial logistic regression model on the training set.
_MULTI_CLASS = "multinomial"
base_pipeline = sklearn.pipeline.Pipeline([
    # Standard scaler. This only makes a difference for certain classes
    # of embeddings.
    #("scaler", sklearn.preprocessing.StandardScaler()),
    ("mlogreg", sklearn.linear_model.LogisticRegression(
        multi_class=_MULTI_CLASS,
        verbose=10,
        max_iter=10000
    ))
])

X_train = train_df["embedding"].values
Y_train = train_df["token_class_id"]
base_model = base_pipeline.fit(X_train, Y_train)
base_model

In [None]:
# Wrap the model's predict method in some pre/post processing
def decode_tags(tags):
    iobs = ["O" if t == "O" else t[:1] for t in tags]
    types = [None if t == "O" else t.split("-")[1] for t in tags]
    return iobs, types

def predict_on_df(df: pd.DataFrame, predictor):
    id_to_class = df["token_class"].values.categories.values

    X = df["embedding"].values
    result_df = df.copy()
    result_df["predicted_id"] = predictor.predict(X)
    result_df["predicted_class"] = [id_to_class[i] for i in result_df["predicted_id"].values]
    iobs, types = decode_tags(result_df["predicted_class"].values)
    result_df["predicted_iob"] = iobs
    result_df["predicted_type"] = types
    return result_df

# Look at results on the training set
train_results_df = predict_on_df(train_df, base_model)
train_results_df

In [None]:
train_results_df.iloc[50:75]

In [None]:
# Look at results on the validation set
valid_results_df = predict_on_df(corpus_df[corpus_df["collection"] == "valid"], base_model)
valid_results_df

In [None]:
valid_results_df.iloc[40:70]

In [None]:
# Split model outputs for an entire fold back into documents and add
# token information.
def split_by_doc(df: pd.DataFrame) -> List[pd.DataFrame]:
    all_pairs = df[["collection", "doc_num"]].drop_duplicates().to_records(index=False)
    indexed_df = df.set_index(["collection", "doc_num", "token_id"], verify_integrity=True)
    results = []
    for collection, doc_num in all_pairs:
        doc_slice = indexed_df.loc[collection, doc_num].reset_index()
        doc_toks = all_span_dfs[collection][doc_num][
            ["id", "char_span", "token_span", "ent_iob", "ent_type"]
        ].rename(columns={"id": "token_id"})
        result_df = doc_toks.copy().merge(
            doc_slice[["token_id", "predicted_iob", "predicted_type"]])
        results.append(result_df)
    return results

valid_results_by_doc = split_by_doc(valid_results_df)
valid_results_by_doc[5]

In [None]:
# Convert IOB-format output (and gold standard tags) to spans.
def convert_to_spans(results_by_doc: pd.DataFrame):
    actual_spans = [tp.iob_to_spans(v) for v in results_by_doc]
    model_spans = [
        tp.iob_to_spans(v, iob_col_name = "predicted_iob",
                        entity_type_col_name = "predicted_type")
          .rename(columns={"predicted_type": "ent_type"})
        for v in results_by_doc]
    return actual_spans, model_spans

valid_actual_spans, valid_model_spans = convert_to_spans(valid_results_by_doc)
valid_model_spans[0].head()

In [None]:
# Same per-document statistics calculation code as in CoNLL_2.ipynb
def make_stats_df(gold_dfs, output_dfs):
    num_true_positives = [len(gold_dfs[i].merge(output_dfs[i]).index)
                          for i in range(len(gold_dfs))]
    num_extracted = [len(df.index) for df in output_dfs]
    num_entities = [len(df.index) for df in gold_dfs]
    doc_num = np.arange(len(gold_dfs))

    stats_by_doc = pd.DataFrame({
        "doc_num": doc_num,
        "num_true_positives": num_true_positives,
        "num_extracted": num_extracted,
        "num_entities": num_entities
    })
    stats_by_doc["precision"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_extracted"]
    stats_by_doc["recall"] = stats_by_doc["num_true_positives"] / stats_by_doc["num_entities"]
    stats_by_doc["F1"] = 2.0 * (stats_by_doc["precision"] * stats_by_doc["recall"]) / (stats_by_doc["precision"] + stats_by_doc["recall"])
    return stats_by_doc

valid_stats_by_doc = make_stats_df(valid_actual_spans, valid_model_spans)
valid_stats_by_doc

In [None]:
# Collection-wide precision and recall can be computed by aggregating
# our dataframe.
def compute_global_scores(stats_by_doc: pd.DataFrame):
    num_true_positives = stats_by_doc["num_true_positives"].sum()
    num_entities = stats_by_doc["num_entities"].sum()
    num_extracted = stats_by_doc["num_extracted"].sum()

    precision = num_true_positives / num_extracted
    recall = num_true_positives / num_entities
    F1 = 2.0 * (precision * recall) / (precision + recall)
    return {
        "num_true_positives": num_true_positives,
        "num_entities": num_entities,
        "num_extracted": num_extracted,
        "precision": precision,
        "recall": recall,
        "F1": F1
    }

compute_global_scores(valid_stats_by_doc)

In [None]:
# Combine the above postprocessing steps into a single function.
def analyze_model(target_df: pd.DataFrame, predictor):
    """
    Score a model on a target set of documents.
    
    :param bert_df: Dataframe of tokens across documents with precomputed
     embeddings in a column called "embedding"
    :param predictor: Trained model with a `predict()` function that accepts
     the contents of the "embedding" column of `bert_df`
    """
    results_df = predict_on_df(target_df, predictor)
    results_by_doc = split_by_doc(results_df)
    actual_spans_by_doc, model_spans_by_doc = convert_to_spans(results_by_doc)
    stats_by_doc = make_stats_df(actual_spans_by_doc, model_spans_by_doc)
    return {
        "results_by_doc": results_by_doc,
        "actual_spans_by_doc": actual_spans_by_doc,
        "model_spans_by_doc": model_spans_by_doc,
        "stats_by_doc": stats_by_doc,
        "global_scores": compute_global_scores(stats_by_doc)
    }

base_validation_results = analyze_model(corpus_df[corpus_df["collection"] == "valid"],
                                        base_model)
base_validation_results["global_scores"]

In [None]:
# Results on the training set
base_train_results = analyze_model(corpus_df[corpus_df["collection"] == "train"],
                                   base_model)
base_train_results["global_scores"]

In [None]:
# Results on the test set
base_test_results = analyze_model(corpus_df[corpus_df["collection"] == "test"],
                                  base_model)
base_test_results["global_scores"]

## Train models with reduced result quality

Define a function that produces detuned versions of our multilogreg model.

In [None]:
import sklearn.random_projection

def train_reduced_model(X: np.ndarray, Y: np.ndarray, n_components: int, seed: 42):
    """
    Train a reduced-quality model by putting a Gaussian random projection in
    front of the multinomial logistic regression stage of the pipeline.
    
    :param X: input embeddings for training set
    :param Y: integer labels corresponding to embeddings
    :param n_components: Number of dimensions to reduce the embeddings to
    :param seed: Random seed to drive Gaussian random projection
    """
    reduce_pipeline = sklearn.pipeline.Pipeline([
        ("dimred", sklearn.random_projection.GaussianRandomProjection(
            n_components=n_components,
            random_state=seed
        )),
        ("mlogreg", sklearn.linear_model.LogisticRegression(
            multi_class=_MULTI_CLASS,
            verbose=10,
            max_iter=10000
        ))
    ])
    print(f"Training model with n_components={n_components} and seed={seed}.")
    return reduce_pipeline.fit(X, Y)

reduce_model = train_reduced_model(X_train, Y_train, 16, None)
reduce_model

In [None]:
reduce_validation_results = analyze_model(corpus_df[corpus_df["collection"] == "valid"],
                                          reduce_model)
reduce_validation_results["global_scores"]

# Step 4: Analyze model outputs

Repeat the process from `CoNLL_2.ipynb` on each fold of the 10-fold cross-validation, comparing the outputs of every model on the validation set for each fold.

## Dry run: Train on the original training set

Using the original CoNLL 2003 training set, train multiple models at
different quality levels. Repeat the evaluation process 
from [`CoNLL_2.ipynb`](./CoNLL_2.ipynb) and verify that the ensemble
of models can pinpoint incorrect labels in the validation data as in
`CoNLL_2.ipynb`.

In [None]:
# Define some constants for a grid over two parameters.
# Not really a grid search, since every element of the grid provides value.
_N_COMPONENTS = [8, 16, 32, 64, 128]  # Values for the n_components parameter
_SEEDS = [1, 2, 3, 4]  # Values for the random seed

params = [{"n_components": c, "seed": s} for c in _N_COMPONENTS for s in _SEEDS]

def params_to_name(p):
    return f"{p['n_components']}_{p['seed']}"

models = {
    params_to_name(p): train_reduced_model(X_train, Y_train,
                                           p["n_components"], p["seed"])
    for p in params
}

In [None]:
validation_df = corpus_df[corpus_df["collection"] == "valid"]
validation_results = {
    name: analyze_model(validation_df, model) for name, model in models.items()
}
list(validation_results.values())[0].keys()

In [None]:
global_scores = [r["global_scores"] for r in validation_results.values()]

summary_df = pd.DataFrame({
    "n_components": [p["n_components"] for p in params],
    "seed": [p["seed"] for p in params],
    "name": list(validation_results.keys()),
    "num_true_positives": [r["num_true_positives"] for r in global_scores],
    "num_entities": [r["num_entities"] for r in global_scores],
    "num_extracted": [r["num_extracted"] for r in global_scores],
    "precision": [r["precision"] for r in global_scores],
    "recall": [r["recall"] for r in global_scores],
    "F1": [r["F1"] for r in global_scores]
})
summary_df

In [None]:
# Tabulate all the results as in CoNLL_2.ipynb
results = validation_results

model_names = list(results.keys())
first_model_name = model_names[0]
first_results = results[first_model_name]
gold_standard_by_doc = first_results["actual_spans_by_doc"]
num_docs = len(gold_standard_by_doc)


def df_for_doc(i):
    df = None
    for model_name in model_names:
        actual_spans_df = results[model_name]["actual_spans_by_doc"][i]
        model_spans_df = results[model_name]["model_spans_by_doc"][i]
        joined_results = pd.merge(actual_spans_df, model_spans_df, how="outer", indicator=True)
        joined_results["gold"] = joined_results["_merge"].isin(["left_only", "both"])
        joined_results[model_name] = joined_results["_merge"].isin(["right_only", "both"])
        joined_results = joined_results.drop(columns="_merge")
        if df is None:
            df = joined_results
        else:
            df = df.merge(joined_results, how="outer", 
                          on=["token_span", "ent_type", "gold"])           
    # TokenSpanArrays from different documents can't currently be stacked,
    # so convert to TokenSpan objects.
    df["token_span"] = df["token_span"].astype(object)
    df = df.fillna(False)
    vectors = df[df.columns[3:]].values
    counts = np.count_nonzero(vectors, axis=1)
    df["num_models"] = counts
    df.insert(0, "doc_num", i)
    return df

to_stack = run_with_progress_bar(num_docs, df_for_doc)
to_stack[42].head()

In [None]:
# Aggregate all results as before
all_results = pd.concat(
    [df[["doc_num", "token_span", "ent_type", "gold", "num_models"]] for df in to_stack]
)
all_results

In [None]:
# How many entities were found by zero models?
(all_results[all_results["gold"] == True][["num_models", "token_span"]]
 .groupby("num_models").count()
 .rename(columns={"token_span": "count"}))

In [None]:
# How many non-results were found by many models?
(all_results[all_results["gold"] == False][["num_models", "token_span"]]
 .groupby("num_models").count()
 .rename(columns={"token_span": "count"}))

In [None]:
# Hardest results from the gold standard to get.
# Use document ID to break ties.
hard_to_get = all_results[all_results["gold"]].sort_values("num_models").head(20)
hard_to_get

## Incorrect results from the above 20:

* Document 200: `[481, 486): 'Chris'`: Should be first word of `Chris Sutton` (`PER` entity)
* Document 200: `[487, 493): 'Sutton'`: Should be first word of `Chris Sutton` (`PER` entity)
* Document 220: `[834, 841): 'PACIFIC'`: Should be first word of `PACIFIC DIVISION` (`MISC` or `ORG` entity, not sure which)
* Document 201: `[19, 35): 'NORTHERN IRELAND'`: Should be first two tokens of `NORTHERN IRELAND PREMIER DIVISION` (`MISC` or `ORG` entity, not sure which)
* Document 56: `[84, 107): 'Department of Transport'`: Should be part of the larger `ORG` entity "UK Department of Transport"
* Document 56: `[81, 83): 'UK'`: Should be part of the larger `ORG` entity "UK Department of Transport"
* Document 55: `[129, 134): 'Czech'`: Should be labeled `MISC` because it's an adjective

## Correct results:

* Document 170, `[498, 509): 'Budisuryana'`: Name of a ship, difficult to identify as `MISC`
* Document 60, `[930, 932): 'AA'`: Abbreviation for "American Airlines"
* Document 220, `[981, 991): 'SACRAMENTO'`: Refers to the Sacramento basketball team
* Document 220, `[952, 964): 'GOLDEN STATE'`: Basketball team
* Document 220, `[807, 816): 'VANCOUVER'`: Basketball team
* Document 60, `[547, 561): 'trans-Atlantic'`: Adjective form of "Atlantic", labeled as `MISC`
* Document 60, `[981, 995): 'trans-Atlantic'`: Adjective form of "Atlantic", labeled as `MISC`
* Document 60: `[346, 364): 'Trade and Industry'`: a British ministry
* Document 57: `[65, 81): 'PT Tambang Timah'`: a company in Indonesia. "PT" stands for "Perseroan Terbatas", Indonesian for "Limited Liability Company".
* Document 202: `[150, 156): 'Widnes'`: Refers to the Rugby team from Widnes
* Document 56: `[11, 16): 'UK-US'`: Adjective referring to talks between two countries
* Document 55: `[60, 66): 'PRAGUE'`: Location from dateline

In [None]:
# Scratchpad for looking at individual docs
# doc_results = gold_standard_by_doc[55]
# doc_results

In [None]:
# Part 2 of scratchpad
# doc_results["token_span"].values

In [None]:
# Hardest results from the gold standard to avoid
hard_to_avoid = all_results[~all_results["gold"]].sort_values("num_models", ascending=False).head(20)
hard_to_avoid

## Missing results in gold standard from the above 20:

* Document 186: `[1280, 1296): 'Florence Masnada'`/`PER`) incorrectly labeled as `[1280, 1288): 'Florence'`/`LOC` and `[1289, 1296): 'Masnada'`/`PER`
* Document 193: `[745, 760): 'United Province'`/`ORG` should be `LOC`, because it refers to a [the Indian province now known as Uttar Pradesh](https://en.wikipedia.org/wiki/United_Provinces_(1937%E2%80%9350))
* Document 41: `[676, 690): 'Sporting Gijon'`/`ORG` incorrectly labeled as `[676, 684): 'Sporting'`/`ORG`
* Document 222: `[93, 115): 'National Hockey League'`/`MISC` incorrectly labeled as `[93, 108): 'National Hockey'`/`ORG` and `[109, 115): 'League'`/`ORG` due to incorrect sentence boundary in corpus.
* Document 31: `[561, 568): 'Schalke'`/`ORG` incorrectly labeled as `[561, 571): 'Schalke 04'`/`ORG`
* Document 161: `[76, 98): 'John Lewis Partnership'`/`ORG` (NOTE: Could also be "The John Lewis Partnership") incorrectly labeled as `[76, 86): 'John Lewis'`/`PER`.
* Document 213: `[697, 708): 'Dion Fourie'`/`PER` incorrectly labeled as `[697, 701): 'Dion'`/`PER` and `[702, 708): 'Fourie'`/`PER` due to incorrect sentence boundary in corpus.
* Document 205: `[627, 636): 'Wimbledon'`/`ORG` (Wimbledon soccer team) incorrectly labeled as `LOC`
* Document 199: `[487, 501): 'Robert Winters'`/`PER` incorrectly labeled as `[487, 493): 'Robert'` and `[494, 501): 'Winters'` due to incorrect sentence boundary in corpus.
* Document 11: `[495, 509): 'Desvonde Botes'`/`PER` incorrectly labeled as `[495, 503): 'Desvonde'` and `[504, 509): 'Botes'` due to incorrect sentence boundary in corpus.
* Document 223: `[289, 295): 'Ottawa'`/`ORG` incorrectly labeled as `LOC`.
* Document 166: `[42, 50): 'NZ First'`/`ORG` (short for the New Zealand First political party) incorrectly labeled as `[42, 44): 'NZ'`/`LOC`
* Document 168: `[443, 471): 'Australian Capital Territory'`/`LOC` (a [federal territory of Australia](https://en.wikipedia.org/wiki/Australian_Capital_Territory)) incorrectly labeled as `MISC`

## Incorrect results in model output from above 20:

* Document 180: `[286, 293): 'Malysia'`: From context, this is a misspelling of the country name "Malaysia".
* Document 47: `[357, 364): 'English'`: First two syllables of "Englishman"
* Document 31: `[532, 542): 'FC Cologne'`: Final two tokens of `[529, 542): '1. FC Cologne'`
* Document 202: `[151, 156): 'idnes'`: Last 5 letters of `[150, 156): 'Widnes'`
* Document 116: `[863, 869): 'Canola'`/`ORG`: Canola oil
* Document 116: `[1123, 1134): 'Canola Corn'`/`ORG`: Two table headers for two types of vegetable oil
* Document 180: `[876, 878): 'US'`/`MISC`: First token of "US$"



In [None]:
# Scratchpad for looking at individual docs
# doc_num = 168
# doc_results = gold_standard_by_doc[doc_num]
# doc_results

In [None]:
# Part 2 of scratchpad
# doc_results["token_span"].values

In [None]:
# Part 3 of scratchpad (for looking at original tokenization)
#valid_raw[doc_num].head(50)

## Experiment: 

# Step 5: Inspect and correct

Analyze the results of the models to pipoint potential incorrect labels. Inspect those labels manually and build up a list of labels that are actually incorrect.

In [None]:
# TODO