## CoNLL-2003 Example for Text Extensions for Pandas
### Part 4

To run this notebook, you will need to obtain a copy of the CoNLL-2003 data set's corpus.
Drop the corpus's files into the following locations:
* conll_03/eng.testa
* conll_03/eng.testb
* conll_03/eng.train

If you are unfamiliar with the basics of Text Extensions for Pandas, we recommend you 
start with Part 1 of this example.

### Introduction

At the end of part 3 of the demo, we've shown that we can train multiple synthetic models with different levels of deliberate imprecision. We used the ensemble of models to pinpoint incorrect labels in the validation set, using 
the same methods that we employed in [`CoNLL_2.ipynb`](./CoNLL_2.ipynb).

Now we need to pinpoint incorrect labels across the entire data set, including train, test, and validation sets.



# Libraries and constants

In [None]:
# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd
import time
import torch
import transformers
from typing import *
import sklearn.model_selection
import sklearn.pipeline
import matplotlib.pyplot as plt

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

# Common code shared across notebooks comes from util.py
import util

# BERT Configuration
# Keep this in sync with `CoNLL_3.ipynb`.
#bert_model_name = "bert-base-uncased"
#bert_model_name = "bert-large-uncased"
bert_model_name = "dslim/bert-base-NER"
tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name, 
                                                           add_special_tokens=True)
bert = transformers.BertModel.from_pretrained(bert_model_name)

# Create a Pandas categorical type for consistent encoding of categories
# across all documents.
_ENTITY_TYPES = ["LOC", "MISC", "ORG", "PER"]
token_class_dtype, int_to_label, label_to_int = tp.make_iob_tag_categories(_ENTITY_TYPES)

# Parameters for splitting the corpus into folds
_KFOLD_RANDOM_SEED = 42
_KFOLD_NUM_FOLDS = 10

# Read inputs

Read in the corpus, retokenize it with the BERT tokenizer, add BERT embeddings, and convert
to a single dataframe.

In [None]:
# The raw dataset in its original tokenization
raw_data = {
    "valid": tp.conll_2003_to_dataframes("../conll_03/eng.testa"),
    "test": tp.conll_2003_to_dataframes("../conll_03/eng.testb"),
    "train": tp.conll_2003_to_dataframes("../conll_03/eng.train"),
}

In [None]:
# The three folds of the dataset with BERT tokens and embeddings
bert_data = {
    key: util.run_with_progress_bar(
        len(val), 
        lambda i: util.conll_to_bert(val[i], tokenizer, bert, token_class_dtype))
    for key, val in raw_data.items()
}

In [None]:
# Single dataframe of annotated tokens for the entire corpus and index
corpus_df = util.combine_folds(bert_data)
corpus_df

# Prepare folds for a 10-fold cross-validation

We divide the documents of the corpus into 10 random samples.

In [None]:
# IDs for each of the keys
doc_keys = corpus_df[["fold", "doc_num"]].drop_duplicates().reset_index(drop=True)
doc_keys

In [None]:
# We want to split the documents randomly into _NUM_FOLDS sets, then
# for each stage of cross-validation train a model on the union of
# (_NUM_FOLDS - 1) of them while testing on the remaining fold.
# sklearn.model_selection doesn't implement this approach directly,
# but we can piece it together with some help from Numpy.
#from numpy.random import default_rng
rng = np.random.default_rng(seed=_KFOLD_RANDOM_SEED)
iloc_order = rng.permutation(len(doc_keys.index))
kf = sklearn.model_selection.KFold(n_splits=_KFOLD_NUM_FOLDS)

train_keys = []
test_keys = []
for train_ix, test_ix in kf.split(iloc_order):
    # sklearn.model_selection.KFold gives us a partitioning of the
    # numbers from 0 to len(iloc_order). Use that partitioning to 
    # choose elements from iloc_order, then use those elements to 
    # index into doc_keys.
    train_iloc = iloc_order[train_ix]
    test_iloc = iloc_order[test_ix]
    train_keys.append(doc_keys.iloc[train_iloc])
    test_keys.append(doc_keys.iloc[test_iloc])

train_keys[1].head(10)