## CoNLL-2003 Example for Text Extensions for Pandas
### Part 3

To run this notebook, you will need to obtain a copy of the CoNLL-2003 data set's corpus.
Drop the corpus's files into the following locations:
* conll_03/eng.testa
* conll_03/eng.testb
* conll_03/eng.train

If you are unfamiliar with the basics of Text Extensions for Pandas, we recommend you 
start with Part 1 of this example.

### Introduction

At the end of part 2 of the demo, we've shown that there are incorrect labels hidden in the CoNLL-2003 validation set, and that you can pinpoint those incorrect labels by data-mining the results of the 16 models the competitors submitted.

Our goal for part 3 of the demo is to pinpoint incorrect labels across the entire data set. The (rough) process to do so will be:

1. Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
2. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a dataframe column (of type TensorType) alongside the tokens and labels.
3. Use the embeddings to quickly train multiple models at multiple levels of sophistication (something like: SVMs, random forests, and LSTMs with small and large numbers of hidden states). Split the corpus into 10 parts and perform a 10-fold cross-validation.
4. Repeat the process from part 2 on each fold of the 10-fold cross-validation, comparing the outputs of every model on the validation set for each fold.
5. ?
6. Profit!


In [None]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import BertTokenizerFast

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

In [None]:
# Retokenize corpus
# TODO
text = 'the quick brown fox bpbed over the bazy bog.'
'''
dfs = tp.conll_2003_to_dataframes('../resources/conll_03/ner/corpus/eng.train')
df = dfs[0]
df
'''

In [None]:
# Tokenize the text with offset mappings
add_special_tokens = False
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', add_special_tokens=add_special_tokens)
tokenized_result = tokenizer.encode_plus(text, return_special_tokens_mask=True, return_offsets_mapping=True)
tokenized_result

In [None]:
# Create DataFrame from tokenizer output
df = pd.DataFrame.from_dict(tokenized_result)
df

In [None]:
# Get offset mapping from tokenizer
offsets = tokenized_result['offset_mapping']

# Remove any special tokens
# TODO
if add_special_tokens:
    offsets = [] + offsets[1:-1] + []

# Create a CharSpanArray from offsets
begins, ends = zip(*offsets)
df['char_span'] = tp.CharSpanArray(text, list(begins), list(ends))
df

In [None]:
# Generate BERT embeddings
# TODO
emb = np.random.rand(len(df), 2)

df['bert_embeddings'] = TensorArray(emb)
df