## CoNLL-2003 Example for Text Extensions for Pandas
### Part 3

To run this notebook, you will need to obtain a copy of the CoNLL-2003 data set's corpus.
Drop the corpus's files into the following locations:
* conll_03/eng.testa
* conll_03/eng.testb
* conll_03/eng.train

If you are unfamiliar with the basics of Text Extensions for Pandas, we recommend you 
start with Part 1 of this example.

### Introduction

At the end of part 2 of the demo, we've shown that there are incorrect labels hidden in the CoNLL-2003 validation set, and that you can pinpoint those incorrect labels by data-mining the results of the 16 models the competitors submitted.

Our goal for part 3 of the demo is to pinpoint incorrect labels across the entire data set. The (rough) process to do so will be:

1. Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
2. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a dataframe column (of type TensorType) alongside the tokens and labels.
3. Use the embeddings to quickly train multiple models at multiple levels of sophistication (something like: SVMs, random forests, and LSTMs with small and large numbers of hidden states). Split the corpus into 10 parts and perform a 10-fold cross-validation.
4. Repeat the process from part 2 on each fold of the 10-fold cross-validation, comparing the outputs of every model on the validation set for each fold.
5. ?
6. Profit!


In [None]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import BertTokenizerFast

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

In [None]:
# Read in the corpus in its original tokenization
gold_standard = tp.conll_2003_to_dataframes("../conll_03/eng.testa")
gold_standard_spans = [tp.iob_to_spans(df) for df in gold_standard]

In [None]:
gold_standard[0]

In [None]:
def make_bert_tokens(target_text: str, 
                     tokenizer: BertTokenizerFast # TODO: More general type?
                    ) -> pd.DataFrame:
    """
    Tokenize the indicated text for BERT embeddings and return a dataframe
    with one row per token.
    
    :param: target_text
    
    :returns: A `pd.DataFrame` with the following columns:
     * "id": unique integer ID for each token
     * "char_span": span of the token with character offsets
     * "token_span": span of the token with token offsets
     * "input_id": integer ID suitable for input to a BERT embedding model
     * "token_type_id": TODO: Accurate description of this column
     * "attention_mask": TODO: Accurate description of this column
     * "special_tokens_mask": `True` if the token is a zero-length special token
       such as "start of document"
    """
    tokenized_result = tokenizer.encode_plus(target_text, 
                                             return_special_tokens_mask=True, 
                                             return_offsets_mapping=True)
    df = pd.DataFrame.from_dict(tokenized_result)
    # Get offset mapping from tokenizer
    offsets = tokenized_result["offset_mapping"]

    # Turn special tokens into zero-length spans
    begins = np.zeros(len(df.index), dtype=np.int64)
    ends = np.zeros(len(df.index), dtype=np.int64)
    prev_begin = 0
    for i in range(len(df.index)):
        if offsets[i] is None:
            # Special token --> zero length span
            begins[i] = prev_begin
            ends[i] = prev_begin
        else:
            begins[i], ends[i] = offsets[i]
            prev_begin = begins[i]
  
    char_spans = tp.CharSpanArray(target_text, begins, ends)
    df["char_span"] = char_spans
    df["token_span"] = tp.TokenSpanArray(char_spans, list(range(len(char_spans))), 
                                         list(range(1, len(char_spans) + 1)))
    
    # Reformat to look more like the output of make_tokens_and_features()
    token_features = pd.DataFrame({
        "id": df.index,
        # Use values instead of series because different indexes
        "char_span": df["char_span"].values,
        "token_span": df["token_span"].values,
        "input_id": df["input_ids"].values,
        "token_type_id": df["token_type_ids"].values,
        "attention_mask": df["attention_mask"].values,
        "special_tokens_mask": df["special_tokens_mask"].values.astype(np.bool)
    })
    return token_features

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', add_special_tokens=True)
#token_features = make_bert_tokens("I have a puppy.", tokenizer)
#token_features

In [None]:
# Retokenize the corpus's text with the BERT tokenizer
bert_toks = [make_bert_tokens(df["char_span"].values[0].target_text, tokenizer) for 
             df in gold_standard]
bert_toks[0]

In [None]:
doc_toks = bert_toks[0]
doc_toks[doc_toks["special_tokens_mask"]]

In [None]:
# Align the BERT tokens with the original tokenization and regenarate the IOB tags
# Start by converting the elements of gold_standard_spans to character spans
new_gold_standard_spans = []
for i in range(len(gold_standard_spans)):
    #print(f"document {i}")
    g = gold_standard_spans[i]
    char_spans = tp.CharSpanArray._from_sequence(g["token_span"].values)
    bert_token_spans = tp.TokenSpanArray.align_to_tokens(bert_toks[i]["char_span"],
                                                         char_spans)
    
    #print(f"char_spans:\n{char_spans}")
    #print(f"bert_token_spans:\n{bert_token_spans}")
    
    new_gold_standard_spans.append(pd.DataFrame({
        "char_span": char_spans,
        "token_span": g["token_span"].values,
        "bert_token_span": bert_token_spans
    }))                                            


In [None]:
new_gold_standard_spans[1]

In [None]:
# TODO: 
# * Move the functions above into text_extensions_for_pandas package
# * Add regression tests of the new functions
# * Add regression test of align_to_tokens
# * Implement the inverse of iob_to_spans(), say, "spans_to_iob"
# * In this notebook, use spans_to_iob() to add columns "ent_iob" and "ent_type"
#   to the BERT tokens dataframe for each document
# * Also in this notebook, align the sentences on BERT tokens and add a 
#   "sentence" column to the BERT tokens dataframe for each document