## CoNLL-2003 Example for Text Extensions for Pandas
### Part 3

To run this notebook, you will need to obtain a copy of the CoNLL-2003 data set's corpus.
Drop the corpus's files into the following locations:
* conll_03/eng.testa
* conll_03/eng.testb
* conll_03/eng.train

If you are unfamiliar with the basics of Text Extensions for Pandas, we recommend you 
start with Part 1 of this example.

### Introduction

At the end of part 2 of the demo, we've shown that there are incorrect labels hidden in the CoNLL-2003 validation set, and that you can pinpoint those incorrect labels by data-mining the results of the 16 models the competitors submitted.

Our goal for part 3 of the demo is to pinpoint incorrect labels across the entire data set. The (rough) process to do so will be:

1. Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
2. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a dataframe column (of type TensorType) alongside the tokens and labels.
3. Use the embeddings to quickly train multiple models at multiple levels of sophistication (something like: SVMs, random forests, and LSTMs with small and large numbers of hidden states). Split the corpus into 10 parts and perform a 10-fold cross-validation.
4. Repeat the process from part 2 on each fold of the 10-fold cross-validation, comparing the outputs of every model on the validation set for each fold.
5. ?
6. Profit!


In [None]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd
import torch
import transformers
from typing import *

import matplotlib.pyplot as plt
import ipywidgets
from IPython.display import display

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

In [None]:
# Read in the corpus in its original tokenization
gold_standard = tp.conll_2003_to_dataframes("../conll_03/eng.testa")
gold_standard_spans = [tp.iob_to_spans(df) for df in gold_standard]

In [None]:
gold_standard[0]

In [None]:
gold_standard_spans[0].head()

In [None]:
# Retokenize the corpus's text with the BERT tokenizer
bert_model_name = "bert-base-uncased"
tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name, 
                                                           add_special_tokens=True)
bert_toks = [tp.make_bert_tokens(df["char_span"].values[0].target_text, tokenizer) for 
             df in gold_standard]
bert_toks[0]

In [None]:
# BERT tokenization includes special zero-length tokens.
doc_toks = bert_toks[0]
doc_toks[doc_toks["special_tokens_mask"]]

In [None]:
# Align the BERT tokens with the original tokenization
# Start by converting the elements of gold_standard_spans to character spans
new_gold_standard_spans = []
for i in range(len(gold_standard_spans)):
    g = gold_standard_spans[i]
    original_spans = g["token_span"]
    bert_token_spans = tp.TokenSpanArray.align_to_tokens(bert_toks[i]["char_span"],
                                                         original_spans)
    new_gold_standard_spans.append(pd.DataFrame({
        "original_span": original_spans,
        "bert_token_span": bert_token_spans,
        "ent_type": g["ent_type"]
    }))                                            

new_gold_standard_spans[1].head()   

In [None]:
# Generate IOB2 tags that align with the BERT tokens.
# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
bert_gold_standard = [t.copy() for t in bert_toks]
for i in range(len(bert_gold_standard)):
    bert_gold_standard[i]["ent_iob"] = (
        tp.spans_to_iob(new_gold_standard_spans[i]["bert_token_span"]))

In [None]:
# Double-check that the new IOB2 tags are properly aligned
bert_gold_standard[1].head(20)

In [None]:
# Add the entity type tags to the new IOB2 data
for i in range(len(bert_gold_standard)):
    iob_df = bert_gold_standard[i]
    entity_df = new_gold_standard_spans[i]
    tmp_df = (
        tp
        .contain_join(entity_df["bert_token_span"], iob_df["token_span"],
                      "bert_token_span", "token_span")
        .merge(entity_df[["bert_token_span", "ent_type"]])
        [["token_span", "ent_type"]]
    )
    bert_gold_standard[i] = iob_df.merge(tmp_df, how="left")
    
# Print out the dataframe from the previous cell to show it has a new column
bert_gold_standard[1].head(20)

In [None]:
# Now we need to attach some embeddings. Fire up a canned BERT model.
bert = transformers.BertModel.from_pretrained(bert_model_name)

In [None]:
# Generate embeddings for document 1, to show how it's done
toks_df = bert_gold_standard[1]
input_ids = torch.from_numpy(toks_df["input_id"].values.reshape([1,-1]))
bert_result = bert(input_ids=input_ids)
hidden_states = bert_result[0]
hidden_states.detach().numpy()

In [None]:
# Generate embeddings for document 1, but this time run the document
# through the BERT model in batches of 128 with the first and last 
# 32 in each batch overlapping with adjacent windows
_OVERLAP = 32
_NON_OVERLAP = 64

toks_df = bert_gold_standard[1]
flat_input_ids = toks_df["input_id"].values
windows = tp.seq_to_windows(flat_input_ids, _OVERLAP, _NON_OVERLAP)
bert_result = bert(input_ids=torch.tensor(windows["input_ids"]), 
                   attention_mask=torch.tensor(windows["attention_masks"]))
hidden_states = tp.windows_to_seq(flat_input_ids, 
                                  bert_result[0].detach().numpy(),
                                  _OVERLAP, _NON_OVERLAP)
hidden_states

In [None]:
# Define a function that adds embeddings to a dataframe.
def add_embeddings(df: pd.DataFrame) -> pd.DataFrame:
    _OVERLAP = 32
    _NON_OVERLAP = 64
    flat_input_ids = df["input_id"].values
    windows = tp.seq_to_windows(flat_input_ids, _OVERLAP, _NON_OVERLAP)
    bert_result = bert(input_ids=torch.tensor(windows["input_ids"]), 
                       attention_mask=torch.tensor(windows["attention_masks"]))
    hidden_states = tp.windows_to_seq(flat_input_ids, 
                                      bert_result[0].detach().numpy(),
                                      _OVERLAP, _NON_OVERLAP)
    embeddings = tp.TensorArray(hidden_states)
    ret = df.copy()
    ret["embeddings"] = embeddings
    return ret

# Test out the function on document 2.
add_embeddings(bert_gold_standard[2])

In [None]:
# Now we can add embeddings to the entire gold standard data set.
# This takes a while, so we display a progress bar.
num_docs = len(bert_gold_standard)
progress_bar = ipywidgets.IntProgress(0, 0, num_docs,
                                      description="Starting...",
                                      layout=ipywidgets.Layout(width="100%"),
                                      style={"description_width": "12%"})
display(progress_bar)
for i in range(len(bert_gold_standard)):
    bert_gold_standard[i] = add_embeddings(bert_gold_standard[i])
    progress_bar.value = i + 1
    progress_bar.description = f"{i + 1}/{num_docs} docs"
progress_bar.bar_style = "success"

In [None]:
# What does the dataframe for document 5 look like now?
bert_gold_standard[5]