<font size=6>
    <b>Model_Training_with_TensorArray.ipynb:</b>
    <p>Use Text Extensions for Pandas TensorArray to integrate model training and prediction with Pandas.</p>
</font>

# Introduction

This notebook shows how to use the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) to seamlessly integrate model training and prediction with [Pandas](https://pandas.pydata.org/) DataFrames for added support of columns with N-dimensional tensors using the Pandas extension array called `TensorArray`.

## TensorArray Specification

A `TensorArray` represents an array of tensors where each element is an N-dimensional tensor of the same shape. If there are M tensor elements in the array, then the entire `TensorArray` will have a shape of M x N, where the outer dimension is the number of elements. Backing the `TensorArray` is a single `numpy.ndarray` with shape M x N. Standard arithmetic and comparison operations are supported and delegated to the backing ndarray. Taking a slice or multiple item selection will produce another `TensorArray`, while a single element selection will produce a `TensorElement` that also wraps a view of the `numpy.ndarray`, with similar operator support.

A `TensorArray` can be constructed with zero copy from a single `numpy.ndarray` or with a sequence of elements of similar shape. Conversion of a `TensorArray` to a `numpy.ndarray` can be done with zero copy by calling `TensorArray.to_numpy()` or using the provided numpy array interface, e.g. `numpy.asarray(TensorArray(...))`. The `TensorArray` is a Pandas extension type of type `TensorType` and can be wrapped in a `pandas.Series` or used as a column in a `pandas.DataFrame` and used in standard Pandas operations. A `NULL` or missing value in the `TensorArray` is represented as a N-dimensional `numpy.ndarray` where all items are `numpy.nan`.

## Model Training Example

To demonstrate the TensorArray functionality from Text Extensions for Pandas, we will use the CoNLL-2003 dataset used from https://www.aclweb.org/anthology/W03-0419/ to ***

1. Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
1. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a DataFrame column (of type TensorType) alongside the tokens and labels.
1. Persist the DataFrame with computed BERT embeddings to disk as a checkpoint.
1. Use the embeddings to quickly train a model that will classify token class.
1. Make predictions of the token classes on the test set, adding predicted class probabilities as a TensorArray to the DataFrame.
1. Drill down into the DataFrame with some analysis on the predictions.


# Imports and constants

In [1]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if ".." not in sys.path:
    sys.path.insert(0, "..")

# Libraries
import numpy as np
import pandas as pd
import time
import torch
import transformers
from typing import *

import sklearn.pipeline

import matplotlib.pyplot as plt
import ipywidgets
from IPython.display import display

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

# How many iterations to run the BGFS optimizer when fitting logistic
# regression models. 100 ==> Fast; 10000 ==> Full convergence
LBGFS_ITERATIONS = 10000

# BERT Configuration
bert_model_name = "dslim/bert-base-NER"

tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name, 
                                                           add_special_tokens=True)
bert = transformers.BertModel.from_pretrained(bert_model_name)

# Create a Pandas categorical type for consistent encoding of categories
# across all documents.
_ENTITY_TYPES = ["LOC", "MISC", "ORG", "PER"]
token_class_dtype, int_to_label, label_to_int = tp.make_iob_tag_categories(_ENTITY_TYPES)

# TensorArray Basics

In [2]:
# Construct from a numpy.ndarray
arr = tp.TensorArray(np.arange(10).reshape(5, 2))
arr, arr.dtype

(array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]),
 <text_extensions_for_pandas.array.tensor.TensorType at 0x7f97bc035e90>)

In [3]:
# Wrap in a Pandas Series
s = pd.Series(arr)
s

0   [0 1]
1   [2 3]
2   [4 5]
3   [6 7]
4   [8 9]
dtype: TensorType

In [4]:
# Convert back to numpy using the provided array interface
np_arr = np.asarray(s)
np_arr, np_arr.dtype

(array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]),
 dtype('int64'))

In [5]:
# Apply operations on the Series, result is another Series of type TensorType
thresh = s > 4
thresh

0    [False False]
1    [False False]
2    [False  True]
3    [ True  True]
4    [ True  True]
dtype: TensorType

In [6]:
# Select items using a mix of numpy and pandas, use `.array` to get the Series as a `TensorArray`
# which can be used on numpy operations
s[np.all(thresh.array, axis=1)] * 2

3   [12 14]
4   [16 18]
dtype: TensorType

In [7]:
# TensorArray can also be added to a Pandas DataFrame
df = pd.DataFrame({"time": pd.date_range('2018-01-01', periods=5, freq='H'), "features": arr})
df

Unnamed: 0,time,features
0,2018-01-01 00:00:00,[0 1]
1,2018-01-01 01:00:00,[2 3]
2,2018-01-01 02:00:00,[4 5]
3,2018-01-01 03:00:00,[6 7]
4,2018-01-01 04:00:00,[8 9]


In [9]:
# TensorArray will be incorporated with standard DataFrame operations 
df.sort_values(by="time", ascending=False)

Unnamed: 0,time,features
4,2018-01-01 04:00:00,[8 9]
3,2018-01-01 03:00:00,[6 7]
2,2018-01-01 02:00:00,[4 5]
1,2018-01-01 01:00:00,[2 3]
0,2018-01-01 00:00:00,[0 1]


# Example usage using BERT embeddings for Model Training on CoNLL-2003

Now let's put the TensorArray to use with an example. Here we use the CoNLL-2003 corpus and retokenize using a BERT tokenizer. Next, the BERT tokens will be used to compute BERT embeddings that will be stored in a TensorArray and added to the DataFrame. Finally, we can use the BERT embeddings to train a model and make predictions on the token class, which will also be added to the DataFrame as a TensorArray

In [9]:
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
#  to the terms of the license when using this data set!
data_set_info = tp.maybe_download_conll_data("outputs")
data_set_info

{'train': 'outputs/eng.train',
 'dev': 'outputs/eng.testa',
 'test': 'outputs/eng.testb'}

# Show how to retokenize with a BERT tokenizer.

Retokenize the corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.

In [10]:
# Read in the corpus in its original tokenization
corpus_raw = {}
for fold_name, file_name in data_set_info.items():
    df_list = tp.conll_2003_to_dataframes(file_name, 
                                          ["pos", "phrase", "ent"],
                                          [False, True, True])
    corpus_raw[fold_name] = [
        df.drop(columns=["pos", "phrase_iob", "phrase_type"])
        for df in df_list
    ]

test_raw = corpus_raw["test"]

# Pick out the dataframe for a single example document.
example_df = test_raw[5]
example_df

Unnamed: 0,span,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'",1469
1,"[11, 18): 'CRICKET'",O,,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1471
2,"[18, 19): '-'",O,,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1472
3,"[20, 28): 'PAKISTAN'",B,LOC,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1473
4,"[29, 30): 'V'",O,,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1474
...,...,...,...,...,...
350,"[1620, 1621): '8'",O,,"[1590, 1634): 'Third one-day match: December 8...",1865
351,"[1621, 1622): ','",O,,"[1590, 1634): 'Third one-day match: December 8...",1866
352,"[1623, 1625): 'in'",O,,"[1590, 1634): 'Third one-day match: December 8...",1867
353,"[1626, 1633): 'Karachi'",B,LOC,"[1590, 1634): 'Third one-day match: December 8...",1868


In [23]:
spans_df = tp.iob_to_spans(example_df)

# Add a TokenSpan in addition to the Span column, and re-order columns
spans_df["token_span"] = tp.TokenSpanArray.from_char_offsets(spans_df["span"].array)
spans_df = spans_df.reindex(columns=["span", "token_span", "ent_type"])
spans_df

Unnamed: 0,span,ent_type,token_span
0,"[20, 28): 'PAKISTAN'",LOC,"[20, 28): 'PAKISTAN'"
1,"[31, 42): 'NEW ZEALAND'",LOC,"[31, 42): 'NEW ZEALAND'"
2,"[80, 83): 'GMT'",MISC,"[80, 83): 'GMT'"
3,"[85, 92): 'SIALKOT'",LOC,"[85, 92): 'SIALKOT'"
4,"[94, 102): 'Pakistan'",LOC,"[94, 102): 'Pakistan'"
...,...,...,...
69,"[1488, 1501): 'Shahid Afridi'",PER,"[1488, 1501): 'Shahid Afridi'"
70,"[1512, 1523): 'Salim Malik'",PER,"[1512, 1523): 'Salim Malik'"
71,"[1535, 1545): 'Ijaz Ahmad'",PER,"[1535, 1545): 'Ijaz Ahmad'"
72,"[1565, 1573): 'Pakistan'",LOC,"[1565, 1573): 'Pakistan'"


In [24]:
# Retokenize the document's text with the BERT tokenizer
bert_toks_df = tp.make_bert_tokens(example_df["span"].values[0].target_text, tokenizer)

# Add a TokenSpan in addition to the Span column, and re-order columns
bert_toks_df["token_span"] = tp.TokenSpanArray.from_char_offsets(bert_toks_df["span"].array)
bert_toks_df = bert_toks_df.reindex(columns=["token_id", "span", "token_span", "input_id", "token_type_id", "attention_mask", "special_tokens_mask"])
bert_toks_df

Unnamed: 0,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask
0,0,"[0, 0): ''",101,0,1,True
1,1,"[0, 1): '-'",118,0,1,False
2,2,"[1, 2): 'D'",141,0,1,False
3,3,"[2, 4): 'OC'",9244,0,1,False
4,4,"[4, 6): 'ST'",9272,0,1,False
...,...,...,...,...,...,...
684,684,"[1621, 1622): ','",117,0,1,False
685,685,"[1623, 1625): 'in'",1107,0,1,False
686,686,"[1626, 1633): 'Karachi'",16237,0,1,False
687,687,"[1633, 1634): '.'",119,0,1,False


In [25]:
# BERT tokenization includes special zero-length tokens.
bert_toks_df[bert_toks_df["special_tokens_mask"]]

Unnamed: 0,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask
0,0,"[0, 0): ''",101,0,1,True
688,688,"[0, 0): ''",102,0,1,True


In [26]:
# Align the BERT tokens with the original tokenization
bert_token_spans = tp.TokenSpanArray.align_to_tokens(bert_toks_df["span"],
                                                     spans_df["token_span"])

pd.DataFrame({
    "original_span": spans_df["token_span"],
    "bert_token_span": bert_token_spans,
    "ent_type": spans_df["ent_type"]
})

Unnamed: 0,original_span,bert_token_span,ent_type
0,"[20, 28): 'PAKISTAN'","[20, 28): 'PAKISTAN'",LOC
1,"[31, 42): 'NEW ZEALAND'","[31, 42): 'NEW ZEALAND'",LOC
2,"[80, 83): 'GMT'","[80, 83): 'GMT'",MISC
3,"[85, 92): 'SIALKOT'","[85, 92): 'SIALKOT'",LOC
4,"[94, 102): 'Pakistan'","[94, 102): 'Pakistan'",LOC
...,...,...,...
69,"[1488, 1501): 'Shahid Afridi'","[1488, 1501): 'Shahid Afridi'",PER
70,"[1512, 1523): 'Salim Malik'","[1512, 1523): 'Salim Malik'",PER
71,"[1535, 1545): 'Ijaz Ahmad'","[1535, 1545): 'Ijaz Ahmad'",PER
72,"[1565, 1573): 'Pakistan'","[1565, 1573): 'Pakistan'",LOC


In [27]:
# Generate IOB2 tags and entity labels that align with the BERT tokens.
# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
bert_toks_df[["ent_iob", "ent_type"]] = tp.spans_to_iob(bert_token_spans, 
                                                        spans_df["ent_type"])
bert_toks_df

Unnamed: 0,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type
0,0,"[0, 0): ''",101,0,1,True,O,
1,1,"[0, 1): '-'",118,0,1,False,O,
2,2,"[1, 2): 'D'",141,0,1,False,O,
3,3,"[2, 4): 'OC'",9244,0,1,False,O,
4,4,"[4, 6): 'ST'",9272,0,1,False,O,
...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','",117,0,1,False,O,
685,685,"[1623, 1625): 'in'",1107,0,1,False,O,
686,686,"[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC
687,687,"[1633, 1634): '.'",119,0,1,False,O,


In [28]:
# The traditional way to transform NER to token classification is to 
# treat each combination of {I,O,B} X {entity type} as a different
# class. Generate class labels in that format.
classes_df = tp.add_token_classes(bert_toks_df, token_class_dtype)
classes_df

Unnamed: 0,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id
0,0,"[0, 0): ''",101,0,1,True,O,,O,0
1,1,"[0, 1): '-'",118,0,1,False,O,,O,0
2,2,"[1, 2): 'D'",141,0,1,False,O,,O,0
3,3,"[2, 4): 'OC'",9244,0,1,False,O,,O,0
4,4,"[4, 6): 'ST'",9272,0,1,False,O,,O,0
...,...,...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','",117,0,1,False,O,,O,0
685,685,"[1623, 1625): 'in'",1107,0,1,False,O,,O,0
686,686,"[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC,B-LOC,1
687,687,"[1633, 1634): '.'",119,0,1,False,O,,O,0


# Show how to compute embeddings

In [29]:
# Adds embeddings to our example dataframe.
embeddings_df = tp.add_embeddings(classes_df, bert)
embeddings_df

Unnamed: 0,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,0,"[0, 0): ''",101,0,1,True,O,,O,0,[-8.30711797e-02 -3.59590203e-01 1.01506817e+0...
1,1,"[0, 1): '-'",118,0,1,False,O,,O,0,[-2.28625670e-01 -4.93136197e-01 1.28423226e+0...
2,2,"[1, 2): 'D'",141,0,1,False,O,,O,0,[ 2.84805615e-02 -1.78742811e-01 1.54320943e+0...
3,3,"[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-0.46517587 -0.2983605 1.0737675 -0.0316486...
4,4,"[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-1.07308812e-01 -3.37210178e-01 1.22697961e+0...
...,...,...,...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','",117,0,1,False,O,,O,0,[-1.28065705e-01 -2.32442445e-03 6.78131640e-0...
685,685,"[1623, 1625): 'in'",1107,0,1,False,O,,O,0,[ 0.30534053 -0.526257 0.8281703 -0.2741491...
686,686,"[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC,B-LOC,1,[-0.04873929 -0.33797333 -0.05835138 0.7557763...
687,687,"[1633, 1634): '.'",119,0,1,False,O,,O,0,[-5.28932782e-03 -2.97430724e-01 7.16174304e-0...


# Generate BERT tokens and BERT embeddings for the entire corpus

Generate BERT embeddings for every token in the entire corpus in one pass, 
and store those embeddings in a dataframe column (of type TensorType) 
alongside the tokens and labels.

In [30]:
# Convienence function to combine the previous few cells' operations in a single call.
tp.conll_to_bert(example_df, tokenizer, bert, token_class_dtype)

Unnamed: 0,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,0,"[0, 0): ''",101,0,1,True,O,,O,0,[-8.30711797e-02 -3.59590203e-01 1.01506817e+0...
1,1,"[0, 1): '-'",118,0,1,False,O,,O,0,[-2.28625670e-01 -4.93136197e-01 1.28423226e+0...
2,2,"[1, 2): 'D'",141,0,1,False,O,,O,0,[ 2.84805615e-02 -1.78742811e-01 1.54320943e+0...
3,3,"[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-0.46517587 -0.2983605 1.0737675 -0.0316486...
4,4,"[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-1.07308812e-01 -3.37210178e-01 1.22697961e+0...
...,...,...,...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','",117,0,1,False,O,,O,0,[-1.28065705e-01 -2.32442445e-03 6.78131640e-0...
685,685,"[1623, 1625): 'in'",1107,0,1,False,O,,O,0,[ 0.30534053 -0.526257 0.8281703 -0.2741491...
686,686,"[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC,B-LOC,1,[-0.04873929 -0.33797333 -0.05835138 0.7557763...
687,687,"[1633, 1634): '.'",119,0,1,False,O,,O,0,[-5.28932782e-03 -2.97430724e-01 7.16174304e-0...


In [31]:
# Run the entire corpus through our processing pipeline.
bert_toks_by_fold = {}
for fold_name in corpus_raw.keys():
    print(f"Processing fold '{fold_name}'...")
    raw = corpus_raw[fold_name]
    bert_toks_by_fold[fold_name] = tp.run_with_progress_bar(
        len(raw), lambda i: tp.conll_to_bert(raw[i], tokenizer, 
                                             bert, token_class_dtype))
bert_toks_by_fold["dev"][20]

Processing fold 'train'...


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=946, style=ProgressStyle(desc…

Processing fold 'dev'...


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=216, style=ProgressStyle(desc…

Processing fold 'test'...


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…

Unnamed: 0,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,0,"[0, 0): ''",101,0,1,True,O,,O,0,[-0.17669654 -0.39899608 0.9088872 0.2128861...
1,1,"[0, 1): '-'",118,0,1,False,O,,O,0,[-3.85537773e-01 -5.02327383e-01 1.17323220e+0...
2,2,"[1, 2): 'D'",141,0,1,False,O,,O,0,[-1.17190674e-01 -1.27011672e-01 1.38969100e+0...
3,3,"[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-3.90257567e-01 -2.50433564e-01 1.07450700e+0...
4,4,"[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-2.77328491e-01 -2.61601597e-01 1.07876039e+0...
...,...,...,...,...,...,...,...,...,...,...,...
2154,2154,"[5704, 5705): ')'",114,0,1,False,O,,O,0,[ 0.01539315 -0.04065072 1.0011852 0.0412664...
2155,2155,"[5706, 5708): '39'",3614,0,1,False,O,,O,0,[ 7.50392899e-02 1.44007439e-02 1.04323184e+0...
2156,2156,"[5708, 5709): '.'",119,0,1,False,O,,O,0,[-0.08579681 0.05905596 1.1146404 0.1048710...
2157,2157,"[5709, 5711): '93'",5429,0,1,False,O,,O,0,[ 1.13780675e-02 -2.63872057e-01 8.81803274e-0...


# Collate the data structures we've generated so far

In [32]:
# Create a single dataframe with the entire corpus's embeddings.
corpus_df = tp.combine_folds(bert_toks_by_fold)
corpus_df

Unnamed: 0,fold,doc_num,token_id,span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,train,0,0,"[0, 0): ''",101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...
1,train,0,1,"[0, 1): '-'",118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...
2,train,0,2,"[1, 2): 'D'",141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...
3,train,0,3,"[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...
4,train,0,4,"[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
416536,test,230,314,"[1386, 1393): 'brother'",1711,0,1,False,O,,O,0,[-2.81729121e-02 -8.06238949e-02 9.80488896e-0...
416537,test,230,315,"[1393, 1394): ','",117,0,1,False,O,,O,0,[ 1.18173949e-01 -7.00848699e-02 8.65484953e-0...
416538,test,230,316,"[1395, 1400): 'Bobby'",5545,0,1,False,B,PER,B-PER,4,[-3.56894344e-01 3.14004630e-01 1.57385385e+0...
416539,test,230,317,"[1400, 1401): '.'",119,0,1,False,O,,O,0,[-1.89571261e-01 -2.45812088e-01 6.62573814e-0...


# Checkpoint
With the TensorArray from Text Extension for Pandas, the computed embeddings can be persisted as a tensor along with the rest of the DataFrame using standard Pandas input/output methods. Since this is a costly operation and the embeddings are deterministic, it can save lots of time to checkpoint the data here and save the results to disk. This will allow us to continue working with model training without needing to re-compute the BERT embeddings again.
 
 ## Save DataFrame with Embeddings Tensor

In [33]:
# Write the tokenized corpus with embeddings to a Feather file.
# We can't currently serialize span columns that cover multiple documents (see issue 73),
# so drop span columns from the contents we write to the Feather file.
cols_to_drop = [c for c in corpus_df.columns if "span" in c]
corpus_df.drop(columns=cols_to_drop).to_feather("outputs/corpus.feather")

## Load DataFrame with Previously Computed Embeddings

In [34]:
# Read the serialized embeddings back in so that you can rerun the model 
# training parts of this notebook (the cells from here onward) without 
# regenerating the embeddings.
corpus_df = pd.read_feather("outputs/corpus.feather")
corpus_df

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,train,0,0,101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...
1,train,0,1,118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...
2,train,0,2,141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...
3,train,0,3,9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...
4,train,0,4,9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...
...,...,...,...,...,...,...,...,...,...,...,...,...
416536,test,230,314,1711,0,1,False,O,,O,0,[-2.81729121e-02 -8.06238949e-02 9.80488896e-0...
416537,test,230,315,117,0,1,False,O,,O,0,[ 1.18173949e-01 -7.00848699e-02 8.65484953e-0...
416538,test,230,316,5545,0,1,False,B,PER,B-PER,4,[-3.56894344e-01 3.14004630e-01 1.57385385e+0...
416539,test,230,317,119,0,1,False,O,,O,0,[-1.89571261e-01 -2.45812088e-01 6.62573814e-0...


# Training a model on the BERT embeddings

Now we will use the loaded BERT embeddings to train a multinomial model to predict the token class from the embeddings tensor.

In [35]:
# Additional initialization boilerplate
import sklearn.linear_model

In [36]:
train_df = corpus_df[corpus_df["fold"] == "train"]
train_df

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,train,0,0,101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...
1,train,0,1,118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...
2,train,0,2,141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...
3,train,0,3,9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...
4,train,0,4,9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...
...,...,...,...,...,...,...,...,...,...,...,...,...
281104,train,945,53,17057,0,1,False,B,ORG,B-ORG,3,[ 7.55636692e-01 -9.18912172e-01 -1.40303954e-0...
281105,train,945,54,122,0,1,False,O,,O,0,[-1.15284957e-01 -4.44919914e-01 4.71556604e-0...
281106,train,945,55,4617,0,1,False,B,ORG,B-ORG,3,[ 4.56021696e-01 -8.97085190e-01 6.78616092e-0...
281107,train,945,56,123,0,1,False,O,,O,0,[-1.97137520e-01 -5.42719424e-01 2.94020921e-0...


In [37]:
# Train a multinomial logistic regression model on the training set.
_MULTI_CLASS = "multinomial"
base_pipeline = sklearn.pipeline.Pipeline([
    # Standard scaler. This only makes a difference for certain classes
    # of embeddings.
    #("scaler", sklearn.preprocessing.StandardScaler()),
    ("mlogreg", sklearn.linear_model.LogisticRegression(
        multi_class=_MULTI_CLASS,
        verbose=10,
        max_iter=LBGFS_ITERATIONS
    ))
])

X_train = train_df["embedding"].values
Y_train = train_df["token_class_id"]
base_model = base_pipeline.fit(X_train, Y_train)
base_model

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 17.9min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 17.9min finished


Pipeline(steps=[('mlogreg',
                 LogisticRegression(max_iter=10000, multi_class='multinomial',
                                    verbose=10))])

In [64]:
# TODO - nicer display of tensor float values?
#import sys
#np.set_printoptions(precision=4, suppress=True, floatmode="fixed", linewidth=100, threshold=sys.maxsize)

In [41]:
# Define a function that will let us make predictions on a fold of the corpus
def predict_on_df(df: pd.DataFrame, id_to_class: Dict[int, str], predictor):
    """
    Run a trained model on a DataFrame of tokens with embeddings.

    :param df: DataFrame of tokens for a document, containing a TokenSpan column
     called "embedding" for each token.
    :param id_to_class: Mapping from class ID to class name, as returned by
     :func:`text_extensions_for_pandas.make_iob_tag_categories`
    :param predictor: Python object with a `predict_proba` method that accepts
     a numpy array of embeddings.
    :returns: A copy of `df`, with the following additional columns:
     `predicted_id`, `predicted_class`, `predicted_iob`, `predicted_type`
     and `predicted_class_pr`.
    """
    result_df = df.copy()
    class_pr = tp.TensorArray(predictor.predict_proba(result_df["embedding"]))
    result_df["predicted_id"] = np.argmax(class_pr, axis=1)
    result_df["predicted_class"] = [id_to_class[i]
                                    for i in result_df["predicted_id"].values]
    iobs, types = tp.decode_class_labels(result_df["predicted_class"].values)
    result_df["predicted_iob"] = iobs
    result_df["predicted_type"] = types
    result_df["predicted_class_pr"] = class_pr
    return result_df

In [39]:
train_results_df = predict_on_df(train_df, int_to_label, base_model)
train_results_df

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding,predicted_id,predicted_class,predicted_iob,predicted_type,predicted_class_pr
0,train,0,0,101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...,0,O,O,,[9.99162793e-01 7.28614070e-06 3.68106932e-04 1...
1,train,0,1,118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...,0,O,O,,[9.98624017e-01 1.17727804e-06 2.30890326e-05 6...
2,train,0,2,141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...,0,O,O,,[9.99822159e-01 2.92875927e-06 4.63864656e-05 9...
3,train,0,3,9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...,0,O,O,,[9.99746151e-01 7.32364501e-09 1.02976297e-05 1...
4,train,0,4,9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...,0,O,O,,[9.99962561e-01 3.31572805e-10 3.60350074e-07 4...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281104,train,945,53,17057,0,1,False,B,ORG,B-ORG,3,[ 7.55636692e-01 -9.18912172e-01 -1.40303954e-0...,3,B-ORG,B,ORG,[3.28689827e-07 1.31715175e-09 5.09700063e-09 9...
281105,train,945,54,122,0,1,False,O,,O,0,[-1.15284957e-01 -4.44919914e-01 4.71556604e-0...,0,O,O,,[9.99998076e-01 6.02579816e-11 1.43360960e-10 5...
281106,train,945,55,4617,0,1,False,B,ORG,B-ORG,3,[ 4.56021696e-01 -8.97085190e-01 6.78616092e-0...,3,B-ORG,B,ORG,[1.80262724e-07 4.99876003e-10 2.15961101e-08 9...
281107,train,945,56,123,0,1,False,O,,O,0,[-1.97137520e-01 -5.42719424e-01 2.94020921e-0...,0,O,O,,[9.99999401e-01 2.85864619e-11 4.76410155e-11 3...


In [42]:
# Look at results on the test set
test_results_df = predict_on_df(corpus_df[corpus_df["fold"] == "test"], 
                                     int_to_label, base_model)
# Take a slice to show a region with more entities
test_results_df.iloc[40:50]

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding,predicted_id,predicted_class,predicted_iob,predicted_type,predicted_class_pr
351041,test,0,40,3309,0,1,False,I,PER,I-PER,8,[-2.10291892e-01 -8.53567719e-01 2.75510421e-0...,6,I-MISC,I,MISC,[1.98323229e-07 1.53130728e-10 2.78405342e-07 2...
351042,test,0,41,1306,0,1,False,I,PER,I-PER,8,[-2.32054338e-01 -9.29077148e-01 3.88911217e-0...,6,I-MISC,I,MISC,[3.10633651e-05 5.22947228e-06 6.52603219e-11 5...
351043,test,0,42,2001,0,1,False,I,PER,I-PER,8,[ 0.36844087 -0.68091148 -0.10591034 -0.3448123...,8,I-PER,I,PER,[2.00370294e-03 1.00450102e-01 6.38725156e-06 1...
351044,test,0,43,1181,0,1,False,I,PER,I-PER,8,[-3.01310837e-01 -6.54601932e-01 -1.72691718e-0...,8,I-PER,I,PER,[1.12122699e-04 3.13711716e-06 1.53931420e-07 4...
351045,test,0,44,2293,0,1,False,I,PER,I-PER,8,[-1.61161020e-01 -6.98910296e-01 2.34246060e-0...,5,I-LOC,I,LOC,[7.02232647e-03 3.18677296e-03 7.76558124e-08 6...
351046,test,0,45,18589,0,1,False,B,LOC,B-LOC,1,[-5.85671440e-02 -7.95588315e-01 3.36059839e-0...,1,B-LOC,B,LOC,[6.36387711e-03 9.02478479e-01 2.26270005e-06 4...
351047,test,0,46,118,0,1,False,I,LOC,I-LOC,5,[ 2.03760266e-01 -7.37309694e-01 -8.88528451e-0...,5,I-LOC,I,LOC,[2.00492628e-01 5.12452116e-05 1.72780996e-07 8...
351048,test,0,47,19016,0,1,False,I,LOC,I-LOC,5,[-1.03412405e-01 -3.36818755e-01 1.73846602e-0...,1,B-LOC,B,LOC,[3.35755616e-02 5.33375311e-01 1.00517707e-07 9...
351049,test,0,48,2249,0,1,False,I,LOC,I-LOC,5,[-4.05427039e-01 -6.51651978e-01 2.46960565e-0...,5,I-LOC,I,LOC,[3.03682213e-04 1.33970366e-04 1.83506170e-08 1...
351050,test,0,49,117,0,1,False,O,,O,0,[-1.68292239e-01 -6.47586107e-01 8.14902484e-0...,0,O,O,,[9.99999663e-01 3.75752227e-08 1.14784193e-09 1...


In [43]:
# Find the predictions that had the lowest probability to see what the model did not predict strongly
# We can easily mix numpy functions on the TensorArray with standard Pandas on the entire DataFrame
# Here we find the max probability for each prediction, then sort the DataFrame by that value ascending
test_results_df["class_max_pr"] = np.max(test_results_df["predicted_class_pr"].array, axis=1)
pr_asc_df = test_results_df.sort_values(by="class_max_pr")

# Select relevant columns and show lowest 20 predictions
pr_asc_df[["doc_num", "token_id", "token_class", "predicted_class", "class_max_pr", "predicted_class_pr"]].head(20)

Unnamed: 0,doc_num,token_id,token_class,predicted_class,class_max_pr,predicted_class_pr
414699,222,371,O,B-MISC,0.238123,[0.08202944 0.00349575 0.23812346 0.14204107 0....
390307,133,8,B-LOC,B-MISC,0.291589,[2.62719988e-01 2.38731092e-01 2.91588782e-01 3...
367670,54,49,I-ORG,I-ORG,0.294557,[1.76412153e-01 3.86455931e-07 4.47598135e-07 7...
363788,42,28,O,O,0.309699,[3.09698791e-01 1.89220263e-05 4.49656243e-05 5...
411548,213,14,I-MISC,I-LOC,0.317874,[2.89094277e-01 3.86128605e-05 1.78886350e-04 3...
385606,114,10,I-LOC,B-MISC,0.327414,[2.28951121e-01 2.04707195e-06 3.27413638e-01 1...
358565,23,387,B-ORG,B-ORG,0.327566,[2.80056412e-01 1.09979216e-01 9.60765376e-02 3...
358426,23,248,I-ORG,I-MISC,0.347311,[1.80356203e-02 1.90402981e-03 3.16516185e-02 2...
364233,44,20,I-ORG,I-PER,0.348541,[1.22118924e-03 3.17763625e-05 1.27397341e-05 4...
386205,116,116,O,I-PER,0.350508,[3.24629125e-01 2.88704564e-05 1.68123889e-05 8...


In [44]:
# Let's take a closer look at the most predictions with the lowest probability and see if there is a document that had the most difficulties
lowest_pr_df = pr_asc_df[["doc_num", "token_id", "input_id", "predicted_class", "class_max_pr"]].head(100)
lowest_pr_df["doc_num"].value_counts().head(5)

23     11
27     10
222     6
220     5
133     3
Name: doc_num, dtype: int64

In [63]:
# It looks like document 23 of the test set had the most weak predictions, let's take a look at this document and see which spans caused this
# TODO - want to show the span with predictions with this document
all_spans = {
    k: [tp.iob_to_spans(df) for df in v] for k, v in corpus_raw.items()
}
fold = "test"
doc_offset = 23
doc_df = all_spans[fold][doc_offset]
doc_df

Unnamed: 0,span,ent_type
0,"[11, 14): 'NHL'",ORG
1,"[61, 69): 'NEW YORK'",LOC
2,"[94, 109): 'National Hockey'",ORG
3,"[110, 116): 'League'",ORG
4,"[276, 284): 'HARTFORD'",ORG
5,"[301, 308): 'BUFFALO'",ORG
6,"[326, 332): 'BOSTON'",ORG
7,"[350, 358): 'MONTREAL'",ORG
8,"[377, 387): 'PITTSBURGH'",ORG
9,"[404, 410): 'OTTAWA'",ORG


In [54]:
corpus_test = corpus_df[corpus_df["fold"] == "test"]
doc_23 = corpus_test[corpus_test["doc_num"] == 23]

In [55]:
doc_23

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
358178,test,23,0,101,0,1,True,O,,O,0,[-3.93204927e-01 -7.20973670e-01 6.35356426e-0...
358179,test,23,1,118,0,1,False,O,,O,0,[-4.59361285e-01 -7.75034666e-01 1.00733566e+0...
358180,test,23,2,141,0,1,False,O,,O,0,[-0.3389236 -0.59117919 1.06589043 0.2252248...
358181,test,23,3,9244,0,1,False,O,,O,0,[-5.98726809e-01 -7.89630830e-01 7.54339695e-0...
358182,test,23,4,9272,0,1,False,O,,O,0,[-4.12817359e-01 -6.46852911e-01 8.24099600e-0...
...,...,...,...,...,...,...,...,...,...,...,...,...
358718,test,23,540,20002,0,1,False,I,LOC,I-LOC,5,[-3.68024737e-01 -3.88864905e-01 -1.79580003e-0...
358719,test,23,541,11414,0,1,False,I,LOC,I-LOC,5,[-1.30571973e+00 -7.02589810e-01 -1.70400810e+0...
358720,test,23,542,18082,0,1,False,I,LOC,I-LOC,5,[-3.79988015e-01 -5.99230766e-01 -1.29478753e+0...
358721,test,23,543,2249,0,1,False,I,LOC,I-LOC,5,[-2.15392202e-01 -8.67155373e-01 -1.35554612e+0...


In [62]:
doc_23 = corpus_raw["test"][23]
doc_23[doc_23["ent_type"] != None].head(50)

Unnamed: 0,span,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'",5246
1,"[11, 14): 'NHL'",B,ORG,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5248
2,"[15, 18): 'ICE'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5249
3,"[19, 25): 'HOCKEY'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5250
4,"[25, 26): '-'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5251
5,"[27, 36): 'STANDINGS'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5252
6,"[37, 42): 'AFTER'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5253
7,"[43, 51): 'THURSDAY'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5254
8,"[51, 53): ''S'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5255
9,"[54, 59): 'GAMES'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5256


In [45]:
# TODO: trying to get the char span for tokens
lowest_pr_df = lowest_pr_df.merge(test_df, on=["doc_num", "token_id"])
lowest_pr_df = lowest_pr_df[["doc_num", "token_id", "token_class", "predicted_class", "class_max_pr"]]
lowest_pr_df

NameError: name 'test_df' is not defined

In [None]:
example_df = corpus_raw["test"][23]
spans_df = tp.iob_to_spans(example_df)
bert_toks_df = tp.make_bert_tokens(example_df["char_span"].values[0].target_text, tokenizer)
bert_token_spans = tp.TokenSpanArray.align_to_tokens(bert_toks_df["char_span"],
                                                     spans_df["token_span"])
bert_toks_df[["ent_iob", "ent_type"]] = tp.spans_to_iob(bert_token_spans, 
                                                        spans_df["ent_type"])
bert_toks_df

In [None]:
# TODO - remove below if not computing metrics on predictions

In [None]:
# Split model outputs for an entire fold back into documents and add
# token information.
test_results_by_doc = util.align_model_outputs_to_tokens(test_results_df,
                                                         bert_toks_by_fold)
test_results_by_doc[("test", 0)].iloc[40:60]

In [None]:
# Convert IOB2-format output (and gold standard tags) to spans.
test_actual_spans = {k: tp.iob_to_spans(v) for k, v in test_results_by_doc.items()}
test_model_spans = {k:
        tp.iob_to_spans(v, iob_col_name = "predicted_iob",
                        entity_type_col_name = "predicted_type")
          .rename(columns={"predicted_type": "ent_type"})
        for k, v in test_results_by_doc.items()}

test_model_spans[("test", 0)].head()

In [None]:
# Every once in a while, the model will split a token in the original data
# set into two entities. For example, look at document 202 of the test
# set:
doc_key = ("test", 202)
test_model_spans[doc_key].head(10)

In [None]:
# Notice `[150, 151): 'W'` and `[151, 156): 'idnes'`. These outputs are part
# of the same original token.
# We can use spanner algebra to fix up these outputs.
spans_df = test_model_spans[doc_key]
toks_df = test_raw[202]

# First, find which tokens the spans overlap with:
overlaps_df = (
    tp
    .overlap_join(spans_df["token_span"], toks_df["char_span"],
                 "token_span", "corpus_token")
    .merge(spans_df)
)
overlaps_df.head(10)

In [None]:
# Next, compute the minimum span that covers all the corpus tokens
# that overlap with each entity span.
agg_df = (
    overlaps_df
    .groupby("token_span")
    .aggregate({"corpus_token": "sum", "ent_type": "first"})
    .reset_index()
)
agg_df.head(10)

In [None]:
# Finally, take unique values and covert character-based spans to spans
# in the corpus tokenization (since the new offsets might not match a
# BERT tokenizer token boundary)
cons_df = (
    tp.consolidate(agg_df, "corpus_token")
    [["corpus_token", "ent_type"]]
    .rename(columns={"corpus_token": "token_span"})
)
cons_df["token_span"] = tp.TokenSpanArray.align_to_tokens(toks_df["char_span"],
                                                          cons_df["token_span"])
cons_df.head(10)

In [None]:
# util.py contains a single function that repeats the actions of the 
# previous 3 cells.
tp.align_bert_tokens_to_corpus_tokens(test_model_spans[doc_key], test_raw[202]).head(10)

In [None]:
# Run all of our dataframes through `realign_to_toks`
keys = list(test_model_spans.keys())
new_values = tp.run_with_progress_bar(
    len(keys), 
    lambda i: tp.align_bert_tokens_to_corpus_tokens(test_model_spans[keys[i]], test_raw[keys[i][1]]))
test_model_spans = {k: v for k, v in zip(keys, new_values)}
test_model_spans[doc_key].head(10)

In [None]:
# Same per-document statistics calculation code as in CoNLL_2.ipynb
test_stats_by_doc = tp.compute_accuracy_by_document(test_actual_spans, test_model_spans)
test_stats_by_doc

In [None]:
# Collection-wide precision and recall can be computed by aggregating
# our dataframe.
tp.compute_global_accuracy(test_stats_by_doc)

In [None]:
# The function analyze_model() in util.py combines the above postprocessing steps 
# into a single function.
base_test_results = util.analyze_model(
    corpus_df[corpus_df["fold"] == "test"], 
    int_to_label, base_model, bert_toks_by_fold, corpus_raw, expand_matches=True)
base_test_results["global_scores"]

In [None]:
# Results on the training set
base_train_results = util.analyze_model(
    corpus_df[corpus_df["fold"] == "train"],
    int_to_label, base_model, bert_toks_by_fold, corpus_raw, expand_matches=True)
base_train_results["global_scores"]

In [None]:
# Results on the development set
base_dev_results = util.analyze_model(
    corpus_df[corpus_df["fold"] == "dev"],
    int_to_label, base_model, bert_toks_by_fold, corpus_raw, expand_matches=True)
base_dev_results["global_scores"]