<font size=6>
    <b>Working_with_Models.ipynb:</b> Use Text Extensions for Pandas to integrate model training and prediction with Pandas.
</font>

# Introduction

This notebook shows how to use the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) to seamlessly integrate model training and prediction with [Pandas](https://pandas.pydata.org/) DataFrames for added support of columns with N-dimensional tensors. 

To demostrate the TensorArray functionality from Text Extensions for Pandas, we will use the CoNLL-2003 dataset [TODO REF] to 

1. Retokenize the entire corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.
1. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a dataframe column (of type TensorType) alongside the tokens and labels.
1. Persist the BERT embeddings, ***
1. Use the embeddings to quickly train multiple models 


# Imports and constants

In [70]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if ".." not in sys.path:
    sys.path.insert(0, "..")

# Libraries
import numpy as np
import pandas as pd
import time
import torch
import transformers
from typing import *

import sklearn.pipeline

import matplotlib.pyplot as plt
import ipywidgets
from IPython.display import display

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

# Common code shared across notebooks is kept in util.py
#import util

# How many iterations to run the BGFS optimizer when fitting logistic
# regression models. 100 ==> Fast; 10000 ==> Full convergence
LBGFS_ITERATIONS = 10000

# BERT Configuration
#bert_model_name = "bert-base-uncased"
#bert_model_name Imports= "bert-large-uncased"
bert_model_name = "dslim/bert-base-NER"

tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name, 
                                                           add_special_tokens=True)
bert = transformers.BertModel.from_pretrained(bert_model_name)

# Create a Pandas categorical type for consistent encoding of categories
# across all documents.
_ENTITY_TYPES = ["LOC", "MISC", "ORG", "PER"]
token_class_dtype, int_to_label, label_to_int = tp.make_iob_tag_categories(_ENTITY_TYPES)

In [71]:
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
#  to the terms of the license when using this data set!
data_set_info = tp.maybe_download_conll_data("outputs")
data_set_info

{'train': 'outputs/eng.train',
 'dev': 'outputs/eng.testa',
 'test': 'outputs/eng.testb'}

# Show how to retokenize with a BERT tokenizer.

Retokenize the corpus using a "BERT-compatible" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.

In [72]:
# Read in the corpus in its original tokenization
corpus_raw = {}
for fold_name, file_name in data_set_info.items():
    df_list = tp.conll_2003_to_dataframes(file_name, 
                                          ["pos", "phrase", "ent"],
                                          [False, True, True])
    corpus_raw[fold_name] = [
        df.drop(columns=["pos", "phrase_iob", "phrase_type"])
        for df in df_list
    ]

test_raw = corpus_raw["test"]

# Pick out the dataframe for a single example document.
example_df = test_raw[5]
example_df

Unnamed: 0,char_span,token_span,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'","[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'",1469
1,"[11, 18): 'CRICKET'","[11, 18): 'CRICKET'",O,,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1471
2,"[18, 19): '-'","[18, 19): '-'",O,,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1472
3,"[20, 28): 'PAKISTAN'","[20, 28): 'PAKISTAN'",B,LOC,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1473
4,"[29, 30): 'V'","[29, 30): 'V'",O,,"[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...",1474
...,...,...,...,...,...,...
350,"[1620, 1621): '8'","[1620, 1621): '8'",O,,"[1590, 1634): 'Third one-day match: December 8...",1865
351,"[1621, 1622): ','","[1621, 1622): ','",O,,"[1590, 1634): 'Third one-day match: December 8...",1866
352,"[1623, 1625): 'in'","[1623, 1625): 'in'",O,,"[1590, 1634): 'Third one-day match: December 8...",1867
353,"[1626, 1633): 'Karachi'","[1626, 1633): 'Karachi'",B,LOC,"[1590, 1634): 'Third one-day match: December 8...",1868


In [5]:
spans_df = tp.iob_to_spans(example_df)
spans_df

Unnamed: 0,token_span,ent_type
0,"[20, 28): 'PAKISTAN'",LOC
1,"[31, 42): 'NEW ZEALAND'",LOC
2,"[80, 83): 'GMT'",MISC
3,"[85, 92): 'SIALKOT'",LOC
4,"[94, 102): 'Pakistan'",LOC
...,...,...
69,"[1488, 1501): 'Shahid Afridi'",PER
70,"[1512, 1523): 'Salim Malik'",PER
71,"[1535, 1545): 'Ijaz Ahmad'",PER
72,"[1565, 1573): 'Pakistan'",LOC


In [6]:
# Retokenize the document's text with the BERT tokenizer
bert_toks_df = tp.make_bert_tokens(example_df["char_span"].values[0].target_text, tokenizer)
bert_toks_df

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False
...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','","[1621, 1622): ','",117,0,1,False
685,685,"[1623, 1625): 'in'","[1623, 1625): 'in'",1107,0,1,False
686,686,"[1626, 1633): 'Karachi'","[1626, 1633): 'Karachi'",16237,0,1,False
687,687,"[1633, 1634): '.'","[1633, 1634): '.'",119,0,1,False


In [7]:
# BERT tokenization includes special zero-length tokens.
bert_toks_df[bert_toks_df["special_tokens_mask"]]

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True
688,688,"[0, 0): ''","[0, 0): ''",102,0,1,True


In [8]:
# Align the BERT tokens with the original tokenization
bert_token_spans = tp.TokenSpanArray.align_to_tokens(bert_toks_df["char_span"],
                                                     spans_df["token_span"])

pd.DataFrame({
    "original_span": spans_df["token_span"],
    "bert_token_span": bert_token_spans,
    "ent_type": spans_df["ent_type"]
})

Unnamed: 0,original_span,bert_token_span,ent_type
0,"[20, 28): 'PAKISTAN'","[20, 28): 'PAKISTAN'",LOC
1,"[31, 42): 'NEW ZEALAND'","[31, 42): 'NEW ZEALAND'",LOC
2,"[80, 83): 'GMT'","[80, 83): 'GMT'",MISC
3,"[85, 92): 'SIALKOT'","[85, 92): 'SIALKOT'",LOC
4,"[94, 102): 'Pakistan'","[94, 102): 'Pakistan'",LOC
...,...,...,...
69,"[1488, 1501): 'Shahid Afridi'","[1488, 1501): 'Shahid Afridi'",PER
70,"[1512, 1523): 'Salim Malik'","[1512, 1523): 'Salim Malik'",PER
71,"[1535, 1545): 'Ijaz Ahmad'","[1535, 1545): 'Ijaz Ahmad'",PER
72,"[1565, 1573): 'Pakistan'","[1565, 1573): 'Pakistan'",LOC


In [9]:
# Generate IOB2 tags and entity labels that align with the BERT tokens.
# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
bert_toks_df[["ent_iob", "ent_type"]] = tp.spans_to_iob(bert_token_spans, 
                                                        spans_df["ent_type"])
bert_toks_df

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,
...,...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','","[1621, 1622): ','",117,0,1,False,O,
685,685,"[1623, 1625): 'in'","[1623, 1625): 'in'",1107,0,1,False,O,
686,686,"[1626, 1633): 'Karachi'","[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC
687,687,"[1633, 1634): '.'","[1633, 1634): '.'",119,0,1,False,O,


In [10]:
# The traditional way to transform NER to token classification is to 
# treat each combination of {I,O,B} X {entity type} as a different
# class. Generate class labels in that format.
classes_df = tp.add_token_classes(bert_toks_df, token_class_dtype)
classes_df

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,,O,0
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,,O,0
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,,O,0
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,,O,0
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,,O,0
...,...,...,...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','","[1621, 1622): ','",117,0,1,False,O,,O,0
685,685,"[1623, 1625): 'in'","[1623, 1625): 'in'",1107,0,1,False,O,,O,0
686,686,"[1626, 1633): 'Karachi'","[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC,B-LOC,1
687,687,"[1633, 1634): '.'","[1633, 1634): '.'",119,0,1,False,O,,O,0


# Show how to compute embeddings

In [11]:
# Adds embeddings to our example dataframe.  Function in util.py
embeddings_df = tp.add_embeddings(classes_df, bert)
embeddings_df

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,,O,0,[-8.30711797e-02 -3.59590203e-01 1.01506817e+0...
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,,O,0,[-2.28625670e-01 -4.93136197e-01 1.28423226e+0...
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,,O,0,[ 2.84805615e-02 -1.78742811e-01 1.54320943e+0...
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-0.46517587 -0.2983605 1.0737675 -0.0316486...
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-1.07308812e-01 -3.37210178e-01 1.22697961e+0...
...,...,...,...,...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','","[1621, 1622): ','",117,0,1,False,O,,O,0,[-1.28065705e-01 -2.32442445e-03 6.78131640e-0...
685,685,"[1623, 1625): 'in'","[1623, 1625): 'in'",1107,0,1,False,O,,O,0,[ 0.30534053 -0.526257 0.8281703 -0.2741491...
686,686,"[1626, 1633): 'Karachi'","[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC,B-LOC,1,[-0.04873929 -0.33797333 -0.05835138 0.7557763...
687,687,"[1633, 1634): '.'","[1633, 1634): '.'",119,0,1,False,O,,O,0,[-5.28932782e-03 -2.97430724e-01 7.16174304e-0...


# Generate BERT tokens and BERT embeddings for the entire corpus

Generate BERT embeddings for every token in the entire corpus in one pass, 
and store those embeddings in a dataframe column (of type TensorType) 
alongside the tokens and labels.

In [12]:
# conll_to_bert() in util.py combines the previous few cells' operations into
# a single function.
tp.conll_to_bert(example_df, tokenizer, bert, token_class_dtype)

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,,O,0,[-8.30711797e-02 -3.59590203e-01 1.01506817e+0...
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,,O,0,[-2.28625670e-01 -4.93136197e-01 1.28423226e+0...
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,,O,0,[ 2.84805615e-02 -1.78742811e-01 1.54320943e+0...
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-0.46517587 -0.2983605 1.0737675 -0.0316486...
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-1.07308812e-01 -3.37210178e-01 1.22697961e+0...
...,...,...,...,...,...,...,...,...,...,...,...,...
684,684,"[1621, 1622): ','","[1621, 1622): ','",117,0,1,False,O,,O,0,[-1.28065705e-01 -2.32442445e-03 6.78131640e-0...
685,685,"[1623, 1625): 'in'","[1623, 1625): 'in'",1107,0,1,False,O,,O,0,[ 0.30534053 -0.526257 0.8281703 -0.2741491...
686,686,"[1626, 1633): 'Karachi'","[1626, 1633): 'Karachi'",16237,0,1,False,B,LOC,B-LOC,1,[-0.04873929 -0.33797333 -0.05835138 0.7557763...
687,687,"[1633, 1634): '.'","[1633, 1634): '.'",119,0,1,False,O,,O,0,[-5.28932782e-03 -2.97430724e-01 7.16174304e-0...


In [13]:
# Run the entire corpus through our processing pipeline.
bert_toks_by_fold = {}
for fold_name in corpus_raw.keys():
    print(f"Processing fold '{fold_name}'...")
    raw = corpus_raw[fold_name]
    bert_toks_by_fold[fold_name] = tp.run_with_progress_bar(
        len(raw), lambda i: tp.conll_to_bert(raw[i], tokenizer, 
                                             bert, token_class_dtype))
bert_toks_by_fold["dev"][20]

Processing fold 'train'...


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=946, style=ProgressStyle(desc…

Processing fold 'dev'...


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=216, style=ProgressStyle(desc…

Processing fold 'test'...


IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,,O,0,[-0.17669654 -0.39899608 0.9088872 0.2128861...
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,,O,0,[-3.85537773e-01 -5.02327383e-01 1.17323220e+0...
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,,O,0,[-1.17190674e-01 -1.27011672e-01 1.38969100e+0...
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-3.90257567e-01 -2.50433564e-01 1.07450700e+0...
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-2.77328491e-01 -2.61601597e-01 1.07876039e+0...
...,...,...,...,...,...,...,...,...,...,...,...,...
2154,2154,"[5704, 5705): ')'","[5704, 5705): ')'",114,0,1,False,O,,O,0,[ 0.01539315 -0.04065072 1.0011852 0.0412664...
2155,2155,"[5706, 5708): '39'","[5706, 5708): '39'",3614,0,1,False,O,,O,0,[ 7.50392899e-02 1.44007439e-02 1.04323184e+0...
2156,2156,"[5708, 5709): '.'","[5708, 5709): '.'",119,0,1,False,O,,O,0,[-0.08579681 0.05905596 1.1146404 0.1048710...
2157,2157,"[5709, 5711): '93'","[5709, 5711): '93'",5429,0,1,False,O,,O,0,[ 1.13780675e-02 -2.63872057e-01 8.81803274e-0...


# Collate the data structures we've generated so far

In [14]:
# Create a single dataframe with the entire corpus's embeddings.
corpus_df = tp.combine_folds(bert_toks_by_fold)
corpus_df

Unnamed: 0,fold,doc_num,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,train,0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...
1,train,0,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...
2,train,0,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...
3,train,0,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...
4,train,0,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416536,test,230,314,"[1386, 1393): 'brother'","[1386, 1393): 'brother'",1711,0,1,False,O,,O,0,[-2.81729121e-02 -8.06238949e-02 9.80488896e-0...
416537,test,230,315,"[1393, 1394): ','","[1393, 1394): ','",117,0,1,False,O,,O,0,[ 1.18173949e-01 -7.00848699e-02 8.65484953e-0...
416538,test,230,316,"[1395, 1400): 'Bobby'","[1395, 1400): 'Bobby'",5545,0,1,False,B,PER,B-PER,4,[-3.56894344e-01 3.14004630e-01 1.57385385e+0...
416539,test,230,317,"[1400, 1401): '.'","[1400, 1401): '.'",119,0,1,False,O,,O,0,[-1.89571261e-01 -2.45812088e-01 6.62573814e-0...


# Checkpoint
With the TensorArray from Text Extension for Pandas, the computed embeddings can be persisted as a tensor along with the rest of the DataFrame using standard Pandas input/output methods. This will allow us to continue working with model training without needing to re-compute the BERT embeddings.
 
 ## Save DataFrame with Embeddings Tensor

In [15]:
# Write the tokenized corpus with embeddings to a Feather file.
# We can't currently serialize span columns that cover multiple documents (see issue 73),
# so drop span columns from the contents we write to the Feather file.
cols_to_drop = [c for c in corpus_df.columns if "span" in c]
corpus_df.drop(columns=cols_to_drop).to_feather("outputs/corpus.feather")

## Load DataFrame with Previously Computed Embeddings

In [5]:
# Read the serialized embeddings back in so that you can rerun the model 
# training parts of this notebook (the cells from here onward) without 
# regenerating the embeddings.
corpus_df = pd.read_feather("outputs/corpus.feather")
corpus_df

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,train,0,0,101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...
1,train,0,1,118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...
2,train,0,2,141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...
3,train,0,3,9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...
4,train,0,4,9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...
...,...,...,...,...,...,...,...,...,...,...,...,...
416536,test,230,314,1711,0,1,False,O,,O,0,[-2.81729121e-02 -8.06238949e-02 9.80488896e-0...
416537,test,230,315,117,0,1,False,O,,O,0,[ 1.18173949e-01 -7.00848699e-02 8.65484953e-0...
416538,test,230,316,5545,0,1,False,B,PER,B-PER,4,[-3.56894344e-01 3.14004630e-01 1.57385385e+0...
416539,test,230,317,119,0,1,False,O,,O,0,[-1.89571261e-01 -2.45812088e-01 6.62573814e-0...


# Train a model on the BERT embeddings

Use the loaded BERT embeddings to quickly train a model.

In [11]:
# Additional initialization boilerplate
import sklearn.linear_model

In [8]:
train_df = corpus_df[corpus_df["fold"] == "train"]
train_df

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding
0,train,0,0,101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...
1,train,0,1,118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...
2,train,0,2,141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...
3,train,0,3,9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...
4,train,0,4,9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...
...,...,...,...,...,...,...,...,...,...,...,...,...
281104,train,945,53,17057,0,1,False,B,ORG,B-ORG,3,[ 7.55636692e-01 -9.18912172e-01 -1.40303954e-0...
281105,train,945,54,122,0,1,False,O,,O,0,[-1.15284957e-01 -4.44919914e-01 4.71556604e-0...
281106,train,945,55,4617,0,1,False,B,ORG,B-ORG,3,[ 4.56021696e-01 -8.97085190e-01 6.78616092e-0...
281107,train,945,56,123,0,1,False,O,,O,0,[-1.97137520e-01 -5.42719424e-01 2.94020921e-0...


In [9]:
train_df["embedding"].values

array([[-0.09850537, -0.40501907,  0.74288791, ..., -0.74161553,
         0.13603999,  0.47785613],
       [-0.05702158, -0.48112127,  0.9898684 , ..., -0.56108212,
         0.12308495,  0.46220705],
       [-0.0482421 , -0.25330001,  1.16719234, ..., -0.54396665,
         0.10910373,  0.50470823],
       ...,
       [ 0.4560217 , -0.89708519,  0.06786161, ...,  1.18717468,
        -1.34834838, -0.24610774],
       [-0.19713752, -0.54271942,  0.29402092, ..., -0.46821135,
         0.12332433,  0.64310163],
       [-0.57650781, -0.42160684,  0.99470353, ..., -0.64154649,
        -0.05481348,  0.23630801]])

In [12]:
# Train a multinomial logistic regression model on the training set.
_MULTI_CLASS = "multinomial"
base_pipeline = sklearn.pipeline.Pipeline([
    # Standard scaler. This only makes a difference for certain classes
    # of embeddings.
    #("scaler", sklearn.preprocessing.StandardScaler()),
    ("mlogreg", sklearn.linear_model.LogisticRegression(
        multi_class=_MULTI_CLASS,
        verbose=10,
        max_iter=LBGFS_ITERATIONS
    ))
])

X_train = train_df["embedding"].values
Y_train = train_df["token_class_id"]
base_model = base_pipeline.fit(X_train, Y_train)
base_model

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 18.3min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 18.3min finished


Pipeline(steps=[('mlogreg',
                 LogisticRegression(max_iter=10000, multi_class='multinomial',
                                    verbose=10))])

In [34]:
# TODO

import sys
np.set_printoptions(precision=4, suppress=True, floatmode="fixed", linewidth=100, threshold=sys.maxsize)

In [13]:
# TODO - update util.py??

def predict_on_df(df: pd.DataFrame, id_to_class: Dict[int, str], predictor):
    """
    Run a trained model on a DataFrame of tokens with embeddings.

    :param df: DataFrame of tokens for a document, containing a TokenSpan column
     called "embedding" for each token.
    :param id_to_class: Mapping from class ID to class name, as returned by
     :func:`text_extensions_for_pandas.make_iob_tag_categories`
    :param predictor: Python object with a `predict` method that accepts a
     numpy array of embeddings.
    :returns: A copy of `df`, with the following additional columns:
     `predicted_id`, `predicted_class`, `predicted_iob`, `predicted_type`
     and `predicted_class_pr`.
    """
    x_values = df["embedding"].values
    result_df = df.copy()
    result_df["predicted_id"] = predictor.predict(x_values)
    result_df["predicted_class"] = [id_to_class[i]
                                    for i in result_df["predicted_id"].values]
    iobs, types = tp.decode_class_labels(result_df["predicted_class"].values)
    result_df["predicted_iob"] = iobs
    result_df["predicted_type"] = types
    prob_values = predictor.predict_proba(x_values)
    result_df["predicted_class_pr"] = tp.TensorArray(prob_values)
    return result_df

In [14]:
train_results_df = predict_on_df(train_df, int_to_label, base_model)
train_results_df

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding,predicted_id,predicted_class,predicted_iob,predicted_type,predicted_class_pr
0,train,0,0,101,0,1,True,O,,O,0,[-9.85053703e-02 -4.05019075e-01 7.42887914e-0...,0,O,O,,[9.99162793e-01 7.28614070e-06 3.68106932e-04 1...
1,train,0,1,118,0,1,False,O,,O,0,[-5.70215769e-02 -4.81121272e-01 9.89868402e-0...,0,O,O,,[9.98624017e-01 1.17727804e-06 2.30890326e-05 6...
2,train,0,2,141,0,1,False,O,,O,0,[-4.82420959e-02 -2.53300011e-01 1.16719234e+0...,0,O,O,,[9.99822159e-01 2.92875927e-06 4.63864656e-05 9...
3,train,0,3,9244,0,1,False,O,,O,0,[-2.66830117e-01 -3.10087562e-01 1.00747287e+0...,0,O,O,,[9.99746151e-01 7.32364501e-09 1.02976297e-05 1...
4,train,0,4,9272,0,1,False,O,,O,0,[-0.22296937 -0.2130852 0.93310183 -0.2613601...,0,O,O,,[9.99962561e-01 3.31572805e-10 3.60350074e-07 4...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281104,train,945,53,17057,0,1,False,B,ORG,B-ORG,3,[ 7.55636692e-01 -9.18912172e-01 -1.40303954e-0...,3,B-ORG,B,ORG,[3.28689827e-07 1.31715175e-09 5.09700063e-09 9...
281105,train,945,54,122,0,1,False,O,,O,0,[-1.15284957e-01 -4.44919914e-01 4.71556604e-0...,0,O,O,,[9.99998076e-01 6.02579816e-11 1.43360960e-10 5...
281106,train,945,55,4617,0,1,False,B,ORG,B-ORG,3,[ 4.56021696e-01 -8.97085190e-01 6.78616092e-0...,3,B-ORG,B,ORG,[1.80262724e-07 4.99876003e-10 2.15961101e-08 9...
281107,train,945,56,123,0,1,False,O,,O,0,[-1.97137520e-01 -5.42719424e-01 2.94020921e-0...,0,O,O,,[9.99999401e-01 2.85864619e-11 4.76410155e-11 3...


In [106]:
# Look at results on the test set
test_results_df = predict_on_df(corpus_df[corpus_df["fold"] == "test"], 
                                     int_to_label, base_model)
# Showing a region with more entities
test_results_df.iloc[40:60]

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding,predicted_id,predicted_class,predicted_iob,predicted_type,predicted_class_pr
351041,test,0,40,3309,0,1,False,I,PER,I-PER,8,[-0.2103 -0.8536 0.0003 0.3010 0.2814 0.501...,6,I-MISC,I,MISC,[0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.96...
351042,test,0,41,1306,0,1,False,I,PER,I-PER,8,[-0.2321 -0.9291 0.3889 -0.5655 0.1010 -0.303...,6,I-MISC,I,MISC,[0.0000 0.0000 0.0000 0.0000 0.0000 0.0037 0.99...
351043,test,0,42,2001,0,1,False,I,PER,I-PER,8,[ 0.3684 -0.6809 -0.1059 -0.3448 0.5867 0.163...,8,I-PER,I,PER,[0.0020 0.1005 0.0000 0.0000 0.0009 0.2209 0.00...
351044,test,0,43,1181,0,1,False,I,PER,I-PER,8,[-0.3013 -0.6546 -0.1727 -0.1366 0.2293 0.075...,8,I-PER,I,PER,[0.0001 0.0000 0.0000 0.0000 0.0000 0.0404 0.08...
351045,test,0,44,2293,0,1,False,I,PER,I-PER,8,[-0.1612 -0.6989 0.2342 -0.4155 -0.1251 -0.282...,5,I-LOC,I,LOC,[0.0070 0.0032 0.0000 0.0000 0.0000 0.9084 0.07...
351046,test,0,45,18589,0,1,False,B,LOC,B-LOC,1,[-0.0586 -0.7956 0.3361 -0.1007 -0.0177 -0.036...,1,B-LOC,B,LOC,[0.0064 0.9025 0.0000 0.0000 0.0001 0.0899 0.00...
351047,test,0,46,118,0,1,False,I,LOC,I-LOC,5,[ 0.2038 -0.7373 -0.0889 -0.1174 0.0234 0.149...,5,I-LOC,I,LOC,[0.2005 0.0001 0.0000 0.0000 0.0000 0.7844 0.00...
351048,test,0,47,19016,0,1,False,I,LOC,I-LOC,5,[-0.1034 -0.3368 0.1738 0.4988 -0.3016 -0.128...,1,B-LOC,B,LOC,[0.0336 0.5334 0.0000 0.0000 0.0001 0.4324 0.00...
351049,test,0,48,2249,0,1,False,I,LOC,I-LOC,5,[-0.4054 -0.6517 0.2470 0.3198 -0.0794 0.101...,5,I-LOC,I,LOC,[0.0003 0.0001 0.0000 0.0000 0.0000 0.9995 0.00...
351050,test,0,49,117,0,1,False,O,,O,0,[-0.1683 -0.6476 0.8149 -0.0069 -0.4896 -0.410...,0,O,O,,[1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.00...


In [110]:
# Find the predictions that had the lowest seperability with the model
# We can easily mix numpy functions on the TensorArray with standard Pandas on the entire DataFrame
# Here we find the max probability for each prediction, then sort the DataFrame by that value ascending
test_results_df["class_max_pr"] = np.max(test_results_df["predicted_class_pr"].array, axis=1)
pr_asc_df = test_results_df.sort_values(by="class_max_pr")
pr_asc_df[["doc_num", "token_id", "token_class", "predicted_class", "class_max_pr", "predicted_class_pr"]].head(20)

Unnamed: 0,doc_num,token_id,token_class,predicted_class,class_max_pr,predicted_class_pr
414699,222,371,O,B-MISC,0.238123,[0.0820 0.0035 0.2381 0.1420 0.0009 0.2229 0.19...
390307,133,8,B-LOC,B-MISC,0.291589,[0.2627 0.2387 0.2916 0.0337 0.1729 0.0000 0.00...
367670,54,49,I-ORG,I-ORG,0.294557,[0.1764 0.0000 0.0000 0.0008 0.0000 0.0726 0.17...
363788,42,28,O,O,0.309699,[0.3097 0.0000 0.0000 0.0001 0.0000 0.1044 0.30...
411548,213,14,I-MISC,I-LOC,0.317874,[0.2891 0.0000 0.0002 0.0000 0.0000 0.3179 0.04...
385606,114,10,I-LOC,B-MISC,0.327414,[0.2290 0.0000 0.3274 0.1464 0.0000 0.0000 0.26...
358565,23,387,B-ORG,B-ORG,0.327566,[0.2801 0.1100 0.0961 0.3276 0.1589 0.0000 0.00...
358426,23,248,I-ORG,I-MISC,0.347311,[0.0180 0.0019 0.0317 0.2189 0.0000 0.1304 0.34...
364233,44,20,I-ORG,I-PER,0.348541,[0.0012 0.0000 0.0000 0.0000 0.0000 0.3160 0.01...
386205,116,116,O,I-PER,0.350508,[0.3246 0.0000 0.0000 0.0823 0.2418 0.0000 0.00...


In [124]:
# Let's take a closer look at the most uncertain predictions and see if there is a document that had the most difficulties
lowest_pr_df = pr_asc_df[["doc_num", "token_id", "input_id", "predicted_class", "class_max_pr"]].head(100)
lowest_pr_df["doc_num"].value_counts().head(5)

23     11
27     10
222     6
220     5
133     3
Name: doc_num, dtype: int64

In [66]:
# TODO: trying to get the char span for tokens
lowest_pr_df = lowest_pr_df.merge(test_df, on=["doc_num", "token_id"])
lowest_pr_df = lowest_pr_df[["doc_num", "token_id", "token_class", "predicted_class", "class_max_pr"]]
lowest_pr_df

Unnamed: 0,doc_num,token_id,token_class,predicted_class,class_max_pr
0,553,404,I-PER,I-ORG,0.306939
1,149,33,O,O,0.319178
2,671,20,I-MISC,O,0.328274
3,582,12,O,O,0.335614
4,760,20,I-MISC,B-ORG,0.338548
...,...,...,...,...,...
195,649,32,O,O,0.501101
196,862,379,O,O,0.501298
197,405,26,O,I-MISC,0.501387
198,408,23,I-PER,O,0.501434


In [117]:
example_df = corpus_raw["test"][23]
spans_df = tp.iob_to_spans(example_df)
bert_toks_df = tp.make_bert_tokens(example_df["char_span"].values[0].target_text, tokenizer)
bert_token_spans = tp.TokenSpanArray.align_to_tokens(bert_toks_df["char_span"],
                                                     spans_df["token_span"])
bert_toks_df[["ent_iob", "ent_type"]] = tp.spans_to_iob(bert_token_spans, 
                                                        spans_df["ent_type"])
bert_toks_df

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,
...,...,...,...,...,...,...,...,...,...
540,540,"[1228, 1230): 'DM'","[1228, 1230): 'DM'",20002,0,1,False,I,LOC
541,541,"[1230, 1232): 'ON'","[1230, 1232): 'ON'",11414,0,1,False,I,LOC
542,542,"[1232, 1234): 'TO'","[1232, 1234): 'TO'",18082,0,1,False,I,LOC
543,543,"[1234, 1235): 'N'","[1234, 1235): 'N'",2249,0,1,False,I,LOC


In [118]:
example_df

Unnamed: 0,char_span,token_span,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'","[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'",5246
1,"[11, 14): 'NHL'","[11, 14): 'NHL'",B,ORG,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5248
2,"[15, 18): 'ICE'","[15, 18): 'ICE'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5249
3,"[19, 25): 'HOCKEY'","[19, 25): 'HOCKEY'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5250
4,"[25, 26): '-'","[25, 26): '-'",O,,"[11, 60): 'NHL ICE HOCKEY- STANDINGS AFTER THU...",5251
...,...,...,...,...,...,...
290,"[1205, 1207): 'AT'","[1205, 1207): 'AT'",O,,"[1196, 1216): 'ST LOUIS AT COLORADO'",5584
291,"[1208, 1216): 'COLORADO'","[1208, 1216): 'COLORADO'",B,LOC,"[1196, 1216): 'ST LOUIS AT COLORADO'",5585
292,"[1217, 1223): 'OTTAWA'","[1217, 1223): 'OTTAWA'",B,ORG,"[1217, 1235): 'OTTAWA AT EDMONTON'",5587
293,"[1224, 1226): 'AT'","[1224, 1226): 'AT'",O,,"[1217, 1235): 'OTTAWA AT EDMONTON'",5588


In [127]:
lowest_pr_df_doc
mask = lowest_pr_df["doc_num"] == 23

In [128]:
temp = lowest_pr_df[mask]
temp

Unnamed: 0,doc_num,token_id,input_id,predicted_class,class_max_pr
358565,23,387,19497,B-ORG,0.327566
358426,23,248,19432,I-MISC,0.347311
358450,23,272,12880,I-LOC,0.351102
358569,23,391,17145,I-ORG,0.365688
358273,23,95,13882,I-MISC,0.376677
358362,23,184,2271,I-MISC,0.388287
358550,23,372,2271,I-ORG,0.413163
358272,23,94,23314,O,0.436398
358465,23,287,13882,I-MISC,0.440996
358540,23,362,1658,I-MISC,0.448011


In [129]:
result = temp.merge(bert_toks_df, on="token_id")
result

Unnamed: 0,doc_num,token_id,input_id_x,predicted_class,class_max_pr,char_span,token_span,input_id_y,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type
0,23,387,19497,B-ORG,0.327566,"[911, 913): 'VA'","[911, 913): 'VA'",19497,0,1,False,B,ORG
1,23,248,19432,I-MISC,0.347311,"[602, 604): 'IS'","[602, 604): 'IS'",19432,0,1,False,I,ORG
2,23,272,12880,I-LOC,0.351102,"[657, 659): 'TE'","[657, 659): 'TE'",12880,0,1,False,O,
3,23,391,17145,I-ORG,0.365688,"[917, 919): 'VE'","[917, 919): 'VE'",17145,0,1,False,I,ORG
4,23,95,13882,I-MISC,0.376677,"[255, 257): 'SI'","[255, 257): 'SI'",13882,0,1,False,O,
5,23,184,2271,I-MISC,0.388287,"[452, 453): 'F'","[452, 453): 'F'",2271,0,1,False,O,
6,23,372,2271,I-ORG,0.413163,"[877, 878): 'F'","[877, 878): 'F'",2271,0,1,False,O,
7,23,94,23314,O,0.436398,"[253, 255): 'VI'","[253, 255): 'VI'",23314,0,1,False,O,
8,23,287,13882,I-MISC,0.440996,"[685, 687): 'SI'","[685, 687): 'SI'",13882,0,1,False,I,MISC
9,23,362,1658,I-MISC,0.448011,"[859, 860): 'C'","[859, 860): 'C'",1658,0,1,False,I,LOC


In [125]:
bert_toks_df

Unnamed: 0,token_id,char_span,token_span,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type
0,0,"[0, 0): ''","[0, 0): ''",101,0,1,True,O,
1,1,"[0, 1): '-'","[0, 1): '-'",118,0,1,False,O,
2,2,"[1, 2): 'D'","[1, 2): 'D'",141,0,1,False,O,
3,3,"[2, 4): 'OC'","[2, 4): 'OC'",9244,0,1,False,O,
4,4,"[4, 6): 'ST'","[4, 6): 'ST'",9272,0,1,False,O,
...,...,...,...,...,...,...,...,...,...
540,540,"[1228, 1230): 'DM'","[1228, 1230): 'DM'",20002,0,1,False,I,LOC
541,541,"[1230, 1232): 'ON'","[1230, 1232): 'ON'",11414,0,1,False,I,LOC
542,542,"[1232, 1234): 'TO'","[1232, 1234): 'TO'",18082,0,1,False,I,LOC
543,543,"[1234, 1235): 'N'","[1234, 1235): 'N'",2249,0,1,False,I,LOC


In [115]:
bert_toks_df["char_span"].values
#bert_token_spans

Unnamed: 0,begin,end,covered_text
0,0,0,
1,0,1,-
2,1,2,D
3,2,4,OC
4,4,6,ST
5,6,8,AR
6,8,9,T
7,9,10,-
8,11,14,NHL
9,15,16,I


In [6]:
a = train_results_df["predicted_class_pr"]
a[train_results_df["predicted_type"] != None].array[:10]

NameError: name 'train_results_df' is not defined

In [21]:
train_results_df.iloc[50:75]

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding,predicted_id,predicted_class,predicted_iob,predicted_type
50,train,0,50,4959,0,1,False,O,,O,0,[-1.01219468e-01 -5.39751053e-01 8.26063514e-0...,0,O,O,
51,train,0,51,2480,0,1,False,O,,O,0,[-1.16538428e-01 -5.09168088e-01 8.96665156e-0...,0,O,O,
52,train,0,52,6340,0,1,False,O,,O,0,[-4.10497747e-03 -5.45212686e-01 9.23570395e-0...,0,O,O,
53,train,0,53,13991,0,1,False,O,,O,0,[-7.26077035e-02 -4.23760414e-01 8.54078233e-0...,0,O,O,
54,train,0,54,3653,0,1,False,O,,O,0,[-5.48630059e-02 -4.56629217e-01 7.94641674e-0...,0,O,O,
55,train,0,55,1169,0,1,False,O,,O,0,[ 0.0500814 -0.47165635 0.8914107 -0.1395002...,0,O,O,
56,train,0,56,1129,0,1,False,O,,O,0,[ 1.15580121e-02 -5.19423008e-01 8.66512358e-0...,0,O,O,
57,train,0,57,12086,0,1,False,O,,O,0,[ 2.52563171e-02 -5.06482542e-01 9.20146346e-0...,0,O,O,
58,train,0,58,1106,0,1,False,O,,O,0,[ 3.49620320e-02 -4.57040161e-01 9.15991008e-0...,0,O,O,
59,train,0,59,8892,0,1,False,O,,O,0,[-2.93209311e-02 -4.76547778e-01 8.20307672e-0...,0,O,O,


In [55]:
# Look at results on the test set
test_results_df = predict_on_df(corpus_df[corpus_df["fold"] == "test"], 
                                     int_to_label, base_model)
# Pick out a region with more entities
test_results_df.iloc[40:60]

Unnamed: 0,fold,doc_num,token_id,input_id,token_type_id,attention_mask,special_tokens_mask,ent_iob,ent_type,token_class,token_class_id,embedding,predicted_id,predicted_class,predicted_iob,predicted_type,predicted_class_pr
351041,test,0,40,3309,0,1,False,I,PER,I-PER,8,[-0.2103 -0.8536 0.0003 0.3010 0.2814 0.501...,6,I-MISC,I,MISC,[0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.96...
351042,test,0,41,1306,0,1,False,I,PER,I-PER,8,[-0.2321 -0.9291 0.3889 -0.5655 0.1010 -0.303...,6,I-MISC,I,MISC,[0.0000 0.0000 0.0000 0.0000 0.0000 0.0037 0.99...
351043,test,0,42,2001,0,1,False,I,PER,I-PER,8,[ 0.3684 -0.6809 -0.1059 -0.3448 0.5867 0.163...,8,I-PER,I,PER,[0.0020 0.1005 0.0000 0.0000 0.0009 0.2209 0.00...
351044,test,0,43,1181,0,1,False,I,PER,I-PER,8,[-0.3013 -0.6546 -0.1727 -0.1366 0.2293 0.075...,8,I-PER,I,PER,[0.0001 0.0000 0.0000 0.0000 0.0000 0.0404 0.08...
351045,test,0,44,2293,0,1,False,I,PER,I-PER,8,[-0.1612 -0.6989 0.2342 -0.4155 -0.1251 -0.282...,5,I-LOC,I,LOC,[0.0070 0.0032 0.0000 0.0000 0.0000 0.9084 0.07...
351046,test,0,45,18589,0,1,False,B,LOC,B-LOC,1,[-0.0586 -0.7956 0.3361 -0.1007 -0.0177 -0.036...,1,B-LOC,B,LOC,[0.0064 0.9025 0.0000 0.0000 0.0001 0.0899 0.00...
351047,test,0,46,118,0,1,False,I,LOC,I-LOC,5,[ 0.2038 -0.7373 -0.0889 -0.1174 0.0234 0.149...,5,I-LOC,I,LOC,[0.2005 0.0001 0.0000 0.0000 0.0000 0.7844 0.00...
351048,test,0,47,19016,0,1,False,I,LOC,I-LOC,5,[-0.1034 -0.3368 0.1738 0.4988 -0.3016 -0.128...,1,B-LOC,B,LOC,[0.0336 0.5334 0.0000 0.0000 0.0001 0.4324 0.00...
351049,test,0,48,2249,0,1,False,I,LOC,I-LOC,5,[-0.4054 -0.6517 0.2470 0.3198 -0.0794 0.101...,5,I-LOC,I,LOC,[0.0003 0.0001 0.0000 0.0000 0.0000 0.9995 0.00...
351050,test,0,49,117,0,1,False,O,,O,0,[-0.1683 -0.6476 0.8149 -0.0069 -0.4896 -0.410...,0,O,O,,[1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.00...


In [24]:
# Split model outputs for an entire fold back into documents and add
# token information.
test_results_by_doc = util.align_model_outputs_to_tokens(test_results_df,
                                                         bert_toks_by_fold)
test_results_by_doc[("test", 0)].iloc[40:60]

Unnamed: 0,token_id,char_span,token_span,ent_iob,ent_type,predicted_iob,predicted_type
40,40,"[68, 70): 'di'","[68, 70): 'di'",I,PER,I,MISC
41,41,"[70, 71): 'm'","[70, 71): 'm'",I,PER,I,MISC
42,42,"[72, 74): 'La'","[72, 74): 'La'",I,PER,I,PER
43,43,"[74, 75): 'd'","[74, 75): 'd'",I,PER,I,PER
44,44,"[75, 77): 'ki'","[75, 77): 'ki'",I,PER,I,LOC
45,45,"[78, 80): 'AL'","[78, 80): 'AL'",B,LOC,B,LOC
46,46,"[80, 81): '-'","[80, 81): '-'",I,LOC,I,LOC
47,47,"[81, 83): 'AI'","[81, 83): 'AI'",I,LOC,B,LOC
48,48,"[83, 84): 'N'","[83, 84): 'N'",I,LOC,I,LOC
49,49,"[84, 85): ','","[84, 85): ','",O,,O,


In [25]:
# Convert IOB2-format output (and gold standard tags) to spans.
test_actual_spans = {k: tp.iob_to_spans(v) for k, v in test_results_by_doc.items()}
test_model_spans = {k:
        tp.iob_to_spans(v, iob_col_name = "predicted_iob",
                        entity_type_col_name = "predicted_type")
          .rename(columns={"predicted_type": "ent_type"})
        for k, v in test_results_by_doc.items()}

test_model_spans[("test", 0)].head()

Unnamed: 0,token_span,ent_type
0,"[19, 24): 'JAPAN'",PER
1,"[29, 34): 'LUCKY'",LOC
2,"[40, 45): 'CHINA'",ORG
3,"[78, 81): 'AL-'",LOC
4,"[81, 84): 'AIN'",LOC


In [26]:
# Every once in a while, the model will split a token in the original data
# set into two entities. For example, look at document 202 of the test
# set:
doc_key = ("test", 202)
test_model_spans[doc_key].head(10)

Unnamed: 0,token_span,ent_type
0,"[11, 22): 'RUGBY UNION'",ORG
1,"[24, 31): 'BRITISH'",MISC
2,"[41, 47): 'LONDON'",LOC
3,"[70, 77): 'British'",MISC
4,"[111, 125): 'Pilkington Cup'",MISC
5,"[139, 146): 'Reading'",ORG
6,"[150, 151): 'W'",ORG
7,"[151, 156): 'idnes'",ORG
8,"[159, 166): 'English'",MISC
9,"[180, 184): 'Bath'",ORG


In [27]:
# Notice `[150, 151): 'W'` and `[151, 156): 'idnes'`. These outputs are part
# of the same original token.
# We can use spanner algebra to fix up these outputs.
spans_df = test_model_spans[doc_key]
toks_df = test_raw[202]

# First, find which tokens the spans overlap with:
overlaps_df = (
    tp
    .overlap_join(spans_df["token_span"], toks_df["char_span"],
                 "token_span", "corpus_token")
    .merge(spans_df)
)
overlaps_df.head(10)

Unnamed: 0,token_span,corpus_token,ent_type
0,"[11, 22): 'RUGBY UNION'","[11, 16): 'RUGBY'",ORG
1,"[11, 22): 'RUGBY UNION'","[17, 22): 'UNION'",ORG
2,"[24, 31): 'BRITISH'","[24, 31): 'BRITISH'",MISC
3,"[41, 47): 'LONDON'","[41, 47): 'LONDON'",LOC
4,"[70, 77): 'British'","[70, 77): 'British'",MISC
5,"[111, 125): 'Pilkington Cup'","[111, 121): 'Pilkington'",MISC
6,"[111, 125): 'Pilkington Cup'","[122, 125): 'Cup'",MISC
7,"[139, 146): 'Reading'","[139, 146): 'Reading'",ORG
8,"[150, 151): 'W'","[150, 156): 'Widnes'",ORG
9,"[151, 156): 'idnes'","[150, 156): 'Widnes'",ORG


In [28]:
# Next, compute the minimum span that covers all the corpus tokens
# that overlap with each entity span.
agg_df = (
    overlaps_df
    .groupby("token_span")
    .aggregate({"corpus_token": "sum", "ent_type": "first"})
    .reset_index()
)
agg_df.head(10)

Unnamed: 0,token_span,corpus_token,ent_type
0,"[11, 22): 'RUGBY UNION'","[11, 22): 'RUGBY UNION'",ORG
1,"[24, 31): 'BRITISH'","[24, 31): 'BRITISH'",MISC
2,"[41, 47): 'LONDON'","[41, 47): 'LONDON'",LOC
3,"[70, 77): 'British'","[70, 77): 'British'",MISC
4,"[111, 125): 'Pilkington Cup'","[111, 125): 'Pilkington Cup'",MISC
5,"[139, 146): 'Reading'","[139, 146): 'Reading'",ORG
6,"[150, 151): 'W'","[150, 156): 'Widnes'",ORG
7,"[151, 156): 'idnes'","[150, 156): 'Widnes'",ORG
8,"[159, 166): 'English'","[159, 166): 'English'",MISC
9,"[180, 184): 'Bath'","[180, 184): 'Bath'",ORG


In [29]:
# Finally, take unique values and covert character-based spans to spans
# in the corpus tokenization (since the new offsets might not match a
# BERT tokenizer token boundary)
cons_df = (
    tp.consolidate(agg_df, "corpus_token")
    [["corpus_token", "ent_type"]]
    .rename(columns={"corpus_token": "token_span"})
)
cons_df["token_span"] = tp.TokenSpanArray.align_to_tokens(toks_df["char_span"],
                                                          cons_df["token_span"])
cons_df.head(10)

Unnamed: 0,token_span,ent_type
0,"[11, 22): 'RUGBY UNION'",ORG
1,"[24, 31): 'BRITISH'",MISC
2,"[41, 47): 'LONDON'",LOC
3,"[70, 77): 'British'",MISC
4,"[111, 125): 'Pilkington Cup'",MISC
5,"[139, 146): 'Reading'",ORG
6,"[150, 156): 'Widnes'",ORG
8,"[159, 166): 'English'",MISC
9,"[180, 184): 'Bath'",ORG
10,"[188, 198): 'Harlequins'",ORG


In [30]:
# util.py contains a single function that repeats the actions of the 
# previous 3 cells.
tp.align_bert_tokens_to_corpus_tokens(test_model_spans[doc_key], test_raw[202]).head(10)

Unnamed: 0,token_span,ent_type
0,"[11, 22): 'RUGBY UNION'",ORG
1,"[24, 31): 'BRITISH'",MISC
2,"[41, 47): 'LONDON'",LOC
3,"[70, 77): 'British'",MISC
4,"[111, 125): 'Pilkington Cup'",MISC
5,"[139, 146): 'Reading'",ORG
6,"[150, 156): 'Widnes'",ORG
8,"[159, 166): 'English'",MISC
9,"[180, 184): 'Bath'",ORG
10,"[188, 198): 'Harlequins'",ORG


In [31]:
# Run all of our dataframes through `realign_to_toks`
keys = list(test_model_spans.keys())
new_values = tp.run_with_progress_bar(
    len(keys), 
    lambda i: tp.align_bert_tokens_to_corpus_tokens(test_model_spans[keys[i]], test_raw[keys[i][1]]))
test_model_spans = {k: v for k, v in zip(keys, new_values)}
test_model_spans[doc_key].head(10)

IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…

Unnamed: 0,token_span,ent_type
0,"[11, 22): 'RUGBY UNION'",ORG
1,"[24, 31): 'BRITISH'",MISC
2,"[41, 47): 'LONDON'",LOC
3,"[70, 77): 'British'",MISC
4,"[111, 125): 'Pilkington Cup'",MISC
5,"[139, 146): 'Reading'",ORG
6,"[150, 156): 'Widnes'",ORG
8,"[159, 166): 'English'",MISC
9,"[180, 184): 'Bath'",ORG
10,"[188, 198): 'Harlequins'",ORG


In [32]:
# Same per-document statistics calculation code as in CoNLL_2.ipynb
test_stats_by_doc = tp.compute_accuracy_by_document(test_actual_spans, test_model_spans)
test_stats_by_doc

Unnamed: 0,fold,doc_num,num_true_positives,num_extracted,num_entities,precision,recall,F1
0,test,0,42,46,45,0.913043,0.933333,0.923077
1,test,1,41,44,44,0.931818,0.931818,0.931818
2,test,2,52,54,54,0.962963,0.962963,0.962963
3,test,3,42,44,44,0.954545,0.954545,0.954545
4,test,4,18,19,19,0.947368,0.947368,0.947368
...,...,...,...,...,...,...,...,...
226,test,226,7,7,7,1.000000,1.000000,1.000000
227,test,227,18,20,21,0.900000,0.857143,0.878049
228,test,228,23,28,27,0.821429,0.851852,0.836364
229,test,229,26,27,27,0.962963,0.962963,0.962963


In [33]:
# Collection-wide precision and recall can be computed by aggregating
# our dataframe.
tp.compute_global_accuracy(test_stats_by_doc)

{'num_true_positives': 4926,
 'num_entities': 5648,
 'num_extracted': 5600,
 'precision': 0.8796428571428572,
 'recall': 0.8721671388101983,
 'F1': 0.8758890469416786}

In [34]:
# The function analyze_model() in util.py combines the above postprocessing steps 
# into a single function.
base_test_results = util.analyze_model(
    corpus_df[corpus_df["fold"] == "test"], 
    int_to_label, base_model, bert_toks_by_fold, corpus_raw, expand_matches=True)
base_test_results["global_scores"]

{'num_true_positives': 4926,
 'num_entities': 5648,
 'num_extracted': 5600,
 'precision': 0.8796428571428572,
 'recall': 0.8721671388101983,
 'F1': 0.8758890469416786}

In [35]:
# Results on the training set
base_train_results = util.analyze_model(
    corpus_df[corpus_df["fold"] == "train"],
    int_to_label, base_model, bert_toks_by_fold, corpus_raw, expand_matches=True)
base_train_results["global_scores"]

{'num_true_positives': 23152,
 'num_entities': 23499,
 'num_extracted': 23402,
 'precision': 0.9893171523801384,
 'recall': 0.9852334141878378,
 'F1': 0.9872710603185432}

In [36]:
# Results on the development set
base_dev_results = util.analyze_model(
    corpus_df[corpus_df["fold"] == "dev"],
    int_to_label, base_model, bert_toks_by_fold, corpus_raw, expand_matches=True)
base_dev_results["global_scores"]

{'num_true_positives': 5506,
 'num_entities': 5942,
 'num_extracted': 5913,
 'precision': 0.9311686115339083,
 'recall': 0.9266240323123528,
 'F1': 0.9288907633909743}