# End-to-End NED Tutorial

In this tutorial, we walk through how to use Bootleg as an end-to-end pipeline to detect and label entities in a set of sentences. First, we show how to use Bootleg to detect and disambiguate mentions to entities. We then compare to an existing system named TAGME. 

This tutorial assumes you want to use Bootleg on full datasets. You can also use Bootleg in annotator mode:

```
pip install bootleg
from bootleg.end2end.bootleg_annotator import BootlegAnnotator
ann = BootlegAnnotator()
ann.label_mentions("Bob Dylan release Desire")["titles"]
```

To understand how Bootleg performs on more natural language than we find in Wikipedia, we hand label the mentions and corresponding entities in 50 questions sampled from the [Natural Questions dataset (Google)](https://ai.google.com/research/NaturalQuestions). We will evaluate our *uncased* Bootleg model. However, we have manually cased the data in case you want to try our cased model instead.

### Requirements

You will need to download the following files for this notebook:
- Pretrained Bootleg uncased model and config [here](https://bootleg-data.s3.amazonaws.com/models/lateset/bootleg_uncased.tar.gz). Cased model and config [here](https://bootleg-data.s3.amazonaws.com/models/lateset/bootleg_cased.tar.gz)
- Sample of Natural Questions with hand-labelled entities [here](https://bootleg-data.s3.amazonaws.com/data/lateset/nq.tar.gz)
- Entity data [here](https://bootleg-data.s3.amazonaws.com/data/lateset/wiki_entity_data.tar.gz)
- Embedding data [here](https://bootleg-data.s3.amazonaws.com/data/lateset/emb_data.tar.gz)

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models` and `data` directories. It will take several minutes to download all the files. 

```
    # use cased for cased model
    bash tutorials/download_model.sh uncased
    bash tutorials/download_data.sh
```

You can also run directly in this notebook by

In [None]:
!sh download_model.sh uncased
!sh download_data.sh

In [2]:
from pathlib import Path
import pandas as pd

# set up logging
import sys
import logging
from importlib import reload
reload(logging)
# Set to logging.DEBUG for more logging output
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)


# root_dir = FILL IN FULL PATH TO DIRECTORY WHERE DATA IS DOWNLOADED (e.g., root_dir/data and root_dir/models)
root_dir = Path(".")
cand_map = root_dir / 'data/wiki_entity_data/entity_mappings/alias2qids_wiki_filt.json'

If you have a GPU with at least 12GB of memory available, set the below to 0 to run inference on a GPU. 

In [3]:
device = -1

## 1. Detect Mentions
Bootleg uses a simple mention extraction algorithm that extracts mentions using a given candidate map. We will use a Wikipedia candidate map that we mined using Wikipedia anchor links and Wikidata aliases for a total of ~15 million mentions (provided in the Requirements section of this notebook).

For the input dataset for the end-to-end pipeline, we assume a jsonlines file with a single dictionary with the key "sentence" and value as the text of the sentence, per line. For instance, you may have a file with the lines:

    {"sentence": "Who did the voice of the magician in Frosty the Snowman"}
    {"sentence": "What is considered the Outer Banks in North Carolina"}
    
Below, we have additional keys to keep track of the hand-labelled mentions, but this is purely for evaluating the quality of the end-to-end pipeline and is not needed in the common use cases of using Bootleg to detect and label mentions.

In [4]:
nq_sample_orig = root_dir / 'data/nq/test_50.jsonl'
nq_sample_bootleg = root_dir / 'data/nq/test_50_bootleg.jsonl'

In [5]:
from bootleg.end2end.extract_mentions import extract_mentions
verbose = False
extract_mentions(in_filepath=nq_sample_orig, out_filepath=nq_sample_bootleg, cand_map_file=cand_map, verbose=verbose)

100%|██████████| 15202497/15202497 [00:28<00:00, 529759.40it/s]
100%|██████████| 1/1 [00:00<00:00, 21.53it/s]
100%|██████████| 7/7 [00:00<00:00, 40.21it/s]
100%|██████████| 7/7 [00:00<00:00, 36.73it/s]
100%|██████████| 7/7 [00:00<00:00, 33.45it/s]
100%|██████████| 7/7 [00:00<00:00, 34.35it/s]
100%|██████████| 7/7 [00:00<00:00, 31.11it/s]
100%|██████████| 7/7 [00:00<00:00, 31.36it/s]
100%|██████████| 7/7 [00:00<00:00, 25.75it/s]


By looking at a sample of the extracted mentions, we can compare the mention extraction phase to the hand-labelled mentions.

In [6]:
from utils import load_mentions

orig_mentions_df = load_mentions(nq_sample_orig)
bootleg_mentions_df = load_mentions(nq_sample_bootleg)

# join dataframes and sample
res = pd.merge(orig_mentions_df, bootleg_mentions_df, on=['sentence'], suffixes=['_hand', '_bootleg'])
display(res.sample(15))

Unnamed: 0,sentence,aliases_hand,spans_hand,aliases_bootleg,spans_bootleg
46,Who controls the past controls the future Rage Against the Machine,[rage against the machine],"[[7, 11]]",[rage against the machine],"[[7, 11]]"
14,Where does the last name Aponte come from,[aponte],"[[5, 6]]",[aponte],"[[5, 6]]"
20,What is the worth of the Catholic Church,[catholic church],"[[6, 8]]","[worth, catholic church]","[[3, 4], [6, 8]]"
17,Is it a bank holiday today in Spain,"[bank holiday, spain]","[[3, 5], [7, 8]]","[bank holiday, spain]","[[3, 5], [7, 8]]"
32,Who played in the last 3 NBA Finals,[nba finals],"[[6, 8]]",[nba finals],"[[6, 8]]"
1,What is considered the Outer Banks in North Carolina,"[outer banks, north carolina]","[[4, 6], [7, 9]]","[outer banks, north carolina]","[[4, 6], [7, 9]]"
47,Who is Mariah Carey talking about in We Belong Together,"[mariah carey, we belong together]","[[2, 4], [7, 10]]",[mariah carey],"[[2, 4]]"
24,Who played Smiley in Tinker Tailor Soldier Spy,[tinker tailor soldier spy],"[[4, 8]]","[smiley, tinker tailor soldier spy]","[[2, 3], [4, 8]]"
44,The representative of the British crown in NZ,"[british crown, nz]","[[4, 6], [7, 8]]","[british crown, nz]","[[4, 6], [7, 8]]"
0,Who did the voice of the magician in Frosty the Snowman,[frosty the snowman],"[[8, 11]]","[voice of, magician, frosty the snowman]","[[3, 5], [6, 7], [8, 11]]"


In the sample above, we see that generally Bootleg detects the same mentions as the hand-labelled mentions, however sometimes Bootleg extracts extra mentions (e.g "colonies" in "Where did Britain create colonies for its empire"). This is expected as we would rather the mention detection step filter out too few mentions than too many. It will be the job of the backbone model and postprocessing to filter out these extra mentions, by either thresholding the prediction probability or predicting a candidate that represents "No Candidate" (we refer to this as "NC").  

## 2. Disambiguate Mentions to Entities

We run inference using a pretrained Bootleg model to disambiguate the extracted mentions to Wikidata QIDs. 

First, load the model config so we can set additional parameters and load the saved model during evaluation. We need to update the config parameters to point to the downloaded model checkpoint and data.

In [10]:
from bootleg.utils.parser.parser_utils import parse_boot_and_emm_args
from bootleg.utils.utils import load_yaml_file
from bootleg.run import run_model

config_in_path = root_dir / 'models/bootleg_uncased/bootleg_config.yaml'

config_args = load_yaml_file(config_in_path)

# decrease number of data threads as this is a small file
config_args["run_config"]["dataset_threads"] = 2
config_args["run_config"]["log_level"] = "info"
# set the model checkpoint path 
config_args["emmental"]["model_path"] = str(root_dir / 'models/bootleg_uncased/bootleg_wiki.pth')

# set the path for the entity db and candidate map
config_args["data_config"]["entity_dir"] = str(root_dir / 'data/wiki_entity_data')
config_args["data_config"]["alias_cand_map"] = "alias2qids_wiki_filt.json"

# set the data path and kore50 test file 
config_args["data_config"]["data_dir"] = str(root_dir / 'data/nq')

# to speed things up for the tutorial, we have already prepped the data with the mentions detected by Bootleg
config_args["data_config"]["test_dataset"]["file"] = nq_sample_bootleg.name

# set the embedding paths 
config_args["data_config"]["emb_dir"] =  str(root_dir / 'data/emb_data')
config_args["data_config"]["word_embedding"]["cache_dir"] =  str(root_dir / 'data/emb_data/pretrained_bert_models')

# set the devie if on CPU
config_args["emmental"]["device"] = device

# save the new args (helps if you want to run things via command line)
config_args = parse_boot_and_emm_args(config_args)

In [11]:
bootleg_label_file, _ = run_model(mode="dump_preds", config=config_args)

2021-02-12 11:49:21,849 Logging was already initialized to use bootleg_logs/wiki_full_ft/2021_02_12/11_46_30/524a6d16.  To configure logging manually, call emmental.init_logging before initialiting Meta.
2021-02-12 11:49:21,910 Loading Emmental default config from /dfs/scratch0/lorr1/projects/emmental/src/emmental/emmental-default-config.yaml.
2021-02-12 11:49:21,912 Updating Emmental config from user provided config.
2021-02-12 11:49:22,042 COMMAND: /dfs/scratch0/lorr1/env_bootleg_38/lib/python3.8/site-packages/ipykernel_launcher.py -f /dfs/scratch0/lorr1/projects/bootleg/tutorials/:/afs/cs.stanford.edu/u/lorr1/.local/apt-cache/share/jupyter/runtime/kernel-94bc511c-5c97-4658-924c-58d7cc619f20.json
2021-02-12 11:49:22,043 Saving config to bootleg_logs/wiki_full_ft/2021_02_12/11_46_30/524a6d16/parsed_config.yaml
2021-02-12 11:49:22,358 Git Hash: Not able to retrieve git hash
2021-02-12 11:49:22,360 Loading entity symbols...
2021-02-12 11:51:20,374 Starting to build data for test from ..

  guid_dtype = np.dtype(
Reading in ../tutorial_data/data/nq/prep/prep_test_dataset_files/create_examples_input/out_0.jsonl: 100%|██████████| 25/25 [00:00<00:00, 426.92it/s]
Reading in ../tutorial_data/data/nq/prep/prep_test_dataset_files/create_examples_input/out_1.jsonl: 100%|██████████| 25/25 [00:00<00:00, 481.79it/s]
  descr = dtypedescr(dtype)
Processing ../tutorial_data/data/nq/prep/prep_test_dataset_files/create_examples_output/out_0.jsonl:   0%|          | 0/25 [00:00<?, ?it/s]

*** Example ***
guid:                            0 subsent 0
examples:                        [CLS] who did the voice of the magician in frost ##y the snow ##man [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
spans:                           [[4, 6], [7, 8], [9, 14]]
aliases_to_predict:              [0, 1, 2]
train_aliases_to_predict_arr:    [0, 1, 2]
alias_list_pos:                  [0, 1, 2]
aliases:                         ['voice of', 'magician', 'frosty the snowman']
qids:                            ['Q-1', 'Q-1', 'Q-1']


*** Example ***
guid:                            18 subsent 0
examples:                        [CLS] 1970 world cup semi final italy vs germany [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
spans:                           [[1, 4], [6, 7], [8, 9]]
aliases_to_predict:              [0, 1, 2]
train_aliases_to_predict_arr:    [0, 1, 2]
alias_list_pos:                  [0, 1, 2]
aliases:                         ['1970 world cup', 'italy', 'germany']
qids:                            ['Q-1', 'Q-1', 'Q

Processing ../tutorial_data/data/nq/prep/prep_test_dataset_files/create_examples_output/out_0.jsonl: 100%|██████████| 25/25 [00:00<00:00, 396.16it/s]
Processing ../tutorial_data/data/nq/prep/prep_test_dataset_files/create_examples_output/out_1.jsonl:   0%|          | 0/24 [00:00<?, ?it/s]

*** Example ***
guid:                            27 subsent 0
examples:                        [CLS] i see the river ti ##ber foam ##ing with much blood [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
spans:                           [[4, 7]]
aliases_to_predict:              [0]
train_aliases_to_predict_arr:    [0]
alias_list_pos:                  [0]
aliases:                         ['river tiber']
qids:                            ['Q-1']
*** Feature ***
start_idx_in_sent:               [ 4. -1. -1. -1. -1. -1. -1


*** Example ***
guid:                            46 subsent 0
examples:                        [CLS] who controls the past controls the future rage against the machine [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
spans:                           [[8, 12]]
aliases_to_predict:              [0]
train_aliases_to_predict_arr:    [0]
alias_list_pos:                  [0]
aliases:                         ['rage against the machine']
qids:                            ['Q-1']
*** Feature ***
start_idx_in_sent:            

Processing ../tutorial_data/data/nq/prep/prep_test_dataset_files/create_examples_output/out_1.jsonl: 100%|██████████| 24/24 [00:00<00:00, 459.03it/s]
Checking sentence uniqueness: 100%|██████████| 49/49 [00:00<00:00, 29131.24it/s]


2021-02-12 11:53:07,423 Loading data from ../tutorial_data/data/nq/prep/test_50_bootleg_bert-base-uncased_L100_A10_InC1_Aug1/ned_data.bin and ../tutorial_data/data/nq/prep/test_50_bootleg_bert-base-uncased_L100_A10_InC1_Aug1/ned_label.bin
2021-02-12 11:53:07,426 Building type labels from scatch.
2021-02-12 11:56:07,890 Creating type prediction labeled data using 2 threads


Reading values for marisa trie: 100%|██████████| 5832699/5832699 [00:05<00:00, 1153219.84it/s]
Processing types: 100%|██████████| 25/25 [00:02<00:00,  9.96it/s] 
Processing types: 100%|██████████| 24/24 [00:02<00:00,  9.44it/s]
Building type data: 100%|██████████| 2/2 [00:02<00:00,  1.29s/it]
Verifying type labels: 100%|██████████| 49/49 [00:00<00:00, 9301.27it/s]


2021-02-12 11:56:25,007 Final data initialization time for test is 304.63114166259766s
2021-02-12 11:56:25,111 Built dataloader for test set with 49 and 1 threads samples (Shuffle=False, Batch size=32).
2021-02-12 11:56:25,184 Building slice dataset for test from ../tutorial_data/data/nq/test_50_bootleg.jsonl.
2021-02-12 11:56:25,241 Building dataset from scratch. Saving to ../tutorial_data/data/nq/prep/test_50_bootleg_bert-base-uncased_L100_A10_InC1_Aug1
2021-02-12 11:56:25,242 Strating to extract examples with 2 threads


Reading in ../tutorial_data/data/nq/prep/prep_test_slice_files/create_examples_input/out_0.jsonl: 100%|██████████| 25/25 [00:00<00:00, 20056.92it/s]
Reading in ../tutorial_data/data/nq/prep/prep_test_slice_files/create_examples_input/out_1.jsonl: 100%|██████████| 25/25 [00:00<00:00, 21312.52it/s]


2021-02-12 11:56:26,074 Starting to build and save features with 2 threads


Processing ../tutorial_data/data/nq/prep/prep_test_slice_files/create_examples_output/out_1.jsonl: 100%|██████████| 25/25 [00:00<00:00, 2276.84it/s]
Processing ../tutorial_data/data/nq/prep/prep_test_slice_files/create_examples_output/out_0.jsonl: 100%|██████████| 25/25 [00:00<00:00, 532.32it/s]
Checking sentence uniqueness: 100%|██████████| 50/50 [00:00<00:00, 13252.98it/s]


2021-02-12 11:56:26,910 Loading data from ../tutorial_data/data/nq/prep/test_50_bootleg_bert-base-uncased_L100_A10_InC1_Aug1/ned_slices_1f126b5224.bin and ../tutorial_data/data/nq/prep/test_50_bootleg_bert-base-uncased_L100_A10_InC1_Aug1/ned_slices_config.json


Building sent idx to row idx mapping: 100%|██████████| 50/50 [00:00<00:00, 12310.12it/s]


2021-02-12 11:56:27,148 Final slice data initialization time from test is 1.9646449089050293s
2021-02-12 11:56:27,150 Updating Emmental config from user provided config.
2021-02-12 11:56:27,156 Starting Bootleg Model
2021-02-12 11:56:27,158 Created emmental model Bootleg that contains task set().
2021-02-12 11:56:31,852 Loading embeddings...
2021-02-12 12:01:11,119 Created task: NED
2021-02-12 12:01:11,123 Moving bert module to CPU.
2021-02-12 12:01:11,128 Moving embedding_payload module to CPU.
2021-02-12 12:01:11,129 Moving attn_network module to CPU.
2021-02-12 12:01:11,132 Moving pred_layer module to CPU.
2021-02-12 12:01:11,133 Moving learned module to CPU.
2021-02-12 12:01:11,134 Moving title_static module to CPU.
2021-02-12 12:01:11,135 Moving learned_type module to CPU.
2021-02-12 12:01:11,137 Moving learned_type_wiki module to CPU.
2021-02-12 12:01:11,137 Moving learned_type_relations module to CPU.
2021-02-12 12:01:11,138 Moving adj_index module to CPU.
2021-02-12 12:01:17,75

Evaluating Bootleg (test): 100%|██████████| 2/2 [00:09<00:00,  4.82s/it]
100%|██████████| 49/49 [00:00<00:00, 10006.86it/s]
Reading values for marisa trie: 100%|██████████| 50/50 [00:00<00:00, 297890.91it/s]


2021-02-12 12:02:53,443 Merging sentences together with 2 processes


Building sent_idx, alias_list_pos mapping: 100%|██████████| 89/89 [00:00<00:00, 84207.77it/s]
Reading values for marisa trie: 100%|██████████| 89/89 [00:00<00:00, 381690.24it/s]
Writing data: 100%|██████████| 25/25 [00:00<00:00, 4436.54it/s]
Writing data: 100%|██████████| 25/25 [00:02<00:00,  9.14it/s]


2021-02-12 12:04:59,707 Merging output files
2021-02-12 12:05:01,509 Wrote predictions to bootleg_logs/wiki_full_ft/2021_02_12/11_46_30/524a6d16/test_50_bootleg/bootleg_wiki/bootleg_labels.jsonl


We can now evaluate the overall quality of the end-to-end pipeline via precision / recall metrics, where the *recall* indicates what proportion of the hand-labelled mentions Bootleg correctly detects and disambiguates, and *precision* indicates what proportion of the mentions that Bootleg labels are correct. For instance, if Bootleg only labelled the few mentions it was very confident in, then it would have a low recall and high precision.

To detect if mentions match the hand-labelled mention spans, we report weak and exact match metrics. Weak means the predicted and gold span boundaries just need to overlap for an entity (e.g., predicted mention 'the wizard of oz' is counted as correct for the gold mention 'wizard of oz' if the correct entity is predicted). 

In [12]:
from utils import compute_metrics
bootleg_end2end_errors = compute_metrics(gold_file=nq_sample_orig,       
                                 pred_file=bootleg_label_file, 
                                 threshold=0.5)

WEAK MATCHING
precision = 61 / 80 = 0.7625
recall = 61 / 78 = 0.782051282051282
f1 = 0.7721518987341772

EXACT MATCHING
precision = 61 / 80 = 0.7625
recall = 61 / 78 = 0.782051282051282
f1 = 0.7721518987341772


We can examine errors in the end-to-end pipeline below. As you increase the threshold in the `compute_metrics` command, entities with a prediction probability less than the threshold will be filtered out. If too few entities are predicted, lowering the threshold may help.  

In [42]:
pd.DataFrame(bootleg_end2end_errors).sample(10)

Unnamed: 0,sent_idx,text,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_qids,pred_spans,pred_probs
14,4,I Love It ( feat . Charli XCX ) Icona Pop,"[i love it, charli xcx, icona pop]","[Q3273659, Q5084390, Q808703]","[[0, 3], [6, 8], [9, 11]]","[charli xcx, icona pop]","[Q5084390, Q808703]","[[5, 9], [9, 11]]","[1.0, 0.9873091578]"
7,42,Who proposed the coordinate system to describe the position of a point in a plane accurately,[coordinate system],[Q62912],"[[3, 5]]",[coordinate system],[Q11210],"[[3, 5]]",[1.0]
5,37,Landmark Supreme Court cases dealing with the First Amendment,"[supreme court, first amendment]","[Q11201, Q12616]","[[1, 3], [7, 9]]","[supreme court cases, first amendment]","[Q6646863, Q12616]","[[1, 4], [7, 9]]","[1.0, 0.9999812841]"
18,8,Once Upon a Time Season 6 episode list,[once upon a time season 6],[Q23301616],"[[0, 6]]","[season 6, episode list]",[Q2404330],"[[4, 6]]",[0.5048642159]
22,13,Where was 10 Things I Hate About You filmed school,[10 things i hate about you],[Q169082],"[[2, 8]]",[10 things i hate about you],[Q169074],"[[2, 8]]",[0.5601187348]
1,31,Who does Oregon state play in the College World Series,"[oregon state, college world series]","[Q7101349, Q787505]","[[2, 4], [7, 10]]","[oregon state, college world series]",[Q787505],"[[7, 10]]",[0.8851752281]
16,6,Why does the author say that the vampire in Nosferatu is named Count Orlok and not Count Dracula,"[nosferatu, count orlok, count dracula]","[Q151895, Q1442062, Q3266236]","[[9, 10], [12, 14], [16, 18]]","[vampire, nosferatu, count orlok, count dracula]","[Q7912955, Q151895, Q1442062, Q3266236]","[[7, 8], [9, 10], [12, 14], [16, 18]]","[0.7404002547, 0.9794661999, 1.0, 0.9972344041]"
6,40,Where does the last name Vigil come from,[vigil],[Q16878937],"[[5, 6]]",[vigil],[],[],[]
3,35,Reasons why South Africa should include renewable energy in its energy mix,"[south africa, renewable energy]","[Q258, Q12705]","[[2, 4], [6, 8]]","[reasons why, south africa, renewable energy, energy mix]","[Q7028249, Q258, Q12705, Q1341346]","[[0, 2], [2, 4], [6, 8], [10, 12]]","[1.0, 0.999255836, 0.8151838183, 1.0]"
10,48,What was the Japanese motivation for bombing Pearl Harbor,"[japanese, pearl harbor]","[Q188712, Q127091]","[[3, 4], [7, 9]]","[motivation, pearl harbor]","[Q644302, Q127091]","[[4, 5], [7, 9]]","[0.9905021191, 0.757401526]"


Some of the errors Bootleg makes is predicting too general of a candidate (e.g. Oregon State Beavers instead of Oregon State Beavers baseball). Other errors are due to ambiguous sentences (e.g. "cast of characters in fiddler on the roof" -> should this be the movie or the musical?). Finally another bucket of errors suggests that we need to boost certain training signals -- this is an area we're actively pursuing in Bootleg with an investigation of model guidability!

## 3. Compare to TAGME 

To get a sense of how Bootleg is doing compared to other systems, we evaluate [TAGME](https://arxiv.org/pdf/1006.3498.pdf), an existing tool to extract and disambiguate mentions. To run TAGME, you need to get a (free) authorization token. Instructions for obtaining a token are [here](https://sobigdata.d4science.org/web/tagme/tagme-help). You will need to verify your account and then follow the "access the VRE") link. We've also provided the file with TAGME labels for a given threshold for download if you want to skip the authorization token.

We note that unlike TAGME, Bootleg also outputs contextual entity embeddings which can be loaded for use in downstream tasks (e.g. relation extraction, question answering). Check out the Entity Embedding tutorial for more details! 

In [47]:
import tagme
# Set the authorization token for subsequent calls.
tagme.GCUBE_TOKEN = ""

In [13]:
tagme_label_file = root_dir / '/data/nq/test_50_tagme.jsonl'

If you do not have a token, skip the cell below and load the pre-generated TAGME labels. If you do have a token, you can play with changing the threshold below and see how it affects the results. Increasing the threshold increases the precision but decreases the recall as TAGME, as TAGME will label fewer mentions.

In [63]:
from utils import tagme_annotate
# As the threshold increases, the precision increases, but the recall decreases
tagme_annotate(in_file=nq_sample_orig, out_file=tagme_label_file, threshold=0.2)

No wikidata id found for Frosty the Snowman (film)
No wikidata id found for The Bachelor (U.S. TV series)
No wikidata id found for House of Cards (U.S. TV series)


Note that we do not set the threshold here when computing metrics for TAGME as TAGME predictions are already thresholded in the `tagme_annotate` function. 

In [64]:
tagme_errors = compute_metrics(gold_file=nq_sample_orig, pred_file=tagme_label_file)

WEAK MATCHING
precision = 53 / 101 = 0.5247524752475248
recall = 53 / 78 = 0.6794871794871795
f1 = 0.5921787709497207

EXACT MATCHING
precision = 52 / 101 = 0.5148514851485149
recall = 52 / 78 = 0.6666666666666666
f1 = 0.5810055865921788


Several Wikidata ids are not recovered due to out of date Wikipedia titles in TAGME predictions. Even when including these three samples as correct predictions, we see that Bootleg is able to outperform TAGME in both precision and recall.