# End-to-End NED Tutorial

In this tutorial, we walk through how to use Bootleg as an end-to-end pipeline to detect and label entities in a set of sentences. First, we show how to use Bootleg to detect and disambiguate mentions to entities. We then compare to an existing system named TAGME. Finally, we show how to use Bootleg to annotate individual sentences on the fly. 

To understand how Bootleg performs on more natural language than we find in Wikipedia, we hand label the mentions and corresponding entities in 50 questions sampled from the [Natural Questions dataset (Google)](https://ai.google.com/research/NaturalQuestions). 

### Requirements

You will need to download the following files for this notebook:
- Pretrained Bootleg model and config [here](https://bootleg-emb.s3.amazonaws.com/models/2020_10_22/bootleg_wiki.tar.gz)*
- Sample of Natural Questions with hand-labelled entities [here](https://bootleg-emb.s3.amazonaws.com/data/nq.tar.gz)
- Entity data [here](https://bootleg-emb.s3.amazonaws.com/data/wiki_entity_data.tar.gz)*
- Embedding data [here](https://bootleg-emb.s3.amazonaws.com/data/emb_data.tar.gz)*
- Pretrained BERT model [here](https://bootleg-emb.s3.amazonaws.com/pretrained_bert_models.tar.gz)*

*Same file as in benchmark tutorial and does not need to be re-downloaded.

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models`, `data`, and `pretrained_bert_models` directories. It will take several minutes to download all the files. 

    bash download_model.sh 
    bash download_data.sh 
    bash download_bert.sh

In [1]:
import numpy as np 
import pandas as pd
import ujson
from utils import load_mentions, tagme_annotate

# set up logging
import sys
import logging
from importlib import reload
reload(logging)
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

root_dir = # FILL IN FULL PATH TO ROOT REPO DIRECTORY HERE 
cand_map = f'{root_dir}/data/wiki_entity_data/entity_mappings/alias2qids_wiki.json'



If you have a GPU with at least 12GB of memory available, set the below to `False` to run inference on a GPU. 

In [2]:
use_cpu = True

## 1. Detect Mentions
Bootleg uses a simple mention extraction algorithm that extracts mentions using a given candidate map. We will use a Wikipedia candidate map that we mined using Wikipedia anchor links and Wikidata aliases for a total of ~8 million mentions (provided in the Requirements section of this notebook).

For the input dataset for the end-to-end pipeline, we assume a jsonlines file with a single dictionary with the key "sentence" and value as the text of the sentence, per line. For instance, you may have a file with the lines:

    {"sentence": "who did the voice of the magician in frosty the snowman"}
    {"sentence": "what is considered the outer banks in north carolina"}
    
Below, we have additional keys to keep track of the hand-labelled mentions, but this is purely for evaluating the quality of the end-to-end pipeline and is not needed in the common use cases of using Bootleg to detect and label mentions.

In [3]:
nq_sample_orig = f'{root_dir}/data/nq/test_natural_questions_50.jsonl'
nq_sample_bootleg = f'{root_dir}/data/nq/test_natural_questions_50_bootleg.jsonl'

In [4]:
from bootleg.extract_mentions import extract_mentions
extract_mentions(in_filepath=nq_sample_orig, out_filepath=nq_sample_bootleg, cand_map_file=cand_map, logger=logger)

2020-10-21 17:17:42,065 Loading candidate mapping...


100%|██████████| 8034754/8034754 [00:16<00:00, 482916.78it/s]

2020-10-21 17:17:58,709 Loaded candidate mapping with 8034754 aliases.





2020-10-21 17:18:12,381 Using 8 workers...
2020-10-21 17:18:12,382 Reading in /dfs/scratch0/lorr1/bootleg/bootleg-internal/new_tutorial_data/data/nq/test_natural_questions_50.jsonl
2020-10-21 17:18:12,658 Wrote out data chunks in 0.27s
2020-10-21 17:18:12,659 Calling subprocess...
2020-10-21 17:18:13,711 Merging files...
2020-10-21 17:18:13,756 Removing temporary files...
2020-10-21 17:18:13,978 Finished in 1.6176435947418213 seconds. Wrote out to /dfs/scratch0/lorr1/bootleg/bootleg-internal/new_tutorial_data/data/nq/test_natural_questions_50_bootleg.jsonl


By looking at a sample of the extracted mentions, we can compare the mention extraction phase to the hand-labelled mentions.

In [5]:
orig_mentions_df = load_mentions(nq_sample_orig)
bootleg_mentions_df = load_mentions(nq_sample_bootleg)

# join dataframes and sample
pd.merge(orig_mentions_df, bootleg_mentions_df, on=['sentence'], suffixes=['_hand', '_bootleg']).sample(15)

Unnamed: 0,sentence,aliases_hand,spans_hand,aliases_bootleg,spans_bootleg
21,the pair of hand drums used in indian classical music is called,[indian classical music],"[[7, 10]]","[hand, drums, indian classical music]","[[3, 4], [4, 5], [7, 10]]"
17,is it a bank holiday today in spain,"[bank holiday, spain]","[[3, 5], [7, 8]]","[bank holiday, spain]","[[3, 5], [7, 8]]"
26,who opened and closed the 1960 winter olympics,[1960 winter olympics],"[[5, 8]]",[1960 winter olympics],"[[5, 8]]"
7,is there an active volcano in new zealand,[new zealand],"[[6, 8]]","[active volcano, new zealand]","[[3, 5], [6, 8]]"
5,the u.s. supreme court hears appeals from circuit courts,"[u.s. supreme court, circuit courts]","[[1, 4], [7, 9]]","[us supreme court, circuit courts]","[[1, 4], [7, 9]]"
19,who played the bank robber in dirty harry,[dirty harry],"[[6, 8]]","[bank robber, dirty harry]","[[3, 5], [6, 8]]"
11,hitchhiker 's guide to the galaxy slartibartfast quotes,"[hitchhiker 's guide to the galaxy, slartibartfast]","[[0, 6], [6, 7]]","[hitchhiker s guide to the galaxy, slartibartfast]","[[0, 6], [6, 7]]"
12,what was dennis hopper 's bike in easy rider,"[dennis hopper, easy rider]","[[2, 4], [7, 9]]","[dennis hopper, bike, easy rider]","[[2, 4], [5, 6], [7, 9]]"
47,who is mariah carey talking about in we belong together,"[mariah carey, we belong together]","[[2, 4], [7, 10]]","[mariah carey, we belong together]","[[2, 4], [7, 10]]"
6,why does the author say that the vampire in nosferatu is named count orlok and not count dracula,"[nosferatu, count orlok, count dracula]","[[9, 10], [12, 14], [16, 18]]","[vampire, nosferatu, count orlok, count dracula]","[[7, 8], [9, 10], [12, 14], [16, 18]]"


In the sample above, we see that generally Bootleg detects the same mentions as the hand-labelled mentions, however sometimes Bootleg extracts extra mentions (e.g "foaming" in "i see the river tiber foaming with much blood"). This is expected as we would rather the mention detection step filter out too few mentions than too many. It will be the job of the backbone model and postprocessing to filter out these extra mentions, by either thresholding the prediction probability or predicting a candidate that represents "No Candidate" (we refer to this as "NC").  

## 2. Disambiguate Mentions to Entities

We run inference using a pretrained Bootleg model to disambiguate the extracted mentions to Wikidata QIDs. 

First, load the model config so we can set additional parameters and load the saved model during evaluation. We need to update the config parameters to point to the downloaded model checkpoint and data.

In [25]:
from bootleg import run
from bootleg.utils.parser_utils import get_full_config

config_path = f'{root_dir}/models/bootleg_wiki/bootleg_config.json'
config_args = get_full_config(config_path)

# decrease number of data threads as this is a small file
config_args.run_config.dataset_threads = 2

# set the model checkpoint path 
config_args.run_config.init_checkpoint = f'{root_dir}/models/bootleg_wiki/bootleg_model.pt'

# set the path for the entity db and candidate map
config_args.data_config.entity_dir = f'{root_dir}/data/wiki_entity_data'
config_args.data_config.alias_cand_map = 'alias2qids_wiki.json'

# set the data path and RSS500 test file 
config_args.data_config.data_dir = f'{root_dir}/data/nq'

# to speed things up for the tutorial, we have already prepped the data with the mentions detected by Bootleg
config_args.data_config.test_dataset.file = 'test_natural_questions_50_bootleg.jsonl'

# set the embedding paths 
config_args.data_config.emb_dir =  f'{root_dir}/data/emb_data'
config_args.data_config.word_embedding.cache_dir =  f'{root_dir}/pretrained_bert_models'

# set the save directory 
config_args.run_config.save_dir = f'{root_dir}/results'

# set whether to run inference on the CPU
config_args.run_config.cpu = use_cpu

Run evaluation in `dump_embs` mode to dump predictions and contextualized entity embeddings. Note that this command is about 10 times slower using a notebook than on the command line. To speed up the next command, run the following on the command line first. Then come back and run the next cell.

```
python3 -m bootleg.run --mode dump_embs \
    --config_script <root_dir>/models/bootleg_wiki/bootleg_config.json \
    --run_config.dataset_threads 2 \
    --run_config.init_checkpoint <root_dir>/models/bootleg_wiki/bootleg_model.pt \
    --data_config.entity_dir <root_dir>/data/wiki_entity_data \
    --data_config.alias_cand_map alias2qids_wiki.json \
    --data_config.data_dir <root_dir>/data/nq \
    --data_config.test_dataset.file test_natural_questions_50_bootleg.jsonl \
    --data_config.emb_dir <root_dir>/data/emb_data \
    --data_config.word_embedding.cache_dir <root_dir>/pretrained_bert_models \
```

In [26]:
bootleg_label_file, bootleg_emb_file = run.model_eval(args=config_args, mode="dump_embs", logger=logger, is_writer=True)

2020-10-21 18:36:44,109 Loading entity_symbols...
2020-10-21 18:37:30,833 Loaded entity_symbols with 5310039 entities.
2020-10-21 18:37:30,861 Loading slices...
2020-10-21 18:48:04,167 Finished loading slices.
2020-10-21 18:49:28,392 Loading dataset...
2020-10-21 19:23:58,480 Finished loading dataset.
2020-10-21 19:24:05,298 Loading embeddings...
2020-10-21 19:24:29,538 Finished loading embeddings.
2020-10-21 19:24:29,729 Loading model from /dfs/scratch0/lorr1/bootleg/bootleg-internal/new_tutorial_data/models/bootleg_wiki/bootleg_model.pt...
2020-10-21 19:24:55,067 Successfully loaded model from /dfs/scratch0/lorr1/bootleg/bootleg-internal/new_tutorial_data/models/bootleg_wiki/bootleg_model.pt starting from checkpoint epoch 1 and step 0.
2020-10-21 19:24:55,141 ************************DUMPING PREDICTIONS FOR test_natural_questions_50_bootleg.jsonl************************
2020-10-21 19:24:55,230 64 samples, 4 batches, 50 len dataset
2020-10-21 19:25:07,864 Writing predictions to /dfs/sc

We can now evaluate the overall quality of the end-to-end pipeline via precision / recall metrics, where the *recall* indicates what proportion of the hand-labelled mentions Bootleg correctly detects and disambiguates, and *precision* indicates what proportion of the mentions that Bootleg labels are correct. For instance, if Bootleg only labelled the few mentions it was very confident in, then it would have a low recall and high precision.

To detect if mentions match the hand-labelled mention spans, we allow for +1/-1 word in the left span boundaries (e.g., 'the wizard of oz' and 'wizard of oz' are counted as the same mention). 

In [27]:
from utils import compute_precision_and_recall

bootleg_errors = compute_precision_and_recall(orig_label_file=nq_sample_orig, 
                                              new_label_file=bootleg_label_file, 
                                              threshold=0.3)

Recall: 0.73 (57/78)
Precision: 0.58 (57/99)


We analyze three classes of errors in the end-to-end pipeline below: 
1. *Missing mentions*: Fail to extract the mention 
2. *Wrong entity*: Correctly extract the mention but disambiguate to the wrong candidate  
3. *Extra mentions*: Label a mention that is not hand-labelled as a mention

In [17]:
pd.DataFrame(bootleg_errors['missing_mention'])

Unnamed: 0,sent_idx,sentence,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_spans,pred_qids,pred_probs,error
0,26,who opened and closed the 1960 winter olympics,[1960 winter olympics],[Q9634],"[[5, 8]]",[1960 winter olympics],"[[5, 8]]",[NC],[0.236],1960 winter olympics
1,32,who played in the last 3 nba finals,[nba finals],[Q842375],"[[6, 8]]",[nba finals],"[[6, 8]]",[NC],[0.143],nba finals


The mentions above get filtered because we set the probability threshold to 0.3 to help filter extra mentions. 

In [18]:
pd.DataFrame(bootleg_errors['wrong_entity']).sample(5)

Unnamed: 0,sent_idx,sentence,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_spans,pred_qids,pred_probs,error
4,18,1970 world cup semi final italy vs germany,"[1970 world cup, italy, germany]","[Q132664, Q676899, Q43310]","[[0, 3], [5, 6], [7, 8]]","[1970 world cup, semi, italy, germany]","[[0, 3], [3, 4], [5, 6], [7, 8]]","[Q132664, Q992994, Q676899, Q37285]","[0.995, 0.811, 0.916, 0.508]",germany
12,38,when was the wizard of oz made in technicolor,"[wizard of oz, technicolor]","[Q193695, Q674564]","[[3, 6], [8, 9]]","[the wizard of oz, in technicolor]","[[2, 6], [7, 9]]","[Q193695, Q17412490]","[0.438, 1.0]",technicolor
11,36,when was the first freeway built in los angeles,[los angeles],[Q65],"[[7, 9]]","[freeway, los angeles]","[[4, 5], [7, 9]]","[Q46622, Q104994]","[0.718, 0.716]",los angeles
5,23,where is israel located on the world map,"[israel, world map]","[Q801, Q653848]","[[2, 3], [6, 8]]","[israel, world map]","[[2, 3], [6, 8]]","[Q23792, Q653848]","[0.383, 0.473]",israel
3,16,where did britain create colonies for its empire,"[britain, empire]","[Q161885, Q8680]","[[2, 3], [7, 8]]","[britain, colonies, its empire]","[[2, 3], [4, 5], [6, 8]]","[Q8680, NC, Q8680]","[0.402, 0.159, 0.468]",britain


Some of the errors Bootleg makes is predicting too general of a candidate (e.g. Oregon State Beavers instead of Oregon State Beavers baseball). Other errors are due to ambiguous sentences (e.g. "cast of characters in fiddler on the roof" -> should this be the movie or the musical?). Finally another bucket of errors suggests that we need to boost certain training signals -- this is an area we're actively pursuing in Bootleg with an investigation of model guidability!

In [19]:
pd.DataFrame(bootleg_errors['extra_mention']).sample(5)

Unnamed: 0,sent_idx,sentence,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_spans,pred_qids,pred_probs,error
14,29,who plays claire underwood 's mom on house of cards,"[claire underwood, house of cards]","[Q14915624, Q3330940]","[[2, 4], [7, 10]]","[claire underwood, mom, house of cards]","[[2, 4], [5, 6], [7, 10]]","[Q14915624, Q7566, Q578361]","[1.0, 0.354, 0.893]",mom
9,21,the pair of hand drums used in indian classical music is called,[indian classical music],[Q1323698],"[[7, 10]]","[hand, drums, indian classical music]","[[3, 4], [4, 5], [7, 10]]","[Q1552740, Q221769, Q1323698]","[0.575, 0.317, 0.926]",drums
15,33,uk national debt as percentage of gdp by year,"[uk national debt, gdp]","[Q611713, Q12638]","[[0, 3], [6, 7]]","[uk, national debt, gdp]","[[0, 1], [1, 3], [6, 7]]","[Q145, Q611713, Q12638]","[0.581, 0.612, 0.961]",uk
4,12,what was dennis hopper 's bike in easy rider,"[dennis hopper, easy rider]","[Q102711, Q503638]","[[2, 4], [7, 9]]","[dennis hopper, bike, easy rider]","[[2, 4], [5, 6], [7, 9]]","[Q102711, Q11442, Q5331186]","[1.0, 0.574, 0.917]",bike
21,48,what was the japanese motivation for bombing pearl harbor,"[japanese, pearl harbor]","[Q188712, Q127091]","[[3, 4], [7, 9]]","[japanese, motivation, bombing, pearl harbor]","[[3, 4], [4, 5], [6, 7], [7, 9]]","[Q188712, Q644302, Q52418, Q52418]","[0.345, 0.963, 0.862, 0.516]",motivation


We see that Bootleg may detect and label extraneous mentions that were not hand-labelled. Setting the threshold higher helps to reduce these predictions, as does using a 'NC' candidate for training, which Bootleg also supports. 

## 3. Compare to TAGME 

To get a sense of how Bootleg is doing compared to other systems, we evaluate [TAGME](https://arxiv.org/pdf/1006.3498.pdf), an existing tool to extract and disambiguate mentions. To run TAGME, you need to get a (free) authorization token. Instructions for obtaining a token are [here](https://sobigdata.d4science.org/web/tagme/tagme-help). You will need to verify your account and then follow the "access the VRE") link. We've also provided the file with TAGME labels for a given threshold for download if you want to skip the authorization token.

We note that unlike TAGME, Bootleg also outputs contextual entity embeddings which can be loaded for use in downstream tasks (e.g. relation extraction, question answering). Check out the Entity Embedding tutorial for more details! 

In [10]:
import tagme
# Set the authorization token for subsequent calls.
tagme.GCUBE_TOKEN = ""

In [11]:
tagme_label_file = f'{root_dir}/data/nq/test_natural_questions_50_tagme.jsonl'

If you do not have a token, skip the cell below and load the pre-generated TAGME labels. If you do have a token, you can play with changing the threshold below and see how it affects the results. Increasing the threshold increases the precision but decreases the recall as TAGME, as TAGME will label fewer mentions.

In [12]:
# We use a mapping from Wikipedia pageids to Wikidata QIDs to get the QIDs predicted by TAGME 
wpid2qid = ujson.load(open(f'{root_dir}/data/wiki_entity_data/entity_mappings/wpid2qid.json'))

# As the threshold increases, the precision increases, but the recall decreases
tagme_annotate(in_file=nq_sample_orig, out_file=tagme_label_file, threshold=0.3, wpid2qid=wpid2qid)

In [13]:
from utils import compute_precision_and_recall
tagme_errors = compute_precision_and_recall(orig_label_file=nq_sample_orig, 
                                            new_label_file=tagme_label_file)

Recall: 0.63 (49/78)
Precision: 0.58 (49/84)


We see that TAGME has slightly worse recall than Bootleg, when the precisions are set to be comparable (changing either TAGME or Bootleg's threshold will change the recall/precision values). 

## 4. Annotate On-the-Fly

To annotate individual sentences with Bootleg, we  also support annotate-on-the-fly mode. 

**Note that Annotator is not optimized and is only intended to be used for quick experimentation and for demos. We recommend using the above pipeline (`extract_mentions` and `model_eval` functions) for evaluating datasets. These functions leverage multiprocessing, caching of preprocessed data, and batching to speed up evaluation.**

To do this, we create an annotator object. This loads the model and entity databases. We use the `config_args` loaded from the previous step. Note it takes several minutes for the initial load of the model and the entity data. 

In [34]:
%load_ext autoreload
%autoreload 2
from bootleg.annotator import Annotator

ann = Annotator(config_args=config_args, cand_map=cand_map, device='cuda' if not use_cpu else 'cpu')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
2020-10-21 19:44:43,128 Loading embeddings...
2020-10-21 19:45:07,344 Finished loading embeddings.
2020-10-21 19:45:40,203 Loading candidate mapping...


100%|██████████| 8034754/8034754 [00:32<00:00, 251047.44it/s]

2020-10-21 19:46:12,213 Loaded candidate mapping with 8034754 aliases.





Similar to TAGME, we allow setting a threshold to only return mentions with labels greater than some probability. 

In [39]:
ann.set_threshold(0.3)

Fill in sentences to see what Bootleg predicts! Bootleg outputs the QIDs (or "NC" for "No Candidate"), the associated probabilities, and the title for each mention. The QIDs map to Wikidata -- to look them up you can use https://www.wikidata.org/wiki/Q1454 and replace the QID. "NC" means Bootleg did not find a good match among the candidates in the candidate list given the context. 

In [40]:
ann.label_mentions("where is the outer banks in north carolina")

(['Q1517373', 'Q1454'],
 [1.0, 0.9984428286552429],
 ['Outer Banks', 'North Carolina'])

In [41]:
ann.label_mentions("cast of characters in fiddler on the roof")

(['Q487330'], [0.8815685510635376], ['Fiddler on the Roof'])

Sometimes the entity disambiguation problem can be quite tricky -- in the above example we predict the song "Fiddler on the Roof" the music instead of the hand-label of the movie (https://www.wikidata.org/wiki/Q934036). Giving additional cues may help though -- for instance, if we add "the movie", the prediction changes to the movie! 

In [42]:
ann.label_mentions("cast of characters in the movie fiddler on the roof")

(['Q934036'], [0.770247220993042], ['Fiddler on the Roof (film)'])