# End-to-End NED Tutorial

In this tutorial, we walk through how to use Bootleg as an end-to-end pipeline to detect and label entities in a set of sentences. First, we show how to use Bootleg to detect and disambiguate mentions to entities. We then compare to an existing system named TAGME. Finally, we show how to use Bootleg to annotate individual sentences on the fly. 

To understand how Bootleg performs on more natural language than we find in Wikipedia, we hand label the mentions and corresponding entities in 50 questions sampled from the [Natural Questions dataset (Google)](https://ai.google.com/research/NaturalQuestions). 

### Requirements

You will need to download the following files for this notebook:
- The sample of Natural Questions with hand-labelled entities [here](https://bootleg-emb.s3.amazonaws.com/data/nq.tar.gz)
- Entity profile information [here](https://bootleg-emb.s3.amazonaws.com/entity_db.tar.gz)*
- Pretrained Bootleg model and config [here](https://bootleg-emb.s3.amazonaws.com/models/2020_08_25/bootleg_wiki.tar.gz)*
- Embedding data [here](https://bootleg-emb.s3.amazonaws.com/emb_data.tar.gz)*

*Same file as in benchmark tutorial and does not need to be re-downloaded.

For convenience, you can run the command below (from the `tutorials` directory) to download all the above files and unpack them to the provided directory. It will take several minutes to download all the files. 

    bash download_all.sh <NAME_OF_DIRECTORY_TO_SAVE_DATA>
    
You will need to assign the variable `input_dir` in this notebook to the path where you download the data.  

In [5]:
import numpy as np 
import pandas as pd
import ujson
from utils import load_mentions, tagme_annotate

# set up logging
import sys
import logging
from importlib import reload
reload(logging)
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

input_dir = # FILL IN FULL PATH TO DATA DIRECTORY WHERE FILES ARE DOWNLOADED HERE 

cand_map = f'{input_dir}/entity_db/entity_mappings/alias2qids_wiki.json'

If you do not have a GPU with at least 12GB of memory available, set the below to `True` to run inference on the CPU. 

In [1]:
use_cpu = False

## 1. Detect Mentions
Bootleg uses a simple mention extraction algorithm that extracts mentions using a given candidate map. We will use a Wikipedia candidate map that we mined using Wikipedia anchor links and Wikidata aliases for a total of ~8 million mentions (provided in the Requirements section of this notebook).

For the input dataset for the end-to-end pipeline, we assume a jsonlines file with a single dictionary with the key "sentence" and value as the text of the sentence, per line. For instance, you may have a file with the lines:

    {"sentence": "who did the voice of the magician in frosty the snowman"}
    {"sentence": "what is considered the outer banks in north carolina"}
    
Below, we have additional keys to keep track of the hand-labelled mentions, but this is purely for evaluating the quality of the end-to-end pipeline and is not needed in the common use cases of using Bootleg to detect and label mentions.

In [5]:
nq_sample_orig = f'{input_dir}/nq/test_natural_questions_50.jsonl'
nq_sample_bootleg = f'{input_dir}/nq/test_natural_questions_50_bootleg.jsonl'

In [3]:
from bootleg.extract_mentions import extract_mentions
extract_mentions(in_filepath=nq_sample_orig, out_filepath=nq_sample_bootleg, cand_map_file=cand_map, logger=logger)

2020-09-09 10:22:27,651 Loading candidate mapping...


100%|██████████| 7970529/7970529 [00:17<00:00, 455341.61it/s]

2020-09-09 10:22:45,161 Loaded candidate mapping with 7970529 aliases.





2020-09-09 10:22:58,305 Using 8 workers...
2020-09-09 10:22:58,307 Reading in /dfs/scratch1/mleszczy/bootleg-test-09092020/tutorials/data/nq/test_natural_questions_50.jsonl
2020-09-09 10:22:58,619 Wrote out data chunks in 0.31s
2020-09-09 10:22:58,620 Calling subprocess...
2020-09-09 10:22:59,346 Merging files...
2020-09-09 10:22:59,384 Removing temporary files...
2020-09-09 10:22:59,691 Finished in 1.3895530700683594 seconds. Wrote out to /dfs/scratch1/mleszczy/bootleg-test-09092020/tutorials/data/nq/test_natural_questions_50_bootleg.jsonl


By looking at a sample of the extracted mentions, we can compare the mention extraction phase to the hand-labelled mentions.

In [4]:
orig_mentions_df = load_mentions(nq_sample_orig)
bootleg_mentions_df = load_mentions(nq_sample_bootleg)

# join dataframes and sample
pd.merge(orig_mentions_df, bootleg_mentions_df, on=['sentence'], suffixes=['_hand', '_bootleg']).sample(15)

Unnamed: 0,sentence,aliases_hand,spans_hand,aliases_bootleg,spans_bootleg
11,hitchhiker 's guide to the galaxy slartibartfast quotes,"[hitchhiker 's guide to the galaxy, slartibartfast]","[0:6, 6:7]","[hitchhiker s guide to the galaxy, slartibartfast]","[0:6, 6:7]"
45,when did rangers last win the scottish cup,"[rangers, scottish cup]","[2:3, 6:8]","[rangers, scottish cup]","[2:3, 6:8]"
21,the pair of hand drums used in indian classical music is called,[indian classical music],[7:10],"[drums, indian classical music]","[4:5, 7:10]"
46,who controls the past controls the future rage against the machine,[rage against the machine],[7:11],[rage against the machine],[7:11]
41,who was president of the united states in 1938,[president of the united states],[2:7],[president of the united states],[2:7]
4,i love it ( feat . charli xcx ) icona pop,"[i love it, charli xcx, icona pop]","[0:3, 6:8, 9:11]","[i love it, charli xcx, icona pop]","[0:3, 6:8, 9:11]"
25,which of these was not an export of ancient greece,[ancient greece],[8:10],[ancient greece],[8:10]
20,what is the worth of the catholic church,[catholic church],[6:8],"[worth, catholic church]","[3:4, 6:8]"
18,1970 world cup semi final italy vs germany,"[1970 world cup, italy, germany]","[0:3, 5:6, 7:8]","[1970 world cup, semi, italy, germany]","[0:3, 3:4, 5:6, 7:8]"
22,game of thrones season 1 white hair girl,[game of thrones season 1],[0:5],"[game of thrones season 1, white hair]","[0:5, 5:7]"


In the sample above, we see that generally Bootleg detects the same mentions as the hand-labelled mentions, however sometimes Bootleg extracts extra mentions (e.g "worth" in "what is the worth of the catholic church"). This is expected as we would rather the mention detection step filter out too few mentions than too many. It will be the job of the backbone model and postprocessing to filter out these extra mentions, by either thresholding the prediction probability or predicting a candidate that represents "No Candidate" (we refer to this as "NC").  

## 2. Disambiguate Mentions to Entities

We run inference using a pretrained Bootleg model to disambiguate the extracted mentions to Wikidata QIDs. 

First, load the model config so we can set additional parameters and load the saved model during evaluation. We need to update the config parameters to point to the downloaded model checkpoint and data.

In [6]:
from bootleg import run
from bootleg.utils.parser_utils import get_full_config

# full path to directory where files are downloaded
config_path = f'{input_dir}/bootleg_wiki/bootleg_config.json'
config_args = get_full_config(config_path)

# set the model checkpoint path 
config_args.run_config.init_checkpoint = f'{input_dir}/bootleg_wiki/bootleg_model.pt'

# set the path for the entity db and candidate map
config_args.data_config.entity_dir = f'{input_dir}/entity_db'
config_args.data_config.alias_cand_map = 'alias2qids_wiki.json'

# set the data path and RSS500 test file 
config_args.data_config.data_dir = f'{input_dir}/nq'

# to speed things up for the tutorial, we have already prepped the data with the mentions detected by Bootleg
config_args.data_config.test_dataset.file = 'test_natural_questions_50_bootleg.jsonl'

# set the embedding paths 
config_args.data_config.emb_dir =  f'{input_dir}/emb_data'
config_args.data_config.word_embedding.cache_dir =  f'{input_dir}/emb_data'

# set the save directory 
config_args.run_config.save_dir = f'{input_dir}/results'

# set whether to run inference on the CPU
config_args.run_config.cpu = use_cpu

Run evaluation in `dump_embs` mode to dump predictions and contextualized entity embeddings.

In [7]:
bootleg_label_file, bootleg_emb_file = run.model_eval(args=config_args, mode="dump_embs", logger=logger, is_writer=True)

2020-09-09 13:15:16,824 PyTorch version 1.5.0 available.
2020-09-09 13:15:17,933 Loading entity_symbols...
2020-09-09 13:16:03,416 Loaded entity_symbols with 5222808 entities.
2020-09-09 13:16:04,593 Loading slices...
2020-09-09 13:16:04,599 Finished loading slices.
2020-09-09 13:16:04,603 Loading dataset...
2020-09-09 13:16:04,606 Finished loading dataset.
2020-09-09 13:16:18,836 Sampled 50 indices from dataset (dev/test) for evaluation.
2020-09-09 13:16:19,055 Loading embeddings...
2020-09-09 13:16:44,103 Finished loading embeddings.
2020-09-09 13:16:44,224 Loading model from data/bootleg_wiki/bootleg_model.pt...
2020-09-09 13:16:46,763 Successfully loaded model from data/bootleg_wiki/bootleg_model.pt starting from checkpoint epoch 2 and step 0.
2020-09-09 13:16:46,805 ************************DUMPING PREDICTIONS FOR test_natural_questions_50_bootleg.jsonl************************
2020-09-09 13:16:47,021 64 samples, 2 batches
2020-09-09 13:16:51,935 Writing predictions...
2020-09-09 13

We can now evaluate the overall quality of the end-to-end pipeline via precision / recall metrics, where the *recall* indicates what proportion of the hand-labelled mentions Bootleg correctly detects and disambiguates, and *precision* indicates what proportion of the mentions that Bootleg labels are correct. For instance, if Bootleg only labelled the few mentions it was very confident in, then it would have a low recall and high precision.

To detect if mentions match the hand-labelled mention spans, we allow for +1/-1 word in the left span boundaries (e.g., 'the wizard of oz' and 'wizard of oz' are counted as the same mention). 

In [6]:
from utils import compute_precision_and_recall

bootleg_errors = compute_precision_and_recall(orig_label_file=nq_sample_orig, 
                                              new_label_file=bootleg_label_file, 
                                              threshold=0.3)

Recall: 0.71 (55/78)
Precision: 0.6 (55/92)


We analyze three classes of errors in the end-to-end pipeline below: 
1. *Missing mentions*: Fail to extract the mention 
2. *Wrong entity*: Correctly extract the mention but disambiguate to the wrong candidate  
3. *Extra mentions*: Label a mention that is not hand-labelled as a mention

In [7]:
pd.DataFrame(bootleg_errors['missing_mention'])

Unnamed: 0,sent_idx,sentence,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_spans,pred_qids,pred_probs,error
0,7,is there an active volcano in new zealand,[new zealand],[Q664],[6:8],"[active volcano, new zealand]","[3:5, 6:8]","[Q8072, NC]","[0.735, 0.283]",new zealand
1,23,where is israel located on the world map,"[israel, world map]","[Q801, Q653848]","[2:3, 6:8]","[israel, world map]","[2:3, 6:8]","[NC, Q653848]","[0.164, 0.908]",israel
2,32,who played in the last 3 nba finals,[nba finals],[Q842375],[6:8],[nba finals],[6:8],[NC],[0.115],nba finals
3,45,when did rangers last win the scottish cup,"[rangers, scottish cup]","[Q19597, Q308822]","[2:3, 6:8]","[rangers, scottish cup]","[2:3, 6:8]","[Q19597, NC]","[0.696, 0.116]",scottish cup


The mentions above get filtered because we set the probability threshold to 0.3 to help filter extra mentions. 

In [8]:
pd.DataFrame(bootleg_errors['wrong_entity']).sample(5)

Unnamed: 0,sent_idx,sentence,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_spans,pred_qids,pred_probs,error
7,24,who played smiley in tinker tailor soldier spy,[tinker tailor soldier spy],[Q681962],[4:8],"[smiley, tinker tailor soldier spy]","[2:3, 4:8]","[Q11241, Q582811]","[0.885, 0.697]",tinker tailor soldier spy
3,6,why does the author say that the vampire in nosferatu is named count orlok and not count dracula,"[nosferatu, count orlok, count dracula]","[Q151895, Q1442062, Q3266236]","[9:10, 12:14, 16:18]","[vampire, nosferatu, count orlok, count dracula]","[7:8, 9:10, 12:14, 16:18]","[Q1425557, Q151895, Q1442062, Q41542]","[0.766, 0.863, 1.0, 0.52]",count dracula
6,21,the pair of hand drums used in indian classical music is called,[indian classical music],[Q1323698],[7:10],"[drums, indian classical music]","[4:5, 7:10]","[Q128309, Q1770695]","[0.685, 0.942]",indian classical music
13,38,when was the wizard of oz made in technicolor,"[wizard of oz, technicolor]","[Q193695, Q674564]","[3:6, 8:9]","[the wizard of oz, technicolor]","[2:6, 8:9]","[Q60447411, Q674564]","[0.676, 0.904]",wizard of oz
9,31,who does oregon state play in the college world series,"[oregon state, college world series]","[Q7101349, Q787505]","[2:4, 7:10]","[oregon state, college world series]","[2:4, 7:10]","[Q2893390, Q787505]","[0.583, 0.7]",oregon state


Some of the errors Bootleg makes is predicting too general of a candidate (e.g. Oregon State Beavers instead of Oregon State Beavers baseball). Other errors are due to ambiguous sentences (e.g. "cast of characters in fiddler on the roof" -> should this be the movie or the musical?). Finally another bucket of errors suggests that we need to boost certain training signals -- this is an area we're actively pursuing in Bootleg with an investigation of model guidability!

In [9]:
pd.DataFrame(bootleg_errors['extra_mention']).sample(5)

Unnamed: 0,sent_idx,sentence,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_spans,pred_qids,pred_probs,error
16,48,what was the japanese motivation for bombing pearl harbor,"[japanese, pearl harbor]","[Q188712, Q127091]","[3:4, 7:9]","[japanese, motivation, bombing, pearl harbor]","[3:4, 4:5, 6:7, 7:9]","[Q184425, Q644302, Q52418, Q52418]","[0.744, 0.986, 0.985, 0.706]",motivation
14,36,when was the first freeway built in los angeles,[los angeles],[Q65],[7:9],"[freeway, los angeles]","[4:5, 7:9]","[Q46622, Q65]","[0.867, 0.506]",freeway
2,6,why does the author say that the vampire in nosferatu is named count orlok and not count dracula,"[nosferatu, count orlok, count dracula]","[Q151895, Q1442062, Q3266236]","[9:10, 12:14, 16:18]","[vampire, nosferatu, count orlok, count dracula]","[7:8, 9:10, 12:14, 16:18]","[Q1425557, Q151895, Q1442062, Q41542]","[0.766, 0.863, 1.0, 0.52]",vampire
11,27,i see the river tiber foaming with much blood,[river tiber],[Q13712],[3:5],"[river tiber, foaming, blood]","[3:5, 5:6, 8:9]","[Q13712, Q7243541, Q7873]","[1.0, 1.0, 0.9]",blood
5,18,1970 world cup semi final italy vs germany,"[1970 world cup, italy, germany]","[Q132664, Q676899, Q43310]","[0:3, 5:6, 7:8]","[1970 world cup, semi, italy, germany]","[0:3, 3:4, 5:6, 7:8]","[Q132664, Q40008974, Q676899, Q43310]","[0.866, 0.332, 0.364, 0.354]",semi


We see that Bootleg may detect and label extraneous mentions that were not hand-labelled. Setting the threshold higher helps to reduce these predictions, as does using a 'NC' candidate for training, which Bootleg also supports . 

## 3. Compare to TAGME 

To get a sense of how Bootleg is doing compared to other systems, we evaluate [TAGME](https://arxiv.org/pdf/1006.3498.pdf), an existing tool to extract and disambiguate mentions. To run TAGME, you need to get a (free) authorization token. Instructions for obtaining a token are [here](https://sobigdata.d4science.org/web/tagme/tagme-help). You will need to verify your account and then follow the "access the VRE") link. We've also provided the file with TAGME labels for a given threshold for download if you want to skip the authorization token.

We note that unlike TAGME, Bootleg also outputs contextual entity embeddings which can be loaded for use in downstream tasks (e.g. relation extraction, question answering). Check out the Entity Embedding tutorial for more details! 

In [10]:
import tagme
# Set the authorization token for subsequent calls.
tagme.GCUBE_TOKEN = # FILL IN WITH YOUR TOKEN

In [11]:
tagme_label_file = f'{input_dir}/nq/test_natural_questions_50_tagme.jsonl'

If you do not have a token, skip the cell below and load the pre-generated TAGME labels. If you do have a token, you can play with changing the threshold below and see how it affects the results. Increasing the threshold increases the precision but decreases the recall as TAGME, as TAGME will label fewer mentions.

In [13]:
# We use a mapping from Wikipedia pageids to Wikidata QIDs to get the QIDs predicted by TAGME 
wpid2qid = ujson.load(open(f'{input_dir}/entity_db/entity_mappings/wpid2qid.json'))

# As the threshold increases, the precision increases, but the recall decreases
tagme_annotate(in_file=nq_sample_orig, out_file=tagme_label_file, threshold=0.3, wpid2qid=wpid2qid)

In [14]:
from utils import compute_precision_and_recall
tagme_errors = compute_precision_and_recall(orig_label_file=nq_sample_orig, 
                                            new_label_file=tagme_label_file)

Recall: 0.63 (49/78)
Precision: 0.58 (49/84)


We see that TAGME has slightly worse recall than Bootleg, when the precisions are set to be comparable (changing either TAGME or Bootleg's threshold will change the recall/precision values). 

## 4. Annotate On-the-Fly

To annotate individual sentences with Bootleg, we  also support annotate-on-the-fly mode. 

To do this, we create an annotator object. This loads the model and entity databases. We use the `config_args` loaded from the previous step. Note it takes several minutes for the initial load of the model and the entity data. 

In [8]:
from bootleg.annotator import Annotator

ann = Annotator(config_args=config_args, cand_map=cand_map, device='cuda' if not use_cpu else 'cpu')

2020-09-09 13:18:12,429 Loading embeddings...
2020-09-09 13:18:36,539 Finished loading embeddings.
2020-09-09 13:19:03,200 Loading candidate mapping...


100%|██████████| 7970529/7970529 [00:14<00:00, 557925.28it/s]

2020-09-09 13:19:17,498 Loaded candidate mapping with 7970529 aliases.





Similar to TAGME, we allow setting a threshold to only return mentions with labels greater than some probability. 

In [9]:
ann.set_threshold(0.3)

Fill in sentences to see what Bootleg predicts! Bootleg outputs the QIDs (or "NC" for "No Candidate"), the associated probabilities, and the title for each mention. The QIDs map to Wikidata -- to look them up you can use https://www.wikidata.org/wiki/Q1454 and replace the QID. "NC" means Bootleg did not find a good match among the candidates in the candidate list given the context. 

In [10]:
ann.label_mentions("where is the outer banks in north carolina")

(['Q1517373', 'Q1454'],
 [1.0, 0.9959885478019714],
 ['Outer Banks', 'North Carolina'])

In [11]:
ann.label_mentions("cast of characters in fiddler on the roof")

(['Q487330'], [0.8602001070976257], ['Fiddler on the Roof'])

Sometimes the entity disambiguation problem can be quite tricky -- in the above example we predict the song "Fiddler on the Roof" the music instead of the hand-label of the movie (https://www.wikidata.org/wiki/Q934036). Giving additional cues may help though -- for instance, if we add "the movie", the prediction changes to the movie! 

In [12]:
ann.label_mentions("cast of characters in the movie fiddler on the roof")

(['Q934036'], [0.7369491457939148], ['Fiddler on the Roof (film)'])