# End-to-End NED Tutorial

In this tutorial, we walk through how to use Bootleg as an end-to-end pipeline to detect and label entities in a set of sentences. First, we show how to use Bootleg to detect and disambiguate mentions to entities. We then compare to an existing system named TAGME. 

This tutorial assumes you want to use Bootleg on full datasets. You can also use Bootleg in annotator mode:

```
pip install bootleg
from bootleg.end2end.bootleg_annotator import BootlegAnnotator
ann = BootlegAnnotator()
ann.label_mentions("Bob Dylan release Desire")["titles"]
```

To understand how Bootleg performs on more natural language than we find in Wikipedia, we hand label the mentions and corresponding entities in 50 questions sampled from the [Natural Questions dataset (Google)](https://ai.google.com/research/NaturalQuestions). We will evaluate our *uncased* Bootleg model. However, we have manually cased the data in case you want to try our cased model instead.

### Requirements

You will need to download the following files for this notebook:
- Pretrained Bootleg uncased model and config [here](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/models/lateset/bootleg_uncased.tar.gz).
- Sample of Natural Questions with hand-labelled entities [here](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/data/lateset/nq.tar.gz)
- Entity data [here](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/data/lateset/entity_db.tar.gz)

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models` and `data` directories. It will take several minutes to download all the files. 

```
    bash tutorials/download_model.sh uncased
    bash tutorials/download_data.sh
```

You can also run directly in this notebook by

In [1]:
!sh download_model.sh uncased
!sh download_data.sh

--2021-10-15 22:22:30--  https://bootleg-ned-data.s3-us-west-1.amazonaws.com/models/latest/bootleg_uncased.tar.gz
Resolving bootleg-ned-data.s3-us-west-1.amazonaws.com (bootleg-ned-data.s3-us-west-1.amazonaws.com)... 52.92.128.82
Connecting to bootleg-ned-data.s3-us-west-1.amazonaws.com (bootleg-ned-data.s3-us-west-1.amazonaws.com)|52.92.128.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 496143916 (473M) [application/x-tar]
Saving to: ‘models/bootleg_uncased.tar.gz.1’


2021-10-15 22:23:06 (15.6 MB/s) - ‘models/bootleg_uncased.tar.gz.1’ saved [496143916/496143916]

bootleg_uncased/
bootleg_uncased/bootleg_wiki.pth
bootleg_uncased/bootleg_config.yaml
--2021-10-15 22:23:18--  https://bootleg-ned-data.s3-us-west-1.amazonaws.com/data/latest/nq.tar.gz
Resolving bootleg-ned-data.s3-us-west-1.amazonaws.com (bootleg-ned-data.s3-us-west-1.amazonaws.com)... 52.218.252.81
Connecting to bootleg-ned-data.s3-us-west-1.amazonaws.com (bootleg-ned-data.s3-us-west-1.amazona

In [2]:
from pathlib import Path
import pandas as pd

# set up logging
import sys
import logging
from importlib import reload

reload(logging)
# Set to logging.DEBUG for more logging output
logging.basicConfig(
    stream=sys.stdout, format="%(asctime)s %(message)s", level=logging.INFO
)
logger = logging.getLogger(__name__)

# root_dir = FILL IN FULL PATH TO DIRECTORY WHERE DATA IS DOWNLOADED (i.e., root_dir/data and root_dir/models)
root_dir = Path(".")
# entity_dir = FILL IN PATH TO ENTITY_DB DATA (i.e., tutorial_data/data
data_dir = root_dir / "data"
entity_dir = data_dir / "entity_db"
cand_map = entity_dir / "entity_mappings/alias2qids.json"
# model_dir = FILL IN PATH TO MODELS
model_dir = root_dir / "models"

If you have a GPU with at least 12GB of memory available, set the below to 0 to run inference on a GPU. 

In [3]:
device = -1

## 1. Detect Mentions
Bootleg uses a simple mention extraction algorithm that extracts mentions using a given candidate map. We will use a Wikipedia candidate map that we mined using Wikipedia anchor links and Wikidata aliases for a total of ~15 million mentions (provided in the Requirements section of this notebook).

For the input dataset for the end-to-end pipeline, we assume a jsonlines file with a single dictionary with the key "sentence" and value as the text of the sentence, per line. For instance, you may have a file with the lines:

    {"sentence": "Who did the voice of the magician in Frosty the Snowman"}
    {"sentence": "What is considered the Outer Banks in North Carolina"}
    
Below, we have additional keys to keep track of the hand-labelled mentions, but this is purely for evaluating the quality of the end-to-end pipeline and is not needed in the common use cases of using Bootleg to detect and label mentions.

In [4]:
nq_sample_orig = data_dir / "nq/test_50.jsonl"
nq_sample_bootleg = data_dir / "nq/test_50_bootleg.jsonl"

In [5]:
from bootleg.end2end.extract_mentions import extract_mentions

verbose = False
extract_mentions(
    in_filepath=nq_sample_orig,
    out_filepath=nq_sample_bootleg,
    entity_db_dir=entity_dir,
    verbose=verbose,
)

By looking at a sample of the extracted mentions, we can compare the mention extraction phase to the hand-labelled mentions.

In [6]:
from utils import load_mentions

orig_mentions_df = load_mentions(nq_sample_orig)
bootleg_mentions_df = load_mentions(nq_sample_bootleg)

# join dataframes and sample
res = pd.merge(
    orig_mentions_df,
    bootleg_mentions_df,
    on=["sentence"],
    suffixes=["_hand", "_bootleg"],
)
display(res.head(15))

Unnamed: 0,sentence,aliases_hand,spans_hand,aliases_bootleg,spans_bootleg
0,Who did the voice of the magician in Frosty the Snowman,[frosty the snowman],"[[8, 11]]","[voice of, magician, frosty the snowman]","[[3, 5], [6, 7], [8, 11]]"
1,What is considered the Outer Banks in North Carolina,"[outer banks, north carolina]","[[4, 6], [7, 9]]","[outer banks, north carolina]","[[4, 6], [7, 9]]"
2,The Nashville sound brought a polished and cosmopolitan sound to country music by,"[nashville sound, country music]","[[1, 3], [10, 12]]","[nashville sound, music by]","[[1, 3], [11, 13]]"
3,What channel is the Premier League on in France,"[premier league, france]","[[4, 6], [8, 9]]","[premier league, france]","[[4, 6], [8, 9]]"
4,I Love It ( feat . Charli XCX ) Icona Pop,"[i love it, charli xcx, icona pop]","[[0, 3], [6, 8], [9, 11]]","[charli xcx, icona pop]","[[6, 8], [9, 11]]"
5,The U.S. Supreme Court hears appeals from circuit courts,"[u.s. supreme court, circuit courts]","[[1, 4], [7, 9]]","[us supreme court, circuit courts]","[[1, 4], [7, 9]]"
6,Why does the author say that the vampire in Nosferatu is named Count Orlok and not Count Dracula,"[nosferatu, count orlok, count dracula]","[[9, 10], [12, 14], [16, 18]]","[vampire, nosferatu, count orlok, count dracula]","[[7, 8], [9, 10], [12, 14], [16, 18]]"
7,Is there an active volcano in New Zealand,[new zealand],"[[6, 8]]","[volcano, new zealand]","[[4, 5], [6, 8]]"
8,Once Upon a Time Season 6 episode list,[once upon a time season 6],"[[0, 6]]","[upon a time, season 6, episode list]","[[1, 4], [4, 6], [6, 8]]"
9,Who is the former co-chairman Goldman Sachs who became a U.S. Secretary of the Treasury,"[goldman sachs, us secretary of the treasury]","[[5, 7], [10, 15]]","[goldman sachs, us secretary of the treasury]","[[5, 7], [10, 15]]"


In the sample above, we see that generally Bootleg detects the same mentions as the hand-labelled mentions, however sometimes Bootleg extracts extra mentions (e.g "colonies" in "Where did Britain create colonies for its empire"). This is expected as we would rather the mention detection step filter out too few mentions than too many. It will be the job of the backbone model and postprocessing to filter out these extra mentions, by either thresholding the prediction probability or predicting a candidate that represents "No Candidate" (we refer to this as "NC").  

## 2. Disambiguate Mentions to Entities

We run inference using a pretrained Bootleg model to disambiguate the extracted mentions to Wikidata QIDs. 

First, load the model config so we can set additional parameters and load the saved model during evaluation. We need to update the config parameters to point to the downloaded model checkpoint and data.

In [7]:
from bootleg.utils.parser.parser_utils import parse_boot_and_emm_args
from bootleg.utils.utils import load_yaml_file
from bootleg.run import run_model

config_in_path = model_dir / "bootleg_uncased/bootleg_config.yaml"

config_args = load_yaml_file(config_in_path)

# decrease number of data threads as this is a small file
config_args["run_config"]["dataset_threads"] = 2
config_args["run_config"]["log_level"] = "info"
# set the model checkpoint path
config_args["emmental"]["model_path"] = str(
    model_dir / "bootleg_uncased/bootleg_wiki.pth"
)

# set the path for the entity db and candidate map
config_args["data_config"]["entity_dir"] = str(entity_dir)
config_args["data_config"]["alias_cand_map"] = "alias2qids.json"

# set the data path and kore50 test file
config_args["data_config"]["data_dir"] = str(data_dir / "nq")

# to speed things up for the tutorial, we have already prepped the data with the mentions detected by Bootleg
config_args["data_config"]["test_dataset"]["file"] = nq_sample_bootleg.name

# set the devie if on CPU
config_args["emmental"]["device"] = device

# save the new args (helps if you want to run things via command line)
config_args = parse_boot_and_emm_args(config_args)



In [8]:
bootleg_label_file = run_model(mode="dump_preds", config=config_args)

2021-10-15 22:29:58,410 Setting logging directory to: bootleg-logs/bootleg_wiki
2021-10-15 22:29:58,453 Loading Emmental default config from /lfs/raiders3/0/senwu/.pyenv/versions/3.8.6/envs/venv38/lib/python3.8/site-packages/emmental/emmental-default-config.yaml.
2021-10-15 22:29:58,454 Updating Emmental config from user provided config.
2021-10-15 22:29:58,455 Set random seed to 1234.


  guid_dtype = np.dtype(


  descr = dtypedescr(dtype)
  X_dict["sent_idx"] = torch.from_numpy(mmap_file["sent_idx"])


Building sent idx to row idx mapping: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 10668.73it/s]


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationsh

Evaluating Bootleg (test): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.96s/it]
  unmerged_storage_type = np.dtype(


**Note that Bootleg automatically handles prepping of new data files for running. These are all saved in `data_config.data_dir`/`data_config.data_prep_dir`. If you change the contents of the underlying `jsonl` file _without_ removing the saved prep file or setting `data_config.overwrite_preprocessed_data` to be `True`, Bootleg will reuse the old prepped file.**


We can now evaluate the overall quality of the end-to-end pipeline via precision / recall metrics, where the *recall* indicates what proportion of the hand-labelled mentions Bootleg correctly detects and disambiguates, and *precision* indicates what proportion of the mentions that Bootleg labels are correct. For instance, if Bootleg only labelled the few mentions it was very confident in, then it would have a low recall and high precision.

To detect if mentions match the hand-labelled mention spans, we report weak and exact match metrics. Weak means the predicted and gold span boundaries just need to overlap for an entity (e.g., predicted mention 'the wizard of oz' is counted as correct for the gold mention 'wizard of oz' if the correct entity is predicted). 

In [9]:
from utils import compute_metrics

bootleg_end2end_errors = compute_metrics(
    gold_file=nq_sample_orig, pred_file=bootleg_label_file, threshold=0.5
)

WEAK MATCHING
precision = 62 / 79 = 0.7848101265822784
recall = 62 / 78 = 0.7948717948717948
f1 = 0.7898089171974522

EXACT MATCHING
precision = 62 / 79 = 0.7848101265822784
recall = 62 / 78 = 0.7948717948717948
f1 = 0.7898089171974522


We can examine errors in the end-to-end pipeline below. As you increase the threshold in the `compute_metrics` command, entities with a prediction probability less than the threshold will be filtered out. If too few entities are predicted, lowering the threshold may help.  

In [10]:
pd.DataFrame(bootleg_end2end_errors).sample(10)

Unnamed: 0,sent_idx,text,gold_aliases,gold_qids,gold_spans,pred_aliases,pred_qids,pred_spans,pred_probs
7,16,Where did Britain create colonies for its empire,"[britain, empire]","[Q161885, Q8680]","[[2, 3], [7, 8]]",[britain],[],[],[]
13,34,Who 's doing the halftime show in 2018,[halftime show],[Q902899],"[[4, 6]]",[halftime show],[Q11161626],"[[4, 6]]",[0.8461166620254517]
2,4,I Love It ( feat . Charli XCX ) Icona Pop,"[i love it, charli xcx, icona pop]","[Q3273659, Q5084390, Q808703]","[[0, 3], [6, 8], [9, 11]]","[charli xcx, icona pop]","[Q5084390, Q808703]","[[6, 8], [9, 11]]","[1.0, 0.9958118200302124]"
3,7,Is there an active volcano in New Zealand,[new zealand],[Q664],"[[6, 8]]","[volcano, new zealand]","[Q8072, Q664]","[[4, 5], [6, 8]]","[0.9965059757232666, 0.9999643564224243]"
1,2,The Nashville sound brought a polished and cosmopolitan sound to country music by,"[nashville sound, country music]","[Q1751782, Q83440]","[[1, 3], [10, 12]]","[nashville sound, music by]","[Q1751782, Q17059875]","[[1, 3], [11, 13]]","[0.9996482133865356, 1.0]"
16,40,Where does the last name Vigil come from,[vigil],[Q16878937],"[[5, 6]]",[vigil],[],[],[]
5,11,Hitchhiker 's Guide to the Galaxy Slartibartfast quotes,"[hitchhiker 's guide to the galaxy, slartibartfast]","[Q25169, Q779920]","[[0, 6], [6, 7]]",[hitchhiker s guide to the galaxy],[Q25169],"[[0, 6]]",[1.0]
4,8,Once Upon a Time Season 6 episode list,[once upon a time season 6],[Q23301616],"[[0, 6]]","[upon a time, season 6, episode list]",[Q55636748],"[[1, 4]]",[1.0]
0,0,Who did the voice of the magician in Frosty the Snowman,[frosty the snowman],[Q5506238],"[[8, 11]]","[voice of, magician, frosty the snowman]","[Q11540682, Q5506238]","[[3, 5], [8, 11]]","[1.0, 0.9153743982315063]"
8,19,Who played the bank robber in Dirty Harry,[dirty harry],[Q110206],"[[6, 8]]","[bank robber, dirty harry]","[Q806824, Q110206]","[[3, 5], [6, 8]]","[0.7667599320411682, 0.6623506546020508]"


Some of the errors Bootleg makes is predicting too general of a candidate (e.g. Oregon State Beavers instead of Oregon State Beavers baseball). Other errors are due to ambiguous sentences (e.g. "cast of characters in fiddler on the roof" -> should this be the movie or the musical?). Finally another bucket of errors suggests that we need to boost certain training signals -- this is an area we're actively pursuing in Bootleg with an investigation of model guidability!