# Bootleg Annotator Tutorial

In this tutorial, we walk through how to use Bootleg as an end-to-end pipeline to detect and label entities in a set of sentences on the fly.

### Requirements

You will need to download the following files for this notebook:
- Pretrained Bootleg uncased model and config [here](https://bootleg-data.s3-us-west-2.amazonaws.com/models/lateset/bootleg_uncased.tar.gz)
- Entity data [here](https://bootleg-data.s3-us-west-2.amazonaws.com/data/lateset/entity_db.tar.gz)

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models` and `data` directories. It will take several minutes to download all the files. 

```
    bash tutorials/download_model.sh uncased
    bash tutorials/download_data.sh
```

You can also run directly in this notebook by

In [None]:
!sh download_model.sh uncased
!sh download_data.sh

In [4]:
from pathlib import Path
import pandas as pd

# set up logging
import sys
import logging
from importlib import reload
reload(logging)
# Set to logging.DEBUG for more logging output
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

# root_dir = FILL IN FULL PATH TO DIRECTORY WHERE DATA IS DOWNLOADED (i.e., root_dir/data and root_dir/models)
root_dir = Path(".")
# entity_dir = FILL IN PATH TO ENTITY_DB DATA (i.e., tutorial_data/data
data_dir = root_dir / "data"
entity_dir = data_dir / "entity_db"
# model_dir = FILL IN PATH TO MODELS
model_dir = root_dir / "models"

If you have a GPU with at least 12GB of memory available, set the below to 0 to run inference on a GPU. 

In [2]:
device = -1

First, load the model config so we can set additional parameters and load the saved model during evaluation. We need to update the config parameters to point to the downloaded model checkpoint and data.

In [5]:
from bootleg.utils.parser.parser_utils import parse_boot_and_emm_args
from bootleg.utils.utils import load_yaml_file
from bootleg.run import run_model

config_in_path = model_dir / 'bootleg_uncased/bootleg_config.yaml'

config_args = load_yaml_file(config_in_path)

# set the model checkpoint path
config_args["emmental"]["model_path"] = str(model_dir / 'bootleg_uncased/bootleg_wiki.pth')

# set the path for the entity db and candidate map
config_args["data_config"]["entity_dir"] = str(entity_dir)

Now let's give the config to load the annotator

In [6]:
# Load new annotator with our config - notice how it does have to reprep some things
from bootleg.end2end.bootleg_annotator import BootlegAnnotator

# You can also pass `return_embs=True` to get the embeddings
ann = BootlegAnnotator(config=config_args, device=device, return_embs=False, verbose=False)

2021-09-30 18:24:28,300 Setting logging directory to: bootleg-logs/bootleg_wiki
2021-09-30 18:24:28,333 Loading Emmental default config from /dfs/scratch0/lorr1/projects/emmental/src/emmental/emmental-default-config.yaml.
2021-09-30 18:24:28,334 Updating Emmental config from user provided config.
2021-09-30 18:24:28,335 Set random seed to 1234.
2021-09-30 18:28:56,379 Lock 140218801889088 acquired on bootleg-data/pretrained_bert_models/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…


2021-09-30 18:28:56,842 Lock 140218801889088 released on bootleg-data/pretrained_bert_models/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
2021-09-30 18:28:57,161 Lock 140218801889040 acquired on bootleg-data/pretrained_bert_models/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…


2021-09-30 18:28:57,611 Lock 140218801889040 released on bootleg-data/pretrained_bert_models/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
2021-09-30 18:28:57,933 Lock 140218801887264 acquired on bootleg-data/pretrained_bert_models/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


2021-09-30 18:28:58,640 Lock 140218801887264 released on bootleg-data/pretrained_bert_models/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
2021-09-30 18:28:58,972 Lock 140218801812240 acquired on bootleg-data/pretrained_bert_models/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…


2021-09-30 18:28:59,747 Lock 140218801812240 released on bootleg-data/pretrained_bert_models/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
2021-09-30 18:29:00,707 Created emmental model Bootleg that contains task set().


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transfo

2021-09-30 18:29:05,606 Created task: NED
2021-09-30 18:29:05,607 Moving context_encoder module to CPU.
2021-09-30 18:29:05,611 Moving entity_encoder module to CPU.
2021-09-30 18:29:05,887 [Bootleg] Model loaded from ../tutorial_data/models/bootleg_uncased/bootleg_wiki.pth
2021-09-30 18:29:05,888 Moving context_encoder module to CPU.
2021-09-30 18:29:05,892 Moving entity_encoder module to CPU.


In [10]:
print(ann.label_mentions(["I am Lincoln"])["titles"])
print(ann.label_mentions(["How much is a Lincoln"])["titles"])

[['Abraham Lincoln']]
[['Lincoln Motor Company']]
