# Bootleg Annotator Tutorial

In this tutorial, we walk through how to use Bootleg as an end-to-end pipeline to detect and label entities in a set of sentences on the fly.

### Requirements
When evaluating Bootleg using the annotator, Bootleg processes possible mentions in text with three environment flags: ``BOOTLEG_STRIP``, ``BOOTLEG_LOWER``, ``BOOTLEG_LANG_CODE``. The first sets the language to use for Spacy. The second is if the user wants to strip punctuation on mentions (set to False by default). The third is if the user wants to call ``.lower()`` (set to True by default). Set these with `os.environ`.

You will need to download the following files for this notebook:
- Pretrained Bootleg uncased model and config [here](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/models/latest/bootleg_uncased.tar.gz)
- Entity data [here](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/data/latest/entity_db.tar.gz)

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models` and `data` directories. It will take several minutes to download all the files. 

```
    bash tutorials/download_model.sh uncased
    bash tutorials/download_data.sh
```

You can also run directly in this notebook by

In [1]:
# !sh download_model.sh uncased
# !sh download_data.sh

In [2]:
from pathlib import Path

# root_dir = FILL IN FULL PATH TO DIRECTORY WHERE DATA IS DOWNLOADED (i.e., root_dir/data and root_dir/models)
root_dir = Path("../")
# entity_dir = FILL IN PATH TO ENTITY_DB DATA (i.e., tutorial_data/data
data_dir = root_dir / "data"
entity_dir = data_dir / "entity_db"
# model_dir = FILL IN PATH TO MODELS
model_dir = root_dir / "models"

If you have a GPU with at least 12GB of memory available, set the below to 0 to run inference on a GPU. 

In [3]:
device = -1

First, load the model config so we can set additional parameters and load the saved model during evaluation. We need to update the config parameters to point to the downloaded model checkpoint and data.

In [4]:
from bootleg.utils.utils import load_yaml_file

config_in_path = model_dir / "bootleg_uncased/bootleg_config.yaml"

config_args = load_yaml_file(config_in_path)

# set the model checkpoint path
config_args["emmental"]["model_path"] = str(
    model_dir / "bootleg_uncased/bootleg_wiki.pth"
)

# set the path for the entity db and candidate map
config_args["data_config"]["entity_dir"] = str(entity_dir)



Now let's give the config to load the annotator

In [5]:
# Load new annotator with our config - notice how it does have to reprep some things
from bootleg.end2end.bootleg_annotator import BootlegAnnotator

# You can also pass `return_embs=True` to get the embeddings
ann = BootlegAnnotator(
    config=config_args, device=device, return_embs=False, verbose=False
)

2021-10-15 20:23:06,436 Setting logging directory to: bootleg-logs/bootleg_wiki
2021-10-15 20:23:06,480 Loading Emmental default config from /lfs/raiders3/0/senwu/.pyenv/versions/3.8.6/envs/venv38/lib/python3.8/site-packages/emmental/emmental-default-config.yaml.
2021-10-15 20:23:06,481 Updating Emmental config from user provided config.
2021-10-15 20:23:06,482 Set random seed to 1234.
2021-10-15 20:29:36,662 Created emmental model Bootleg that contains task set().


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predicti

2021-10-15 20:29:46,048 Created task: NED
2021-10-15 20:29:46,053 Moving context_encoder module to CPU.
2021-10-15 20:29:46,057 Moving entity_encoder module to CPU.
2021-10-15 20:29:46,746 [Bootleg] Model loaded from ../models/bootleg_uncased/bootleg_wiki.pth
2021-10-15 20:29:46,747 Moving context_encoder module to CPU.
2021-10-15 20:29:46,751 Moving entity_encoder module to CPU.


In [6]:
print(ann.label_mentions(["I am Lincoln"])["titles"])
print(ann.label_mentions(["How much is a Lincoln"])["titles"])

[['Abraham Lincoln']]
[['Lincoln Motor Company']]


### Faster inference
If you want more efficient inference of the annotator, we have the ability for the user to pass in a static entity
embedding matrix so the model does not have to call a forward pass of the entity encoder.

See our ```entity_embedding_tutorial.ipynb``` for how to call ```extract_all_entities```. The output of this
can be passed into our annotator via

In [None]:
entity_emb_file = "<path to file>"
ann = BootlegAnnotator(config=config_args, device=device, return_embs=False, entity_emb_file=entity_emb_file)