# Entity Embedding Tutorial

In this tutorial, we walk through how to generate Bootleg contextual entity embeddings for use in downstream tasks using a pretrained Bootleg model. We also demonstrate how to extract Bootleg's static learned embeddings for downstream tasks when contextualized embeddings are not needed.

### Requirements

You will need to download the following files for this notebook:
- Pretrained Bootleg model and config [here](https://bootleg-emb.s3.amazonaws.com/models/2020_12_09/bootleg_wiki.tar.gz)
- Sample of Natural Questions with hand-labelled entities [here](https://bootleg-emb.s3.amazonaws.com/data/nq.tar.gz)
- Entity data [here](https://bootleg-emb.s3.amazonaws.com/data/wiki_entity_data.tar.gz)
- Embedding data [here](https://bootleg-emb.s3.amazonaws.com/data/emb_data.tar.gz)
- Pretrained BERT model [here](https://bootleg-emb.s3.amazonaws.com/pretrained_bert_models.tar.gz)

These are the same files as the End-to-End tutorial and do not need to be re-downloaded if you completed that tutorial. 

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models`, `data`, and `pretrained_bert_models` directories. It will take several minutes to download all the files. 

    bash download_model.sh 
    bash download_data.sh 
    bash download_bert.sh

## 1.  Prepare Model Config

As with the other tutorials, we set up the config to point to the correct data directories and model checkpoint. We use the sample of [Natural Questions](https://ai.google.com/research/NaturalQuestions) with mentions extracted by Bootleg introduced in the End-to-End tutorial. 

In [2]:
import numpy as np 
import pandas as pd
import ujson
from utils import load_mentions, tagme_annotate

from bootleg import run
from bootleg.utils.parser_utils import get_full_config

# set up logging
import sys
import logging
from importlib import reload
reload(logging)
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)



If you have a GPU with at least 12GB of memory available, set the below to `False` to run inference on a GPU. 

In [3]:
use_cpu = True

We need to specify the input directory where files were downloaded below. 

In [4]:
root_dir = # FILL IN FULL PATH TO ROOT REPO DIRECTORY HERE
config_path = f'{root_dir}/models/bootleg_wiki/bootleg_config.json'
config_args = get_full_config(config_path)

# decrease number of data threads as this is a small file
config_args.data_config.dataset_threads = 2

# set the model checkpoint path 
config_args.run_config.init_checkpoint = f'{root_dir}/models/bootleg_wiki/bootleg_model.pt'

# set the path for the entity db and candidate map
config_args.data_config.entity_dir = f'{root_dir}/data/wiki_entity_data'
config_args.data_config.alias_cand_map = 'alias2qids_wiki.json'

# set the data path and RSS500 test file 
config_args.data_config.data_dir = f'{root_dir}/data/nq'

# to speed things up for the tutorial, we have already prepped the data with the mentions detected by Bootleg
config_args.data_config.test_dataset.file = 'test_natural_questions_50_bootleg.jsonl'

# set the embedding paths 
config_args.data_config.emb_dir =  f'{root_dir}/data/emb_data'
config_args.data_config.word_embedding.cache_dir =  f'{root_dir}/pretrained_bert_models'

# set the save directory 
config_args.run_config.save_dir = f'{root_dir}/results'

# set whether to run inference on the CPU
config_args.run_config.cpu = use_cpu

## 2. Load Contextual Entity Embeddings

We now show how Bootleg contextualized embeddings can be loaded and used in downstream tasks. First we use the `dump_embs` mode to generate contextual entity embeddings. 

In [5]:
bootleg_label_file, bootleg_emb_file = run.model_eval(args=config_args, mode="dump_embs", logger=logger, is_writer=True)

2020-10-21 18:11:21,769 PyTorch version 1.5.0 available.
2020-10-21 18:11:24,719 TensorFlow version 2.2.0 available.
2020-10-21 18:11:25,410 Loading entity_symbols...
2020-10-21 18:12:11,442 Loaded entity_symbols with 5310039 entities.
2020-10-21 18:12:12,237 Loading slices...
2020-10-21 18:12:12,259 Finished loading slices.
2020-10-21 18:12:32,381 Loading dataset...
2020-10-21 18:12:32,412 Finished loading dataset.
2020-10-21 18:12:37,468 Loading embeddings...
2020-10-21 18:13:01,797 Finished loading embeddings.
2020-10-21 18:13:01,886 Loading model from /dfs/scratch0/lorr1/bootleg/bootleg-internal/new_tutorial_data/models/bootleg_wiki/bootleg_model.pt...
2020-10-21 18:13:10,063 Successfully loaded model from /dfs/scratch0/lorr1/bootleg/bootleg-internal/new_tutorial_data/models/bootleg_wiki/bootleg_model.pt starting from checkpoint epoch 1 and step 0.
2020-10-21 18:13:10,128 ************************DUMPING PREDICTIONS FOR test_natural_questions_50_bootleg.jsonl************************

In `dump_embs` mode, Bootleg saves the contextual entity embeddings corresponding to each mention in each sentence to a file. We return this file in the variable `bootleg_emb_file`. We can also see the full file path in the log (ends in `*npy`). 

In [6]:
import numpy as np
contextual_entity_embs = np.load(bootleg_emb_file)
contextual_entity_embs.shape

(104, 512)

Each row in the contextual entity embedding above corresponds to an extracted mention in a sentence. In the above embedding there are 100 extracted mentions total with 350 dimensions for each corresponding contextual entity embedding.

The mapping from mentions to rows in the contextual entity embedding is stored in `ctx_emb_ids` in the label file. We now check out the label file, which was also generated and returned from running `dump_embs` mode.

In [7]:
import jsonlines
with jsonlines.open(bootleg_label_file) as f: 
    for i, line in enumerate(f): 
        print('sentence:', line['sentence'])
        print('mentions:', line['aliases'])
        print('contextual emb ids:', line['ctx_emb_ids'])
        print()
        if i == 5: 
            break

sentence: who did the voice of the magician in frosty the snowman
mentions: ['the voice', 'the magician', 'frosty the snowman']
contextual emb ids: [0, 1, 2]

sentence: what is considered the outer banks in north carolina
mentions: ['outer banks', 'north carolina']
contextual emb ids: [3, 4]

sentence: the nashville sound brought a polished and cosmopolitan sound to country music by
mentions: ['the nashville sound', 'cosmopolitan', 'country music']
contextual emb ids: [5, 6, 7]

sentence: what channel is the premier league on in france
mentions: ['premier league', 'france']
contextual emb ids: [8, 9]

sentence: i love it ( feat . charli xcx ) icona pop
mentions: ['i love it', 'charli xcx', 'icona pop']
contextual emb ids: [10, 11, 12]

sentence: the u.s. supreme court hears appeals from circuit courts
mentions: ['us supreme court', 'circuit courts']
contextual emb ids: [13, 14]



In the first sentence, we can find the corresponding contextual entity embedding for "the voice", "the magician", and "frosty the snowman" in rows 0, 1, and 2 of `contextual_entity_embs`, respectively. Similarly, we have unique row ids for the mentions in each of the other sentences. A downstream task can use this process to load the correct contextual entity embeddings for each mention in a simple dataloader.

## 3. Load Static Entity Embeddings

In addition to contextual entity embeddings, Bootleg learns static entity embeddings. These can be useful in downstream tasks when contextual information is not available for the downstream task, or if we want the same entity embedding regardless of the context or position of the mention.

We walk through how to extract the static, learned entity embeddings from a pretrained Bootleg model. First, we define a utility function to load a model.

In [12]:
import os 
import torch 
from collections import OrderedDict

from bootleg.model import Model
from bootleg.symbols.entity_symbols import EntitySymbols
from bootleg.utils import data_utils

def load_model(config_args, device='cuda', logger=None):
    logger.info(f'Using device {device}')
    entity_db =  EntitySymbols(os.path.join(config_args.data_config.entity_dir,
                                                             config_args.data_config.entity_map_dir), 
                              alias_cand_map_file=config_args.data_config.alias_cand_map)
    word_db = data_utils.load_wordsymbols(config_args.data_config, is_writer=True, distributed=False)

    model = Model(args=config_args, model_device=device,
            entity_symbols=entity_db, word_symbols=word_db).to(device)
    
    logger.info(f'Loading model from {config_args.run_config.init_checkpoint}.')
    model_state_dict = torch.load(config_args.run_config.init_checkpoint,
            map_location=lambda storage, loc: storage)['model']
    logger.info('Loaded model.')
    if config_args.run_config.distributed:
        # Remove distributed naming if model trained in distributed mode
        new_state_dict = OrderedDict()
        for k, v in model_state_dict.items():
            if k.startswith('module.'):
                name = k[len('module.'):]
                new_state_dict[name] = v
            else:
                new_state_dict[k] = v
        model_state_dict = new_state_dict
    model.load_state_dict(model_state_dict, strict=True)
    model.eval()
    return model

Load the pretrained Bootleg model. This will take several minutes. 

In [13]:
model = load_model(config_args, logger=logger, device='cuda' if not use_cpu else 'cpu')

2020-10-21 18:20:56,809 Using device cpu
2020-10-21 18:22:27,588 Loading embeddings...
2020-10-21 18:22:52,005 Finished loading embeddings.
2020-10-21 18:22:52,098 Loading model from /dfs/scratch0/lorr1/bootleg/bootleg-internal/new_tutorial_data/models/bootleg_wiki/bootleg_model.pt.
2020-10-21 18:22:59,825 Loaded model.


Get the static, learned entity embedding as a torch tensor. 

In [14]:
ent_embedding = model.emb_layer.entity_embs.learned.learned_entity_embedding.weight.data 
ent_embedding.shape

torch.Size([5310041, 256])

This Bootleg model was trained on data with 5.2 million entities and each entity embedding is 200-dimensional, as indicated by the shape of the static, learned entity embedding above.

The mapping from mentions to rows in the static, learned entity embedding (corresponding to the predicted entity) is also saved in the label file produced by `dump_embs` mode. We check out the label file below and use the `entity_ids` key to find the corresponding embedding row.  

In [15]:
import jsonlines
with jsonlines.open(bootleg_label_file) as f: 
    for i, line in enumerate(f): 
        print('sentence:', line['sentence'])
        print('mentions:', line['aliases'])
        print('entity ids:', line['entity_ids'])
        print()
        if i == 5: 
            break

sentence: who did the voice of the magician in frosty the snowman
mentions: ['the voice', 'the magician', 'frosty the snowman']
entity ids: [3137025, 47084, 317160]

sentence: what is considered the outer banks in north carolina
mentions: ['outer banks', 'north carolina']
entity ids: [669293, 10038]

sentence: the nashville sound brought a polished and cosmopolitan sound to country music by
mentions: ['the nashville sound', 'cosmopolitan', 'country music']
entity ids: [4820686, 23951, 2213]

sentence: what channel is the premier league on in france
mentions: ['premier league', 'france']
entity ids: [5048, 1039794]

sentence: i love it ( feat . charli xcx ) icona pop
mentions: ['i love it', 'charli xcx', 'icona pop']
entity ids: [3556241, 3432476, 3539431]

sentence: the u.s. supreme court hears appeals from circuit courts
mentions: ['us supreme court', 'circuit courts']
entity ids: [14865, 31738]



Unlike the contextual entity embeddings, the static embeddings are not unique across mentions. For instance, if the same entity is predicted across two different mentions, the static entity embedding (and ids in the label file) will be the same for those mentions, whereas the contextual entity embeddings and ids will be different. 