# Creates "Contextualized" Entity Embeddings

We will use the same requirements as in the entity_embedding_tutorial.ipynb file.

The unique entity embeddings, before contextualization, that are used in Bootleg's neural model are not as useful in downstream tasks in isolation. They were not trained to have a meaningful dot product, for example. Instead, we create "contextualized" entity embeddings that a fed through the Bootleg model.

Bootleg requires two things generating a "contextualized" entity embeddings:
1. A sentence
2. A mention and list of entity candidates

We want to generate (1) and (2) that represents, as closely as possible, a single entity. Our solution is
1. The sentence is the entity title and nothing else
2. The mention is the entiyt title and there is a single candidate. The candidate is the entity we want to generate an embedding for.

We do this below to create embeddings.

In [1]:
%load_ext autoreload
%autoreload 2

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

In [2]:
import sys, torch
from rich import print
from rich.progress import track
import numpy as np
from pathlib import Path

## Loading

In [3]:
from bootleg.end2end.bootleg_annotator import BootlegAnnotator
bootleg_cache = "/dfs/scratch0/lorr1/projects/bootleg/tutorial_data" # WHERE DATA IS DOWNLOADED
device = 0 # SET TO 0 FOR GPU
ann = BootlegAnnotator(cache_dir=bootleg_cache, device=device, return_embs=True, verbose=True)

[2021-09-06 22:33:39,371][INFO] emmental.meta:122 - Setting logging directory to: /dfs/scratch0/lorr1/projects/bootleg/tutorial_data/data/log_dir
[2021-09-06 22:33:39,428][INFO] emmental.meta:64 - Loading Emmental default config from /dfs/scratch0/lorr1/projects/emmental/src/emmental/emmental-default-config.yaml.
[2021-09-06 22:33:39,429][INFO] emmental.meta:174 - Updating Emmental config from user provided config.
[2021-09-06 22:33:39,431][INFO] emmental.utils.seed:27 - Set random seed to 1234.
[2021-09-06 22:33:39,436][DEBUG] bootleg.end2end.bootleg_annotator:225 - Reading entity database
[2021-09-06 22:35:38,197][DEBUG] bootleg.end2end.bootleg_annotator:238 - Reading entity database
[2021-09-06 22:38:10,187][DEBUG] bootleg.end2end.bootleg_annotator:248 - Reading word tokenizers
[2021-09-06 22:38:10,196][DEBUG] urllib3.connectionpool:971 - Starting new HTTPS connection (1): huggingface.co:443
[2021-09-06 22:38:10,491][DEBUG] urllib3.connectionpool:452 - https://huggingface.co:443 "HE

Using Standard Cands CrossEntropy Loss


[2021-09-06 22:38:12,647][DEBUG] urllib3.connectionpool:452 - https://huggingface.co:443 "HEAD /bert-base-uncased/resolve/main/config.json HTTP/1.1" 200 0
[2021-09-06 22:38:12,657][DEBUG] urllib3.connectionpool:971 - Starting new HTTPS connection (1): huggingface.co:443
[2021-09-06 22:38:12,954][DEBUG] urllib3.connectionpool:452 - https://huggingface.co:443 "HEAD /bert-base-uncased/resolve/main/pytorch_model.bin HTTP/1.1" 302 0
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceCla

In [4]:
from bootleg.utils.utils import get_lnrm

def get_mention(ep, qid):
    """Returns the lower cased title with punction stripped"""
    return get_lnrm(ep.get_title(qid), strip=True, lower=True)

In [5]:
from bootleg.symbols.entity_profile import EntityProfile

ep = EntityProfile.load_from_cache(Path(bootleg_cache) / "data/entity_db")

## Generate embeddings

To show how the process works, we'll use a small subset of entities, create embeddings for them, and compare their cosine similarity. The entity set we'll use is a collection of all entities with the word vmware or apple in their title.

In [6]:
# Collect entities
entities_to_emb = [q for q in track(ep.get_all_qids()) if ("vmware" in ep.get_title(q).lower()) or ("apple" in ep.get_title(q).lower())]

print([ep.get_title(q) for q in entities_to_emb][:20])

Output()

In [7]:
# Create the input of (1) and (2) to feed into our annotator
extracted_exs = [
    {
        "sentence": ep.get_title(q),
        "aliases": [get_mention(ep, q)],
        "spans": [[0, len(ep.get_title(q).split())]],
        "cands": [[q]],
    }
    for q in entities_to_emb
]
# We use the special `extracted_examples` input into our label_mentions so that annotator uses the candidates and sentence we provide
out_dict = ann.label_mentions(extracted_examples=extracted_exs)

Prepping data: 100%|██████████| 1740/1740 [00:13<00:00, 125.85it/s]
Evaluating model: 100%|██████████| 109/109 [04:40<00:00,  2.58s/it]


In [8]:
# Get the ids of the VMWare entities only
vmware_ids = [i for i, t in enumerate([ep.get_title(q) for q in entities_to_emb]) if "vmware" in t.lower()]
print(len(vmware_ids), len(entities_to_emb))

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

mat = np.vstack(out_dict["embs"])
print(mat.shape)
res = cosine_similarity(mat)
sorted_res = np.argsort(res, axis=-1)
print(sorted_res.shape)
# For each vmware entity, get the top 5 most similar entities from the set above
for i in vmware_ids:
    for j in sorted_res[i][::-1][:5]:
        print(ep.get_title(entities_to_emb[i]), "~", ep.get_title(entities_to_emb[j]))

We see that overall, we are choosing similar entities. No people with the name of Apple, for example, are returned as being similar.