In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import ujson, os
from tqdm.auto import tqdm
from bootleg.symbols.entity_profile import EntityProfile

# Entity Profile Tutorial

In this tutorial, we will show you how to modify and interact with our entity metadata.

### Requirements

You will need to download the following files for this notebook:
- Pretrained Bootleg uncased model and config [here](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/models/lateset/bootleg_uncased.tar.gz).
- Entity data [here](https://bootleg-ned-data.s3-us-west-1.amazonaws.com/data/lateset/entity_db.tar.gz)

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models` and `data` directories. It will take several minutes to download all the files.

```
    bash tutorials/download_model.sh uncased
    bash tutorials/download_data.sh
```


### Load up the entity profile
Inside the cache directory is
* entity_mappings: where aliases and entity information is stored. We also have the original unfiltered alias to candidate mapping we used for training on Wikipedia. For all other uses, we use the alias to candidate map called `alias2qids.json`, with higher quality aliases.
* type_mappings: where type information is stored. There will be one subfolder per type system. In the `wiki` subfolder, we have a mapping from Wikidata title to Wikidata QID for the types. The `relations` subfolder is where we keep our relationship types and treat them as types in our model.
* kg_mappings: where kg information is stored

When we load a entity profile, we can put it in `edit_mode` to allow us to make changes. Don't forget to set that flag below to edit.

See our read the docs [here](https://bootleg.readthedocs.io/en/latest/gettingstarted/entity_profile.html) for more information on our entity profiles.

In [3]:
# MODIFY THE PATH TO THE DOWNLOADED ENTITY_DB DATA.
entity_profile_cache = Path("../data/entity_db")
# Print out directory structure
for fold in entity_profile_cache.iterdir():
    # Skip showing our prep directory as that's used when loading a model
    if fold.name in ["prep"]:
        continue
    print(fold.name)
    for sub_file in fold.iterdir():
        print("   ", sub_file.name)
        if sub_file.is_dir():
            for subsub_file in sub_file.iterdir():
                print("       ", subsub_file.name)

kg_mappings
    qid2relations.json
    kg_adj.txt
    relation_vocab.json
    config.json
type_mappings
    hyena_coarse
        qid2typenames.json
        type_vocab.json
        qid2typeids.json
        config.json
    hyena
        qid2typenames.json
        qid2typeids.json
        config.json
        type_vocab.json
    wiki
        qid2typeids.json
        type_vocab_to_wikidataqid.json
        type_vocab.json
        qid2typenames.json
        config.json
    relations
        config.json
        qid2typeids.json
        qid2typenames.json
        type_vocab.json
entity_mappings
    qid2title.json
    config.json
    alias2id.json
    alias2qids.json
    alias2id_unfiltered.json
    qid2cnt.json
    qid2eid.json
    alias2qids_unfiltered.json
    qid2desc.json


We call `load_from_cache` to load in a profile. If you only want to modify or edit only type information or only kg information, we provide flags to turn off loading some data. In particular, the `no_kg` flag turns off KG information, `no_type` flag turns off type information, and `type_systems_to_load` will specify which types system subfolders to load (`None` means load all).

**Note** that if you do not load up a subset of metadata, you cannot add, remove, or otherwise examine that data. If you set `no_kg = True`, for example, you can't add a new KG connection. This also means if you call `save`, that metadata will not be saved. 

In [4]:
import time

st = time.time()
# Load up ALL profile data - don't forget to set edit_mode = True
# As edit_mode triggers the profile to build some index structures for fast editing,
# the loading takes a few minutes for all of wiki
ep = EntityProfile.load_from_cache(entity_profile_cache, edit_mode=True, verbose=True)
print(f"Loaded full ep in {time.time() - st}")
st = time.time()

# # Load up NO KG information
# ep = EntityProfile.load_from_cache(
#     entity_profile_cache, edit_mode=True, verbose=True, no_kg=True
# )
# print(f"Loaded full ep without KG in {time.time() - st}")
# st = time.time()

# # Load up NO TYPE information
# ep = EntityProfile.load_from_cache(
#     entity_profile_cache, edit_mode=True, verbose=True, no_type=True
# )
# print(f"Loaded full ep without type in {time.time() - st}")
# st = time.time()

# # Load up only wiki type information
# ep = EntityProfile.load_from_cache(
#     entity_profile_cache,
#     edit_mode=True,
#     verbose=True,
#     no_kg=True,
#     type_systems_to_load=["wiki"],
# )
# print(f"Loaded full ep without KG and only wikidata type in {time.time() - st}")

Loading Entity Symbols


Building edit mode objs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15202497/15202497 [01:03<00:00, 237598.02it/s]


Loading Type Symbols from ../data/entity_db/type_mappings/hyena_coarse


Building edit mode objs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5832699/5832699 [00:03<00:00, 1514773.95it/s]


Loading Type Symbols from ../data/entity_db/type_mappings/hyena


Building edit mode objs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5832699/5832699 [00:20<00:00, 283105.07it/s]


Loading Type Symbols from ../data/entity_db/type_mappings/wiki


Building edit mode objs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5832699/5832699 [00:09<00:00, 590063.88it/s]


Loading Type Symbols from ../data/entity_db/type_mappings/relations


Building edit mode objs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5832699/5832699 [00:21<00:00, 266170.52it/s]


Loading KG Symbols


Checking relations and building edit mode objs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5832699/5832699 [02:19<00:00, 41914.72it/s]

Loaded full ep in 682.7707569599152





### Let's see what operations you can call

In [5]:
object_methods = [
    method_name for method_name in dir(ep) if callable(getattr(ep, method_name))
]

print(object_methods)

['__class__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_read_profile_file', 'add_entity', 'add_mention', 'add_relation', 'add_type', 'get_all_connections', 'get_all_mentions', 'get_all_qids', 'get_all_types', 'get_all_typesystems', 'get_connections_by_relation', 'get_desc', 'get_eid', 'get_entities_of_type', 'get_mentions', 'get_mentions_with_scores', 'get_qid_cands', 'get_qid_count_cands', 'get_title', 'get_types', 'is_connected', 'load_from_cache', 'load_from_jsonl', 'mention_exists', 'prune_to_entities', 'qid_exists', 'reidentify_entity', 'remove_mention', 'remove_relation', 'remove_type', 'save', 'save_to_jsonl', 'update_entity']


In [6]:
# Get the title of an entity
print("Title:", ep.get_title("Q62446736"))

# Get the description of an entity
print("Description:", ep.get_desc("Q62446736"))

# Get mentions for an entity
print("Mentions:", ep.get_mentions("Q62446736"))

# Get type systems
print("Type Systems:", ep.get_all_typesystems())

# Get some types
print("Sample Wikidata Types:", ep.get_all_types("wiki")[:5])

Title: Apple TV+
Description: Apple TV+ is an ad-free subscription video on demand web television service of Apple Inc that debuted on November 1 , 2019 .
Mentions: {'apple', 'appletv', 'apple tv', 'apple worldwide video', 'apple tv plus'}
Type Systems: ['hyena_coarse', 'hyena', 'wiki', 'relations']
Sample Wikidata Types: ['town in China', 'tehsil of India', 'subdistrict of China', 'faculty', 'pier']


### Modify the types

Suppose you think the QID Q62446736 should really be a computer type instead of a tv type. First we need to see what types the QID is and find a possible replacement type. Then we need to actually remove and add the types.

In [7]:
# First get existing types
qid = "Q62446736"
type_system = "wiki"
print("Existing Types:", ep.get_types(qid, type_system))

# Get all possible types with the word computer in it
all_types = ep.get_all_types(type_system)

comp_types = [t for t in all_types if "computer" in t.lower()]
print(len(comp_types))
print(comp_types)

Existing Types: ['video streaming service']
73
['computer program', 'minicomputer', 'computer network', 'computer model', 'tablet computer', 'computer network protocol', 'computer model series', 'supercomputer', 'computer scientist', '3D computer graphics software', 'computer', 'vector supercomputer', 'computer system', 'computer form factor', 'personal computer', 'computer-aided engineering', 'computer memory', 'home computer', 'computer language', 'computer science term', 'computer monitor', 'microcomputer', 'first generation computer', 'decimal computer', 'computer key', 'computer programming', 'computer surveillance', 'portable computer', 'computer science', 'computer file', 'one-of-a-kind computer', 'computer architecture', 'computer file management', 'computer-aided design software', 'computer security software', 'computer hardware', 'single-board computer', 'computer-animated film', 'computer data storage', 'desktop computer', 'computer worm', 'computer magazine', 'computer alge

In [8]:
# Remove type
ep.remove_type(qid, "video streaming service", type_system)
# Add type
ep.add_type(qid, "computer", type_system)

print("Modified Types:", ep.get_types(qid, type_system))

Modified Types: ['computer']


### Modify the relations

Suppose you think Q178194 should not have the relation P910 with Q8439242 anymore. Don't worry if you misspecify the relation pair. If the pair doesn't exist, we do nothing.

In [9]:
qid = "Q62446736"
print("Existing Connections:", ep.get_relation_between(qid))

# Remove relation
ep.remove_relation(qid, "P31", "Q59152282")

print("Modified Connections:", ep.get_relation_between(qid))

Existing Connections: {'P31': ['Q59152282'], 'P137': ['Q312'], 'P127': ['Q312'], 'P17': ['Q30'], 'P407': ['Q1860'], 'P452': ['Q723685'], 'P749': ['Q312'], 'P1454': ['Q891723'], 'P910': ['Q49225405'], 'P1889': ['Q270285']}
Modified Connections: {'P137': ['Q312'], 'P127': ['Q312'], 'P17': ['Q30'], 'P407': ['Q1860'], 'P452': ['Q723685'], 'P749': ['Q312'], 'P1454': ['Q891723'], 'P910': ['Q49225405'], 'P1889': ['Q270285']}


### Add a new entity

To add a new entity, we need to provide the following json object to our entity profile
```
{
    "entity_id": "C000",
    "mentions": [["dog", 10.0], ["dogg", 7.0], ["animal", 4.0]],
    "title": "Dog",
    "description": "An animal that barks",
    "types": {"hyena": ["animal"], "wiki": ["dog"]},
    "relations": [
        {"relation": "sibling", "object": "Q345"},
        {"relation": "sibling", "object": "Q567"},
    ],
}
```

The numeric values for the mentions represent the score of that mention. These can all be the same value. They are just used for sorting the mentions for an entity.

If you do not have mentions, a good default is the title of the mention with a score of 1.

**NOTE** We will lower case mentions and strip certain punctuation for mention extraction when adding mentions to the entity profile. See ``bootleg.utils.utils.get_lnrm`` for more info (we set strip and lower to be True).

In [10]:
title = "Some New Entity"
# The numeric value is the score associated with the mention
mentions = [["computer", 10.0], ["sparkle device", 12.0]]
wiki_types = ["computer"]
d = {
    "entity_id": "NQ1",
    "mentions": mentions,
    "title": title,
    "description": "A computer that performs",
    "types": {"wiki": wiki_types},
}
if not ep.qid_exists("NQ1"):
    ep.add_entity(d)

### Remove unused entities

Lastly, for space reasons, it'd be nice to remove the QIDs that are no longer needed in this dump. For that, we can call `prune_to_entities`. This operation will remove all entities not in the set of entities given. In will throw an error, however, if you ask it to remove an entity that doesn't exist.

In [11]:
# Get entities to keep based on those that have the types in `types_to_add`
type_system = "wiki"
types_to_add = [
    "computer",
    "fruit",
    "meat",
    "country",
    "national association football team",
]
entities_of_type = set()
for ty in types_to_add:
    entities_of_type.update(set(ep.get_entities_of_type(ty, type_system)))

# Make sure they are all in the dump
for qid in tqdm(entities_of_type):
    if not ep.qid_exists(qid):
        print(f"{qid} does not exists")
        break

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1523/1523 [00:00<00:00, 318044.56it/s]


In [12]:
print(f"Starting number of entities: {len(ep.get_all_qids())}")
ep.prune_to_entities(entities_of_type)
print(f"Ending number of entities: {len(ep.get_all_qids())}")

Starting number of entities: 5832700
Pruning entity data
Pruning hyena_coarse data
Pruning hyena data
Pruning wiki data
Pruning relations data
Pruning kg data
Ending number of entities: 1523


In [13]:
# Save the new profile
ep.save(entity_profile_cache.parent / "new_profile_wiki")

# Adjust Model and Run

A benefit of Bootleg is that you can easily add and remove entities and do not need to change the model in any way to accommodate them. If you are using a modified entity profile, all you need to do is adjust some paths in the model config to point to this new location. That's it!

In [14]:
# Load up model path. This path should be the `models` subfolder.
model_dir = Path("../models")

# Base model config to modify
old_config_path = str(model_dir / "bootleg_uncased/bootleg_config.yaml")
# Provide save path for the new bootleg config yaml file. This can be anywhere.
new_config_save_path = "np_bootleg_config.yaml"
# Base model pth path to modify
model_path = str(model_dir / "bootleg_uncased/bootleg_wiki.pth")
# Path where you saved the adjusted entity profile above
new_entity_path = str(entity_profile_cache.parent / "new_profile_wiki")
# Path where you want logs to be saved
new_log_path = str("bootleg-logs/new-bootleg")

In [15]:
import yaml


def modify_config(
    old_config_path, new_config_path, model_save_path, new_entity_path, new_log_path
):
    """Modifies the old config with the new profile and model for running.

    Args:
        old_config_path: old config path
        new_config_path: new config path
        model_save_path: model save path
        new_entity_path: new entity path
        new_log_path: new log path

    Returns:
    """
    with open(old_config_path) as file:
        old_config = yaml.load(file, Loader=yaml.FullLoader)

    if "emmental" not in old_config:
        old_config["emmental"] = {}
    old_config["emmental"]["model_path"] = model_save_path
    old_config["emmental"]["log_path"] = new_log_path
    old_config["data_config"]["entity_dir"] = new_entity_path

    with open(new_config_path, "w") as file:
        yaml.dump(old_config, file)
    print(f"Dumped config to {new_config_path}")


modify_config(
    old_config_path, new_config_save_path, model_path, new_entity_path, new_log_path
)

Dumped config to np_bootleg_config.yaml


### Run model
Before running the annotator, we need to load and sanity check the config. We pass this into the annotator.

In [16]:
# Load and sanity check config

# !!! Set this to what config you want to use
config_to_load = new_config_save_path

# Load config
with open(config_to_load) as file:
    config = yaml.load(file, Loader=yaml.FullLoader)

print(ujson.dumps(config, indent=4))

{
    "data_config": {
        "context_mask_perc": 0.0,
        "data_dir": "\/home\/data\/bootleg-data\/wiki_title_0122",
        "data_prep_dir": "prep",
        "dev_dataset": {
            "file": "merged_sample.jsonl",
            "use_weak_label": true
        },
        "entity_dir": "..\/data\/new_profile_wiki",
        "entity_kg_data": {
            "kg_labels": "kg_mappings\/qid2relations.json",
            "kg_vocab": "kg_mappings\/relation_vocab.json",
            "use_entity_kg": true
        },
        "entity_type_data": {
            "type_labels": "type_mappings\/wiki\/qid2typeids.json",
            "type_vocab": "type_mappings\/wiki\/type_vocab.json",
            "use_entity_types": true
        },
        "eval_slices": [
            "unif_all",
            "unif_NS_all",
            "unif_HD",
            "unif_TO",
            "unif_TL",
            "unif_TS"
        ],
        "max_ent_len": 128,
        "max_seq_len": 128,
        "max_seq_window_len": 64,
    

In [17]:
# Load new annotator with our config - notice how it does have to reprep some things
from bootleg.end2end.bootleg_annotator import BootlegAnnotator

# You can also pass `return_embs=True` to get the embeddings
ann = BootlegAnnotator(config=config, device=-1, return_embs=False)

[2021-10-15 20:00:17,218][INFO] emmental.meta:122 - Setting logging directory to: bootleg-logs/new-bootleg
[2021-10-15 20:00:17,261][INFO] emmental.meta:64 - Loading Emmental default config from /lfs/raiders3/0/senwu/.pyenv/versions/3.8.6/envs/venv38/lib/python3.8/site-packages/emmental/emmental-default-config.yaml.
[2021-10-15 20:00:17,262][INFO] emmental.meta:174 - Updating Emmental config from user provided config.
[2021-10-15 20:00:17,263][INFO] emmental.utils.seed:27 - Set random seed to 1234.
[2021-10-15 20:00:20,062][INFO] emmental.model:72 - Created emmental model Bootleg that contains task set().
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bi

In [18]:
# These are some of the aliases our model will possible extract from sentences...they are all about computers!
print(list(ann.all_aliases_trie.keys())[:10])

['sa', 'saint vincent', 'saint vincent and the grenadines', 'saint vincent and the grenadines national u20 football team', 'saint vincent and the grenadines national under20 football team', 'saint vincent and the grenadines national football team', 'saint vincent and the grenadines national team', 'saint vincent and the grenadines u20', 'saint vincent and grenadines', 'saint vincent amp the grenadines']


In [19]:
# Extract some mentions...
# notice that there is less ambiguity as well because we removed a lot of QIDs from our dump
ann.label_mentions("How did the sparkle device perform")

{'qids': [['NQ1']],
 'probs': [[1.0]],
 'titles': [['Some New Entity']],
 'cands': [[['NQ1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1',
    '-1']]],
 'cand_probs': [[array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)]],
 'spans': [[[3, 5]]],
 'char_spans': [[[12, 26]]],
 'aliases': [['sparkle device']]}

### Faster inference
If you want more efficient inference of the annotator, we have the ability for the user to pass in a static entity
embedding matrix so the model does not have to call a forward pass of the entity encoder.

See our ```entity_embedding_tutorial.ipynb``` for how to call ```extract_all_entities```. The output of this
can be passed into our annotator via

In [None]:
entity_emb_file = "<path to file>"
ann = BootlegAnnotator(config=config, device=-1, return_embs=False, entity_emb_file=entity_emb_file)


