In [5]:
import faiss
import torch

from omegaconf import OmegaConf
from pytorch_lightning import Trainer
from tqdm import tqdm

from nemo.collections import nlp as nemo_nlp
from nemo.collections.nlp.models import EntityLinkingModel
from nemo.utils.exp_manager import exp_manager

ImportError: cannot import name 'EntityLinkingModel' from 'nemo.collections.nlp.models' (/home/vadams/Projects/entity-linking-research/NeMo/nemo/collections/nlp/models/__init__.py)

## Entity Linking

#### Task Description
[Entity linking](https://en.wikipedia.org/wiki/Entity_linking) is the process of matching concepts mentioned in natural language to their unique IDs and canonical forms stored in a knowledge based. Entity linking applications range from helping automate ingestion of large amounts of data to assisting in real time concept normalization during a conversation. 

Though there are a myriad of approaches to the entity linking task, within nemo and this tutorial we use the methodology described in the [Self-alignment Pre-training for Biomedical Entity Representations](https://arxiv.org/abs/2010.11784) paper. The main intution behind the approach is to reshape an initial BERT embedding space such that different descriptions of the same concept are closer togther in that space and unrealted concepts are further apart. We can then use the concept embeddings from this reshaped space to build an index of embeddings from a knowledge base. Finally, we can link query concepts to their canonical forms in the knowledge base by performing a nearest neighbor search- matching concept query embeddings to the most similar concepts embeddings in the knowledge base index. In this tutorial we will be using the [faiss](https://github.com/facebookresearch/faiss) library to build our concept index. 

#### Self Alignment Pretraining
Self-Alignment pretraining is a second stage pretraining of an exsiting encoder (called second stage because the encoder model can also be further finetuned after this more general pretraining step). The dataset used during training consits of pairs of concept synonyms that map to the same ID in a knowledge base. At each training iteration, we only select *hard* examples present in the mini batch to calculate the loss and update the model weights. In this context, a hard example is an example where a concept is closer to an unrelated concept in the mini batch than it is to the synonym concept it is paired with by some margin. I encourage you to take a look at [section 2 of the paper](https://arxiv.org/pdf/2010.11784.pdf) for a more formal and indepth description of how hard examples are selected. 

We then use a [metric learning loss](https://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Multi-Similarity_Loss_With_General_Pair_Weighting_for_Deep_Metric_Learning_CVPR_2019_paper.pdf) calculated from the hard examples selected. This loss basically takes concept representations that were incorrectly positioned in our initial embedding space and pushes embedding pairs that should be more similar together, while pulling pairs that represent distinct ideas apart. Through this training process we reshape the concept embedding space to be better suited for our entity linking task than it was originally. 

Now that we have idea of what's going on, let's get started!

## Dataset Preprocessing

In this tutorial we will be using a small toy dataset to demonstrate how to use NeMo's entity linking model functionality. The following code downloads this tutorial's dataset. In this dataset the concepts are already paired off and formatted as `ID concept_synonym1 concept_synonym2`

In [2]:
# Download training data, validation data, and example config
#!wget 

For full medical domain entity linking model training, we recommend using the [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/index.html) dataset. The data is a table of medical concepts and their corresponding concept IDs (CUI). After obtaining [requesting a free license and making a UMLS Terminology Services (UTS) account](https://www.nlm.nih.gov/research/umls/index.html), the [entire UMLS dataset](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) can be downloaded from the NIH's website. If you've cloned the NeMo repo you can run the data processing script located in `examples/nlp/entity_linking/data/umls_dataset_processing.py` on the full dataset. This script will take in the initial table of UMLS concepts and produce a .tsv file where each row is formatted as `CUI\tconcept_synonym1\tconcept_synonym2`. Once the UMLS dataset .RRF file is downloaded, the script can be run from the `examples/nlp/entity_linking` directory like so: 

## Model Training

Second stage pretrain a BERT Base encoder with on the self-alignment pretraining task (SAP) for better entity linking.

In [3]:
# Load in the config file
cfg = OmegaConf.load("tiny_example_entity_linking_config.yaml")

In [4]:
# Initialize the trainer and model
trainer = Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))
model = nemo_nlp.models.EntityLinkingModel(cfg=cfg.model, trainer=trainer)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Using native 16bit precision.


[NeMo I 2021-04-05 06:41:50 exp_manager:210] Experiments will be logged at SelfAlignmentPretrainingTinyExample/2021-04-05_06-41-50
[NeMo I 2021-04-05 06:41:50 exp_manager:550] TensorboardLogger has been set up


AttributeError: module 'nemo.collections.nlp.models' has no attribute 'EntityLinkingModel'

In [None]:
# Train and save the model
trainer.fit(model)
model.save_to(cfg.model.nemo_path)

You can run the script at `examples/nlp/entity_linking/self_alignment_pretraining.py` to train a model on a larger dataset. Run

from the `examples/nlp/entity_linking` directory.

## Model Evaluation

Let's evaluate our model using top 1 and top 5 accuarcy on a held out test set. For this evaluation we are going to be comparing every test query with every concept vector in our test set knowledge base and ranking each item in the knowledge base by its cosine similarity with the test query. We'll then compare with our ground truth results to calculate top 1 and top 5 accuarcy. 

When evaluating a model trained on a larger dataset, you can use a nearest neighbors index to speed up the evaluation time.

## Building an Index

To qualitatively observe the improvement we gain from the second stage pretraining, let's build two indices. One will be built with BERT base embeddings before self alignment pretraining and one will be built with the model we just trained. Our knowledge base in this tutorial will be in the same domain and have some over lapping concepts as the training set. This data file is formatted as `ID\tconcept`.

In [None]:
# Download index dataset
!wget 

In [None]:
# Restore second stage pretrained model
sap_model_cfg = cfg
sap_model = nemo_nlp.models.EntityLinkingModel.restore_from(sap_model_cfg.model.nemo_path)

# Load original model before pretraining
base_model_cfg = OmegaConf.load("conf/entity_linking_example_config.yaml")
base_model_cfg.model.train_ds = None
base_model_cfg.model.validation_ds = None
base_model_cfg.index.index_save_name = "base_model_index"
base_model = nemo_nlp.models.EntityLinkingModel(base_model_cfg.model)

The `EntityLinkingDataset` class can load the data used for training the entity linking encoder as well as for building the index if the `is_index_data` flag is set to true. 

In [None]:
def build_index(cfg, model):
    # Setup index dataset loader
    index_dataloader = model.setup_dataloader(cfg.index.dataset, is_index_data=True)
    
    # Get index dataset embeddings
    embeddings = []

    with torch.no_grad():
        for batch in tqdm(index_dataloader):
            input_ids, token_type_ids, attention_mask, _ = batch
            batch_embeddings = model.forward(input_ids, token_type_ids, attention_mask)

            # Accumulate index embeddings and their corresponding IDs
            embeddings.extend(batch_embeddings.detach().numpy())
            
    # Train IVFFlat index using faiss
    quantizer = faiss.IndexFlatL2(cfg.index.dims)
    index = faiss.IndexIVFFlat(quantizer, cfg.index.dims, cfg.index.nlist)
    index = faiss.index_cpu_to_all_gpus(index)
    index.train(embeddings)
    
    # Add concept embeddings to index
    for i in tqdm(range(0, embeddings.shape[0], cfg.index.index_batch_size)):
            index.add(embeddings[i:i+cfg.index.index_batch_size])

    # Save index
    faiss.write_index(index, cfg.index.index_save_name)

In [None]:
build_index(sap_model_cfg, sap_model)
build_index(base_model_cfg, base_model)

## Entity Linking via Nearest Neighbor Search

Now its time to query our indices!

In [None]:
def query_index(cfg, model, index, queries, id2string)
    query_embs = get_query_embedding(queries, model).numpy()
    
    # Use query embedding to find closet concept embedding in knowledge base
    distances, neighbors = index.search(query_embs, cfg.index.top_n)
    neighbor_concepts = [[id2string[concept_id] for concept_id in query_neighbor] \
                                                for query_neighbor in neighbors]
    
    for query_idx in range(len(queries)):
        print(f"\nThe most similar concepts to {queries[query_idx]} are:")
        print(zip(neighbors[query_idx], neighbor_concepts[query_idx], distances[query_idx]))

    
def get_query_embedding(query, queries):
    model_input =  model.tokenizer(queries,
                                   add_special_tokens = True,
                                   padding = True,
                                   truncation = True,
                                   max_length = 512,
                                   return_token_type_ids = True,
                                   return_attention_mask = True)

    query_emb =  model.forward(torch.LongTensor(model_input["input_ids"]),
                               torch.LongTensor(model_input["token_type_ids"]),
                               torch.LongTensor(model_input["attention_mask"]))

In [None]:
# Load indices
sap_index = faiss.read_index(sap_model_cfg.index.index_save_name)
base_index = faiss.read_index(base_model_cfg.index.index_save_name)

In [None]:
# Map concept IDs to one canonical string
id2string = {}

In [None]:
# Query both indices
queries = []
query_index(base_model_cfg, base_model, base_index, queries, id2string)
query_index(sap_model_cfg, sap_model, sap_index, queries, id2string)

Try some of your own queries.

In [None]:
while True:
    query = input()

For larger knowledge bases keeping the default embedding size might be too large and cause out of memory issues. You can apply PCA or some other dimensionality reduction method to your data to reduce its memory footprint. Code for creating a text file of all the UMLS entities in the correct format needed to build an index and creating a dictionary mapping concept ids to canonical concept strings can be found here `examples/nlp/entity_linking/data/umls_dataset_processing.py`. 

The code for extracting knowledge base concept embeddings, training and applying a pca transformation to the embeddings, builing a faiss index and querying the index from the command line is located at `examples/nlp/entity_linking/build_and_query_index.py`. 

If you've cloned the NeMo repo, both of these steps can be run as follows on the commandline from the `examples/nlp/entity_linking/` directory.

Intermidate steps of the index building process are saved, so in the occurance of an error, previously completed steps do not need to be rerun. 

## Command Recap

Here is a recap of the commands and steps to repeat this process on the full UMLS dataset. 

1) Download the UMLS datset file `MRCONSO.RRF` from the NIH website and place it in the `examples/nlp/entity_linking/data` directory.

2) Run the following commands from the `examples/nlp/entity_linking` directory
```
python data/umls_dataset_processing.py --cfg conf/medical_entity_linking_config.yaml
python self_alignment_pretraining.py
python data/umls_dataset_processing.py --index --cfg /conf/medical_entity_linking_config.yaml
python build_and_query_index.py --restore --cfg conf/medical_entity_linking_config.yaml --top_n 5
```
The model will take ~24hrs to train on two GPUs and ~48hrs to train on one GPU.