# NED Benchmark Tutorial

In this tutorial, we demonstrate how to use a pretrained Bootleg NED model to run inference on RSS500, a standard sentence-level NED benchmark. 


### Requirements 

To run this tutorial, you'll need to download the following: 

- Data for RSS500 [here](https://bootleg-emb.s3.amazonaws.com/data/rss500.tar.gz).
- Entity profile information [here](https://bootleg-emb.s3.amazonaws.com/entity_db.tar.gz)
- Pretrained Bootleg model and config [here](https://bootleg-emb.s3.amazonaws.com/models/2020_08_25/bootleg_wiki.tar.gz)
- Embedding data [here](https://bootleg-emb.s3.amazonaws.com/emb_data.tar.gz)

For convenience, you can run the command below (from the `tutorials` directory) to download all the above files and unpack them to the provided directory. It will take several minutes to download all the files. 

    bash download_all.sh <NAME_OF_DIRECTORY_TO_SAVE_DATA>
    
You will need to assign the variable `input_dir` in this notebook to the path where you download the data. 

## 1. Prepare the Config File

Necessary import statements. You will need to have installed bootleg as package to run these (see Installation instructions in the README).

In [1]:
import sys
import logging
from importlib import reload
reload(logging)
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

from bootleg import run
from bootleg.utils.parser_utils import get_full_config

If you do not have a GPU with at least 12GB of memory available, set the below to `True` to run inference on the CPU. 

In [2]:
use_cpu = False

Load the model config so we can set additional parameters and load the saved model during evaluation.  

In [3]:
input_dir = # FILL IN FULL PATH TO DATA DIRECTORY WHERE FILES ARE DOWNLOADED HERE 

config_path = f'{input_dir}/bootleg_wiki/bootleg_config.json'
config_args = get_full_config(config_path)

Update the config parameters to point to the downloaded model checkpoint and data. 

In [4]:
# set the model checkpoint path 
config_args.run_config.init_checkpoint = f'{input_dir}/bootleg_wiki/bootleg_model.pt'

# set the path for the entity db and candidate map
config_args.data_config.entity_dir = f'{input_dir}/entity_db'
config_args.data_config.alias_cand_map = 'alias2qids_rss500.json'

# set the data path and RSS500 test file 
config_args.data_config.data_dir = f'{input_dir}/rss500'
config_args.data_config.test_dataset.file = 'test_rss500.jsonl'

# set the embedding paths 
config_args.data_config.emb_dir =  f'{input_dir}/emb_data'
config_args.data_config.word_embedding.cache_dir =  f'{input_dir}/emb_data'

# set the save directory 
config_args.run_config.save_dir = f'{input_dir}/results'

# set whether to run inference on the CPU
config_args.run_config.cpu = use_cpu

## 2. Run Inference for RSS500

Once the config is set up, run model evaluation! You should get that 421/472 mentions (men) are correct (crct). 

In [6]:
run.model_eval(args=config_args, mode="eval", logger=logger, is_writer=True)

2020-09-09 13:05:23,319 Loading entity_symbols...
2020-09-09 13:05:29,086 Loaded entity_symbols with 5222808 entities.
2020-09-09 13:05:29,089 Loading slices...
2020-09-09 13:05:29,092 Finished loading slices.
2020-09-09 13:05:29,093 Loading dataset...
2020-09-09 13:05:29,095 Finished loading dataset.
2020-09-09 13:05:37,441 Sampled 319 indices from dataset (dev/test) for evaluation.
2020-09-09 13:05:37,635 Loading embeddings...
2020-09-09 13:06:00,551 Finished loading embeddings.
2020-09-09 13:06:00,644 Loading model from data/bootleg_wiki/bootleg_model.pt...
2020-09-09 13:06:02,981 Successfully loaded model from data/bootleg_wiki/bootleg_model.pt starting from checkpoint epoch 2 and step 0.
2020-09-09 13:06:03,042 ************************RUNNING EVAL test_rss500.jsonl************************
2020-09-09 13:06:03,043 Evaluating 10 batches
2020-09-09 13:06:24,245 
+------------+------------+-------+--------+-------------+------------+-------+-----------+----------+
| head       | slice 

The `final_loss` head corresponds to the final prediction head. We only have a single data subset, or slice, for the benchmark, which is the overall slice that includes all mentions (we call this slice `final_loss`). 

The `f1_pop` metric is the score on simple baseline which simply predicts most popular candidate for each mention without any other contextual information. We see that Bootleg improves over 11 F1 points over this baseline. 

The F1 score reported here is the micro-averaged F1 score over the entities and assumes 100% candidate recall (every mention has a candidate list). However, for the benchmarks, some mentions are in the benchmark but do not have a corresponding candidate list. Thus, for benchmarks we need to re-compute the F1 taking into account the candidate recall, where the number above is equivalent to the benchmark precision. 

## 3. Analyze the Errors

To understand what examples Bootleg gets wrong, we also support a `dump_preds` mode. Rather than computing aggregate quality metrics, this mode writes a jsonlines file with the predicted candidates for each mention and their associated probabilities. 

Running this is very similar as before, except we need to switch the mode to `dump_preds`. 

In [7]:
pred_file, _ = run.model_eval(args=config_args, mode="dump_preds", logger=logger, is_writer=True)

2020-09-09 13:06:24,690 Loading entity_symbols...
2020-09-09 13:06:29,476 Loaded entity_symbols with 5222808 entities.
2020-09-09 13:06:29,478 Loading slices...
2020-09-09 13:06:29,482 Finished loading slices.
2020-09-09 13:06:29,483 Loading dataset...
2020-09-09 13:06:29,485 Finished loading dataset.
2020-09-09 13:06:37,830 Sampled 319 indices from dataset (dev/test) for evaluation.
2020-09-09 13:06:38,049 Loading embeddings...
2020-09-09 13:07:00,983 Finished loading embeddings.
2020-09-09 13:07:01,032 Loading model from data/bootleg_wiki/bootleg_model.pt...
2020-09-09 13:07:03,272 Successfully loaded model from data/bootleg_wiki/bootleg_model.pt starting from checkpoint epoch 2 and step 0.
2020-09-09 13:07:03,313 ************************DUMPING PREDICTIONS FOR test_rss500.jsonl************************
2020-09-09 13:07:03,469 320 samples, 10 batches
2020-09-09 13:07:25,123 Writing predictions...
2020-09-09 13:07:25,131 Total number of mentions across all sentences: 472
2020-09-09 13:

We provide utility functions to load in the predicted labels as well as the original file and generate a merged [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) with both predicted and gold labels. 

In [8]:
from utils import score_predictions
import pandas as pd
pd.options.display.max_colwidth = 500

pred_df = score_predictions(orig_file=f'{input_dir}/rss500/test_rss500.jsonl', 
                 pred_file=pred_file,
                 entity_mapping_dir=f'{input_dir}/entity_db/entity_mappings')

Let's take a look at the DataFrame! We include the sentence, the aliases we want to disambiguate, the correct mapping to Wikidata for the particular alias (gold_qid), and the predicted and true titles. 

In [9]:
pred_df.sample(10)

Unnamed: 0,sentence,aliases,alias,alias_idx,gold_qid,pred_qid,gold_title,pred_title
132,"Catherine Lucey joined the Daily News in 2002 and has written about murderous drug gangs , political protesters and Harry Potter .",[daily news_104],daily news_104,0,Q3378849,Q2605160,Philadelphia Daily News,The Daily News (UK)
22,"That included 330,000 Appalachian Power customers in West Virginia .","[appalachian power_20, west virginia_20]",appalachian power_20,0,Q464092,Q464092,American Electric Power,American Electric Power
15,"Blue Origin , a Kent , Wash.-based start-up , is backed by Amazon founder Jeff Bezos .","[amazon_15, jeff bezos_15]",jeff bezos_15,1,Q312556,Q312556,Jeff Bezos,Jeff Bezos
128,"Carolina receiver Steve Smith signs autographs Wednesday following the team 's final training camp practice at Wofford College in Spartanburg , S.C. .",[carolina_101],carolina_101,0,Q330120,Q330120,Carolina Panthers,Carolina Panthers
426,Word had just come that Liddell died in February .,"[liddell_323, february_323]",liddell_323,0,Q317422,Q863177,Eric Liddell,Billy Liddell
3,"Joshua 's jubilation at claiming the 13th gold in the boxing ring at the London Games was a memorable moment , but it did not match the performances of the three women champions , which were still drawing praise from AIBA president Wu Ching-kuo .",[wu chingkuo_3],wu chingkuo_3,0,Q8038886,Q8038886,Wu Ching-kuo,Wu Ching-kuo
423,"Leslie Drake , director of the U.S. Commercial Service in Charleston , said exports from West Virginia have increased by 31 percent this year .",[us commercial service_321],us commercial service_321,0,Q7889637,Q7889637,United States Commercial Service,United States Commercial Service
110,CBS NFL analyst Boomer Esiason is n't a fan of Tim Tebow 's chances to be a star QB .,[boomer esiason_89],boomer esiason_89,0,Q725373,Q725373,Boomer Esiason,Boomer Esiason
131,Carroll College head coach Mike Van Diest during an NAIA news conference Tuesday .,"[carroll college_103, mike van diest_103]",mike van diest_103,1,Q6849134,Q6849134,Mike Van Diest,Mike Van Diest
199,The report aims to curb the flow of future boat arrivals and was drawn up by an expert panel headed by former Australian Defense Force chief Angus Houston .,"[australian defense force_157, angus houston_157]",australian defense force_157,0,Q625657,Q625657,Australian Defence Force,Australian Defence Force


We can write functions over the DataFrame to help with error analysis. For instance, to get all incorrect examples, we use the below command. 

In [10]:
pred_df[pred_df['gold_qid'] != pred_df['pred_qid']].sample(10)

Unnamed: 0,sentence,aliases,alias,alias_idx,gold_qid,pred_qid,gold_title,pred_title
318,"In 2007 , George Washington defeated Hurricane 1-0 in the title game .",[george washington_239],george washington_239,0,Q534375,Q2984247,George Washington Colonials,George Washington Colonials men's basketball
408,"But when Kusama arrived in New York in 1958 , the fad was `` action painting , `` characterized by dribbles , swooshes and smears , not dots .","[kusama_311, new york_311]",new york_311,1,Q1384,Q60,New York (state),New York City
58,He was passed by BMC teammate Van Garderen despite a three-minute head start and fell one spot to seventh in the overall standings .,"[bmc_46, van garderen_46]",bmc_46,0,Q787401,Q65091419,CCC Pro Team,BMC Racing Team
327,Gonzalez returned to the Bay Area for the first time since the Oakland Athletics traded him to the Nationals during the offseason .,"[gonzalez_246, bay area_246]",gonzalez_246,0,Q1525217,Q7103093,Gio González,Orlando González
159,"While the Security Council united behind Ban 's proposals , Russia 's Churkin criticized the U.S. and its European allies who opposed an extension of the observer mission .","[churkin_122, us_122]",us_122,1,Q30,Q48525,United States,Federal government of the United States
76,"The intrusion-detection system , manufactured by defense contractor Raytheon Co. , should have set off a series of warnings , said Bobby Egbert , spokesman for the Port Authority police officers union .",[port authority_65],port authority_65,0,Q652812,Q908666,Port authority,Port Authority of New York and New Jersey
333,"Guerrero 's wife , Casey , who beat leukemia thanks to a bone marrow transplant from German woman Katharina Zech , who was sitting ringside , said she was just happy her husband was back in the ring for the first time in 15 months .",[guerrero_250],guerrero_250,0,Q653806,Q44374,Robert Guerrero,Eddie Guerrero
213,"`` It 's a little different , that 's for sure , `` Dodgers manager Don Mattingly said .","[dodgers_167, don mattingly_167]",dodgers_167,0,Q334634,Q21189452,Los Angeles Dodgers,2016 Los Angeles Dodgers season
427,Word had just come that Liddell died in February .,"[liddell_323, february_323]",february_323,1,Q109,Q7488743,February,Shankha Ghosh
132,"Catherine Lucey joined the Daily News in 2002 and has written about murderous drug gangs , political protesters and Harry Potter .",[daily news_104],daily news_104,0,Q3378849,Q2605160,Philadelphia Daily News,The Daily News (UK)
