# NED Benchmark Tutorial

In this tutorial, we demonstrate how to use a pretrained Bootleg NED model to run inference on RSS500, a standard sentence-level NED benchmark. 


### Requirements 

To run this tutorial, you'll need to download the following: 

- Pretrained Bootleg model and config [here](https://bootleg-emb.s3.amazonaws.com/models/2020_08_25/bootleg_wiki.tar.gz)
- RSS500 data [here](https://bootleg-emb.s3.amazonaws.com/data/rss500.tar.gz)
- Entity data [here](https://bootleg-emb.s3.amazonaws.com/data/wiki_entity_data.tar.gz)
- Embedding data [here](https://bootleg-emb.s3.amazonaws.com/data/emb_data.tar.gz)
- Pretrained BERT model [here](https://bootleg-emb.s3.amazonaws.com/pretrained_bert_models.tar.gz)

For convenience, you can run the commands below (from the root directory of the repo) to download all the above files and unpack them to `models`, `data`, and `pretrained_bert_models` directories. It will take several minutes to download all the files. 

    bash download_model.sh 
    bash download_data.sh 
    bash download_bert.sh
    

## 1. Prepare the Config File

Necessary import statements. You will need to have installed bootleg as package to run these (see Installation instructions in the README).

In [1]:
import sys
import logging
from importlib import reload
reload(logging)
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

from bootleg import run
from bootleg.utils.parser_utils import get_full_config

If you do not have a GPU with at least 12GB of memory available, set the below to `True` to run inference on the CPU. 

In [2]:
use_cpu = False

Load the model config so we can set additional parameters and load the saved model during evaluation.  

In [4]:
root_dir = # FILL IN FULL PATH TO ROOT REPO DIRECTORY HERE 

config_path = f'{root_dir}/models/bootleg_wiki/bootleg_config.json'
config_args = get_full_config(config_path)

Update the config parameters to point to the downloaded model checkpoint and data. 

In [5]:
# set the model checkpoint path 
config_args.run_config.init_checkpoint = f'{root_dir}/models/bootleg_wiki/bootleg_model.pt'

# set the path for the entity db and candidate map
config_args.data_config.entity_dir = f'{root_dir}/data/wiki_entity_data'
config_args.data_config.alias_cand_map = 'alias2qids_rss500.json'

# set the data path and RSS500 test file 
config_args.data_config.data_dir = f'{root_dir}/data/rss500'
config_args.data_config.test_dataset.file = 'test_rss500.jsonl'

# set the embedding paths 
config_args.data_config.emb_dir =  f'{root_dir}/data/emb_data'
config_args.data_config.word_embedding.cache_dir =  f'{root_dir}/pretrained_bert_models'

# set the save directory 
config_args.run_config.save_dir = f'{root_dir}/results'

# set whether to run inference on the CPU
config_args.run_config.cpu = use_cpu

## 2. Run Inference for RSS500

Once the config is set up, run model evaluation! You should get that 421/472 mentions (men) are correct (crct). 

In [6]:
run.model_eval(args=config_args, mode="eval", logger=logger, is_writer=True)

2020-09-15 14:26:43,226 PyTorch version 1.5.0 available.
2020-09-15 14:26:44,307 Loading entity_symbols...
2020-09-15 14:26:50,903 Loaded entity_symbols with 5222808 entities.
2020-09-15 14:26:51,050 Loading slices...
2020-09-15 14:26:51,073 Finished loading slices.
2020-09-15 14:26:51,076 Loading dataset...
2020-09-15 14:26:51,109 Finished loading dataset.
2020-09-15 14:27:01,937 Sampled 319 indices from dataset (dev/test) for evaluation.
2020-09-15 14:27:03,033 Loading embeddings...
2020-09-15 14:27:50,782 Finished loading embeddings.
2020-09-15 14:27:51,704 Loading model from /dfs/scratch1/mleszczy/bootleg-internal/models/bootleg_wiki/bootleg_model.pt...
2020-09-15 14:28:03,570 Successfully loaded model from /dfs/scratch1/mleszczy/bootleg-internal/models/bootleg_wiki/bootleg_model.pt starting from checkpoint epoch 2 and step 0.
2020-09-15 14:28:03,645 ************************RUNNING EVAL test_rss500.jsonl************************
2020-09-15 14:28:03,647 Evaluating 10 batches
2020-09-

The `final_loss` head corresponds to the final prediction head. We only have a single data subset, or slice, for the benchmark, which is the overall slice that includes all mentions (we call this slice `final_loss`). 

The `f1_pop` metric is the score on simple baseline which simply predicts most popular candidate for each mention without any other contextual information. We see that Bootleg improves over 11 F1 points over this baseline. 

The F1 score reported here is the micro-averaged F1 score over the entities and assumes 100% candidate recall (every mention has a candidate list). However, for the benchmarks, some mentions are in the benchmark but do not have a corresponding candidate list. Thus, for benchmarks we need to re-compute the F1 taking into account the candidate recall, where the number above is equivalent to the benchmark precision. 

## 3. Analyze the Errors

To understand what examples Bootleg gets wrong, we also support a `dump_preds` mode. Rather than computing aggregate quality metrics, this mode writes a jsonlines file with the predicted candidates for each mention and their associated probabilities. 

Running this is very similar as before, except we need to switch the mode to `dump_preds`. 

In [7]:
pred_file, _ = run.model_eval(args=config_args, mode="dump_preds", logger=logger, is_writer=True)

2020-09-15 14:28:07,161 Loading entity_symbols...
2020-09-15 14:28:11,850 Loaded entity_symbols with 5222808 entities.
2020-09-15 14:28:11,852 Loading slices...
2020-09-15 14:28:11,856 Finished loading slices.
2020-09-15 14:28:11,858 Loading dataset...
2020-09-15 14:28:11,862 Finished loading dataset.
2020-09-15 14:28:20,565 Sampled 319 indices from dataset (dev/test) for evaluation.
2020-09-15 14:28:20,766 Loading embeddings...
2020-09-15 14:28:44,649 Finished loading embeddings.
2020-09-15 14:28:45,262 Loading model from /dfs/scratch1/mleszczy/bootleg-internal/models/bootleg_wiki/bootleg_model.pt...
2020-09-15 14:28:47,610 Successfully loaded model from /dfs/scratch1/mleszczy/bootleg-internal/models/bootleg_wiki/bootleg_model.pt starting from checkpoint epoch 2 and step 0.
2020-09-15 14:28:47,639 ************************DUMPING PREDICTIONS FOR test_rss500.jsonl************************
2020-09-15 14:28:47,759 320 samples, 10 batches
2020-09-15 14:28:51,347 Writing predictions...
2020-

We provide utility functions to load in the predicted labels as well as the original file and generate a merged [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) with both predicted and gold labels. 

In [8]:
from utils import score_predictions
import pandas as pd
pd.options.display.max_colwidth = 500

pred_df = score_predictions(orig_file=f'{root_dir}/data/rss500/test_rss500.jsonl', 
                 pred_file=pred_file,
                 entity_mapping_dir=f'{root_dir}/data/wiki_entity_data/entity_mappings')

Let's take a look at the DataFrame! We include the sentence, the aliases we want to disambiguate, the correct mapping to Wikidata for the particular alias (gold_qid), and the predicted and true titles. 

In [9]:
pred_df.sample(10)

Unnamed: 0,sentence,aliases,alias,alias_idx,gold_qid,pred_qid,gold_title,pred_title
398,"The keynote address was given by Republican Party of Kentucky chairman Darrell Brock , and other dignitaries in attendance included Larry Cox , state director for Senator Mitch McConnell ; Councilman Glen Stuckel ; Linda Greenwell , candidate for state auditor ; and several other candidates .","[republican party of kentucky_303, darrell brock_303]",darrell brock_303,1,Q5224589,Q5224589,Darrell Brock Jr.,Darrell Brock Jr.
367,"`` The authorities claim that Israel is unified , but at the same time they continue to ignore the legal commitments to the children of East Jerusalem , `` said Ir Amim director Yehudith Oppenheimer .",[ir amim_279],ir amim_279,0,Q5070422,Q5070422,Ir Amim,Ir Amim
246,"Recently appointed French coach Sabri Lamouchi included brothers Kolo and Yaya Toure from English Premier League champions Manchester City , plus Shanghai-based veteran Drogba and Gervinho of Arsenal .","[english premier league_190, manchester city_190]",english premier league_190,0,Q9448,Q9448,Premier League,Premier League
100,"Bradford Wilcox , director of the National Marriage Project at the University of Virginia , said marriage brings many benefits in addition to higher median salaries and greater shared resources .",[bradford wilcox_79],bradford wilcox_79,0,Q7945359,Q896966,W. Bradford Wilcox,Bradford Parkinson
448,"The Saints ' big-play defense lived up to its reputation in the nick of time , thwarting MVP quarterback Peyton Manning and providing the game-clinching play with a little more than three minutes left with cornerback Tracy Porter 's 74-yard touchdown on an interception return .","[mvp_338, peyton manning_338]",mvp_338,0,Q652965,Q1079382,Most valuable player,Super Bowl Most Valuable Player Award
193,"Industry consultant Daniel Yergin , chairman of IHS Cambridge Energy Research Associates , said that peak oil advocates have underestimated technology advances .",[daniel yergin_150],daniel yergin_150,0,Q714148,Q714148,Daniel Yergin,Daniel Yergin
277,"-LRB- AP Photo\/Ted S. Warren -RRB- Felix Hernandez pitched the Seattle Mariners ' first perfect game and the 23rd in baseball history , overpowering the Tampa Bay Rays in a brilliant 1-0 victory Wednesday .","[felix hernandez_210, seattle mariners_210]",felix hernandez_210,0,Q1196594,Q1196594,Félix Hernández,Félix Hernández
326,"Compared with leading manufacturers , HTC needs to address several challenges before it makes a turnaround , Goldman Sachs analyst Robert Yen -LRB- 嚴柏宇 -RRB- said .",[goldman sachs_245],goldman sachs_245,0,Q193326,Q193326,Goldman Sachs,Goldman Sachs
303,"`` We 've got sort of this life-size timeline , Corvette timeline , we 're setting up that will have real Corvettes and real Corvette engines , `` said GM spokesman Tom Wilkinson .",[gm_228],gm_228,0,Q81965,Q1141551,General Motors,Gundam (fictional robot)
355,"`` The first social responsibility and professional ethic of media staff should be understanding their role clearly and being a good mouthpiece , `` Hu Zhanfan , the president of CCTV , said in a speech .",[cctv_271],cctv_271,0,Q207936,Q242256,China Central Television,Closed-circuit television


We can write functions over the DataFrame to help with error analysis. For instance, to get all incorrect examples, we use the below command. 

In [10]:
pred_df[pred_df['gold_qid'] != pred_df['pred_qid']].sample(10)

Unnamed: 0,sentence,aliases,alias,alias_idx,gold_qid,pred_qid,gold_title,pred_title
327,Gonzalez returned to the Bay Area for the first time since the Oakland Athletics traded him to the Nationals during the offseason .,"[gonzalez_246, bay area_246]",gonzalez_246,0,Q1525217,Q7103093,Gio González,Orlando González
399,"`` Everyone from -LRB- Kentucky coach John -RRB- Calipari to -LRB- Duke coach -RRB- Mike Kryzewski to -LRB- North Carolina coach -RRB- Roy Williams , `` Hollis said .","[everyone from lrb kentucky_304, john rrb calipari_304]",everyone from lrb kentucky_304,0,Q3324473,Q6392428,Kentucky Wildcats,Kentucky Wildcats men's basketball
448,"The Saints ' big-play defense lived up to its reputation in the nick of time , thwarting MVP quarterback Peyton Manning and providing the game-clinching play with a little more than three minutes left with cornerback Tracy Porter 's 74-yard touchdown on an interception return .","[mvp_338, peyton manning_338]",mvp_338,0,Q652965,Q1079382,Most valuable player,Super Bowl Most Valuable Player Award
264,"The tight end has 696 career receptions , second in team history behind only Hall of Fame receiver Michael Irvin at 750 .","[hall of fame_203, michael irvin_203]",hall of fame_203,0,Q1046088,Q778412,List of halls and walks of fame,Pro Football Hall of Fame
281,"In Seattle , Washington , Felix Hernandez pitched Seattle 's first perfect game and the 23rd in majors history as the Mariners edged Tampa Bay .","[seattle_212, felix hernandez_212]",seattle_212,0,Q466586,Q5083,Seattle Mariners,Seattle
318,"In 2007 , George Washington defeated Hurricane 1-0 in the title game .",[george washington_239],george washington_239,0,Q534375,Q2984247,George Washington Colonials,George Washington Colonials men's basketball
26,The co-founder of Apple Inc. died in October at the age of 56 .,[october_22],october_22,0,Q124,Q2336911,October,Thursday October Christian I
190,"Officer Korey Lankow was placed on administrative leave after leaving Jeg , a drug-sniffing dog , in his squad car outside DPS headquarters in Tucson for more than an hour on July 11 .",[tucson_148],tucson_148,0,Q18575,Q1433111,"Tucson, Arizona",Tucson International Airport
154,"Still , Tomblin campaign spokesman Chris Stadelman criticized Maloney 's goal of simply making it smaller than last year 's .",[maloney_118],maloney_118,0,Q4910023,Q455833,Bill Maloney,Carolyn Maloney
421,Dutch rider Lars Boom won the Eneco Tour after Sunday 's final stage as former Tour de France winner Alberto Contador marked his return from a doping ban by finishing fourth .,"[lars boom_319, eneco tour_319]",eneco tour_319,1,Q670258,Q19598847,BinckBank Tour,2015 Eneco Tour
