This is the official repository for Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, accepted for CIKM 2023.
The study adapts BERT-based Entity Linking (BLINK) to identify mentions that do not have corresponding KB entities by matching them to a special NIL entity, with NIL entity representation and classification, and synonym enhancement.
The study also applies KB Pruning and Versioning strategies to automatically construct out-of-KB datasets from common in-KB Entity Linking datasets. Please see the model training and data construction scripts below.
See step_all_BLINK.sh
for running BLINK models with Threshold-based and NIL-rep-based methods.
See step_all_BLINKout.sh
for running BLINKout models and the dynamic feature baseline.
See step_all_BM25+cross-enc.sh
for all BM25+BERT models.
For all scripts above:
- setting
dataset
(andmm_onto_ver_model_mark
for MedMentions) - setting
bi_enc_bertmodel
andcross_enc_bertmodel
(and changefurther_model_mark
accordingly) - setting
train_bi
(except BM25),rep_ents
,train_cross
,inference
totrue
to perform each step. - setting
use_best_top_k
astrue
if using tuned top-k, otherwise using default
For step_all_BLINK.sh
, further
- setting
use_NIL_threshold
totrue
when using the Threshold-based approach (and the correspondingth2
as threshold value for each dataset) - setting
use_NIL_ranking
totrue
when using the NIL-rep-based approach (and setting NIL representation binary parameters ofuse_NIL_tag
,use_NIL_desc
, anduse_NIL_desc_tag
)
For step_all_BLINKout.sh
, further
- setting NIL representation binary parameters of
use_NIL_tag
,use_NIL_desc
, anduse_NIL_desc_tag
. - setting
dynamic_emb_extra_ft_baseline
totrue
and select the corresponding line (around 273-274) to use either the NIL regulariser (gu2021
) or the dynamic feature baseline (full-features-NIL-infer
), also setting the value oflambda_NIL
.
For step_all_BM25+cross-enc.sh
- requiring the tokenizer of the saved biencoder model, so run
step_all_BLINK.sh
with the same biencoder model first before running this script.
Link to out-of-KB mention discovery datasets: https://zenodo.org/record/8228371.
We acknowledge the sources below for data construction:
-
ShARe/CLEF 2013 dataset is from https://physionet.org/content/shareclefehealth2013/1.0/
-
MedMention dataset is from https://github.com/chanzuckerberg/MedMentions
-
UMLS (versions 2012AB, 2014AB, 2017AA) is from https://www.nlm.nih.gov/research/umls/index.html
-
SNOMED CT (corresponding versions) is from https://www.nlm.nih.gov/healthit/snomedct/index.html
-
NILK dataset is from https://zenodo.org/record/6607514
-
WikiData 2017 dump is from https://archive.org/download/enwiki-20170220/enwiki-20170220-pages-articles.xml.bz2
See files under the preprocessing
folder, where running scripts to create the datasets are in run_preprocess_ents_and_data.sh
.
The repository is based on BLINK
under the MIT license. Also, we acknowledge the data sources above.