First,
git clone git@github.com:THUDM/HOSMEL.git
Our toolkit allows 3 different levels of usages
If you are not interested to change our default setup, follow the steps of Mention Filtering, Mention Detection, Disambiguation By Subtitle, and Disambiguation By Relation. A Live demonstration using the same structure is also available.
We know some users might prefer to design their own entity disambiguation framework or has other needs, some level of high quality candidate entity retrieval is still required. As a result, we support partial usage. To do this, make sure you complete the setups(which is only downloading and extracting zip files), simply import the corresponding part of the toolkit, and use it in your preferred manner. For better illustration a sample usage for the complete pipeline could be found in
https://drive.google.com/drive/folders/1eh-dJnKWJulPuZGsORii4fPW-zCmWS5k?usp=sharing
See setups and Usage of current Modules for more information.
To train your own module, we recommend to copy the NewMudule
module, a template Module we created, ideally, for training you only need to make sure your training data satisfies the form of
{
"sentence": "The input text",
"Label": int(k) # Label id showing targetk is the correct value,
"mention": "the mention of the entity", # Note: for mention detection, leave the mention empty and make the targets as your candidate mentions
"target0": "A", # The four candidate values
"target1": "B",
"target2": "C",
"target3": "D"
}
Then reimplement the generatePair
method in the apply{feature}.py
file for infer.
https://drive.google.com/file/d/12w12GH5XEVGKYoaWm_sXVFHGFOSFJHnu/view?usp=sharing
https://drive.google.com/file/d/1BZphOj8rS7qHZA3wWz0vcY3H_qbCjTGK/view?usp=sharing
https://drive.google.com/file/d/1pMqN63yy9S9NZJWRV41bc-dASRndLwtr/view?usp=sharing
https://drive.google.com/file/d/1xKvPx0LY6XgVXY7wtSmUwk2iMfBm-9qw/view?usp=sharing
Our method requires a few python based dependencies:
pip install flask torch tqdm pyahocorasick datasets transformers
Make sure you have all the dependencies installed to access all of our methods.
First Download TriMention.zip to the TriMention
directory, then simply extract the zip packages. You should see your directory to look like,
TriMention/
├── bdi2relation.pkl
├── mention.py
├── nameTri
├── subList.json
└── web.py
The TriMention
folder not only includes the basic Trie tree, it also comes with subtitle and relationship data which would be used for later sections. We separate the data-loading and processing for the consideration of better development experience since loading such data takes a large amount of time. To load the datas simply run
python mention.py
Download the MD_checkpoint.zip file and extract it to the MCMention/model
folder. It should look like
MCMention/
├── applyMention.py
├── model
│ ├── config.json
│ ├── pytorch_model.bin
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ └── vocab.txt
├── preprocessData.py
└── train.py
Download the SD_checkpoint.zip to the model
directory under MCSubtitle
. Then unzip it. The final directory should look like
MCSubtitle/
├── applySubtitle.py
├── model
│ ├── config.json
│ ├── pytorch_model.bin
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ └── vocab.txt
├── preprocessData.py
└── train.py
Download the RD_checkpoint.zip to MCRelation/model/
, and unzip to get
MCRelation/
├── applyRelation.py
├── model
│ ├── config.json
│ ├── pytorch_model.bin
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ └── vocab.txt
├── preprocessData.py
└── train.py
To launch the complete HOSMEL, we provide a flask based backend, simply run it as
python backend.py
We release our training data here
Our test data is also available here
It's simple to use our provided modules. After setting up, most modules have their method implemented in a apply{$Module}.py
file. With a topk{$Module}
method in it. This method is often formed with three parameters:
Parameter | Usage |
---|---|
q | The input text of the entity linking framework. |
mentions/entities | The result output from the previous step. |
K | The top-K result for linking will be outputted for the current Module to the next. Default we set it to 3. |
The mention filtering stage is different, because it is the first step of entity linking and the candidate entity before this set could be viewed as the entire domain, thus we deployed it separately in the TriMention/mention.py
file. The usage is to import parse_mentions
from TriMention/web.py
and call
from TriMention.web import parse_mentions
mentions = parse_mentions(text)
Where text
is the input text to the toolkit.
Other modules goes after as
entities = topkMention(text,mentions,K=3)
entities = topkSubtitle(text,entities,K=3)
entities = topkRelation(text,entities,K=1)
print(entities)
To train a new module, simply move the training data to corresponding folder and use
python preprocessData.py
Make sure you have the name right, for example the name for training data in the MCSubtitle filder is subtitleData.json
. This should give a processedData.json
file in the same directory. Then use
python train.py
The model's checkpoint should be saved in the model
folder.
Idealy, if you have selected your checkpoint and replaced the model
folder with it, you don't need to change anything other than editing the generatePairs
method. However, just in case, if you are interested to change model directory. In the applyNew.py
folder, change
model_location = os.path.join(os.path.dirname(__file__),"model")
into
model_location = "New checkpoint location"
Will do it.
To use the new module for infer, it is required to reimplement the generatePairs
method. The generate Pair method takes the input entity
, aka, the output of the previous module, and retrieves a list of "mention|attribute value" pairs. A bdi_list
variable, containing the same amount of items as the pairs list with the i'th
item being the id
of the i'th
pair's entity, is required to add the scores back to the corresponding entity.
Now to test your newly implemented module, import the topkNew
method and use
from TriMention.web import parse_mentions as mentionFiltering
from ... import ... as DisambiguationBy...
......
from NewModule.applyNew import topkNew as DisambiguationByNew
text = "A test text"
entities = mentionFiltering(text)
entities = DisambiguationBy...(text,entities,K=3)
......
entities = DisambiguationByNew(text,entities,K=3)
print(entities[0])
We provided a live demonstration at https://www.aminer.cn/el
If you found our project helpful please cite our paper
@inproceedings{zhangli2022hosmel,
title={HOSMEL: A Hot-Swappable Modularized Entity Linking Toolkit for Chinese},
author={Zhang-Li, Daniel and Zhang, Jing and Yu, Jifan and Zhang, Xiaokang and Zhang, Peng and Tang, Jie and Li, Juanzi},
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
year={2022}
}