This is the implementation of Weighted self Distillation for Chinese word segmentation at Findings of the Association for Computational Linguistics: ACL 2022.
Our code works with the following environment.
python=3.6.13
pytorch=1.7
In our paper, we use BERT (paper), please download pre-trained BERT-Base Chinese from Google or from HuggingFace. You can also choose RoBERTa from roberta_zh.
We use SIGHAN2005 in our paper.
To obtain pre-processed data, please follow Improving Chinese Word Segmentation with Wordhood Memory Networks.
Furthermore, we conduct some NER tasks (including WEIBO
, RESUME
, and MSRA
).
For AttenCD, you can download our codes to train models.
You can find the command lines to train and test models on a specific dataset in run.sh
.
Here are some important parameters:
--do_train
: training mode.--num_wei
(num_atten): weight class.--ratio
: control the amount of training data.--data_name
: the name of the data to be trained.--dict_path
: the dictionary path.--bert_model
: the directory of pre-trained BERT or RoBERTa model.--model_name
: the name of model to save.
Here are some important parameters:
--do_test
: testing mode.--test_model
: the pre-trained AttenCD model.--data_name
: the name of the data to be tested.--dict_path
: the dictionary path.
note: Only the prediction method of CWS task is implemented.
Here are some important parameters:
--do_predict
: predicting mode.--input_file
: the file to be predicted.--output_file
: the path of the output file.--test_model
: the pre-trained AttenCD model.
If you use or extend our work, please cite our paper at ACL2022. bibtex
@inproceedings{DBLP:conf/acl/HeC0Z22,
author = {Rian He and
Shubin Cai and
Zhong Ming and
Jialei Zhang},
editor = {Smaranda Muresan and
Preslav Nakov and
Aline Villavicencio},
title = {Weighted self Distillation for Chinese word segmentation},
booktitle = {Findings of the Association for Computational Linguistics: {ACL} 2022,
Dublin, Ireland, May 22-27, 2022},
pages = {1757--1770},
publisher = {Association for Computational Linguistics},
year = {2022},
url = {https://aclanthology.org/2022.findings-acl.139},
timestamp = {Thu, 19 May 2022 16:52:59 +0200},
biburl = {https://dblp.org/rec/conf/acl/HeC0Z22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
You can leave comments in the Issues
section, if you have any questions about our methods.