Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models

This is the official code for paper titled "Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models".

We propose a method, which continually identifies the weak spots of a model to generate more valuable training instances, and apply a task-specific pre-training strategy to enhance the model. Experimental results show that such an adversarial training method combined with the pre-training strategy can improve both the generalization and robustness of multiple CSC models across three different datasets, achieving state-of-the-art performance for CSC task.

Requirements

For BERT and Soft-Masked BERT:

python==3.7
pytorch==1.4.0
transformers==3.4.0

For SpellGCN, we borrow some codes from SpellGCN, thus our requirements are the same with their.

Tensorflow==1.13.1
python==2.7
"BERT-Base, Chinese" from google-research

How to run?

1. Prepare the datasets:

For pre-train:
1. Wiki: Download the latest zhwiki and process the dump with gensim.
2. Weibo: Download the Weibo datasets.
3. Pre-process the corpus to remove noise, such as splitting paragraphs into sentences, filtering out the inappropriate sentences (too long or too short) and so on.
For train:
1. Download the additional 270K data samples from here.
2. Extract the training samples from the file "train.sgml".
Note: The data samples mentioned above are absent here due to the lack of permission.

2. Run the models:

For BERT and Soft-Masked BERT:

Set up an virtual environment for BERT and Soft-Masked BERT(python==3.7,torch==1.4.0,transformers==3.4.0) using Anaconda
```
conda create -n bert python=3.7.9
conda activate bert
pip install torch==1.4.0
pip install transformers==3.4.0
```
Go to the directory "scripts", set up your private parameters(like the path of initial model and data)
```
cd scripts
vim run.sh
```
bash run.sh
```
bash run.sh
```

For SpellGCN:

Set up an virtual environment for SpellGCN (python==2.7, Tensorflow==1.13.1) using Anaconda

conda create -n spellgcn python=2.7.1
source activate spellgcn
pip install tensorflow==1.13.1

Go to the directory "scripts", set up your private parameters (like the path of BERT and initial model)
```
cd scripts
vim run.sh
```
bash run.sh
```
bash run.sh
```

3. Or you can download the models you need and initialize your models from them.

Baidu Wangpan:
- 链接：https://pan.baidu.com/s/1O9mLjWSiXzxcPBy0fU-_BQ 提取码：y25e

Contact

chongli17@fudan.edu.cn and cenyuanzhang17@fudan.edu.cn

How to cite our paper?

@inproceedings{li-etal-2021-2Ways,
  author    = {Chong Li and
               Cenyuan Zhang and
               Xiaoqing Zheng and
               Xuanjing Huang},
  title="Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models",
  booktitle="Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing",
  publisher = "Association for Computational Linguistics",
  year="2021"
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
BERT		BERT
Data revision record		Data revision record
SoftmaskedBert		SoftmaskedBert
SpellGCN		SpellGCN
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models

Requirements

How to run?

1. Prepare the datasets:

2. Run the models:

3. Or you can download the models you need and initialize your models from them.

Contact

How to cite our paper?

About

Releases

Packages

Languages

License

FDChongLi/TwoWaysToImproveCSC

Folders and files

Latest commit

History

Repository files navigation

Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models

Requirements

How to run?

1. Prepare the datasets:

2. Run the models:

3. Or you can download the models you need and initialize your models from them.

Contact

How to cite our paper?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages