Skip to content

FudanNLP/adversarial-multi-criteria-learning-for-CWS

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 

adversarial-multi-criteria-learning-for-CWS

The implementation of paper https://arxiv.org/abs/1704.07556, ACL 2017

Dependencies

Tensorflow: ==1.0.0

Pandas: >= 0.18.1

numpy: >=1.12.1

File Tree

|-- AdvMulti_model.py

|-- AdvMulti_train.py

|-- Baseline_model.py

|-- Baseline_train.py

|-- config.py

|-- data_as

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_cityu

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_ckip

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_ctb

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_helpers.py

|-- data_msr

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_ncc

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_pku

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_sxu

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_weibo

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- models

|   |-- cws_msr

|   |   `-- checkpoints

|   |-- cws_ncc

|   |   `-- checkpoints

|   |-- cws_sxu

|   |   `-- checkpoints

|   |-- multi_model9

|   |   `-- checkpoints

|   |-- train_words

|   `-- vec100.txt

|-- prepare_data_index.py

|-- prepare_train_words.py

`-- voc.py

Data Format

For dev, train, test in each data_directory, its format is:

1995#<NUM>#B_NT

The first one is the original char(1995), the second one is the processed char(<NUM>), the last one is the segmentation tag and POS(B_NT). The POS information is not needed in the paper, its just for the convenience of the expand use.

For words in each data_directory, it is a dict for words:

平定 费尔南多·安特萨纳 北京索有文化传播有限公司

For words_for_training in each data_directory, it format is:

LC 过后 28

LC is POS, ‘过后’ is the bigram we extracted, 28 means its frequency in the specific dataset. The POS information is not needed in the paper, its just for the convenience of the expand use.

For vec100.txt is the embeding file generated by word2vec toolkit

Here is the link: https://pan.baidu.com/s/1jHHdzmA

Code Usage

prepare_data_index.py is used produce .csv that is used as direct input

prepare_train_words.py is used for generating words (need to be trained) beyond specific frequency in Multi-task learning.

AdvMulti_model.py & AdvMulti_train.py are paired model and train file

Baseline_model.py & Baseline_train.py are paired model and train file

Run

The hyper parameters are defined in config.py and tf.FLAGS

When you have all necessary files:

For baseline train:

CUDA_VISIBLE_DEVICES=0 python Baseline_train.py

For adversarial multi_task train:

CUDA_VISIBLE_DEVICES=0 python AdvMulti_train.py

About

The implementation of paper https://arxiv.org/abs/1704.07556, ACL 2017

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages