Skip to content
Code for the ACL-2015 paper "Accurate Linear-Time Chinese Word Segmentation via Embedding Matching"
Python Perl
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

#Embedding Matching for Chinese Word Segmentation

This repository hosts the code for the following paper:

Ma and Hinrichs. 2015. Accurate Linear-Time Chinese Word Segmentation via Embedding Matching. In Proceedings of ACL.

The python implementation is in the /code folder. Change the directory there to get started.


  1. To reproduce the experiments, including the training and evaluations after each iteration, run:

python config.txt

the parses the configurations in config.txt and calls the main code to train the model.

  1. To segment a raw corpus with an existing model

python predict <input_corpus> <output_path>

Note: Python2 is supported and all corpora are assumed to be in UTF-8 encoding.

  • The training takes about 1 hour on a laptop with intel Core5 CPU (single thread). In the process, error rate and quick evaulations for small batches of sentences will be conducted.

At the end of each epoch, evaluation on the testing data is given. And in the end of the whole training (10 epochs), F-scores, OOV-Recall and IV recall for each epoch will be reported again. Evaluations are based on the SIGHAN official script (in "working_data" folder)

##Requriement: ###Required Software:

  • Linux-like environment (tested on Ubuntu 14.4 and Mac OSX )
  • python 2.7 or above (but not Python 3.X)
  • Python numpy package 1.9 or above
  • Python gensim package 0.10.3 (we use their save/load/math routines and also the threading, lookup-table etc from their word2vec implementation, which is modified a bit and included here)

Required Data

PKU corpus from 2nd SIGHAN word segmentation bakeoff ( is in the folder "working_data", which include:

  • score (official scoring script:
  • pku_test (official testing split)
  • pku_train (official training split)
  • pku_test.raw (testing data converted into character sequences)
  • pku.100.txt (top 100 sent in pku_train, for fast evaluation when monitoring training)
  • pku.test.10 (top10 sent in pku_test, similar purpose)
  • pku.dict (the list of words that have occurred in pku_train, for evaluation)

MSR and other dataset can be prepared in similar manner (.raw, .100.txt test.10, and .dict files are needed) and update config.txt accordingly.

This software is open source software released under the GNU LGPL license. Copyright (c) 2015-now Jianqiang Ma

You can’t perform that action at this time.