A micro blog oriented Chinese word segmentation system. Code for 'Micro blogs Oriented Word Segmentation System'
C++ C Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
python
src
test
thirdparty
tools/script
.gitignore
.travis.yml
CMakeLists.txt
COPYING
LICENSE
Makefile
README.md
configure

README.md

LIBWEICWS

Build Status Bitdeli Badge coverity scan build status

TOC

  1. Introduction
  2. Prerequisites
  3. Building
  4. Example
  5. License
  6. ChangLog
  7. Models

1. INTRODUCTION

libweicws is a micro blog oriented Chinese word segmentation system. It's the system we submit to Task 1: Micro-blog word segmentation on 2012 CLP back-offs. This system achieve F-score of 94.04% on the bake-off test data. The algorithm is presented in this paper in detail.

We want to devolop a library that can achieve some certain accuracy on micro blog corpus. We also want to make it compatiable crossing differen platform. What's more, we are planning to provide different programming language interfaces including Java and Python.

Model and data are still under construction for publish.

2. PREREQUISITES

This project requires:

  • Cross-platform Make (CMake) v2.8.0+
  • GNU Make or equivalent.
  • GCC or an alternative, reasonably conformant C++ compiler. MSVC can also work fine with it
  • PCRE-8.32
  • crfsuite-0.12

NOTE: You don't need to link libpcre and libcrfsuit to this project, we have integrated above two libraries into it and hacked the source of their makefile a tiny little bit.

3. BUILDING

This project uses the Cross-platform Make (CMake) build system. However, we have conveniently provided a wrapper configure script and Makefile so that the typical build invocation of "./configure" followed by "make" will work. For a list of all possible build targets, use the command "make help".

In Unix (GCC):

./configue
make

In Windows (MSVC):

mkdir build
cd build
cmake ..

Open the Visual Studio solution libweicws.sln, and build it like other projects.

NOTE: Users of CMake may believe that the top-level Makefile has been generated by CMake; it hasn't, so please do not delete that file.

4. Example

You can refer the demo.cpp. It's an example of loading model and conducting segmentation over Chinese sentence.

5. LICENSE

Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing with a ready-to-sign agreement is available.

6. ChangeLog

2013-01-07

  • publish first version of the library.

7. Model

Model suiteable for current version can be downloaded from here.

Histories:

TODO

  • Improve the performance
  • Multi-thread support
  • (?)Iteratior model over the result