Skip to content

A micro blog oriented Chinese word segmentation system. Code for 'Micro blogs Oriented Word Segmentation System'

License

Notifications You must be signed in to change notification settings

Oneplus/libweicws

Repository files navigation

LIBWEICWS

Build Status Bitdeli Badge coverity scan build status

TOC

  1. Introduction
  2. Prerequisites
  3. Building
  4. Example
  5. License
  6. ChangLog
  7. Models

1. INTRODUCTION

libweicws is a micro blog oriented Chinese word segmentation system. It's the system we submit to Task 1: Micro-blog word segmentation on 2012 CLP back-offs. This system achieve F-score of 94.04% on the bake-off test data. The algorithm is presented in this paper in detail.

We want to devolop a library that can achieve some certain accuracy on micro blog corpus. We also want to make it compatiable crossing differen platform. What's more, we are planning to provide different programming language interfaces including Java and Python.

Model and data are still under construction for publish.

2. PREREQUISITES

This project requires:

  • Cross-platform Make (CMake) v2.8.0+
  • GNU Make or equivalent.
  • GCC or an alternative, reasonably conformant C++ compiler. MSVC can also work fine with it
  • PCRE-8.32
  • crfsuite-0.12

NOTE: You don't need to link libpcre and libcrfsuit to this project, we have integrated above two libraries into it and hacked the source of their makefile a tiny little bit.

3. BUILDING

This project uses the Cross-platform Make (CMake) build system. However, we have conveniently provided a wrapper configure script and Makefile so that the typical build invocation of "./configure" followed by "make" will work. For a list of all possible build targets, use the command "make help".

In Unix (GCC):

./configue
make

In Windows (MSVC):

mkdir build
cd build
cmake ..

Open the Visual Studio solution libweicws.sln, and build it like other projects.

NOTE: Users of CMake may believe that the top-level Makefile has been generated by CMake; it hasn't, so please do not delete that file.

4. Example

You can refer the demo.cpp. It's an example of loading model and conducting segmentation over Chinese sentence.

5. LICENSE

Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing with a ready-to-sign agreement is available.

6. ChangeLog

2013-01-07

  • publish first version of the library.

7. Model

Model suiteable for current version can be downloaded from here.

Histories:

TODO

  • Improve the performance
  • Multi-thread support
  • (?)Iteratior model over the result

About

A micro blog oriented Chinese word segmentation system. Code for 'Micro blogs Oriented Word Segmentation System'

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published