libweicws is a micro blog oriented Chinese word segmentation system. It's the system we submit to Task 1: Micro-blog word segmentation on 2012 CLP back-offs. This system achieve F-score of 94.04% on the bake-off test data. The algorithm is presented in this paper in detail.
We want to devolop a library that can achieve some certain accuracy on micro blog corpus. We also want to make it compatiable crossing differen platform. What's more, we are planning to provide different programming language interfaces including Java and Python.
Model and data are still under construction for publish.
This project requires:
- Cross-platform Make (CMake) v2.8.0+
- GNU Make or equivalent.
- GCC or an alternative, reasonably conformant C++ compiler. MSVC can also work fine with it
NOTE: You don't need to link libpcre and libcrfsuit to this project, we have integrated above two libraries into it and hacked the source of their makefile a tiny little bit.
This project uses the Cross-platform Make (CMake) build system. However, we have conveniently provided a wrapper configure script and Makefile so that the typical build invocation of "./configure" followed by "make" will work. For a list of all possible build targets, use the command "make help".
In Unix (GCC):
In Windows (MSVC):
mkdir build cd build cmake ..
Open the Visual Studio solution
libweicws.sln, and build it like other projects.
NOTE: Users of CMake may believe that the top-level Makefile has been generated by CMake; it hasn't, so please do not delete that file.
You can refer the demo.cpp. It's an example of loading model and conducting segmentation over Chinese sentence.
Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing with a ready-to-sign agreement is available.
- publish first version of the library.
Model suiteable for current version can be downloaded from here.
- Improve the performance
- Multi-thread support
- (?)Iteratior model over the result