CThaiNLP

C implementation of Thai Natural Language Processing tools, ported from PyThaiNLP.

Installation

Python Package (Recommended)

pip install cthainlp

Or install from source:

git clone https://github.com/wannaphong/CThaiNLP.git
cd CThaiNLP
pip install -e .

C Library

See Building section below.

Features

newmm: Dictionary-based maximal matching word segmentation constrained by Thai Character Cluster (TCC) boundaries
Similar API to PyThaiNLP for easy migration from Python to C
UTF-8 support
Efficient Trie data structure for dictionary lookup
Handles mixed Thai/English/numeric content
Python bindings with PyThaiNLP-compatible API

Quick Start

Python

from cthainlp import word_tokenize

# Tokenize Thai text
text = "ฉันไปโรงเรียน"
tokens = word_tokenize(text)
print(tokens)  # ['ฉัน', 'ไป', 'โรงเรียน']

C

#include "newmm.h"

int main() {
    const char* text = "ฉันไปโรงเรียน";
    int token_count;
    char** tokens = newmm_segment(text, "data/thai_words.txt", &token_count);
    
    // Use tokens...
    
    newmm_free_result(tokens, token_count);
    return 0;
}

Building

Prerequisites

GCC or compatible C compiler
Make
Python 3.7+ (for Python bindings)

C Library Compilation

make

This will create:

Static library: lib/libcthainlp.a
Example program: build/example_basic

Python Package Installation

Install the Python bindings:

pip install -e .

Or build from source:

python setup.py build
python setup.py install

Usage

Python

The Python API is designed to be compatible with PyThaiNLP:

from cthainlp import word_tokenize

# Basic tokenization
text = "ฉันไปโรงเรียน"
tokens = word_tokenize(text)
print(tokens)  # ['ฉัน', 'ไป', 'โรงเรียน']

# With custom dictionary
tokens = word_tokenize(text, custom_dict="data/thai_words.txt")

# Specify engine explicitly
tokens = word_tokenize(text, engine="newmm")

C Library

Basic Example

#include "newmm.h"

int main() {
    const char* text = "ฉันไปโรงเรียน";
    int token_count;
    
    // Segment text (with NULL for dict_path to use default dictionary)
    char** tokens = newmm_segment(text, NULL, &token_count);
    
    // Print tokens
    for (int i = 0; i < token_count; i++) {
        printf("%s\n", tokens[i]);
    }
    
    // Free memory
    newmm_free_result(tokens, token_count);
    
    return 0;
}

Compile Your Program

gcc your_program.c -I./include -L./lib -lcthainlp -o your_program

Running Examples

Python Example

python examples/python/example_basic.py

C Example

Basic example with default dictionary:

./build/example_basic "ฉันไปโรงเรียน"

With custom dictionary:

./build/example_basic "ฉันไปโรงเรียน" data/thai_words.txt

Running Tests

Python Tests

python tests/python/test_tokenize.py

C Tests

Run the C test suite:

make test

This will compile and run all unit tests to verify the tokenizer is working correctly.

API Reference

Functions

`char** newmm_segment(const char* text, const char* dict_path, int* token_count)`

Segment Thai text into words using the newmm algorithm.

Parameters:

text: Input text to segment (UTF-8 encoded)
dict_path: Path to dictionary file (one word per line, UTF-8). Use NULL for default dictionary
token_count: Output parameter - receives the number of tokens found

Returns:

Array of strings (tokens), or NULL on error
Caller must free the result using newmm_free_result()

Example:

int count;
char** tokens = newmm_segment("ฉันไปโรงเรียน", "dict.txt", &count);

`void newmm_free_result(char** tokens, int token_count)`

Free memory allocated by newmm_segment().

Parameters:

tokens: Array of tokens returned by newmm_segment()
token_count: Number of tokens in the array

Example:

newmm_free_result(tokens, count);

Dictionary Format

Dictionary files should contain one word per line in UTF-8 encoding:

ฉัน
ไป
โรงเรียน
วันนี้
อากาศ
ดี
มาก

A sample dictionary is provided in data/thai_words.txt.

Comparison with PyThaiNLP

CThaiNLP provides both C and Python APIs. The Python API is designed to be compatible with PyThaiNLP's word_tokenize() function:

PyThaiNLP:

from pythainlp.tokenize import word_tokenize

text = "ฉันไปโรงเรียน"
tokens = word_tokenize(text, engine="newmm")
print(tokens)  # ['ฉัน', 'ไป', 'โรงเรียน']

CThaiNLP Python:

from cthainlp import word_tokenize

text = "ฉันไปโรงเรียน"
tokens = word_tokenize(text, engine="newmm")
print(tokens)  # ['ฉัน', 'ไป', 'โรงเรียน']

CThaiNLP (C):

const char* text = "ฉันไปโรงเรียน";
int token_count;
char** tokens = newmm_segment(text, NULL, &token_count);
// tokens = ['ฉัน', 'ไป', 'โรงเรียน']
newmm_free_result(tokens, token_count);

Algorithm

The newmm (New Maximum Matching) algorithm:

Trie-based Dictionary Lookup: Uses a trie data structure for efficient prefix matching
Thai Character Cluster (TCC) Boundaries: Respects Thai character cluster rules for valid word boundaries
Maximal Matching: Finds the longest dictionary word that matches at each position
Fallback Handling: Handles non-dictionary words and non-Thai characters (Latin, digits, etc.)

Project Structure

CThaiNLP/
├── include/
│   └── newmm.h             # Public C API header
├── src/
│   ├── newmm.c             # Main newmm implementation
│   ├── trie.c              # Trie data structure
│   ├── trie.h              # Trie header
│   ├── tcc.c               # Thai Character Cluster
│   └── tcc.h               # TCC header
├── python/
│   └── cthainlp_wrapper.c  # Python C extension wrapper
├── cthainlp/
│   ├── __init__.py         # Python package
│   └── tokenize.py         # Python tokenization API
├── examples/
│   ├── example_basic.c     # C usage example
│   └── python/
│       └── example_basic.py # Python usage example
├── tests/
│   ├── test_newmm.c        # C test suite
│   └── python/
│       └── test_tokenize.py # Python test suite
├── data/
│   └── thai_words.txt      # Sample dictionary
├── setup.py                # Python package setup
├── pyproject.toml          # Python build configuration
├── Makefile                # Build configuration
└── README.md               # This file

Credits

Original PyThaiNLP implementation: PyThaiNLP Project
newmm algorithm: Based on work by Korakot Chaovavanich
TCC rules: Theeramunkong et al. 2000

License

Apache License 2.0 (following PyThaiNLP's license)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Future Enhancements

Add more tokenization engines (attacut, deepcut, etc.)
Improve performance with optimized data structures
Add part-of-speech tagging
Add named entity recognition
Provide Python bindings
Publish to PyPI

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
cthainlp		cthainlp
data		data
examples		examples
include		include
python		python
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
_codeql_detected_source_root		_codeql_detected_source_root
pyproject.toml		pyproject.toml
setup.py		setup.py

License

PyThaiNLP/CThaiNLP

Folders and files

Latest commit

History

Repository files navigation

CThaiNLP

Installation

Python Package (Recommended)

C Library

Features

Quick Start

Python

C

Building

Prerequisites

C Library Compilation

Python Package Installation

Usage

Python

C Library

Basic Example

Compile Your Program

Running Examples

Python Example

C Example

Running Tests

Python Tests

C Tests

API Reference

Functions

char** newmm_segment(const char* text, const char* dict_path, int* token_count)

void newmm_free_result(char** tokens, int token_count)

Dictionary Format

Comparison with PyThaiNLP

Algorithm

Project Structure

Credits

License

Contributing

Future Enhancements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`char** newmm_segment(const char* text, const char* dict_path, int* token_count)`

`void newmm_free_result(char** tokens, int token_count)`

Packages