Skip to content
Fast and Reasonably Accurate Word Tokenizer for Thai
Python Shell
Branch: master
Clone or download
heytitle Merge pull request #12 from wannaphongcom/master
Add Thai Natural Language to setup.py
Latest commit d1699ca Sep 18, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
attacut Bump version: 1.0.2-dev → 1.0.2 Sep 8, 2019
docs remove unused imports, sort imports Sep 8, 2019
scripts remove unused imports, sort imports Sep 8, 2019
tests remove unused imports, sort imports Sep 8, 2019
.bumpversion.cfg
.floydignore remove unused imports, sort imports Sep 8, 2019
.gitignore update gitignore Aug 27, 2019
.travis.yml remove unused imports, sort imports Sep 8, 2019
LICENSE Initial commit Aug 24, 2019
Pipfile add document Aug 27, 2019
Pipfile.lock add document Aug 27, 2019
README.md Add appveyor Sep 8, 2019
appveyor.yml Update appveyor.yml Sep 8, 2019
floyd.yml remove unused imports, sort imports Sep 8, 2019
floyd_requirements.txt remove unused imports, sort imports Sep 8, 2019
requirements.txt bring back torch version Aug 29, 2019
run remove unused imports, sort imports Sep 8, 2019
setup.py Merge pull request #12 from wannaphongcom/master Sep 18, 2019

README.md

AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

Build Status Build status

How does AttaCut look like?


TL;DR: 3-Layer Dilated CNN on syllable and character features. It’s 6x faster than DeepCut (SOTA) while its WL-f1 on BEST is 91%, only 2% lower.

Installation

$ pip install attacut

Remarks: Windows users need to install PyTorch before the command above. Please consult PyTorch.org for more details.

Usage

Command-Line Interface

$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Tokenizer for Thai

Usage:
  attacut-cli <src> [--dest=<dest>] [--model=<model>]
  attacut-cli (-h | --help)

Options:
  -h --help         Show this screen.
  --model=<model>   Model to be used [default: attacut-sc].
  --dest=<dest>     If not specified, it'll be <src>-tokenized-by-<model>.txt

High-Level API

from attacut import tokenize, Tokenizer

# tokenize `txt` using our best model `attacut-sc`
words = tokenize(txt)

# alternatively, an AttaCut tokenizer might be instantiated directly, allowing
# one to specify whether to use `attacut-sc` or `attacut-c`.
atta = Tokenizer(model="attacut-sc")
words = atta.tokenize(txt)

Benchmark Results

Belows are brief summaries. More details can be found on our benchmarking page.

Tokenization Quality

Speed

Retraining on Custom Dataset

Please refer to our retraining page

Related Resources

Acknowledgements

This repository was initially done by Pattarawat Chormai, while interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand. Many people have involed in this project. Complete list of names can be found on Acknowledgement.

You can’t perform that action at this time.