Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
3e19fd3
Merge pull request #221 from PyThaiNLP/dev
bact May 10, 2019
0a57ae9
Merge pull request #232 from PyThaiNLP/issue71_add_documentation
wannaphong Jul 27, 2019
7d194e5
move text processing functions to ulmfit/rules.py
Aug 19, 2019
9784fe0
add test cases for text processing rules and BaseTokenizer
Aug 19, 2019
a180a88
add a blank line
Aug 19, 2019
e53c018
format docstring
Aug 19, 2019
064f791
remove `fastai` from the library dependencies
Aug 19, 2019
eb29b90
refactor code due to conitive complexity
Aug 19, 2019
9135559
fix bug, ungroup emoji
Aug 19, 2019
d44b03f
NER : Add output like html tag
wannaphong Aug 30, 2019
1302a85
add attacut to pythainlp/tokrnize
bkktimber Aug 30, 2019
9d4fab3
fix format
bkktimber Aug 30, 2019
ea703f2
add test cases for new option `tag`
Aug 31, 2019
def7388
fix PEP8 issues
Aug 31, 2019
723a052
fix typo
Aug 31, 2019
937aecb
change input texts in two cases
Aug 31, 2019
fd20e0b
Merge changes related to `ulmfit` module from `dev` branch
Aug 31, 2019
888aa78
refactor the unittest
Aug 31, 2019
074a391
Add change in appveyor.yml (to fix build error)
Aug 31, 2019
53c812a
add test cases for new function in `ulmfit`
Aug 31, 2019
d6ccb2c
add a test case for `pythainlp.ulmfit.process_text`
Aug 31, 2019
c20ec56
edit test case description
Aug 31, 2019
3e90cc5
update test cases description
Aug 31, 2019
3292ba3
use high level api in attacut
bkktimber Sep 1, 2019
3e0d4d6
merge
Sep 1, 2019
73ba1ed
Merge branch 'dev' into add-attacut
bkktimber Sep 1, 2019
7adc2ea
fixed merge conflict in and
bkktimber Sep 1, 2019
70b0a3b
fixed merge conflict
bkktimber Sep 1, 2019
9822e10
fixed file format
bkktimber Sep 1, 2019
44a9f6b
Update requirements.txt
bact Sep 1, 2019
134e79b
Update requirements.txt
bact Sep 1, 2019
7adb41a
Merge branch 'dev' into 2.0
bact Sep 1, 2019
39ca9cd
Merge pull request #265 from PyThaiNLP/2.0
bact Sep 1, 2019
a7523b1
Merge pull request #264 from PyThaiNLP/2.1
bact Sep 1, 2019
ee1bc1d
merge from dev
Sep 2, 2019
802814d
Merge branch 'remove-fastai-dep' of https://github.com/PyThaiNLP/pyth…
Sep 2, 2019
d6c53d5
remove try-except from tokenize/attacut.py
bkktimber Sep 2, 2019
229d9eb
add test for attacut
bkktimber Sep 2, 2019
c496e3c
add test for attacut
bkktimber Sep 2, 2019
d9888ca
add documentation
Sep 2, 2019
2c5cc02
Merge branch 'remove-fastai-dep' of https://github.com/PyThaiNLP/pyth…
Sep 2, 2019
de69a54
add test for attacut
bkktimber Sep 2, 2019
1948516
Add Discussion
Sep 2, 2019
a479b0e
Add pythainlp.benchmarks
Sep 2, 2019
bb340e8
resolve conflict CONTRIBUTING.md
Sep 2, 2019
4af3d73
resolve conflict CONTRIBUTING.md
Sep 2, 2019
c4d1345
Merge branch 'dev' of https://github.com/PyThaiNLP/pythainlp into rem…
Sep 2, 2019
7697ced
correct test case for attacut
bkktimber Sep 2, 2019
e631365
Removed a modified file from pull request
Sep 2, 2019
65e3d6e
resolve conflict setup.py
Sep 2, 2019
36f8fe4
resolve conflict CONTRIBUTING.md
Sep 2, 2019
ec44934
remove rules.py
Sep 2, 2019
72c394b
fix PEP8 issues
Sep 2, 2019
3494588
Merge pull request #261 from bkktimber/add-attacut
wannaphong Sep 2, 2019
1986bbd
Remove fast AI from setup[full]
lalital Sep 2, 2019
a909b86
Merge pull request #252 from PyThaiNLP/remove-fastai-dep
wannaphong Sep 2, 2019
a02a9d6
Update README-pypi.md
wannaphong Sep 2, 2019
e852567
PyThaiNLP 2.1.dev3
wannaphong Sep 3, 2019
787ca14
Merge branch 'dev' of https://github.com/PyThaiNLP/pythainlp into dev
lalital Sep 4, 2019
7745a6f
add image for #248
p16i Sep 4, 2019
e0fa4e7
Automatically build/deploy documentation (#267) (build and deloy docs)
lalital Sep 5, 2019
bb2fc33
Update appveyor.docs.yml (build and deploy docs)
lalital Sep 5, 2019
74357b6
refactor correctly tokenized word counting code
p16i Sep 6, 2019
c91dde6
better description for a cli param
p16i Sep 6, 2019
86b384e
update tokenization benchmark figure
p16i Sep 6, 2019
c6a70be
add file type
p16i Sep 7, 2019
d7115f8
Update test_benchmarks.py
bact Sep 7, 2019
7b7f92c
remove unused imports
bact Sep 7, 2019
fcc9d28
change open mode for sentences.yml to "rb"
bact Sep 8, 2019
21ea11e
Merge branch 'dev' into fix-tokenization-benchmark-issue
bact Sep 8, 2019
ceb167f
Update test_benchmarks.py
bact Sep 8, 2019
fd5817d
sort imports
bact Sep 8, 2019
08c365d
write with utf-8 encoding
bact Sep 8, 2019
3e630c4
fix naming consistency (tokenization instead of tokenisation)
p16i Sep 8, 2019
a853b75
Update word-tokenization-benchmark
bact Sep 8, 2019
b4ee5d6
Update benchmarks.rst
wannaphong Sep 8, 2019
cfb529a
Delete build.sh
bact Sep 8, 2019
729b886
Delete bld.bat
bact Sep 8, 2019
c49f4f6
Delete buildall.sh
bact Sep 8, 2019
21871bc
Merge pull request #269 from PyThaiNLP/fix-tokenization-benchmark-issue
wannaphong Sep 8, 2019
23ba97e
Update attacut.py
bact Sep 9, 2019
206a115
Update command_line.rst
wannaphong Sep 11, 2019
49fe5b0
Merge pull request #271 from PyThaiNLP/Command-Line-Docs (build and d…
wannaphong Sep 13, 2019
323bca1
add replace_url
Sep 17, 2019
b9b7792
fixed replace_url
Sep 17, 2019
93348ae
add process_thai to wongnai
Sep 17, 2019
2dfee52
add word2vec
Sep 17, 2019
4353e65
add visualize.py to notebooks
Sep 17, 2019
d0bc6c2
Update .travis.yml
wannaphong Sep 21, 2019
1d26919
PyThaiNLP 2.1.dev4
wannaphong Sep 21, 2019
eb09bc9
Merge branch 'dev' into ner-tag
bact Sep 21, 2019
90bb505
replace_url should be in pre_rules
cstorm125 Sep 21, 2019
b9025aa
Merge pull request #273 from PyThaiNLP/ner-tag
bact Sep 21, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,4 @@ docs/_build/doctrees/api/

docs/_build/html/

docs/_build/doctrees/
docs/_build/doctrees/
7 changes: 3 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,9 @@ script: coverage run --source=pythainlp setup.py test
after_success: coveralls
deploy:
provider: pypi
distributions: sdist bdist_wheel
distributions: bdist_wheel
user: wannaphong
password:
secure: zX35+8niw5W9H8XbFwacrDAhqyIibdUdC/cARnHlmxLN/2H9IynK0NW04UZwkBlrwrIZrU/g+cqYXFQXu6jE1ozlBKBxUd3xG8d1kixuntI0j9e+erPTs8Ju/KazUZtlknJPvnDMP+/1Dq+RMnMCP3RRlBrH6lvG70OgZ1aBpgx8FxRfs0xHfBIZvo5CVtR/QlDzhDJM1cgEyWkSgnlAhPxpv8qIQbh4/Rw89jXIZqv0bGCVJorrrcTA1oCzkr/4E4u/WZaARnvPjUr2a9U1w7C2IysDHiBfqQWlovdMmpoSLFE56YlG3smbmXfldWjmiMRQoWL+Ifu+smisvOLmR0ja78UMrrhHWP4mdzIeBVVRnT6eHUv0ChmLT2uCkOLE0newhtEJIYToot2TSoLFavXXIQB1fIHt6e74KRTV6WGnm0nFfHuGP+b5SgSPQFgqx8tBpn0rBOeqZ1y3pRISc/drF0F4reWMnlqoQfZZFmLmU1UmDZbvWNvXPu6MWyyuZ1F6fE9jyb3mG+kDuJf1PZ4ejC/sdIvpLlwUGLFGzRMa2TtxXqGq5CWsywPxo8Sx+bpMPCOImuW60PB9K/xKgfLhAtb7gZwndzUGqDbtSJCd5PmTkfEH8fawv/XnydvsssYUpipBCmFDZlNREyAkgOcLlL099Y5fAO8l2gOLyKs=
secure: Tj3VA05qopp0mkzWu6EFTlvijAoisd0BN/gD2c/vaaDCUy6fTXBkYk+dTkjbmYkEBl/WrsrW1T/QxCt2uc6bv7QTz+qL243Edv4FFQbBKvMSNlUO+hh1jI9zv3/QzwOaNHXOsI4JGeUaN5cULfxBjsBEFN+v6E0mkgBwJ0Qdb0/yuMybLWZ9dJI8iUKiaWNIr+NQoa9a+Sxw6Ltl/mdCKPppgOYPpVMCsDDdLqZdjkgXmzsjH9+Nfe6R+mYbdmeigy3ePNsIKbPkzZrY+E/I0lPZOVUgrs6gvZwlD3gESJgTROrUH6E2lBP9yYvFUE3KB0O+rdT5GyFq3MH1uD2ocrPCTQku6577wK31FzGoex6vtT4y2b39fLbhRmZDOJW8IFO7MLybazuRsNhaXn9hQU4HBRM2GQZc41bLkiEhsUX9/b2ujcn4PJKDZy91LnBw/93bgZJ7KweDzKywmcZSNeuBsGWgXdPqYiizzcf8DdvJAYytydhf8RxqdemTiS7GE7XBoXhj1/9Vfrt3lZXZbfYpTjNZeyxu7FrUJpm/I23wCw46qaRWzKXv2sRRUleNqQ1jIKEVupIa9sruHvG7DZecErhO9rMkGdsf4CIjolZ0A2BE+eAPEEY6/H1WFUWHxzxuELbUJwxnl1By677hBkLJaVs1YMGc2enGWzOnUYI=
on:
tags: true
repo: pythainlp/pythainlp
tags: true
29 changes: 16 additions & 13 deletions README-pypi.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,15 @@ PyThaiNLP includes Thai word tokenizers, transliterators, soundex converters, pa

📫 follow us on Facebook [PyThaiNLP](https://www.facebook.com/pythainlp/)

## What's new in 2.0 ?
## What's new in 2.1 ?

- Terminate Python 2 support. Remove all Python 2 compatibility code.
- Improved `word_tokenize` ("newmm" and "mm" engine), a `custom_dict` dictionary can be provided
- Improved `pos_tag` Part-Of-Speech tagging
- New `NorvigSpellChecker` spell checker class, which can be initialized with custom dictionary.
- New `thai2fit` (replacing `thai2vec`, upgrade ULMFiT-related code to fastai 1.0)
- Updated ThaiNER to 1.0
- You may need to [update your existing ThaiNER models from PyThaiNLP 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
- Remove old, obsolated, deprecated, duplicated, and experimental code.
- Sentiment analysis is no longer part of the library, but rather [a text classification example](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/sentiment_analysis.ipynb).
- Add AttaCut to be options for `word_tokenize` engine.
- New Thai2rom (PyTorch)
- New Command Line
- Add word tokenization benchmark to PyThaiNLP
- See more examples in [Get Started notebook](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb)
- [Full change log](https://github.com/PyThaiNLP/pythainlp/issues/118)
- [Upgrading from 1.7](https://thainlp.org/pythainlp/docs/2.0/notes/pythainlp-1_7-2_0.html)
- [Full change log](https://github.com/PyThaiNLP/pythainlp/issues/181)

## Install

Expand All @@ -40,6 +35,7 @@ pip install pythainlp[extra1,extra2,...]
where extras can be

- `artagger` (to support artagger part-of-speech tagger)*
- `attacut` - Wrapper for AttaCut (https://github.com/PyThaiNLP/attacut)
- `deepcut` (to support deepcut machine-learnt tokenizer)
- `icu` (for ICU support in transliteration and tokenization)
- `ipa` (for International Phonetic Alphabet support in transliteration)
Expand All @@ -54,8 +50,15 @@ Install it with pip, for example: `pip install marisa_trie‑0.7.5‑cp36‑cp36

## Links

- User guide: [English](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb), [ภาษาไทย](https://colab.research.google.com/drive/1rEkB2Dcr1UAKPqz4bCghZV7pXx2qxf89)
- Docs: https://thainlp.org/pythainlp/docs/2.0/
- User guide: [English](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb)
- Docs: https://thainlp.org/pythainlp/docs/2.1/
- GitHub: https://github.com/PyThaiNLP/pythainlp
- Issues: https://github.com/PyThaiNLP/pythainlp/issues
- Facebook: [PyThaiNLP](https://www.facebook.com/pythainlp/)


Made with ❤️

We build Thai NLP.

PyThaiNLP Team.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ Made with ❤️

We build Thai NLP.

PyThaiNLP team.
PyThaiNLP Team.

# ภาษาไทย

Expand Down
62 changes: 62 additions & 0 deletions appveyor.docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
image: ubuntu1604

branches:
only:
- /2.*/
- dev

skip_commits:
message: /(skip ci docs)/ # Skip a new build if message contains '(skip ci docs)'

install:
- sudo add-apt-repository ppa:jonathonf/python-3.6 -y
- sudo apt-get update
- sudo apt install -y python3.6
- sudo apt install -y python3.6-dev
- sudo apt install -y python3.6-venv
- wget https://bootstrap.pypa.io/get-pip.py
- sudo python3.6 get-pip.py
- sudo ln -s /usr/bin/python3.6 /usr/local/bin/python
- sudo apt-get install -y pandoc libicu-dev
- python -V
- python3 -V
- pip -V
- sudo pip install -r requirements.txt
- export LD_LIBRARY_PATH=/usr/local/lib
- sudo pip install torch==1.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
- sudo pip install sphinx sphinx-rtd-theme typing artagger deepcut epitran keras numpy pyicu sklearn-crfsuite tensorflow ssg emoji pandas
- sudo pip install --upgrade gensim smart_open boto

# configuration for deploy mode, commit message with /(build and deloy docs)/
# 1. build documents and upload HTML files to Appveyor's storage
# 2. upload to thainlp.org/pythainlp/docs/<brnach_bane>

only_commits:
message: /(build and deploy docs)/

build_script:
- cd ./docs
- export CURRENT_BRANCH=$APPVEYOR_REPO_BRANCH
- export RELEASE=$(git describe --tags --always)
- export RELEASE=$(echo $RELEASE | cut -d'-' -f1)
- export TODAY=$(date +'%Y-%m-%d')
- make html
- echo "Done building HTML files for the branch -- $APPVEYOR_REPO_BRANCH"
- echo "Start cleaning the directory /docs/$APPVEYOR_REPO_BRANCH"
- sudo bash ./clean_directory.sh $FTP_USER $FTP_PASSWORD $FTP_HOST $APPVEYOR_REPO_BRANCH
- echo "Start Uploading files to thainlp.org/pythainlp/docs/$APPVEYOR_REPO_BRANCH"
- cd ./_build/html
- echo "cd to ./build/html"
- find . -type f -name "*" -print -exec curl --ftp-create-dir --ipv4 -T {} ftp://${FTP_USER}:${FTP_PASSWORD}@${FTP_HOST}/public_html/pythainlp/docs/$APPVEYOR_REPO_BRANCH/{} \;
- echo "Done uploading"
- echo "Done uploading files to -- thainlp.org/pythainlp/docs/$APPVEYOR_REPO_BRANCH"

artifacts:
- path: ./docs/_build/html
name: document

after_build:
- echo "Done build and deploy"
- appveyor exit

test: off
106 changes: 50 additions & 56 deletions bin/word-tokenization-benchmark
Original file line number Diff line number Diff line change
@@ -1,121 +1,115 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import argparse
import json
import os
import argparse
import yaml

from pythainlp.benchmarks import word_tokenisation
import yaml
from pythainlp.benchmarks import word_tokenization

parser = argparse.ArgumentParser(
description="Script for benchmarking tokenizaiton results"
)

parser.add_argument(
"--input",
"--input-file",
action="store",
help="path to file that you want to compare against the test file"
help="Path to input file to compare against the test file",
)

parser.add_argument(
"--test-file",
action="store",
help="path to test file"
help="Path to test file i.e. ground truth",
)

parser.add_argument(
"--save-details",
default=False,
action='store_true',
help="specify whether to save the details of comparisons"
action="store_true",
help="Save comparison details to files (eval-XXX.json and eval-details-XXX.json)",
)

args = parser.parse_args()


def _read_file(path):
with open(path, "r", encoding="utf-8") as f:
lines = map(lambda r: r.strip(), f.readlines())
return list(lines)


print(args.input)
actual = _read_file(args.input)
print(args.input_file)
actual = _read_file(args.input_file)
expected = _read_file(args.test_file)

assert len(actual) == len(expected), \
'Input and test files do not have the same number of samples'
print('Benchmarking %s against %s with %d samples in total' % (
args.input, args.test_file, len(actual)
))

df_raw = word_tokenisation.benchmark(expected, actual)

df_res = df_raw\
.describe()
df_res = df_res[[
'char_level:tp',
'char_level:tn',
'char_level:fp',
'char_level:fn',
'char_level:precision',
'char_level:recall',
'char_level:f1',
'word_level:precision',
'word_level:recall',
'word_level:f1',
]]
assert len(actual) == len(
expected
), "Input and test files do not have the same number of samples"
print(
"Benchmarking %s against %s with %d samples in total"
% (args.input_file, args.test_file, len(actual))
)

df_raw = word_tokenization.benchmark(expected, actual)

df_res = df_raw.describe()
df_res = df_res[
[
"char_level:tp",
"char_level:tn",
"char_level:fp",
"char_level:fn",
"char_level:precision",
"char_level:recall",
"char_level:f1",
"word_level:precision",
"word_level:recall",
"word_level:f1",
]
]

df_res = df_res.T.reset_index(0)

df_res['mean±std'] = df_res.apply(
lambda r: '%2.2f±%2.2f' % (r['mean'], r['std']),
axis=1
df_res["mean±std"] = df_res.apply(
lambda r: "%2.2f±%2.2f" % (r["mean"], r["std"]), axis=1
)

df_res['metric'] = df_res['index']
df_res["metric"] = df_res["index"]

print("============== Benchmark Result ==============")
print(df_res[['metric', 'mean±std', 'min', 'max']].to_string(index=False))

print(df_res[["metric", "mean±std", "min", "max"]].to_string(index=False))


if args.save_details:
data = {}
for r in df_res.to_dict('records'):
metric = r['index']
del r['index']
for r in df_res.to_dict("records"):
metric = r["index"]
del r["index"]
data[metric] = r

dir_name = os.path.dirname(args.input)
file_name = args.input.split("/")[-1].split(".")[0]
dir_name = os.path.dirname(args.input_file)
file_name = args.input_file.split("/")[-1].split(".")[0]

res_path = "%s/eval-%s.yml" % (dir_name, file_name)
print("Evaluation result is saved to %s" % res_path)

with open(res_path, 'w') as outfile:
with open(res_path, "w", encoding="utf-8") as outfile:
yaml.dump(data, outfile, default_flow_style=False)

res_path = "%s/eval-details-%s.json" % (dir_name, file_name)
print("Details of comparisons is saved to %s" % res_path)

with open(res_path, "w") as f:
with open(res_path, "w", encoding="utf-8") as f:
samples = []
for i, r in enumerate(df_raw.to_dict("records")):
expected, actual = r["expected"], r["actual"]
del r["expected"]
del r["actual"]

samples.append(dict(
metrics=r,
expected=expected,
actual=actual,
id=i
))

details = dict(
metrics=data,
samples=samples
)
samples.append(dict(metrics=r, expected=expected, actual=actual, id=i))

details = dict(metrics=data, samples=samples)

json.dump(details, f, ensure_ascii=False)
2 changes: 0 additions & 2 deletions bld.bat

This file was deleted.

3 changes: 0 additions & 3 deletions build.sh

This file was deleted.

42 changes: 0 additions & 42 deletions buildall.sh

This file was deleted.

6 changes: 3 additions & 3 deletions docs/api/benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,6 @@ Quality

Qualitative evaluation of word tokenization.

.. autofunction:: pythainlp.benchmarks.word_tokenisation.compute_stats
.. autofunction:: pythainlp.benchmarks.word_tokenisation.benchmark
.. autofunction:: pythainlp.benchmarks.word_tokenisation.preprocessing
.. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats
.. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark
.. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing
Loading