PyThaiNLP · bact · Sep 21, 2019 · May 10, 2019 · Jul 27, 2019 · Aug 19, 2019
diff --git a/.gitignore b/.gitignore
@@ -82,4 +82,4 @@ docs/_build/doctrees/api/
 
 docs/_build/html/
 
-docs/_build/doctrees/
+docs/_build/doctrees/
diff --git a/.travis.yml b/.travis.yml
@@ -15,10 +15,9 @@ script: coverage run --source=pythainlp setup.py test
 after_success: coveralls
 deploy:
   provider: pypi
-  distributions: sdist bdist_wheel
+  distributions: bdist_wheel
   user: wannaphong
   password:
-    secure: zX35+8niw5W9H8XbFwacrDAhqyIibdUdC/cARnHlmxLN/2H9IynK0NW04UZwkBlrwrIZrU/g+cqYXFQXu6jE1ozlBKBxUd3xG8d1kixuntI0j9e+erPTs8Ju/KazUZtlknJPvnDMP+/1Dq+RMnMCP3RRlBrH6lvG70OgZ1aBpgx8FxRfs0xHfBIZvo5CVtR/QlDzhDJM1cgEyWkSgnlAhPxpv8qIQbh4/Rw89jXIZqv0bGCVJorrrcTA1oCzkr/4E4u/WZaARnvPjUr2a9U1w7C2IysDHiBfqQWlovdMmpoSLFE56YlG3smbmXfldWjmiMRQoWL+Ifu+smisvOLmR0ja78UMrrhHWP4mdzIeBVVRnT6eHUv0ChmLT2uCkOLE0newhtEJIYToot2TSoLFavXXIQB1fIHt6e74KRTV6WGnm0nFfHuGP+b5SgSPQFgqx8tBpn0rBOeqZ1y3pRISc/drF0F4reWMnlqoQfZZFmLmU1UmDZbvWNvXPu6MWyyuZ1F6fE9jyb3mG+kDuJf1PZ4ejC/sdIvpLlwUGLFGzRMa2TtxXqGq5CWsywPxo8Sx+bpMPCOImuW60PB9K/xKgfLhAtb7gZwndzUGqDbtSJCd5PmTkfEH8fawv/XnydvsssYUpipBCmFDZlNREyAkgOcLlL099Y5fAO8l2gOLyKs=
+    secure: Tj3VA05qopp0mkzWu6EFTlvijAoisd0BN/gD2c/vaaDCUy6fTXBkYk+dTkjbmYkEBl/WrsrW1T/QxCt2uc6bv7QTz+qL243Edv4FFQbBKvMSNlUO+hh1jI9zv3/QzwOaNHXOsI4JGeUaN5cULfxBjsBEFN+v6E0mkgBwJ0Qdb0/yuMybLWZ9dJI8iUKiaWNIr+NQoa9a+Sxw6Ltl/mdCKPppgOYPpVMCsDDdLqZdjkgXmzsjH9+Nfe6R+mYbdmeigy3ePNsIKbPkzZrY+E/I0lPZOVUgrs6gvZwlD3gESJgTROrUH6E2lBP9yYvFUE3KB0O+rdT5GyFq3MH1uD2ocrPCTQku6577wK31FzGoex6vtT4y2b39fLbhRmZDOJW8IFO7MLybazuRsNhaXn9hQU4HBRM2GQZc41bLkiEhsUX9/b2ujcn4PJKDZy91LnBw/93bgZJ7KweDzKywmcZSNeuBsGWgXdPqYiizzcf8DdvJAYytydhf8RxqdemTiS7GE7XBoXhj1/9Vfrt3lZXZbfYpTjNZeyxu7FrUJpm/I23wCw46qaRWzKXv2sRRUleNqQ1jIKEVupIa9sruHvG7DZecErhO9rMkGdsf4CIjolZ0A2BE+eAPEEY6/H1WFUWHxzxuELbUJwxnl1By677hBkLJaVs1YMGc2enGWzOnUYI=
   on:
-    tags: true
-    repo: pythainlp/pythainlp
+    tags: true
diff --git a/README-pypi.md b/README-pypi.md
@@ -8,20 +8,15 @@ PyThaiNLP includes Thai word tokenizers, transliterators, soundex converters, pa
 
 📫 follow us on Facebook [PyThaiNLP](https://www.facebook.com/pythainlp/)
 
-## What's new in 2.0 ?
+## What's new in 2.1 ?
 
-- Terminate Python 2 support. Remove all Python 2 compatibility code.
 - Improved `word_tokenize` ("newmm" and "mm" engine), a `custom_dict` dictionary can be provided
-- Improved `pos_tag` Part-Of-Speech tagging
-- New `NorvigSpellChecker` spell checker class, which can be initialized with custom dictionary.
-- New `thai2fit` (replacing `thai2vec`, upgrade ULMFiT-related code to fastai 1.0)
-- Updated ThaiNER to 1.0
-  - You may need to [update your existing ThaiNER models from PyThaiNLP 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
-- Remove old, obsolated, deprecated, duplicated, and experimental code.
-  - Sentiment analysis is no longer part of the library, but rather [a text classification example](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/sentiment_analysis.ipynb).
+- Add AttaCut to be options for `word_tokenize` engine.
+- New Thai2rom (PyTorch)
+- New Command Line
+- Add word tokenization benchmark to PyThaiNLP
 - See more examples in [Get Started notebook](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb)
-- [Full change log](https://github.com/PyThaiNLP/pythainlp/issues/118)
-- [Upgrading from 1.7](https://thainlp.org/pythainlp/docs/2.0/notes/pythainlp-1_7-2_0.html)
+- [Full change log](https://github.com/PyThaiNLP/pythainlp/issues/181)
 
 ## Install
 
@@ -40,6 +35,7 @@ pip install pythainlp[extra1,extra2,...]
 where extras can be
 
 - `artagger` (to support artagger part-of-speech tagger)*
+- `attacut` - Wrapper for AttaCut (https://github.com/PyThaiNLP/attacut)
 - `deepcut` (to support deepcut machine-learnt tokenizer)
 - `icu` (for ICU support in transliteration and tokenization)
 - `ipa` (for International Phonetic Alphabet support in transliteration)
@@ -54,8 +50,15 @@ Install it with pip, for example: `pip install marisa_trie‑0.7.5‑cp36‑cp36
 
 ## Links
 
-- User guide: [English](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb), [ภาษาไทย](https://colab.research.google.com/drive/1rEkB2Dcr1UAKPqz4bCghZV7pXx2qxf89)
-- Docs: https://thainlp.org/pythainlp/docs/2.0/ 
+- User guide: [English](https://github.com/PyThaiNLP/pythainlp/blob/dev/notebooks/pythainlp-get-started.ipynb)
+- Docs: https://thainlp.org/pythainlp/docs/2.1/ 
 - GitHub: https://github.com/PyThaiNLP/pythainlp
 - Issues: https://github.com/PyThaiNLP/pythainlp/issues
 - Facebook: [PyThaiNLP](https://www.facebook.com/pythainlp/)
+
+
+Made with ❤️
+
+We build Thai NLP.
+
+PyThaiNLP Team.
diff --git a/README.md b/README.md
@@ -113,7 +113,7 @@ Made with ❤️
 
 We build Thai NLP.
 
-PyThaiNLP team.
+PyThaiNLP Team.
 
 # ภาษาไทย
 

diff --git a/appveyor.docs.yml b/appveyor.docs.yml
@@ -0,0 +1,62 @@
+image: ubuntu1604
+
+branches:
+  only:
+    - /2.*/
+    - dev
+
+skip_commits:
+  message: /(skip ci docs)/     # Skip a new build if message contains '(skip ci docs)'
+
+install:
+  - sudo add-apt-repository ppa:jonathonf/python-3.6 -y
+  - sudo apt-get update
+  - sudo apt install -y python3.6 
+  - sudo apt install -y python3.6-dev
+  - sudo apt install -y python3.6-venv
+  - wget https://bootstrap.pypa.io/get-pip.py
+  - sudo python3.6 get-pip.py
+  - sudo ln -s /usr/bin/python3.6 /usr/local/bin/python
+  - sudo apt-get install -y pandoc libicu-dev
+  - python -V
+  - python3 -V
+  - pip -V
+  - sudo pip install -r requirements.txt
+  - export LD_LIBRARY_PATH=/usr/local/lib
+  - sudo pip install torch==1.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+  - sudo pip install sphinx sphinx-rtd-theme typing artagger deepcut epitran keras numpy pyicu sklearn-crfsuite tensorflow ssg emoji pandas 
+  - sudo pip install --upgrade gensim smart_open boto
+
+# configuration for deploy mode, commit message with /(build and deloy docs)/
+# 1. build documents and upload HTML files to Appveyor's storage
+# 2. upload to thainlp.org/pythainlp/docs/<brnach_bane>
+
+only_commits:
+  message: /(build and deploy docs)/    
+
+build_script: 
+  - cd ./docs
+  - export CURRENT_BRANCH=$APPVEYOR_REPO_BRANCH
+  - export RELEASE=$(git describe --tags --always)
+  - export RELEASE=$(echo $RELEASE | cut -d'-' -f1)
+  - export TODAY=$(date +'%Y-%m-%d')
+  - make html
+  - echo "Done building HTML files for the branch -- $APPVEYOR_REPO_BRANCH"
+  - echo "Start cleaning the directory /docs/$APPVEYOR_REPO_BRANCH"
+  - sudo bash ./clean_directory.sh $FTP_USER $FTP_PASSWORD $FTP_HOST $APPVEYOR_REPO_BRANCH
+  - echo "Start Uploading files to thainlp.org/pythainlp/docs/$APPVEYOR_REPO_BRANCH"
+  - cd ./_build/html
+  - echo "cd to ./build/html"
+  - find . -type f -name "*" -print -exec curl --ftp-create-dir --ipv4 -T {} ftp://${FTP_USER}:${FTP_PASSWORD}@${FTP_HOST}/public_html/pythainlp/docs/$APPVEYOR_REPO_BRANCH/{} \;
+  - echo "Done uploading"
+  - echo "Done uploading files to -- thainlp.org/pythainlp/docs/$APPVEYOR_REPO_BRANCH"
+
+artifacts:
+  - path: ./docs/_build/html
+    name: document
+
+after_build:
+  - echo "Done build and deploy"
+  - appveyor exit
+
+test: off
diff --git a/bin/word-tokenization-benchmark b/bin/word-tokenization-benchmark
@@ -1,121 +1,115 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 
+import argparse
 import json
 import os
-import argparse
-import yaml
 
-from pythainlp.benchmarks import word_tokenisation
+import yaml
+from pythainlp.benchmarks import word_tokenization
 
 parser = argparse.ArgumentParser(
     description="Script for benchmarking tokenizaiton results"
 )
 
 parser.add_argument(
-    "--input",
+    "--input-file",
     action="store",
-    help="path to file that you want to compare against the test file"
+    help="Path to input file to compare against the test file",
 )
 
 parser.add_argument(
     "--test-file",
     action="store",
-    help="path to test file"
+    help="Path to test file i.e. ground truth",
 )
 
 parser.add_argument(
     "--save-details",
     default=False,
-    action='store_true',
-    help="specify whether to save the details of comparisons"
+    action="store_true",
+    help="Save comparison details to files (eval-XXX.json and eval-details-XXX.json)",
 )
 
 args = parser.parse_args()
 
+
 def _read_file(path):
     with open(path, "r", encoding="utf-8") as f:
         lines = map(lambda r: r.strip(), f.readlines())
     return list(lines)
 
 
-print(args.input)
-actual = _read_file(args.input)
+print(args.input_file)
+actual = _read_file(args.input_file)
 expected = _read_file(args.test_file)
 
-assert len(actual) == len(expected), \
-    'Input and test files do not have the same number of samples'
-print('Benchmarking %s against %s with %d samples in total' % (
-    args.input, args.test_file, len(actual)
-))
-
-df_raw = word_tokenisation.benchmark(expected, actual)
-
-df_res = df_raw\
-    .describe()
-df_res = df_res[[
-    'char_level:tp',
-    'char_level:tn',
-    'char_level:fp',
-    'char_level:fn',
-    'char_level:precision',
-    'char_level:recall',
-    'char_level:f1',
-    'word_level:precision',
-    'word_level:recall',
-    'word_level:f1',
-]]
+assert len(actual) == len(
+    expected
+), "Input and test files do not have the same number of samples"
+print(
+    "Benchmarking %s against %s with %d samples in total"
+    % (args.input_file, args.test_file, len(actual))
+)
+
+df_raw = word_tokenization.benchmark(expected, actual)
+
+df_res = df_raw.describe()
+df_res = df_res[
+    [
+        "char_level:tp",
+        "char_level:tn",
+        "char_level:fp",
+        "char_level:fn",
+        "char_level:precision",
+        "char_level:recall",
+        "char_level:f1",
+        "word_level:precision",
+        "word_level:recall",
+        "word_level:f1",
+    ]
+]
 
 df_res = df_res.T.reset_index(0)
 
-df_res['mean±std'] = df_res.apply(
-    lambda r: '%2.2f±%2.2f' % (r['mean'], r['std']),
-    axis=1
+df_res["mean±std"] = df_res.apply(
+    lambda r: "%2.2f±%2.2f" % (r["mean"], r["std"]), axis=1
 )
 
-df_res['metric'] = df_res['index']
+df_res["metric"] = df_res["index"]
 
 print("============== Benchmark Result ==============")
-print(df_res[['metric', 'mean±std', 'min', 'max']].to_string(index=False))
-
+print(df_res[["metric", "mean±std", "min", "max"]].to_string(index=False))
 
 
 if args.save_details:
     data = {}
-    for r in df_res.to_dict('records'):
-        metric = r['index']
-        del r['index']
+    for r in df_res.to_dict("records"):
+        metric = r["index"]
+        del r["index"]
         data[metric] = r
 
-    dir_name = os.path.dirname(args.input)
-    file_name = args.input.split("/")[-1].split(".")[0]
+    dir_name = os.path.dirname(args.input_file)
+    file_name = args.input_file.split("/")[-1].split(".")[0]
 
     res_path = "%s/eval-%s.yml" % (dir_name, file_name)
     print("Evaluation result is saved to %s" % res_path)
 
-    with open(res_path, 'w') as outfile:
+    with open(res_path, "w", encoding="utf-8") as outfile:
         yaml.dump(data, outfile, default_flow_style=False)
 
     res_path = "%s/eval-details-%s.json" % (dir_name, file_name)
     print("Details of comparisons is saved to %s" % res_path)
 
-    with open(res_path, "w") as f:
+    with open(res_path, "w", encoding="utf-8") as f:
         samples = []
         for i, r in enumerate(df_raw.to_dict("records")):
             expected, actual = r["expected"], r["actual"]
             del r["expected"]
             del r["actual"]
 
-            samples.append(dict(
-                metrics=r,
-                expected=expected,
-                actual=actual,
-                id=i
-            ))
-
-        details = dict(
-            metrics=data,
-            samples=samples
-        )
+            samples.append(dict(metrics=r, expected=expected, actual=actual, id=i))
+
+        details = dict(metrics=data, samples=samples)
 
         json.dump(details, f, ensure_ascii=False)
diff --git a/bld.bat b/bld.bat
diff --git a/build.sh b/build.sh
diff --git a/buildall.sh b/buildall.sh
diff --git a/docs/api/benchmarks.rst b/docs/api/benchmarks.rst
@@ -19,6 +19,6 @@ Quality
 
    Qualitative evaluation of word tokenization.
 
-.. autofunction:: pythainlp.benchmarks.word_tokenisation.compute_stats
-.. autofunction:: pythainlp.benchmarks.word_tokenisation.benchmark
-.. autofunction:: pythainlp.benchmarks.word_tokenisation.preprocessing
+.. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats
+.. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark
+.. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing
Original file line number	Diff line number	Diff line change
Expand Up		@@ -82,4 +82,4 @@ docs/_build/doctrees/api/

		docs/_build/html/

		docs/_build/doctrees/
		docs/_build/doctrees/