Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
1145ee8
Initial commit, the long text fix will follow
bact Oct 12, 2019
4e7b639
fix newmm issue with long text
bact Oct 12, 2019
7b0c3a4
fix PEP 8 issues
bact Oct 13, 2019
1fb4c1d
Update test_corpus.py
bact Oct 13, 2019
8f894ee
remove person names
bact Oct 13, 2019
2af315b
remove few hyphened words
bact Oct 13, 2019
f3b4daa
Try smaller window
bact Oct 17, 2019
7f60b2a
Fix appveyor.yml
bact Oct 17, 2019
dcded4a
Merge branch 'dev' into fix-newmm-longtext
wannaphong Oct 17, 2019
307cc9f
try break by the right-most space first, to speed up
bact Oct 19, 2019
78270ad
fix cut_pos
bact Oct 19, 2019
a1df24b
add more test cases
bact Oct 19, 2019
bb75b8e
remove obvious compound words from dictionary
bact Oct 19, 2019
73a1827
Comments in English
bact Oct 20, 2019
a17a604
Merge pull request #308 from PyThaiNLP/dev
bact Oct 20, 2019
fad2e83
Merge pull request #310 from PyThaiNLP/dev
bact Oct 20, 2019
142e9b1
Merge pull request #312 from PyThaiNLP/dev
bact Oct 21, 2019
9309ee6
Merge pull request #315 from PyThaiNLP/dev
bact Nov 3, 2019
80e7a5a
Update .travis.yml
wannaphong Nov 6, 2019
6070951
Delete pythainlp-1_7-2_0.rst (build and deploy docs)
wannaphong Nov 6, 2019
af16bea
add "newmm-safe" option
bact Nov 7, 2019
bbfabb4
Update CORPUS_DB_URL
wannaphong Nov 8, 2019
fa288fa
Update ThaiNER 1.2 to ThaiNER 1.3
wannaphong Nov 8, 2019
be27627
Update etcc.py
wannaphong Nov 11, 2019
fd5c44d
Update etcc.py (build and deploy docs)
wannaphong Nov 11, 2019
0332cf2
Add test for newmm-safe mode
bact Nov 12, 2019
c627f82
Update docstring for newmm-safe
bact Nov 12, 2019
1f3faf0
Fixed Travis CI : Update ThaiNER
wannaphong Nov 12, 2019
4a04e98
add long text test case for newmm-safe
bact Nov 13, 2019
c464c6d
Merge branch 'fix-newmm-longtext' of https://github.com/PyThaiNLP/pyt…
bact Nov 13, 2019
817c2c8
add more type hints
bact Nov 13, 2019
9159940
fix Generator type hinting
bact Nov 13, 2019
152e238
Update Tennsorflow version to 2 for deepcut test
bact Nov 13, 2019
b2d72d1
Merge branch 'dev' into fix-newmm-longtext
bact Nov 13, 2019
23f3856
Merge pull request #302 from PyThaiNLP/fix-newmm-longtext
bact Nov 14, 2019
112418e
Update __init__.py
bact Nov 14, 2019
8d27dbc
if -> elif engine == "newmm-safe"
bact Nov 14, 2019
29ceec2
change thai_time() precision params from "minute" and "second" to "m"…
bact Nov 14, 2019
5cf8e44
Update README.md
bact Nov 14, 2019
50821ea
Update newmm.py
bact Nov 14, 2019
fe736dd
Update newmm.py
bact Nov 14, 2019
f6f1845
makes PEP8 happy
bact Nov 14, 2019
86b9c56
more test cases for thai_time()
bact Nov 14, 2019
8fc83ae
use self.index2word, self.word_vec() instead of deprecated self.wv...
bact Nov 14, 2019
cb27a35
close file
bact Nov 14, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ before_install:
- sudo rm -f /etc/boto.cfg

install:
- pip install "tensorflow>=1.14,<2" deepcut
- pip install "tensorflow>=2,<3" deepcut
- pip install -r requirements.txt
- pip install .[full]
- pip install coveralls
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
[![Build Status](https://travis-ci.org/PyThaiNLP/pythainlp.svg?branch=develop)](https://travis-ci.org/PyThaiNLP/pythainlp)
[![Build status](https://ci.appveyor.com/api/projects/status/9g3mfcwchi8em40x?svg=true)](https://ci.appveyor.com/project/wannaphongcom/pythainlp-9y1ch)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/cb946260c87a4cc5905ca608704406f7)](https://www.codacy.com/app/pythainlp/pythainlp_2?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=PyThaiNLP/pythainlp&amp;utm_campaign=Badge_Grade)
[![Coverage Status](https://coveralls.io/repos/github/PyThaiNLP/pythainlp/badge.svg?branch=dev)](https://coveralls.io/github/PyThaiNLP/pythainlp?branch=dev) [![Google Colab Badge](https://badgen.net/badge/Launch%20Quick%20Start%20Guide/on%20Google%20Colab/blue?icon=terminal)](https://colab.research.google.com/github/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp-get-started.ipynb)
[![Coverage Status](https://coveralls.io/repos/github/PyThaiNLP/pythainlp/badge.svg?branch=dev)](https://coveralls.io/github/PyThaiNLP/pythainlp?branch=dev) [![Google Colab Badge](https://badgen.net/badge/Launch%20Quick%20Start%20Guide/on%20Google%20Colab/blue?icon=terminal)](https://colab.research.google.com/github/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp_get_started.ipynb)
[![DOI](https://zenodo.org/badge/61813823.svg)](https://zenodo.org/badge/latestdoi/61813823)

Thai Natural Language Processing in Python.
Expand All @@ -24,7 +24,7 @@ PyThaiNLP is a Python package for text processing and linguistic analysis, simil
**This is a document for development branch (post 2.0). Things will break.**

- The latest stable release is [2.0.7](https://github.com/PyThaiNLP/pythainlp/releases)
- The latest development release is [2.1.dev7](https://github.com/PyThaiNLP/pythainlp/releases). See [2.1 change log](https://github.com/PyThaiNLP/pythainlp/issues/181).
- The latest development release is [2.1.dev7](https://github.com/PyThaiNLP/pythainlp/releases). See the ongoing [2.1 change log](https://github.com/PyThaiNLP/pythainlp/issues/181).
- 📫 follow our [PyThaiNLP](https://www.facebook.com/pythainlp/) Facebook page


Expand Down Expand Up @@ -89,7 +89,7 @@ The data location can be changed, using `PYTHAINLP_DATA_DIR` environment variabl

## Documentation

- [PyThaiNLP Get Started notebook](https://github.com/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp-get-started.ipynb)
- [PyThaiNLP Get Started](https://www.thainlp.org/pythainlp/tutorials/notebooks/pythainlp_get_started.html)
- More tutorials at [https://www.thainlp.org/pythainlp/tutorials/](https://www.thainlp.org/pythainlp/tutorials/)
- See full documentation at [https://thainlp.org/pythainlp/docs/2.0/](https://thainlp.org/pythainlp/docs/2.0/)

Expand Down Expand Up @@ -198,7 +198,7 @@ pip install pythainlp[extra1,extra2,...]

## เอกสารการใช้งาน

- [notebook เริ่มต้นใช้งาน PyThaiNLP](https://github.com/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp-get-started.ipynb)
- [เริ่มต้นใช้งาน PyThaiNLP](https://www.thainlp.org/pythainlp/tutorials/notebooks/pythainlp_get_started.html)
- สอนการใช้งานเพิ่มเติม ในรูปแบบ notebook [https://www.thainlp.org/pythainlp/tutorials/](https://www.thainlp.org/pythainlp/tutorials/)
- เอกสารตัวเต็ม [https://thainlp.org/pythainlp/docs/2.0/](https://thainlp.org/pythainlp/docs/2.0/)

Expand Down
4 changes: 2 additions & 2 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,8 @@ install:
- pip --version
- pip install coveralls[yaml]
- pip install coverage
- pip install "tensorflow>=1.14,<2" deepcut
- pip install torch==1.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
- pip install "tensorflow>=2,<3" deepcut
- pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
- pip install %PYICU_PKG%
- pip install %ARTAGGER_PKG%
- pip install -e .[full]
Expand Down
96 changes: 0 additions & 96 deletions docs/notes/pythainlp-1_7-2_0.rst

This file was deleted.

12 changes: 6 additions & 6 deletions pythainlp/corpus/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
_CORPUS_DB_URL = (
"https://raw.githubusercontent.com/"
+ "PyThaiNLP/pythainlp-corpus/"
+ "master/db.json"
+ "2.1/db.json"
)

_CORPUS_DB_FILENAME = "db.json"
Expand Down Expand Up @@ -165,12 +165,12 @@ def _check_hash(dst: str, md5: str) -> NoReturn:
@param: md5 place to hash the file (MD5)
"""
if md5 and md5 != "-":
f = open(get_full_data_path(dst), "rb")
content = f.read()
file_md5 = hashlib.md5(content).hexdigest()
with open(get_full_data_path(dst), "rb") as f:
content = f.read()
file_md5 = hashlib.md5(content).hexdigest()

if md5 != file_md5:
raise Exception("Hash does not match expected.")
if md5 != file_md5:
raise Exception("Hash does not match expected.")


def download(name: str, force: bool = False) -> NoReturn:
Expand Down
6 changes: 3 additions & 3 deletions pythainlp/tag/named_entity.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,10 @@ def __init__(self):
"""
Thai named-entity recognizer
"""
self.__data_path = get_corpus_path("thainer-1-2")
self.__data_path = get_corpus_path("thainer-1-3")
if not self.__data_path:
download("thainer-1-2")
self.__data_path = get_corpus_path("thainer-1-2")
download("thainer-1-3")
self.__data_path = get_corpus_path("thainer-1-3")
self.crf = sklearn_crfsuite.CRF(
algorithm="lbfgs",
c1=0.1,
Expand Down
28 changes: 22 additions & 6 deletions pythainlp/tokenize/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ def word_tokenize(
**Options for engine**
* *newmm* (default) - dictionary-based, Maximum Matching +
Thai Character Cluster
* *newmm-safe* - newmm, with a mechanism to avoid long
processing time for some long continuous text without spaces
* *longest* - dictionary-based, Longest Matching
* *icu* - wrapper for ICU (International Components for Unicode,
using PyICU), dictionary-based
Expand Down Expand Up @@ -101,10 +103,15 @@ def word_tokenize(
return []

segments = []

if engine == "newmm" or engine == "onecut":
from .newmm import segment

segments = segment(text, custom_dict)
elif engine == "newmm-safe":
from .newmm import segment

segments = segment(text, custom_dict, safe_mode=True)
elif engine == "attacut":
from .attacut import segment

Expand Down Expand Up @@ -157,6 +164,7 @@ def dict_word_tokenize(
:param bool keep_whitespace: True to keep whitespaces, a common mark
for end of phrase in Thai
:return: list of words
:rtype: list[str]
"""
warnings.warn(
"dict_word_tokenize is deprecated. Use word_tokenize with a custom_dict argument instead.",
Expand Down Expand Up @@ -336,6 +344,7 @@ def syllable_tokenize(text: str, engine: str = "default") -> List[str]:
tokens.extend(word_tokenize(text=word, custom_dict=trie))
else:
from .ssg import segment

tokens = segment(text)

return tokens
Expand All @@ -345,9 +354,10 @@ def dict_trie(dict_source: Union[str, Iterable[str], Trie]) -> Trie:
"""
Create a dictionary trie which will be used for word_tokenize() function.

:param string/list dict_source: a list of vocaburaries or a path
to source file
:return: a trie created from a dictionary input
:param str|Iterable[str]|pythainlp.tokenize.Trie dict_source: a path to
dictionary file or a list of words or a pythainlp.tokenize.Trie object
:return: a trie object created from a dictionary input
:rtype: pythainlp.tokenize.Trie
"""
trie = None

Expand All @@ -359,7 +369,9 @@ def dict_trie(dict_source: Union[str, Iterable[str], Trie]) -> Trie:
_vocabs = f.read().splitlines()
trie = Trie(_vocabs)
elif isinstance(dict_source, Iterable):
# Note: Trie and str are both Iterable, Iterable check should be here
# Note: Since Trie and str are both Iterable,
# so the Iterable check should be here, at the very end,
# because it has less specificality
# Received a sequence type object of vocabs
trie = Trie(dict_source)
else:
Expand Down Expand Up @@ -435,7 +447,9 @@ class Tokenizer:
"""

def __init__(
self, custom_dict: Union[Trie, Iterable[str], str] = None, engine: str = "newmm"
self,
custom_dict: Union[Trie, Iterable[str], str] = None,
engine: str = "newmm",
):
"""
Initialize tokenizer object
Expand All @@ -458,7 +472,9 @@ def word_tokenize(self, text: str) -> List[str]:
:return: list of words, tokenized from the text
:rtype: list[str]
"""
return word_tokenize(text, custom_dict=self.__trie_dict, engine=self.__engine)
return word_tokenize(
text, custom_dict=self.__trie_dict, engine=self.__engine
)

def set_tokenize_engine(self, engine: str) -> None:
"""
Expand Down
3 changes: 1 addition & 2 deletions pythainlp/tokenize/etcc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
"""
Enhanced Thai Character Cluster (ETCC)
Enhanced Thai Character Cluster (ETCC) (In progress)
Python implementation by Wannaphong Phatthiyaphaibun (19 June 2017)

:See Also:
Expand Down Expand Up @@ -75,5 +75,4 @@ def segment(text: str) -> str:
text = re.sub(i, ii + "/", text)

text = re.sub("//", "/", text)

return text.split("/")
Loading