Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
7f3acc3
Add word_detokenize
wannaphong Sep 4, 2022
9c3512e
word_detokenize: Add whitespace between ๆ
wannaphong Sep 7, 2022
5dd6891
Create word_detokenize.ipynb
wannaphong Sep 7, 2022
79cdf58
Add test_word_detokenize
wannaphong Sep 8, 2022
1f771cb
Fixed some pep8
wannaphong Sep 8, 2022
c8b71f4
Add prayut_and_somchaip
wannaphong Sep 8, 2022
16646eb
Update file
wannaphong Sep 8, 2022
38b0ce7
Update test_soundex.py
wannaphong Sep 8, 2022
58086da
Update prayut_and_somchaip.py
wannaphong Sep 8, 2022
cd00ff3
Update test_soundex.py
wannaphong Sep 8, 2022
ce9b618
Merge pull request #699 from PyThaiNLP/add-Thai-English-soundex
wannaphong Sep 9, 2022
451abd6
Update test_tokenize.py
wannaphong Sep 9, 2022
b546f89
Update word_detokenize rule
wannaphong Sep 9, 2022
a108ca9
Update core.py
wannaphong Sep 9, 2022
9d04bd2
Update core.py
wannaphong Sep 9, 2022
695df0a
Update core.py
wannaphong Sep 9, 2022
918d80f
Update prayut_and_somchaip.py
wannaphong Sep 9, 2022
de7e1d0
Fixed #700
wannaphong Sep 10, 2022
18b604c
Merge pull request #701 from PyThaiNLP/fix-corpus-utf8
wannaphong Sep 10, 2022
ae452a2
Update core.py
wannaphong Sep 10, 2022
fdfcd88
Create windows-test.yml
wannaphong Sep 10, 2022
4f3c742
Update windows-test.yml
wannaphong Sep 10, 2022
13ed76b
Update windows-test.yml
wannaphong Sep 10, 2022
6ba2880
Update windows-test.yml
wannaphong Sep 10, 2022
8a9c4ec
Update windows-test.yml
wannaphong Sep 10, 2022
d0a4eaf
Update windows-test.yml
wannaphong Sep 10, 2022
f48e24c
Update windows-test.yml
wannaphong Sep 10, 2022
15471e2
Update core.py
wannaphong Sep 13, 2022
7479baa
Add list support in crfcut.py
wannaphong Sep 13, 2022
218fd27
Merge pull request #703 from PyThaiNLP/dev
wannaphong Sep 13, 2022
8e31d6e
Update __version__
wannaphong Sep 13, 2022
ff77556
Change open encoding
wannaphong Sep 14, 2022
bb2c7b4
Merge pull request #697 from PyThaiNLP/add-word_detokenize
wannaphong Sep 15, 2022
d823121
Update README.md
wannaphong Sep 15, 2022
6add701
Move LST20 Perceptron model
wannaphong Sep 16, 2022
37e4d2e
Add about lst20
wannaphong Sep 16, 2022
638c28d
Merge pull request #705 from PyThaiNLP/move-model
wannaphong Sep 16, 2022
e727c87
Add pythainlp.parse.dependency_parsing
wannaphong Sep 16, 2022
5a5e975
Update docker_requirements.txt
wannaphong Sep 16, 2022
e1d1b34
Update installation.rst
wannaphong Sep 16, 2022
c883542
Add transformers_ud
wannaphong Sep 17, 2022
1218777
Update code
wannaphong Sep 17, 2022
e8e68a7
Update transformers_ud.py
wannaphong Sep 17, 2022
53b9aff
Update core.py
wannaphong Sep 17, 2022
e9b5ffb
Update requirements
wannaphong Sep 17, 2022
e2a3404
Merge pull request #706 from PyThaiNLP/add-dependency-parser
wannaphong Sep 17, 2022
ff8db54
Add tag for dependency_parsing
wannaphong Sep 17, 2022
b1e34c7
PyThaiNLP v3.1.0-dev2
wannaphong Sep 18, 2022
296df1f
PyThaiNLP v3.1.0-dev3
wannaphong Sep 18, 2022
18c8c50
Update warnings
wannaphong Sep 20, 2022
c49e1cd
PyThaiNLP v3.1.0-beta0
wannaphong Sep 20, 2022
ed43b54
PyThaiNLP v3.1.0
wannaphong Sep 24, 2022
f0b2e78
Update README
wannaphong Sep 24, 2022
859c9f1
Merge pull request #713 from PyThaiNLP/v3.1.0
wannaphong Sep 24, 2022
52d1c88
Update README.md
wannaphong Sep 24, 2022
fc60cd3
Update README_TH.md
wannaphong Sep 24, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .github/workflows/windows-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Windows Unit test and code coverage

on:
push:
paths-ignore:
- '**.md'
- 'docs/**'
pull_request:
branches:
- dev
paths-ignore:
- '**.md'
- 'docs/**'

jobs:
build:

runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [windows-latest]
python-version: [3.8]

steps:
- uses: actions/checkout@v2
- uses: conda-incubator/setup-miniconda@v2
with:
python-version: ${{ matrix.python-version }}
auto-activate-base: true
auto-update-conda: true
- shell: powershell
run: |
conda info
conda list
- name: Install PyTorch
shell: powershell
run: |
pip install torch==1.8.1
- name: Install dependencies
shell: powershell
run: |
python -m pip install --disable-pip-version-check --user --upgrade pip setuptools
python -m pip --version
python -m pip install pytest coverage coveralls
conda install -y -c conda-forge fairseq
python -m pip install https://www.dropbox.com/s/o6p2sj5z50iim1e/PyICU-2.3.1-cp38-cp38-win_amd64.whl?dl=1
python -m pip install -r docker_requirements.txt
python -m pip install .[full]
python -m nltk.downloader omw-1.4
python -m pip install spacy deepcut
- name: Test
shell: powershell
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
COVERALLS_SERVICE_NAME: github
run: |
coverage run -m unittest discover
coveralls
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.7-slim-buster
FROM python:3.8-slim-buster

COPY . .

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร

**News**

> Now, You can contact or ask any questions you encounter with the PyThaiNLP team. <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>
> Now, You can contact or ask any questions with the PyThaiNLP team. <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>

| Version | Description | Status |
|:------:|:--:|:------:|
| [3.0](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/545) |
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 3.1 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/643) |
| [3.1](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/643) |
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 3.2 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/714) |


## Getting Started
Expand Down
6 changes: 3 additions & 3 deletions README_TH.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร

**ข่าวสาร**

>นับตั้งแต่ PyThaiNLP 3.0 พวกเราจะยุติการสนับสนุน Python 3.6 แล้ว หากคุณจำเป็นต้องใช้ PyThaiNLP บน Python 3.6 คุณสามารถใช้ PyThaiNLP 2.3.1 ได้
> คุณสามารถพูดคุยหรือแชทกับทีม PyThaiNLP หรือผู้สนับสนุนคนอื่น ๆ ได้ที่ <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>

| รุ่น | คำอธิบาย | สถานะ |
|:------:|:--:|:------:|
| [3.0](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/545 |
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 3.1 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/643) |
| [3.1](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/643) |
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 3.2 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/714) |

ติดตามพวกเราบน [PyThaiNLP Facebook page](https://www.facebook.com/pythainlp/) เพื่อรับข่าวสารเพิ่มเติม

Expand Down
1 change: 1 addition & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

| Version | Supported |
| ------- | ------------------ |
| 3.1.x | :white_check_mark: |
| 3.0.x | :white_check_mark: |
| 2.3.x | :white_check_mark: |
| 2.2.x | :x: |
Expand Down
5 changes: 4 additions & 1 deletion docker_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ sentencepiece==0.1.91
ssg==0.0.8
torch==1.8.1
fastai==1.0.61
transformers==4.8.2
transformers==4.22.1
phunspell==0.1.6
spylls==0.1.5
symspellpy==6.7.6
Expand All @@ -31,3 +31,6 @@ thai-nner==0.3
spacy==2.3.*
wunsen==0.0.3
khanaa==0.0.6
spacy_thai==0.7.1
esupar==1.3.8
ufal.chu-liu-edmonds==1.0.2
10 changes: 10 additions & 0 deletions docs/api/parse.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. currentmodule:: pythainlp.parse

pythainlp.parse
===============
The :class:`pythainlp.parse` is dependency parsing for Thai.

Modules
-------

.. autofunction:: dependency_parsing
3 changes: 3 additions & 0 deletions docs/api/soundex.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Modules
.. autofunction:: lk82
.. autofunction:: udom83
.. autofunction:: metasound
.. autofunction:: prayut_and_somchaip

References
----------
Expand All @@ -23,3 +24,5 @@ References
Master Thesis. Chulalongkorn University, Thailand.

.. [#lk82] วิชิต หล่อจีระชุณห์กุล และ เจริญ คุวินทร์พันธุ์. `โปรแกรมการสืบค้นคำไทยตามเสียงอ่าน (Thai Soundex) <http://guru.sanook.com/1520/>`_.

.. [#prayut_and_somchaip] Prayut Suwanvisat, Somchai Prasitjutrakul. Thai-English Cross-Language Transliterated Word Retrieval using Soundex Technique. In 1998 [cited 2022 Sep 8]. Available from: https://www.cp.eng.chula.ac.th/~somchai/spj/papers/ThaiText/ncsec98-clir.pdf
1 change: 1 addition & 0 deletions docs/api/tokenize.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Modules
.. autofunction:: sent_tokenize
.. autofunction:: subword_tokenize
.. autofunction:: word_tokenize
.. autofunction:: word_detokenize
.. autoclass:: Tokenizer
:members:

Expand Down
6 changes: 5 additions & 1 deletion docs/notes/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,11 @@ where ``extras`` can be
- ``tltk`` (to support tltk)
- ``textaugment`` (to support text augmentation)
- ``oskut`` (to support OSKUT)
- ``nlpo3`` (to support nlpo3 enging)
- ``nlpo3`` (to support nlpo3 engine)
- ``spacy_thai`` (to support spacy_thai engine)
- ``esupar`` (to support esupar engine)
- ``transformers_ud`` (to support transformers_ud engine)
- ``dependency_parsing`` (to support dependency parsing with all engine)
- ``full`` (install everything)

For dependency details, look at `extras` variable in `setup.py <https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py>`_.
Expand Down
147 changes: 147 additions & 0 deletions notebooks/word_detokenize.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from pythainlp.tokenize import word_detokenize"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ผมเลี้ยง 5 ตัว'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_detokenize([\"ผม\",\"เลี้ยง\",\"5\",\"ตัว\"])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['ผม', 'เลี้ยง', ' ', '5', ' ', 'ตัว']]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_detokenize([\"ผม\",\"เลี้ยง\",\" \",\"5\",\"ตัว\"],\"list\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ผมเลี้ยง 5 5 ตัว'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_detokenize([\"ผม\",\"เลี้ยง\",\"5\",\"5\",\"ตัว\"])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ภาษาไทยหรือภาษาไทยกลางเป็นภาษาในกลุ่มภาษาไทซึ่งเป็นกลุ่มย่อยของตระกูลภาษาขร้า - ไท และเป็นภาษาราชการและภาษาประจำชาติของประเทศไทย'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_detokenize(['ภาษาไทย', 'หรือ', 'ภาษาไทย', 'กลาง', 'เป็น', 'ภาษา', 'ใน', 'กลุ่ม', 'ภาษา', 'ไท', 'ซึ่ง', 'เป็น', 'กลุ่มย่อย', 'ของ', 'ตระกูล', 'ภาษา', 'ข', 'ร้า', '-', 'ไท', 'และ', 'เป็น', 'ภาษาราชการ', 'และ', 'ภาษาประจำชาติ', 'ของ', 'ประเทศ', 'ไทย'])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'ผมเลี้ยง 5 5 ตัว ๆ คนดี'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_detokenize([\"ผม\",\"เลี้ยง\",\"5\",\"5\",\"ตัว\",\"ๆ\",\"คน\",\"ดี\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.12 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "48b90c76b600d2ec6cf3e350b23a5df9176e3eef7b22ad90377f14c1de9c1bf6"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 1 addition & 1 deletion pythainlp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# Copyright (C) 2016-2022 PyThaiNLP Project
# URL: <https://pythainlp.github.io/>
# For license information, see LICENSE
__version__ = "3.1.0-dev1"
__version__ = "3.1.0"

thai_consonants = "กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ" # 44 chars

Expand Down
14 changes: 8 additions & 6 deletions pythainlp/corpus/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def get_corpus_db_detail(name: str, version: str = None) -> dict:
:return: details about a corpus
:rtype: dict
"""
with open(corpus_db_path(), "r", encoding="utf-8") as f:
with open(corpus_db_path(), "r", encoding="utf-8-sig") as f:
local_db = json.load(f)

if version is None:
Expand Down Expand Up @@ -148,11 +148,13 @@ def get_corpus_default_db(name: str, version: str = None) -> Union[str, None]:
)


def get_corpus_path(name: str, version: str = None) -> Union[str, None]:
def get_corpus_path(name: str, version: str = None, force: bool = False) -> Union[str, None]:
"""
Get corpus path.

:param str name: corpus name
:param str version: version
:param bool force: force download
:return: path to the corpus or **None** of the corpus doesn't \
exist in the device
:rtype: str
Expand Down Expand Up @@ -202,7 +204,7 @@ def get_corpus_path(name: str, version: str = None) -> Union[str, None]:
corpus_db_detail = get_corpus_db_detail(name, version=version)

if not corpus_db_detail or not corpus_db_detail.get("filename"):
download(name, version=version)
download(name, version=version, force=force)
corpus_db_detail = get_corpus_db_detail(name, version=version)

if corpus_db_detail and corpus_db_detail.get("filename"):
Expand All @@ -213,7 +215,7 @@ def get_corpus_path(name: str, version: str = None) -> Union[str, None]:
path = get_full_data_path(corpus_db_detail.get("filename"))
# check if the corpus file actually exists, download if not
if not os.path.exists(path):
download(name)
download(name, version=version, force=force)
if os.path.exists(path):
return path

Expand Down Expand Up @@ -378,7 +380,7 @@ def download(

# check if corpus is available
if name in corpus_db:
with open(corpus_db_path(), "r", encoding="utf-8") as f:
with open(corpus_db_path(), "r", encoding="utf-8-sig") as f:
local_db = json.load(f)

corpus = corpus_db[name]
Expand Down Expand Up @@ -509,7 +511,7 @@ def remove(name: str) -> bool:
if _CHECK_MODE == "1":
print("PyThaiNLP is read-only mode. It can't remove corpus.")
return False
with open(corpus_db_path(), "r", encoding="utf-8") as f:
with open(corpus_db_path(), "r", encoding="utf-8-sig") as f:
db = json.load(f)
data = [
corpus for corpus in db["_default"].values() if corpus["name"] == name
Expand Down
2 changes: 1 addition & 1 deletion pythainlp/corpus/oscar.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def word_freqs() -> List[Tuple[str, int]]:
"""
word_freqs = []
_path = get_corpus_path(_FILENAME)
with open(_path, "r", encoding="utf-8") as f:
with open(_path, "r", encoding="utf-8-sig") as f:
_data = [i for i in f.readlines()]
del _data[0]
for line in _data:
Expand Down
1 change: 0 additions & 1 deletion pythainlp/corpus/pos_lst20_perceptron-v0.2.3.json

This file was deleted.

Loading