Skip to content

Commit

Permalink
Merge branch 'dev' into add-rhyme
Browse files Browse the repository at this point in the history
  • Loading branch information
wannaphong committed Oct 19, 2023
2 parents 1cd22ba + 9aabf57 commit 4f95b08
Show file tree
Hide file tree
Showing 175 changed files with 2,288 additions and 1,366 deletions.
50 changes: 25 additions & 25 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,36 +7,36 @@ Please refer to our [Contributor Covenant Code of Conduct](https://github.com/Py
## Issue Report and Discussion

- Discussion: https://github.com/PyThaiNLP/pythainlp/discussions
- GitHub issues (problems and suggestions): https://github.com/PyThaiNLP/pythainlp/issues
- Facebook group (not specific to PyThaiNLP, can be Thai NLP discussion in general): https://www.facebook.com/groups/thainlp
- GitHub issues (for problems and suggestions): https://github.com/PyThaiNLP/pythainlp/issues
- Facebook group (not specific to PyThaiNLP, for Thai NLP discussion in general): https://www.facebook.com/groups/thainlp


## Code

## Code Guidelines

- Follows [PEP8](http://www.python.org/dev/peps/pep-0008/), use [black](https://github.com/ambv/black) with `--line-length` = 79;
- Follow [PEP8](http://www.python.org/dev/peps/pep-0008/), use [black](https://github.com/ambv/black) with `--line-length` = 79;
- Name identifiers (variables, classes, functions, module names) with meaningful
and pronounceable names (`x` is always wrong);
- Please follow this [naming convention](https://namingconvention.org/python/). For example, global constant variables must be in `ALL_CAPS`;
<img src="https://i.stack.imgur.com/uBr10.png" />
- Write tests for your new features. Test suites are in `tests/` directory. (see "Testing" section below);
- Write tests for your new features. The test suite is in `tests/` directory. (see "Testing" section below);
- Run all tests before pushing (just execute `tox`) so you will know if your
changes broke something;
- Commented code is [dead
code](http://www.codinghorror.com/blog/2008/07/coding-without-comments.html);
- Commented out codes are [dead
codes](http://www.codinghorror.com/blog/2008/07/coding-without-comments.html);
- All `#TODO` comments should be turned into [issues](https://github.com/pythainlp/pythainlp/issues) in GitHub;
- When appropriate, use [f-String](https://www.python.org/dev/peps/pep-0498/)
- When appropriate, use [f-string](https://www.python.org/dev/peps/pep-0498/)
(use `f"{a} = {b}"`, instead of `"{} = {}".format(a, b)` and `"%s = %s' % (a, b)"`);
- All text files, including source code, must be ended with one empty line. This is [to please git](https://stackoverflow.com/questions/5813311/no-newline-at-end-of-file#5813359) and [to keep up with POSIX standard](https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline).
- All text files, including source codes, must end with one empty line. This is [to please git](https://stackoverflow.com/questions/5813311/no-newline-at-end-of-file#5813359) and [to keep up with POSIX standard](https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline).

### Version Control System

- We use [Git](http://git-scm.com/) as our [version control system](http://en.wikipedia.org/wiki/Revision_control),
so it may be a good idea to familiarize yourself with it.
- You can start with the [Pro Git book](http://git-scm.com/book/) (free!).

### Commit Comment
### Commit Message

- [How to Write a Git Commit Message](https://chris.beams.io/posts/git-commit/)
- [Commit Verbs 101: why I like to use this and why you should also like it.](https://chris.beams.io/posts/git-commit/)
Expand All @@ -45,24 +45,24 @@ so it may be a good idea to familiarize yourself with it.

- We use the famous [gitflow](http://nvie.com/posts/a-successful-git-branching-model/)
to manage our branches.
- When you do pull request on GitHub, Travis CI and AppVeyor will run tests
- When you create pull requests on GitHub, Github Actions and AppVeyor will run tests
and several checks automatically. Click the "Details" link at the end of
each check to see what needs to be fixed.


## Documentation

- We use [Sphinx](https://www.sphinx-doc.org/en/master/) to generate API document
automatically from "docstring" comments in source code. This means the comment
section in the source code is important for the quality of documentation.
- A docstring should start with one summary line, ended the line with a full stop (period),
then followed by a blank line before the start new paragraph.
- A commit to release branches (e.g. `2.2`, `2.1`) with a title **"(build and deploy docs)"** (without quotes) will trigger the system to rebuild the documentation files and upload them to the website https://pythainlp.github.io/docs
automatically from "docstring" comments in source codes. This means the comment
section in the source codes is important for the quality of documentation.
- A docstring should start with one summary line, end with one line with a full stop (period),
then be followed by a blank line before starting a new paragraph.
- A commit to release branches (e.g. `2.2`, `2.1`) with a title **"(build and deploy docs)"** (without quotes) will trigger the system to rebuild the documentation files and upload them to the website https://pythainlp.github.io/docs.


## Testing

We use standard Python `unittest`. Test suites are in `tests/` directory.
We use standard Python `unittest`. The test suite is in `tests/` directory.

To run unit tests locally together with code coverage test:

Expand All @@ -81,12 +81,12 @@ Generate code coverage test in HTML (files will be available in `htmlcov/` direc
coverage html
```

Make sure the same tests pass on Travis CI and AppVeyor.
Make sure the tests pass on both Github Actions and AppVeyor.


## Releasing
- We use [semantic versioning](https://semver.org/): MAJOR.MINOR.PATCH, with development build suffix: MAJOR.MINOR.PATCH-devBUILD
- Use [`bumpversion`](https://github.com/c4urself/bump2version/#installation) to manage versioning.
- We use [`bumpversion`](https://github.com/c4urself/bump2version/#installation) to manage versioning.
- `bumpversion [major|minor|patch|release|build]`
- Example:
```
Expand Down Expand Up @@ -129,18 +129,18 @@ Make sure the same tests pass on Travis CI and AppVeyor.
<img src="https://contributors-img.firebaseapp.com/image?repo=PyThaiNLP/pythainlp" />
</a>

Thanks all the [contributors](https://github.com/PyThaiNLP/pythainlp/graphs/contributors). (Image made with [contributors-img](https://contributors-img.firebaseapp.com))
Thanks to all [contributors](https://github.com/PyThaiNLP/pythainlp/graphs/contributors). (Image made with [contributors-img](https://contributors-img.firebaseapp.com))

### Development Lead
- Wannaphong Phatthiyaphaibun <wannaphong@yahoo.com> - founder, distribution and maintainance
- Korakot Chaovavanich - initial tokenization and soundex code
### Development Leads
- Wannaphong Phatthiyaphaibun <wannaphong@yahoo.com> - foundation, distribution and maintenance
- Korakot Chaovavanich - initial tokenization and soundex codes
- Charin Polpanumas - classification and benchmarking
- Peeradej Tanruangporn - documentation
- Arthit Suriyawongkul - refactoring, packaging, distribution, and maintainance
- Arthit Suriyawongkul - refactoring, packaging, distribution, and maintenance
- Chakri Lowphansirikul - documentation
- Pattarawat Chormai - benchmarking
- Thanathip Suntorntip - nlpO3 maintainance, Rust Developer
- Can Udomcharoenchaikit - documentation and code
- Thanathip Suntorntip - nlpO3 maintenance, Rust Developer
- Can Udomcharoenchaikit - documentation and codes

### Maintainers
- Arthit Suriyawongkul
Expand Down
4 changes: 2 additions & 2 deletions INTHEWILD.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Who uses PyThaiNLP?

We'd like to keep track of who is using the package. Please send a PR with your company name or @githubhandle or company name with @githubhandle.
We'd like to keep track of who are using the package. Please send a PR with your company name or @githubhandle or both company name and @githubhandle.

Currently, officially using PyThaiNLP:
Currently, those who are officially using PyThaiNLP are as follows:

1. [Hope Data Annotations Co., Ltd.](https://hopedata.org) ([@hopedataannotations](https://github.com/hopedataannotaions))
2. [Codustry (Thailand) Co., Ltd.](https://codustry.com) ([@codustry](https://github.com/codustry))
Expand Down
42 changes: 21 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@
<a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>
</div>

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to [NLTK](https://www.nltk.org/) with focus on Thai language.
PyThaiNLP is a Python package for text processing and linguistic analysis, similar to [NLTK](https://www.nltk.org/) with a focus on the Thai language.

PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย [ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD](https://github.com/PyThaiNLP/pythainlp/blob/dev/README_TH.md)

**News**

> Now, You can contact or ask any questions with the PyThaiNLP team. <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>
> Now, You can contact with or ask any questions of the PyThaiNLP team. <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>
| Version | Description | Status |
|:------:|:--:|:------:|
Expand All @@ -37,7 +37,7 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร

## Capabilities

PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via the command-line interface.

<details>
<summary>List of Features</summary>
Expand All @@ -48,11 +48,11 @@ PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech t
- Thai spelling suggestion and correction (`spell` and `correct`)
- Thai transliteration (`transliterate`)
- Thai soundex (`soundex`) with three engines (`lk82`, `udom83`, `metasound`)
- Thai collation (sort by dictionary order) (`collate`)
- Thai collation (sorted by dictionary order) (`collate`)
- Read out number to Thai words (`bahttext`, `num_to_thaiword`)
- Thai datetime formatting (`thai_strftime`)
- Thai-English keyboard misswitched fix (`eng_to_thai`, `thai_to_eng`)
- Command-line interface for basic functions, like tokenization and pos tagging (run `thainlp` in your shell)
- Command-line interface for basic functions, like tokenization and POS tagging (run `thainlp` in your shell)
</details>


Expand All @@ -67,7 +67,7 @@ This will install the latest stable release of PyThaiNLP.
Install different releases:

- Stable release: `pip install --upgrade pythainlp`
- Pre-release (near ready): `pip install --upgrade --pre pythainlp`
- Pre-release (nearly ready): `pip install --upgrade --pre pythainlp`
- Development (likely to break things): `pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip`

### Installation Options
Expand All @@ -92,27 +92,27 @@ pip install pythainlp[extra1,extra2,...]
- `wordnet` (for Thai WordNet API)
</details>

For dependency details, look at `extras` variable in [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).
For dependency details, look at the `extras` variable in [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).


## Data directory
## Data Directory

- Some additional data, like word lists and language models, may get automatically download during runtime.
- Some additional data, like word lists and language models, may be automatically downloaded during runtime.
- PyThaiNLP caches these data under the directory `~/pythainlp-data` by default.
- Data directory can be changed by specifying the environment variable `PYTHAINLP_DATA_DIR`.
- The data directory can be changed by specifying the environment variable `PYTHAINLP_DATA_DIR`.
- See the data catalog (`db.json`) at https://github.com/PyThaiNLP/pythainlp-corpus


## Command-Line Interface

Some of PyThaiNLP functionalities can be used at command line, using `thainlp` command.
Some of PyThaiNLP functionalities can be used via command line with the `thainlp` command.

For example, displaying a catalog of datasets:
For example, to display a catalog of datasets:
```sh
thainlp data catalog
```

Showing how to use:
To show how to use:
```sh
thainlp help
```
Expand All @@ -122,16 +122,16 @@ thainlp help

| | License |
|:---|:----|
| PyThaiNLP Source Code and Notebooks | [Apache Software License 2.0](https://github.com/PyThaiNLP/pythainlp/blob/dev/LICENSE) |
| PyThaiNLP source codes and notebooks | [Apache Software License 2.0](https://github.com/PyThaiNLP/pythainlp/blob/dev/LICENSE) |
| Corpora, datasets, and documentations created by PyThaiNLP | [Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)](https://creativecommons.org/publicdomain/zero/1.0/)|
| Language models created by PyThaiNLP | [Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/) |
| Other corpora and models that may included with PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |
| Other corpora and models that may be included in PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |


## Contribute to PyThaiNLP

- Please do fork and create a pull request :)
- For style guide and other information, including references to algorithms we use, please refer to our [contributing](https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md) page.
- Please fork and create a pull request :)
- For style guides and other information, including references to algorithms we use, please refer to our [contributing](https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md) page.

## Who uses PyThaiNLP?

Expand All @@ -140,13 +140,13 @@ You can read [INTHEWILD.md](https://github.com/PyThaiNLP/pythainlp/blob/dev/INTH

## Citations

If you use `PyThaiNLP` in your project or publication, please cite the library as follows
If you use `PyThaiNLP` in your project or publication, please cite the library as follows:

```
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354
```

or BibTeX entry:
or by BibTeX entry:

``` bib
@misc{pythainlp,
Expand All @@ -166,7 +166,7 @@ or BibTeX entry:
| Logo | Description |
| --- | ----------- |
| [![VISTEC-depa Thailand Artificial Intelligence Research Institute](https://airesearch.in.th/assets/img/logo/airesearch-logo.svg)](https://airesearch.in.th/) | Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by [VISTEC-depa Thailand Artificial Intelligence Research Institute](https://airesearch.in.th/). |
| [![MacStadium](https://i.imgur.com/rKy1dJX.png)](https://www.macstadium.com) | We get support free Mac Mini M1 from [MacStadium](https://www.macstadium.com) for doing Build CI. |
| [![MacStadium](https://i.imgur.com/rKy1dJX.png)](https://www.macstadium.com) | We get support of free Mac Mini M1 from [MacStadium](https://www.macstadium.com) for running CI builds. |

------

Expand All @@ -181,5 +181,5 @@ or BibTeX entry:
</div>

<div align="center">
<strong>Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.</strong>
<strong>Beware of malware if you use codes from mirrors other than the official two on GitHub and GitLab.</strong>
</div>
4 changes: 2 additions & 2 deletions docker_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
PyYAML==5.4
numpy==1.22.*
python-crfsuite==0.9.7
requests==2.25.*
requests==2.31.*
gensim==4.0.*
nltk==3.6.6
emoji==0.5.2
Expand Down Expand Up @@ -37,4 +37,4 @@ ufal.chu-liu-edmonds==1.0.2
wtpsplit==1.0.1
fastcoref==2.1.6
panphon==0.20.0
sentence-transformers==2.2.2
sentence-transformers==2.2.2
68 changes: 56 additions & 12 deletions docs/api/augment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,67 @@
pythainlp.augment
=================

The :class:`textaugment` is Thai text augment. This function for text augment task.
Introduction
------------

Modules
-------
The `pythainlp.augment` module is a powerful toolset for text augmentation in the Thai language. Text augmentation is a process that enriches and diversifies textual data by generating alternative versions of the original text. This module is a valuable resource for improving the quality and variety of Thai language data for NLP tasks.

TextAugment Class
-----------------

The central component of the `pythainlp.augment` module is the `TextAugment` class. This class provides various text augmentation techniques and functions to enhance the diversity of your text data. It offers the following methods:

.. autoclass:: pythainlp.augment.TextAugment
:members:

WordNetAug Class
----------------

The `WordNetAug` class is designed to perform text augmentation using WordNet, a lexical database for English. This class enables you to augment Thai text using English synonyms, offering a unique approach to text diversification. The following methods are available within this class:

.. autoclass:: pythainlp.augment.WordNetAug
:members:

Word2VecAug, Thai2fitAug, LTW2VAug Classes
------------------------------------------

The `pythainlp.augment.word2vec` package contains multiple classes for text augmentation using Word2Vec models. These classes include `Word2VecAug`, `Thai2fitAug`, and `LTW2VAug`. Each of these classes allows you to use Word2Vec embeddings to generate text variations. Explore the methods provided by these classes to understand their capabilities.

.. autoclass:: WordNetAug
:members:
.. autofunction:: postype2wordnet
.. autoclass:: pythainlp.augment.word2vec.Word2VecAug
:members:
:members:

.. autoclass:: pythainlp.augment.word2vec.Thai2fitAug
:members:
:members:

.. autoclass:: pythainlp.augment.word2vec.LTW2VAug
:members:
:members:

FastTextAug and Thai2transformersAug Classes
--------------------------------------------

The `pythainlp.augment.lm` package offers classes for text augmentation using language models. These classes include `FastTextAug` and `Thai2transformersAug`. These classes allow you to use language model-based techniques to diversify text data. Explore their methods to understand their capabilities.

.. autoclass:: pythainlp.augment.lm.FastTextAug
:members:
:members:

.. autoclass:: pythainlp.augment.lm.Thai2transformersAug
:members:
:members:

BPEmbAug Class
--------------

The `pythainlp.augment.word2vec.bpemb_wv` package contains the `BPEmbAug` class, which is designed for text augmentation using subword embeddings. This class is particularly useful when working with subword representations for Thai text augmentation.

.. autoclass:: pythainlp.augment.word2vec.bpemb_wv.BPEmbAug
:members:
:members:

Additional Functions
-------------------

To further enhance your text augmentation tasks, the `pythainlp.augment` module offers the following functions:

- `postype2wordnet`: This function maps part-of-speech tags to WordNet-compatible POS tags, facilitating the integration of WordNet augmentation with Thai text.

These functions and classes provide diverse techniques for text augmentation in the Thai language, making this module a valuable asset for NLP researchers, developers, and practitioners.

For detailed usage examples and guidelines, please refer to the official PyThaiNLP documentation. The `pythainlp.augment` module opens up new possibilities for enriching and diversifying Thai text data, leading to improved NLP models and applications.
Loading

0 comments on commit 4f95b08

Please sign in to comment.