Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PhayaThaiBERT engine with new features [WIP] #873

Merged
merged 25 commits into from
Dec 11, 2023
Merged

Conversation

pavaris-pm
Copy link
Contributor

@pavaris-pm pavaris-pm commented Dec 1, 2023

What does this changes

According to #868, i have made a new PR by added new folder named phayathaibert since it is a new Thai language model, with that, i decided to treat it to be the same as we treat wangchanberta because it also has their own folder as well. Apart from introducing a new model, i currently made an experiment and added new features into pythainlp including its test cases as well (you can see the list in this PR description). Clearly note here that this PR also solved the unsync forked version in #871 as well.

Will resolve #871 and fix #868.

List of new added features from PhayaThaiBERT [WIP] 🚧 👷🏻‍♂️

⚠️ [NOTE] i will ask for a review after all tasks in the below list completed

Here is the task which I found that PhayaThaiBERT can be integrated in PyThaiNLP after reading a paper. The list below here is the current progress (check mark means that I already added in the source code and will ask for your review after i complete all of them krub):

  • Part-of-speech tagging
  • Named-entity-recognition
  • Tokenization
  • Data Augmentation (Text)

Upcoming features that can be added soon (futurePR)

  • Word Correction (Under heavy development)

etc ... (I will keep add more into the list based on what I have found during an experiment)

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

@pep8speaks
Copy link

pep8speaks commented Dec 1, 2023

Hello @pavaris-pm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-12-11 14:00:19 UTC

@pavaris-pm pavaris-pm changed the title Added PhayaThaiBERT engine with new features [WIP] ⚠️ Added PhayaThaiBERT engine with new features [WIP] Dec 1, 2023
@coveralls
Copy link

coveralls commented Dec 1, 2023

Coverage Status

coverage: 85.515% (-0.9%) from 86.41%
when pulling e7ef6ce on pavaris-pm:dev
into 6514bb8 on PyThaiNLP:dev.

@bact bact added the enhancement enhance functionalities label Dec 4, 2023
@bact bact added this to the Future milestone Dec 4, 2023
@bact bact added this to In progress in PyThaiNLP Dec 4, 2023
@bact bact changed the title Added PhayaThaiBERT engine with new features [WIP] Add PhayaThaiBERT engine with new features [WIP] Dec 5, 2023
@pavaris-pm
Copy link
Contributor Author

pavaris-pm commented Dec 10, 2023

@bact @wannaphong i've already add all features that i've been found from phayathaibert into the source code and already fix pep8 format.

As of today, here is new features added

  1. Part-of-speech tagging model trained by @MpolaarbearM (already made co-authored commit 👍🏻 )
  2. Named-entity-recognition model trained by @pavaris-pm
  3. Tokenization (even it used the same tokenizer as WangchanBERTa, clearly note here that Vocabulary Expansion was integrated into it) added by @pavaris-pm
  4. Data Augmentation (Text) by @pavaris-pm

Upcoming features that can be added soon (future PR)

  1. Word Correction (i've found a nice paper for Two-stage Thai Misspelling Correction based on Pre-trained Language Models and need further research on it)
    """
    Two-stage Thai Misspelling Correction based on Pre-trained Language Models
    :See Also:
    * Paper: \
    https://ieeexplore.ieee.org/abstract/document/10202006
    * GitHub: \
    https://github.com/bookpanda/Two-stage-Thai-Misspelling-Correction-Based-on-Pre-trained-Language-Models

According to this, i think that adding these 4 completed features first, and the upcoming features (e.g. word correction) can be added later with the next PR because it is better to bring the state-of-the-art Thai encoder based model into production asap. With that, you can review this PR and suggest for further development of them. If you're ok with this, you can approve and merge it krub.

@pavaris-pm
Copy link
Contributor Author

@MpolaarbearM kindly inform here that co-authored commit already made krub. You can check it 😄

@pavaris-pm pavaris-pm requested a review from bact December 10, 2023 14:03
@bact
Copy link
Member

bact commented Dec 11, 2023

There were few error test suite (not related to your PR).
I have fixed them, do you mind to sync your code with the latest from dev? So the test can be run again.
thanks.

@pavaris-pm
Copy link
Contributor Author

There were few error test suite (not related to your PR).

I have fixed them, do you mind to sync your code with the latest from dev? So the test can be run again.

thanks.

Roger that. I'll do it krub.

@pavaris-pm
Copy link
Contributor Author

@bact I'm done with syncing already

Use UPPERCASE for constant
Copy link

sonarcloud bot commented Dec 11, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@bact bact merged commit ff74b39 into PyThaiNLP:dev Dec 11, 2023
9 of 14 checks passed
PyThaiNLP automation moved this from In progress to Done Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhance functionalities
Projects
PyThaiNLP
  
Done
Development

Successfully merging this pull request may close these issues.

Add PhayaThaiBERT model into PyThaiNLP [WIP]
4 participants