Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pos_tag_transformers function #865

Merged
merged 2 commits into from
Nov 25, 2023

Conversation

pavaris-pm
Copy link
Contributor

@pavaris-pm pavaris-pm commented Nov 14, 2023

What does this changes

from #866, i've updated pos_tag_transformers function by clean up the code, add docstring, fix deprecation, and change the output format of the function to make it be the same format as other tagger in PyThaiNLP

What was wrong

in #857 , pos_tag_transformers was added which consist of 3 models, however, to call and engine, the full name of it must be specified, also the output still not the same format as another tagger. For example

pos_tag_transformers(words="แมวทำอะไรตอนห้าโมงเช้า", engine = "bert-base-th-cased-blackboard")
# outputs
# [{'entity_group': 'NN', 'score': 0.910759, 'word': 'แมวมา', 'start': 0, 'end': 5},
#  {'entity_group': 'VV', 'score': 0.9462489, 'word': '##ทำ', 'start': 5,  'end': 7},
# {'entity_group': 'NN', 'score': 0.8325567, 'word': '##อะไรตอนห้าโมงเช้า',  'start': 7, 'end': 24}]

which is very hard for the normal user to remember its entire name, and may result in more mess in the internal code if another transformers model trained on new corpus are added. we will end up with a lot of if-else condition in order to call a model in the future

How this fixes it

i've cleaned up the code to let a user call a model with parameters named engine and corpus same as what we have from the former function that is pos_tag and pos_tag_sents and also fix output format. This will reduce how hard to remember the entire model name. Here is the newly added version:

from pythainlp.tag import pos_tag_transformers
txt = pos_tag_transformers(sentence="แมวทำอะไรตอนห้าโมงเช้า", engine="mdeberta", corpus='pud')
# outputs
# [[('แมว', 'NOUN'), ('ทําอะไร', 'VERB'), ('ตอนห้าโมงเช้า', 'NOUN')]]

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

@pep8speaks
Copy link

Hello @pavaris-pm! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 183:19: W291 trailing whitespace
Line 186:2: E225 missing whitespace around operator
Line 193:101: E501 line too long (135 > 100 characters)
Line 194:101: E501 line too long (122 > 100 characters)
Line 196:101: E501 line too long (140 > 100 characters)
Line 226:15: E203 whitespace before ':'
Line 230:24: E203 whitespace before ':'
Line 231:19: E203 whitespace before ':'
Line 249:101: E501 line too long (107 > 100 characters)
Line 253:21: W292 no newline at end of file

Line 378:80: W291 trailing whitespace
Line 379:63: W292 no newline at end of file

Copy link

sonarcloud bot commented Nov 14, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@pavaris-pm pavaris-pm changed the title Update pos _tag_transformers function Update pos_tag_transformers function Nov 14, 2023
Comment on lines +225 to +232
_blackboard_support_engine = {
"bert" : "lunarlist/pos_thai",
}

_pud_support_engine = {
"wangchanberta" : "Pavarissy/wangchanberta-ud-thai-pud-upos",
"mdeberta" : "Pavarissy/mdeberta-v3-ud-thai-pud-upos",
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that i've made a dictionary to match the model name with the input engine so that it would be easier to maintain internal code in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

I think in general we need to update/expand the naming convention to cover our function names and model names as well. Also to better enforce it.

Current naming convention for files
#141

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bact agreed!, setting convention will reduce much complexity in future development in both user side and contributor side as well. Do you need me to put this into the pythainlp discussion section so that we can inform other contributors and also newcomers in the future too? if you're ok, i will put it in rn (of course that i will put it in Thai language krub)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. Very appreciated.

@bact bact added documentation improve documentation and test cases refactoring a technical improvement which does not add any new features or change existing features. labels Nov 14, 2023
Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 💯

@wannaphong wannaphong merged commit abfbf02 into PyThaiNLP:dev Nov 25, 2023
9 of 14 checks passed
@wannaphong wannaphong added this to the 5.0 milestone Nov 25, 2023
@wannaphong wannaphong added this to Done in PyThaiNLP Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation improve documentation and test cases refactoring a technical improvement which does not add any new features or change existing features.
Projects
PyThaiNLP
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants