update entities #2809

CloseChoice · 2023-04-21T16:12:43Z

Add entities, which are refered in model/model_training/custom_datasets/formatting.py

andreaskoepf · 2023-04-21T21:39:54Z

model/model_training/custom_datasets/entities.py

+    AF = "af"  # Afrikaans
+    AK = "ak"  # Akan
+    SQ = "sq"  # Albanian
+    AM = "am"  # Amharic


Are these all BCP 47 lang-codes? Should document which standard is used, the human-demonstration collection backend uses BCP 47. We might also think about using a lib like for example https://pypi.org/project/langcodes/ ?

This is the iso639-1 standard. Langcodes could be an option if we want to have even more control over the languages.

ok, then we need to check and change this.. because they should match the lang codes of the backend, right? There we use BCP 47: https://en.wikipedia.org/wiki/IETF_language_tag

Here is a result directly from our DB:

postgres=# select distinct lang from message; lang ------- ar bg bn ca cs da de el en eo es eu fa fi fr gl he hu id it ja ko nb-NO nl pl pt-BR ro ru sk sv th tr uk-UA vi zh (35 rows)

To give some background why this is necessary: Portuguese is VERY different in Brazil and Portugal .. much more different than for example British and American English .. native speakers clearly explained to us hat these two languages need to be separated.

changed it to langcodes and removed the Language enum

- add a script which can be used to check if any dataset consists specific regular expressions or words - update vicuna so that it can be used with the new `DatasetEntry` class - remove single references from vicuna (so `[1]` is removed, but I found with the script mentioned above that there are a couple of occurances where our `re_reference_remove` regex hits, but it is actually a list (for language like e.g. python), therefore just remove single references) - if human response is none, then sample from multiple answers like `['please continue', '...']` This PR depends on #2809

…d-entities

andreaskoepf

lgtm! :-)

update entities

bb651e3

CloseChoice requested review from theblackcat102, sanagno, dvruette, andreaskoepf and yk as code owners April 21, 2023 16:12

CloseChoice mentioned this pull request Apr 21, 2023

add check dataset appearances, update vicuna to dataset entry #2818

Merged

andreaskoepf reviewed Apr 21, 2023

View reviewed changes

CloseChoice added 2 commits April 21, 2023 23:55

Merge branch 'main' of github.com:LAION-AI/Open-Assistant into fix/ad…

5af0ba8

…d-entities

add langcodes instead of custom enum

b2f6732

andreaskoepf approved these changes Apr 21, 2023

View reviewed changes

andreaskoepf merged commit 8e404f0 into LAION-AI:main Apr 21, 2023
1 check passed

CloseChoice deleted the fix/add-entities branch April 21, 2023 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update entities #2809

update entities #2809

CloseChoice commented Apr 21, 2023

andreaskoepf Apr 21, 2023

CloseChoice Apr 21, 2023 •

edited

andreaskoepf Apr 21, 2023

andreaskoepf Apr 21, 2023

andreaskoepf Apr 21, 2023

CloseChoice Apr 21, 2023 •

edited

andreaskoepf left a comment

update entities #2809

update entities #2809

Conversation

CloseChoice commented Apr 21, 2023

andreaskoepf Apr 21, 2023

Choose a reason for hiding this comment

CloseChoice Apr 21, 2023 • edited

Choose a reason for hiding this comment

andreaskoepf Apr 21, 2023

Choose a reason for hiding this comment

andreaskoepf Apr 21, 2023

Choose a reason for hiding this comment

andreaskoepf Apr 21, 2023

Choose a reason for hiding this comment

CloseChoice Apr 21, 2023 • edited

Choose a reason for hiding this comment

andreaskoepf left a comment

Choose a reason for hiding this comment

CloseChoice Apr 21, 2023 •

edited

CloseChoice Apr 21, 2023 •

edited