New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update entities #2809
update entities #2809
Conversation
AF = "af" # Afrikaans | ||
AK = "ak" # Akan | ||
SQ = "sq" # Albanian | ||
AM = "am" # Amharic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these all BCP 47 lang-codes? Should document which standard is used, the human-demonstration collection backend uses BCP 47. We might also think about using a lib like for example https://pypi.org/project/langcodes/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the iso639-1 standard. Langcodes could be an option if we want to have even more control over the languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then we need to check and change this.. because they should match the lang codes of the backend, right? There we use BCP 47: https://en.wikipedia.org/wiki/IETF_language_tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a result directly from our DB:
postgres=# select distinct lang from message;
lang
-------
ar
bg
bn
ca
cs
da
de
el
en
eo
es
eu
fa
fi
fr
gl
he
hu
id
it
ja
ko
nb-NO
nl
pl
pt-BR
ro
ru
sk
sv
th
tr
uk-UA
vi
zh
(35 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give some background why this is necessary: Portuguese is VERY different in Brazil and Portugal .. much more different than for example British and American English .. native speakers clearly explained to us hat these two languages need to be separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to langcodes and removed the Language
enum
- add a script which can be used to check if any dataset consists specific regular expressions or words - update vicuna so that it can be used with the new `DatasetEntry` class - remove single references from vicuna (so `[1]` is removed, but I found with the script mentioned above that there are a couple of occurances where our `re_reference_remove` regex hits, but it is actually a list (for language like e.g. python), therefore just remove single references) - if human response is none, then sample from multiple answers like `['please continue', '...']` This PR depends on #2809
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! :-)
Add entities, which are refered in model/model_training/custom_datasets/formatting.py