Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update entities #2809

Merged
merged 3 commits into from Apr 21, 2023
Merged

update entities #2809

merged 3 commits into from Apr 21, 2023

Conversation

CloseChoice
Copy link
Collaborator

Add entities, which are refered in model/model_training/custom_datasets/formatting.py

AF = "af" # Afrikaans
AK = "ak" # Akan
SQ = "sq" # Albanian
AM = "am" # Amharic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these all BCP 47 lang-codes? Should document which standard is used, the human-demonstration collection backend uses BCP 47. We might also think about using a lib like for example https://pypi.org/project/langcodes/ ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the iso639-1 standard. Langcodes could be an option if we want to have even more control over the languages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then we need to check and change this.. because they should match the lang codes of the backend, right? There we use BCP 47: https://en.wikipedia.org/wiki/IETF_language_tag

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a result directly from our DB:

postgres=# select distinct lang from message;
 lang  
-------
 ar
 bg
 bn
 ca
 cs
 da
 de
 el
 en
 eo
 es
 eu
 fa
 fi
 fr
 gl
 he
 hu
 id
 it
 ja
 ko
 nb-NO
 nl
 pl
 pt-BR
 ro
 ru
 sk
 sv
 th
 tr
 uk-UA
 vi
 zh
(35 rows)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give some background why this is necessary: Portuguese is VERY different in Brazil and Portugal .. much more different than for example British and American English .. native speakers clearly explained to us hat these two languages need to be separated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to langcodes and removed the Language enum

andreaskoepf pushed a commit that referenced this pull request Apr 21, 2023
- add a script which can be used to check if any dataset consists
specific regular expressions or words
- update vicuna so that it can be used with the new `DatasetEntry` class
- remove single references from vicuna (so `[1]` is removed, but I found
with the script mentioned above that there are a couple of occurances
where our `re_reference_remove` regex hits, but it is actually a list
(for language like e.g. python), therefore just remove single
references)
- if human response is none, then sample from multiple answers like
`['please continue', '...']`

This PR depends on #2809
Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! :-)

@andreaskoepf andreaskoepf merged commit 8e404f0 into LAION-AI:main Apr 21, 2023
1 check passed
@CloseChoice CloseChoice deleted the fix/add-entities branch April 21, 2023 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants