Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting new language #2974

Closed
amirjalaly opened this issue Apr 29, 2023 · 5 comments
Closed

Supporting new language #2974

amirjalaly opened this issue Apr 29, 2023 · 5 comments

Comments

@amirjalaly
Copy link

How is it possible to add the support of a new language? The performance of the chat in English is very good, it does not have many languages including my native one i.e. Farsi (Persian). How is it possible to add a language to the system by ourselves?
Suppose, in a small scenario, it is possible to collect Persian data and sentence ranking dataset by ourselves

@someone13574
Copy link
Contributor

To add a language follow you simply need to translate the site. Here are a few pull requests that show how to do it.

https://github.com/LAION-AI/Open-Assistant/pull/1390/files
https://github.com/LAION-AI/Open-Assistant/pull/2271/files
https://github.com/LAION-AI/Open-Assistant/pull/2386/files

@amirjalaly
Copy link
Author

I mean adding a new language support to LLM not the site

@olliestanley
Copy link
Collaborator

I mean adding a new language support to LLM not the site

The two are equivalent. If you translate the site, OA will start collecting data in the new language and then the LLM could be tuned with that data in future.

@pourmand1376
Copy link
Contributor

I think that amount of data is not enough. For LLM to understand farsi, It needs to see at least 10GB text in Persian which is completely available on Wikipedia. Are there any plans to officially support farsi?

@stefangrotz
Copy link
Contributor

stefangrotz commented Aug 1, 2023

If you have data in farsi you can add an import script in the data folder: https://github.com/LAION-AI/Open-Assistant/tree/main/data/datasets

Unfortunately Wikipedia is only good to train Base Models, not fine tune dialogue models like OA. For OA you need dialogue data. But you could expand the Tatoeba import script for Farsi relatively easily.

olliestanley pushed a commit that referenced this issue Aug 3, 2023
Currently, the Open-assistant model doesn't support Farsi. This is a
text-only dataset to learn Farsi (Persian).

One of my friends fine-tuned LLaMa on this dataset and It could
understand Farsi grammar and word usage very well. If the Open-assistant
team wants to add support to Farsi, this should be the first step.

I have transformed the dataset into the standard that has been mentioned
[here](https://projects.laion.ai/Open-Assistant/docs/data/datasets) and
uploaded it to [my huggingface
account](https://huggingface.co/datasets/pourmand1376/fa-wikipedia).


- #2974
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants