New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting new language #2974
Comments
To add a language follow you simply need to translate the site. Here are a few pull requests that show how to do it. https://github.com/LAION-AI/Open-Assistant/pull/1390/files |
I mean adding a new language support to LLM not the site |
The two are equivalent. If you translate the site, OA will start collecting data in the new language and then the LLM could be tuned with that data in future. |
I think that amount of data is not enough. For LLM to understand farsi, It needs to see at least 10GB text in Persian which is completely available on Wikipedia. Are there any plans to officially support farsi? |
If you have data in farsi you can add an import script in the data folder: https://github.com/LAION-AI/Open-Assistant/tree/main/data/datasets Unfortunately Wikipedia is only good to train Base Models, not fine tune dialogue models like OA. For OA you need dialogue data. But you could expand the Tatoeba import script for Farsi relatively easily. |
Currently, the Open-assistant model doesn't support Farsi. This is a text-only dataset to learn Farsi (Persian). One of my friends fine-tuned LLaMa on this dataset and It could understand Farsi grammar and word usage very well. If the Open-assistant team wants to add support to Farsi, this should be the first step. I have transformed the dataset into the standard that has been mentioned [here](https://projects.laion.ai/Open-Assistant/docs/data/datasets) and uploaded it to [my huggingface account](https://huggingface.co/datasets/pourmand1376/fa-wikipedia). - #2974
How is it possible to add the support of a new language? The performance of the chat in English is very good, it does not have many languages including my native one i.e. Farsi (Persian). How is it possible to add a language to the system by ourselves?
Suppose, in a small scenario, it is possible to collect Persian data and sentence ranking dataset by ourselves
The text was updated successfully, but these errors were encountered: