Kurdish Llama

This is an attempt to fine-tune the Llama model released by Meta for Central Kurdish. The initial model was then fine-tuned on a set of instructions provided by Stanford's Alpaca project.

Another project, GPT-4-LLM, used the same set of instructions provided by the Alpaca project and generated output using GPT-4 rather than the original text-davinci-003.

Most of the hard work has already been done. The goal of this project is to translate the dataset to Central Kurdish using an NLLB model. The resulting fine-tuned model, KurdishLlama, can be used for various natural language processing tasks in Central Kurdish.

Stay tuned for updates on the progress of this project!

Translating the Dataset

To translate the dataset, run the following command:

python translate_data.py ./data/alpaca_gpt4_data.json ./data/alpaca_gpt4_ckb.json

This command will use an NLLB model to translate the Alpaca project's GPT-4 data to Central Kurdish, and save the translated data to a new file called alpaca_gpt4_ckb.json.

The NLLB model has a tendency to produce erroneous translations where it repeats a single word throughout. To address this issue, the data_cleaning.py script will remove any instances where a single word is repeated consecutively at least three times.

Generation

To test the model you can either run the generate.py script or use the Inference notebook:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
.gitignore		.gitignore
Inference.ipynb		Inference.ipynb
README.md		README.md
data_cleaning.py		data_cleaning.py
finetune.py		finetune.py
finetune.sh		finetune.sh
generate.py		generate.py
generate.sh		generate.sh
requirements.txt		requirements.txt
translate_data.py		translate_data.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

Inference.ipynb

Inference.ipynb

README.md

README.md

data_cleaning.py

data_cleaning.py

finetune.py

finetune.py

finetune.sh

finetune.sh

generate.py

generate.py

generate.sh

generate.sh

requirements.txt

requirements.txt

translate_data.py

translate_data.py

utils.py

utils.py

Repository files navigation

Kurdish Llama

Translating the Dataset

Generation

About

Releases

Packages

Languages

Hrazhan/kurdish-llama

Folders and files

Latest commit

History

Repository files navigation

Kurdish Llama

Translating the Dataset

Generation

About

Resources

Stars

Watchers

Forks

Languages