Skip to content

Finnish version of databricks-dolly-15k instruction dataset

Notifications You must be signed in to change notification settings

TurkuNLP/dolly-fi

Repository files navigation

Finnish Dolly dataset

Finnish version of the databricks-dolly-15k instruction dataset, machine translated from the original English using DeepL.

Data

The data is found in the file dolly-15k-fi.jsonl. The format and uses of this data match those of the original English dataset. For more information, please see https://github.com/databrickslabs/dolly/tree/master/data.

Processing

The data was processed from the original as follows:

  1. Convert original data from JSONL to DOCX files
python3 jsonl2doc.py original-data/databricks-dolly-15k.jsonl
  1. Translate DOCX files from dolly-doc-in/ using DeepL and save outputs in dolly-doc-out/.

  2. Convert back to JSONL

python3 doc2jsonl.py \
    --add-id \
    --include-original \
    original-data/databricks-dolly-15k.jsonl \
    dolly-doc-out/dolly-000*.docx \
    > dolly-15k-fi.jsonl

License

This dataset is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA).

Note that under the DeepL terms and conditions, this data may not be used to develop, market or train a machine translation algorithm.

About

Finnish version of databricks-dolly-15k instruction dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages