Skip to content

Commit

Permalink
Add Bulgarian Reasoning Dataset (#3137)
Browse files Browse the repository at this point in the history
Part of this: #3122
https://huggingface.co/datasets/0x22almostEvil/reasoning_bg_oa
---
# Dataset Card for Bulgarian QnA reasoning with ~2.7K entries.

### Dataset Summary

Contains Parquet of a list of instructions and answer.

Each row consists of

* INSTRUCTION
* RESPONSE
* SOURCE (reasoning_bg)
* METADATA (json with language, url, id).

### Original Dataset is available here:
* https://huggingface.co/datasets/reasoning_bg

---------

Co-authored-by: 0x22almostEvil <0x22almostEvil>
  • Loading branch information
0x22almostEvil committed May 13, 2023
1 parent f80f556 commit b1a6a15
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 0 deletions.
1 change: 1 addition & 0 deletions data/datasets/__init__.py
Expand Up @@ -29,6 +29,7 @@
"ru_riddles_337": "0x22almostEvil/ru-riddles-377",
"instructional_codesearchnet_python": "Nan-Do/instructional_codesearchnet_python",
"tatoeba_mt_qna_oa": "0x22almostEvil/tatoeba-mt-qna-oa",
"reasoning_bg_oa": "0x22almostEvil/reasoning_bg_oa",
}

SAFETY_DATASETS = {
Expand Down
21 changes: 21 additions & 0 deletions data/datasets/reasoning_bg_oa/README.MD
@@ -0,0 +1,21 @@
# Dataset Card for Bulgarian QnA reasoning with ~2.7K entries.

### Dataset Summary

- License: apache-2.0
- Contains Parquet of a list of instructions and answers in Bulgarian language.

Each row consists of

- INSTRUCTION
- RESPONSE
- SOURCE (reasoning_bg)
- METADATA (json with language, url, id).

### Link:

- https://huggingface.co/datasets/0x22almostEvil/reasoning_bg_oa

### Original Dataset is available here:

- https://huggingface.co/datasets/reasoning_bg
48 changes: 48 additions & 0 deletions data/datasets/reasoning_bg_oa/data_process.py
@@ -0,0 +1,48 @@
import json
from dataclasses import dataclass

import pandas as pd
from datasets import concatenate_datasets, load_dataset

configs = ["biology-12th", "philosophy-12th", "geography-12th", "history-12th", "history-quiz"]
datasets = []


@dataclass
class QnA:
INSTRUCTION: str
RESPONSE: str
SOURCE: str
METADATA: str


# format in QnA
def create_qna(row):
instruction = f'{row["question"]} {", ".join(row["answers"])}?'
response = row["correct"].translate(str.maketrans("", "", "();"))
source = "reasoning_bg"
metadata = {
"language": "bg",
"url": f'{row["url"]}',
"id": f'{row["id"]}',
}
metadata_str = json.dumps(metadata)
return QnA(instruction, response, source, metadata_str)


# merge dataset configs into one
for config in configs:
dataset = load_dataset("reasoning_bg", config, split="train")
datasets.append(dataset)

merged_dataset = concatenate_datasets(datasets)

print(merged_dataset)

# convert the dataset to a pandas dataframe
df = pd.DataFrame(merged_dataset)

qna_list = df.apply(create_qna, axis=1).tolist()

qna_df = pd.DataFrame(qna_list, columns=["INSTRUCTION", "RESPONSE", "SOURCE", "METADATA"])
qna_df.to_parquet("reasoning-bg-oa.parquet", row_group_size=100, engine="pyarrow", index=False)

0 comments on commit b1a6a15

Please sign in to comment.