Skip to content

Commit

Permalink
Add gsm Dataset (#3149)
Browse files Browse the repository at this point in the history
Resolves two of #3122

---

# Dataset Card for GSM QnA reasoning with ~8.8K entries.

### Dataset Summary

License: MIT. Contains Parquet of a list of instructions and answers
(English
only). Reasoning, logic and programming.

Each row consists of

- INSTRUCTION
- RESPONSE
- SOURCE
- METADATA (json with language).

### Link:

https://huggingface.co/datasets/0x22almostEvil/reasoning-gsm-qna-oa

### Original Datasets are available here:

- https://huggingface.co/datasets/gsm8k
- https://huggingface.co/datasets/reasoning-machines/gsm-hard

Co-authored-by: 0x22almostEvil <0x22almostEvil>
  • Loading branch information
echo0x22 committed May 13, 2023
1 parent 74994da commit d15cba1
Show file tree
Hide file tree
Showing 3 changed files with 97 additions and 0 deletions.
1 change: 1 addition & 0 deletions data/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
"instructional_codesearchnet_python": "Nan-Do/instructional_codesearchnet_python",
"tatoeba_mt_qna_oa": "0x22almostEvil/tatoeba-mt-qna-oa",
"reasoning_bg_oa": "0x22almostEvil/reasoning_bg_oa",
"reasoning_gsm_qna_oa": "0x22almostEvil/reasoning-gsm-qna-oa",
}

SAFETY_DATASETS = {
Expand Down
22 changes: 22 additions & 0 deletions data/datasets/reasoning_gsm_qna_oa/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Dataset Card for GSM QnA reasoning with ~8.8K entries.

### Dataset Summary

License: MIT. Contains Parquet of a list of instructions and answers (English
only). Reasoning, logic and programming.

Each row consists of

- INSTRUCTION
- RESPONSE
- SOURCE
- METADATA (json with language).

### Link:

https://huggingface.co/datasets/0x22almostEvil/reasoning-gsm-qna-oa

### Original Datasets are available here:

- https://huggingface.co/datasets/gsm8k
- https://huggingface.co/datasets/reasoning-machines/gsm-hard
74 changes: 74 additions & 0 deletions data/datasets/reasoning_gsm_qna_oa/data_process.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import json
import random
import re
from dataclasses import dataclass

import pandas as pd
from datasets import load_dataset

random.seed(42)

random_list_python = [
"Make a python code.",
"Make a python script. Only function.",
"Write a solution in python.",
"Solve with Python.",
"Please, use python!",
"Also, could you use python?",
"Think and write in python.",
"Write a function in python.",
"Make a Python function.",
]

random_list_answer = [
"\nAnswer is",
"\nThe final answer:",
"\nThe answer will be",
]


def qna_wrapper(source, random_list_python, random_list_answer):
def create_qna(row):
instruction = row["question"] if source == "gsm8k" else row["input"] + " " + random.choice(random_list_python)
response = (
re.sub(r"(<<[\d\.\-\+\*=/\\]+>>)", "", row["answer"].replace("####", random.choice(random_list_answer)))
+ "."
if source == "gsm8k"
else row["code"]
)
metadata = {
"language": "en",
}
metadata_str = json.dumps(metadata)
return QnA(instruction, response, source, metadata_str)

return create_qna


@dataclass
class QnA:
INSTRUCTION: str
RESPONSE: str
SOURCE: str
METADATA: str


# load gsm8k & gsm-hard
dataset1 = load_dataset("gsm8k", "main", split="train")
print(dataset1)

dataset2 = load_dataset("reasoning-machines/gsm-hard", split="train")
print(dataset2)

# process gsm8k & gsm-hard
qna_list_1 = pd.DataFrame(dataset1).apply(qna_wrapper("gsm8k", random_list_python, random_list_answer), axis=1).tolist()
qna_list_2 = (
pd.DataFrame(dataset2).apply(qna_wrapper("gsm-hard", random_list_python, random_list_answer), axis=1).tolist()
)

# merge gsm8k & gsm-hard
qna_list = qna_list_1 + qna_list_2

# convert to parquet
qna_df = pd.DataFrame(qna_list, columns=["INSTRUCTION", "RESPONSE", "SOURCE", "METADATA"])
qna_df.to_parquet("reasoning-gsm-qna.parquet", row_group_size=100, engine="pyarrow", index=False)

0 comments on commit d15cba1

Please sign in to comment.