GSM8K-AI-SubQ

This repository contains GSM8K-AI-SubQ dataset, scripts for its collection and scripts for baselines.

The dataset was created to conduct research in the direction of distillation of LLMs reasoning abilities, particularly their ability of splitting problems into simpler sub-problems. We have employed ChatGPT for the generation of the dataset. It is based on GSM8K dataset and includes examples of ChatGPT problems decomposition and its own feedback on generated sub-questions. Our data also includes ChatGPT's answers for sub-questions, but we didn't conduct any experiments for this part of reasoning. We hope that our dataset will help further advancements of offline RL algorithms in the area of reasoning.

For more details see our work "Distilling LLMs' Decomposition Abilities into Compact Language Models".

Repository structure

Each of the directories contains README.md with relevant instructions and comments. All the requirements can be installed with

python3 -m pip install -r requirements.txt

baselines contains the scripts of baseline algorithms: Behavioral Cloning (BC), Filtered BC and ILQL.
data_generation_and_evaluation contains the scripts and data required for the generation of the dataset and scripts for evaluation of results.
dataset contains the GSM8K-AI-SubQ dataset.
eval_responses contains test set sub-questions generated with different baselines and answers of different language models to these sub-questions.
results_processing contains scripts for results processing.

Evaluation results

ChatGPT as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.476	0.508	0.538	0.507
Filtered BC	0.493	0.527	0.576	0.532
ILQL-sparse	0.474	0.513	0.531	0.506
ILQL-full	0.482	0.505	0.533	0.507
ChatGPT	-	-	-	0.682

LLaMA 7B as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.118	0.154	0.164	0.145
Filtered BC	0.125	0.159	0.162	0.149
ILQL-sparse	0.122	0.141	0.164	0.142
ILQL-full	0.123	0.147	0.163	0.144
ChatGPT	-	-	-	0.234

LLaMA 13B as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.184	0.212	0.247	0.214
Filtered BC	0.194	0.230	0.245	0.223
ILQL-sparse	0.178	0.204	0.247	0.210
ILQL-full	0.183	0.205	0.247	0.212
ChatGPT	-	-	-	0.353

Mistral as sub-question answerer

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.240	0.264	0.290	0.265
Filtered BC	0.228	0.256	0.293	0.259
ILQL-sparse	0.223	0.253	0.288	0.255
ILQL-full	0.235	0.252	0.282	0.256
ChatGPT	-	-	-	0.446

Average among sub-question answerers

Algorithm	DistillGPT	GPT-2 small	GPT-2 medium	Average
BC	0.255	0.284	0.310	0.283
Filtered BC	0.260	0.293	0.319	0.291
ILQL-sparse	0.249	0.278	0.308	0.278
ILQL-full	0.256	0.277	0.306	0.280
ChatGPT	-	-	-	0.429

Citing

If you use our work in your research, please use the following bibtex

@article{tarasov2024distilling,
  title={Distilling LLMs' Decomposition Abilities into Compact Language Models},
  author={Tarasov, Denis and Shridhar, Kumar},
  journal={arXiv preprint arXiv:2402.01812},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GSM8K-AI-SubQ

Repository structure

Evaluation results

ChatGPT as sub-question answerer

LLaMA 7B as sub-question answerer

LLaMA 13B as sub-question answerer

Mistral as sub-question answerer

Average among sub-question answerers

Citing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
baselines		baselines
data_generation_and_evaluation		data_generation_and_evaluation
dataset		dataset
eval_responses		eval_responses
results_processing		results_processing
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

DT6A/GSM8K-AI-SubQ

Folders and files

Latest commit

History

Repository files navigation

GSM8K-AI-SubQ

Repository structure

Evaluation results

ChatGPT as sub-question answerer

LLaMA 7B as sub-question answerer

LLaMA 13B as sub-question answerer

Mistral as sub-question answerer

Average among sub-question answerers

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages