Since efficient code generation is a new branch that is opened for code generation, we curate a new dataset of efficient code generation programming problems called ECG for fine-tuning and evaluation. Accordingly, our model is fine-tuned on the ECG dataset.
The ECG draws on the APPS dataset (Hendrycks et al., 2021) and the CodeContests dataset (Li et al., 2022). We describe the dataset creation process and creative ideas in detail.
The ECG dataset is divided into train, dev, and test in a ratio of 8:1:1. The train has 3021 folders, dev and test each have 377 folders, each folder corresponds to a problem, and the folders are sorted according to the difficulty of the problem from easy to challenging. We describe in detail what each problem folder covers in Appendix A. And the derived datasets ECG-CG, ECG-mini, and ECG-clone are divided into datasets in the same way as ECG.
The ECG dataset is divided into train, dev, and test in the ratio of 8:1:1. the train has 3021 folders, dev and test each have 377 folders, each folder corresponds to a problem, the folders are sorted according to the difficulty of the problem from easy to challenging, and each problem folder covers the following contents:
-
acc_soltuions folder. It contains alternative better solution codes, which are grabbed at intervals based on the speed at which the code runs, covering all solution ideas as much as possible. The number of better solution codes is proportional to the number of timeout codes, with an upper limit of 5, sorted by running time from smallest to largest.
-
acc_tle_soltuions folder. It contains inefficient timeout codes, grabbed by the slowest running time that can complete the problem in the specified time. The number of timeout codes is proportional to the number of better-solved codes, capped at 10, and sorted by runtime from smallest to largest.
-
accepted.txt. We first select the optimal code with the shortest running time from all solutions to the current problem and again select the code with the smallest running space in the filtered results. It is the perfect solution code that we are trying to find to solve the current problem. This filtering method can optimize the code effectively without falling into the trap of trading space for time.
-
accepted run time.txt. It contains the accepted code run time and required space information.
-
tags.txt. It contains the problem tags, i.e., the corresponding classification tags for the problem. The problem tags indicate which methods may be needed to solve the problem (e.g., "greedy" or "recursive"). The last problem tag is a numerical rating of the problem's difficulty, from the most accessible 800 to the most challenging 3500.
-
URL.txt. It is the file containing the URL of the problem description and solution code. The code generated by the model can be taken to the corresponding website, referring to the evaluation mechanism.
-
question.txt This is the complete natural language description. The natural language description is divided into five parts: title, the body of the question description, input description, output description, I/O sample test, and note description, separated by two line breaks.
-
Title.txt. This is one part of the split complete natural language description, that is, the title part.
-
Problem Description Body.txt. This is one part of the split complete natural language description, which is the central part of the problem description.
-
Input description.txt. This is one of the parts of the split complete natural language description, that is, the input description part.
-
Output describing.txt. This is one of the parts of the split complete natural language description, that is, the output describing part.
-
I/O sample testing and note description.txt. This is one part of the split complete natural language description, i.e., the I/O sample test and note description part.
All of the above files are in txt format. The better alternative solution code in the acc_soltuions folder and the inefficient code in the acc_tle_soltuions folder have similar naming rules: number,runtime,runspace.txt, for example, 0,77 ms,284 KB.txt.