Skip to content

FlagOpen/TACO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

46 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TACO(Topics in Algorithmic COde generation dataset)

๐Ÿค— Hugging Faceย ย  | ย ย  BAAI DataHubย ย  | ย ย  Paper


TACO (Topics in Algorithmic COde generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more challenging training dataset and evaluation benchmark for the code generation model field. The dataset consists of programming competition problems that are more difficult and closer to real programming scenarios. It emphasizes improving or evaluating the model's understanding and reasoning abilities in practical application scenarios, rather than just implementing predefined function functionalities.

  • Larger scale: TACO includes a training set (25,443 problems) and a test set (1,000 problems), making it the largest code generation dataset currently available.
  • Higher quality: Each problem in the TACO dataset is designed to match a diverse set of solution answers, with answer sizes of up to 1.55M. This ensures that the model is not prone to overfitting during training and validates the effectiveness of evaluation results.
  • Fine-grained labels: Each problem in the TACO dataset includes fine-grained labels such as task topics, algorithms, skills, and difficulty levels. These labels provide more accurate references for the training and evaluation of code generation models.

News and Updates

  • 2024.04.11 Hello world!๐ŸŒโœจ We're excited to announce that all models from our TACO research are now available on Hugging Face!๐Ÿš€ You can download our SFT models from the following link: https://huggingface.co/FlagOpen ๐Ÿ“ฅ๐Ÿค—. Please note that our models have been trained solely on the training split of the TACO dataset. Happy coding! ๐Ÿ’ป๐ŸŽ‰

Download and Use

๐Ÿค— Hugging Face

First, install the datasets package.

pip install -U datasets

Then, load the dataset with the following program.

from datasets import load_dataset
taco = load_dataset('BAAI/TACO', token=YOUR_HF_TOKEN)
  • You can also specify the split ("train" or "test") through
    from datasets import load_dataset
    taco = load_dataset('BAAI/TACO', split='train', token=YOUR_HF_TOKEN)
  • You can also specify the difficulties (a list choosing from ["EASY", "MEDIUM", "MEDIUM_HARD", "HARD", "VERY_HARD"] or ["ALL"] as default) or skills (a list choosing from ["Data structures", "Sorting", "Range queries", "Complete search", "Amortized analysis", "Dynamic programming", "Bit manipulation", "Greedy algorithms"] or ["ALL"] as default) by passing the list of difficulties or skills as a list.
    from datasets import load_dataset
    taco_difficulties = load_dataset('BAAI/TACO', difficulties=['EASY'], token=YOUR_HF_TOKEN)
    from datasets import load_dataset
    taco_skills = load_dataset('BAAI/TACO', skills=['Sorting', 'Range queries'], token=YOUR_HF_TOKEN)

BAAI DataHub First, download the dataset and unzip it into a folder named "BAAI-TACO." Then, load the dataset with the following program.

from datasets import load_from_disk
taco = load_from_disk(PATH_TO_BAAI-TACO)
  • You can also specify the split ("train" or "test") through
    from datasets import load_from_disk
    taco = load_from_disk(PATH_TO_BAAI-TACO)['train']
  • You can also specify the difficulties (a list choosing from ["EASY", "MEDIUM", "MEDIUM_HARD", "HARD", "VERY_HARD"] or ["ALL"] as default) or skills (a list choosing from ["Data structures", "Sorting", "Range queries", "Complete search", "Amortized analysis", "Dynamic programming", "Bit manipulation", "Greedy algorithms"] or ["ALL"] as default) by passing the list of difficulties or skills as a list.
    from datasets import load_from_disk
    difficulties=['EASY']
    taco = load_from_disk(PATH_TO_BAAI-TACO)
    taco_difficulties = taco.filter(lambda entry: entry['difficulty'] in difficulties)
    from datasets import load_from_disk
    skills=set(['Sorting', 'Range queries'])
    taco = load_from_disk(PATH_TO_BAAI-TACO)
    taco_skills = taco.filter(lambda entry: set(eval(entry['skill_types'])) & skills)

Statistics of TACO

Comparison Dimension TACO CodeContest APPS HumanEval(/-X) MBP(/X)P
Problem Scale (train/dev/test) 25443/-/1000 13328/117/165 5000/-/5000 -/-/164 374/-/500
No Answers in Test Set 0 43/165 1235/5000 0 0
Duplicate Questions No Duplication No Duplication No Duplication Duplicates Removed Duplicates Removed
Duplicate Answers Duplicates Removed No Duplication No Duplication Duplicates Removed Duplicates Removed
Test Cases/Problems 202.3 203.7 20.99 7.77 3
Task Topics Yes Yes No No No
Algorithm Tags Yes No No No No
Programming Skills Yes No No No No
Difficulty Tags Yes Yes Yes No No

The Distribution of Algorithm Tags in TACO is

The Distribution of Programming Skills in TACO is

Evaluation with TACO

First, you should initialize model, tokenizer as well as the difficulties or skills to use TACO.

# Initialize model and tokenizer
model_name = 'codellama/CodeLlama-7b-hf'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = "cuda:0"
model = model.to(device)


# Initialize evaluation dataset 
difficulties = ['ALL']
# difficulties = ["EASY", "MEDIUM", "MEDIUM_HARD", "HARD", "VERY_HARD"] 
# skills = ['ALL']
# skills = ["Data structures", "Sorting", "Range queries", "Complete search", "Amortized analysis", "Dynamic programming", "Bit manipulation", "Greedy algorithms"]

from datasets import load_dataset
taco = load_dataset('BAAI/TACO', split='test', difficulties=difficulties)
# taco = load_dataset('BAAI/TACO', split='test', skills=skills)

Then, run generations with code models.

# setting up times of run
n_samples = 200
temperature = 0.2
top_p = 0.95 
output = []
for idx, sample in enumerate(taco):
    prompt = sample['question']
    results = {"task_id": idx, "prompt": prompt}
    generations = []
    for i in range(n_samples):
        seed = i
        generation = predict(device, model, tokenizer, prompt, seed, top_p, temperature, max_length=2048)
        clean_code = truncate_after_eof_strings(generation)
        generations.append(clean_code)
    results["output"] = generations
    output.append(results)

generation.py gives a complete example of generate TACO result samples with CodeLlama, which outputs a JSON format file generation.json.

[
    {
        "task_id": 0,
        "prompt": "The city park of IT City contains n east to ...",
        "output": [
            "\ndef solve(n):\n    return n**5 - 10*n**4 + 40*n**3 ...",
            "\ndef solve(n):\n    return n**5 - 10*n**4 + 40*n**3 ...",
            ...
        ]
    },
    {
        "task_id": "1",
        "prompt": "Zookeeper is buying a carton of fruit to feed ...",
        "output": [
            "\ndef solve(n, s):\n    pre, suf, ans = [0]*n, [0]*n, ...",
            "\ndef solve(n, s):\n    pre, suf, ans = [0]*n, [0]*n, ...",
            ...
        ]
    },
    ...
]

Finally, execute the generated codes and compute metrics. compute_metric.py gives a complete example of code execution and pass@k computation with generation.json from last step.

The result file taco_metrics.json is like

{
    "pass@1": 0.0932,
    "pass@10": 0.1515,
    "pass@100": 0.1999,
    "detail" : {
        "pass@1": {
            "0": ...,
            "1": ...,
            ...
        },
        "pass@10": {
            "0": ...,
            "1": ...,
            ...
        },
        "pass@100": {
            "0": ...,
            "1": ...,
            ...
        },
    }
}

Finetuning with TACO

First, you should tokenize the training set of TACO. We provide a python script pretokenizing.py and an example shell script pretokenize.sh to help you. This step would output a pretokenized training data in cache_dir with the name of dataset_name. Below is an example to tokenize with CodeLlama-7b.

python pretokenizing.py \
    --tokenizer_dir codellama/CodeLlama-7b-hf \
    --cache_dir . \
    --dataset_name codellama_tokenized 

Then, finetune with the pretokenized training data. We provide a python script train.py and an example shell script finetune.sh to help you. This step would output the checkpoints in output_dir. Below is an example to finetuning CodeLlama-7b.

torchrun --nproc_per_node=8 --nnodes=1 train.py \
    --model_name_or_path codellama/CodeLlama-7b-hf \
    --data_path codellama_tokenized \
    --bf16 True \
    --output_dir codellama_ft \
    --num_train_epochs 2 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --warmup_ratio 0.1 \
    --logging_steps 1 \
    --resume_from_checkpoint True \
    --gradient_checkpointing True \
    --deepspeed ds_configs/deepspeed_z2_config_bf16.json

Evaluation Results

We conducted experiments using the TACO test set and training set on GPT-4 and a code generation model trained on a large amount of code data. The results show:

  • The TACO test set is highly challenging, with GPT-4 achieving a pass@1 score of only 31.5 at the easy level. Except for GPT-4, the pass@1 scores of various code models across five difficulty levels are generally below 10. Even the pass@100 scores are not as high as GPT-4's pass@1

  • Utilizing the TACO training set with fine-grained labels can selectively enhance the performance of code generation models. For instance, after fine-tuning starcoder-1b on specific skills using the TACO training set, there is a noticeable improvement in performance.

Citation

If you use the models, data, or code from this project, please cite the original paper:

@article{li2023taco,
  title={TACO: Topics in Algorithmic COde generation dataset},
  author={Rongao Li and Jie Fu and Bo-Wen Zhang and Tao Huang and Zhihong Sun and Chen Lyu and Guang Liu and Zhi Jin and Ge Li},
  journal={arXiv preprint arXiv:2312.14852},
  year={2023}
}

License

The TACO dataset that is authored by BAAI, Shandong Normal University and Peking University is released under an Apache 2.0 License. However, the data also includes content licensed under other permissive licenses such as MIT License, or web-crawled data which is used under the terms of the CC BY 4.0 license(Creative Commons Attribution 4.0 International license).

We gratefully acknowledge the contributions of the following: