Skip to content

This repository contains code for cleaning your training data of benchmark data to help combat data snooping.

License

Notifications You must be signed in to change notification settings

CarperAI/decontamination

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

decontamination

This repository is heavily inspired by the BigCode repository and is mostly a refactoring of their code. Specifically, the main person who worked on this repository is Chenghao Mou (Awesome work!).

Install

pip install decontamination

How to use

First you need to specify which benchmarks you want to clean your data of. You can do this by creating dictionary with the benchmark name in huggingface’s datasets repository as the key and the name of the column containing the benchmark data as the value. For example, if you want to clean your data of the HumanEval and LAMBADA benchmarks, you would do the following:

!export HF_ACCESS_TOKEN=<TOKEN>
from datasets import load_dataset
from decontamination.core import BenchmarkCleaner

# load your dataset
dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")

benchmarks = ["openai_humaneval", "lambada"]
cleaner = BenchmarkCleaner(benchmarks, "/tmp/benchmarks", threshold=0.1, num_perm=128)

# clean the dataset
cleaned_dataset = cleaner.clean(dataset, column="content", check_for_fp=True)
[01/24/23 00:27:37] INFO     Benchmark datasets already exist. Skipping hashing.                        core.py:181
/home/nathan/miniconda3/envs/decontamination/lib/python3.10/site-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing 'storage_options=fs.storage_options' instead.
  warnings.warn(
Checking for false positives...: 100%|██████████| 8636/8636 [00:33<00:00, 261.25it/s]
Checking for false positives...: 100%|██████████| 8805/8805 [06:58<00:00, 21.06it/s]
Checking for false positives...: 100%|██████████| 8722/8722 [06:39<00:00, 21.82it/s]
Filtering duplicates... #0: 100%|██████████| 1/1 [00:00<00:00, 140.36ba/s]

Filtering duplicates... #1: 100%|██████████| 1/1 [00:00<00:00, 123.28ba/s]
Filtering duplicates... #2: 100%|██████████| 1/1 [00:00<00:00, 169.47ba/s]





Filtering duplicates... #3: 100%|██████████| 1/1 [00:00<00:00, 141.77ba/s]




Filtering duplicates... #4: 100%|██████████| 1/1 [00:00<00:00, 142.31ba/s]





Filtering duplicates... #5: 100%|██████████| 1/1 [00:00<00:00, 139.13ba/s]






Filtering duplicates... #6: 100%|██████████| 1/1 [00:00<00:00, 156.00ba/s]







Filtering duplicates... #7: 100%|██████████| 1/1 [00:00<00:00, 139.18ba/s]








Filtering duplicates... #8: 100%|██████████| 1/1 [00:00<00:00, 162.53ba/s]









Filtering duplicates... #9: 100%|██████████| 1/1 [00:00<00:00, 140.68ba/s]










Filtering duplicates... #10: 100%|██████████| 1/1 [00:00<00:00, 138.69ba/s]











Filtering duplicates... #11: 100%|██████████| 1/1 [00:00<00:00, 145.31ba/s]












Filtering duplicates... #12: 100%|██████████| 1/1 [00:00<00:00, 144.74ba/s]













Filtering duplicates... #13: 100%|██████████| 1/1 [00:00<00:00, 157.68ba/s]





























Filtering duplicates... #14: 100%|██████████| 1/1 [00:00<00:00, 95.45ba/s]
Filtering duplicates... #15: 100%|██████████| 1/1 [00:00<00:00, 135.26ba/s]
















Filtering duplicates... #16: 100%|██████████| 1/1 [00:00<00:00, 136.07ba/s]



































Filtering duplicates... #17: 100%|██████████| 1/1 [00:00<00:00, 107.33ba/s]
Filtering duplicates... #18: 100%|██████████| 1/1 [00:00<00:00, 141.83ba/s]
Filtering duplicates... #19: 100%|██████████| 1/1 [00:00<00:00, 139.11ba/s]
Filtering duplicates... #20: 100%|██████████| 1/1 [00:00<00:00, 137.10ba/s]
Filtering duplicates... #21: 100%|██████████| 1/1 [00:00<00:00, 146.80ba/s]
Filtering duplicates... #22: 100%|██████████| 1/1 [00:00<00:00, 147.25ba/s]
Filtering duplicates... #23: 100%|██████████| 1/1 [00:00<00:00, 149.84ba/s]
Filtering duplicates... #24: 100%|██████████| 1/1 [00:00<00:00, 132.19ba/s]
Filtering duplicates... #25: 100%|██████████| 1/1 [00:00<00:00, 24.02ba/s]
Filtering duplicates... #30: 100%|██████████| 1/1 [00:00<00:00, 119.37ba/s]
Filtering duplicates... #29: 100%|██████████| 1/1 [00:00<00:00, 98.58ba/s]
Filtering duplicates... #28: 100%|██████████| 1/1 [00:00<00:00, 85.76ba/s]
Filtering duplicates... #26: 100%|██████████| 1/1 [00:00<00:00, 76.09ba/s]
Filtering duplicates... #31: 100%|██████████| 1/1 [00:00<00:00, 69.66ba/s]
Filtering duplicates... #27: 100%|██████████| 1/1 [00:00<00:00, 62.54ba/s]
[01/24/23 00:41:50] INFO     Data Number                   : 10000                                      core.py:277
                    INFO     Duplicate Number              : 3932                                       core.py:278
                    INFO     Duplicate Rate                : 39.32%                                     core.py:279
                    INFO     Total Time                    : 853.66 seconds                             core.py:280

About

This repository contains code for cleaning your training data of benchmark data to help combat data snooping.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published