decontamination

This repository is heavily inspired by the BigCode repository and is mostly a refactoring of their code. Specifically, the main person who worked on this repository is Chenghao Mou (Awesome work!).

Install

pip install decontamination

How to use

First you need to specify which benchmarks you want to clean your data of. You can do this by creating dictionary with the benchmark name in huggingface’s datasets repository as the key and the name of the column containing the benchmark data as the value. For example, if you want to clean your data of the HumanEval and LAMBADA benchmarks, you would do the following:

!export HF_ACCESS_TOKEN=<TOKEN>

from datasets import load_dataset
from decontamination.core import BenchmarkCleaner

# load your dataset
dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")

benchmarks = ["openai_humaneval", "lambada"]
cleaner = BenchmarkCleaner(benchmarks, "/tmp/benchmarks", threshold=0.1, num_perm=128)

# clean the dataset
cleaned_dataset = cleaner.clean(dataset, column="content", check_for_fp=True)

[01/24/23 00:27:37] INFO     Benchmark datasets already exist. Skipping hashing.                        core.py:181

/home/nathan/miniconda3/envs/decontamination/lib/python3.10/site-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing 'storage_options=fs.storage_options' instead.
  warnings.warn(
Checking for false positives...: 100%|██████████| 8636/8636 [00:33<00:00, 261.25it/s]
Checking for false positives...: 100%|██████████| 8805/8805 [06:58<00:00, 21.06it/s]
Checking for false positives...: 100%|██████████| 8722/8722 [06:39<00:00, 21.82it/s]
Filtering duplicates... #0: 100%|██████████| 1/1 [00:00<00:00, 140.36ba/s]

Filtering duplicates... #1: 100%|██████████| 1/1 [00:00<00:00, 123.28ba/s]
Filtering duplicates... #2: 100%|██████████| 1/1 [00:00<00:00, 169.47ba/s]





Filtering duplicates... #3: 100%|██████████| 1/1 [00:00<00:00, 141.77ba/s]




Filtering duplicates... #4: 100%|██████████| 1/1 [00:00<00:00, 142.31ba/s]





Filtering duplicates... #5: 100%|██████████| 1/1 [00:00<00:00, 139.13ba/s]






Filtering duplicates... #6: 100%|██████████| 1/1 [00:00<00:00, 156.00ba/s]







Filtering duplicates... #7: 100%|██████████| 1/1 [00:00<00:00, 139.18ba/s]








Filtering duplicates... #8: 100%|██████████| 1/1 [00:00<00:00, 162.53ba/s]









Filtering duplicates... #9: 100%|██████████| 1/1 [00:00<00:00, 140.68ba/s]










Filtering duplicates... #10: 100%|██████████| 1/1 [00:00<00:00, 138.69ba/s]











Filtering duplicates... #11: 100%|██████████| 1/1 [00:00<00:00, 145.31ba/s]












Filtering duplicates... #12: 100%|██████████| 1/1 [00:00<00:00, 144.74ba/s]













Filtering duplicates... #13: 100%|██████████| 1/1 [00:00<00:00, 157.68ba/s]





























Filtering duplicates... #14: 100%|██████████| 1/1 [00:00<00:00, 95.45ba/s]
Filtering duplicates... #15: 100%|██████████| 1/1 [00:00<00:00, 135.26ba/s]
















Filtering duplicates... #16: 100%|██████████| 1/1 [00:00<00:00, 136.07ba/s]



































Filtering duplicates... #17: 100%|██████████| 1/1 [00:00<00:00, 107.33ba/s]
Filtering duplicates... #18: 100%|██████████| 1/1 [00:00<00:00, 141.83ba/s]
Filtering duplicates... #19: 100%|██████████| 1/1 [00:00<00:00, 139.11ba/s]
Filtering duplicates... #20: 100%|██████████| 1/1 [00:00<00:00, 137.10ba/s]
Filtering duplicates... #21: 100%|██████████| 1/1 [00:00<00:00, 146.80ba/s]
Filtering duplicates... #22: 100%|██████████| 1/1 [00:00<00:00, 147.25ba/s]
Filtering duplicates... #23: 100%|██████████| 1/1 [00:00<00:00, 149.84ba/s]
Filtering duplicates... #24: 100%|██████████| 1/1 [00:00<00:00, 132.19ba/s]
Filtering duplicates... #25: 100%|██████████| 1/1 [00:00<00:00, 24.02ba/s]
Filtering duplicates... #30: 100%|██████████| 1/1 [00:00<00:00, 119.37ba/s]
Filtering duplicates... #29: 100%|██████████| 1/1 [00:00<00:00, 98.58ba/s]
Filtering duplicates... #28: 100%|██████████| 1/1 [00:00<00:00, 85.76ba/s]
Filtering duplicates... #26: 100%|██████████| 1/1 [00:00<00:00, 76.09ba/s]
Filtering duplicates... #31: 100%|██████████| 1/1 [00:00<00:00, 69.66ba/s]
Filtering duplicates... #27: 100%|██████████| 1/1 [00:00<00:00, 62.54ba/s]

[01/24/23 00:41:50] INFO     Data Number                   : 10000                                      core.py:277

                    INFO     Duplicate Number              : 3932                                       core.py:278

                    INFO     Duplicate Rate                : 39.32%                                     core.py:279

                    INFO     Total Time                    : 853.66 seconds                             core.py:280

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
decontamination		decontamination
nbs		nbs
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
settings.ini		settings.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows