Data Contamination Can Cross Language Barriers
Overview • Quick Start • Data Release • 🤗 Models • Paper
Deep Contam represents the cross-lingual contamination that inflates LLMs' benchmark performance while evading existing detection methods. An effective method to detect it is also provided in this repository.
To detect potential hidden contamination in a specific model, follow the steps below.
-
Install dependencies.
pip install -r requirements.txt
-
Specify
model_path
and run the following command.python detect.py --model_path MODEL_PATH --dataset_name DATA_NAME
For example,
python detect.py --model_path 'microsoft/phi-2' --dataset_name MMLU,ARC-C,MathQA
The output would be:
MMLU original: 23.83 generalized: 25.02 difference: +1.20 ---------------------- ARC-C original: 42.92 generalized: 47.27 difference: +4.35 ---------------------- MathQA original: 31.32 generalized: 38.70 difference: +7.38
The generalized versions of the benchmark we constructed to detect the potential contamination are released as follows.
Checkpoints of the models we deliberately injected with cross-lingual contamination are provided as follows.