Skip to content

shangdatalab/Deep-Contam

Repository files navigation

Data Contamination Can Cross Language Barriers

OverviewQuick StartData Release🤗 ModelsPaper

Overview

Deep Contam represents the cross-lingual contamination that inflates LLMs' benchmark performance while evading existing detection methods. An effective method to detect it is also provided in this repository.

Quick Start

To detect potential hidden contamination in a specific model, follow the steps below.

  • Install dependencies.

    pip install -r requirements.txt
  • Specify model_path and run the following command.

    python detect.py --model_path MODEL_PATH --dataset_name DATA_NAME

    For example,

    python detect.py --model_path 'microsoft/phi-2' --dataset_name MMLU,ARC-C,MathQA

    The output would be:

    MMLU
        original: 23.83
        generalized: 25.02
        difference: +1.20
    ----------------------
    ARC-C
        original: 42.92
        generalized: 47.27
        difference: +4.35
    ----------------------
    MathQA
        original: 31.32
        generalized: 38.70
        difference: +7.38

Data Release

The generalized versions of the benchmark we constructed to detect the potential contamination are released as follows.

Contaminated Models

Checkpoints of the models we deliberately injected with cross-lingual contamination are provided as follows.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages