You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, congratulations and thank you for the amazing project and the fantastic framework that is being used in almost all LLM evaluations. I appreciate you sharing it with us.
Regarding decontamination:
I wonder if data decontamination is still effective.
I am curious if there are plans or ongoing efforts to conduct strict data decontamination based on the current evaluation dataset.
Ultimately, if we strongly remove data contamination using an 8-gram(harder than 13-gram) approach through current data decontamination methods, can we ensure that the results on the current evaluation bench are free from contamination?
Many papers claim that they did not cheat or contaminate the data in their own ways, while some omit this process. Moreover, I believe this could ultimately raise issues regarding the reliability of the evaluation dataset. Although benchmark datasets can provide standardized quantitative metrics, the moment data contamination occurs, the benchmark's credibility will be significantly undermined. This might be fundamentally different from some forms of cheating in MMLU implementations. Additionally, methods that measure contamination based on the inference results of the model post hoc could also be misleading. Ultimately, a benchmark is needed for what the model has not been taught but can infer.
Is there any information that might be helpful for my brief perspective? If I use the current harness implementation for data decontamination, can I say that it sufficiently removes data contamination?
The text was updated successfully, but these errors were encountered:
First, congratulations and thank you for the amazing project and the fantastic framework that is being used in almost all LLM evaluations. I appreciate you sharing it with us.
Regarding decontamination:
Ultimately, if we strongly remove data contamination using an 8-gram(harder than 13-gram) approach through current data decontamination methods, can we ensure that the results on the current evaluation bench are free from contamination?
Many papers claim that they did not cheat or contaminate the data in their own ways, while some omit this process. Moreover, I believe this could ultimately raise issues regarding the reliability of the evaluation dataset. Although benchmark datasets can provide standardized quantitative metrics, the moment data contamination occurs, the benchmark's credibility will be significantly undermined. This might be fundamentally different from some forms of cheating in MMLU implementations. Additionally, methods that measure contamination based on the inference results of the model post hoc could also be misleading. Ultimately, a benchmark is needed for what the model has not been taught but can infer.
Is there any information that might be helpful for my brief perspective? If I use the current harness implementation for data decontamination, can I say that it sufficiently removes data contamination?
The text was updated successfully, but these errors were encountered: