Security features from the Hugging Face datasets library #1135

lhoestq · 2023-12-15T15:07:51Z

In the next version of datasets we are adding a new argument trust_remote_code to let users decide or not to trust the dataset scripts in python on Hugging Face. The default in the next release will still be trust_remote_code=True, but in the coming months we plan to do a major release and set it to False by default.

Most datasets in lm-evaluation-harness are defined on HF using dataset scripts, and may require passing trust_remote_code=True.

Those datasets could also be converted to e.g. JSONL or Parquet datasets on HF (e.g. like Muennighoff/babi).

Let me know if you have questions on this change and if I can help making a smooth transition.
Ideally this should make lm-evaluation-harness safer for people using datasets defined with python code.

cc @haileyschoelkopf

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2023-12-16T15:43:09Z

Thanks, @lhoestq ! Will see about the smoothest way to make this non-invasive for users while still having them acknowledge that they may be running untrusted code.

veekaybee · 2024-02-22T18:54:17Z

I'm happy to help on this issue if it's something you'd still like for lm-evaluation-harness to be able to bump datasets.

I see we include it as a parameters when initialing the constructor for HFLM.

The steps I had in mind:

Set the flag to True
If running a huggingface model, include a Y/n at the CLI run level after the user triggers a model run.

We mentioned this may be invasive for users and introduces an extra level of friction that they don't usually need to respond to cli prompts to continue when using lm-evaluation-harness - perhaps a log line entry is enough here?

Let me know what you think @haileyschoelkopf

haileyschoelkopf · 2024-02-22T20:02:52Z

Hi @veekaybee , this'd be awesome! I agree that since the current behavior is to run datasets from HF datasets, remote code or no, that manually setting it to True so we can update datasets wouldn't be a regression in safety, but it'd be nice if there's a way to one-time opt-in per env install to trust_remote_code or something, and subsequently afterward override to trust_remote_code=True as you're suggesting.

Maybe @lhoestq has suggestions for how to non-invasively incorporate this while not making it too much friction for users?

veekaybee · 2024-02-23T00:24:51Z

I'm taking a look at how some other libraries handle this:

For OpenLLM, it's passed as an environment variable on invocation and here
vLLM passes it as a dataclass of config args

Perhaps we could do some variation of the first one, where we indicate that users should set it in the docs and then check the environment variable when a simple evaluation for HFLM is instantiated ?

haileyschoelkopf · 2024-02-23T00:33:14Z

Hmmm... I think for models, allowing it to be passed through --model_args as we do now for vllm and HFLM doesn't seem like a problem to users (though if there's something better that can switch its default to True like an env var open to it for sure!)--but it seems nonideal to have a --dataset_args or --trust_remote_code standalone switch for Datasets' trust_remote_code kwarg that people will possibly need to always have on for certain tasks.

lhoestq · 2024-02-23T13:43:02Z

It would be great to make the list of tasks that require trust_remote_code=True, maybe using a code like this ?

from datasets.inspect import get_dataset_infos

task_that_require_trust_remote_code = []
for task in tasks:
    try:
        get_dataset_config_names(task.DATASET_PATH, trust_remote_code=False)
    except ValueError e:
        if "trust_remote_code=True" in str(e):
            task_that_require_trust_remote_code.append(task)

Maybe for all the tasks that we can already trust we can add trust_remote_code: true in their YAML (e.g. the ones maintained by EAI or HF), and require --trust_remote_code for the rest ?

Note that there is also an environment variable HF_DATASETS_TRUST_REMOTE_CODE that you can set to "1" to always trust remote code by default, if it can help. It can also be dynamically set using datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True

veekaybee · 2024-02-23T15:20:42Z

I want to summarize so far for my own understanding, let me know if I'm missing or mischaracterizing anything:

Datasets is HuggingFace's library for loading datasets for model training/tuning. Some of these datasets require running extra scripts to fully populate the data.

Datasets
The current behavior for load_dataset , which loads files + any scripts to populate files, is that the trust_remote_code is initialized in the object constructor and currently set to True.

There is a change in the new version that would allow users to pass this explicitly. when calling load_dataset, and it defaults to True, as before, however it will be set to False.

LM-Harness

Within datasets used for evaluation, such as squadv2, we call the datasets load_dataset function, example here.

This means when we run evaluation, regardless of what model type we are evaluating, the task is run and the script for the task is run. For example:

lm_eval --model hf \
     --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks squadv2 \
    --batch_size 16

will trigger this issue or warning (if squadv2 is a dataset with scripts that need to be executed)

In order to understand when this would be a user issue, we'd like to understand which datasets actually require True, which we can get from dataset metadata, and based on that, we could potentially pass it here, like this:

lm_eval --model hf \
   --model_args pretrained=EleutherAI/gpt-j-6B, trust_remote_code=True \
   --tasks squadv2 \
   --batch_size 16

which would mean it gets picked up from the configuration file.

haileyschoelkopf · 2024-02-23T18:05:14Z

@veekaybee This is correct!

@lhoestq thank you, that environment variable is quite helpful--exactly what I was hoping might exist!

veekaybee · 2024-02-23T18:51:04Z

In this case, then, would it make sense to do the following:

In the config:

Add a new field in the task YAML files based on the inventory suggested above, trust_remote_code to the TaskConfig dict

Within the Model constructor:

Set the parameter in the constructor to None
Set trust_remote_code here, to be something like:

datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True
self.trust_remote_code = trust_remote_code if trust_remote_code is not None else os.envget('HF_DATASETS_TRUST_REMOTE_CODE')

Add a log warning, "you are trusting remote code for this task/model run" if the code in 2 gets called

A couple things I'm not clear on: Should we set the environment variable from the TaskConfig directly or once we instantiate the class? And also, how the TaskConfig dict gets passed to the HFLM class- can we call it directly?

haileyschoelkopf · 2024-02-23T20:14:42Z

So to clarify, both transformers and datasets now have trust_remote_code=True but that doesn't necessarily mean we have to link the behaviors. We currently support trust remote code (and default to False) for HF transformers models, but don't support specifying this option for datasets. Related to this: LM classes are never passed or provided arguments based on TaskConfig classes, as TaskConfigs only are used for task / dataset related arguments, not model-related arguments.

In the config:

We can add trust_remote_code to dataset_kwargs which will get passed to load_dataset() , so don't need a whole extra field!

dataset_kwargs:
  trust_remote_code: true

This should behave as anticipated, I think, as well in the absence of trust_remote_code in the config--it should then work based on the user's HF_DATASETS_TRUST_REMOTE_CODE.

In the Model:
I think it's okay to preserve the current behavior, at least as one option (allowing --model_args trust_remote_code=True/False). I think if we do add a --trust_remote_code CLI flag for HF datasets, it'd make sense for the CLI flag turning on trust_remote_code to also override and pass trust_remote_code=True to the model constructors though.

lhoestq · 2024-06-13T13:32:25Z

Hi ! FYI the datasets release that requires explicit trust_remote_code=True for datasets with a python loading script will be out today. Feel free to ping me if you encounter any issue with your CI or from users !

haileyschoelkopf added the feature request A feature that isn't implemented yet. label Dec 16, 2023

haileyschoelkopf added help wanted Contributors and extra help welcome. good first issue Good for newcomers labels Dec 16, 2023

haileyschoelkopf mentioned this issue Dec 26, 2023

Evaluating Arc Easy Got NonMatchingSplitsSizesError #1217

Closed

haileyschoelkopf mentioned this issue Jan 18, 2024

Pin datasets dependency at 2.15 #1312

Merged

veekaybee mentioned this issue Feb 22, 2024

Warning on trust_remote_code on datasets mozilla-ai/lm-buddy#49

Closed

haileyschoelkopf assigned veekaybee Feb 23, 2024

This was referenced Feb 24, 2024

Setting trust_remote_code to True for HuggingFace datasets compatibility #1467

Merged

Setting trust_remote_code to True for HuggingFace datasets compatibility #1487

Merged

haileyschoelkopf closed this as completed Mar 3, 2024

Jack-Khuu mentioned this issue Jun 17, 2024

How to enable trust_remote_code when encountered programmatically via get_task_dict? #1980

Closed

changwangss mentioned this issue Jun 18, 2024

piqa task need add trust_remote_code true in piqa.yml #1985

Closed

Borda mentioned this issue Aug 11, 2024

fix missing req. arg for new datasets package microsoft/FLAML#1334

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security features from the Hugging Face datasets library #1135

Security features from the Hugging Face datasets library #1135

lhoestq commented Dec 15, 2023

haileyschoelkopf commented Dec 16, 2023

veekaybee commented Feb 22, 2024

haileyschoelkopf commented Feb 22, 2024

veekaybee commented Feb 23, 2024

haileyschoelkopf commented Feb 23, 2024

lhoestq commented Feb 23, 2024

veekaybee commented Feb 23, 2024

haileyschoelkopf commented Feb 23, 2024

veekaybee commented Feb 23, 2024

haileyschoelkopf commented Feb 23, 2024

lhoestq commented Jun 13, 2024

Security features from the Hugging Face datasets library #1135

Security features from the Hugging Face datasets library #1135

Comments

lhoestq commented Dec 15, 2023

haileyschoelkopf commented Dec 16, 2023

veekaybee commented Feb 22, 2024

haileyschoelkopf commented Feb 22, 2024

veekaybee commented Feb 23, 2024

haileyschoelkopf commented Feb 23, 2024

lhoestq commented Feb 23, 2024

veekaybee commented Feb 23, 2024

haileyschoelkopf commented Feb 23, 2024

veekaybee commented Feb 23, 2024

haileyschoelkopf commented Feb 23, 2024

lhoestq commented Jun 13, 2024