Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting trust_remote_code to True for HuggingFace datasets compatibility #1467

Merged
merged 1 commit into from
Feb 26, 2024

Conversation

veekaybee
Copy link
Contributor

See this issue for context: #1135 (comment)

We'd like to be able to use the latest datasets version, 2.16 in lm-evaluation-harness. We currently use 2.15 .

In order to accommodate this, we'd like to:

  1. add a trust_remote_code dataset kwarg to each dataset that requires it and pass through - including a sample for now based on a dataset that I know requires remote code execution.
    Let me know if it looks ok and I'll do a scan through the datasets to see which ones do and add this?

  2. Include trust_remote_code as True by default in the model constructor, to be overriden by -model_args when evaluating from the command line.

@lintangsutawika lintangsutawika merged commit c1145df into EleutherAI:main Feb 26, 2024
8 checks passed
haileyschoelkopf added a commit that referenced this pull request Feb 26, 2024
lintangsutawika pushed a commit that referenced this pull request Feb 26, 2024
@haileyschoelkopf
Copy link
Contributor

Oops sorry about that @veekaybee !

I think this looks great to me as how it should be implemented! We probably also want to make sure that if datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True is set by a user that that's respected.

I guess the question is if there are any datasets for which we'd want to force people to opt-in to trusting their remote code-- @lhoestq mentioned perhaps any datasets under orgs other than the HF default one or EleutherAI's HF org would require a --trust_remote_code flag?

@LSinev
Copy link
Contributor

LSinev commented Feb 26, 2024

Checks for other useful datasets env variables like HF_DATASETS_CACHE, HF_DATASETS_IN_MEMORY_MAX_SIZE may also be useful to respect (mentioning here in case some specific code will be used for that).

@veekaybee
Copy link
Contributor Author

No worries, I should have marked this as a draft!

I can add a check for the environment variables before the code goes into the actual code download, would here be a good place for it?

def download(self, data_dir=None, cache_dir=None, download_mode=None) -> None:

I wanted to also check which we should mark these true for. I wrote a quick script that checks which tasks actually have dataset paths and came up with these: https://gist.github.com/veekaybee/269c8f7c51e6b1a92af4d4ff99bd0931

It looks like this is the list we'll need to check against some variation of this code, right? #1135 (comment)

wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Mar 13, 2024
wx-zhang pushed a commit to wx-zhang/lm-evaluation-harness that referenced this pull request Mar 13, 2024
nightingal3 pushed a commit to mycoalchen/lm-evaluation-harness that referenced this pull request May 2, 2024
nightingal3 pushed a commit to mycoalchen/lm-evaluation-harness that referenced this pull request May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants