Skip to content

message in case of stale datasets#59

Merged
geoalgo merged 2 commits into
mainfrom
remove_stale_datasets
Apr 17, 2026
Merged

message in case of stale datasets#59
geoalgo merged 2 commits into
mainfrom
remove_stale_datasets

Conversation

@geoalgo
Copy link
Copy Markdown
Collaborator

@geoalgo geoalgo commented Apr 14, 2026

When changing datasets versions, some datasets are downloaded in the cache and show an error like this:

Traceback (most recent call last):
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/bin/oellm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/oellm/main.py", line 767, in main
    auto_cli(
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/jsonargparse/_cli.py", line 132, in auto_cli
    return _run_component(component, init.get(subcommand))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/jsonargparse/_cli.py", line 234, in _run_component
    return component(**cfg)
           ^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/oellm/utils.py", line 446, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/oellm/main.py", line 260, in schedule_evals
    _pre_download_datasets_from_specs(
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/oellm/utils.py", line 322, in _pre_download_datasets_from_specs
    load_dataset(
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/load.py", line 2062, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/load.py", line 1819, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/builder.py", line 395, in __init__
    self.info = DatasetInfo.from_directory(self._cache_dir)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/info.py", line 279, in from_directory
    return cls.from_dict(dataset_info_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/info.py", line 284, in from_dict
    return cls(**{k: v for k, v in dataset_info_dict.items() if k in field_names})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 20, in __init__
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/info.py", line 170, in __post_init__
    self.features = Features.from_dict(self.features)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/features/features.py", line 1888, in from_dict
    obj = generate_from_dict(dic)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/features/features.py", line 1468, in generate_from_dict
    return {key: generate_from_dict(value) for key, value in obj.items()}
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/salinasd/Documents/code/multisynt-evals/oellm-cli/.venv/lib/python3.12/site-packages/datasets/features/features.py", line 1474, in generate_from_dict
    raise ValueError(f"Feature type '{_type}' not found. Available feature types: {list(_FEATURE_TYPES.keys())}")
ValueError: Feature type 'List' not found. Available feature types: ['Value', 'ClassLabel', 'Translation', 'TranslationVariableLanguages', 'LargeList', 'Sequence', 'Array2D', 'Array3D', 'Array4D', 'Array5D', 'Audio', 'Image', 'Video', 'Pdf']

The solution is to delete the cache, this PR adds a message to propose this solution to the user (it does not run the delete to avoid any potential mistake).

Copy link
Copy Markdown
Collaborator

@swag2198 swag2198 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Does this error show up with datasets versions > 4?

@geoalgo
Copy link
Copy Markdown
Collaborator Author

geoalgo commented Apr 17, 2026

@swag2198 I did not track exactly the version where this happened but it happened to me which was quite annoying. Could you approve the PR so that we merge it?

@swag2198
Copy link
Copy Markdown
Collaborator

Sorry I am not able to approve the PR, it says I would need write access to approve it!

@geoalgo
Copy link
Copy Markdown
Collaborator Author

geoalgo commented Apr 17, 2026

I see thanks, just gave it to you!

@swag2198 swag2198 self-requested a review April 17, 2026 13:37
@geoalgo geoalgo merged commit c21a3ba into main Apr 17, 2026
3 checks passed
@geoalgo geoalgo deleted the remove_stale_datasets branch April 17, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants