Skip to content

Normalize remote dataset file types from URLs#1486

Open
biefan wants to merge 1 commit intoAzure:mainfrom
biefan:fix-remote-dataset-url-file-types
Open

Normalize remote dataset file types from URLs#1486
biefan wants to merge 1 commit intoAzure:mainfrom
biefan:fix-remote-dataset-url-file-types

Conversation

@biefan
Copy link
Contributor

@biefan biefan commented Mar 17, 2026

Summary

  • normalize remote dataset file types before handler lookup
  • ignore URL query strings and fragments when inferring the file extension
  • add regression coverage for signed-style URLs and uppercase extensions

Problem

_RemoteDatasetLoader._fetch_from_url() currently infers the file type with source.split(".")[-1]. That breaks two real input shapes:

  • public URLs with query strings, e.g. https://example.com/data.json?download=1
  • URLs or paths with uppercase extensions, e.g. https://example.com/data.JSON

Both are currently rejected with Invalid file_type before the loader even reaches the HTTP fetch path, even though the underlying data format is valid and supported.

Testing

  • .venv/bin/pytest tests/unit/datasets/test_remote_dataset_loader.py -q

@romanlutz
Copy link
Contributor

This needs fixes to pre-commit steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants