Add Hub for browsing and importing datasets from external sources#652
Merged
Conversation
…review config discovery
Add GET /info endpoint per source that fetches description+tags for a single dataset on demand. OpenML search now returns empty description/tags; DatasetDetail fetches details lazily via getDatasetInfo when a dataset is selected, reducing HTTP calls from 20/page to 1/click. Tags section also updated to prefer lazily-fetched extraInfo over search-time data.
…t dialog Each DatasetSource now declares COMPATIBLE_COMPONENTS (list of DataLoader class names). The import dialog becomes a 3-step stepper: format selection (radio picker for multi-format sources, auto-selected for single), dataloader parameter form (driven by the component schema), and preview+confirm. The selected dataloader and its user-configured params are forwarded to the import job, replacing the hardcoded separator workaround.
Co-authored-by: n/a
- HubDownloadStatus enum (downloading/ready/error)
- HubDownload DB model with unique constraint on (source_name, dataset_id)
- Alembic migration for hub_download table
- HubDownloadJob: fetches file via fetch_full, stores under hub_downloads/{id}/
- /v1/hub-download CRUD + /files listing endpoint
- Idempotent POST: returns existing record if ready/downloading, retries on error
…ided Skips re-downloading when hub_download_id + selected_file are passed in params; temp_dir remains None so the cached directory is not cleaned up.
- hub.ts: HubDownload types + listHubDownloads/getHubDownload/createHubDownload/deleteHubDownload/listHubDownloadFiles - DatasetDetail: button state machine (download → downloading → add to DashAI / error+retry) - HubLeftBar: downloaded datasets section with status, delete, and add-to-dashai actions - HubContent: manages downloads map, polls in-progress every 3s, wires all handlers - HubImportPanel: optional file-selector step 0 when hubDownload prop is provided - i18n: new hub keys for download flow and file selector (en + es)
Switch from /api/v1/json/data/list to the ES endpoint (https://www.openml.org/es/data/data/_search) which returns description in each hit — no separate per-dataset requests needed. Pagination uses ES from/size fields. Tags extracted from hit source.
…_dataset, drop fetch_preview
- Remove COMPATIBLE_COMPONENTS from all dataset sources; frontend now
loads all registered DataLoaders via getComponents({ selectTypes: ["DataLoader"] })
- Rename fetch_full -> download_dataset across base class, HuggingFace,
OpenML, DatasetJob, and HubDownloadJob
- Remove fetch_preview abstract method; POST preview endpoint now uses
hub_download_id to load the already-downloaded local file, falling
back to download_dataset for non-cached previews
- Update tests accordingly
…esnt exclude all files
Adjust padding and gap values in DatasetGrid, HubBreadcrumbs, HubImportPanel, and HubContent to improve visual spacing and layout consistency within the hub interface.
Change the status badge text color from `text.disabled` to `text.primary` to improve readability.
cristian-tamblay
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add Hub feature for browsing, previewing, and importing datasets from external sources (HuggingFace, OpenML, Zenodo). Introduces a
Datafileconcept that tracks asynchronous downloads from those sources, then feeds into the existing dataset import flow.Type of Change
Changes (by file)
Backend - dataset sources
DashAI/back/dataset_sources/base_dataset_source.py: newBaseDatasetSourceABC +DatasetEntry/SearchPagedataclasses.DashAI/back/dataset_sources/huggingface_dataset_source.py,openml_dataset_source.py,zenodo_dataset_source.py: source implementations (search,get_info,download_dataset).DashAI/back/initial_components.py,dependencies/config_builder.py: register the three sources.Backend - datafile model & migrations
DashAI/back/dependencies/database/models.py: newDatafilemodel with unique(source_name, dataset_id)constraint and JSON-as-texttags.DashAI/back/core/enums/status.py: newDatafileStatusenum (downloading/ready/error).DashAI/alembic/versions/a1c3e5f7b9d2_add_hub_download_table.py: createdatafiletable.DashAI/alembic/versions/c3d7a1f05e8b_add_metadata_to_datafile.py: adddescription,tags,size_bytes,source_url.DashAI/alembic/versions/3db684f4090a_merge_datafile_and_dataset_heads.py: empty merge migration to reconcile parallel heads.Backend - endpoints & jobs
DashAI/back/api/api_v1/endpoints/datafile.py: list / create / get / delete datafiles, list files inside one.DashAI/back/api/api_v1/endpoints/dataset_source.py:search, metadata, dataloader-aware preview, import-into-dataset endpoints.DashAI/back/api/api_v1/api.py: register both routers.DashAI/back/job/datafile_job.py: async download job, updatesDatafileStatus.DashAI/back/job/dataset_job.py: extra branch importing from a readyDatafileinstead of a user-supplied file/URL.Backend - config
DashAI/back/config.py,dependencies/config_builder.py,app.py: addDATAFILE_PATHsetting and create the directory at startup.requirements.txt: addopenml,oslo.concurrency.Frontend - Hub UI
DashAI/front/src/pages/hub/HubContent.jsx,HubImportPage.jsx: top-level Hub pages.DashAI/front/src/components/hub/:DatasetCard.jsx,DatasetGrid.jsx,DatasetDetail.jsx,HubBreadcrumbs.jsx,HubImportPanel.jsx,DatafileInfoPanel.jsx.DashAI/front/src/api/hub.ts,api/job.ts: HTTP clients for the new endpoints.DashAI/front/src/hooks/datasets/useDownloads.js: poll active datafile downloads.DashAI/front/src/App.jsx: route wiring for Hub pages.Frontend - touch-ups
components/notebooks/DatasetNotebookLeftBar.jsx,dataset/DatasetsCenterContent.jsx,datasetCreation/PreviewDataset.jsx,threeSectionLayout/*,custom/ComponentSelector.jsx,contexts/DatasetsAndNotebooksContext.jsx,shared/LoadingDots.jsx: integration points and shared widgets.utils/i18n/locales/{en,es}/hub.json, pluscommon.json/datasets.jsonkeys.Tests
tests/back/dataset_sources/test_base_dataset_source.py: source-layer tests.tests/back/api/test_dataset_source_api.py: endpoint tests.Testing
pytest tests/back/dataset_sources tests/back/api/test_dataset_source_api.py.Notes
3db684f4090ais intentionally empty - only reconciles the newdatafilechain with the paralleldatasethead.DatasetJobvia asource_name/datafile_idbranch - no duplication of the dataloader path.