Skip to content

Add Hub for browsing and importing datasets from external sources#652

Merged
cristian-tamblay merged 145 commits into
developfrom
refactor/move-hub
May 28, 2026
Merged

Add Hub for browsing and importing datasets from external sources#652
cristian-tamblay merged 145 commits into
developfrom
refactor/move-hub

Conversation

@Irozuku
Copy link
Copy Markdown
Collaborator

@Irozuku Irozuku commented May 28, 2026

Summary

Add Hub feature for browsing, previewing, and importing datasets from external sources (HuggingFace, OpenML, Zenodo). Introduces a Datafile concept that tracks asynchronous downloads from those sources, then feeds into the existing dataset import flow.


Type of Change

  • Backend change
  • Frontend change
  • CI / Workflow change
  • Build / Packaging change
  • Bug fix
  • Documentation

Changes (by file)

Backend - dataset sources

  • DashAI/back/dataset_sources/base_dataset_source.py: new BaseDatasetSource ABC + DatasetEntry / SearchPage dataclasses.
  • DashAI/back/dataset_sources/huggingface_dataset_source.py, openml_dataset_source.py, zenodo_dataset_source.py: source implementations (search, get_info, download_dataset).
  • DashAI/back/initial_components.py, dependencies/config_builder.py: register the three sources.

Backend - datafile model & migrations

  • DashAI/back/dependencies/database/models.py: new Datafile model with unique (source_name, dataset_id) constraint and JSON-as-text tags.
  • DashAI/back/core/enums/status.py: new DatafileStatus enum (downloading / ready / error).
  • DashAI/alembic/versions/a1c3e5f7b9d2_add_hub_download_table.py: create datafile table.
  • DashAI/alembic/versions/c3d7a1f05e8b_add_metadata_to_datafile.py: add description, tags, size_bytes, source_url.
  • DashAI/alembic/versions/3db684f4090a_merge_datafile_and_dataset_heads.py: empty merge migration to reconcile parallel heads.

Backend - endpoints & jobs

  • DashAI/back/api/api_v1/endpoints/datafile.py: list / create / get / delete datafiles, list files inside one.
  • DashAI/back/api/api_v1/endpoints/dataset_source.py: search, metadata, dataloader-aware preview, import-into-dataset endpoints.
  • DashAI/back/api/api_v1/api.py: register both routers.
  • DashAI/back/job/datafile_job.py: async download job, updates DatafileStatus.
  • DashAI/back/job/dataset_job.py: extra branch importing from a ready Datafile instead of a user-supplied file/URL.

Backend - config

  • DashAI/back/config.py, dependencies/config_builder.py, app.py: add DATAFILE_PATH setting and create the directory at startup.
  • requirements.txt: add openml, oslo.concurrency.

Frontend - Hub UI

  • DashAI/front/src/pages/hub/HubContent.jsx, HubImportPage.jsx: top-level Hub pages.
  • DashAI/front/src/components/hub/: DatasetCard.jsx, DatasetGrid.jsx, DatasetDetail.jsx, HubBreadcrumbs.jsx, HubImportPanel.jsx, DatafileInfoPanel.jsx.
  • DashAI/front/src/api/hub.ts, api/job.ts: HTTP clients for the new endpoints.
  • DashAI/front/src/hooks/datasets/useDownloads.js: poll active datafile downloads.
  • DashAI/front/src/App.jsx: route wiring for Hub pages.

Frontend - touch-ups

  • components/notebooks/DatasetNotebookLeftBar.jsx, dataset/DatasetsCenterContent.jsx, datasetCreation/PreviewDataset.jsx, threeSectionLayout/*, custom/ComponentSelector.jsx, contexts/DatasetsAndNotebooksContext.jsx, shared/LoadingDots.jsx: integration points and shared widgets.
  • utils/i18n/locales/{en,es}/hub.json, plus common.json / datasets.json keys.

Tests

  • tests/back/dataset_sources/test_base_dataset_source.py: source-layer tests.
  • tests/back/api/test_dataset_source_api.py: endpoint tests.

Testing

  • Run pytest tests/back/dataset_sources tests/back/api/test_dataset_source_api.py.
  • Manual testing: open Hub, search on each source, import a small dataset end-to-end, confirm it shows up in Datasets.

Notes

  • Migration 3db684f4090a is intentionally empty - only reconciles the new datafile chain with the parallel dataset head.
  • Hub imports reuse DatasetJob via a source_name / datafile_id branch - no duplication of the dataloader path.

Irozuku added 30 commits May 5, 2026 12:38
Add GET /info endpoint per source that fetches description+tags for a
single dataset on demand. OpenML search now returns empty description/tags;
DatasetDetail fetches details lazily via getDatasetInfo when a dataset is
selected, reducing HTTP calls from 20/page to 1/click. Tags section also
updated to prefer lazily-fetched extraInfo over search-time data.
…t dialog

Each DatasetSource now declares COMPATIBLE_COMPONENTS (list of DataLoader
class names). The import dialog becomes a 3-step stepper: format selection
(radio picker for multi-format sources, auto-selected for single), dataloader
parameter form (driven by the component schema), and preview+confirm.
The selected dataloader and its user-configured params are forwarded to the
import job, replacing the hardcoded separator workaround.
- HubDownloadStatus enum (downloading/ready/error)
- HubDownload DB model with unique constraint on (source_name, dataset_id)
- Alembic migration for hub_download table
- HubDownloadJob: fetches file via fetch_full, stores under hub_downloads/{id}/
- /v1/hub-download CRUD + /files listing endpoint
- Idempotent POST: returns existing record if ready/downloading, retries on error
…ided

Skips re-downloading when hub_download_id + selected_file are passed in
params; temp_dir remains None so the cached directory is not cleaned up.
- hub.ts: HubDownload types + listHubDownloads/getHubDownload/createHubDownload/deleteHubDownload/listHubDownloadFiles
- DatasetDetail: button state machine (download → downloading → add to DashAI / error+retry)
- HubLeftBar: downloaded datasets section with status, delete, and add-to-dashai actions
- HubContent: manages downloads map, polls in-progress every 3s, wires all handlers
- HubImportPanel: optional file-selector step 0 when hubDownload prop is provided
- i18n: new hub keys for download flow and file selector (en + es)
Switch from /api/v1/json/data/list to the ES endpoint
(https://www.openml.org/es/data/data/_search) which returns
description in each hit — no separate per-dataset requests needed.
Pagination uses ES from/size fields. Tags extracted from hit source.
…_dataset, drop fetch_preview

- Remove COMPATIBLE_COMPONENTS from all dataset sources; frontend now
  loads all registered DataLoaders via getComponents({ selectTypes: ["DataLoader"] })
- Rename fetch_full -> download_dataset across base class, HuggingFace,
  OpenML, DatasetJob, and HubDownloadJob
- Remove fetch_preview abstract method; POST preview endpoint now uses
  hub_download_id to load the already-downloaded local file, falling
  back to download_dataset for non-cached previews
- Update tests accordingly
Irozuku added 20 commits May 27, 2026 17:21
Adjust padding and gap values in DatasetGrid, HubBreadcrumbs, HubImportPanel, and HubContent to improve visual spacing and layout consistency within the hub interface.
Change the status badge text color from `text.disabled` to `text.primary` to improve readability.
@Irozuku Irozuku added enhancement New feature or request front Frontend work back Backend work labels May 28, 2026
@cristian-tamblay cristian-tamblay merged commit f2e92a9 into develop May 28, 2026
19 checks passed
@cristian-tamblay cristian-tamblay deleted the refactor/move-hub branch May 28, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

back Backend work enhancement New feature or request front Frontend work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants