Refactor dataset creation flow#295
Conversation
…tation in dataset creation component
…n and DatasetsPage components to use it
… dataset handling Update fixtures to use job directly instead of calling the API so we be sure that dataset is ready before testing it. Update all the tests to reflect the new dataset creation system. Add create_dataset tests.
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the dataset creation flow to establish a two-step process where datasets are first created in the database with a status, then processed by jobs. This replaces the previous approach where datasets only appeared in the database after job completion, eliminating frontend synchronization issues with temporary objects.
Key changes:
- Backend now creates datasets with status tracking before job processing
- Frontend uses real dataset objects from the database instead of temporary placeholders
- Test fixtures updated to use the new two-step dataset creation flow
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| DashAI/back/core/enums/status.py | Adds DatasetStatus enum for tracking dataset processing states |
| DashAI/back/dependencies/database/models.py | Adds status column and status management methods to Dataset model |
| DashAI/back/job/dataset_job.py | Updates job to work with existing dataset records and manage status transitions |
| DashAI/back/api/api_v1/endpoints/datasets.py | Adds dataset creation endpoint and status validation for operations |
| DashAI/back/api/api_v1/schemas/datasets_params.py | Adds status field and creation parameters schema |
| DashAI/front/src/utils/datasetStatus.js | Utility function for mapping status numbers to readable strings |
| DashAI/front/src/types/dataset.ts | Adds status field to dataset interface |
| DashAI/front/src/pages/datasets/Datasets.jsx | Updates polling logic to check real dataset status |
| DashAI/front/src/components/notebooks/datasetCreation/ConfigureAndUploadDataset.jsx | Implements two-step creation flow |
| DashAI/front/src/api/job.ts | Updates job API to use dataset_id instead of name |
| DashAI/front/src/api/datasets.ts | Adds createDataset function for the new endpoint |
| tests/back/api/*.py | Updates test fixtures to use new creation flow |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…erving current state
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…improve dataset creation handling with timer cleanup
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Summary
This PR refactors the dataset creation and processing flow to ensure consistency between the frontend and backend.
Previously, the frontend simulated dataset creation by adding a temporary object as soon as a file was uploaded. This fake dataset had no real
id, and its status was managed through timers, which often caused tracking issues and desynchronization with the backend.Now, the flow is aligned with other components of the system (e.g., explorers, converters). Datasets are created in the database first, assigned a valid
id, and updated through jobs. The frontend only polls the backend for real status updates, eliminating inconsistencies.Type of change
Changes
Backend
Old flow:
POST /v1/job/(with file + parameters)New flow:
POST /v1/datasets/(only dataset name)status = NOT_STARTED.POST /v1/job/(withdataset_id+ file)NOT_STARTED → STARTED → FINISHED / ERRORSummary of backend changes:
statuscolumn indatasetstable.Frontend
dataset.idfrom the backend.status.Tests
All API tests that previously created datasets through jobs were updated to match the new flow.
Modified files include:
conftest.pytest_dataset_api.pytest_experiments_api.pytest_explainer_jobs.pytest_jobs.pytest_predict_api.pytest_runs_api.pyHow to Test
status = NOT_STARTED.STARTED → FINISHED/ERROR).