Skip to content

Refactor dataset creation flow#295

Merged
cristian-tamblay merged 21 commits into
developfrom
improvement/dataset-status
Sep 22, 2025
Merged

Refactor dataset creation flow#295
cristian-tamblay merged 21 commits into
developfrom
improvement/dataset-status

Conversation

@Creylay
Copy link
Copy Markdown
Collaborator

@Creylay Creylay commented Sep 10, 2025

Summary

This PR refactors the dataset creation and processing flow to ensure consistency between the frontend and backend.

Previously, the frontend simulated dataset creation by adding a temporary object as soon as a file was uploaded. This fake dataset had no real id, and its status was managed through timers, which often caused tracking issues and desynchronization with the backend.

Now, the flow is aligned with other components of the system (e.g., explorers, converters). Datasets are created in the database first, assigned a valid id, and updated through jobs. The frontend only polls the backend for real status updates, eliminating inconsistencies.


Type of change

  • Front end new feature.
  • Back end new feature.
  • Refactoring.
  • Bug fix (tracking/desync issues).

Changes

Backend

Old flow:

  1. Frontend → POST /v1/job/ (with file + parameters)
  2. Job creates dataset + processes file in a single step
  3. Dataset appears in DB only at the end (success or failure)

New flow:

  1. Frontend → POST /v1/datasets/ (only dataset name)
    • Dataset is created immediately in the DB with status = NOT_STARTED.
  2. Frontend → POST /v1/job/ (with dataset_id + file)
    • Job processes file and updates status.
  3. Job updates:
    NOT_STARTED → STARTED → FINISHED / ERROR

Summary of backend changes:

  • Added status column in datasets table.
  • Introduced two-step flow: dataset creation → job execution.
  • Job is now responsible for updating the dataset status.

Frontend

  • Removed temporary dataset objects.
  • Frontend now always works with a valid dataset.id from the backend.
  • Polling is performed against the real backend status.

Tests

All API tests that previously created datasets through jobs were updated to match the new flow.

Modified files include:

  • conftest.py
  • test_dataset_api.py
  • test_experiments_api.py
  • test_explainer_jobs.py
  • test_jobs.py
  • test_predict_api.py
  • test_runs_api.py

How to Test

  1. Start the backend and frontend.
  2. From the frontend, upload a dataset file.
  3. Confirm that:
    • A dataset is immediately created in the DB with status = NOT_STARTED.
    • The job updates the dataset’s status (STARTED → FINISHED/ERROR).
    • The frontend correctly displays the real dataset status (no fake objects).
  4. Run the test suite:
    pytest tests/

… dataset handling

Update fixtures to use job directly instead of calling the API so we be sure that dataset is ready before testing it.
Update all the tests to reflect the new dataset creation system.
Add create_dataset tests.
@Creylay Creylay changed the title Improvement/dataset status Refactor dataset creation flow Sep 10, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the dataset creation flow to establish a two-step process where datasets are first created in the database with a status, then processed by jobs. This replaces the previous approach where datasets only appeared in the database after job completion, eliminating frontend synchronization issues with temporary objects.

Key changes:

  • Backend now creates datasets with status tracking before job processing
  • Frontend uses real dataset objects from the database instead of temporary placeholders
  • Test fixtures updated to use the new two-step dataset creation flow

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
DashAI/back/core/enums/status.py Adds DatasetStatus enum for tracking dataset processing states
DashAI/back/dependencies/database/models.py Adds status column and status management methods to Dataset model
DashAI/back/job/dataset_job.py Updates job to work with existing dataset records and manage status transitions
DashAI/back/api/api_v1/endpoints/datasets.py Adds dataset creation endpoint and status validation for operations
DashAI/back/api/api_v1/schemas/datasets_params.py Adds status field and creation parameters schema
DashAI/front/src/utils/datasetStatus.js Utility function for mapping status numbers to readable strings
DashAI/front/src/types/dataset.ts Adds status field to dataset interface
DashAI/front/src/pages/datasets/Datasets.jsx Updates polling logic to check real dataset status
DashAI/front/src/components/notebooks/datasetCreation/ConfigureAndUploadDataset.jsx Implements two-step creation flow
DashAI/front/src/api/job.ts Updates job API to use dataset_id instead of name
DashAI/front/src/api/datasets.ts Adds createDataset function for the new endpoint
tests/back/api/*.py Updates test fixtures to use new creation flow

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread DashAI/front/src/utils/datasetStatus.js
Comment thread DashAI/back/dependencies/database/models.py Outdated
Comment thread DashAI/back/dependencies/database/models.py Outdated
Comment thread DashAI/back/dependencies/database/models.py Outdated
Comment thread DashAI/front/src/pages/datasets/Datasets.jsx
@Creylay Creylay requested a review from Copilot September 17, 2025 13:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread DashAI/front/src/pages/datasets/Datasets.jsx
Comment thread DashAI/front/src/pages/datasets/Datasets.jsx Outdated
@Creylay Creylay requested a review from Copilot September 17, 2025 17:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread DashAI/front/src/pages/datasets/Datasets.jsx Outdated
Comment thread DashAI/front/src/pages/datasets/Datasets.jsx
Comment thread DashAI/back/job/dataset_job.py Outdated
Comment thread tests/back/api/test_predict_api.py
…improve dataset creation handling with timer cleanup
@Creylay Creylay requested a review from Copilot September 22, 2025 16:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread DashAI/back/job/dataset_job.py
Comment thread tests/back/api/test_predict_api.py
Comment thread tests/back/api/test_predict_api.py
@cristian-tamblay cristian-tamblay merged commit 3b1394e into develop Sep 22, 2025
5 checks passed
@cristian-tamblay cristian-tamblay deleted the improvement/dataset-status branch September 22, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants