Skip to content

Fix stale dataset cache and converter updates for Arrow-backed datasets#625

Merged
cristian-tamblay merged 4 commits into
perf/rows-columns-databasefrom
fix/apply-converter
May 18, 2026
Merged

Fix stale dataset cache and converter updates for Arrow-backed datasets#625
cristian-tamblay merged 4 commits into
perf/rows-columns-databasefrom
fix/apply-converter

Conversation

@Irozuku
Copy link
Copy Markdown
Collaborator

@Irozuku Irozuku commented May 15, 2026

Summary

This pull request improves Arrow file handling in dataset API endpoints by replacing pa.memory_map with pa.OSFile, fixing converter updates not being reflected in the frontend, and adding cache invalidation based on Arrow file modification times to avoid serving stale data.


Type of Change

Check all that apply like this [x]:

  • Backend change
  • Frontend change
  • CI / Workflow change
  • Build / Packaging change
  • Bug fix
  • Documentation

Changes (by file)

  • DashAI/back/api/datasets.py:
    • Replaced pa.memory_map with pa.OSFile in:
      • _load_and_filter_table
      • get_dataset_file
      • export_dataset_as_csv
      • export_dataset_csv_by_id
    • Added cache invalidation logic that checks the modification time of data.arrow files and refreshes cached entries if the file has changed.
    • Fixed dataset reload behavior so converters are correctly applied and reflected in frontend responses.
    • Imported the os module to support file modification time checks.

Testing (optional)

  • Verify dataset export endpoints still correctly return CSV and Arrow-backed data.
  • Verify dataset cache entries are invalidated after modifying a data.arrow file.
  • Verify converters are correctly applied after dataset updates and visible in the frontend.
  • Verify dataset loading and filtering still behave correctly for large Arrow files.

Irozuku added 4 commits May 15, 2026 16:10
…converter

- Switch pa.memory_map to pa.OSFile in all four read sites to release
  Windows file lock (WinError 1224) so converter job can write data.arrow
- Add mtime check in _FilteredTableCache.get so cache auto-invalidates
  when data.arrow is written by a converter job, preventing stale previews
…e cache

shutil.copytree uses copy2 which preserves the source file's mtime. When
deleting the only converter (no previous ones to re-run), the restored
data.arrow has an older mtime than the cache entry, so the mtime-based
cache invalidation never fires and the table still shows the old transformed
data. Touching data.arrow after the copy ensures a fresh mtime.
- Return actual job ID from delete_converter so the frontend polls
  for re-run completion before refreshing (previously job_ids was
  always empty due to hasattr check before put())
- Refresh column types via handleStatusChange (ConverterBox path) and
  FormConverterSection onSuccess (job-polling path) instead of a
  redundant useEffect in DatasetPreviewNotebook
- DatasetPreviewNotebook syncs localColumnTypes from context only,
  eliminating the extra type fetch on initial notebook load (3→1)
@Irozuku Irozuku changed the title Fix/apply converter Fix stale dataset cache and converter updates for Arrow-backed datasets May 15, 2026
@Irozuku Irozuku added bug Something isn't working help wanted Extra attention is needed front Frontend work back Backend work labels May 15, 2026
Base automatically changed from perf/reduce-notebook-fetches to perf/rows-columns-database May 18, 2026 13:33
@cristian-tamblay cristian-tamblay merged commit 52eba7f into perf/rows-columns-database May 18, 2026
33 checks passed
@cristian-tamblay cristian-tamblay deleted the fix/apply-converter branch May 18, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

back Backend work bug Something isn't working front Frontend work help wanted Extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants