Skip to content

Replace CroissantML with custom cache management#5

Merged
MarcT0K merged 14 commits into
masterfrom
replace-croissant
May 5, 2026
Merged

Replace CroissantML with custom cache management#5
MarcT0K merged 14 commits into
masterfrom
replace-croissant

Conversation

@MarcT0K
Copy link
Copy Markdown
Owner

@MarcT0K MarcT0K commented May 5, 2026

Summary

Replace CroissantML-based data handling with a custom caching and dataset management system.

Changes

  • Removed dependency on CroissantML
  • Introduced a custom cache layer for dataset download and storage
  • Added automatic update detection using dataset metadata (dateModified)
  • Implemented cache_only mode for offline/reproducible usage
  • Added streaming downloads with progress (requests, tqdm)
  • Ensured atomic writes to avoid corrupted downloads
  • Normalized dataset structure (full/ and reduced/ directories)

Impact

  • Removes external dependency (CroissantML)
  • Improves control over caching and updates
  • Enables offline usage via cache_only

Notes

No breaking changes to the public graph loading API.

@MarcT0K MarcT0K merged commit 78792b0 into master May 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant