Skip to content

Releases: SomeB1oody/dataset-core

Release v0.3.0 - Add More Methods and More Datasets Support

05 Jun 20:31

Choose a tag to compare

Release v0.3.0

This is a workspace release covering both crates. It graduates dataset-core to a
loader-on-construction design, slims the public utils surface, and grows dataset-ml
from six to ten built-in datasets.

Crate Previous This release crates.io
dataset-core 0.2.0 0.3.0 dataset-core = "0.3"
dataset-ml 0.1.0 0.2.0 dataset-ml = "0.2"

Both crates are published independently. dataset-ml 0.2.0 depends on dataset-core 0.3.0.


Highlights

  • Dataset<T>Dataset<T, E> — the loader is now supplied once at construction and
    stored on the container, so load() takes no arguments and the loader's error type is part of
    the type. (breaking)
  • New cache-control & ownership methods on Dataset: set_loader, invalidate, get,
    get_mut, into_inner, take.
  • Leaner utils APIcreate_temp_dir / file_sha256_matches (and the internal
    evaluate_storage) are no longer public; acquire_dataset is the single acquisition entry
    point. (breaking)
  • Four new datasets in dataset-ml: Breast Cancer Wisconsin, Wine Recognition, Palmer
    Penguins, and California Housing.
  • Serde-based CSV parsing for every dataset-ml loader, plus owned/borrowed cached-data
    accessors (into_data / take_data / get_data / get_data_mut).

dataset-core 0.3.0

⚠️ Breaking changes

1. The loader moves to construction time; Dataset<T> becomes Dataset<T, E>.

The loader closure is now stored on the struct at new(dir, loader), load() runs it exactly
once and takes no arguments, and the loader's error type E is a second type parameter. The
stored loader is Box<dyn Fn(&str) -> Result<T, E> + Send + Sync>, so it must be
Send + Sync + 'static (capture by value/clone, not by borrow). Dataset<T, E> stays
Send + Sync whenever T is.

// Before (0.2.x): one type parameter, loader passed at each load() call
use dataset_core::Dataset;

let ds: Dataset<Vec<String>> = Dataset::new("./data");
let data = ds.load(|dir| read_my_files(dir))?;
// After (0.3.0): loader stored at construction, E is part of the type, load() takes no args
use dataset_core::Dataset;

let ds: Dataset<Vec<String>, std::io::Error> =
    Dataset::new("./data", |dir| read_my_files(dir));
let data = ds.load()?;

2. create_temp_dir and file_sha256_matches are no longer public.

In 0.2.x these were re-exported at the crate root (dataset_core::create_temp_dir,
dataset_core::file_sha256_matches) and reachable through dataset_core::utils::. They are now
private implementation details, and the internal evaluate_storage helper was folded away. Use
acquire_dataset, which performs temp-dir creation, SHA-256 verification, and the atomic rename
for you:

// After (0.3.0): one cache-aware entry point instead of hand-composed helpers
use dataset_core::{acquire_dataset, download_to};

let file = acquire_dataset(dir, "data.csv", "MyDataset", Some(EXPECTED_SHA256), |tmp| {
    download_to(URL, tmp, None)?;
    Ok(tmp.join("data.csv"))
})?;

Added

  • Dataset::set_loader(&mut self, loader) — replace the stored loader and invalidate the cache
    (the next load lazily re-parses; no immediate I/O).
  • Dataset::invalidate(&mut self) — drop the cached value but keep the loader (the next load
    re-runs it, e.g. after the underlying files change on disk).
  • Dataset::into_inner(self) -> Option<T> and Dataset::take(&mut self) -> Option<T> — move the
    cached value out without cloning. into_inner consumes the container; take leaves it reusable
    (reset to unloaded). Both return None if never loaded; neither triggers loading.
  • Dataset::get(&self) -> Option<&T> and Dataset::get_mut(&mut self) -> Option<&mut T>
    access the cached value without triggering loading. get_mut allows in-place editing that
    persists in the cache. Both return None if never loaded.

Changed

  • download_to now validates the URL and strips any query string and fragment before deriving
    the output filename from the URL. An explicit filename argument is still used verbatim, and the
    public signature is unchanged.
  • Raised the minimum ureq to 3.3.0 and thiserror to 2.0.18 (both within their existing major
    versions; utils feature only).

See crates/dataset-core/CHANGELOG.md for the full list.


dataset-ml 0.2.0

Added — four new datasets

Struct Module path Samples Features Task Source
BreastCancer dataset_ml::breast_cancer 569 30 Classification UCI ML Repository
WineRecognition dataset_ml::wine_recognition 178 13 Classification UCI ML Repository
PalmerPenguins dataset_ml::palmer_penguins 344 7 Classification palmerpenguins R package
CaliforniaHousing dataset_ml::california_housing 20,640 8 Regression StatLib (1990 census)
  • BreastCancer — Breast Cancer Wisconsin (Diagnostic). 30 numeric features (mean/se/worst
    for 10 cell-nucleus measurements), &'static str diagnosis label ("malignant" / "benign").
  • WineRecognition — scikit-learn's load_wine. 13 chemical-constituent features, &'static str
    cultivar label ("class_1" / "class_2" / "class_3"). Distinct from the wine_quality
    regression datasets.
  • PalmerPenguins — mixed-type like Titanic: features() returns
    (&Array2<String>, &Array2<f64>) and data() is a triple. Missing values (literal token NA
    in the source) become NaN (numeric) or "" (string).
  • CaliforniaHousing — the one loader that does feature engineering: it reproduces
    scikit-learn's fetch_california_housing features (AveRooms = total_rooms / households, etc.)
    from Géron's housing.csv and scales the target by 1/100000. The source's 207 missing
    total_bedrooms values surface as NaN in AveBedrms. A modern replacement for Boston Housing.

All four are sourced with pinned SHA-256 verification and re-exported at the crate root
(dataset_ml::BreastCancer, etc.).

Added — cached-data accessors

On every loader (Iris, BostonHousing, Diabetes, Titanic, RedWineQuality,
WhiteWineQuality, and the new loaders):

  • into_data(self) / take_data(&mut self) — return owned arrays without a to_owned() clone.
    into_data consumes the loader; take_data leaves it reusable (a later accessor reloads).
  • get_data(&self) -> Option<&XData> / get_data_mut(&mut self) -> Option<&mut XData> — borrow or
    edit the cached tuple without triggering loading (None if not yet loaded).

These build on the new Dataset::into_inner / take / get / get_mut in dataset-core.

Changed

  • Adapted to the loader-on-construction API: each loader's field is now
    Dataset<XData, DatasetError>, new passes Self::load_data to Dataset::new, and accessors
    call self.dataset.load(). The public API of each loader (Iris::new(dir), features(),
    labels(), data(), …) is unchanged.
  • Serde-based CSV parsing: every loader defines a #[derive(Deserialize)] record struct and
    parses with csv::Reader::deserialize(), replacing manual per-field parsing and column-count
    checks. Records deserialize positionally, so parsing no longer depends on header spelling or a
    byte-order mark. Behavior (including Titanic's NaN for missing numerics) is unchanged.
  • data() now returns a reference to the cached tuple (&IrisData, &TitanicData, …) instead of a
    tuple of references. Call-site destructuring (let (features, labels) = ds.data()?) is unchanged
    thanks to match ergonomics.
  • Each loader's content type now has a named alias (IrisData, BostonHousingData, …, shared
    WineData).
  • Added serde (with derive) as a direct dependency.

See crates/dataset-ml/CHANGELOG.md for the full list.

Full dataset lineup (10)

Struct Samples Features Task
Iris 150 4 Classification
BreastCancer 🆕 569 30 Classification
BostonHousing 506 13 Regression
CaliforniaHousing 🆕 20,640 8 Regression
Diabetes 768 8 Classification
Titanic 891 11 Classification
PalmerPenguins 🆕 344 7 Classification
WineRecognition 🆕 178 13 Classification
RedWineQuality 1,599 11 Regression
WhiteWineQuality 4,898 11 Regression

Upgrading

# dataset-core only
[dependencies]
dataset-core = "0.3"

# dataset-core with the download / unzip / SHA-256 helpers
[dependencies]
dataset-core = { version = "0.3", features = ["utils"] }

# Built-in ML dataset loaders (pulls in dataset-core automatically)
[dependencies]
dataset-ml = "0.2"

If you use dataset-ml loaders only: bump the version — ...

Read more

v0.2.0 - Separating Architecture and Implementation

28 May 05:36

Choose a tag to compare

Release Notes — v0.2.0 (2026-05-27)

This release is a major restructuring of the project since v0.1.0: the repository has been split into a Cargo workspace. dataset-core now contains only the architecture layer, while a new companion crate dataset-ml houses all built-in dataset loaders. The two crates are published to crates.io independently.

Crate Version
dataset-core 0.1.00.2.0
dataset-ml 0.1.0 (initial release)

⚠️ Breaking Changes

  • Workspace split: dataset-core now only ships Dataset<T>, the utils module, and the error module. All built-in dataset loaders have moved to the new dataset-ml crate.

  • datasets feature removed: the former datasets feature on dataset-core is gone. Use dataset-ml instead.

  • Import path changes (loaders moved to dataset-ml):

    Old path (dataset-core 0.1.x) New path (dataset-ml 0.1.0)
    dataset_core::datasets::iris::Iris dataset_ml::iris::Iris
    dataset_core::datasets::boston_housing::BostonHousing dataset_ml::boston_housing::BostonHousing
    dataset_core::datasets::diabetes::Diabetes dataset_ml::diabetes::Diabetes
    dataset_core::datasets::titanic::Titanic dataset_ml::titanic::Titanic
    dataset_core::datasets::wine_quality::red_wine_quality::RedWineQuality dataset_ml::wine_quality::red_wine_quality::RedWineQuality
    dataset_core::datasets::wine_quality::white_wine_quality::WhiteWineQuality dataset_ml::wine_quality::white_wine_quality::WhiteWineQuality

    There is no longer a datasets:: namespace — modules sit directly at the dataset_ml crate root, and every dataset struct is also re-exported at the crate root for convenience.

  • utils function renames:

    • prepare_download_direvaluate_storage
    • download_dataset_withacquire_dataset
  • Download backend swap: replaced downloader with ureq. The download_to API was refactored and now supports an optional custom filename.

  • Slimmer error payloads: DataFormatError no longer formats the offending record into the error message. Error output is more compact and avoids echoing raw data.

✨ Added

  • Structured error handling: thiserror is now used to derive DatasetError / DataFormatErrorKind. Detailed variants, a consistent [dataset_name] ... prefix, and From impls for UreqError, ZipError, and std::io::Error mean ? just works inside loader closures.
  • dataset-ml initial release: ships loaders for Iris, Boston Housing, Diabetes, Titanic, and Red / White Wine Quality. Wine Quality is split into red and white submodules that share parse_wine_data_to_array.
  • Semantic tests across the board: dataset integration tests now assert value constraints, consistency checks, and finiteness — not just shapes.
  • Documentation upgrades: each dataset module gained detailed module-level docs covering features, target variable, sample count, applications, and source.
  • Chinese localization: README.zh-CN.md added for dataset-core, dataset-ml, and the workspace root.

🔧 Changed

  • Dependency bumps: ureq3.3.0, thiserror2.0.18, zip8.5.1.
  • Shared metadata (edition, rust-version, authors, license, repository) lifted into [workspace.package]; shared dependency versions live in [workspace.dependencies].
  • Doctests that create files on disk are now marked no_run, so cargo test --doc no longer leaves stray artifacts behind.
  • Removed redundant module-level docs from error.rs and stale markdown links in utils docs.

📦 Installation

[dependencies]
dataset-core = "0.2.0"           # architecture layer: Dataset<T> + utils + error
dataset-ml   = "0.1.0"           # add this only if you want the built-in loaders

If you only need the Dataset<T> container (zero external dependencies), no features are required. Enable features = ["utils"] to pull in acquire_dataset / download_to / unzip / SHA-256 helpers. dataset-ml transitively enables dataset-core/utils, so you don't need to configure it manually.

v0.1.0 - Initial Release

11 Apr 22:44

Choose a tag to compare

dataset-core v0.1.0

A generic, thread-safe dataset container with lazy loading and caching for Rust.

Note: This is an initial release. The API is not yet stable and may change in future versions.

Highlights

  • Zero-dependency coreDataset<T> pairs a storage directory with lazily-initialized data of any type. The first call to load() runs your closure and caches the result via OnceLock; every subsequent call returns &T with zero overhead, even across threads.

  • Feature-gated modules — opt in to only what you need:

    Feature What it adds Extra deps
    (none) Dataset<T> none
    utils download_to, unzip, create_temp_dir, file_sha256_matches, acquire_dataset, and the error module ureq, zip, tempfile, sha2
    datasets 6 built-in ML dataset loaders (implies utils) ndarray, csv

Built-in Datasets

Six classic machine learning datasets, ready to use with a consistent API (newfeatures() / labels() / targets() / data()):

Dataset Samples Features Task
Iris 150 4 Classification
Boston Housing 506 13 Regression
Diabetes (Pima) 768 8 Classification
Titanic 891 11 (mixed) Classification
Wine Quality (Red) 1,599 11 Regression
Wine Quality (White) 4,898 11 Regression

All datasets are automatically downloaded, cached locally, and validated with SHA-256 checksums.

Utility Functions (utils feature)

  • download_to — download a remote file into a directory
  • unzip — extract a ZIP archive
  • create_temp_dir — create a self-cleaning temporary directory
  • file_sha256_matches — verify a file's SHA-256 hash
  • acquire_dataset — cache-aware dataset acquisition workflow (temp dir → prepare → optional hash check → move to final location)

Requirements

  • Rust edition 2024, MSRV 1.88.0
  • License: MIT

Quick Start

use dataset_core::Dataset;

let ds = Dataset::<String>::new("./cache");
let data = ds.load(|dir| Ok(std::fs::read_to_string(format!("{dir}/my_file.txt"))?))?;
println!("{data}");

With built-in datasets:

use dataset_core::datasets::Iris;

let iris = Iris::new("./data");
let (features, labels) = iris.data()?;
println!("shape: {:?}, first label: {}", features.shape(), labels[0]);