Skip to content

Release v0.3.0 - Add More Methods and More Datasets Support

Latest

Choose a tag to compare

@SomeB1oody SomeB1oody released this 05 Jun 20:31
· 7 commits to master since this release

Release v0.3.0

This is a workspace release covering both crates. It graduates dataset-core to a
loader-on-construction design, slims the public utils surface, and grows dataset-ml
from six to ten built-in datasets.

Crate Previous This release crates.io
dataset-core 0.2.0 0.3.0 dataset-core = "0.3"
dataset-ml 0.1.0 0.2.0 dataset-ml = "0.2"

Both crates are published independently. dataset-ml 0.2.0 depends on dataset-core 0.3.0.


Highlights

  • Dataset<T>Dataset<T, E> — the loader is now supplied once at construction and
    stored on the container, so load() takes no arguments and the loader's error type is part of
    the type. (breaking)
  • New cache-control & ownership methods on Dataset: set_loader, invalidate, get,
    get_mut, into_inner, take.
  • Leaner utils APIcreate_temp_dir / file_sha256_matches (and the internal
    evaluate_storage) are no longer public; acquire_dataset is the single acquisition entry
    point. (breaking)
  • Four new datasets in dataset-ml: Breast Cancer Wisconsin, Wine Recognition, Palmer
    Penguins, and California Housing.
  • Serde-based CSV parsing for every dataset-ml loader, plus owned/borrowed cached-data
    accessors (into_data / take_data / get_data / get_data_mut).

dataset-core 0.3.0

⚠️ Breaking changes

1. The loader moves to construction time; Dataset<T> becomes Dataset<T, E>.

The loader closure is now stored on the struct at new(dir, loader), load() runs it exactly
once and takes no arguments, and the loader's error type E is a second type parameter. The
stored loader is Box<dyn Fn(&str) -> Result<T, E> + Send + Sync>, so it must be
Send + Sync + 'static (capture by value/clone, not by borrow). Dataset<T, E> stays
Send + Sync whenever T is.

// Before (0.2.x): one type parameter, loader passed at each load() call
use dataset_core::Dataset;

let ds: Dataset<Vec<String>> = Dataset::new("./data");
let data = ds.load(|dir| read_my_files(dir))?;
// After (0.3.0): loader stored at construction, E is part of the type, load() takes no args
use dataset_core::Dataset;

let ds: Dataset<Vec<String>, std::io::Error> =
    Dataset::new("./data", |dir| read_my_files(dir));
let data = ds.load()?;

2. create_temp_dir and file_sha256_matches are no longer public.

In 0.2.x these were re-exported at the crate root (dataset_core::create_temp_dir,
dataset_core::file_sha256_matches) and reachable through dataset_core::utils::. They are now
private implementation details, and the internal evaluate_storage helper was folded away. Use
acquire_dataset, which performs temp-dir creation, SHA-256 verification, and the atomic rename
for you:

// After (0.3.0): one cache-aware entry point instead of hand-composed helpers
use dataset_core::{acquire_dataset, download_to};

let file = acquire_dataset(dir, "data.csv", "MyDataset", Some(EXPECTED_SHA256), |tmp| {
    download_to(URL, tmp, None)?;
    Ok(tmp.join("data.csv"))
})?;

Added

  • Dataset::set_loader(&mut self, loader) — replace the stored loader and invalidate the cache
    (the next load lazily re-parses; no immediate I/O).
  • Dataset::invalidate(&mut self) — drop the cached value but keep the loader (the next load
    re-runs it, e.g. after the underlying files change on disk).
  • Dataset::into_inner(self) -> Option<T> and Dataset::take(&mut self) -> Option<T> — move the
    cached value out without cloning. into_inner consumes the container; take leaves it reusable
    (reset to unloaded). Both return None if never loaded; neither triggers loading.
  • Dataset::get(&self) -> Option<&T> and Dataset::get_mut(&mut self) -> Option<&mut T>
    access the cached value without triggering loading. get_mut allows in-place editing that
    persists in the cache. Both return None if never loaded.

Changed

  • download_to now validates the URL and strips any query string and fragment before deriving
    the output filename from the URL. An explicit filename argument is still used verbatim, and the
    public signature is unchanged.
  • Raised the minimum ureq to 3.3.0 and thiserror to 2.0.18 (both within their existing major
    versions; utils feature only).

See crates/dataset-core/CHANGELOG.md for the full list.


dataset-ml 0.2.0

Added — four new datasets

Struct Module path Samples Features Task Source
BreastCancer dataset_ml::breast_cancer 569 30 Classification UCI ML Repository
WineRecognition dataset_ml::wine_recognition 178 13 Classification UCI ML Repository
PalmerPenguins dataset_ml::palmer_penguins 344 7 Classification palmerpenguins R package
CaliforniaHousing dataset_ml::california_housing 20,640 8 Regression StatLib (1990 census)
  • BreastCancer — Breast Cancer Wisconsin (Diagnostic). 30 numeric features (mean/se/worst
    for 10 cell-nucleus measurements), &'static str diagnosis label ("malignant" / "benign").
  • WineRecognition — scikit-learn's load_wine. 13 chemical-constituent features, &'static str
    cultivar label ("class_1" / "class_2" / "class_3"). Distinct from the wine_quality
    regression datasets.
  • PalmerPenguins — mixed-type like Titanic: features() returns
    (&Array2<String>, &Array2<f64>) and data() is a triple. Missing values (literal token NA
    in the source) become NaN (numeric) or "" (string).
  • CaliforniaHousing — the one loader that does feature engineering: it reproduces
    scikit-learn's fetch_california_housing features (AveRooms = total_rooms / households, etc.)
    from Géron's housing.csv and scales the target by 1/100000. The source's 207 missing
    total_bedrooms values surface as NaN in AveBedrms. A modern replacement for Boston Housing.

All four are sourced with pinned SHA-256 verification and re-exported at the crate root
(dataset_ml::BreastCancer, etc.).

Added — cached-data accessors

On every loader (Iris, BostonHousing, Diabetes, Titanic, RedWineQuality,
WhiteWineQuality, and the new loaders):

  • into_data(self) / take_data(&mut self) — return owned arrays without a to_owned() clone.
    into_data consumes the loader; take_data leaves it reusable (a later accessor reloads).
  • get_data(&self) -> Option<&XData> / get_data_mut(&mut self) -> Option<&mut XData> — borrow or
    edit the cached tuple without triggering loading (None if not yet loaded).

These build on the new Dataset::into_inner / take / get / get_mut in dataset-core.

Changed

  • Adapted to the loader-on-construction API: each loader's field is now
    Dataset<XData, DatasetError>, new passes Self::load_data to Dataset::new, and accessors
    call self.dataset.load(). The public API of each loader (Iris::new(dir), features(),
    labels(), data(), …) is unchanged.
  • Serde-based CSV parsing: every loader defines a #[derive(Deserialize)] record struct and
    parses with csv::Reader::deserialize(), replacing manual per-field parsing and column-count
    checks. Records deserialize positionally, so parsing no longer depends on header spelling or a
    byte-order mark. Behavior (including Titanic's NaN for missing numerics) is unchanged.
  • data() now returns a reference to the cached tuple (&IrisData, &TitanicData, …) instead of a
    tuple of references. Call-site destructuring (let (features, labels) = ds.data()?) is unchanged
    thanks to match ergonomics.
  • Each loader's content type now has a named alias (IrisData, BostonHousingData, …, shared
    WineData).
  • Added serde (with derive) as a direct dependency.

See crates/dataset-ml/CHANGELOG.md for the full list.

Full dataset lineup (10)

Struct Samples Features Task
Iris 150 4 Classification
BreastCancer 🆕 569 30 Classification
BostonHousing 506 13 Regression
CaliforniaHousing 🆕 20,640 8 Regression
Diabetes 768 8 Classification
Titanic 891 11 Classification
PalmerPenguins 🆕 344 7 Classification
WineRecognition 🆕 178 13 Classification
RedWineQuality 1,599 11 Regression
WhiteWineQuality 4,898 11 Regression

Upgrading

# dataset-core only
[dependencies]
dataset-core = "0.3"

# dataset-core with the download / unzip / SHA-256 helpers
[dependencies]
dataset-core = { version = "0.3", features = ["utils"] }

# Built-in ML dataset loaders (pulls in dataset-core automatically)
[dependencies]
dataset-ml = "0.2"

If you use dataset-ml loaders only: bump the version — the loader public APIs are unchanged.
The new into_data / take_data / get_data / get_data_mut accessors are additive.

If you wrap Dataset<T> yourself: move your loader into Dataset::new(dir, loader), add the
error type parameter (Dataset<T, E>), and drop the loader argument from load(). Replace any use
of the removed create_temp_dir / file_sha256_matches helpers with acquire_dataset.

Links