Release v0.3.0
This is a workspace release covering both crates. It graduates dataset-core to a
loader-on-construction design, slims the public utils surface, and grows dataset-ml
from six to ten built-in datasets.
| Crate | Previous | This release | crates.io |
|---|---|---|---|
dataset-core |
0.2.0 | 0.3.0 | dataset-core = "0.3" |
dataset-ml |
0.1.0 | 0.2.0 | dataset-ml = "0.2" |
Both crates are published independently.
dataset-ml0.2.0 depends ondataset-core0.3.0.
Highlights
Dataset<T>→Dataset<T, E>— the loader is now supplied once at construction and
stored on the container, soload()takes no arguments and the loader's error type is part of
the type. (breaking)- New cache-control & ownership methods on
Dataset:set_loader,invalidate,get,
get_mut,into_inner,take. - Leaner
utilsAPI —create_temp_dir/file_sha256_matches(and the internal
evaluate_storage) are no longer public;acquire_datasetis the single acquisition entry
point. (breaking) - Four new datasets in
dataset-ml: Breast Cancer Wisconsin, Wine Recognition, Palmer
Penguins, and California Housing. - Serde-based CSV parsing for every
dataset-mlloader, plus owned/borrowed cached-data
accessors (into_data/take_data/get_data/get_data_mut).
dataset-core 0.3.0
⚠️ Breaking changes
1. The loader moves to construction time; Dataset<T> becomes Dataset<T, E>.
The loader closure is now stored on the struct at new(dir, loader), load() runs it exactly
once and takes no arguments, and the loader's error type E is a second type parameter. The
stored loader is Box<dyn Fn(&str) -> Result<T, E> + Send + Sync>, so it must be
Send + Sync + 'static (capture by value/clone, not by borrow). Dataset<T, E> stays
Send + Sync whenever T is.
// Before (0.2.x): one type parameter, loader passed at each load() call
use dataset_core::Dataset;
let ds: Dataset<Vec<String>> = Dataset::new("./data");
let data = ds.load(|dir| read_my_files(dir))?;// After (0.3.0): loader stored at construction, E is part of the type, load() takes no args
use dataset_core::Dataset;
let ds: Dataset<Vec<String>, std::io::Error> =
Dataset::new("./data", |dir| read_my_files(dir));
let data = ds.load()?;2. create_temp_dir and file_sha256_matches are no longer public.
In 0.2.x these were re-exported at the crate root (dataset_core::create_temp_dir,
dataset_core::file_sha256_matches) and reachable through dataset_core::utils::. They are now
private implementation details, and the internal evaluate_storage helper was folded away. Use
acquire_dataset, which performs temp-dir creation, SHA-256 verification, and the atomic rename
for you:
// After (0.3.0): one cache-aware entry point instead of hand-composed helpers
use dataset_core::{acquire_dataset, download_to};
let file = acquire_dataset(dir, "data.csv", "MyDataset", Some(EXPECTED_SHA256), |tmp| {
download_to(URL, tmp, None)?;
Ok(tmp.join("data.csv"))
})?;Added
Dataset::set_loader(&mut self, loader)— replace the stored loader and invalidate the cache
(the nextloadlazily re-parses; no immediate I/O).Dataset::invalidate(&mut self)— drop the cached value but keep the loader (the nextload
re-runs it, e.g. after the underlying files change on disk).Dataset::into_inner(self) -> Option<T>andDataset::take(&mut self) -> Option<T>— move the
cached value out without cloning.into_innerconsumes the container;takeleaves it reusable
(reset to unloaded). Both returnNoneif never loaded; neither triggers loading.Dataset::get(&self) -> Option<&T>andDataset::get_mut(&mut self) -> Option<&mut T>—
access the cached value without triggering loading.get_mutallows in-place editing that
persists in the cache. Both returnNoneif never loaded.
Changed
download_tonow validates the URL and strips any query string and fragment before deriving
the output filename from the URL. An explicitfilenameargument is still used verbatim, and the
public signature is unchanged.- Raised the minimum
ureqto 3.3.0 andthiserrorto 2.0.18 (both within their existing major
versions;utilsfeature only).
See crates/dataset-core/CHANGELOG.md for the full list.
dataset-ml 0.2.0
Added — four new datasets
| Struct | Module path | Samples | Features | Task | Source |
|---|---|---|---|---|---|
BreastCancer |
dataset_ml::breast_cancer |
569 | 30 | Classification | UCI ML Repository |
WineRecognition |
dataset_ml::wine_recognition |
178 | 13 | Classification | UCI ML Repository |
PalmerPenguins |
dataset_ml::palmer_penguins |
344 | 7 | Classification | palmerpenguins R package |
CaliforniaHousing |
dataset_ml::california_housing |
20,640 | 8 | Regression | StatLib (1990 census) |
BreastCancer— Breast Cancer Wisconsin (Diagnostic). 30 numeric features (mean/se/worst
for 10 cell-nucleus measurements),&'static strdiagnosis label ("malignant"/"benign").WineRecognition— scikit-learn'sload_wine. 13 chemical-constituent features,&'static str
cultivar label ("class_1"/"class_2"/"class_3"). Distinct from thewine_quality
regression datasets.PalmerPenguins— mixed-type likeTitanic:features()returns
(&Array2<String>, &Array2<f64>)anddata()is a triple. Missing values (literal tokenNA
in the source) becomeNaN(numeric) or""(string).CaliforniaHousing— the one loader that does feature engineering: it reproduces
scikit-learn'sfetch_california_housingfeatures (AveRooms = total_rooms / households, etc.)
from Géron'shousing.csvand scales the target by1/100000. The source's 207 missing
total_bedroomsvalues surface asNaNinAveBedrms. A modern replacement for Boston Housing.
All four are sourced with pinned SHA-256 verification and re-exported at the crate root
(dataset_ml::BreastCancer, etc.).
Added — cached-data accessors
On every loader (Iris, BostonHousing, Diabetes, Titanic, RedWineQuality,
WhiteWineQuality, and the new loaders):
into_data(self)/take_data(&mut self)— return owned arrays without ato_owned()clone.
into_dataconsumes the loader;take_dataleaves it reusable (a later accessor reloads).get_data(&self) -> Option<&XData>/get_data_mut(&mut self) -> Option<&mut XData>— borrow or
edit the cached tuple without triggering loading (Noneif not yet loaded).
These build on the new Dataset::into_inner / take / get / get_mut in dataset-core.
Changed
- Adapted to the loader-on-construction API: each loader's field is now
Dataset<XData, DatasetError>,newpassesSelf::load_datatoDataset::new, and accessors
callself.dataset.load(). The public API of each loader (Iris::new(dir),features(),
labels(),data(), …) is unchanged. - Serde-based CSV parsing: every loader defines a
#[derive(Deserialize)]record struct and
parses withcsv::Reader::deserialize(), replacing manual per-field parsing and column-count
checks. Records deserialize positionally, so parsing no longer depends on header spelling or a
byte-order mark. Behavior (including Titanic'sNaNfor missing numerics) is unchanged. data()now returns a reference to the cached tuple (&IrisData,&TitanicData, …) instead of a
tuple of references. Call-site destructuring (let (features, labels) = ds.data()?) is unchanged
thanks to match ergonomics.- Each loader's content type now has a named alias (
IrisData,BostonHousingData, …, shared
WineData). - Added
serde(withderive) as a direct dependency.
See crates/dataset-ml/CHANGELOG.md for the full list.
Full dataset lineup (10)
| Struct | Samples | Features | Task |
|---|---|---|---|
Iris |
150 | 4 | Classification |
BreastCancer 🆕 |
569 | 30 | Classification |
BostonHousing |
506 | 13 | Regression |
CaliforniaHousing 🆕 |
20,640 | 8 | Regression |
Diabetes |
768 | 8 | Classification |
Titanic |
891 | 11 | Classification |
PalmerPenguins 🆕 |
344 | 7 | Classification |
WineRecognition 🆕 |
178 | 13 | Classification |
RedWineQuality |
1,599 | 11 | Regression |
WhiteWineQuality |
4,898 | 11 | Regression |
Upgrading
# dataset-core only
[dependencies]
dataset-core = "0.3"
# dataset-core with the download / unzip / SHA-256 helpers
[dependencies]
dataset-core = { version = "0.3", features = ["utils"] }
# Built-in ML dataset loaders (pulls in dataset-core automatically)
[dependencies]
dataset-ml = "0.2"If you use dataset-ml loaders only: bump the version — the loader public APIs are unchanged.
The new into_data / take_data / get_data / get_data_mut accessors are additive.
If you wrap Dataset<T> yourself: move your loader into Dataset::new(dir, loader), add the
error type parameter (Dataset<T, E>), and drop the loader argument from load(). Replace any use
of the removed create_temp_dir / file_sha256_matches helpers with acquire_dataset.
Links
dataset-corechangelog —crates/dataset-core/CHANGELOG.mddataset-mlchangelog —crates/dataset-ml/CHANGELOG.md- Repository — SomeB1oody/dataset-core