A toolbox for creating, processing and inspecting audio/image datasets through a simple CLI interface.
pip install datasets-toolboxThe goal of datasets-toolbox is to build audio/image datasets with CLI.
All the commands support --config [config-name] and --split [split-name] options to specified the target. Where config-name is the configuration name (e.g. language) and split-name is something like train, validation, test.
datasets import --config [data] --split [train] <sources>
Import data into datasets structure.
If the configuration/split is not configured, will defaults to default configuration and train split.
datasets modify <action> --config [data] --split [train] --other-params
If the configuration/split is not configured, will defaults to recursively run on all configurations and all splits.
datasets modify slice --config [data] --split [train] --min-length [ms] --hop-size [n]
datasets modify resample --config [data] --split [train] --sr [16000] --mono
datasets modify transcribe --model [openai/whisper-large-v3-turbo]'
datasets inspect --config [data] --split [train] --other-params
If the configuration/split is not configured, will defaults to recursively run on all configurations and all splits.
datasets inspect hours --config [data] --split [train]