# Simple Dataset Demo (v3.x)

Embedded audio in parquet. Uses HuggingFace `datasets` v3.x with `trust_remote_code=True`.

**Note**: The simple dataset works the same in both v3.x and v4.x since it uses parquet with embedded audio.

*Generated by Claude*

In [None]:
# Check datasets version
from oyez_sa_asr.hf_compat import datasets_version, supports_loading_scripts

print(f"datasets version: {datasets_version()}")
print(f"Supports loading scripts: {supports_loading_scripts()}")

In [None]:
# v3.x loading with trust_remote_code (optional for simple)
from datasets import load_dataset

ds = load_dataset("../datasets/simple", "lt1m", trust_remote_code=True)
sample = ds["train"][0]

print(f"Sentence: {sample['sentence'][:100]}...")
print(f"Speaker: {sample.get('speaker', 'N/A')}")
print(f"Duration: {sample.get('duration', 'N/A')}s")

In [None]:
from IPython.display import Audio

# Audio is embedded in parquet
audio = sample["audio"]
Audio(data=audio["array"], rate=audio["sampling_rate"])

## Available Splits

- `lt1m`: Utterances < 1 minute (most common)
- `lt5m`: Utterances 1-5 minutes
- `lt30m`: Utterances 5-30 minutes

In [None]:
# Load different split
ds_5m = load_dataset("../datasets/simple", "lt5m", trust_remote_code=True)
sample_5m = ds_5m["train"][0]
print(f"lt5m sample duration: {sample_5m.get('duration', 'N/A')}s")

## Note on v3.x vs v4.x

For the `simple` dataset, both versions work identically since the audio is embedded in parquet files. The `trust_remote_code=True` parameter is optional.