# Tutorial: Using the OKLAD (Oklahoma labeled AI dataset)) Dataset with SeisBench

**Author:** Hongyu Xiao @ OU

**Last Updated:** 20251117


## Loading a Dataset

SeisBench provides access to several pre-compiled datasets. These are curated collections of seismic waveforms and associated metadata, ready for use in machine learning applications. You can find a list of available datasets in the [SeisBench documentation](https://seisbench.readthedocs.io/en/stable/pages/benchmark_datasets.html).

Here, we first will load the "DummyDataset" dataset, which is a sample dataset in seismology. We specify a `sampling_rate` of 100 Hz, which means the waveforms will be resampled to this frequency if they are not already.

In [10]:
import seisbench
import seisbench.data as sbd

data = sbd.DummyDataset(sampling_rate=100)
train, dev, test = data.train_dev_test()

Downloading metadata: 100%|██████████| 16.8k/16.8k [00:00<00:00, 2.63MB/s]
Downloading waveforms: 100%|██████████| 2.76M/2.76M [00:00<00:00, 3.32MB/s]


When running this command for the first time, the dataset is downloaded. All downloaded data is stored in the SeisBench cache. 

The location of the cache defaults to `~/.seisbench`, but can be set using the environment variable `SEISBENCH_CACHE_ROOT`. 

Let's inspect the cache. Depending which commands where used before, it contains at least the directory `datasets`. 

Inside this directory, each locally available dataset has its own folder. If we look into the folder `dummydataset`, we find two relevant files `metadata.csv` and `waveforms.hdf5`, containing the metadata and the waveforms.


In [31]:
# Enhanced, more explanatory cache / datasets summary
# (Uses variables and helpers defined elsewhere in this notebook:
#  - cache_root (Path)
#  - datasets_dir (Path)
#  - hr_size(n) -> human readable size
#  - datetime)
print("SeisBench cache summary")
print("=" * 40)
print(f"Cache root: {cache_root}")
print()

# Top-level entries with type, size and modification time
try:
    top_entries = sorted(cache_root.iterdir())
    print(f"Top-level entries ({len(top_entries)}):")
    for p in top_entries:
        try:
            st = p.stat()
            size = hr_size(st.st_size)
            mtime = datetime.fromtimestamp(st.st_mtime).isoformat(sep=' ', timespec='seconds')
            typ = "dir" if p.is_dir() else "file"
            print(f"  - {p.name:30} {size:8}  {typ:4}  modified: {mtime}")
        except Exception as e:
            print(f"  - {p.name:30} (unable to stat: {e})")
except Exception as e:
    print("Could not list cache root:", e)
print()

# Datasets folder overview
if datasets_dir.exists() and datasets_dir.is_dir():
    ds_list = sorted([p for p in datasets_dir.iterdir() if p.is_dir()])
    print(f"Datasets folder: {datasets_dir}  ({len(ds_list)} datasets)")
    for d in ds_list:
        print(f"\nDataset: {d.name}")
        print(f"  Path: {d}")
        try:
            items = sorted(d.iterdir())
            print(f"  Contains {len(items)} item(s):")
            for it in items:
                try:
                    st = it.stat()
                    size = hr_size(st.st_size)
                    mtime = datetime.fromtimestamp(st.st_mtime).isoformat(sep=' ', timespec='seconds')
                    typ = "dir" if it.is_dir() else "file"
                    # Add a short hint for commonly expected dataset files
                    hint = ""
                    nm = it.name.lower()
                    if nm.endswith(".csv"):
                        hint = " (metadata CSV)"
                    elif nm.endswith(".hdf5") or nm.endswith(".h5"):
                        hint = " (waveforms HDF5)"
                    elif nm.endswith(".py"):
                        hint = " (script)"
                    print(f"    - {it.name:30} {size:8}  {typ:4}  modified: {mtime}{hint}")
                except Exception as e:
                    print(f"    - {it.name:30} (unable to stat: {e})")
        except Exception as e:
            print(f"  Could not inspect dataset contents: {e}")
else:
    print("No 'datasets' folder found in cache.")

SeisBench cache summary
Cache root: /Users/hongyuxiao/.seisbench

Top-level entries (4):
  - .DS_Store                      6.0KB     file  modified: 2025-02-14 18:25:58
  - config.json                    62.0B     file  modified: 2024-05-14 21:15:12
  - datasets                       160.0B    dir   modified: 2025-11-17 16:37:07
  - models                         96.0B     dir   modified: 2024-05-18 14:52:29

Datasets folder: /Users/hongyuxiao/.seisbench/datasets  (2 datasets)

Dataset: dummydataset
  Path: /Users/hongyuxiao/.seisbench/datasets/dummydataset
  Contains 2 item(s):
    - metadata.csv                   16.8KB    file  modified: 2025-11-17 16:37:10 (metadata CSV)
    - waveforms.hdf5                 2.8MB     file  modified: 2025-11-17 16:37:12 (waveforms HDF5)

Dataset: okla_1mil_120s_ver_3
  Path: /Users/hongyuxiao/.seisbench/datasets/okla_1mil_120s_ver_3
  Contains 5 item(s):
    - 202503_fix_proper_empty_field.py 280.0B    file  modified: 2025-04-08 18:20:59 (script)
 

## Explanation of SeisBench cache summary

- Why this is useful:
    - Quick inspection of locally available datasets and their on-disk footprint.
    - Helps identify missing or unexpected files and confirms where large files (e.g. HDF5 waveforms) reside.

## What does a dataset contain?