# Tutorial: Using the OKLAD (Oklahoma labeled AI dataset)) Dataset with SeisBench

**Author:** Hongyu Xiao @ OU

**Last Updated:** 20251117


# Even before using seisbench or OKLAD

---

### **Purpose**

This notebook is the **very first** step in the SeisBench workflow tutorial, focusing on **loading and reproducing my work**.

---

### **Machiene Learning Key Package Version and Setup**

| Component | Detail |
| :--- | :--- |
| **python version** | 3.12 |
| **seisbench version** | 0.7.0 |
| **cuda version** | 12.1 |
| **jupyter version** | 1.0.0 |
| **obspy version** | 1.4.1 |
| **torch version** | 2.3.1 |

### **(*) Optional but helpful packages for visualization**

| Component | Detail |
| :--- | :--- |
| **seaborn version** | 0.13.2 |
| **matplotlib version** | 3.9.0 |
| **cartopy version** | 0.24.1 |

### **(*) Annotation packages**

| Component | Detail |
| :--- | :--- |
| **pyocto version** | 0.1.9 |
| **pyarrow version** | 19.0.0 |
| **pyrocko version** | 2025.1.21 |

If you do not know what is your current version situation, copy the following code run in your notebook:


In [None]:
import sys
import seisbench
import obspy
import torch
import jupyter_core

print(f"python: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}") # Print the major, minor, and micro version of the running Python interpreter.
print(f"seisbench: {seisbench.__version__}") # Print the version of the SeisBench library.
print(f"cuda: {torch.version.cuda if torch.cuda.is_available() else 'N/A'}") # Print the version of the CUDA toolkit being used by PyTorch, or 'N/A' if CUDA is not available.
print(f"jupyter: {jupyter_core.__version__}") # Print the version of the Jupyter Core library.
print(f"obspy: {obspy.__version__}") # Print the version of the ObsPy library.
print(f"torch: {torch.__version__}") # Print the version of the PyTorch library.

python: 3.12.3
seisbench: 0.7.0
cuda: N/A
jupyter: 5.7.2
obspy: 1.4.1
torch: 2.2.2


You could copy and paste the following code to install the most essential ones first

```python

!pip install python==3.12 seisbench==0.7.0 cuda==12.1 obspy==1.4.1 torch==2.3.1

```

Other versions of the packages should be ok if they are backward compatible

## Loading a Dataset

SeisBench provides access to several pre-compiled datasets. These are curated collections of seismic waveforms and associated metadata, ready for use in machine learning applications. You can find a list of available datasets in the [SeisBench documentation](https://seisbench.readthedocs.io/en/stable/pages/benchmark_datasets.html).

Here, we first will load the "DummyDataset" dataset, which is a sample dataset in seismology. We specify a `sampling_rate` of 100 Hz, which means the waveforms will be resampled to this frequency if they are not already.

In [1]:
import seisbench
import seisbench.data as sbd

data = sbd.DummyDataset(sampling_rate=100)
train, dev, test = data.train_dev_test()



When running this command for the first time, the dataset is downloaded. All downloaded data is stored in the SeisBench cache. 

The location of the cache defaults to `~/.seisbench`, but can be set using the environment variable `SEISBENCH_CACHE_ROOT`. 

Let's inspect the cache. Depending which commands where used before, it contains at least the directory `datasets`. 

Inside this directory, each locally available dataset has its own folder. If we look into the folder `dummydataset`, we find two relevant files `metadata.csv` and `waveforms.hdf5`, containing the metadata and the waveforms.


In [7]:
%cd ~/.seisbench

%ls ~/.seisbench/datasets/ 

/Users/hongyuxiao/.seisbench
[34mdummydataset[m[m/         [34mokla_1mil_120s_ver_3[m[m/


Within the .seisbench directory, here would be how it looks like:
```
├── config.json
├── datasets
│   ├── dummydataset
│   │   ├── metadata.csv
│   │   └── waveforms.hdf5
│   └── okla_1mil_120s_ver_3
│       ├── metadata.csv
│       └── waveforms.hdf5
└── models
    └── v3
        ├── eqtransformer
        │   ├── instance.json.v2
        │   ├── instance.pt.v2
        │   ├── original.json.v3
        │   └── original.pt.v3
        ├── gpd
        │   ├── ethz.json.v1
        │   ├── ethz.pt.v1
        │   ├── original.json.v1
        │   ├── original.pt.v1
        └── phasenet
            ├── instance.json.v2
            ├── instance.pt.v2
            ├── original.json.v1
            ├── original.pt.v1
            ├── scedc.json.v2
            ├── scedc.pt.v2
            ├── stead.json.v2
            └── stead.pt.v2

```

## What does a dataset contain?

Each dataset consists of waveforms and the associated metadata.

Let's first inspect the metadata. It is represented by a **pandas** DataFrame and lists for each trace different attributes, describing properties of the source, the trace, the station and possibly the path. When loading a dataset, only the metadata is loaded into memory. 

The waveforms are loaded on demand.

In [15]:
import seisbench
import seisbench.data as sbd

data = sbd.DummyDataset()



## This dataset contains 100 traces

In [17]:
print(data)

DummyDataset - 100 traces


In [19]:
data.metadata[0:3]

Unnamed: 0,index,trace_start_time,source_latitude_deg,source_longitude_deg,source_depth_km,source_event_category,source_magnitude,source_magnitude_uncertainty,source_magnitude2,source_magnitude_uncertainty2,...,station_latitude_deg,station_longitude_deg,station_elevation_m,source_magnitude_type,source_magnitude_type2,split,trace_name_original,trace_chunk,trace_sampling_rate_hz,trace_component_order
0,0,2007/01/01 01:42:45.08,-20.43802,-69.27681,83.18,ID,1.353,0.014,1.426,0.011,...,-21.04323,-69.4874,900.0,MA,ML,train,2007_01_01 01_42_45_08,,20,ZNE
1,1,2007/01/01 02:41:13.75,-21.64059,-68.41443,118.38,ID,1.981,0.02,2.027,0.023,...,-21.04323,-69.4874,900.0,MA,ML,train,2007_01_01 02_41_13_75,,20,ZNE
2,2,2007/01/01 03:50:29.27,-21.84637,-68.53904,111.82,ID,2.719,0.024,2.811,0.026,...,-21.04323,-69.4874,900.0,MA,ML,train,2007_01_01 03_50_29_27,,20,ZNE


The first time when you are loading pre-compiled dataset or your own dataset, it might take a while. 

seisbench will recompile your csv and hdf5 file into its onw data format.

# Load your own dataset

If you want to load your own data, it might takes a bit more tweaking. But it is very simple. 

Here is an example.

First you want to check your seisbench location.

In [20]:
%pip show seisbench

Name: seisbench
Version: 0.7.0
Summary: The seismological machine learning benchmark collection
Home-page: 
Author: 
Author-email: Jack Woolam <jack.woollam@kit.edu>, Jannes Münchmeyer <munchmej@gfz-potsdam.de>
License: GPLv3
Location: /opt/anaconda3/envs/ML/lib/python3.12/site-packages
Requires: bottleneck, h5py, nest-asyncio, numpy, obspy, pandas, scipy, torch, tqdm
Required-by: 
Note: you may need to restart the kernel to use updated packages.



For this example, it shows that seisbench is installed at 

<mark> /opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/ </mark>

get into that directory 

In [21]:
%ls /opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/

__init__.py  [34m__pycache__[m[m/ [34mdata[m[m/        [34mgenerate[m[m/    [34mmodels[m[m/      [34mutil[m[m/


This is how the directory looks like (only showing direcotry here)

```
/opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/
├── __pycache__
├── data
│   └── __pycache__
├── generate
│   └── __pycache__
├── models
│   └── __pycache__
└── util
    └── __pycache__

```

## There will be a ```__init__.py``` will be under root directory <mark>/opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/</mark>, that is the place where you could change ```cache_root``` variable to decicde where to put your seisbench data and models

## Now, lets move into `data` folder <mark>/opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/data</mark> , will be **another** ```__init__.py```

under the folder <mark>data</mark>, you will find a **py** file called ```__init__.py```

```
/opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/data/__init__.py
```

And this will be how it looks like 

``` 
from .base import (
    BenchmarkDataset,
    Bucketer,
    GeometricBucketer,
    MultiWaveformDataset,
    WaveformDataset,
    WaveformDataWriter,
)
from .dummy import ChunkedDummyDataset, DummyDataset
from .ethz import ETHZ
from .geofon import GEOFON
from .instance import InstanceCounts, InstanceCountsCombined, InstanceGM, InstanceNoise
from .iquique import Iquique
from .isc_ehb import ISC_EHB_DepthPhases
from .lendb import LenDB
from .lfe_stacks import (
    LFEStacksCascadiaBostock2015,
    LFEStacksMexicoFrank2014,
    LFEStacksSanAndreasShelly2017,
)
from .neic import MLAAPDE, NEIC
from .obs import OBS
from .obst2024 import OBST2024
from .pnw import PNW, PNWAccelerometers, PNWExotic, PNWNoise
from .scedc import SCEDC, Meier2019JGR, Ross2018GPD, Ross2018JGRFM, Ross2018JGRPick
from .stead import STEAD
from .txed import TXED
```

Then you want to add your own customized dataset

For example, here is what I added after the last line.

```
# By Hongyu Xiao, This is intended for New Million Dataset with Longer Trace length of 120s
from .okla_1Mil_120_ver3 import OKLA_1Mil_120s_Ver_3```



In this case, this is how my ```/opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/data/__init__.py``` looks like

```
from .base import (
    BenchmarkDataset,
    Bucketer,
    GeometricBucketer,
    MultiWaveformDataset,
    WaveformDataset,
    WaveformDataWriter,
)
from .dummy import ChunkedDummyDataset, DummyDataset
from .ethz import ETHZ
from .geofon import GEOFON
from .instance import InstanceCounts, InstanceCountsCombined, InstanceGM, InstanceNoise
from .iquique import Iquique
from .isc_ehb import ISC_EHB_DepthPhases
from .lendb import LenDB
from .lfe_stacks import (
    LFEStacksCascadiaBostock2015,
    LFEStacksMexicoFrank2014,
    LFEStacksSanAndreasShelly2017,
)
from .neic import MLAAPDE, NEIC
from .obs import OBS
from .obst2024 import OBST2024
from .pnw import PNW, PNWAccelerometers, PNWExotic, PNWNoise
from .scedc import SCEDC, Meier2019JGR, Ross2018GPD, Ross2018JGRFM, Ross2018JGRPick
from .stead import STEAD
from .txed import TXED

# By Hongyu Xiao, This is intended for New Million Dataset with Longer Trace length of 120s
from .okla_1Mil_120_ver3 import OKLA_1Mil_120s_Ver_3
```

in my case, 

under the directory: ```/opt/anaconda3/envs/ML/lib/python3.12/site-packages/seisbench/data/``` 

compose a **okla_1Mil_120_ver3.py** file that could be imported when using this package.

The key is that in the line

<mark>from **.okla_1Mil_120_ver3** import OKLA_1Mil_120s_Ver_3 

the name should be the same as the added py file.

<mark>**okla_1Mil_120_ver3.py**<mark>

And this is how my okla_1Mil_120_ver3.py looks like


```
import seisbench
import seisbench.util
from .base import BenchmarkDataset, WaveformDataWriter

from pathlib import Path
import h5py
import pandas as pd
import numpy as np


class OKLA_1Mil_120s_Ver_3(BenchmarkDataset):
    """
    OKLAHOMA dataset , Hongyu Xiao

    Using the train/test split from the EQTransformer Github repository
    train/dev split defined in SeisBench
    """

    def __init__(self, **kwargs):
        citation = (
            "Hongyu Xiao Modified from STanford EArthquake Dataset (STEAD): "
            "doi:"
        )
        license = "CC BY 4.0"
        super().__init__(citation=citation, license=license, **kwargs)

    def _download_dataset(self, writer: WaveformDataWriter, basepath=None, **kwargs):
        download_instructions = (
            "Please download STEAD following the instructions at https://github.com/smousavi05/STEAD. "
            "Provide the locations of the STEAD formatted unpacked files (metadata.csv and waveform.hdf5) in the "
            "download_kwargs argument 'basepath'."
            "This step is only necessary the first time STEAD is loaded."
        )

        metadata_dict = {
            "trace_start_time": "trace_start_time",
            "trace_category": "trace_category",
            "trace_name": "trace_name",
            "p_arrival_sample": "trace_p_arrival_sample",
            "p_status": "trace_p_status",
            "p_weight": "trace_p_weight",
            "p_travel_sec": "path_p_travel_sec",
            "s_arrival_sample": "trace_s_arrival_sample",
            "s_status": "trace_s_status",
            "s_weight": "trace_s_weight",
            "s_travel_sec": "path_s_travel_sec",
            "back_azimuth_deg": "path_back_azimuth_deg",
            "snr_db": "trace_snr_db",
            "coda_end_sample": "trace_coda_end_sample",
            "network_code": "station_network_code",
            "receiver_code": "station_code",
            "receiver_type": "trace_channel",
            "receiver_latitude": "station_latitude_deg",
            "receiver_longitude": "station_longitude_deg",
            "receiver_elevation_m": "station_elevation_m",
            "source_id": "source_id",
            "source_origin_time": "source_origin_time",
            "source_origin_uncertainty_sec": "source_origin_uncertainty_sec",
            "source_latitude": "source_latitude_deg",
            "source_longitude": "source_longitude_deg",
            "source_error_sec": "source_error_sec",
            "source_gap_deg": "source_gap_deg",
            "source_horizontal_uncertainty_km": "source_horizontal_uncertainty_km",
            "source_depth_km": "source_depth_km",
            "source_depth_uncertainty_km": "source_depth_uncertainty_km",
            "source_magnitude": "source_magnitude",
            "source_magnitude_type": "source_magnitude_type",
            "source_magnitude_author": "source_magnitude_author",
        }

        path = self.path
        #basepath = '/Users/hongyuxiao/Hongyu_File/Data'
        basepath = '/ourdisk/hpc/ogs/hongyux/dont_archive/Merge_Data_OklaCLEAN_1Mil_Raw_Ver_3_with_120s'
        #basepath = '/ourdisk/hpc/ogs/hongyux/dont_archive/Merge_Data_OklaCLEAN_1Mil_Raw_Ver2'
        if basepath is None:
            raise ValueError(
                "No cached version of Okla_1Mil_120 found. " + download_instructions
            )

        basepath = Path(basepath)

        if not (basepath / "merge.csv").is_file():
            raise ValueError(
                "Basepath does not contain file merged.csv. " + download_instructions
            )
        if not (basepath / "merge.hdf5").is_file():
            raise ValueError(
                "Basepath does not contain file merge.hdf5. " + download_instructions
            )

        self.path.mkdir(parents=True, exist_ok=True)
        seisbench.logger.warning(
            "Converting Okla_1Mil_120 files to SeisBench format. This might take a while."
        )

        #split_url = "https://github.com/smousavi05/EQTransformer/raw/master/ModelsAndSampleData/test.npy"
        #seisbench.util.download_http(
        #    split_url, path / "test.npy", desc=f"Downloading test splits"
        #)

        # Copy metadata and rename columns to SeisBench format
        metadata = pd.read_csv(basepath / "merge.csv")
        metadata.rename(columns=metadata_dict, inplace=True)
        metadata['split'] = np.random.choice(['train', 'dev', 'test'], size=metadata.shape[0], p=[0.7, 0.15, 0.15])

	#splits = []
        #for o in metadata['source_origin_time']: 
        #    if str(o) < "2021-01-01": 
        #        split = "train" 
        #    elif str(o) < "2021-06-01": 
        #        split = "dev" 
        #    else: 
        #        split = "test" 
        #    splits.append(split) 
        #metadata["split"] = splits
        # Set split
        #test_split = set(np.load(path / "test.npy"))
        #test_mask = metadata["trace_name"].isin(test_split)
        #train_dev = metadata["trace_name"][~test_mask].values
        #dev_split = train_dev[
        #    ::18
        #]  # Use 5% of total traces as suggested in EQTransformer Github repository
        #dev_mask = metadata["trace_name"].isin(dev_split)
        #metadata["split"] = "train"
        #metadata.loc[dev_mask, "split"] = "dev"
        #metadata.loc[test_mask, "split"] = "test"

        # Writer data format
        writer.data_format = {
            "dimension_order": "CW",
            "component_order": "ZNE",
            "sampling_rate": 100,
            "measurement": "velocity",
            "unit": "counts",
            "instrument_response": "not restituted",
        }

        writer.set_total(len(metadata))

        with h5py.File(basepath / "merge.hdf5") as f:
            gdata = f["data"]
            for _, row in metadata.iterrows():
                row = row.to_dict()
                #print(_,row)
                #print(waveforms)
                waveforms = gdata[row["trace_name"]][()]
                #print(waveforms)
                #print(type(waveforms))
                #print(len(waveforms))
                if waveforms.shape[1] == 3:
                    waveforms = waveforms.T  # From WC to CW
                    waveforms = waveforms[[2, 1, 0]]  # From ENZ to ZNE

                    writer.add_trace(row, waveforms)

```

## Documentation of okla_1Mil_120_ver3.py file.

<mark>class OKLA_1Mil_120s_Ver_3(BenchmarkDataset):</mark>

- Please make sure OKLA_1Mil_120s_Ver_3 is same name as what you added in ```__init__.py```

metadata_dict should be adjusted accordingly

Make sure this was linked properly to the location of you complied csv and hdf5 file directory

 - ```basepath = '/ourdisk/hpc/ogs/hongyux/dont_archive/Merge_Data_OklaCLEAN_1Mil_Raw_Ver_3_with_120s'```

```metadata['split'] = np.random.choice(['train', 'dev', 'test'], size=metadata.shape[0], p=[0.7, 0.15, 0.15])```

This will randomly split the dataset into <mark>70% training, 15% dev and 15% test </mark>

```# Writer data format
    writer.data_format = {
        "dimension_order": "CW",
        "component_order": "ZNE",
        "sampling_rate": 100,
        "measurement": "velocity",
        "unit": "counts",
        "instrument_response": "not restituted",
    }
```

Please adjust this component accordingly as well.