# Verify `/das` Dataset Shape in the DAS-HDF5 File Collection

Each DAS-HDF5 file is expected to contain DAS data with 30,000 time samples. The code below checks the shape of the `/das` HDF5 dataset in every DAS-HDF5 file and prints out any file where that shape is not (30000, 8721).

**Note**: The data access approach here is based on [this](https://gist.github.com/ajelenak/db0d9bf14b7ea4c48acf20249e189c80) gist.

In [1]:
import s3fs
import h5py

In [2]:
%%time
s3 = s3fs.S3FileSystem(anon=False, default_fill_cache=False)
walker = s3.walk('cordial1/repack/DAS-HDF5/PoroTomo/')
for o in walker:
    # "o" is a tuple; its 3rd element is a list of found files under a "directory"
    if len(o[2]) == 0:
        continue
        
    # We have some files, let's process...
    for name in o[2]:
        s3obj = s3.open(f'{o[0]}/{name}', mode='rb')
        with h5py.File(s3obj, mode='r', driver='fileobj') as f:
            if f['das'].shape != (30000, 8721):
                print(f"{f.filename}:/das shape = {f['das'].shape}")

<File-like object S3FileSystem, cordial1/repack/DAS-HDF5/PoroTomo/20160321/PoroTomo_iDAS16043_160321141721.h5>:/das shape = (7696, 8721)
<File-like object S3FileSystem, cordial1/repack/DAS-HDF5/PoroTomo/20160321/PoroTomo_iDAS16043_160321154434.h5>:/das shape = (24824, 8721)
CPU times: user 3min 55s, sys: 19.8 s, total: 4min 15s
Wall time: 49min 16s


The total run time is reported to be 49 minutes and 16 seconds. There are 8,467 DAS-HDF5 files. The time per on files is:

In [3]:
(49 * 60 + 16)/8467

0.34912011338136295

approx. 0.35 seconds per file. This is **much quicker** than having to download the files from S3 and check their `/das` dataset.