The intial preprocessor stores the arrays as .npz files, it seems that parquet could be more efficient, I will experiment with converting the array as parquet here. 

In [5]:
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

# Assuming you have an array with shape (290, 400, 400)
# This represents 290 grayscale images of size 400×400
data_f = np.load(r"C:\Users\NoahB\OneDrive\Desktop\cetacean_detection\nefsc_sbnms_200903_nopp6_ch10\processed\intial_run\images\NOPP6_EST_20090328_000000_CH10.npz")
images = data_f['X']
labels = data_f['y']
# Create a dictionary with each image as a row
data = {
    'labels': [l.tobytes() for l in labels],  # Image identifiers
    'image_data': [img.tobytes() for img in images]
}

# Create a PyArrow table
table = pa.Table.from_pydict(data)

# Save as Parquet
pq.write_table(table, 'images_collection.parquet')

In [6]:
import os
print(os.path.getsize(r"nefsc_sbnms_200903_nopp6_ch10/processed/intial_run/images/NOPP6_EST_20090328_000000_CH10.npz"))
print(os.path.getsize(r'C:\Users\NoahB\OneDrive\Desktop\cetacean_detection\nefsc_sbnms_200903_nopp6_ch10\processed\intial_run\parquet\image_arrays.parquet'))

46401650
3770871


We see a storage difference of 46mb to 3.6mb... no brainer to use it!

In [8]:
# experiment with hdf5
import h5py
data_f = r"C:\Users\NoahB\OneDrive\Desktop\cetacean_detection\nefsc_sbnms_200903_nopp6_ch10\processed\intial_run\hdf5\sample.h5"
# Saving images
with h5py.File(data_f, 'w') as f:
    images = images
    labels = labels
    f.create_dataset('images', data=images)
    f.create_dataset('labels', data=labels)
    
print(os.path.getsize(data_f))


46403208


HDF5 does not yield the same data storage beenfits (its much larger), but it is much better for reading random indexes - we can index specific indexes in the array without loading in the entire dataset. SO while we are training locally, hdf5 will be the best option, as it will be easily shuffled and read (rather than having to preshuffle and store as a parquet file)
* if we end up needing very quick upload / download of the entire dataset, then we may want to switch to parquet. For now the larger dataset should'nt be a huge issue. 
* final decision - use hdf5 for now for its reading capabilities. but swithc to parquet if we need super efficient file upload / download. 

The entire dataset on one hdf5 file is 29GB, this is the same as the entirety of the npy dataset, so they have similar storage sizes.