## Converting hdf5 to lmdb

References:

* https://symas.com/lightning-memory-mapped-database/
* http://deepdish.io/2015/04/28/creating-lmdb-in-python/
* https://gist.github.com/bearpaw/3a07f0e8904ed42f376e
* http://stackoverflow.com/questions/37337523/how-do-you-load-an-lmdb-file-into-tensorflow
* http://research.beenfrog.com/code/2015/12/30/write-read-lmdb-example.html
* https://lmdb.readthedocs.io/en/release/
* http://stackoverflow.com/questions/8855574/convert-ndarray-from-float64-to-integer
* https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.frombuffer.html
* https://docs.scipy.org/doc/numpy/user/basics.types.html

Reasons:
* LMDB uses memory-mapped files, giving much better I/O performance.
* Works with large datasets. The HDF5 files are always read entirely into memory, so you can’t have any HDF5 file exceed your memory capacity.

Install: pip install lmdb


In [2]:
import lmdb
import h5py
import numpy as np
from driving_data import HandleData

In [3]:
# Load hdf5 file and get the whole training (batch=-1)
data = HandleData(path='TestData.h5',shuffle = False)
xs, ys = data.LoadTrainBatch(-1,crop_up=0)

Loading training data
Spliting training and validation
Number training images: 752
Number validation images: 188


In [4]:
# Open LMDB file
env = lmdb.open('mylmdb', map_size=2000000000)

In [5]:
# Get a write lmdb transaction, lmdb store stuff with a key,value(in bytes) format
with env.begin(write=True) as txn:
    # Iterate on batch
    idx = 0
    for (tup_element) in list(zip(xs, ys)):
        img,steer = tup_element        
        str_id = '{:08}'.format(idx)   
        img_id = 'img_{:08}'.format(idx)   
        #print(type(steer[0]))
        #print(type(img))
        # The current type of steer is np.float32
        txn.put(bytes(str_id.encode('ascii')),steer[0].tobytes())                
        txn.put(bytes(img_id.encode('ascii')),img.tobytes())                
        idx += 1

### Reading from lmdb

In [6]:
env = lmdb.open('mylmdb', readonly=True)
with env.begin() as txn:
    cursor = txn.cursor()
    for key, value in cursor:        
        print(key, np.frombuffer(value, dtype=np.float32))
        #print(key, value)

b'00000000' [ 0.]
b'00000001' [-0.28670669]
b'00000002' [-0.35403481]
b'00000003' [-0.3218213]
b'00000004' [-0.2492938]
b'00000005' [-0.00639278]
b'00000006' [-0.06573342]
b'00000007' [-0.1582671]
b'00000008' [-0.3240065]
b'00000009' [-0.30283219]
b'00000010' [-0.09086371]
b'00000011' [-0.1713786]
b'00000012' [-0.1670458]
b'00000013' [ 0.]
b'00000014' [ 0.]
b'00000015' [ 0.08626717]
b'00000016' [ 0.1483205]
b'00000017' [ 0.1603017]
b'00000018' [ 0.07436135]
b'00000019' [ 0.2023488]
b'00000020' [ 0.184377]
b'00000021' [ 0.1477177]
b'00000022' [ 0.1392781]
b'00000023' [ 0.1392781]
b'00000024' [ 0.1392781]
b'00000025' [ 0.1392781]
b'00000026' [ 0.]
b'00000027' [ 0.]
b'00000028' [ 0.]
b'00000029' [ 0.]
b'00000030' [ 0.]
b'00000031' [ 0.]
b'00000032' [ 0.]
b'00000033' [ 0.01739435]
b'00000034' [ 0.08509921]
b'00000035' [ 0.1077428]
b'00000036' [ 0.1077428]
b'00000037' [ 0.08151991]
b'00000038' [ 0.03491397]
b'00000039' [ 0.03491397]
b'00000040' [ 0.0256832]
b'00000041' [ 0.02319655]
b'00000