### 0. Basic libraries

In [1]:
import numpy as np 
import pandas as pd 
import tables
import h5py

### 1. Creating HDF5 files

#### 1.1. create & store an `array_vars`

Firstly, mock up some simple dummy data to save to our file.

In [2]:
d1 = np.random.random(size = (1000,20))
d2 = np.random.random(size = (1000,200))

The **first step to creating a HDF5 file is to initialise it**. 

It uses a very similar syntax to initialising a typical text file in numpy. The first argument provides the filename and location, the second the mode. We’re writing the file, so we provide a w for write access.

In [3]:
hf = h5py.File('data.h5', 'w')

This creates a file object, `hf`, which has a bunch of associated methods. One is `create_dataset`, which does what it says on the tin. Just provide a name for the dataset, and the numpy array.

In [4]:
hf.create_dataset('vars1', data=d1)
hf.create_dataset('vars2', data=d2)

<HDF5 dataset "vars2": shape (1000, 200), type "<f8">

All we need to do now is close the file, which will write all of our work to disk.

In [5]:
hf.close()

#### 1.2. create & store a dictionary

In [6]:
d = {'x_vrs': [[1,2], [1, 0]], 'y_vrs': 2.5}

In [7]:
import deepdish as dd
import numpy as np

X = np.zeros((2, 2, 2, 2))
y = np.arange(1, 3, 1)

dd.io.save('test.h5', {'data': X, 'label': y}, compression=None)

#### 1.3. create & store a `string_vars`

In [8]:
h5 = h5py.File("model_text1","w")
h5["test"] = "Hello World"
h5["train"] = 'go'

A weak_point / disapointage of this method is we can not store a list of string directly.

In [9]:
try : 
    h5.create_dataset('val', data = ['a', 'b'])
except TypeError as err:
    print("TypeErrors: ", err)

TypeErrors:  No conversion path for dtype: dtype('<U1')


In [10]:
h5.close()

So, we will use `deepdish (dd)` again to load data

In [11]:
text1 = ['hello', 'fuck', 'press a button']
text2 = ['never', 'say', 'neverous']
dd.io.save('model_text2.h5', [text1, text2])

  elif _pandas and isinstance(level, (pd.DataFrame, pd.Series, pd.Panel)):


### 2. Reading HDF5 files.

#### 2.1. Using data saved by `h5py` (here is h5py.File('data.h5', 'w'))

To `open` and `read` data, we use the same File method in `read mode`: `r`.

In [12]:
hf = h5py.File('data.h5', 'r')
hf

<HDF5 file "data.h5" (mode r)>

To see what data is in this file, we can call the keys() method on the file object.

In [13]:
hf.keys()

<KeysViewHDF5 ['vars1', 'vars2']>

 **Grab** each dataset we created above using the `get method`, specifying the `name`.

In [14]:
name1 = hf.get('vars1')
name1

<HDF5 dataset "vars1": shape (1000, 20), type "<f8">

This returns a `HDF5` dataset object. To convert this to an array, just call numpy’s array method.

In [15]:
name1 = np.array(name1)
name1.shape

(1000, 20)

In [16]:
hf.close()

#### 2.2. Saving by using `dd` (here is `dd.io.save('model_text.h5', text)`) is equivalent with using `h5py` but faster

In [17]:
hf = h5py.File('test.h5', 'r')

In [18]:
hf.keys()

<KeysViewHDF5 ['data', 'label']>

In [19]:
name1 = hf.get('data')
name2 = hf.get('label')

name1, name2

(<HDF5 dataset "data": shape (2, 2, 2, 2), type "<f8">,
 <HDF5 dataset "label": shape (2,), type "<i8">)

In [20]:
name1.shape, name2.shape

((2, 2, 2, 2), (2,))

In [21]:
name2[:]

array([1, 2])

In [22]:
name1[0]

array([[[0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.]]])

In [23]:
hf.close()

Beside that, we can call the insight_variables by using `deep_dish`

In [24]:
a = dd.io.load('test.h5')
a

{'data': array([[[[0., 0.],
          [0., 0.]],
 
         [[0., 0.],
          [0., 0.]]],
 
 
        [[[0., 0.],
          [0., 0.]],
 
         [[0., 0.],
          [0., 0.]]]]),
 'label': array([1, 2])}

#### 2.3. Likewise, now we using `h5py` to loading `model_text2.h5`

In [25]:
hf = h5py.File('model_text2.h5', 'r')
hf.keys()

<KeysViewHDF5 ['data']>

In [26]:
name = hf.get('data')
name = str(name)

In [27]:
name

'<HDF5 group "/data" (2 members)>'

In [28]:
hf.close()

But this method is not useful to call the `insight_variable` then using `deepdish`

In [29]:
Z = dd.io.load('model_text2.h5')
print(Z[0], '\n', Z[1])

['hello', 'fuck', 'press a button'] 
 ['never', 'say', 'neverous']


### 3. Groups.

Groups are the basic container mechanism in a HDF5 file, allowing hierarchical organisation of the data. Groups are created similarly to datasets, and datsets are then added using the group object.

In [30]:
d0 = np.random.random(size = (100))
d1 = np.random.random(size = (100,33))
d2 = np.random.random(size = (100,333))
d3 = np.random.random(size = (100,3333))

In [31]:
hf = h5py.File('data2.h5', 'w')

In [32]:
g1 = hf.create_group('group1')

In [33]:
g1.create_dataset('data1',data=d1)
g1.create_dataset('data2',data=d1)

<HDF5 dataset "data2": shape (100, 33), type "<f8">

We can also create subfolders. Just specify the group name as a directory format.

In [34]:
g2 = hf.create_group('group2/subfolder2')

In [35]:
g2.create_dataset('data3',data=d3)

<HDF5 dataset "data3": shape (100, 3333), type "<f8">

As before, to read data in irectories and subdirectories use the `get method` with the full `subdirectory path`.

In [36]:
group2 = hf.get('group2/subfolder2')
group2.items()

ItemsViewHDF5(<HDF5 group "/group2/subfolder2" (1 members)>)

In [37]:
group1 = hf.get('group1')
group1.items()

ItemsViewHDF5(<HDF5 group "/group1" (2 members)>)

In [38]:
n1 = group1.get('data1')
np.array(n1).shape

(100, 33)

In [39]:
hf.close()

### 4. Compression
To save on disk space, while sacrificing read speed, you can compress the data. Just add the compression argument, which can be either gzip, lzf or szip. gzip is the most portable, as it’s available with every HDF5 install, lzf is the fastest but doesn’t compress as effectively as gzip, and szip is a NASA format that is patented up; if you don’t know about it, chances are your organisation doesn’t have the patent, so avoid.

For gzip you can also specify the additional compression_opts argument, which sets the compression level. The default is 4, but it can be an integer between 0 and 9.

In [40]:
hf = h5py.File('data.h5', 'w')

hf.create_dataset('dataset_1', data=d1, compression="gzip", compression_opts=9)
hf.create_dataset('dataset_2', data=d2, compression="gzip", compression_opts=9)

hf.close()

### 5. Examples:
Create `database` which contains 3 columns: 2 first columns is dictionaries of the 3D_arrays (for instance size 256x256x3); the last one is a list_string of names

#### 5.1. Step1. Create datasets

In [41]:
vars1 = dict({'img_data': np.random.randint(0, 255, size = (10, 256, 256, 3)), 'idx': 10})
vars1['img_data'].shape, vars1['idx']

((10, 256, 256, 3), 10)

Noting that we can not use `vars1.img_data` to call the `insight_variables`; because

In [42]:
try: vars1.img_data
except AttributeError as err:
    print('AttributeError:', err)

AttributeError: 'dict' object has no attribute 'img_data'


Going on `vars2` & `vars3`

In [43]:
vars2 = dict({'img_data': np.random.randint(0, 5, size = (10, 256, 256, 3)), 'idx': 10})
vars3 = ['abc0001.tiff']

#### 5.2. Step2. Storing vars to `file.hfd5`

In [44]:
data = {'data_image': vars1, 'mask_image': vars2, 'image_id': vars3}

In [45]:
dd.io.save('image_database.h5', data = data)

#### 5.2.1. Loading a string to `hdf5`

In [46]:
A = dd.io.load('image_database.h5')
type(A), len(A)

(dict, 3)

In [47]:
type(A['data_image']), len(A['data_image']), type(A['mask_image']), len(A['mask_image'])

(dict, 2, dict, 2)

In [48]:
type(A['data_image']['img_data']), len(A['data_image']['img_data'])

(numpy.ndarray, 10)

In [49]:
type(A['mask_image']['img_data']), len(A['mask_image']['img_data'])

(numpy.ndarray, 10)

### 6. Loading to get a full file

In [50]:
import pandas as pd
import openslide
import skimage
import skimage.io
import random
import seaborn as sns
import cv2

import matplotlib
import matplotlib.pyplot as plt
import PIL
from IPython.display import Image, display
import plotly.graph_objs as go

In [51]:
BASE_PATH = '../input/prostate-cancer-grade-assessment'

# image and mask directories
data_dir = f'{BASE_PATH}/train_images'
mask_dir = f'{BASE_PATH}/train_label_masks'


# Location of training labels
train = pd.read_csv(f'{BASE_PATH}/train.csv').set_index('image_id')
test = pd.read_csv(f'{BASE_PATH}/test.csv')

train_labels = pd.read_csv('/kaggle/input/prostate-cancer-grade-assessment/train.csv').set_index('image_id')

submission = pd.read_csv(f'{BASE_PATH}/sample_submission.csv')

In [52]:
images = [ '07a7ef0ba3bb0d6564a73f4f3e1c2293',
            '037504061b9fba71ef6e24c48c6df44d',
            '035b1edd3d1aeeffc77ce5d248a01a53',
            '059cbf902c5e42972587c8d17d49efed',
            '06a0cbd8fd6320ef1aa6f19342af2e68',
            '06eda4a6faca84e84a781fee2d5f47e1',
            '0a4b7a7499ed55c71033cefb0765e93d',
            '0838c82917cd9af681df249264d2769c',
            '046b35ae95374bfb48cdca8d7c83233f',
            '074c3e01525681a275a42282cd21cbde',
            '05abe25c883d508ecc15b6e857e59f32',
            '05f4e9415af9fdabc19109c980daf5ad',
            '060121a06476ef401d8a21d6567dee6d',
            '068b0e3be4c35ea983f77accf8351cc8',
            '08f055372c7b8a7e1df97c6586542ac8']


In [53]:
def get_tiles(img_id, level, mode=0, n_tiles = 81, ops = 256):
        """
            Input: 
                    - img_id (str): image_id from the train dataset
                    - level (int): an integer in {0, 1, 2} corresponding to the level_downsamples {1, 4, 16}
                    - mode (int) : define the quantities of pad_height & pad_width
                    - n_tiles (int): number of tiles (must be a squared_number)
                    - ops (int) : output_size of each image
            return: 
                    - list of img_data_tiles
                    - img_mask
                    - bool
        """
        tile_size = int(256 / 2**(2*level))
        data_img = skimage.io.MultiImage(os.path.join(data_dir, f'{img_id}.tiff'))[level]
        mask_img = skimage.io.MultiImage(os.path.join(mask_dir, f'{img_id}_mask.tiff'))[level]
        
        image_data_ls = []; image_mask_ls = []
        
        h, w = data_img.shape[:2]
        pad_h = (tile_size - h % tile_size) % tile_size + ((tile_size * mode) // 2)
        pad_w = (tile_size - w % tile_size) % tile_size + ((tile_size * mode) // 2)

        img2_dt_ = np.pad(data_img,[[pad_h // 2, pad_h - pad_h // 2], [pad_w // 2,pad_w - pad_w//2], [0,0]], constant_values = 255)
        img2_ms_ = np.pad(mask_img,[[pad_h // 2, pad_h - pad_h // 2], [pad_w // 2,pad_w - pad_w//2], [0,0]], constant_values = mask_img.max())
        
        img3_dt_ = img2_dt_.reshape(img2_dt_.shape[0] // tile_size, tile_size,
                                    img2_dt_.shape[1] // tile_size, tile_size,
                                    3 )
        img3_ms_ = img2_ms_.reshape(img2_ms_.shape[0] // tile_size, tile_size,
                                    img2_ms_.shape[1] // tile_size, tile_size,
                                    3 )
        
        img3_dt_ = img3_dt_.transpose(0,2,1,3,4).reshape(-1, tile_size, tile_size,3)
        img3_ms_ = img3_ms_.transpose(0,2,1,3,4).reshape(-1, tile_size, tile_size,3)
        
        n_tiles_with_info = (img3_dt_.reshape(img3_dt_.shape[0],-1).sum(1) < tile_size ** 2 * 3 * 255).sum()
        
        if len(img) < n_tiles:
            img3_dt_ = np.pad(img3_dt_,[[0,N - len(img3_dt_)],[0,0],[0,0],[0,0]], constant_values=255)
            img3_ms_ = np.pad(img3_ms_,[[0,N - len(img3_ms_)],[0,0],[0,0],[0,0]], constant_values = mask_img.max())
            
        idxs_dt_ = np.argsort(img3_dt_.reshape(img3_dt_.shape[0],-1).sum(-1))[:n_tiles]    
        
        img3_dt_ = img3_dt_[idxs_dt_]
        img3_ms_ = img3_ms_[idxs_dt_]
        
        for i in range(len(img3_dt_)):
            img4_dt_ = cv2.resize(img3_dt_[i], (ops, ops))
            image_data_ls.append({'img':img4_dt_, 'idx':i})
            img4_ms_ = cv2.resize(img3_ms_[i], (ops, ops))
            image_mask_ls.append({'img':img4_ms_, 'idx':i})
        
        return image_data_ls, image_mask_ls, n_tiles_with_info >= n_tiles

In [54]:
def dataloader(list_images, level, mode=0, n_tiles = 81, ops = 256):
    
    vars1 = []
    vars2 = []
    vrs_bool = []
    vars3 = list_images
    for img_id in list_images:
        vrs1, vrs2, bools = get_tiles(img_id, level, mode=0, n_tiles = 81, ops = 256)
        vars1.append(vrs1)
        vars2.append(vrs2)
        vrs_bool.append(bools)
    data = {'data_image': vars1, 'mask_image': vars2, 'image_id': vars3, 'bools_val' : vrs_bool}
    return data

In [56]:
data = dataloader(images, 1)
dd.io.save('image_database.h5', data = data)