## Using PyTorch Dataset Loading Utilities for Custom Datasets (CSV files converted to HDF5)

This notebook provides an example for how to load a dataset from an HDF5 file created from a CSV file, using PyTorch's data loading utilities. For a more in-depth discussion, please see the official

- [Data Loading and Processing Tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html)
- [torch.utils.data](http://pytorch.org/docs/master/data.html) API documentation

An Hierarchical Data Format (HDF) is a convenient way that allows quick access to data instances during minibatch learning if a dataset is too large to fit into memory. The approach outlined in this notebook uses uses the common [HDF5](https://support.hdfgroup.org/HDF5/) format and should be accessible to any programming language or tool with an HDF5 API.


In [1]:
import pandas as pd
import numpy as np
import h5py
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

### 创建HDF5数据

In [2]:
# csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
csv_path = "../data/iris.data"

num_lines = 150
num_features = 4

class_dict = {'Iris-setosa': 0,
              'Iris-versicolor': 1,
              'Iris-virginica': 2}

# use 10,000 or 100,000 or so for large files
chunksize = 10

with h5py.File('iris.h5', 'w') as h5f:
    
    # use num_features-1 if the csv file has a column header
    dset1 = h5f.create_dataset('features',
                               shape=(num_lines, num_features),
                               compression=None,
                               dtype='float32')
    dset2 = h5f.create_dataset('labels',
                               shape=(num_lines,),
                               compression=None,
                               dtype='int32')

    # change range argument from 0 -> 1 if your csv file contains a column header
    for i in range(0, num_lines, chunksize):  

        df = pd.read_csv(csv_path,  
                header=None,  # no header, define column header manually later
                nrows=chunksize, # number of rows to read at each iteration
                skiprows=i)   # skip rows that were already read
        
        df[4] = df[4].map(class_dict)

        features = df.values[:, :4]
        labels = df.values[:, -1]
        
        # use i-1 and i-1+10 if csv file has a column header
        dset1[i:i+10, :] = features
        dset2[i:i+10] = labels[0]

In [3]:
with h5py.File('iris.h5', 'r') as h5f:
    print(h5f['features'].shape)
    print(h5f['labels'].shape)

(150, 4)
(150,)


In [4]:
with h5py.File('iris.h5', 'r') as h5f:
    print('Features of entry no. 99:', h5f['features'][99])
    print('Class label of entry no. 99:', h5f['labels'][99])

Features of entry no. 99: [5.7 2.8 4.1 1.3]
Class label of entry no. 99: 1


### 自定义dataset

实现自己的dataset类，需要完成两件事：
- `__getitem__(self, index) `:
 - 根据index读取单张image和label
 - 返回读取的image和对应的label
 
- `__len(self)__`:
 - 返回数据集的长度


In [5]:
class Hdf5Dataset(Dataset):
    """Custom Dataset for loading entries from HDF5 databases"""

    def __init__(self, h5_path, transform=None):
    
        self.h5f = h5py.File(h5_path, 'r')
        self.num_entries = self.h5f['labels'].shape[0]
        self.transform = transform

    def __getitem__(self, index):
        
        features = self.h5f['features'][index]
        label = self.h5f['labels'][index]
        if self.transform is not None:
            features = self.transform(features)
        return features, label

    def __len__(self):
        return self.num_entries

In [6]:
train_dataset = Hdf5Dataset(h5_path='iris.h5',
                            transform=None)

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=50,
                          shuffle=True,
                          num_workers=0)

### train

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

epochs = 10
for epoch in range(epochs):
    for batch_idx, (x, y) in enumerate(train_loader):
        print('Epoch:', epoch+1, end='')
        print(' | Batch index:', batch_idx, end='')
        print(' | Batch size:', y.size()[0])
        
        x = x.to(device)
        y = y.to(device)
        
        # train step here

cuda
Epoch: 1 | Batch index: 0 | Batch size: 50
Epoch: 1 | Batch index: 1 | Batch size: 50
Epoch: 1 | Batch index: 2 | Batch size: 50
Epoch: 2 | Batch index: 0 | Batch size: 50
Epoch: 2 | Batch index: 1 | Batch size: 50
Epoch: 2 | Batch index: 2 | Batch size: 50
Epoch: 3 | Batch index: 0 | Batch size: 50
Epoch: 3 | Batch index: 1 | Batch size: 50
Epoch: 3 | Batch index: 2 | Batch size: 50
Epoch: 4 | Batch index: 0 | Batch size: 50
Epoch: 4 | Batch index: 1 | Batch size: 50
Epoch: 4 | Batch index: 2 | Batch size: 50
Epoch: 5 | Batch index: 0 | Batch size: 50
Epoch: 5 | Batch index: 1 | Batch size: 50
Epoch: 5 | Batch index: 2 | Batch size: 50
Epoch: 6 | Batch index: 0 | Batch size: 50
Epoch: 6 | Batch index: 1 | Batch size: 50
Epoch: 6 | Batch index: 2 | Batch size: 50
Epoch: 7 | Batch index: 0 | Batch size: 50
Epoch: 7 | Batch index: 1 | Batch size: 50
Epoch: 7 | Batch index: 2 | Batch size: 50
Epoch: 8 | Batch index: 0 | Batch size: 50
Epoch: 8 | Batch index: 1 | Batch size: 50
Epoch:

In [8]:
train_dataset.h5f.close()