In [16]:
import numpy as np
import pandas as pd

import torch
from torch.utils.data import (
    Dataset
    , IterableDataset
    , DataLoader
    , SubsetRandomSampler
    , random_split
)

# Building a Data Pipeline

To build a data pipline with `PyTorch`, there are three major components:

* `Dataset`
* `DataLoader`
* `Sampler`

In general, the workflow would be: (1) load raw dataset from disk/web to create a torch `Dataset` object, (2) determine a sampling scheme and instantiate `Sampler` object(s), and lastly (3) glue the `Dataset` and `Sampler`, together with other training parameters, to generate a `DataLoader`. During the model training, we load batches of data by iterating through the `DataLoader` we have created. Examples will be given below.

## The Dataset Object

The official documentation can be found [here](https://pytorch.org/docs/stable/data.html). There are two types of datasets: (1) **Map-style Dataset** and (2) **Iterable-style Dataset**. In a word, map-style datasets act like a **list** or **table** that we can index on. For instance, we can select the 7th sample. On the other hand, the iterable-style datasets work like iterators, meaning that we only define how we to **retrieve next sample/batch**. Both of these two types of datasets can be very useful. I will introduce the map-style dataset as it is more intuitive. I also decide to save the `Iterable-style Dataset` for another tutorial because this notebook would become too long otherwise.

### Map-style Dataset

To construct a custom map-style dataset, we must implement at least three methods:

* `__init__(self, ...)` (obviously)
* `__len__(self)`
* `__getitem__(self, idx)`

Just in case, the third method is actually called through the square bracket operator `[]`. For instance `my_dataset[0]` gives the first item (well, it depends on how you would implement the method, e.g. whether it supports ranged slicing).

A simple template is shown below.

In [17]:
class MyDataset(torch.utils.data.Dataset):

    def __init__(self, path):    # <-- My habit, not necessarily <path> only
        pass

    def __len__(self):
        pass

    def __getitem__(self, idx):  # <-- ONE ARGUMENT ONLY!!!
        pass

# Sample creation of a dataset
try:
    my_dataset = MyDataset("~/path/to/your/raw/data.csv")
except:
    pass

#### A Toy Example

The toy example consists only 8 samples with two attributes: factor and target, both of which are numerical data.

In [18]:
df_toy = pd.read_csv("../data/classification/toy_example/data.csv")
df_toy.head()

Unnamed: 0,factor,target
0,10.0,0
1,5.0,1
2,2.5,1
3,11.0,0
4,15.0,0


In [19]:
class ToyMapDataset(Dataset):

    """
    Desc:
        It simply fetch the raw csv file, stored as pd.DataFrame
    """

    def __init__(self, path):
        
        self.df = pd.read_csv(path)
        return

    def __len__(self):
        
        return len(self.df)

    def __getitem__(self, idx):  

        return self.df.iloc[idx, :]


In [20]:
# Instantiate one toy dataset and see its feature
toy_map_dataset = ToyMapDataset("../data/classification/toy_example/data.csv")
print(
    f"Number of record(s): {len(toy_map_dataset)}; "
    f"The 6th and 7th samples in the datset are:\n{str(toy_map_dataset[6:8])}"
)

# So, of course we can iterate through the dataset
# for row in toy_map_dataset:
#     print(row)

Number of record(s): 8; The 6th and 7th samples in the datset are:
   factor   target
6     0.0        1
7     3.0        1


Although this is already an working example, we can not use it directly for model training as the return values are not `torch.tensor`. Thus, we must do some pre-processing before feeding the data into our model/computation graph, and this set of pre-processing operations is best enclosed in the `Dataset` object. So, let's do some slight modification towards the previous example. **Note** that there are lots of ways in terms of actual implementation, as long as ensuring that **The Return Values Are Tensors** (or list of tensors).

In [21]:
class ToyMapDataset(Dataset):

    """
    Desc:
        Besides fetching the raw data.csv file, we convert 
          the dataset into collection of tensors. Below is 
          just one implementation.
    """

    def __init__(self, path):
        
        self.df = pd.read_csv(path)
        
        # Store XY in tensors
        self.factor = torch.tensor(
            self.df.iloc[:, 0].values  # <-- Hard-coded column names
            , dtype=torch.float        #   well, not usually a problem since
        )                              #   we pair loaders with src data file
        self.target = torch.tensor(
            self.df.iloc[:, 1].values
            , dtype=torch.float
        )
        return

    def __len__(self):
        
        return len(self.df)

    def __getitem__(self, idx):  

        # Returns pairs factor(s) and target
        return self.factor[idx], self.target[idx]

In [22]:
# Instantiate one toy dataset and see its feature
toy_map_dataset = ToyMapDataset("../data/classification/toy_example/data.csv")

# Fetch two samples
x, y = toy_map_dataset[1:3]
print(x, type(x))
print(y, type(y))

tensor([5.0000, 2.5000]) <class 'torch.Tensor'>
tensor([1., 1.]) <class 'torch.Tensor'>


To be honest, up to this point, we don't really need the sampler or dataloader since we can simply iterate through the dataset using a for loop anyway. But, it would be much nicer to enclose the details and only expose nice and clean interfaces to make the code much understandable, and the `DataLoader` object is just designed for that purpose. It might be weird that I don't introduce the `Sampler` first because I said that `DataLoader` is composed of `Dataset` and `Sampler`. The reason is that, by default, `DataLoader` will automatically assign us a `Sampler` (actually two, explained later).

## The DataLoader Object

The official documentation can be found [here](https://pytorch.org/docs/stable/data.html). 

In [24]:
toy_map_dataloader = DataLoader(toy_map_dataset)
for x, y in toy_map_dataloader:
    print(x, y)

tensor([10.]) tensor([0.])
tensor([5.]) tensor([1.])
tensor([2.5000]) tensor([1.])
tensor([11.]) tensor([0.])
tensor([15.]) tensor([0.])
tensor([20.]) tensor([0.])
tensor([0.]) tensor([1.])
tensor([3.]) tensor([1.])


## The Sampler Object

The official documentation can be found [here](https://pytorch.org/docs/stable/data.html). In a word, samplers are **indices generator**.

### Sampler

### Batch Sampler

## Custom Data Loader for Training Using the Iris Dataset

In [None]:
class IrisDataLoader(object):

    def __init__(self):

        return

## Large Dataset: Read by Chunks

In [None]:
class ToyLargeDataset(IterableDataset):

    def __init__(self, path):

        self.path = path

    def __iter__(self):

        self.df = pd.read_csv(self.path, chunksize=4)
        for dfr in self.df:

            fct = torch.tensor(dfr.iloc[:, 0].values)
            tgt = torch.tensor(dfr.iloc[:, 1].values)
            yield fct, tgt