# Dataset & Dataloader test

When training neural networks, data management codes are unavoidable. Here we demonstrate the process to test dataset and dataloader codes with an example. 

The scenario is to write a pytorch map-style customized Dataset loading images. This should be the most common scenario in our lab. 

Most of the knowledges you need can be found in PyTorch official documents, [Dataset and Dataloader tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) and [Dataset and Dataloader API](https://pytorch.org/docs/stable/data.html).

In [None]:
import torch
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt

from torch.utils.data import Dataset, DataLoader

## Write the code

A lot of online materials can be found about how to write a pytorch dataset/dataloader, including the official documents mentioned above. Since this notebook is all about testing, I'll just write a simple dummy Dataset, which gives a noisy RGB image as input, and a clear RGB image as ground truth.

In [None]:
def four_digit_square(digit:int, ps:int, color:tuple):
    """
    Return a square uint8 RGB image, gray as background, 4 digits as foreground. 
    Args:
        digit: integer to show. int, should be 0-9999. Undefined behavior for outliers
        ps: patch size. positive int
        color: color tuple. Length-3, 0-255 int
    Rtv:
        img: (ps, ps, 3) uint8 numpy array
    """
    assert ps >= 20, 'Patch size too small to hold the texts'
    img = np.full((ps, ps, 3), 127, dtype=np.uint8)
    text = '{:04d}'.format(digit)[-4:]
    text1 = text[:2]
    text2 = text[2:]
    
    cv.putText(img, text1, 
               (np.round(ps*0.01).astype(int), np.round(ps*0.45).astype(int)), 
               cv.FONT_HERSHEY_SIMPLEX, ps*0.018, 
               color, np.round(ps*0.015).astype(int), cv.LINE_AA, False)
    cv.putText(img, text2, 
               (np.round(ps*0.25).astype(int), np.round(ps*0.92).astype(int)), 
               cv.FONT_HERSHEY_SIMPLEX, ps*0.018, 
               color, np.round(ps*0.015).astype(int), cv.LINE_AA, False)
    
    return img

class FourDigitDataset(Dataset):
    """
    A dataset gives noisy-clean pair of 128x128 images with digits
    """
    
    def __init__(self, folder:str, noise_level:float, rng_seed=None):
        """
        Args:
            folder: str, actually choose the color of the digits here
            noise_level: float, the Gaussian additive noise sigma
            rng_seed: random number generator seed. None for default
        """
        # validate the folder
        if folder=='red':
            self.color = (255, 0, 0)
        elif folder=='green':
            self.color = (0, 255, 0)
        elif folder=='blue':
            self.color = (0, 0, 255)
        else:
            raise RuntimeError('Undefined color')
        self.folder = folder
        
        # count the samples in the folder (fixed in this dummy clas)
        self.length = 10000
        
        # set the random noise parameters
        self.noise_level = noise_level
        self.rng = np.random.default_rng(rng_seed)
        
    def __len__(self):
        return self.length
    
    def __getitem__(self, idx):
        # validate index
        assert idx>=0 and idx<len(self), 'Index out-of-range'
        
        # read the image
        # assume that the image read are 256x256
        clean_img = four_digit_square(idx, 256, self.color)
        clean_img = clean_img.astype(np.float32)/255
        
        # runtime augmentation
        noisy_img = clean_img + self.rng.normal(0, self.noise_level, clean_img.shape)
        noisy_img = np.clip(noisy_img, 0, 1)
        
        # turn to tensors
        gt = torch.from_numpy(clean_img.transpose(2, 0, 1))
        x  = torch.from_numpy(noisy_img.transpose(2, 0, 1))
        
        # return 
        return x, gt

## Read the code and guarantee the basic functionalitis

Read your code, and answer these questions:

 1. `__init__` method   
 Can you assign a certain path/file as the source of the images?
 2. `__len__` method   
 Can the dataset object tell the length of itself?
 3. `__getitem__` method   
 Does it takes in an integer as index, and returns a training sample?   
 What is the training sample returned? Usually it’s a tuple as (inputs(tensor or tuple), gt(tensor)).   
 Does the integer starts from 0 and ends at the correct number?   
 (Optional) Can it deal with out-of-range inputs, whether by throw an exception or fixing with a warning?   
 4. (Optional) Augmentation   
 Is there any run-time augmentation mechanism?   
 If so, how do you control the parameters and randomness for it?

You can also practice that our our dummy dataset

## Create an instance and check the `__len__` method

In [None]:
dataset = FourDigitDataset('red', 0.05, None)
print(len(dataset))

## Check the training sample returned by the `__getitem__` method

Check several samples with different indices.

For each image tensor in every sample, check its:
 1. shape   
 Should be (channel, height, width). They should match your design of the network. Batch size will be assigned by pytorch Dataloader.
 2. shape consistency   
 The same tensor should share the same shape across all the samples.
 3. data type    
 For most of the time, torch.float or torch.float32. The two types mean the same thing.
 4. data device   
 Usually CPU, then moved to GPU with the training step. Lightning does that data movement automatically.
 5. data range/distribution   
 Usually [0, 1] for images; [-1, -1] if you move it to the middle; or unit normal distribution if you normalize it.
 6. visualization   
 Send the tensors to cpu, turn to numpy arrays, permute the channels, and show with matplotlib. See if the shown image fits its source, and if the index of the source the same as you assigned
 7. boundary cases   
 Will your dataset give the correct result for the first and the last sample? Especially important for Dataloader.

In [None]:
test_idx_list = (0, 20, 500, 9999)
for idx in test_idx_list:
    print('Sample {:d}'.format(idx))
    x, gt = dataset[idx]
    fig, axes = plt.subplots(1, 2, figsize=(10, 5))
    for ten, ax, label in zip((x, gt), axes, ('input', 'gt')):
        print('\t{} tensor'.format(label).expandtabs(4))
        print('\t\tShape: {}'.format(ten.shape).expandtabs(4))
        print('\t\tData type: {}'.format(ten.dtype).expandtabs(4))
        print('\t\tData device: {}'.format(ten.device).expandtabs(4))
        print('\t\tData range: {:.3f}-{:.3f}'.format(ten.min(), ten.max()).expandtabs(4))
        ax.imshow(ten.cpu().numpy().transpose(1,2,0))
        ax.axis('off')
    plt.show()

## Check the speed

Time a loop with 100 samples. Think whether your data loading or your GPU computing will be the bottleneck of training. We would like to make the GPU run in full speed.

We're using `%%time` cell magic command here for a rough test, and the `time` module built-in Python can give you finer control.

Note that the data reading may be slower when they're first read, perhaps some cache issue of the disk.

In [None]:
%%time
test_amount = 100
test_idx_list = np.arange(len(dataset))[:test_amount]
device = torch.device('cuda:0')

for idx in test_idx_list:
    x, gt = dataset[idx]
    x = x.to(device)
    gt = gt.to(device)
print('{:d} sample generated'.format(test_amount))

## Memory leak

During iterations, will your CPU/GPU memory usage keep rising due to memory leak? You can use `htop` to observe CPU memory, and `nvidia-smi` to observe GPU memory. Both commands are commandline tools.

## Wrap it with Dataloader object

Then do the same tests: the `__getitem__` method, the speed, and the memory leak. 

Only the speed test is shown here. There are several possible bottlenecks for the dataloader:
 1. Disk IO speed
 1. CPU computing speed for data processing, mostly augmentation
 1. The speed to transfer data in CPU memory to GPU memory
 
Increase `num_workers` can speed up the dataloading process only if the bottleheck is CPU computing.

In [None]:
bs = 10
dataloader = DataLoader(dataset, batch_size=bs, shuffle=True, num_workers=5, pin_memory=True, drop_last=True)

In [None]:
%%time
test_amount = 100
device = torch.device('cuda:0')
counter = 0

for x, gt in dataloader:
    x = x.to(device)
    gt = gt.to(device)
    counter += 1
    if counter > test_amount:
        break
print('{:d} sample generated'.format(bs*test_amount))

In [None]:
print(x.shape, gt.shape)