# "Deep dive into fastai DataLoader methods"
> Usage example of fastai DataLoader methods in chronological order (starting from index of item to batch)
- toc: true
- comments: true
- author: Kushajveer Singh
- categories: [notes]
- badges: true

In [1]:
from fastai.vision.all import *

First, we define a dummy dataset.

In [2]:
dataset = list(string.ascii_lowercase)[:10]
dataset

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Next, we create a `DataLoader`.

In [3]:
dl = DataLoader(dataset, bs=3, shuffle=True)
dl

<fastai.data.load.DataLoader at 0x7f3734d233a0>

## DataLoader.get_idxs
The first thing we need is a way to get a list of indices, that we can then use to get items from the dataset by indexing into it.

In [4]:
dl.get_idxs()

[7, 0, 6, 5, 1, 4, 9, 8, 3, 2]

## DataLoader.sample
In practice we want the list of indices to be returned as a generator because it is more efficient than handling a python list.

In [5]:
dl.sample()

<generator object DataLoader.sample.<locals>.<genexpr> at 0x7f3734cb2200>

In [6]:
list(dl.sample())

[3, 1, 4, 0, 9, 2, 6, 8, 5, 7]

As we see from the above two examples `get_idxs` and `sample` are the same thing. The only difference is that `get_idxs` returns a list and sample returns a generator of indices.

In case of *iterable-style dataset*, index does not make sense as dataset is generated on the fly.

In [7]:
dl.new(indexed=False).get_idxs()

[None, None, None, None, None, None, None, None, None, None]

## DataLoader.create_item
We have got the indices. The next step is to get items from the dataset. This is done using `self.dataset[idx]`. `create_item` does exactly this.

In [8]:
dl.create_item(1), dl.create_item(2)

('b', 'c')

## DataLoader.after_item
After `create_item` we might want to do some operation on it. We can define this operation as a function which takes output of `create_item` as input.

In [9]:
def func(item): 
    print(f'Original item is "{item}"')
    new_item = item + '_something'
    print(f'Changed the item to "{new_item}"')
    return new_item

dll = dl.new(after_item=func)

dll.after_item(dll.create_item(1))

Original item is "b"
Changed the item to "b_something"


'b_something'

## DataLoader.do_item
This method is a shorthand for `dll.after_item(dll.create_item(index))`.

In [10]:
dll.do_item(1)

Original item is "b"
Changed the item to "b_something"


'b_something'

## DataLoader.before_batch
Now we have got the items that we can use for training. The next step is to create a batch. fastai provides us the flexibility to control every step of it. Before creating a batch we have a list of items, and we can use `before_batch` to modify these items. Mixup is a good example of this.

In [11]:
def before_batch(items):
    # a very bare-bones implementation of mixup
    shuffle = torch.randperm(len(items))
    lam = 0.3
    return [lam*i+(1-lam)*j for i,j in zip(items, items[shuffle])]

dl = DataLoader(bs=2, before_batch=before_batch)
items = L(torch.tensor(1.), torch.tensor(2.), torch.tensor(3.), torch.tensor(4))
dl.before_batch(items)

[tensor(1.7000), tensor(2.7000), tensor(1.6000), tensor(4.)]

## DataLoader.create_batch
Next step is collating the list of items into a batch. `create_batch` is same as `collate_fn` used in PyTorch. The usage is `create_batch(before_batch(items))`.

## DataLoader.do_batch
A shorthand for `create_batch(before_batch(items))`.

In [13]:
dl.do_batch(items) # Returns a batch of items

tensor([2.4000, 3.4000, 1.6000, 2.6000])

## DataLoader.after_batch
After getting a batch, we can perform any operation on the batch using `after_batch`. This method can be used for defining batch data augmentations to be applied on the GPU.

## Other useful methods
* `before_iter()` - Called before starting to read/iterate `DataLoader`
* `after_iter()` - Called after `DataLoader` is read/iterated.