# "fastai Data API from Foundations"
> TODO
- toc: true
- comments: true
- author: Kushajveer Singh
- categories: [notes]
- badges: true

In [1]:
# Handles all the necessary imports
from fastai.vision.all import *
from fastai.callback.fp16 import to_fp16

## Get dataset
For this post I use [Imagewoof](https://github.com/fastai/imagenette) dataset. There is nothing special here. It is an ImageNet style dataset and will provide a basis for the post.

> Tip: In fastai if you cannot create a DataLoader, using a `csv` file is the best option. Define a `label` column and a `valid` column and it will just work.

In [2]:
path = Path('/home/kushaj/Desktop/Data/imagewoof2/')
Path.BASE_PATH = path
path.ls()

(#2) [Path('train'),Path('val')]

In [3]:
!tree {path} -L 2

[34;42m/home/kushaj/Desktop/Data/imagewoof2[00m
├── [34;42mtrain[00m
│   ├── [34;42mn02086240[00m
│   ├── [34;42mn02087394[00m
│   ├── [34;42mn02088364[00m
│   ├── [34;42mn02089973[00m
│   ├── [34;42mn02093754[00m
│   ├── [34;42mn02096294[00m
│   ├── [34;42mn02099601[00m
│   ├── [34;42mn02105641[00m
│   ├── [34;42mn02111889[00m
│   └── [34;42mn02115641[00m
└── [34;42mval[00m
    ├── [34;42mn02086240[00m
    ├── [34;42mn02087394[00m
    ├── [34;42mn02088364[00m
    ├── [34;42mn02089973[00m
    ├── [34;42mn02093754[00m
    ├── [34;42mn02096294[00m
    ├── [34;42mn02099601[00m
    ├── [34;42mn02105641[00m
    ├── [34;42mn02111889[00m
    └── [34;42mn02115641[00m

22 directories, 0 files


## PyTorch `Dataset` class
The base class to get items from the dataset. It is an instance of `torch.utils.data.Dataset`. fastai does not do anything special here. The same dataset class used in PyTorch can be used here.

Let's start by defining the `Dataset` class for training and validation dataset. What do we need this class to do?

It should get the name of image files. And that is it. The reason being the `DataLoader` class in fastai is very powerful (as we will soon see). 

> Note: To avoid messing up with fastai imports I use a underscore (\_) in front of class names.

In [4]:
class _Dataset(torch.utils.data.Dataset):
    def __init__(self, path=None): self.items = get_image_files(path)
    def __len__(self)            : return len(self.items)
    def __getitem__(self, i)     : return self.items[i]
    
dataset = {
    'train': _Dataset(path/'train'),
    'valid': _Dataset(path/'val'),
}

In [5]:
len(dataset['train']), len(dataset['valid'])

(9025, 3929)

In [6]:
dataset['train'][1]

Path('train/n02111889/n02111889_11223.JPEG')

## Create `DataLoader`
The base dataloader class that is defined in `fastai.data.load.DataLoader` forms the basis of fastai DataBlock API.

Please refer to my previous post [Deep dive into fastai DataLoader methods](https://kushajveersingh.github.io/blog/notes/2020/09/05/post-0013.html) which provides a 2 minute summary of all the important methods of `DataLoader` class.

From this point I assume you are comfortable with the methods available in `DataLoader` class and in what order they operate.

Now let's start creating our `DataLoader`.

### Get (image, label) tuple from filename
This can be done using `after_item`. We need to read the image from disk and resize to a fixed size (224,224 for this example) and extract label of the image. For the labels, I manually create a dictionary to map folder name to integer.

At this point we are still limiting ourselves by not using `Transform`s. We will use them in the next section.

In [7]:
vocab = {
    'n02086240':0,
    'n02087394':1,
    'n02088364':2,
    'n02089973':3,
    'n02093754':4,
    'n02096294':5,
    'n02099601':6,
    'n02105641':7,
    'n02111889':8,
    'n02115641':9,
}

In [8]:
def after_item(item):
    # `item` here is dataset[idx] i.e. image file path
    image = image2tensor(load_image(item, mode='RGB').resize((224,224)))
    label = vocab[item.parent.name]
    return image, label

In [9]:
after_item(dataset['train'][1])

(tensor([[[103, 105, 107,  ...,  85,  86,  89],
          [105, 105, 105,  ...,  81,  78,  81],
          [109, 109, 109,  ...,  85,  83,  83],
          ...,
          [ 50,  44,  39,  ...,  70,  66,  66],
          [ 47,  51,  57,  ...,  63,  60,  70],
          [ 43,  49,  45,  ...,  58,  60,  56]],
 
         [[ 77,  79,  82,  ...,  53,  53,  53],
          [ 78,  78,  80,  ...,  48,  48,  49],
          [ 80,  80,  82,  ...,  49,  48,  48],
          ...,
          [122, 113, 102,  ..., 142, 137, 137],
          [106, 116, 132,  ..., 136, 137, 139],
          [ 85, 104,  98,  ..., 134, 136, 137]],
 
         [[ 55,  57,  59,  ...,  38,  34,  32],
          [ 55,  56,  56,  ...,  34,  31,  33],
          [ 61,  61,  63,  ...,  32,  32,  34],
          ...,
          [118, 107,  99,  ..., 148, 145, 144],
          [104, 113, 130,  ..., 142, 143, 146],
          [ 69,  90,  91,  ..., 140, 143, 144]]], dtype=torch.uint8),
 8)

### Apply some transforms
This is where power of fastai comes into play. We can define the transforms to apply on CPU or on a complete batch on the GPU. `after_item` can also be considered a form of transform. 

To apply transforms on the complete batch on the GPU we use `after_batch`. For our example, we need to convert the image tensor to float and then normalize the tensor using imagenet mean and std.

In [10]:
def after_batch(b):
    # `b` is a tuple of (image, label) 
    # `image` of shape [batch_size, num_channels, height, width]
    # `label` of shape [batch_size]
    device = torch.device('cuda')
    b = to_device(b, device)
    imgs, lbls = b
    
    # convert `imgs` to float
    imgs = imgs.div(255.)
    
    # normalize data
    mean = torch.tensor([0.485, 0.456, 0.406], device=device).view(1,3,1,1)
    std  = torch.tensor([0.229, 0.224, 0.225], device=device).view(1,3,1,1)
    imgs = (imgs - mean) / std
    
    return imgs,lbls

### Create `DataLoader`
Now we are ready to create a `DataLoader` for training and validation dataset.

In [11]:
def get_dataloader(dataset, shuffle):
    return DataLoader(dataset,
                      bs=64,
                      num_workers=12,
                      shuffle=True,
                      after_item=after_item,
                      after_batch=after_batch)

dl = {
    'train': get_dataloader(dataset['train'], shuffle=True),
    'valid': get_dataloader(dataset['valid'], shuffle=False),
}

dls = DataLoaders(dl['train'], dl['valid'], device=torch.device('cuda'))

`DataLoaders` is just a wrapper around a list of `DataLoader`s.

And that is it. Now we can create a learner.

In [12]:
learn = Learner(dls, 
                xresnet18(), 
                loss_func=CrossEntropyLossFlat(),
                pretrained=False, 
                metrics=[accuracy]).to_fp16()

In [13]:
learn.fit_one_cycle(3)

epoch,train_loss,valid_loss,accuracy,time
0,2.597425,2.156052,0.234665,00:13
1,1.891093,1.834174,0.359888,00:11
2,1.618509,1.609872,0.434716,00:11
