---
title: "Notes on dataloaders"
description: "..."
author: "Temi"
date: 'Thurs Sep 7 2023'
categories: [pytorch, machine learning]
---

:::{.callout-note}
This post is still under construction; I am adding sutff as I get the time to.
:::

In [3]:
import torch
import numpy as np, os, sys, pandas as pd
import matplotlib.pyplot as plt
import random

print(f'Kernel used is: {os.path.basename(sys.executable.replace("/bin/python",""))}')



Kernel used is: dl-tools


# Introduction

During training deep learning models (or any machine learning model for that matter), we try to make the most of available data. One way we do this is to supply a batch of the data to the model at a training iteration. So, if you have 5000 observations to train on, you can supply, say, 20 at a time.

In addition, loading 5000 observations all at once may consume a lot of memory, especially if you have limited resources. 

`pytorch` gives us a convenient way to load data in this manner by letting us create our own `dataset` objects, which are used by pytorch's `dataloader`

# Class Dataset

In [4]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

The trick to create your Dataset object is that when you call the class, or attempt to get an item from the dataset, it should return one training observation. 

I will create a fictitious dataset, observations `X` and ground truth `Y`. They will be numpy arrays; this way I can easily manipulate them.

In [21]:
X = np.random.rand(100, 48) 
Y = np.random.choice([0,1], size=100)
X.shape, Y.shape

((100, 48), (100,))

I create a dataset class, `MyDataset`, that will take just one observation and ground truth at a time

In [22]:
class MyDataset(Dataset): # will inherit from the Dataset object
    def __init__(self, X, Y):
        self.X = X
        self.Y = Y
    
    def __len__(self): # the dataloader needs to know the number of observations you have
        return self.X.shape[0]

    def __getitem__(self, idx): # this is what returns just one observation or one unit of training
        return(self.X[idx, : ], self.Y[idx])

Now I can use the dataloader object

In [23]:
mydataset = MyDataset(X, Y)
mydataset

<__main__.MyDataset at 0x11d3808d0>

You can confirm that the dataset object works by doing this. I give it an index, `8` and it pulls the observations and ground truth corresponding to that index.

In [24]:
mydataset.__getitem__(8)

(array([0.33197901, 0.10920114, 0.32552711, 0.55107147, 0.95523331,
        0.82203799, 0.58211899, 0.9134724 , 0.15777504, 0.74666818,
        0.84837099, 0.53785637, 0.49206978, 0.90127475, 0.7626803 ,
        0.89917058, 0.01275132, 0.94277784, 0.73115781, 0.76832774,
        0.41417915, 0.83662125, 0.69430352, 0.97880989, 0.25958756,
        0.04993424, 0.2055082 , 0.48704122, 0.55182948, 0.72521316,
        0.58642776, 0.95965883, 0.35750039, 0.02896049, 0.34491265,
        0.81426974, 0.47463192, 0.08679966, 0.64945759, 0.28330604,
        0.0216591 , 0.30981423, 0.97186651, 0.95268351, 0.42557078,
        0.15942108, 0.79952813, 0.98738138]),
 1)

# Dataloader

All well and good. But I don't want to give my model one observation at a time. Although people do this, it is too small. Instead, I want to give the model a certain batch at time. `Dataloaders` help with this.

I create a `DataLoader` object and supply it the argument `batch_size`. Whenever I ask the object for training examples, it gives me `batch_size` number of observations at a time. Here I will set `batch_size` to 50. 

In [25]:
mydataloader = DataLoader(mydataset, batch_size=20)

In [26]:
for i, batch in enumerate(mydataloader):
    print(f'batch {i}: number of observations and ground truth are {batch[0].shape[0]} and {batch[1].shape[0]} respectively')

batch 0: number of observations and ground truth are 20 and 20 respectively
batch 1: number of observations and ground truth are 20 and 20 respectively
batch 2: number of observations and ground truth are 20 and 20 respectively
batch 3: number of observations and ground truth are 20 and 20 respectively
batch 4: number of observations and ground truth are 20 and 20 respectively
