# Why use DataSet and Dataloader Class ?

We were earlier using the Batch Gradient Descent (using the entire dataset to do prediction and then calculate loss)

Issues :

1. Memory inefficient (entire data in memory at the same time)
2. Slow convergence

So we can use Stochastic Gradient descent : dividing data in small batches and then using them. (Mini batch gradient descent)


Dataset and Dataloader class

1. These are abstractions that decouple how you define your data from how efficiently you can iterate over it in training loops.

2. Dataset Class : help in loading the data from secondary memory in primary memory.

3. Dataloader class : divide the data in batch. Decides the number of row per batches. Dataloader uses dataset to get the rows and then makes batch for training.


how to use ?

1. Create a CustomDataset class inheriting from the Dataset class (abstract class)
2. 3 essential methods :
    
    __init__() : telling how the data will be loaded

    __len__() : total number of rows in the dataset
    
    __getitem__(index) : how to get a row at a given index
3.
    
    Then dataloader shuffle the row using a sampler.
    
    Then is create chunks based on the batch size
    
    Invokes the __getitem(index)__ to make batches (using collage)
    
    The batch in finally returned




In [1]:
from sklearn.datasets import make_classification
import torch

In [2]:
# synthetic dataset creation
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2)

In [5]:
X = torch.from_numpy(X).to(dtype = torch.float32)
y = torch.from_numpy(y).to(dtype = torch.float32)

In [6]:
type(X)

torch.Tensor

In [7]:
from torch.utils.data import Dataset, DataLoader

In [11]:
class CustomDataset(Dataset) :
  def __init__(self,features, labels) :
    self.features = features
    self.labels = labels

  def __len__(self) :
    return len(self.features)

  def __getitem__(self, index) :
    return self.features[index], self.labels[index]

In [12]:
dataset = CustomDataset(X,y)

In [16]:
len(dataset)

1000

In [25]:
dataloader = DataLoader(dataset, batch_size = 100 , shuffle = True)

In [28]:
for batch_features, batch_labels in dataloader :
  print(batch_features)
  print(batch_labels)
  print("------------------------------")


tensor([[-0.4319, -1.0708, -0.6121,  0.8377, -1.1504, -1.0668, -0.3438, -0.5364,
         -0.3965,  0.1557, -1.1927,  0.2240, -0.9381,  0.1656, -1.0490, -0.8079,
          1.0349,  1.0933,  0.5139, -1.2581],
        [-0.7644, -0.2246, -0.7554, -0.6092, -0.9295, -0.9404,  2.1669, -0.0971,
         -0.0918, -0.4187, -1.2787, -1.0343, -1.5243,  1.4249, -0.8604, -1.3429,
         -0.6335,  1.1641, -0.1212,  0.2009]])
tensor([0., 0.])
------------------------------
tensor([[-0.7959,  0.4805,  0.9799, -0.1124, -2.2056,  1.2954,  0.0079,  1.9445,
         -0.1558, -0.2964,  0.0790, -1.4580,  0.1112,  1.7008,  0.7657,  0.4833,
         -0.2634, -0.6383,  0.6464,  0.5460],
        [-0.3034, -1.2941,  2.4981,  0.9222, -0.5533,  0.9022,  0.5283,  1.1085,
         -1.0314, -0.8774, -1.1954,  0.0582,  1.8168,  0.3465,  0.7028, -0.0290,
         -0.6124, -0.5756, -0.8818,  0.5607]])
tensor([0., 1.])
------------------------------
tensor([[-0.7614, -0.3726,  1.7594, -1.0858, -0.4042, -1.3394,  0.3990

In [29]:
# We can also do some transformation in the __getitem__ function . Data augmentation , lowercasing , black and white image etc.
# Parallization : for creating batches are also implemented in Dataloader we just need to set our num_workers

# sampler : shuffel = False , then sequential sampler (for timeseries data) , shuffel = True , then we will use random sampler
# we can also create custom sampler.



# collate fn : help in combining the output of the getitem calls to create the batches
# we can customize it : if we want something specific. For example : we are converting text data (1  sentences at a time) in word embedding, which might of different dimension. So
# some kind of padding manual logic is required

All important parameter for the DataLoader class
1. dataset : necessary with all 3 methods already defined
2. batch_size : by default it is 1
3. num_workers : for parallelization
4. shuffle
5. pin_memory : to leverage the use of GPU
6. drop_last : drop the last fractional 32 data, 10 batchsize
7. collate_fn : to specify how the clubbing has to be done
8. sampler : to specify how the sampling needs to be performed