Till now we have used batch optimizer. Which means we are sending all the data at once in our training pipeline. This can cause some problems like:
* Memory inefficiency.
* Slow converging. Because in one epoch you are training your parameters (weights) only once.

For best practice we use mini-batch gradient descent.

Why do we need dataLoader and Dataset?
* No standard interface for data.
* No easy was to apply transformations.
* Shuffling and sampling
* Batch management & Parallelization

Dataset and DataLoader are core abstractions in PyTorch that decouple how you
define your data from how you efficiently iterate over it in training loops.

**Dataset Class**

The Dataset class is essentially a blueprint. When you create a
custom Dataset, you decide how data is loaded and returned.
It defines:
* __init__() which tells how data should be loaded.
* __len__() which returns the total number of samples.
* __getitem__(index) which returns the data (and label) at the
given index.

**DataLoader Class**

The DataLoader wraps a Dataset and handles batching, shuffling,
and parallel loading for you.

**DataLoader Control Flow:**

* At the start of each epoch, the DataLoader (if shuffle=True)
shuffles indices(using a sampler).
* It divides the indices into chunks of batch_size.
* For each index in the chunk, data samples are fetched from
the Dataset object
* The samples are then collected and combined into a batch
(using collate_fn)
* The batch is returned to the main training loop.

In [None]:
from sklearn.datasets import make_classification
import torch

In [None]:
# Create a synthetic dataset using sklearn
X, y = make_classification(n_samples=10, # Number of samples
                           n_features=2, # Number of features
                           n_informative=2, # Numer of informative features
                           n_redundant=0, # Number of redundant features
                           n_classes=2,   # Number of classes
                           random_state=42) # For reproducibility

In [None]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [None]:
X.shape

(10, 2)

In [None]:
y.shape

(10,)

In [None]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [None]:
# Convert the data into tensors.
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32)

  X = torch.tensor(X, dtype=torch.float32)
  y = torch.tensor(y, dtype=torch.float32)


In [None]:
X

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388],
        [-2.8954,  1.9769],
        [-0.7206, -0.9606],
        [-1.9629, -0.9923],
        [-0.9382, -0.5430],
        [ 1.7273, -1.1858],
        [ 1.7774,  1.5116],
        [ 1.8997,  0.8344],
        [-0.5872, -1.9717]])

In [None]:
y

tensor([1., 0., 0., 0., 0., 1., 1., 1., 1., 0.])

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
class CustomDataset(Dataset):

  def __init__(self, features, labels):
    self.features = features
    self.labels = labels

  def __len__(self):
    return self.features.shape[0]

  def __getitem__(self, idx):
    return self.features[idx], self.labels[idx]

In [None]:
dataset = CustomDataset(X, y)

In [None]:
len(dataset)

10

In [None]:
dataset[2]

(tensor([-2.8954,  1.9769]), tensor(0.))

In [None]:
dataloader = DataLoader(dataset, batch_size = 2, shuffle = True)

In [None]:
for batch_features, batch_labels in dataloader:

  print(batch_features)
  print(batch_labels)
  print('-'*50)

tensor([[-2.8954,  1.9769],
        [-1.9629, -0.9923]])
tensor([0., 0.])
--------------------------------------------------
tensor([[-0.9382, -0.5430],
        [-0.7206, -0.9606]])
tensor([1., 0.])
--------------------------------------------------
tensor([[-0.5872, -1.9717],
        [-1.1402, -0.8388]])
tensor([0., 0.])
--------------------------------------------------
tensor([[1.7774, 1.5116],
        [1.8997, 0.8344]])
tensor([1., 1.])
--------------------------------------------------
tensor([[ 1.0683, -0.9701],
        [ 1.7273, -1.1858]])
tensor([1., 1.])
--------------------------------------------------


Now where can you do apply data transformation here,

* In CustomDataset class you can apply transformation in the __getitem__() method.

# A note about Parallelization

Imagine the entire data loading and training process for one epoch with num_workers=4:

**Assumptions:**

* Total samples: 10,000
* Batch size: 32
* Workers (num_workers): 4
* Approximately 312 full batches per epoch (10000 / 32 ≈ 312).

**Workflow:**
1. Sampler and Batch Creation (Main Process):

Before training starts for the epoch, the DataLoader’s sampler generates a shuffled list of all 10,000 indices. These are then grouped into 312 batches of 32 indices each. All these batches are queued up, ready to be fetched by
workers.

2. Parallel Data Loading (Workers):
* At the start of the training epoch, you run a training loop like:

 - python
 - Copy code

- for batch_data, batch_labels in dataloader:

Training logic

Under the hood, as soon as you start iterating over dataloader, it dispatches the first four batches of indices to the four workers:
* Worker #1 loads batch 1 (indices [batch_1_indices])
* Worker #2 loads batch 2 (indices [batch_2_indices])
* Worker #3 loads batch 3 (indices [batch_3_indices])
* Worker #4 loads batch 4 (indices [batch_4_indices])

Each worker:

* Fetches the corresponding samples by calling __getitem__ on the dataset for each index in that batch.
* Applies any defined transforms and passes the samples through collate_fn to form a single batch tensor.


3. First Batch Returned to Main Process:
* Whichever worker finishes first sends its fully prepared batch (e.g., batch 1) back to the main process.
* As soon as the main process gets this first prepared batch, it yields it to your training loop, so your code for batch_data, batch_labels in dataloader:receives (batch_data, batch_labels) for the first batch.

4. Model Training on the Main Process:
* While you are now performing the forward pass, computing loss, and doing backpropagation on the first batch, the other three workers are still preparing their batches in parallel.
* By the time you finish updating your model parameters for the first batch, the DataLoader likely has the second, third, or even more batches ready to go (depending on processing speed and hardware).

5. Continuous Processing:
* As soon as a worker finishes its batch, it grabs the next batch of indices from the queue.
* For example, after Worker #1 finishes with batch 1, it immediately starts on batch 5. After Worker #2
finishes batch 2, it takes batch 6, and so forth.
* This creates a pipeline effect: at any given moment, up to 4 batches are being prepared concurrently.

6. Loop Progression:
* Your training loop simply sees:
python
Copy code
for batch_data, batch_labels in dataloader:
 - forward pass
 - loss computation
 - backward pass
 - optimizer step
* Each iteration, it gets a new, ready-to-use batch without long I/O waits, because the workers have been pre-
loading and processing data in parallel.

7. End of the Epoch:
* After ~312 iterations, all batches have been processed. All indices have been consumed, so the DataLoader has no more batches to yield.
* The epoch ends. If shuffle=True, on the next epoch, the sampler reshuffles indices, and the whole process repeats with workers again loading data in parallel.


# A note about samplers

In PyTorch, the sampler in the DataLoader determines the strategy for selecting samples from the dataset during data loading. It controls how indices of the dataset are drawn for each batch.
**Types of Samplers**
PyTorch provides several predefined samplers, and you can create custom ones:
1. SequentialSampler:
- Samples elements sequentially, in the order they appear in the dataset.
- Default when shuffle=False.
2. RandomSampler:
- Samples elements randomly without replacement.
- Default when shuffle=True.
3. CustomSampler:
- For imbalanced dataset (99% data have class 1, 1% data have class 2) here you will want that in every batch the data should be like that. You can't use sequential or random here. You will have to create your own CustomSampler.

# A note about collate_function

The collate_fn in PyTorch's DataLoader is a function that specifies how to combine a list of samples from a dataset into a single batch. By default, the DataLoader uses a simple batch collation mechanism, but collate_fn allows you to customize how the data should be processed and batched.

# DataLoader Important Parameters

The DataLoader class in PyTorch comes with several parameters that allow you to customize how data is loaded, batched, and preprocessed. Some of the most commonly used and important parameters include:
1. dataset (mandatory):
- The Dataset from which the DataLoader will pull data.
- Must be a subclass of torch.utils.data.Dataset that implements __getitem__ and __len__.

2. batch_size:
- How many samples per batch to load.
- Default is 1.
- Larger batch sizes can speed up training on GPUs but require more memory.

3. shuffle:
- If True, the DataLoader will shuffle the dataset indices each epoch.
- Helpful to avoid the model becoming too dependent on the order of samples.

4. num_workers:
- The number of worker processes used to load data in parallel.
- Setting num_workers > 0 can speed up data loading by leveraging multiple CPU
cores, especially if I/O or preprocessing is a bottleneck.

5. pin_memory:
- If True, the DataLoader will copy tensors into pinned (page-locked) memory before returning them.
- This can improve GPU transfer speed and thus overall training throughput,
particularly on CUDA systems.

6. drop_last:
- If True, the DataLoader will drop the last incomplete batch if the total number of samples is not divisible by the batch size.
- Useful when exact batch sizes are required (for example, in some batch
normalization scenarios).

7. collate_fn:
- A callable that processes a list of samples into a batch (the default simply stacks tensors).
- Custom collate_fn can handle variable-length sequences, perform custom batching logic, or handle complex data structures.

8. sampler:
- sampler defines the strategy for drawing samples (e.g., for handling imbalanced classes, or custom sampling strategies).
- batch_sampler works at the batch level, controlling how batches are formed.
- Typically, you don’t need to specify these if you are using batch_size and shuffle. However, they provide lower-level control if you have advanced requirements.

In [1]:
import numpy as np
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [3]:
df.shape

(569, 33)

In [4]:
df.drop(columns=['id', 'Unnamed: 32'], inplace=True)

In [5]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [13]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size = 0.2, shuffle=True)

In [14]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
y_train

Unnamed: 0,diagnosis
216,B
104,B
172,M
346,B
15,M
...,...
555,B
470,B
382,B
378,B


In [15]:
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

In [10]:
y_train

array([0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0,

In [16]:
# Numpy to tensor
X_train_tensor = torch.from_numpy(X_train.astype(np.float32))
X_test_tensor = torch.from_numpy(X_test.astype(np.float32))
y_train_tensor = torch.from_numpy(y_train.astype(np.float32))
y_test_tensor = torch.from_numpy(y_test.astype(np.float32))

In [17]:
X_train_tensor.shape

torch.Size([455, 30])

In [18]:
y_train_tensor.shape

torch.Size([455])

In [19]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):

  def __init__(self, features, labels):
    self.features = features
    self.labels = labels

  def __len__(self):
    return self.features.shape[0]

  def __getitem__(self, idx):
    return self.features[idx], self.labels[idx]

In [20]:
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)

In [21]:
train_dataset[10]

(tensor([ 1.5307, -0.2371,  1.4802,  1.4429,  0.5245,  0.7472,  0.9017,  1.1742,
          0.3061, -0.5738,  0.2787, -0.3943,  0.0527,  0.3376, -0.4423, -0.1853,
         -0.0445,  0.5494, -0.5021, -0.6865,  1.5682,  0.4728,  1.3562,  1.3306,
          0.8481,  0.7401,  0.7244,  1.6444,  1.1014, -0.3486]),
 tensor(1.))

In [22]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)

In [26]:
# Define model
import torch.nn as nn

class SimpleModel(nn.Module):

  def __init__(self, num_features):
    super().__init__()
    self.linear = nn.Linear(num_features, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):
    out = self.linear(features)
    out = self.sigmoid(out)
    return out

In [27]:
learning_rate = 0.1
epochs = 25

In [38]:
# Create model
model = SimpleModel(X_train_tensor.shape[1])

# define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)

# define loss function
loss_fn = nn.BCELoss()

In [39]:
# Training Pipeline
for epoch in range(epochs):

  for batch_features, batch_labels in train_loader:
    # forward pass
    y_pred = model(batch_features)

    # loss calculate
    loss = loss_fn(y_pred, batch_labels.view(-1, 1))

    # clear gradients
    optimizer.zero_grad()

    # backward pass
    loss.backward()

    # parameter update
    optimizer.step()

  if (epoch+1) % 5 == 0:
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Epoch: 5, Loss: 0.06027161702513695
Epoch: 10, Loss: 0.013299224898219109
Epoch: 15, Loss: 0.023816151544451714
Epoch: 20, Loss: 0.08706999570131302
Epoch: 25, Loss: 0.13152243196964264


In [41]:
# Evaluation
model.eval()     # set the model to evaluation mode
accuracy_list = []

with torch.no_grad():
  for batch_features, batch_labels in test_loader:
    y_pred = model(batch_features)
    y_pred = (y_pred > 0.6).float()

    accuracy = (y_pred == batch_labels.view(-1, 1)).float().mean()
    accuracy_list.append(accuracy.item())

overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Overall Accuracy: {overall_accuracy}')

Overall Accuracy: 0.9704861044883728
