# Dataset and Dataloader in pytorch

## Using batches of data to train the model.

- Last time we used all data to train the neural network at ones but this not recommended.
1. Memory inefficient
2. Better Convergence

- As data becomes larger it's highly inefficient to load all data at once in memory so we make batches of data for each epoch to train the model.

 The Dataset and Dataloader Classes

- Dataset and DataLoaderare core abstractions in PyTorch that decouple how you
define your data from how you efficiently iterate over it in training loops.

Dataset Class

 The Datasetclass is essentially a blueprint. When you create a custom Dataset, you decide how data is loaded and returned.

 It defines:

-  __init__()which tells how data should be loaded.
-  __len__()which returns the total number of samples.
-  __getitem__(index)which returns the data (and label) at the given index.

 DataLoader Class

 The DataLoaderwraps a Datasetand handles batching, shuffling, and parallel loading for you.

  DataLoader Control Flow:

 - At the start of each epoch, the DataLoader(if shuffle=True) shuffles indices(using a sampler).

 - It divides the indices into chunks of batch_size.
 - for each index in the chunk, data samples are fetched from the Dataset object.

 - The samples are then collected and combined into a batch (using collate_fn).

 - The batch is returned to the main training loop.

In [None]:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

- Importing a dataset.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [None]:
df.shape

(569, 33)

- Droping unneeded columns such as id and unnamed in this dataset.

In [None]:
df.drop(columns=['id', 'Unnamed: 32'], inplace= True)

In [None]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


- Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.2)

- Scaling

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train

array([[-0.47097669, -0.14352024, -0.44726309, ..., -0.19035626,
         0.20495639,  0.21512728],
       [ 1.56876912,  0.58775255,  1.5515305 , ...,  1.71522575,
         1.91504063, -0.25370001],
       [-1.28121691, -0.56139041, -1.24375019, ..., -0.50231848,
         0.15187909,  0.8208837 ],
       ...,
       [-2.02384282, -1.37101386, -1.9747165 , ..., -1.75016734,
         0.06894581,  0.5712557 ],
       [ 2.22793802,  0.63998632,  2.25997403, ...,  1.98723221,
        -0.24785932,  0.11595227],
       [-0.80197566, -0.240865  , -0.74046862, ...,  0.12160596,
         0.69260408,  2.5930734 ]])

In [None]:
y_train

Unnamed: 0,diagnosis
204,B
343,M
341,B
254,M
146,M
...,...
500,B
329,M
101,B
369,M


- Label Encoding

In [None]:
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

In [None]:
y_train

array([0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,

- Numpy arrays to pytorch tensors.

In [16]:
X_train_tensor = torch.from_numpy(X_train.astype(np.float32))
X_test_tensor = torch.from_numpy(X_test.astype(np.float32))
y_train_tensor = torch.from_numpy(y_train.astype(np.float32))
y_test_tensor = torch.from_numpy(y_test.astype(np.float32))

In [17]:
X_train_tensor

tensor([[-0.4710, -0.1435, -0.4473,  ..., -0.1904,  0.2050,  0.2151],
        [ 1.5688,  0.5878,  1.5515,  ...,  1.7152,  1.9150, -0.2537],
        [-1.2812, -0.5614, -1.2438,  ..., -0.5023,  0.1519,  0.8209],
        ...,
        [-2.0238, -1.3710, -1.9747,  ..., -1.7502,  0.0689,  0.5713],
        [ 2.2279,  0.6400,  2.2600,  ...,  1.9872, -0.2479,  0.1160],
        [-0.8020, -0.2409, -0.7405,  ...,  0.1216,  0.6926,  2.5931]])

In [18]:
X_train_tensor.shape

torch.Size([455, 30])

In [19]:
y_train_tensor.shape

torch.Size([455])

- Dataset and Dataloader

In [21]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):

  def __init__(self, features, labels):

    self.features = features
    self.labels = labels

  def __len__(self):

    return len(self.features)

  def __getitem__(self, idx):

    return self.features[idx], self.labels[idx]

In [22]:
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)

In [23]:
train_dataset[10]

(tensor([ 0.3070,  2.7103,  0.4663,  0.1718,  0.5995,  2.0300,  2.0811,  1.1781,
          1.1568,  1.2468, -0.5180, -0.0225, -0.2545, -0.3811, -0.7946,  1.2940,
          1.3030,  0.6642,  0.0822,  0.8230,  0.2606,  2.9018,  0.6340,  0.0612,
          0.4251,  3.5181,  4.3128,  1.8704,  1.9880,  3.2242]),
 tensor(1.))

In [24]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)

## Defining the model

In [25]:
class MySimpleNN(nn.Module):

  def __init__(self, num_features):

    super().__init__()
    self.linear = nn.Linear(num_features, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):
    out = self.linear(features)
    out = self.sigmoid(out)
    return out


 - Important parameters

In [31]:
learning_rate = 0.1
epochs = 25

In [32]:
# create model
model = MySimpleNN(X_train_tensor.shape[1])

# define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# loss function
loss_function = nn.BCELoss() # BinaryCrossEntropy loss function


- Training pipeline

The torch.optim module

- torch.optim is a module in PyTorch that provides a variety of optimization
algorithms used to update the parameters of your model during training.

- It includes common optimizers like Stochastic Gradient Descent (SGD), Adam,
RMSprop, and more.

- It handles weight updates efficiently, including additional features like learning rate scheduling and weight decay (regularization).

- The model.parameters() method in PyTorch retrieves an iterator over all the
trainable parameters (weights and biases) in a model. These parameters are
instances of torch.nn.Parameter and include:

 -Weights: The weight matrices of layers like nn.Linear, nn.Conv2d, etc.

 -Biases: The bias terms of layers (if they exist).

- The optimizer uses these parameters to compute gradients and update them
during training.

In [39]:
# define loop
for epoch in range(epochs):

  for batch_features, batch_labels in train_loader:

    # forward pass
    y_pred = model(batch_features)

    # loss calculate
    loss = loss_function(y_pred, batch_labels.view(-1,1))

    # clear gradients
    optimizer.zero_grad()

    # backward pass
    loss.backward()

    # parameters update
    optimizer.step()

  # print loss in each epoch
  print(f'Epoch: {epoch + 1}, Loss: {loss.item()}')

Epoch: 1, Loss: 0.0490751676261425
Epoch: 2, Loss: 0.000980081269517541
Epoch: 3, Loss: 0.3198051452636719
Epoch: 4, Loss: 0.002747294260188937
Epoch: 5, Loss: 0.017402049154043198
Epoch: 6, Loss: 0.041321735829114914
Epoch: 7, Loss: 0.08950668573379517
Epoch: 8, Loss: 0.00168566161300987
Epoch: 9, Loss: 0.06922593712806702
Epoch: 10, Loss: 0.033561889082193375
Epoch: 11, Loss: 0.0027229669503867626
Epoch: 12, Loss: 0.03166811913251877
Epoch: 13, Loss: 0.011800956912338734
Epoch: 14, Loss: 0.011633331887423992
Epoch: 15, Loss: 0.029750389978289604
Epoch: 16, Loss: 0.019169727340340614
Epoch: 17, Loss: 0.1304090917110443
Epoch: 18, Loss: 0.017415611073374748
Epoch: 19, Loss: 0.05330490320920944
Epoch: 20, Loss: 0.03300324082374573
Epoch: 21, Loss: 0.000652570161037147
Epoch: 22, Loss: 0.046250518411397934
Epoch: 23, Loss: 0.007692183367908001
Epoch: 24, Loss: 0.019207539036870003
Epoch: 25, Loss: 0.008600703440606594


In [40]:
model.linear.weight

Parameter containing:
tensor([[ 0.7603,  0.8802,  0.5371,  0.7461,  0.0577, -0.0911,  0.6539,  0.7236,
          0.2679, -0.6108,  0.9420, -0.3036,  0.6235,  0.8540,  0.1732, -0.6365,
         -0.2303,  0.3228, -0.1920, -0.3791,  0.9201,  1.0049,  0.7875,  0.8248,
          0.7822,  0.4220,  0.6870,  0.8477,  0.7319,  0.3048]],
       requires_grad=True)

In [41]:
model.linear.bias

Parameter containing:
tensor([-0.5143], requires_grad=True)

### Evaluation

In [42]:
# Model evaluation using test_loader
model.eval()  # Set the model to evaluation mode
accuracy_list = []

with torch.no_grad():
    for batch_features, batch_labels in test_loader:
        # Forward pass
        y_pred = model(batch_features)
        y_pred = (y_pred > 0.8).float()  # Convert probabilities to binary predictions

        # Calculate accuracy for the current batch
        batch_accuracy = (y_pred.view(-1) == batch_labels).float().mean().item()
        accuracy_list.append(batch_accuracy)

# Calculate overall accuracy
overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Accuracy: {overall_accuracy:.4f}')


Accuracy: 0.9427
