üì¶ PyTorch Dataset Class

The Dataset class is essentially a blueprint.
When you create a custom Dataset, you define how data is loaded and returned.

A custom Dataset defines:
1. `__init__`(self)

Used to load file paths, data, labels, transforms, etc.

Runs once when the dataset object is created.

2. `__len__`(self)

Returns the total number of samples in the dataset.

Used by DataLoader to know dataset size.

3. `__getitem__`(self, index)

Returns one data sample (and its label) at a given index.

Called every time a sample is needed.

üìå Key idea:
Dataset handles single sample logic, not batches.

üöö PyTorch DataLoader Class

The DataLoader wraps a Dataset and automatically handles:

Batching

Shuffling

Parallel data loading

Combining samples into batches

You don‚Äôt manually write batch logic.

üîÅ DataLoader Control Flow (What happens internally)

At the start of each epoch:

Shuffle indices (if shuffle=True)

Done using a sampler

Divide indices into chunks of size batch_size

For each chunk:

Call Dataset.__getitem__() for every index

Fetch individual samples

Combine samples into a batch

Done using collate_fn (default or custom)

Return the batch to the training loop

Dataset Class

In [2]:
import torch
from sklearn.datasets import make_classification
from torch.utils.data import Dataset, DataLoader

In [9]:
x, y= make_classification(n_samples=10, n_features=2, n_informative=2, n_redundant=0,n_classes=2, random_state=42)
print(x)
print(y)

[[ 1.06833894 -0.97007347]
 [-1.14021544 -0.83879234]
 [-2.8953973   1.97686236]
 [-0.72063436 -0.96059253]
 [-1.96287438 -0.99225135]
 [-0.9382051  -0.54304815]
 [ 1.72725924 -1.18582677]
 [ 1.77736657  1.51157598]
 [ 1.89969252  0.83444483]
 [-0.58723065 -1.97171753]]
[1 0 0 0 0 1 1 1 1 0]


In [10]:
x.shape

(10, 2)

In [11]:
y.shape

(10,)

In [49]:
x=torch.tensor(x, dtype=torch.float32)
y=torch.tensor(y, dtype=torch.float32)

  x=torch.tensor(x, dtype=torch.float32)
  y=torch.tensor(y, dtype=torch.float32)


In [51]:
class CustomDataset(Dataset):
  def __init__(self,features,labels):
    self.features=features
    self.labels=labels
  def __len__(self):
    return len(self.features)
  def __getitem__(self,idx):
    return self.features[idx],self.labels[idx]

In [52]:
dataset=CustomDataset(x,y)

In [53]:
len(dataset)

10

In [54]:
dataset[0]

(tensor([ 1.0683, -0.9701]), tensor(1.))

DataLoader Class

In [55]:
dataLoader=DataLoader(dataset,batch_size=2,shuffle=True)

In [56]:
for batch_features, batch_labels in dataLoader:
  print(batch_features)
  print(batch_labels)
  print("_"*50)



tensor([[ 1.7774,  1.5116],
        [-0.9382, -0.5430]])
tensor([1., 1.])
__________________________________________________
tensor([[ 1.0683, -0.9701],
        [-0.7206, -0.9606]])
tensor([1., 0.])
__________________________________________________
tensor([[-1.9629, -0.9923],
        [-0.5872, -1.9717]])
tensor([0., 0.])
__________________________________________________
tensor([[ 1.8997,  0.8344],
        [-2.8954,  1.9769]])
tensor([1., 0.])
__________________________________________________
tensor([[-1.1402, -0.8388],
        [ 1.7273, -1.1858]])
tensor([0., 1.])
__________________________________________________


How transformations are used in a Dataset class.

Rule:
üëâ Apply transformations inside __getitem__(), not in __init__().

Why?
Because transformations are applied per sample, when the data is fetched.

In [57]:
def __getitem__(self, index):
  x = self.data[index]
  y = self.labels[index]

  if self.transform:
    x = self.transform(x)
    return x, y



Parallelization : Refer Notes on Dataset and DataLoader

# *Breast Cancer dataset model updated*

In [58]:
  import numpy as np
  import pandas as pd
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import StandardScaler
  from sklearn.preprocessing import LabelEncoder



In [59]:
df=pd.read_csv("breast-cancer.csv")

In [60]:
df.head(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [61]:
df.drop(columns=['id'],inplace=True)
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [62]:
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:,1:],df.iloc[:,0],test_size=0.2)

In [63]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [64]:
encoder=LabelEncoder()
y_train=encoder.fit_transform(y_train)
y_test=encoder.transform(y_test)

In [79]:
X_train_tensor = torch.from_numpy(x_train).float()
X_test_tensor = torch.from_numpy(x_test).float()

y_train_tensor = torch.from_numpy(y_train).float()
y_test_tensor = torch.from_numpy(y_test).float()


In [80]:
X_train_tensor

tensor([[-0.3686,  0.6946, -0.2771,  ...,  1.2390,  0.7679,  2.1665],
        [-0.3827, -0.6828, -0.4367,  ..., -0.8435, -0.8278, -1.0902],
        [ 1.4219,  1.7017,  1.4033,  ...,  1.0816,  0.5357,  0.7004],
        ...,
        [-0.5581, -0.3195, -0.5632,  ..., -0.8723, -0.7920, -0.3767],
        [-0.6175, -1.0436, -0.6067,  ..., -1.0108,  0.3269, -0.4519],
        [ 0.0812,  0.0908,  0.1014,  ...,  0.6212, -0.3058,  0.5019]])

In [81]:
X_test_tensor

tensor([[-0.8947, -0.5153, -0.8324,  ...,  0.6062, -0.5832,  0.5987],
        [ 0.3838,  0.1380,  0.4264,  ...,  1.4549,  0.4250,  0.9547],
        [ 0.1038, -2.0035,  0.0932,  ...,  0.1054, -0.0393, -0.2268],
        ...,
        [ 0.2594, -0.0743,  0.2155,  ..., -0.0461, -0.9852, -0.7959],
        [ 0.5988,  0.0295,  0.7302,  ...,  0.9361,  0.6292,  0.3845],
        [-1.1549, -0.4375, -1.1329,  ..., -0.7510, -0.0316, -0.4118]])

In [82]:
X_train_tensor.shape

torch.Size([455, 30])

In [83]:
y_train_tensor.shape

torch.Size([455])

In [84]:
class CustomDataset(Dataset):
  def __init__(self,features,labels):
    self.features=features
    self.labels=labels
  def __len__(self):
    return len(self.features)
  def __getitem__(self,idx):
    return self.features[idx],self.labels[idx]

In [85]:
train_dataset=CustomDataset(X_train_tensor,y_train_tensor)
test_dataset=CustomDataset(X_test_tensor,y_test_tensor)

In [86]:
train_dataset[0]

(tensor([-0.3686,  0.6946, -0.2771, -0.4288,  0.8896,  1.4097,  0.9918,  0.4946,
          1.0692,  1.3644, -0.3574, -0.2529, -0.2972, -0.3294, -0.0208,  0.7024,
          0.4539,  0.4827, -0.4510,  0.6413, -0.2243,  0.7236, -0.0611, -0.3077,
          2.0279,  1.7486,  1.6987,  1.2390,  0.7679,  2.1665]),
 tensor(1.))

In [87]:
train_loader=DataLoader(train_dataset,batch_size=32,shuffle=True)
test_loader=DataLoader(test_dataset,batch_size=32,shuffle=False)

Define Model

In [88]:
import torch.nn as nn
class MySimpleNN(nn.Module):
  def __init__(self,num_features):
    super().__init__()
    self.linear = nn.Linear(num_features,1)
    self.sigmoid = nn.Sigmoid()

  def forward(self,features):
    out =self.linear(features)
    out=self.sigmoid(out)
    return out

In [89]:
learning_rate=0.1
epoches=15

In [90]:
model=MySimpleNN(x_train.shape[1])
optimizer=torch.optim.SGD(model.parameters(),lr=learning_rate)
loss_function=torch.nn.BCELoss()

In [91]:
# create model
model = MySimpleNN(X_train_tensor.shape[1])

# define loop
for epoch in range(epoches):
  for batch_features, batch_labels in train_loader:

  # forward pass
    y_pred = model(batch_features)

  # # loss calculate
    loss = loss_function(y_pred, batch_labels.view(-1, 1))
  # zero grad
    optimizer.zero_grad()
  # # backward pass

    loss.backward()

    optimizer.step()
    print(f'Epoch: {epoch + 1}, Loss: {loss.item()}')

Epoch: 1, Loss: 0.9007735848426819
Epoch: 1, Loss: 0.8675692677497864
Epoch: 1, Loss: 0.8727073669433594
Epoch: 1, Loss: 0.9174148440361023
Epoch: 1, Loss: 0.9840832948684692
Epoch: 1, Loss: 0.9062014818191528
Epoch: 1, Loss: 0.8344483971595764
Epoch: 1, Loss: 0.8827698230743408
Epoch: 1, Loss: 0.9063546657562256
Epoch: 1, Loss: 0.8747988343238831
Epoch: 1, Loss: 0.9038629531860352
Epoch: 1, Loss: 0.9512932300567627
Epoch: 1, Loss: 0.8137195706367493
Epoch: 1, Loss: 0.8450412154197693
Epoch: 1, Loss: 1.0241143703460693
Epoch: 2, Loss: 0.9007186889648438
Epoch: 2, Loss: 0.9057799577713013
Epoch: 2, Loss: 0.8551068305969238
Epoch: 2, Loss: 0.8679406642913818
Epoch: 2, Loss: 0.9274201393127441
Epoch: 2, Loss: 0.9015293717384338
Epoch: 2, Loss: 0.9496473073959351
Epoch: 2, Loss: 0.8664764165878296
Epoch: 2, Loss: 0.9468897581100464
Epoch: 2, Loss: 0.9087004661560059
Epoch: 2, Loss: 0.9273947477340698
Epoch: 2, Loss: 0.8839218020439148
Epoch: 2, Loss: 0.8570414185523987
Epoch: 2, Loss: 0.78

In [92]:
# Model evaluation using test_loader
model.eval()  # Set the model to evaluation mode
accuracy_list = []

with torch.no_grad():
    for batch_features, batch_labels in test_loader:
        # Forward pass
        y_pred = model(batch_features)
        y_pred = (y_pred > 0.8).float()  # Convert probabilities to binary predictions

        # Calculate accuracy for the current batch
        batch_accuracy = (y_pred.view(-1) == batch_labels).float().mean().item()
        accuracy_list.append(batch_accuracy)

# Calculate overall accuracy
overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Accuracy: {overall_accuracy:.4f}')


Accuracy: 0.6554
