# Components of Custom Data Loader in PyTorch

From previous tutorial (linear-regression.ipynb)

The dataset has been loaded from kagglehub 

What is Data Loader exactly ? 

Previously from kaggle the insurance.csv has been loaded on cache, and it has been used in the custom NN...
Pre processing and reformulating the data form (cathegorical data to numerical form)

Dataset has been scaled (normalized) based on data properties (mean, std dev)



When Training the model, with model.train() we are passing the whole tensor in a single iteration thousands of samples all at once !

1 epoch: all the row data ! If e.g.

x_train_tensor = 1000000 rows (dataset) 

In a single iteratiion all this is sent on memory ! e.g. 10GB of GPU VRAM can be allocated, causing Out Of Memory OOM 


With slow CPU usage, but enogh RAM to store the data: 

gradient descent and optim occurs only after all rows are computed, single epoch of gradient update after 1M iterations...


If we are teaching human: A book of 1k pages, student learn 1k and after that: I don't understand...Maybe issue on the go! Only when all completed you say if you undretsand. We want a feedback during learning, e.g. every 10 pages -> feedback allow better learning and faster update 


1k pages with 10 pages per step, 100 iterations per epoch.

100 feedback (iterations) on 1000 data is 1 epoch.

If overall 100 epochs, 100*100 = 10000 iterations . Update weights more frequently and not running out of memory, passing chanks of data using data loader


In [3]:
# You load a chunk of data with a custom data loader !
# It is possible by creating it and using for training in batches to optimize data utilization and updates 

# Necessary components: 
# 1. Dataset: torch.utils.data.DataSet
# rsponsible for loading and processing data, for example from a cvs there are all the method of preprocessing 
# In Dataset you define how data is stored, acessed, transformed, how to retrieve samples
# 
# 2. DataLoader: torch.utils.data.DataLoader
# DataLoader is built in in pytorch, after defining the custom dataset, 
# DataLoader help to wraps the dataset in iterable object that load data in batches (specify amount of rows)
# Handles batching and parallel processing with multiple workers

# Create custom Dataset in pytorch: require the init() and so on...
# init() initialize the dataset, loads data, apply transformation if needed (preprocessing)

# len() : to return the length of the dataset as total number of rows (samples) in the dataset 
# Required because pytorch read it internally when loading in pytorch from there  
# (Initial index to final index knowing the len..) DataLoader will wrap the Dataset
# DataLoader will know the len using this method 

# getitem(): defines how to retrieve a single data sample when an index is provided
# used by DataLoader to get the actual data from that class (load in batches if required)

import kagglehub

# Download latest version of the dataset with kaggle
path = kagglehub.dataset_download("mirichoi0218/insurance")

print(f"Path of dataset files : {path}") # Stored in the cache


import pandas as pd
import os

df = pd.read_csv(os.path.join(path, 'insurance.csv'))


# I want to implement the code for loading in batches 
# The loader should be used for the training set, not on the validation level 

Path of dataset files : /home/ale/.cache/kagglehub/datasets/mirichoi0218/insurance/versions/1


In [4]:
import torch 
from torch.utils.data import Dataset, DataLoader

# Define components for data preprocessing for model training 
import torch
import torch.nn as nn
import torch.optim as optim

# Using sklearn for data pre-processing
from sklearn.preprocessing import LabelEncoder, StandardScaler 
# we need LabelEncoder to transform the dataset
from sklearn.model_selection import train_test_split # help to divide the dataset

# Split dataset 
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42) # 80% as train, 20% as test

label_encoder = {}

for col in ['sex', 'smoker', 'region']:
    le = LabelEncoder()
    train_df[col] = le.fit_transform(train_df[col])
    test_df[col] = le.transform(test_df[col]) # Transform test set 
    label_encoder[col] = le

print(label_encoder)

# create features and target columns 
# X and Y

X_train = train_df.drop(columns=['charges']) # all except charges
y_train = train_df['charges'] # take just charges 

X_test = test_df.drop(columns=['charges']) # all except charges
y_test = test_df['charges'] # take just charges 

scaler = StandardScaler()

# Normalize on all features (NOT TARGET)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1) # Flatten the output
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)


# NN model 

class SimpleNNRegressionModel(nn.Module):
    def __init__(self, input_dim):

        super(SimpleNNRegressionModel, self).__init__()

        # Very simple model, Sequential is enough
        self.network = nn.Sequential(
            nn.Linear(input_dim, 64), # 64 neurons of first hidden layer
            nn.ReLU(),
            nn.Linear(64, 128), # map to 128 neurons on next hidden layer 
            nn.ReLU(),
            nn.Linear(128, 1) # here the final layer of 128 is mapped to regrssion output of dimension 1
        )

    def forward(self, x):
        return self.network(x) # produce the output forward passing x on the network
    
# Initialize model from the custom NN 
input_dim = X_train_tensor.shape[1] # we are interested on feature lenght of each data
model = SimpleNNRegressionModel(input_dim=input_dim)


# Initialize other nn components, Loss and Optimizer

criterion = nn.MSELoss() # In regression MSE loss is enough
optimizer = optim.Adam(model.parameters(), lr=0.01) # Adam optimizer




{'sex': LabelEncoder(), 'smoker': LabelEncoder(), 'region': LabelEncoder()}


In [5]:
# PyTorch custom dataset handling init, len, getitem
# in principle the csv to numpy conversion and preprocessing happen on the init method 

class InsuranceDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
        # It is possible to start from reading the data giving the path of the csv
        # Here pre processing is already been done

    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        # How to retrieve a SINGLE DATA SAMPLE
        # After label encoding, we start from the np array
        features =  torch.tensor(self.X[idx], dtype=torch.float32) # return a row of the data
        target =  torch.tensor(self.y.values[idx], dtype=torch.float32) # return a row of the data

        return features, target



In [6]:
# DataLoader will wrap Insurance 
dataset = InsuranceDataset(X_train, y_train)

In [None]:
# once created dataset, DataLoader is used to wrap it!
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
# shuffle tell you that data will be evrytime in random order, with 32 data per retrieval

# now dataloader is an iterable over the dataset  

In [None]:
for batch_idx, (features, targets) in enumerate(dataloader):
    print(f"Batch {batch_idx + 1}")
    print(f"Features :  {features.shape}")
    print(f"Targets:  {targets.shape}")

# batch size refer to the way of exploring the dataset in each epoch, 
# defined the batch size the number of iterations are affected by data_len/batch_size
# The last iterations will do the final data



Batch 1
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 2
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 3
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 4
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 5
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 6
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 7
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 8
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 9
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 10
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 11
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 12
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 13
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 14
Features :  torch.Size([32, 6])
Targets:  torch.Size([32])
Batch 15
Features :  torch.Size([32, 6])
Targets:  torch.

In [10]:
# Model training in batches 
# Create training loop 
epochs = 1000
for epoch in range(epochs):
    model.train()           # train method tell pytorch to store history of model training
    # while to test the model, model.eval() is to use in feed forward

    # I want to perform the below code for all batches 
    for batch_idx, (batch_X, batch_y) in enumerate(dataloader):

        print(f"Current Batch: {batch_idx + 1}")

        optimizer.zero_grad()   # clear gradients

        predictions = model(batch_X)
        loss = criterion(predictions, batch_y)
        loss.backward() #  compute gradients

        optimizer.step() # update weights using gradients computed 
        print(f"Epoch: [{epoch+1}/{epochs}], Batch {batch_idx + 1}, Loss : {loss.item():.4f}")

    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss : {loss.item():.4f}")

Current Batch: 1
Epoch: [1/1000], Batch 1, Loss : 67250392.0000
Current Batch: 2
Epoch: [1/1000], Batch 2, Loss : 167153376.0000
Current Batch: 3
Epoch: [1/1000], Batch 3, Loss : 107152384.0000
Current Batch: 4
Epoch: [1/1000], Batch 4, Loss : 128527216.0000
Current Batch: 5
Epoch: [1/1000], Batch 5, Loss : 103536176.0000
Current Batch: 6
Epoch: [1/1000], Batch 6, Loss : 202315024.0000
Current Batch: 7
Epoch: [1/1000], Batch 7, Loss : 234889232.0000
Current Batch: 8


  return F.mse_loss(input, target, reduction=self.reduction)


Epoch: [1/1000], Batch 8, Loss : 149055008.0000
Current Batch: 9
Epoch: [1/1000], Batch 9, Loss : 134792432.0000
Current Batch: 10
Epoch: [1/1000], Batch 10, Loss : 93969760.0000
Current Batch: 11
Epoch: [1/1000], Batch 11, Loss : 124896464.0000
Current Batch: 12
Epoch: [1/1000], Batch 12, Loss : 66941864.0000
Current Batch: 13
Epoch: [1/1000], Batch 13, Loss : 76407256.0000
Current Batch: 14
Epoch: [1/1000], Batch 14, Loss : 141604928.0000
Current Batch: 15
Epoch: [1/1000], Batch 15, Loss : 66960080.0000
Current Batch: 16
Epoch: [1/1000], Batch 16, Loss : 245238880.0000
Current Batch: 17
Epoch: [1/1000], Batch 17, Loss : 86791520.0000
Current Batch: 18
Epoch: [1/1000], Batch 18, Loss : 151494192.0000
Current Batch: 19
Epoch: [1/1000], Batch 19, Loss : 215265248.0000
Current Batch: 20
Epoch: [1/1000], Batch 20, Loss : 156591232.0000
Current Batch: 21
Epoch: [1/1000], Batch 21, Loss : 203237184.0000
Current Batch: 22
Epoch: [1/1000], Batch 22, Loss : 192863328.0000
Current Batch: 23
Epo

  return F.mse_loss(input, target, reduction=self.reduction)


Current Batch: 1
Epoch: [2/1000], Batch 1, Loss : 115349320.0000
Current Batch: 2
Epoch: [2/1000], Batch 2, Loss : 73723200.0000
Current Batch: 3
Epoch: [2/1000], Batch 3, Loss : 195563344.0000
Current Batch: 4
Epoch: [2/1000], Batch 4, Loss : 207397136.0000
Current Batch: 5
Epoch: [2/1000], Batch 5, Loss : 195648352.0000
Current Batch: 6
Epoch: [2/1000], Batch 6, Loss : 164962272.0000
Current Batch: 7
Epoch: [2/1000], Batch 7, Loss : 116069072.0000
Current Batch: 8
Epoch: [2/1000], Batch 8, Loss : 118908664.0000
Current Batch: 9
Epoch: [2/1000], Batch 9, Loss : 89995408.0000
Current Batch: 10
Epoch: [2/1000], Batch 10, Loss : 151873360.0000
Current Batch: 11
Epoch: [2/1000], Batch 11, Loss : 133877880.0000
Current Batch: 12
Epoch: [2/1000], Batch 12, Loss : 91721664.0000
Current Batch: 13
Epoch: [2/1000], Batch 13, Loss : 76530848.0000
Current Batch: 14
Epoch: [2/1000], Batch 14, Loss : 134786752.0000
Current Batch: 15
Epoch: [2/1000], Batch 15, Loss : 293419008.0000
Current Batch: 16

KeyboardInterrupt: 

In [11]:
# What about the num_workers ? 
# it is possible to change it values e.e 4, allowing parallel processing 
# 4 threads work simultaneusly to load data
# 4 separate processes to load data in parallel. Already in load state, already available data !
# Reduced time to load time 


# With a smart DataLoading technique: 
# - Stabilize training, more frequent gradient update 
# - Reduce memory usage 
# - Generalize better by shuffling, which prevent overfitting 

# - MIN BATCH TRAINING TECHNIQUE, introduce randomness by shuffling 