<div style="text-align:left;">
  <a href="https://code213.tech/" target="_blank">
    <img src="../code213.PNG" alt="code213">
  </a>
  <p><em>prepared by Latreche Sara</em></p>
</div>

# Data Handling in PyTorch: Dataset & DataLoader

Efficiently handling data is a **critical step** in training deep learning models.  

PyTorch provides two key components:  

1. **Dataset**  
   - Encapsulates the data and labels.  
   - Must implement:
     - `__len__()` → returns the number of samples  
     - `__getitem__()` → returns a single sample  

2. **DataLoader**  
   - Wraps a `Dataset` to provide:
     - Mini-batches  
     - Shuffling  
     - Parallel loading using multiple workers  



In this notebook, we will cover:  
- Creating custom datasets with `torch.utils.data.Dataset`  
- Using `DataLoader` to iterate over batches  
- Handling shuffling, batch size, and parallelism  


## Table of Contents

- [1 - PyTorch Dataset](#1)
- [2 - Custom Dataset](#2)
- [3 - DataLoader](#3)
- [4 - Iterating through DataLoader](#4)
- [5 - Transformations & Normalization](#5)
- [6 - Practice Exercises](#6)


<a name='1'></a>
## 1 - PyTorch Dataset

A `Dataset` in PyTorch is an **abstract class** representing a collection of data samples and their corresponding labels.  

Key points:  
- Every dataset must implement two methods:
  1. `__len__()` → returns the number of samples in the dataset  
  2. `__getitem__(index)` → returns a sample and its label at the given index  
- Allows **flexible and efficient data access**  
- Can be used with `DataLoader` to create batches and shuffle data


In [1]:
import torch
from torch.utils.data import Dataset

# Example: simple dataset
class MyDataset(Dataset):
    def __init__(self):
        # Sample data: 5 samples, 2 features
        self.x = torch.randn(5, 2)
        self.y = torch.tensor([0, 1, 0, 1, 0])
    
    def __len__(self):
        return len(self.x)
    
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

# Create dataset instance
dataset = MyDataset()

# Accessing a sample
sample_x, sample_y = dataset[0]
print("First sample:", sample_x, sample_y)
print("Dataset length:", len(dataset))


First sample: tensor([-0.0363,  0.9612]) tensor(0)
Dataset length: 5


<a name='2'></a>
## 2 - Custom Dataset

Creating a **custom dataset** allows you to handle:  
- Your own data files (CSV, images, etc.)  
- Preprocessing and transformations  
- Lazy loading (loading samples on the fly)  

Key steps to create a custom dataset:  
1. Inherit from `torch.utils.data.Dataset`  
2. Implement `__init__()`, `__len__()`, and `__getitem__()`  
3. Optionally, add **transformations** for preprocessing  

This approach is flexible and works with `DataLoader` to create batches and shuffle data.


In [2]:
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data, labels, transform=None):
        self.data = data
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        x = self.data[idx]
        y = self.labels[idx]
        
        # Apply optional transformation
        if self.transform:
            x = self.transform(x)
        
        return x, y

# Example data
data = torch.randn(6, 3)
labels = torch.tensor([0, 1, 0, 1, 0, 1])

# Create dataset
dataset = CustomDataset(data, labels)

# Access a sample
x_sample, y_sample = dataset[1]
print("Sample 1:", x_sample, y_sample)


Sample 1: tensor([ 0.9238, -1.9255,  1.5385]) tensor(1)


<a name='3'></a>
## 3 - DataLoader

`DataLoader` in PyTorch provides an **iterable over a Dataset** with many convenient features:

### Key Features
- **Mini-batches**: Split dataset into smaller batches for efficient training  
- **Shuffling**: Randomly shuffle data at the start of each epoch  
- **Parallel loading**: Use multiple workers to speed up data loading  
- **Custom collate function**: Control how samples are combined into batches  

### Common Parameters
- `dataset`: Dataset object to load from  
- `batch_size`: Number of samples per batch  
- `shuffle`: Whether to shuffle data each epoch  
- `num_workers`: Number of subprocesses to use for data loading  

DataLoader makes it easy to **iterate over mini-batches** in a training loop.


In [3]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# Sample dataset
x = torch.randn(10, 2)
y = torch.randint(0, 2, (10,))
dataset = TensorDataset(x, y)

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=3, shuffle=True, num_workers=0)

# Iterate through batches
for batch_idx, (x_batch, y_batch) in enumerate(dataloader):
    print(f"Batch {batch_idx+1}")
    print("x:", x_batch)
    print("y:", y_batch)
    print("---")


Batch 1
x: tensor([[ 0.6115, -1.0577],
        [ 0.9681, -0.5553],
        [-1.1261,  2.0470]])
y: tensor([0, 0, 0])
---
Batch 2
x: tensor([[ 1.4703,  0.9640],
        [ 0.3942,  1.7599],
        [ 0.7142, -1.2718]])
y: tensor([1, 1, 1])
---
Batch 3
x: tensor([[ 0.7942, -0.3868],
        [ 0.5389,  1.4670],
        [ 0.3613,  0.4697]])
y: tensor([0, 1, 1])
---
Batch 4
x: tensor([[-0.3986, -0.3414]])
y: tensor([1])
---


<a name='4'></a>
## 4 - Iterating through DataLoader

Once a `DataLoader` is created, you can **iterate over it in a training loop** to process mini-batches.  

### Key Points
- Each iteration returns a batch of inputs and labels.  
- Combine with **forward pass, loss computation, and backward pass** during training.  
- Shuffling ensures batches are different each epoch, which helps prevent overfitting.  

Typical pattern:



In [4]:
import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim

# Sample dataset
x = torch.randn(8, 2)
y = torch.randint(0, 2, (8,))
dataset = TensorDataset(x, y)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Simple model
model = nn.Linear(2, 2)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Iterate through DataLoader
for epoch in range(2):
    print(f"Epoch {epoch+1}")
    for batch_idx, (x_batch, y_batch) in enumerate(dataloader):
        outputs = model(x_batch)
        loss = loss_fn(outputs, y_batch)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f"Batch {batch_idx+1}, Loss: {loss.item():.4f}")


Epoch 1
Batch 1, Loss: 0.3627
Batch 2, Loss: 0.6184
Batch 3, Loss: 0.8799
Batch 4, Loss: 1.1342
Epoch 2
Batch 1, Loss: 0.8160
Batch 2, Loss: 0.4689
Batch 3, Loss: 0.5183
Batch 4, Loss: 1.1568


<a name='5'></a>
## 5 - Transformations & Normalization

Transformations allow you to **preprocess data** before feeding it to a model.  
Common transformations include:

1. **Normalization**  
   - Scale features to a standard range (e.g., 0–1 or mean=0, std=1)  
   - Helps training converge faster

2. **Type conversion**  
   - Convert images to tensors (`torchvision.transforms.ToTensor`)  
   - Convert data to float32 if needed  

3. **Data augmentation** (for images)  
   - Random rotations, flips, crops  
   - Improves model generalization  

PyTorch provides **`torchvision.transforms`** for image datasets.  
For custom datasets, you can define a `transform` function and apply it in `__getitem__()`.


In [6]:
import torch
from torch.utils.data import Dataset, DataLoader

# Sample transformation: normalize features
def normalize(x):
    return (x - x.mean()) / x.std()

class TransformDataset(Dataset):
    def __init__(self, data, labels, transform=None):
        self.data = data
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        x = self.data[idx]
        y = self.labels[idx]
        if self.transform:
            x = self.transform(x)
        return x, y

# Sample data
data = torch.randn(5, 3)
labels = torch.randint(0, 2, (5,))

dataset = TransformDataset(data, labels, transform=normalize)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterate through transformed data
for x_batch, y_batch in dataloader:
    print("x:", x_batch)
    print("y:", y_batch)
    print("---")


x: tensor([[ 0.3140, -1.1193,  0.8053],
        [ 0.0439,  0.9773, -1.0212]])
y: tensor([0, 1])
---
x: tensor([[-1.1143,  0.2951,  0.8192],
        [ 0.0331,  0.9830, -1.0161]])
y: tensor([1, 0])
---
x: tensor([[ 0.7484, -1.1357,  0.3874]])
y: tensor([1])
---


<a name='6'></a>
## 6 - Practice Exercises

Try the following exercises to reinforce your understanding of **Dataset, DataLoader, and Transformations**:



### **Exercise 1: Simple Dataset**
- Create a custom Dataset with 10 samples, each with 3 features.  
- Return the sample and a label (0 or 1).  
- Print the first sample and its label.



### **Exercise 2: DataLoader**
- Wrap the dataset with a DataLoader.  
- Set `batch_size=4` and `shuffle=True`.  
- Iterate through all batches and print them.


### **Exercise 3: Transformation**
- Add a transformation to normalize each sample:  

$$
x_{\text{norm}} = \frac{x - \text{mean}(x)}{\text{std}(x)}
$$  

- Print the normalized batches from the DataLoader.



### **Exercise 4 (Optional): Mini-Batch Training**
- Use a simple linear model with input size 3 → output size 1.  
- Use MSE loss and SGD optimizer.  
- Iterate through the DataLoader for 2 epochs and print batch loss.


In [7]:
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim

# ----------------------------
# Exercise 1: Simple Dataset
# ----------------------------
class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(10, 3)
        self.labels = torch.randint(0, 2, (10,))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataset = MyDataset()
print("First sample:", dataset[0])

# ----------------------------
# Exercise 2: DataLoader
# ----------------------------
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)
for batch_idx, (x_batch, y_batch) in enumerate(dataloader):
    print(f"Batch {batch_idx+1}")
    print("x:", x_batch)
    print("y:", y_batch)
    print("---")

# ----------------------------
# Exercise 3: Transformation
# ----------------------------
def normalize(x):
    return (x - x.mean()) / x.std()

class TransformDataset(Dataset):
    def __init__(self, data, labels, transform=None):
        self.data = data
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        x = self.data[idx]
        y = self.labels[idx]
        if self.transform:
            x = self.transform(x)
        return x, y

trans_dataset = TransformDataset(dataset.data, dataset.labels, transform=normalize)
trans_loader = DataLoader(trans_dataset, batch_size=4, shuffle=True)

for x_batch, y_batch in trans_loader:
    print("Normalized batch x:", x_batch)
    print("y:", y_batch)
    print("---")

# ----------------------------
# Exercise 4: Mini-Batch Training
# ----------------------------
model = nn.Linear(3, 1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(2):
    for x_batch, y_batch in trans_loader:
        y_batch = y_batch.float().unsqueeze(1)
        outputs = model(x_batch)
        loss = loss_fn(outputs, y_batch)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")


First sample: (tensor([-0.2774, -0.5221,  0.9460]), tensor(1))
Batch 1
x: tensor([[-1.4489,  0.2493,  0.4823],
        [ 0.8521, -0.1315,  0.5867],
        [ 1.1370,  0.2952, -0.1801],
        [-1.5567,  2.1936,  0.7252]])
y: tensor([0, 0, 0, 0])
---
Batch 2
x: tensor([[-1.9914,  0.7881,  0.2722],
        [ 0.7671,  0.3641,  1.2196],
        [-0.6737, -0.7288,  0.2280],
        [ 0.3095,  0.5873, -0.1617]])
y: tensor([0, 0, 0, 0])
---
Batch 3
x: tensor([[-0.2774, -0.5221,  0.9460],
        [-1.6754, -1.6615,  0.0034]])
y: tensor([1, 1])
---
Normalized batch x: tensor([[-0.4147, -0.7259,  1.1406],
        [-0.0386, -0.9801,  1.0188],
        [-1.1476,  0.4633,  0.6843],
        [-1.0640,  0.9205,  0.1435]])
y: tensor([1, 0, 0, 0])
---
Normalized batch x: tensor([[ 0.1702,  0.9040, -1.0742],
        [-0.5253, -0.6279,  1.1532],
        [ 1.0789, -0.1832, -0.8958],
        [-1.1370,  0.7430,  0.3940]])
y: tensor([0, 0, 0, 0])
---
Normalized batch x: tensor([[ 0.8182, -1.1147,  0.2965],
  