# Datasets and Dataloaders

## Dataset
- Acts as a **container** for raw data.
- Provides access to **individual samples** via indexing (`dataset[idx]`).
- Implements:
  - `__len__`: Returns the size of the dataset.
  - `__getitem__`: Fetches a single sample by index.
- Does **not handle batching**, shuffling, or parallel loading.

## DataLoader
- Wraps a `Dataset` and acts as an **iterator** for easy data loading.
- Provides **batches** of data for training or inference.
- Handles:
  - **Batching**: Combines multiple samples into batches.
  - **Shuffling**: Randomizes the order of samples.
  - **Parallel loading**: Uses multiple worker threads or processes to load data.
- Implements:
  - `__iter__`: Returns an iterator that fetches batches.
 
## Summary
| Feature              | `Dataset`                | `DataLoader`            |
|-----------------------|--------------------------|--------------------------|
| **Role**             | Data container           | Iterator for data loading |
| **Access**           | Single sample            | Batches of samples       |
| **Indexing**         | Yes (`dataset[idx]`)     | No                       |
| **Batching**         | No                       | Yes                      |
| **Shuffling**        | No                       | Yes                      |
| **Parallel Loading** | No                       | Yes                      |

## Example of very simple Dataset and DataLoader

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader

In [27]:
# Step 1: Define a simple dataset
class SimpleDataset(Dataset):
    
    def __init__(self):
        # Create some dummy data (e.g., 100 samples of 2D points)
        self.inputs = torch.arange(10).view(-1, 2)  # 50 samples of [x, y] pairs
        self.targets = torch.arange(5) 
        
    def __len__(self):
        # Number of samples in the dataset
        return len(self.inputs)
    
    def __getitem__(self, idx):
        # Return a single sample at the given index
        return self.inputs[idx], self.targets[idx]

dataset = SimpleDataset()

In [28]:
batch_size = 2
shuffle = True
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
for batch_idx, (inputs, target) in enumerate(dataloader):
    print(f"Batch {batch_idx + 1}")
    print(f"Inputs: {inputs.numpy()}")
    print(f"Target: {target.numpy()}")
    print()

Batch 1
Inputs: [[0 1]
 [6 7]]
Target: [0 3]

Batch 2
Inputs: [[2 3]
 [8 9]]
Target: [1 4]

Batch 3
Inputs: [[4 5]]
Target: [2]



In [29]:
batch_size = 3
shuffle = False
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
for batch_idx, (inputs, target) in enumerate(dataloader):
    print(f"Batch {batch_idx + 1}")
    print(f"Inputs: {inputs.numpy()}")
    print(f"Target: {target.numpy()}")
    print()

Batch 1
Inputs: [[0 1]
 [2 3]
 [4 5]]
Target: [0 1 2]

Batch 2
Inputs: [[6 7]
 [8 9]]
Target: [3 4]

