# Batch Gradient Descent
Batch Gradient Descent is an optimization algorithm where you use the entire training dataset to compute the gradient of the loss function in each iteration.

It is usually avoided because it is memory inefficient and it doesn't have better convergence. We were previously using Batch Gradient Descent, We need to fix it.

So instead of loading dataset in one we can divide them in batches. This is called **Mini Batch Gradient Descent**.

Our current approach uses Batch GD (entire dataset at once), which is inefficient. With DataLoader, we can easily implement Mini-Batch GD by:

- Creating batches of optimal size
- Processing one batch per iteration
- Updating parameters after each batch
# Dataset and DataLoader
Dataset and DataLoader are core abstraction in pytorch that decouple how you define your data from how you efficiently iterate over it in training loops.

Dataset Class job is to load the data and DataLoader Class job is to create batches from loaded Data.

Eg: There is CSV dataset with 10 rows and batch size is 2 meaning total batches will be 10/2 = 5 batches.

Dataset class is essentially a blueprint that defines how to access and preprocess your raw data. It defines:

- `__init__(self, ...)`: Initializes the dataset - sets up data paths, parameters, and preprocessing rules.

- `__getitem__(self, index)`: Retrieves and returns the single data sample at the given index, including any transformations.

- `__len__(self)`: Returns the total number of samples in the dataset.

```python
import torch
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self, data):  # Setup data
        self.data = data
    
    def __getitem__(self, index):  # Get one sample
        return self.data[index]
    
    def __len__(self):  # Total samples
        return len(self.data)

# Example data: 10 numbers
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
dataset = SimpleDataset(data)

# Test the dataset
print(f"Total samples: {len(dataset)}")  # Output: Total samples: 10
print(f"Sample at index 0: {dataset[0]}")  # Output: Sample at index 0: 10
print(f"Sample at index 5: {dataset[5]}")  # Output: Sample at index 5: 60
```

The DataLoader is a powerful iterator that works with your Dataset to automatically:

- Create batches from individual samples
- Shuffle data randomly each epoch
- Load data in parallel using multiple workers
- Manage memory efficiently
## Features:
- Batching: Groups individual samples into mini-batches for efficient processing
- Shuffling: Randomizes data order to prevent learning sequence patterns
- Parallel Loading: Uses multiple workers to load next batch while current one processes
- Automatic Batching: Handles different batch sizes and dataset lengths

```python
import torch
from torch.utils.data import Dataset, DataLoader

# Same Simple Dataset from before
class SimpleDataset(Dataset):
    def __init__(self, data):
        self.data = data
    
    def __getitem__(self, index):
        return self.data[index]
    
    def __len__(self):
        return len(self.data)

# Create dataset with 10 samples
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
dataset = SimpleDataset(data)

# Create DataLoader with batch_size=2
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterate through batches
print("Batches with batch_size=2:")
for batch_idx, batch in enumerate(dataloader):
    print(f"Batch {batch_idx}: {batch}")

# Output will look like (order changes due to shuffle=True):
# Batch 0: tensor([20, 90])  
# Batch 1: tensor([40, 30])
# Batch 2: tensor([10, 70])
# Batch 3: tensor([60, 80])
# Batch 4: tensor([100, 50])
```

In [1]:
from sklearn.datasets import make_classification
import torch

In [2]:
# Step 1: Create a synthetic classification dataset using sklearn
X, y = make_classification(
    n_samples=10,       # Number of samples
    n_features=2,       # Number of features
    n_informative=2,    # Number of informative features
    n_redundant=0,      # Number of redundant features
    n_classes=2,        # Number of classes
    random_state=42     # For reproducibility
)

In [3]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [4]:
# Convert the data to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

In [5]:
from torch.utils.data import Dataset, DataLoader

In [7]:
class CustomDataset(Dataset):

  def __init__(self, features, labels):

    self.features = features
    self.labels = labels

  def __len__(self):

    return self.features.shape[0]

  def __getitem__(self, index):

    return self.features[index], self.labels[index] # Returns Row of Index position that is given as input

In [8]:
dataset = CustomDataset(X, y)

In [9]:
len(dataset)

10

In [10]:
dataset[2]

(tensor([-2.8954,  1.9769]), tensor(0))

In [11]:
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)

In [12]:
for batch_features, batch_labels in dataloader:

  print(batch_features)
  print(batch_labels)
  print("-"*50)

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388]])
tensor([1, 0])
--------------------------------------------------
tensor([[-2.8954,  1.9769],
        [-0.7206, -0.9606]])
tensor([0, 0])
--------------------------------------------------
tensor([[-1.9629, -0.9923],
        [-0.9382, -0.5430]])
tensor([0, 1])
--------------------------------------------------
tensor([[ 1.7273, -1.1858],
        [ 1.7774,  1.5116]])
tensor([1, 1])
--------------------------------------------------
tensor([[ 1.8997,  0.8344],
        [-0.5872, -1.9717]])
tensor([1, 0])
--------------------------------------------------


Workers are subprocesses that load data in parallel while the main process trains the model. This prevents the GPU from waiting for data.

`num_workers=0`: Main process loads data (GPU waits)

`num_workers=2`: 2 subprocesses load data concurrently (GPU rarely waits)

In [13]:
import torch
import time
from torch.utils.data import Dataset, DataLoader

class SlowDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, index):
        time.sleep(0.1)  # Simulate slow data loading
        return self.data[index]

    def __len__(self):
        return len(self.data)

# Create dataset
data = list(range(10))
dataset = SlowDataset(data)

# WITHOUT workers (slow)
print("Without workers (num_workers=0):")
dataloader_slow = DataLoader(dataset, batch_size=2, num_workers=0)
start = time.time()
for batch in dataloader_slow:
    print(f"Batch: {batch}")
print(f"Time: {time.time() - start:.2f}s\n")

# WITH workers (fast)
print("With workers (num_workers=2):")
dataloader_fast = DataLoader(dataset, batch_size=2, num_workers=2)
start = time.time()
for batch in dataloader_fast:
    print(f"Batch: {batch}")
print(f"Time: {time.time() - start:.2f}s")

Without workers (num_workers=0):
Batch: tensor([0, 1])
Batch: tensor([2, 3])
Batch: tensor([4, 5])
Batch: tensor([6, 7])
Batch: tensor([8, 9])
Time: 1.01s

With workers (num_workers=2):
Batch: tensor([0, 1])
Batch: tensor([2, 3])
Batch: tensor([4, 5])
Batch: tensor([6, 7])
Batch: tensor([8, 9])
Time: 0.85s


# Sampling (Shuffling)
Sampling controls the order in which data samples are selected from the dataset.

## Types of Samplers:
- SequentialSampler: Samples in sequential order (0, 1, 2, 3...)

- RandomSampler: Samples in random order (shuffling)

- WeightedRandomSampler: Samples with probability weights

In [14]:
import torch
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler

data = [10, 20, 30, 40, 50]
dataset = torch.utils.data.TensorDataset(torch.tensor(data))

# Sequential Sampling (for validation/test)
sequential_loader = DataLoader(dataset, batch_size=2, sampler=SequentialSampler(dataset))
print("Sequential Sampling:")
for batch in sequential_loader:
    print(batch)  # Always: [10, 20], [30, 40], [50]

# Random Sampling (for training)
random_loader = DataLoader(dataset, batch_size=2, sampler=RandomSampler(dataset))
print("\nRandom Sampling:")
for batch in random_loader:
    print(batch)  # Random order: [30, 10], [40, 50], [20] etc.

Sequential Sampling:
[tensor([10, 20])]
[tensor([30, 40])]
[tensor([50])]

Random Sampling:
[tensor([20, 40])]
[tensor([50, 10])]
[tensor([30])]


# Collate Function
Collate Function defines how individual samples are combined into a batch.

Default Collate Behavior:
- Stacks tensors with same shape
- Creates lists for different-shaped items
- Converts numbers to tensors

In [15]:
from torch.utils.data import DataLoader
import torch

class CustomDataset:
    def __len__(self): return 3
    def __getitem__(self, idx):
        return torch.tensor([idx, idx*2])  # Sample: [0,0], [1,2], [2,4]

dataset = CustomDataset()

# Default collate stacks tensors
dataloader = DataLoader(dataset, batch_size=2)
for batch in dataloader:
    print(f"Batch shape: {batch.shape}, Batch: {batch}")
    # Output: Batch shape: torch.Size([2, 2]), Batch: tensor([[0, 0], [1, 2]])

Batch shape: torch.Size([2, 2]), Batch: tensor([[0, 0],
        [1, 2]])
Batch shape: torch.Size([1, 2]), Batch: tensor([[2, 4]])
