# Gradient Accumulation

## Vector vs. Scalar

A gradient is generally a vector (or a higher-dimensional tensor) because it contains the partial derivatives of a scalar loss function with respect to each model parameter.

If you have multiple parameters $(w_1, w_2, ..., w_n)$ in a model, the gradient of the loss $L$ with respect to these parameters is:
$$
\nabla_w L = \left(\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, ..., \frac{\partial L}{\partial w_n}\right).
$$

Each element of this gradient vector is a scalar partial derivative. When we talk about accumulating gradients, we are adding these gradient vectors element-wise.

# How is a Vector Accumulated?

Adding (accumulating) gradients means performing vector addition. Suppose after processing one mini-batch you get a gradient vector:
$$
g^{(1)} = (g^{(1)}_1, g^{(1)}_2, ..., g^{(1)}_n).
$$

After another mini-batch, you get another gradient vector:
$$
g^{(2)} = (g^{(2)}_1, g^{(2)}_2, ..., g^{(2)}_n).
$$

Accumulating gradients is done by element-wise addition:
$$
g^{(accumulated)} = g^{(1)} + g^{(2)} = (g^{(1)}_1 + g^{(2)}_1,\; g^{(1)}_2 + g^{(2)}_2,\; ...,\; g^{(1)}_n + g^{(2)}_n).
$$

# Simple Example with Linear Regression

Consider a simple linear regression model with one parameter $w$ and a bias $b$:
$$
\hat{y} = w x + b.
$$

The loss function (Mean Squared Error) for a single data point $(x, y)$ is:
$$
L = \tfrac{1}{2}(y - \hat{y})^2 = \tfrac{1}{2}(y - (w x + b))^2.
$$

The gradients are:
$$
\frac{\partial L}{\partial w} = -(y - (w x + b)) x,
$$
$$
\frac{\partial L}{\partial b} = -(y - (w x + b)).
$$

So the gradient vector for one data point is:
$$
\nabla L = \left(\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b}\right).
$$

# Gradient Accumulation Over Multiple Batches

Imagine you want an effective batch size of 4, but can only process 2 data points at a time.

Data points: $(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4)$  
Mini-batch 1: $(x_1, y_1), (x_2, y_2)$  
Mini-batch 2: $(x_3, y_3), (x_4, y_4)$

Initialize:
$$
g^{(accum)}_w = 0,\quad g^{(accum)}_b = 0.
$$

Process Mini-batch 1:
Compute gradients for $(x_1, y_1)$ and $(x_2, y_2)$, then average them:
$$
g^{(batch1)}_w = \frac{g^{(1)}_w + g^{(2)}_w}{2},\quad g^{(batch1)}_b = \frac{g^{(1)}_b + g^{(2)}_b}{2}.
$$

Accumulate:
$$
g^{(accum)}_w = g^{(accum)}_w + g^{(batch1)}_w,\quad g^{(accum)}_b = g^{(accum)}_b + g^{(batch1)}_b.
$$

Process Mini-batch 2:
Similarly:
$$
g^{(batch2)}_w = \frac{g^{(3)}_w + g^{(4)}_w}{2},\quad g^{(batch2)}_b = \frac{g^{(3)}_b + g^{(4)}_b}{2}.
$$

Accumulate again:
$$
g^{(accum)}_w = g^{(accum)}_w + g^{(batch2)}_w,\quad g^{(accum)}_b = g^{(accum)}_b + g^{(batch2)}_b.
$$

Now $g^{(accum)}_w$ and $g^{(accum)}_b$ represent the combined gradient from all 4 data points. You could then take the final average if desired (depending on how you handle normalization) and perform:
$$
w \leftarrow w - \eta \cdot g^{(final)}_w,\quad b \leftarrow b - \eta \cdot g^{(final)}_b.
$$

This shows how gradients from multiple subsets of data (mini-batches) are summed or averaged to effectively replicate having processed the entire set at once.


```code
import torch

# Example setup
model = MyModel()  # Your model here
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

train_loader = get_training_dataloader()  # Your data loader here
accumulation_steps = 4  # Number of mini-batches to accumulate gradients

model.train()
optimizer.zero_grad()

for epoch in range(num_epochs):
    for i, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.cuda(), targets.cuda()  # If using GPU

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)

        # Backward pass
        loss.backward()

        # Only step optimizer after accumulating enough gradients
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()        # Update the parameters
            optimizer.zero_grad()   # Reset gradients for next set of accumulations

    # If the total number of batches isn't divisible by accumulation_steps,
    # we may have some leftover gradients. Handle that if needed:
    # (This typically occurs after the loop if i+1 is not divisible by accumulation_steps)
    remainder = (i + 1) % accumulation_steps
    if remainder != 0:
        optimizer.step()
        optimizer.zero_grad()

```

# Batch Size, Learning Rate and Gradient Accumulation

We mentioned, you want an effective batch size of 4, but can only process 2 data points at a time.

Data points: $(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4)$
Mini-batch 1: $(x_1, y_1), (x_2, y_2)$
Mini-batch 2: $(x_3, y_3), (x_4, y_4)$


直觉上，当Batch Size增大时，每个Batch的梯度将会更准，所以步子就可以迈大一点，也就是增大学习率，以求更快达到终点，缩短训练时间，这一点大体上都能想到。https://spaces.ac.cn/archives/10542
Otherwise, if the the batch size is small, and the gradient is not too accurate, then we need to lower the learning rate since we don't wanna be too aggressive when moving forward with a not accurate direction.


Most of time, to determine the batch size you wanna run the your model, you will need to do a trail and error thing like making the batch size quite large, and see if the CUDA memory is out-of-memory. And then you lower the batch size until you model updating process can be fited into you CUDA memory. Then thats your batch size.

Lots of model will require different memory requirements, and therefore the batch_size will not be the same across different models (since the cuda memory size is constant). However, one issue is that lots of time, we need to quick & fast iterate different models archetecture to find out a reasonabely good baseline model so that we can keep improving on. In this case, we need constant batch size for different model.

This is where gradient accumulation comes in. Even though for some models, the batch_size = 128 will cause OUT-OF-MEMORY issue, but with gradient accumulation, we can keep batch_size to 128 and we does not need to tune the batch size.