# Minibatch Stochastic Gradient Descent
At the heart of this is to use minibatches because is computationally efficienct. This is most easily understood when considering parallelization to multiple GPUs and multiple servers.

The way to alleviate constraints is to use a hierarchy of CPU caches that are actually fast enough to supply the processor with data. This is the driving force behind batching.
1. We could compute $\mathbf{A}_ij=\mathbf{B}_{i,:}\mathbf{C}_{:,j}$ i.e., we could compute it elementwise by means of dot products
2. We could compute $\mathbf{A}_{:,j}=\mathbf{BC}_{i,:}$ i.e., we could compute it one column at a time. Likewise we could compute $\mathbf{A}$ one row $\mathbf{A}_{i,:}$ at a time
3. We could simply compute $\mathbf{A}=\mathbf{BC}$
4. We could break $\mathbf{B}$ and $\mathbf{C}$ into smaller block matrices and compute $\mathbf{A}$ one block at a time

Option 1 will need to copy one row and one column vector into the CPU each time we want to compute an element $\mathbf{A}_{ij}$

Option 2 is more favorable, we are able to keep the column vector $\mathbf{C}_{:,j}$ in the CPU cache while we keep on traversing through $\mathbf{B}$, this halves the memory bandwidth

Option 3 is most desirable but most matrices might not entirely fit into cache

Option 4 offers practically useful alternative: we can move blocks of the matrix into cache and multiply them locally.

Beyond computational efficiency, the overhead introduced by Python and by the learning framework itself is considerable.

__Use vectorization (and matrices) whenever possible__

In [1]:
%matplotlib inline
import time
import torch
import numpy as numpy 
from torch import nn

A = torch.zeros(256, 256)
B = torch.randn(256, 256)
C = torch.randn(256, 256)


Element-wise assignment simply iterates over all rows and columns of $\mathbf{B}$ and $\mathbf{C}$ respectively to assign the value to $\mathbf{A}$

In [2]:
# Compute A = BC one element at a time
start = time.time()
for i in range(256):
    for j in range(256):
        A[i,j] = torch.dot(B[i, :], C[:, j])
print("Time for element-wise operation",  time.time() - start)

Time for element-wise operation 1.1361885070800781


A faster strategy to perform column-wise assignement

In [3]:
# Compute A = BC one column at a time
start = time.time()

for j in range(256):
    A[:, j] = torch.mv(B, C[:, j])
print("Time for column-wise operation", time.time() - start)

Time for column-wise operation 0.026204586029052734


Lastly the most effective manner is to perform the entire operation in one block. Note that multiplying any two matrices $\mathbf{B} \in \mathbb{R}^{m\times n}$ and $\mathbf{C}\in \mathbb{R}^{n\times p}$ kates approximately $2mnp$ floating point operations, when scalar multiplication and addition are counted as separate operations (fused in practice). Thus, multiplying two $256\times 256$ matrices takes $0.03$ billion floating point operations.

In [4]:
# Compute A = BC in one go
start = time.time()
A = torch.mm(B, C)
print("Time for entire operation in one block", time.time() - start)

Time for entire operation in one block 0.018102645874023438


### Minibatches
We would read _minibatches_ of data rather than single observations to update parameters. Processing single observations requires us to perform many single matrix-vector multiplications, which is quite expensive and which incurs a significant overhead on behalf of the underlying deep learning framework. This applies both to evaluating a network when applied to data and when computing gradients to update parameters

Whenever we perform $\mathbf{w}\leftarrow\mathbf{w}-\eta_t\mathbf{g}_t$
$$\mathbf{g}_t=\partial_wf(\mathbf{x}_t,\mathbf{w})$$
WE can increase the computational efficiency of this operation by applying it to a minibarch of observations at a time. We replace the gradient $\mathbf{g}_t$ over a single observation by one over a small batch
$$\mathbf{g}_t=\partial_w\frac{1}{|\mathcal{B}_t|}\sum_{i\in \mathcal{B}_t}f(\mathbf{x}_i,\mathbf{w})$$

Lets see an example of 
matrix-matrix multiplication but broken up into "minibatches" of 64 columns at a time

In [5]:

start = time.time()

for j in range(0, 256, 64):
    A[:, j:j+64] = torch.mm(B, C[:, j:j+64])
    
elapsed_time = time.time() - start  
print(f"Time for column-wise operation: {elapsed_time:.6f} seconds")

flops = 0.03 / elapsed_time  # Gigaflops
print(f'Performance in Gigaflops: {flops:.3f}')

Time for column-wise operation: 0.001006 seconds
Performance in Gigaflops: 29.831
