# Mirco-Benchmarking for Transformers

This notebook benchmarks the most time consuming components in BERT, GPT-2 and T5 to help you understand its performance. Let's first check our libraries and hardware. If your GPUs are recent models, please make sure your CUDA version is also recent, which may greatly affect the performance.

In [1]:
import torch

print('Pytorch version\t:', torch.__version__)
print('CUDA version\t:', torch.version.cuda)
print('GPU\t\t:',torch.cuda.get_device_name())

  from .autonotebook import tqdm as notebook_tqdm


Pytorch version	: 1.13.0a0+08820cb
CUDA version	: 11.7
GPU		: Tesla T4


Let's first define a `walltime` method to benchmark Pytorch statements by at least 3 seconds. 

In [2]:
import inspect
from collections import defaultdict
import pandas as pd
from torch.utils import benchmark 

pd.options.display.precision = 3

def var_dict(*args):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    return dict([(name, val) for name, val in callers_local_vars if val is arg][0] 
                for arg in args)

def walltime(stmt, arg_dict, duration=3):
    return benchmark.Timer(stmt=stmt, globals=arg_dict).blocked_autorange(
        min_run_time=duration).median

Last install huggingface from source code.

In [4]:
from IPython.display import clear_output

!git clone https://github.com/huggingface/transformers
!cd transformers; pip install .

clear_output()

## Matrix Multiplication

Matrix multiplication is the most used operator in Transformers. Its performance is crucial. Let's test the [TFLOPS](https://en.wikipedia.org/wiki/FLOPS) we can achieve on square matrices. 

In [4]:
matmul_tflops = defaultdict(lambda: {})
for n in [128, 512, 2048, 8192]:
    for dtype in (torch.float32, torch.float16):
        a = torch.randn(n, n, dtype=dtype).cuda()
        b = torch.randn(n, n, dtype=dtype).cuda()   
        t = walltime('a @ b', var_dict(a, b))
        matmul_tflops[f'n={n}'][dtype] = 2*n**3 / t / 1e12
        del a, b
        
pd.DataFrame(matmul_tflops)

Unnamed: 0,n=128,n=512,n=2048,n=8192
torch.float32,0.124,4.399,4.316,4.319
torch.float16,0.278,16.526,22.809,24.913


You can see that the performance increases with the matrix size. If your GPU has [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/), you will see a big performance jump when switching from 32-bit floating points to 16-bit floating points.

Next you can find the theory TFLOPS of your GPU from Wikipedia, for example, [Nvidia Tesla](https://en.wikipedia.org/wiki/Ampere_(microarchitecture)), [Nvidia Quadro](https://en.wikipedia.org/wiki/Quadro), [RTX 30xx](https://en.wikipedia.org/wiki/GeForce_30_series), and [RTX 20xx](https://en.wikipedia.org/wiki/GeForce_20_series). Here we list several cards, with their memory information.

| Model       | Memory (GB) | Memory Bandwidth (GB/sec) | FP32 TFLOPS | FP16 TFLOPS |
| ----------- | ----------- | ------------------------- | ----------- | ----------- |
| A100        | 80          | 2039                      | 19.5        | 312         |
| V100        | 16          | 900                       | 15.7        | 125         |
| A6000       | 48          | 768                       | 38          | 150         |
| RTX 3090 TI | 24          | 1008                      | 40          | 160         |

If the best TFLOPS number you got is still far away from the theory TFLOPS of your GPU, the performance is likely bottlenecked by the memory bandwidth. To illustrate it, let's benchmark a simple elemental-wise multiplication to show both its TFLOPS with memory bandwidth. 

In [5]:
vector = defaultdict(lambda: {})
for n in [1024*64, 1024*256, 1024*1024, 1024*1024*4]:
    a = torch.randn(n).cuda()
    t = walltime('a * 1.2', var_dict(a))
    vector[n]['TFLOPS'] = n / t / 1e12
    vector[n]['GB/s'] = 8 * n / t / 1e9
    
pd.DataFrame(vector)

Unnamed: 0,65536,262144,1048576,4194304
TFLOPS,0.007,0.026,0.028,0.029
GB/s,53.28,211.695,220.689,232.619


You can see that even for large vectors, the TFLOPS is far far way from GPU peak performance, while the bandwidth may be quite close to its theoretical number.

The matrix multiplication performance is a main topic in HPC. There are a large number of research papers. Unfortunately the backend library, cuBLAS, is not open sourced. You may check [cutlass](https://github.com/NVIDIA/cutlass), which claimed similar performance as cuBLAS, for some implementation details.


## BERT Layer

The main body of a Transformer model is a stacking of Transformer blocks. Let's benchmark the performance of a single block. In BERT, it is often called a BERT layer. Let's construct one such layer from the [BERT large model](https://huggingface.co/bert-large-uncased). We use 16-bit floating points for better performance. 

In [3]:
from transformers import AutoConfig, BertLayer

config = AutoConfig.from_pretrained("bert-large-uncased")
layer = BertLayer(config).half().cuda()

Then define a function to benchmark both forward and forward with backward performance using different sequence lengths and batch sizes. 

In [3]:
def layer_benchmark(layer, hidden_size, seq_lens, batch_sizes, cross_attention=False):
    h = hidden_size
    results = defaultdict(lambda: {})    
    encoder_state = 'encoder_hidden_states=X' if cross_attention else ''
    for s in seq_lens:
        for b in batch_sizes:            
            ffn = 16*b*s*h*h / 1e12  # TFLOPS for the Feed-Forward Network
            atten = (4*b*h*s*s + 8*b*s*h*h) / 1e12  # TFLOPS for attention            
            forward = ffn + (2 if cross_attention else 1) * atten
            
            X = torch.randn(b, s, h).half().cuda()
            results[f'batch={b}'][f'fwd seq_len={s}'] = forward / walltime(
                f'layer(X, {encoder_state})', var_dict(layer, X))
            results[f'batch={b}'][f'fwd+bwd seq_len={s}'] = 3 * forward / walltime(
                f'layer(X, {encoder_state})[0].sum().backward()', var_dict(layer, X))            
    return pd.DataFrame(results)

In BERT pre-training, we often train with a sequence of 128 (stage 1) or 512 (stage 2). Let's test its performance. 

In [8]:
layer_benchmark(layer, config.hidden_size, [128, 512], [2, 4, 8, 16, 32, 64, 128])

Unnamed: 0,batch=2,batch=4,batch=8,batch=16,batch=32,batch=64,batch=128
fwd seq_len=128,7.393,14.828,16.485,16.746,16.721,17.312,18.123
fwd+bwd seq_len=128,9.097,17.562,19.492,20.384,21.341,21.976,22.482
fwd seq_len=512,13.332,13.518,13.813,14.208,14.572,14.407,13.233
fwd+bwd seq_len=512,15.738,16.559,16.745,17.579,17.932,17.657,17.072


No surprise that a large batch size helps. But the best number is below the matrix multiplication TFLOPS. Let's find why.

We first benchmark the first dense layer in the Feed-Forward Network (FFN) in the layer. 

In [9]:
h, b, s = config.hidden_size, 64, 128
X = torch.randn(b, s, h).half().cuda()

'Dense layer TFLOPS: %.3f' % (8*b*s*h*h / 1e12 / walltime(    
    'layer.intermediate.dense(X)', var_dict(layer, X)))

'Dense layer TFLOPS: 25.382'

The number is pretty good. Then run this dense layer with the GeLU activation.

In [10]:
'Dense+Activation TFLOPS: %.3f' % (8*b*s*h*h / 1e12 / walltime(
    'layer.intermediate(X)', var_dict(layer, X)))

'Dense+Activation TFLOPS: 21.597'

Even the activation function has a ignorable complexity, it brings down the TFLOPS. We pointed out the reason before, the elemental-wise operation of the activation function is bounded by the memory bandwidth.

Now test the whole FFN.

In [11]:
ffn = 16*b*s*h*h / 1e12
'FFN TFLOPS: %.3f'%(ffn / walltime(
    'layer.output(layer.intermediate(X),X)', var_dict(layer, X)))

'FFN TFLOPS: 21.348'

The other part in the BERT layer is the multi-head self-attention.

In [12]:
att = (4*b*h*s*s + 8*b*s*h*h) / 1e12
'Attention TFLOPS: %.3f'%(
    att / walltime('layer.attention(X)', var_dict(layer, X)))

'Attention TFLOPS: 12.981'

Even though the main computation part of the attention block is still matrix multiplication, it has more memory bounded operators compared to FFN. So you see a lower TFLOPS.

In [13]:
att / ffn

0.53125

The ratio of complexity between attention and FFN depends on the BERT configuration. The overall performance is a weighted sum between the FLOPS of these two components.

## GPT-2 Block

Next let's evaluate `gpt2-medium`, which has a similar architecture has `bert-large`, i.e. 24 layers with a 1024 hidden size. GPT2 is trained with a 1024 sequence length.

In [14]:
from transformers.models.gpt2.modeling_gpt2 import GPT2Block

config = AutoConfig.from_pretrained("gpt2-medium")
layer = GPT2Block(config, layer_idx=0).half().cuda()
layer_benchmark(layer, config.n_embd, [512, 1024], [2, 4, 8, 16, 32, 64])

Downloading: 100%|██████████████████████████████████████| 718/718 [00:00<00:00, 812kB/s]


Unnamed: 0,batch=2,batch=4,batch=8,batch=16,batch=32,batch=64
fwd seq_len=512,10.08,10.308,10.51,10.511,10.967,10.599
fwd+bwd seq_len=512,10.843,11.266,11.649,11.81,11.996,11.863
fwd seq_len=1024,8.26,8.44,8.365,8.481,8.427,7.892
fwd+bwd seq_len=1024,9.227,9.517,9.633,9.833,9.548,9.092


You can see that, despite GPT-2 and BERT has the same complexity, GPT-2 has slightly worse TFLOPS when using the same batch size and sequence length. Also using a larger sequence length 1024 further harms the performance.

## T5 Layer

T5 has both encoder and decoder, let's first benchmark the decoder, whose performance is similar to BERT.

In [8]:
from transformers.models.t5.modeling_t5 import T5Block

config = AutoConfig.from_pretrained("t5-large")
config.use_cache = False
config.is_decoder = False
config.is_encoder_decoder = False

encoder = T5Block(config).half().cuda()
layer_benchmark(encoder, config.d_model, [512], [2, 4, 8, 16, 32, 64, 128])

Unnamed: 0,batch=2,batch=4,batch=8,batch=16,batch=32,batch=64,batch=128
fwd seq_len=512,9.215,9.702,10.291,10.525,10.688,10.72,10.566
fwd+bwd seq_len=512,11.348,11.903,12.363,12.678,12.899,12.831,12.488


The decoder has an additional cross attention, which increases the time complexity and also hurts TFLOPS.

In [4]:
config.is_decoder = True
decoder = T5Block(config).half().cuda()

# 显存不足,无法测试batch=128
layer_benchmark(decoder, config.d_model, [512], [2, 4, 8, 16, 32, 64], cross_attention=True)

Unnamed: 0,batch=2,batch=4,batch=8,batch=16,batch=32,batch=64
fwd seq_len=512,7.477,7.997,8.408,8.588,8.735,8.735
fwd+bwd seq_len=512,9.627,10.071,10.487,10.747,10.906,10.82


## Conclusion

To conclude, to achieve the best performance for a Transformer layer, you need to use a fast data type and a large batch size. For further improvement, we may need to rewrite the code. For example, [fusing](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#fuse-pointwise-operations) multiple kernels into a single one. 