# Estimation of CO2 equivalent emissions of tranformer based large language models

We estimate the equivalent carbon emission for transformer based LLMs. Following models are used:
- GPT
- BERT
- T5

We focus on the total training co2eq estimation since it is more deterministic compared to inference.

The equivalent training carbon footprint depends on:
- Total Training Time
- Number of GPUs
- Thermal Design Power(TDP) of GPUs
- Regional carbon equivalent emissions
- Power Usage Effectiveness(PUE)

We estimate throughput to find total train time and carbon emission. A linear regression using a 2nd order polynomial is fit on the throughput scaling data presented in the paper [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473). The final curve returns throughput FLOPs per gpu given P number of parameters. The paper uses autoregressive transformer models like GPT-3 for its study.

The empirical data presented in the Megatron LM paper for throughput scaling uses specific compute and communication optimizations for workload distribution. This impacts the parallelization degrees i.e. pipe, tensor, and data parallel. Data used in this paper is therefore constrained by the specific hardware of the experimental setup(**DGX A100 NVIDIA nodes**). To extend this throughput estimation to NVIDIA 32GB V100 GPUs we use their ratio of relative performance([source](https://lambdalabs.com/gpu-benchmarks)). Using this ratio alone is a rough extension though since it does not consider the impact of p,t,d scaling using Megatron framework for A100 GPU reported throughput values.

Naturally, the total workload is split among all workers so we have $p.t.d=n$ where n is the total number of GPUs. We assume that reported throughput in the Megatron LM paper is an upper bound for GPT like models of a given parameter size.



    
### Training time estimation

For an approximate training time calculation, we need to estimate the following:
- Total train FLOPs required by the model
- Benchmark of single GPU FLOPs
- Percent of peak device throughput as estimated using the regression equation

This gives the train time as: $t_{train} = \frac{\text{Total Train FLOPs}}{\text{(Benchmark FLOPs per GPU)}*\text{Percent Utilization}*\text{#GPUs}}$

To calculate the total compute FLOPs during training for different large transformer models we refer to the paper [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165). The following table gives FLOPs per parameter per token for different model types. This factor is multiplied with total tokens and parameters. We do not consider attention operation FLOPs since these are $<10\%$ of overall.

![flop_count.png](flop_count.png)

The total train compute can be defined as: $train_{compute} = T.P.f_p$ where <br>
- $T$: Total Training Tokens
- $P$: Total Parameters
- $f_p$: FLOPs required per token per parameter
    - BERT/GPT: 6 (100% of parameters active for each token)
    - T5: 3 (50% of parameters active for each token)    
    
Megatron scaling has been used for each of GPT, BERT, and T5 models leading to a safe assumtion of throughput behavior showing similar trends across each of the model type given specific hardware configuration.

We scale throughput to find percent of peak utilization. Peak FLOPs across GPU types are taken from NVIDIA specsheet:
- A100 40/80GB: 312 TFLOPs
- V100 32GB: 125 TFLOPs

### CO2e calculation

CO2e i.e. equivalent carbon emission is the product of the following: <br>
- Total train time
- Thermal Design Power(TDP) of GPU
- Regional carbon equivalent emissions
    - US national average for 2018 is 1.58
    - GPT-3 training time PUE is 0.429
- Power Usage Effectiveness(PUE)
    - OpenAI reported PUE for GPT-3 training is 1.1


Gross CO2e emisssion can hence be estimated as <br>
*KWh = Hours to train × Number of Processors × Average Power per Processor × PUE ÷ 1000* <br>
*tCO2e = KWh × kg CO2e per KWh ÷ 1000* <br>

Net CO2e emission factors into account carbon offset by multiple methods such as buying carbon credits, using renewable grid and others. Net carbon equivalent is not calculated here.

---

<mark>*Assumption*</mark> Mem capacity of a 80GB GPU:  2.4B parameter model. Other gpu parameter capacity is based on this unit<br>

Let
- Total number of parameters: $P$ <br>
- GPU type: V100/A100 <br>
- GPU memory: $gpu_{mem}$ <br>
- Number of GPUs in a single node: $node_{size}$ (restricted to 1,2,4,8)<br>
- Parameter capacity of a single GPU: $gpu_{cap}$ <br>
- Parameter capacity of a single node: $node_{cap}$ <br>
- Estimated total number of GPUs: $n$ <br>
- Estimated Batch Size: $B$ <br>
- Estimated tensor size: $tensor$ <br>
- Estimated pipeline size: $pipe$ <br>
- Estimated data size: $data$ <br>
- Estimated throughput: $X$ <br>
- A100:V100 ratio: $r$ <br>
- FLOPs per parameter per token: $flop_{token}$ <br>
- FLOP benchmark GPU: $flop_{bench}$ <br>
- Total training tokens: $T$ <br>
- Total train compute: $C_{train}$
- End-to-end training time: $t_{train}$ <br>
- Gross CO2e emission estimate: $co2e_{gross}$


**__Algorithm__** <br>
1. Calculate total parameters $ P = 12lh^2(1+\frac{13}{12h} + \frac{V+s}{12lh})$ <br>
2. Use regression coefficients for estimating number of GPUs $n$ <br>
3. Use regression coefficients for estimating batch size $B$  <br>
4. Calculate parameter capacity of a single node $node_{cap} = node_{size}*gpu_{cap}$ <br>
5. if total parameters($P$) < parameter cap for a single node($node_{cap}$) <br>
    5.1 Set pipeline size $pipe=1$ and tensor size $tensor = \lceil \frac{P}{gpu_{cap}} \rceil$ <br>
    5.2 else set pipeline size $pipe=\lceil \frac{P}{node_{cap}} \rceil$ and tensor size $tensor = node_{size}$
6. Use regression coefficients and p,t,d for estimating throughput $X$ and peak utilization given A100 nodes<br>
    6.1 Use relative performance ratio to scale to V100 GPU type
7. Calculate the total training compute $C_{train} = flop_{token} * T * P$ <br>
8. Calculate total training time $t_{train} = \frac{C_{train}}{n*\text{(percent of peak)}*flop_{bench}}$ <br>
9. Calculate gross CO2e estimate as <br>
KWh = Hours to train × Number of Processors × Average Power per Processor × PUE ÷ 1000 <br>
tCO2e = KWh × kg CO2e per KWh ÷ 1000 <br>
$\implies \mathbf{co2e_{gross} = n*t_{train}*\text{GPU TDP}*\text{PUE}*\text{Datacenter gross CO2 e /KWh}}$

---

## Implementation

Estimated regression coefficients used for polynomial fit $ \mathbf{y = ax^2 + bx + c} $
- Tensor model throughput: $$a= -8.82079068\times 10^{-20},  b= 1.68591116\times 10^{-09},  c= 1.33954735\times 10^{+02}$$
- Pipeline model throughput: $$a= -5.60233749\times 10^{-23},  b= 8.45435587\times 10^{-11},  c= 1.34546129\times 10^{+02}$$
- Total number of GPUs: $$a= -2.12910565\times 10^{-21},  b= 4.39684339\times 10^{-09},  c=7.99173057\times 10^{+02}$$
- Batch Size: $$a = -4.29439186\times 10^{-01},  b= 5.21376002\times 10^{+01},  c= 1.43737095\times 10^{+03}$$

### Scaled throughput calculation

Data used for regression uses 8x 80GB A100 NVLink nodes for the reported throughput. From observed data, the ratio of relative performance between 8X 32GB V100 vs 8X A100 is $r = \frac{7.76}{33.46} = 0.23$. This reported factor considers large transformer models and fp-16 mixed precision training used in LLMs.

We use this factor of 0.23 to scale reported throughput for A100 GPU in original paper to 32GB V100 GPU used by Patterson et al to report their estimated CO2e calculation. This scaling is done by sclaing the peak percent performance of V100 GPU by the same amount as A100 reported throughput after reducing by a factor 0.23. We use this percent of peak for V100 GPUs to calculate throughput used in all calculations for V100.

Example:
- Let model be GPT with parameter size 100B
- Let estimated throughput($X$) for the 100B model be 140 TFLOPs
- New throughput $X_{new} = X(1 - r)$
- $X_{new}$ percent peak: $p_{new} = \frac{X_{new}}{312}$
- Hence estimated throughput for V100: $p_{new}*125$

In [1]:
import pandas as pd
import numpy as np
import os

**Total Parameters and GPU details**

In [2]:
V = 51200
s = 2048
h = 2304
a = 24
l = 24
P = 12*l*(h**2)*(1 + (13/(12*h)) + ((V+s)/(12*l*h)))
P_user = 175e9 # explicitly defined number of parameters
gpu_map = {
    'A100': {'tdp': 0.4, 'peak': 312},
    'V100': {'tdp': 0.3, 'peak': 125}
}

**Model Training and regression functions**

In [3]:
model_type = 'GPT' # GPT, BERT, T5
tokens = 300e9 # training tokens

region_co2 = 0.429 #OPENAI GPT 3
pue = 1.1 #reported by NVIDIA for Azure datacenter

# regression coefficients basis observed Megatron scaling for throughput
coeff_tensor = np.array([-8.82079068e-20,  1.68591116e-09,  1.33954735e+02])
coeff_pipe = np.array([-5.60233749e-23,  8.45435587e-11,  1.34546129e+02])
coeff_gpu = np.array([-2.12910565e-21,  4.39684339e-09,  7.99173057e+02])
coeff_batch = np.array([-4.29439186e-01,  5.21376002e+01,  1.43737095e+03])

func_tensor = np.poly1d(coeff_tensor)
func_pipe = np.poly1d(coeff_pipe)
func_gpu = np.poly1d(coeff_gpu)
func_batch = np.poly1d(coeff_batch)

**Funtion definition for parallel strategy, estimated throughput, end-to-end train time, gross co2e**

In [4]:
def get_ptd(P, node_size, gpu_cap, gpu_type, gpu_mem):
    p_b = P/1e9

    # model parallel size
    if p_b < node_size*gpu_cap:
        p_size = 1
        t_size = int(np.ceil(p_b/gpu_cap))
    else:
        t_size = node_size
        p_size = int(np.ceil(p_b/(node_size*gpu_cap)))

    model_size = t_size * p_size

    # number of gpus estimate
    num_gpu = np.round(func_gpu(P)/model_size)*model_size
    if 'V100' in gpu_type:
        num_gpu = np.round(num_gpu * 2.5)

    if gpu_mem == 40:
        num_gpu *= 2

    d_size = num_gpu/model_size
    #estimated batch size
    if p_size == 1:
        batch_size = 512
    else:
        batch_size = np.round(func_batch(p_size)/8)*8
        if batch_size < num_gpu:
            batch_size = num_gpu

    return p_size, t_size, d_size, num_gpu, batch_size

def get_throughput(t_size, p_size, node_size, P, gpu_type, rel_thru):
    #intra model condition
    if (t_size <= node_size and p_size == 1):
        X = func_tensor(P)
    # inter model
    else:
        X = func_pipe(P)

    if 'V100' in gpu_type:
        X_new = X -  X*rel_thru
        peak_new = X_new /312
        X_scaled = peak_new*125
    else:
        X_scaled = X

    return X_scaled

def get_train_time(model_type, tokens, P, n, X):
    flop_per_parameter = 6
    if 'T5' in model_type:
        flop_per_parameter = 3

    total_compute = P*tokens*flop_per_parameter
    total_compute_per_sec = n*X*1e12

    train_sec = total_compute / total_compute_per_sec

    return train_sec, total_compute

def get_co2e(gpu_tdp, train_time, region_co2, pue, n):
    co2_gpu = gpu_tdp * train_time * region_co2 * pue
    co2_gross = co2_gpu*n
    return co2_gross

### Evaluation on 175B GPT-3 Parameter model

We calculate the total co2e for GPT_Large(175B). We observe the difference between:
- 80 GB A100 for NVLink
- 32 GB V100 for NVLink

There is no throughput scaling study for V100 GPUs for large transformer models hence we are using a throughput performance ratio of A100 vs V100 to estimate percent peak for V100 given this ratio

**Results for 8x 80GB A100 nodes**

In [5]:
node_size = 8
gpu_type = 'A100' # A100 or V100
gpu_mem = 80 # 40,32(for V100)
gpu_cap = 0.03*gpu_mem
node_cap = node_size*gpu_cap
gpu_tdp = gpu_map[gpu_type]['tdp']
gpu_peak = gpu_map[gpu_type]['peak']

#relative throughput speedup ratio for 8X V100 vs A100 throughput
rel_thru = 7.76/33.46

p_size, t_size, d_size, num_gpu, batch_size = get_ptd(P_user, node_size, gpu_cap, gpu_type, gpu_mem)

X = get_throughput(t_size, p_size, node_size, P_user, gpu_type, rel_thru)

train_sec, total_compute  = get_train_time(model_type, tokens, P_user, num_gpu, X)
train_hour = np.round(train_sec/3600)
train_day = np.ceil(train_sec/86400)


co2e_gross = get_co2e(gpu_tdp, train_hour, region_co2, pue, num_gpu)

print('Estimated Number of GPU: {} \n\
P,T,D : {}, {}, {} \n\
Estimated Batch Size: {} \n\
Total Compute Required: {:e} FLOPs \n\
Estimated throughput: {:.2f} TFLOPs \n\
Percent peak: {:.2f} % \n\
Total Train Time: {} days \n\
Gross CO2e: {:.2f} tCO2e'.format(num_gpu, p_size, t_size, d_size, batch_size, total_compute, X, \
                                 (X/gpu_peak)*100, train_day, co2e_gross/1000))

Estimated Number of GPU: 1520.0 
P,T,D : 10, 8, 19.0 
Estimated Batch Size: 1912.0 
Total Compute Required: 3.150000e+23 FLOPs 
Estimated throughput: 147.63 TFLOPs 
Percent peak: 47.32 % 
Total Train Time: 17.0 days 
Gross CO2e: 111.90 tCO2e


**Results for 8x 32GB V100 nodes**

In [6]:
node_size = 8
gpu_type = 'V100' # A100 or V100
gpu_mem = 32 # 40,32(for V100)
gpu_cap = 0.03*gpu_mem
node_cap = node_size*gpu_cap
gpu_tdp = gpu_map[gpu_type]['tdp']
gpu_peak = gpu_map[gpu_type]['peak']

#relative throughput speedup ratio for 8X V100 vs A100 throughput
rel_thru = 7.76/33.46

p_size, t_size, d_size, num_gpu, batch_size = get_ptd(P_user, node_size, gpu_cap, gpu_type, gpu_mem)

X = get_throughput(t_size, p_size, node_size, P_user, gpu_type, rel_thru)

train_sec, total_compute  = get_train_time(model_type, tokens, P_user, num_gpu, X)
train_hour = np.round(train_sec/3600)
train_day = np.ceil(train_sec/86400)


co2e_gross = get_co2e(gpu_tdp, train_hour, region_co2, pue, num_gpu)

print('Estimated Number of GPU: {} \n\
P,T,D : {}, {}, {} \n\
Estimated Batch Size: {} \n\
Total Compute Required: {:e} FLOPs \n\
Estimated throughput: {:.2f} TFLOPs \n\
Percent peak: {:.2f} % \n\
Total Train Time: {} days \n\
Gross CO2e: {:.2f} tCO2e'.format(num_gpu, p_size, t_size, d_size, batch_size, total_compute, X, \
                                 (X/gpu_peak)*100, train_day, co2e_gross/1000))

Estimated Number of GPU: 3680.0 
P,T,D : 23, 8, 20.0 
Estimated Batch Size: 3680.0 
Total Compute Required: 3.150000e+23 FLOPs 
Estimated throughput: 45.43 TFLOPs 
Percent peak: 36.34 % 
Total Train Time: 22.0 days 
Gross CO2e: 272.47 tCO2e


### Discussion on variation observed with total CO2e estimated for GPT(Large) model discussed in Patterson et al

The paper by David Patterson et al on [Carbon Emissions and Large Neural Network Training](https://arxiv.org/abs/2104.10350) calculates equivalent carbon for different model types. The results observed on different models across different hardware is given below:

![co2e_results](co2e_results.png)

From the above data throughput used for this setup of 10,000 GPUs is <mark>24.6 TFLOPs (19.7% of peak)</mark>. This reported throughput does not uses Megatron to find optimized parallel dimensions of a given model size for workload distribution. Hence a valid assumption can be made regarding expected higher values of throughput when run with V100 GPU using P,T,D parallel sizes based on obervations made in Megatron LM.

We estimate throughput per GPU for 3680 V100 GPUs as $45.43 \text{TFLOPs} (36.34\% of peak)$. This 17% increase from the reported throughput by Google for estimating CO2e leads to <mark>-51%</mark> lesser CO2e estimated from the regression equations versus the reported values in Patterson et al. If the same throughput values as used by Patterson et al are used for the estimated model configurations of 3680 GPUs with a training time of 22 days we only have a $-9\%$ variation with the reported throughput in Patterson. Hence we find that this throughput difference is the main driver behind the large difference of $-57\%$.

### Result for parameters from 1B to 1000B

Below graph shows the estimated throughput using the fit regression curve for GPT/BERT like models with parameters ranging from 1B to 1000B

![a100_thru](a100_thru.png)
![v100_thru](v100_thru.png)

As expected, the estimated throughput follows the same trend as observed in Megatron paper.

#### Appendix
**Function for batch run**

In [7]:
def run(P_sizes, gpu_type, node_size, gpu_mem, rel_thru):
    gpu_cap = 0.03*gpu_mem
    node_cap = node_size*gpu_cap
    gpu_tdp = gpu_map[gpu_type]['tdp']
    gpu_peak = gpu_map[gpu_type]['peak']

    #relative throughput speedup ratio for 8X V100 vs A100 throughput
    # rel_thru = 7.76/33.46
    result = list()
    for P_size in P_sizes:
        p_size, t_size, d_size, num_gpu, batch_size = get_ptd(P_size, node_size, gpu_cap, gpu_type, gpu_mem)

        X = get_throughput(t_size, p_size, node_size, P_size, gpu_type, rel_thru)

        train_sec, total_compute  = get_train_time(model_type, tokens, P_size, num_gpu, X)
        train_hour = np.round(train_sec/3600)
        train_day = np.ceil(train_sec/86400)
        co2e_gross = get_co2e(gpu_tdp, train_hour, region_co2, pue, num_gpu)

        r = (P_size, num_gpu, p_size, t_size, d_size, batch_size, total_compute, X, \
                                     (X/gpu_peak)*100, train_day, co2e_gross/1000)

        result.append(r)
    return result

In [8]:
pr_list = [1.7e9, 3.6e9, 7.5e9, 18.4e9, 39.1e9, 76.1e9, 145.6e9, 310.1e9, 529.6e9, 1008e9]
node_size = 8
gpu_type = 'A100' # A100 or V100
gpu_mem = 80 # 40,32(for V100)
rel_thru = 7.76/33.46
a100_results = run(pr_list, gpu_type, node_size, gpu_mem, rel_thru)
a100_X = []
for el in a100_results:
    a100_X.append(el[7])

In [9]:
pr_list = [1.7e9, 3.6e9, 7.5e9, 18.4e9, 39.1e9, 76.1e9, 145.6e9, 310.1e9, 529.6e9, 1008e9]
node_size = 8
gpu_type = 'V100' # A100 or V100
gpu_mem = 32 # 40,32(for V100)
rel_thru = 7.76/33.46
v100_results = run(pr_list, gpu_type, node_size, gpu_mem, rel_thru)
v100_X = []
for el in v100_results:
    v100_X.append(el[7])