FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers) #230

vince62s · 2023-05-17T20:38:44Z

Here is what I am getting (see below)

FP8 slower than FP16

for FP16, multiples of 16 make things slower than multiple of 8

Am I missing something ?

Batch_size_multiple 16 // Seqlen multiple 16

FP8 (adam)
[2023-05-17 22:20:28,534 INFO] Step 100/300000; acc: 16.1; ppl: 6038.0; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 14043/16656 tok/s; 61 sec;
[2023-05-17 22:21:06,060 INFO] Step 200/300000; acc: 20.6; ppl: 1059.6; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 23063/27297 tok/s; 99 sec;
[2023-05-17 22:21:43,862 INFO] Step 300/300000; acc: 25.3; ppl: 466.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 23082/27262 tok/s; 136 sec;
[2023-05-17 22:22:21,180 INFO] Step 400/300000; acc: 27.6; ppl: 315.5; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 22912/27074 tok/s; 174 sec;
[2023-05-17 22:22:58,740 INFO] Step 500/300000; acc: 30.4; ppl: 236.7; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 22880/27001 tok/s; 211 sec;

FP16 (adam)
[2023-05-17 22:24:39,883 INFO] Step 100/300000; acc: 16.2; ppl: 6127.8; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 18771/22265 tok/s; 46 sec;
[2023-05-17 22:25:04,966 INFO] Step 200/300000; acc: 20.6; ppl: 1061.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 34504/40838 tok/s; 71 sec;
[2023-05-17 22:25:30,067 INFO] Step 300/300000; acc: 25.3; ppl: 467.8; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 34760/41057 tok/s; 96 sec;
[2023-05-17 22:25:55,069 INFO] Step 400/300000; acc: 27.4; ppl: 320.1; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 34199/40411 tok/s; 121 sec;
[2023-05-17 22:26:19,589 INFO] Step 500/300000; acc: 30.1; ppl: 241.5; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 35048/41359 tok/s; 145 sec;

FP16 (fusedadam)
[2023-05-17 22:28:29,266 INFO] Step 100/300000; acc: 16.1; ppl: 6160.6; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 20312/24092 tok/s; 42 sec;
[2023-05-17 22:28:49,956 INFO] Step 200/300000; acc: 20.6; ppl: 1063.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 41830/49509 tok/s; 63 sec;
[2023-05-17 22:29:11,128 INFO] Step 300/300000; acc: 25.3; ppl: 468.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 41213/48678 tok/s; 84 sec;
[2023-05-17 22:29:32,063 INFO] Step 400/300000; acc: 27.4; ppl: 320.2; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 40842/48260 tok/s; 105 sec;
[2023-05-17 22:29:52,720 INFO] Step 500/300000; acc: 30.2; ppl: 241.3; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 41603/49095 tok/s; 126 sec;

Batch_size_multiple 8 // Seqlen multiple 8
FP16 (Fusedadam)
[2023-05-17 22:32:08,412 INFO] Step 100/300000; acc: 16.0; ppl: 6256.0; xent: 8.7; lr: 0.00002; sents: 34120; bsz: 2337/2766/85; 22346/26446 tok/s; 42 sec;
[2023-05-17 22:32:29,029 INFO] Step 200/300000; acc: 20.9; ppl: 1047.4; xent: 7.0; lr: 0.00005; sents: 31128; bsz: 2349/2772/78; 45571/53777 tok/s; 62 sec;
[2023-05-17 22:32:49,643 INFO] Step 300/300000; acc: 24.6; ppl: 482.1; xent: 6.2; lr: 0.00007; sents: 26808; bsz: 2346/2776/67; 45523/53867 tok/s; 83 sec;
[2023-05-17 22:33:10,198 INFO] Step 400/300000; acc: 27.0; ppl: 326.7; xent: 5.8; lr: 0.00010; sents: 28448; bsz: 2341/2771/71; 45563/53917 tok/s; 104 sec;
[2023-05-17 22:33:30,629 INFO] Step 500/300000; acc: 30.0; ppl: 242.5; xent: 5.5; lr: 0.00012; sents: 27072; bsz: 2338/2764/68; 45773/54123 tok/s; 124 sec;

overvalidated · 2023-05-18T17:35:11Z

Same problem. The only performance gain I got is from a bigger batch size. But implementation problems in Accelerate (model conversion takes much more memory) don't allow to use it.

AaronZLT · 2023-06-05T01:20:20Z

Here is what I am getting (see below)

FP8 slower than FP16

for FP16, multiples of 16 make things slower than multiple of 8

Am I missing something ?

Batch_size_multiple 16 // Seqlen multiple 16

FP8 (adam) [2023-05-17 22:20:28,534 INFO] Step 100/300000; acc: 16.1; ppl: 6038.0; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 14043/16656 tok/s; 61 sec; [2023-05-17 22:21:06,060 INFO] Step 200/300000; acc: 20.6; ppl: 1059.6; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 23063/27297 tok/s; 99 sec; [2023-05-17 22:21:43,862 INFO] Step 300/300000; acc: 25.3; ppl: 466.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 23082/27262 tok/s; 136 sec; [2023-05-17 22:22:21,180 INFO] Step 400/300000; acc: 27.6; ppl: 315.5; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 22912/27074 tok/s; 174 sec; [2023-05-17 22:22:58,740 INFO] Step 500/300000; acc: 30.4; ppl: 236.7; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 22880/27001 tok/s; 211 sec;

FP16 (adam) [2023-05-17 22:24:39,883 INFO] Step 100/300000; acc: 16.2; ppl: 6127.8; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 18771/22265 tok/s; 46 sec; [2023-05-17 22:25:04,966 INFO] Step 200/300000; acc: 20.6; ppl: 1061.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 34504/40838 tok/s; 71 sec; [2023-05-17 22:25:30,067 INFO] Step 300/300000; acc: 25.3; ppl: 467.8; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 34760/41057 tok/s; 96 sec; [2023-05-17 22:25:55,069 INFO] Step 400/300000; acc: 27.4; ppl: 320.1; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 34199/40411 tok/s; 121 sec; [2023-05-17 22:26:19,589 INFO] Step 500/300000; acc: 30.1; ppl: 241.5; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 35048/41359 tok/s; 145 sec;

FP16 (fusedadam) [2023-05-17 22:28:29,266 INFO] Step 100/300000; acc: 16.1; ppl: 6160.6; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 20312/24092 tok/s; 42 sec; [2023-05-17 22:28:49,956 INFO] Step 200/300000; acc: 20.6; ppl: 1063.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 41830/49509 tok/s; 63 sec; [2023-05-17 22:29:11,128 INFO] Step 300/300000; acc: 25.3; ppl: 468.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 41213/48678 tok/s; 84 sec; [2023-05-17 22:29:32,063 INFO] Step 400/300000; acc: 27.4; ppl: 320.2; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 40842/48260 tok/s; 105 sec; [2023-05-17 22:29:52,720 INFO] Step 500/300000; acc: 30.2; ppl: 241.3; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 41603/49095 tok/s; 126 sec;

Batch_size_multiple 8 // Seqlen multiple 8 FP16 (Fusedadam) [2023-05-17 22:32:08,412 INFO] Step 100/300000; acc: 16.0; ppl: 6256.0; xent: 8.7; lr: 0.00002; sents: 34120; bsz: 2337/2766/85; 22346/26446 tok/s; 42 sec; [2023-05-17 22:32:29,029 INFO] Step 200/300000; acc: 20.9; ppl: 1047.4; xent: 7.0; lr: 0.00005; sents: 31128; bsz: 2349/2772/78; 45571/53777 tok/s; 62 sec; [2023-05-17 22:32:49,643 INFO] Step 300/300000; acc: 24.6; ppl: 482.1; xent: 6.2; lr: 0.00007; sents: 26808; bsz: 2346/2776/67; 45523/53867 tok/s; 83 sec; [2023-05-17 22:33:10,198 INFO] Step 400/300000; acc: 27.0; ppl: 326.7; xent: 5.8; lr: 0.00010; sents: 28448; bsz: 2341/2771/71; 45563/53917 tok/s; 104 sec; [2023-05-17 22:33:30,629 INFO] Step 500/300000; acc: 30.0; ppl: 242.5; xent: 5.5; lr: 0.00012; sents: 27072; bsz: 2338/2764/68; 45773/54123 tok/s; 124 sec;

Hi vince62s, could you share your benchmark script for replica the issue? :)

vince62s · 2023-06-05T16:53:07Z

well I don't know if you really want to check the code, but here is my branch of the FP8 changes.
https://github.com/vince62s/OpenNMT-py/tree/fp8
main thing happens here: https://github.com/vince62s/OpenNMT-py/blob/fp8/onmt/model_builder.py#L426-L427
if you want an example of training script:
https://github.com/vince62s/OpenNMT-py/blob/fp8/docs/source/examples/wmt17/Translation.md

vince62s mentioned this issue May 19, 2023

fp8 support OpenNMT/OpenNMT-py#2304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers) #230

FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers) #230

vince62s commented May 17, 2023

overvalidated commented May 18, 2023

AaronZLT commented Jun 5, 2023

vince62s commented Jun 5, 2023

FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers) #230

FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers) #230

Comments

vince62s commented May 17, 2023

overvalidated commented May 18, 2023

AaronZLT commented Jun 5, 2023

vince62s commented Jun 5, 2023