Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers) #230

Open
vince62s opened this issue May 17, 2023 · 3 comments

Comments

@vince62s
Copy link

Here is what I am getting (see below)

FP8 slower than FP16

for FP16, multiples of 16 make things slower than multiple of 8

Am I missing something ?

Batch_size_multiple 16 // Seqlen multiple 16

FP8 (adam)
[2023-05-17 22:20:28,534 INFO] Step 100/300000; acc: 16.1; ppl: 6038.0; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 14043/16656 tok/s; 61 sec;
[2023-05-17 22:21:06,060 INFO] Step 200/300000; acc: 20.6; ppl: 1059.6; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 23063/27297 tok/s; 99 sec;
[2023-05-17 22:21:43,862 INFO] Step 300/300000; acc: 25.3; ppl: 466.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 23082/27262 tok/s; 136 sec;
[2023-05-17 22:22:21,180 INFO] Step 400/300000; acc: 27.6; ppl: 315.5; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 22912/27074 tok/s; 174 sec;
[2023-05-17 22:22:58,740 INFO] Step 500/300000; acc: 30.4; ppl: 236.7; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 22880/27001 tok/s; 211 sec;

FP16 (adam)
[2023-05-17 22:24:39,883 INFO] Step 100/300000; acc: 16.2; ppl: 6127.8; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 18771/22265 tok/s; 46 sec;
[2023-05-17 22:25:04,966 INFO] Step 200/300000; acc: 20.6; ppl: 1061.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 34504/40838 tok/s; 71 sec;
[2023-05-17 22:25:30,067 INFO] Step 300/300000; acc: 25.3; ppl: 467.8; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 34760/41057 tok/s; 96 sec;
[2023-05-17 22:25:55,069 INFO] Step 400/300000; acc: 27.4; ppl: 320.1; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 34199/40411 tok/s; 121 sec;
[2023-05-17 22:26:19,589 INFO] Step 500/300000; acc: 30.1; ppl: 241.5; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 35048/41359 tok/s; 145 sec;

FP16 (fusedadam)
[2023-05-17 22:28:29,266 INFO] Step 100/300000; acc: 16.1; ppl: 6160.6; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 20312/24092 tok/s; 42 sec;
[2023-05-17 22:28:49,956 INFO] Step 200/300000; acc: 20.6; ppl: 1063.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 41830/49509 tok/s; 63 sec;
[2023-05-17 22:29:11,128 INFO] Step 300/300000; acc: 25.3; ppl: 468.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 41213/48678 tok/s; 84 sec;
[2023-05-17 22:29:32,063 INFO] Step 400/300000; acc: 27.4; ppl: 320.2; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 40842/48260 tok/s; 105 sec;
[2023-05-17 22:29:52,720 INFO] Step 500/300000; acc: 30.2; ppl: 241.3; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 41603/49095 tok/s; 126 sec;

Batch_size_multiple 8 // Seqlen multiple 8
FP16 (Fusedadam)
[2023-05-17 22:32:08,412 INFO] Step 100/300000; acc: 16.0; ppl: 6256.0; xent: 8.7; lr: 0.00002; sents: 34120; bsz: 2337/2766/85; 22346/26446 tok/s; 42 sec;
[2023-05-17 22:32:29,029 INFO] Step 200/300000; acc: 20.9; ppl: 1047.4; xent: 7.0; lr: 0.00005; sents: 31128; bsz: 2349/2772/78; 45571/53777 tok/s; 62 sec;
[2023-05-17 22:32:49,643 INFO] Step 300/300000; acc: 24.6; ppl: 482.1; xent: 6.2; lr: 0.00007; sents: 26808; bsz: 2346/2776/67; 45523/53867 tok/s; 83 sec;
[2023-05-17 22:33:10,198 INFO] Step 400/300000; acc: 27.0; ppl: 326.7; xent: 5.8; lr: 0.00010; sents: 28448; bsz: 2341/2771/71; 45563/53917 tok/s; 104 sec;
[2023-05-17 22:33:30,629 INFO] Step 500/300000; acc: 30.0; ppl: 242.5; xent: 5.5; lr: 0.00012; sents: 27072; bsz: 2338/2764/68; 45773/54123 tok/s; 124 sec;

@overvalidated
Copy link

Same problem. The only performance gain I got is from a bigger batch size. But implementation problems in Accelerate (model conversion takes much more memory) don't allow to use it.

@AaronZLT
Copy link

AaronZLT commented Jun 5, 2023

Here is what I am getting (see below)

FP8 slower than FP16

for FP16, multiples of 16 make things slower than multiple of 8

Am I missing something ?

Batch_size_multiple 16 // Seqlen multiple 16

FP8 (adam) [2023-05-17 22:20:28,534 INFO] Step 100/300000; acc: 16.1; ppl: 6038.0; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 14043/16656 tok/s; 61 sec; [2023-05-17 22:21:06,060 INFO] Step 200/300000; acc: 20.6; ppl: 1059.6; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 23063/27297 tok/s; 99 sec; [2023-05-17 22:21:43,862 INFO] Step 300/300000; acc: 25.3; ppl: 466.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 23082/27262 tok/s; 136 sec; [2023-05-17 22:22:21,180 INFO] Step 400/300000; acc: 27.6; ppl: 315.5; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 22912/27074 tok/s; 174 sec; [2023-05-17 22:22:58,740 INFO] Step 500/300000; acc: 30.4; ppl: 236.7; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 22880/27001 tok/s; 211 sec;

FP16 (adam) [2023-05-17 22:24:39,883 INFO] Step 100/300000; acc: 16.2; ppl: 6127.8; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 18771/22265 tok/s; 46 sec; [2023-05-17 22:25:04,966 INFO] Step 200/300000; acc: 20.6; ppl: 1061.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 34504/40838 tok/s; 71 sec; [2023-05-17 22:25:30,067 INFO] Step 300/300000; acc: 25.3; ppl: 467.8; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 34760/41057 tok/s; 96 sec; [2023-05-17 22:25:55,069 INFO] Step 400/300000; acc: 27.4; ppl: 320.1; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 34199/40411 tok/s; 121 sec; [2023-05-17 22:26:19,589 INFO] Step 500/300000; acc: 30.1; ppl: 241.5; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 35048/41359 tok/s; 145 sec;

FP16 (fusedadam) [2023-05-17 22:28:29,266 INFO] Step 100/300000; acc: 16.1; ppl: 6160.6; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 20312/24092 tok/s; 42 sec; [2023-05-17 22:28:49,956 INFO] Step 200/300000; acc: 20.6; ppl: 1063.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 41830/49509 tok/s; 63 sec; [2023-05-17 22:29:11,128 INFO] Step 300/300000; acc: 25.3; ppl: 468.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 41213/48678 tok/s; 84 sec; [2023-05-17 22:29:32,063 INFO] Step 400/300000; acc: 27.4; ppl: 320.2; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 40842/48260 tok/s; 105 sec; [2023-05-17 22:29:52,720 INFO] Step 500/300000; acc: 30.2; ppl: 241.3; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 41603/49095 tok/s; 126 sec;

Batch_size_multiple 8 // Seqlen multiple 8 FP16 (Fusedadam) [2023-05-17 22:32:08,412 INFO] Step 100/300000; acc: 16.0; ppl: 6256.0; xent: 8.7; lr: 0.00002; sents: 34120; bsz: 2337/2766/85; 22346/26446 tok/s; 42 sec; [2023-05-17 22:32:29,029 INFO] Step 200/300000; acc: 20.9; ppl: 1047.4; xent: 7.0; lr: 0.00005; sents: 31128; bsz: 2349/2772/78; 45571/53777 tok/s; 62 sec; [2023-05-17 22:32:49,643 INFO] Step 300/300000; acc: 24.6; ppl: 482.1; xent: 6.2; lr: 0.00007; sents: 26808; bsz: 2346/2776/67; 45523/53867 tok/s; 83 sec; [2023-05-17 22:33:10,198 INFO] Step 400/300000; acc: 27.0; ppl: 326.7; xent: 5.8; lr: 0.00010; sents: 28448; bsz: 2341/2771/71; 45563/53917 tok/s; 104 sec; [2023-05-17 22:33:30,629 INFO] Step 500/300000; acc: 30.0; ppl: 242.5; xent: 5.5; lr: 0.00012; sents: 27072; bsz: 2338/2764/68; 45773/54123 tok/s; 124 sec;

Hi vince62s, could you share your benchmark script for replica the issue? :)

@vince62s
Copy link
Author

vince62s commented Jun 5, 2023

well I don't know if you really want to check the code, but here is my branch of the FP8 changes.
https://github.com/vince62s/OpenNMT-py/tree/fp8
main thing happens here: https://github.com/vince62s/OpenNMT-py/blob/fp8/onmt/model_builder.py#L426-L427
if you want an example of training script:
https://github.com/vince62s/OpenNMT-py/blob/fp8/docs/source/examples/wmt17/Translation.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants