Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable mixed precision training for Transformer models #211

Merged
merged 1 commit into from Oct 3, 2018

Conversation

Projects
None yet
2 participants
@guillaumekln
Copy link
Member

commented Oct 3, 2018

Closes #57.

@guillaumekln guillaumekln merged commit 87f6f3c into OpenNMT:master Oct 3, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@guillaumekln guillaumekln deleted the guillaumekln:mixed-precision branch Oct 3, 2018

wanghm92 added a commit to wanghm92/OpenNMT-tf that referenced this pull request Jan 5, 2019

@mehmedes

This comment has been minimized.

Copy link

commented Jan 18, 2019

Hi @guillaumekln,
Would you mind giving us some input on your speed gains with your mixed precision implementation vs. fp32:
tensorflow/tensor2tensor#1221

@guillaumekln

This comment has been minimized.

Copy link
Member Author

commented Jan 18, 2019

Hi,

I gathered some fresh values on a P3 instance (1 x V100) using the tensorflow/tensorflow:nightly-gpu-py3 Docker image. Same configuration in both tests to highlight the raw gain:

  • Model type: TransformerBase (without shared weights)
  • Batch size: 8192
vocab size step/s source tokens/s target tokens/s
FP32 32,001 2.64 18.1k 20.4k
FP16 32,001 3.56 24.5k 27.6k
FP16 32,000 4.03 27.8k 31.4k
FP16 (with #309) 32,000 4.68 32.8k 37.1k
@mehmedes

This comment has been minimized.

Copy link

commented Jan 18, 2019

Thank you for the feedback. Looks like we share the same fate :_(

@guillaumekln

This comment has been minimized.

Copy link
Member Author

commented Jan 29, 2019

@mehmedes Please note that it's important to make the vocabulary size a multiple of 8. In my initial experiment it was actually 32,000 + 1 (the <unk> token). Changing it to 31,999 + 1 makes a difference, see the table above.

@guillaumekln

This comment has been minimized.

Copy link
Member Author

commented Jan 30, 2019

Similarly, the batch size should ideally be a multiple of 8. With #309, additional gains are observed (see the updated table above).

@guillaumekln

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2019

@mehmedes Here are additional data for a big Transformer model with a batch size of 4096 and the latest updates:

step/s source tokens/s target tokens/s
FP32 1.92 6.6k 7.4k
FP16 4.27 15.3k 17.3k

So to summarize, here are the current gains (with equal batch size):

  • base Transformer: x1.77
  • big Transformer: x2.22

which are in line with the expected FP16 gains (but generally lower than what one can achieve in PyTorch for example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.