How to calculate TFLOPS in LSTM.cu #7

yhuanghamu · 2016-05-03T18:00:58Z

The output of this code is runtime, but what i want to compare is throughput, how do i convert the runtime into TFLOPS.
I mean how the computation is related to the other parameters.

JAppleyard · 2016-05-04T10:53:54Z

The vast majority of FLOPs in an LSTM are in the matrix multiplications. A single matrix multiplication requires 2MN(K+1) FLOPs. There are 8 matrix multiplications per layer per timestep, and in this case M=K=hiddenSize, N=minibatch. Therefore the total FLOPs are:

layers * timesteps * 8 * 2 * hiddenSize * minibatch * (hiddenSize + 1).

In reality there's a few more FLOPs due to biases and activation functions. These are only significant if hiddenSize is very small as they scale linearly with hiddenSize rather than with the square.

In any case, this will give you the total approximate number of FLOPs. Divide by time and multiply by 10^-12, and you have TFLOPS.

xiezhq-hermann · 2019-01-08T15:23:23Z

@JAppleyard Hi, why a single matrix multiplication requires 2MN(K+1), I mean, it should be MN(2K-1) right?

yhuanghamu changed the title ~~How to calculate TFLOPS~~ How to calculate TFLOPS in LSTM.cu May 3, 2016

harrism closed this as completed May 10, 2016

dkokron mentioned this issue Oct 31, 2019

CUDA aware Jacobi examples fail using PGI #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to calculate TFLOPS in LSTM.cu #7

How to calculate TFLOPS in LSTM.cu #7

yhuanghamu commented May 3, 2016 •

edited

JAppleyard commented May 4, 2016 •

edited

xiezhq-hermann commented Jan 8, 2019 •

edited

How to calculate TFLOPS in LSTM.cu #7

How to calculate TFLOPS in LSTM.cu #7

Comments

yhuanghamu commented May 3, 2016 • edited

JAppleyard commented May 4, 2016 • edited

xiezhq-hermann commented Jan 8, 2019 • edited

yhuanghamu commented May 3, 2016 •

edited

JAppleyard commented May 4, 2016 •

edited

xiezhq-hermann commented Jan 8, 2019 •

edited