You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From issue #8818 we can see that in parameter optimization stage, there are many elementwise_mul ops, they take a lot of time.
These elementwise_mul ops are used to compute learning_rate for each parameter because every parameter may have a different learning_rate, the computation process is
param_lr=global_lr*lr_for_param
global_lr is a global Variable, lr_for_param is a float value for a parameter, the default value is 1.0. The code above adds the elementwise_mul ops to the main program.
The improvement
Most of the time, the value of lr_for_param is 1.0, in this condition we have no need to add these elementwise_mul ops.
A complete solution should be constant folding, we should add a constant folding transpiler which will recognize all constant value and calculate them during compile stage, this will reduce many ops running when executing the program.
Optimization result
Timeline after optimize
calc_step_num
ave_step_time(before)
ave_step_time(after)
after/before
3
1.12088267008
1.03341897329
0.9219689097488165
38
1.05036788238
0.987895676964
0.9405234999432334
78
1.06520705345
0.953312274737
0.894954902569792
The text was updated successfully, but these errors were encountered:
Is it because both global_lr and lr_for_param are constants, we could remove those elementwise_muls by computing global_lr*lr_for_param at compile time? If so, do we need to add a compilation optimization stage? Where should it be? In a transpiler? @jacquesqiao
@wangkuiyi yes, a better solution should be constant folding, we should add a constant folding transpiler which will recognize all constant value and calculate them during compile stage, this will reduce many ops running when executing the program.
Background
Profile script: dzhwinter/benchmark#84
From issue #8818 we can see that in parameter optimization stage, there are many
elementwise_mul
ops, they take a lot of time.These
elementwise_mul
ops are used to computelearning_rate
for each parameter because every parameter may have a different learning_rate, the computation process isglobal_lr is a global Variable, lr_for_param is a float value for a parameter, the default value is 1.0. The code above adds the
elementwise_mul
ops to the main program.The improvement
Most of the time, the value of
lr_for_param
is 1.0, in this condition we have no need to add theseelementwise_mul
ops.The logic after optimization should be:
A complete solution should be
constant folding
, we should add a constant folding transpiler which will recognize all constant value and calculate them during compile stage, this will reduce many ops running when executing the program.Optimization result
Timeline after optimize
The text was updated successfully, but these errors were encountered: