SE-ResNeXt Optimization #8990

jacquesqiao · 2018-03-12T06:31:28Z

Background

project: https://github.com/PaddlePaddle/Paddle/projects/55
Profiling script:

Optimization methods and result

Delete unused GPU memory during training.
- It will slow down the training a little bit because GPU is an async device. However, it will reduce GPU memory huger than reusing variables. (reduce 54.3% memory usage than 45.5%) [Memory]More memory optimization policy #8690
remove program.clone in Executor. (25% speedup) [Speed]speed up python executor in fluid #8729
initialize NCCL once. (5%~6% speedup) [Speed]Avoid init_nccl for every steps. #8758
use constant folding at compile time to reduce the number of calls to elementwise_mul ops at optimization time (5%~10% speedup) optimize optimizer learning rate #8873
optimize elementwise related op -- use our own implementations, no longer depend on Eigen (speedup x10 for single op) [Speed] Optimize elementwise_mul_op gradient functor #8811

Status

multi cards training has not been fully tested.
need to profile acceleration ratio for multi cards.

Plan

Give a total profile after all the optimization is merged (@chengduoZH )

chengduoZH · 2018-03-12T07:05:01Z

Maybe a lot of small Op kernels, like sgd_op, also need be optimized. It can be imaged that, if the model has 1000 parameters, it will call 1000 times of sgd_ops, this method is very time-consuming.

There are two strategies, one is to analyze the dependence of operations and insert those sgd_ops into the process of backward, the other is to replace sgd_op with sgd_group_op.

This issue(#8941) displays the result of the second strategy(using sgd_group_op).

chengduoZH · 2018-03-13T07:28:13Z

Config and Env:

Input: 3 x 224 x 224
batch_size: 25
CentOS 6.3, Tesla P40, single card.

The comparison results before optimization:

	Speed	Memory
Fluid(before)	1.95 sec/iter	18341 MB
PyTorch	1.154 sec/iter	13359 MB
Fluid/PyTorch	1.68	1.3729

After optimizing the speed:

	Speed	Memory
Fluid(opti_speed)	1.45 sec/iter	17222 MB
PyTorch	1.154 sec/iter	13359 MB
Fluid/PyTorch	1.256499133	1.289168351

After optimizing the memory usage:

	Speed	Memory
Fluid(opti_mem)	1.93 sec/iter	14388 MB
PyTorch	1.154 sec/iter	13359 MB
Fluid/PyTorch	1.672443674	1.077026724

QiJune · 2018-03-13T07:49:36Z

Now, if we choose release memory policy, the memory occupation is almost the same with PyTorch.

However, delete_var operator will synchronize the CUDA stream before release unused memory, which will reduce computation performance.

We have to implement AsyncExecutor to run operators in parallel. This will solve this problem ultimately.

jacquesqiao added this to TODO in Performance Tuning Mar 12, 2018

jacquesqiao moved this from TODO to Doing in Performance Tuning Mar 12, 2018

jacquesqiao assigned panyx0718, jacquesqiao, QiJune and chengduoZH Mar 12, 2018

BigFishMaster mentioned this issue Mar 13, 2018

An error occured when using "fluid.optimizer.Momentum" #9024

Closed

Xreki mentioned this issue Mar 14, 2018

Remove unnecessary clone of program in C++ Executor.Run #9043

Merged

shanyi15 closed this as completed Aug 15, 2018

Performance Tuning automation moved this from Doing to Done Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SE-ResNeXt Optimization #8990

SE-ResNeXt Optimization #8990

jacquesqiao commented Mar 12, 2018 •

edited by reyoung

Loading

chengduoZH commented Mar 12, 2018

chengduoZH commented Mar 13, 2018 •

edited

Loading

QiJune commented Mar 13, 2018 •

edited

Loading

SE-ResNeXt Optimization #8990

SE-ResNeXt Optimization #8990

Comments

jacquesqiao commented Mar 12, 2018 • edited by reyoung Loading

Background

Optimization methods and result

Status

Plan

chengduoZH commented Mar 12, 2018

chengduoZH commented Mar 13, 2018 • edited Loading

Config and Env:

QiJune commented Mar 13, 2018 • edited Loading

jacquesqiao commented Mar 12, 2018 •

edited by reyoung

Loading

chengduoZH commented Mar 13, 2018 •

edited

Loading

QiJune commented Mar 13, 2018 •

edited

Loading