Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A experiment of fuse all optimize op to one #8941

Closed
jacquesqiao opened this issue Mar 9, 2018 · 1 comment
Closed

A experiment of fuse all optimize op to one #8941

jacquesqiao opened this issue Mar 9, 2018 · 1 comment

Comments

@jacquesqiao
Copy link
Member

Background

In fluid's current design, the last part of a program is optimization related operators. For example, in se_resnext example, the last operators are a list of SGD as timeline shows below:
image

After discuss with @chengduoZH @panyx0718, we think that currently, these little ops will waste a lot of time launching kernels, so maybe we can fuse these little ops into a big one.

There is two part of things to do:

  1. Add a fused SGD op that it takes a list of parameter, gradient and learning rate as the input. This is done by @chengduoZH in PR [Don't merge] Add sgd group #8869.
  2. Add a fuse transpiler to fuse all SGD op into the big one. This is done in PR fuse optimize op transpiler #8940.

Experiment result

  • Timeline after the fusing.
    image

  • Running time for all SGD ops. The running time has been reduced by half. There are still promotion space if we can merge all CUDA kernel into one in the sgd_group op.

image

Notice

This is just an experiment for one kind of solution, there are many other solutions, for example, multi-thread async execution that need to discuss.

@chengduoZH
Copy link
Contributor

That is awesome!
I will try to use only one CUDA kernel do all the SGD operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants