Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/insert reduce_op to parallel exe #10096

Conversation

chengduoZH
Copy link
Contributor

@chengduoZH chengduoZH commented Apr 20, 2018

This PR's work includes:

  • Insert broadcast_op and reduce_op into SSA-Graph.
  • Change the strategy of updating parameters. If the model has 100 parameters and runs on 4-card, the new strategy is every card updates 25 parameters but not the all, and then broadcast their new parameters to others. All those operations are completed by reduce_op and broadcast_op.
  • If the program is running on GPU, broadcast_op will call nccl_Bcast, otherwise, it will call tensor copying.
  • If the inputs' type of reduce_op is sparse, reduce_op will concatenate them, otherwise, it will call nccl_Bcast or tensor copying according to the input's place.

I compared the performance of before and after optimization, the result is: (sec/batch)

  4-card, 4-threads 4-card, 8-threads 4-card, 16-threads
before 1.004974976 0.967834597 1.021792168
after 0.936332986 0.920828621 1.077100386

This is to say, parallel_exe supports updating sparse parameter now, meanwhile, the performance of parallel_exe is Improved.

@reyoung reyoung force-pushed the feature/insert_reduce_to_parallel_exe branch from cf3a209 to a22d385 Compare April 20, 2018 09:40
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 2 times, most recently from f9eafab to 898b196 Compare April 20, 2018 10:33
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 7 times, most recently from f6113f6 to 9aa2b65 Compare April 20, 2018 17:05
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 2 times, most recently from aa04680 to f5749fe Compare April 21, 2018 02:51
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from f5749fe to cec94e1 Compare April 21, 2018 03:56
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from b932a66 to 1389575 Compare April 22, 2018 14:45
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 1389575 to 20ba594 Compare April 22, 2018 16:06
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 2 times, most recently from b4f0207 to 8b0adbe Compare April 24, 2018 03:24
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 8b0adbe to f965e9a Compare April 24, 2018 05:12
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 9dd6517 to 7b58d47 Compare April 24, 2018 06:39
Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you further verify the performance improvements on different environments? The improvement seems not obvious in 4card-8thread condition, while this change introduce significant complexity?

call();
}
}
if (*out_handle != *in_var_handle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is out_handle? What is the usage of this code block? Can you add some comments to explain this block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadcast is to broadcast the root's data to all the devices except itself, so if the input and output of root device are different, we need to add the copying operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is root device and what is input and output of root device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the description above is a bit obscure. I regarded the source device, which will send the data to other devices, as the root node.
The function of broadcast_op is sending input data to all the devices, so there must exist one of output and input is in the same device, if the value(tensor or seleceted_rows) of them is not equal, we should do the memory copying manually.

BroadcastOpHandle(const std::vector<Scope *> &local_scopes,
const std::vector<platform::Place> &places);
const std::vector<platform::Place> &places, bool use_nccl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_nccl_ should be false and we don't need use_nccl argument here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BroadcastOp also can broadcast CPU data which can not use nccl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that's why we don't need use_nccl argument here for CPU case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out, in order to compare the performance between memory copying and ncclBcast, I add this parameter, the result is ncclBcast is faster. So use_nccl can be removed.

#endif

if (use_gpu_) {
#ifndef PADDLE_WITH_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are so many ifdef here, can we improve this block of codes?

// Wait input done, this Wait is asynchronous operation
WaitInputVarGenerated(in_var_handles);

std::vector<int64_t> out_rows;
std::vector<Tensor> in_tensors;
std::vector<platform::Place> in_places;
// std::vector<platform::Place> in_places;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

in_places.push_back(in_p);
PADDLE_ENFORCE_EQ(in_p.which(), pre_place.which(),
"Places must be all on CPU or all on CUDA.");
// in_places.push_back(in_handle->place_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

std::vector<std::unordered_set<std::string>> var_name_on_devices;
std::vector<std::unordered_set<std::string>> bcast_var_name_set;

var_name_on_devices.resize(places_.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be reserve()? Is it different from resize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer : https://stackoverflow.com/questions/7397768/choice-between-vectorresize-and-vectorreserve

The resize() method (and passing argument to constructor is equivalent to that) will insert or delete appropriate number of elements to the vector to make it given size (it has optional second argument to specify their value). It will affect the size(), iteration will go over all those elements, push_back will insert after them and you can directly access them using the operator[].

The reserve() method only allocates memory, but leaves it uninitialized. It only affects capacity(), but size() will be unchanged. There is no value for the objects, because nothing is added to the vector. If you then insert the elements, no reallocation will happen, because it was done in advance, but that's the only effect.

@@ -55,7 +55,7 @@ struct ReduceOpHandle : public OpHandleBase {

std::string Name() const override;

bool IsMultiDeviceTransfer() override { return false; };
bool IsMultiDeviceTransfer() override { return true; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this now that many things have changed.

@@ -28,7 +28,9 @@ def __init__(self,
loss_name=None,
main_program=None,
num_threads=None,
threads_per_dev=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems confusing to have both num_threads and threads_per_dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto dev_ctx = dev_ctxes_.at(out_p);
RunAndRecordEvent(out_p, [in_tensor, out_var, dev_ctx, out_p] {
paddle::framework::TensorCopy(
in_tensor, out_p, *(dev_ctx),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for () for dex_ctx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have removed.

auto *out = out_var_handles[j];
auto *out_var = var_scopes.at(out->scope_idx_)->FindVar(out->name_);

if (*out != *in_var_handle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you comparing VarHandle instead of VarHandle*? Is it correct?
Hopefully, we don't use "auto" everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you comparing VarHandle instead of VarHandle*?

line 90 compares the instance but not the point.

Is it correct?

The point of input (in_var_handle) and output(out) must be different because the output varHandle is always created. code

Hopefully, we don't use "auto" everywhere.

Thanks, I will correct later.

}

int type = platform::ToNCCLDataType(in_tensor.type());
all_reduce_calls.emplace_back([=] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is "broadcast" not "all_reduce"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

call();
}
}
if (*out_handle != *in_var_handle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is root device and what is input and output of root device?

BroadcastOpHandle(const std::vector<Scope *> &local_scopes,
const std::vector<platform::Place> &places);
const std::vector<platform::Place> &places, bool use_nccl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that's why we don't need use_nccl argument here for CPU case?

@@ -30,7 +30,8 @@ def __init__(self,
num_threads=None,
allow_op_delay=False,
share_vars_from=None,
customize_loss_grad=False):
customize_loss_grad=False,
use_nccl_allreduce=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -565,24 +568,31 @@ def test_parallel_testing(self):
depth = 8
mix_hidden_lr = 1e-3
embedding_name = 'emb'
is_sparse = True
use_nccl_allreduce = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it cover is_sparse=False and use_nccl_allreduce=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will add unit test.


bool operator!=(const VarHandle& o) const {
return o.generated_op_ != generated_op_ || o.name_ != name_ ||
o.scope_idx_ != scope_idx_;
Copy link
Contributor

@panyx0718 panyx0718 Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about version_ and place_? this seems dangerous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this operator is used in here,

if (*out != *in_var_handle) 

The version_ of them can be different, but the place_.....
Let me think about it again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no need to implement != manually. You can just invoke return this->operator==(o);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 3 times, most recently from 55d542c to be5a84b Compare April 24, 2018 16:11
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from be5a84b to ea78be2 Compare April 25, 2018 11:19
@chengduoZH chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 93daa01 to c0a3746 Compare May 2, 2018 10:39
@chengduoZH chengduoZH closed this May 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants