Skip to content

Commit

Permalink
Auto Parallel (#8891)
Browse files Browse the repository at this point in the history
* add auto_parallel code

add auto_parallel pass

* Feat ap remove hierarchy cast (#7919)

* feat(AutoParallel): support remove parallel_cast ops

* feat(AutoParallel): export enable_auto_parallel_prune_parallel_cast_ops

* format code

* Fix add conv grad cost (#7972)

* feat(Conv): add grad computation cost

* fix ConvDataGrad computation cost

* update conv grad cost

* refine

* Auto parallel/fast collector (#7958)

* Try to speed up sbp collector.
However, throughput drop

* Shrink the parallel candidates for the proxy node

* Print out some information and then refine

* Store the sbp set for each consumer

* Update binary set intersection

* Remove impossible parallel candidates from sbp proxy

* Refine binary set

* Add a Clear() in binary set

* Filter out those proxy candidates containing two
sbps from the same unique group

* refine

* Check spells

* Clip useless edges

* AutoParallel mainstem algorithm add mutable_op_ctrl_edge (#8033)

* feat(AutoParallel): mainstem algorithm add mutable_op_ctrl_edge

* use if instead std::max

* fix(AutoParallel): fix pooling computation cost function bug (#8147)

* [WIP] Fix auto parallel dump uniform sbp bug (#8330)

* fix(AutoParallel): fix auto parallel dump uniform sbp bug

* refine source op judgement

* update auto_parallel config (#8356)

* Refactor dump nd sbp for auto parallel (#8353)

* fix(AutoParallel): fix auto parallel dump uniform sbp bug

* feat(AutoParallel): add inferface for op to dump nd_sbp to op_conf

* refactor(AutoParallel): refactor DumpNdSbpSignatureForOpConfFn

* rename Global to Singleton

* Refactor SbpEdge (#8684)

* refactor(AP): refactor SbpEdge

* Rename variables

* Add const for some functions

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

* Refactor auto parallel sbp node (#8712)

* Rename

* Code clean up

* Code clean up

* Code clean up and package up

* Rename

* Add const for some functions

* Refactor auto parallel sbp graph (#8722)

* Code clean up

* Package up

* Code clean up and package up in SbpNode and SbpEdge

* Rename

* Rename

* Rename mainstem to trunk

* Typo, small bugs and rename

* Rename and of format

* Refactor auto parallel rest (#8731)

* Package up SbpCollector

* Add const for SbpGraph

* Add const for SbpNode

* Add const for SbpEdge

* Add const for SbpCollector

* Add const, rename, and package up for BinarySet

* Rename for BinarySet

* Rename for SbpCollector

* Rename for SbpCollector

* Rename for algorithm utils

* Fix a bug for an unused function AddEntries()

* Rename for BinarySet

* Rename for SbpConstructor

* Rename for BoxingCollector

* Add const for sbp utils

* fix merge conflict

* Remove template for sbp signature (#8787)

* Remove template for sbp signature

* Remove _H_ from cpp files

* Remove namespace specifier oneflow::

* Remove namespace specifier oneflow::

* Of format

* Move the inline functions to cpp files

* Can not add inline specifier?

* Update oneflow/core/auto_parallel/sbp_graph.h

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Of format

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Refactor auto parallel class object stuff (#8835)

* Delete copy/move constructor/operator

* Move the deconstructor of SbpEdge to the cpp file

* Equal by address for Sbp data structor

* Replace sbp_sig_list_ with sbp_sig_obj_list_

* Fix auto parallel copy cost infer2 (#8788)

* Check the output shape for operator in auto parallel

* Return infinity for different sbps while is_mutable

* Update oneflow/core/auto_parallel/sbp_constructor.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Update oneflow/core/operator/operator.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* with output -> check output

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Refactor prune identity as much as possible (#8849)

* Prune a line of parallel cast ops

* Avoid repeated pruning

* Code clean up

* Remove identity op

* Update oneflow/core/job_rewriter/auto_parallel.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Fix auto parallel low throughput (#8876)

* Speed up after pruning identity

* Slight changes

* Refactor auto parallel final check (#8887)

* Of format

* Use const auto &

* Of format and rename

* Re-compute cost if steals sbp signatures

* Docs auto parallel doc (#8896)

* doc(AutoParallel): add auto parallel document framework

* docs(AutoParallel): add document

* fix typo

* refine document

* refine documentation

* Test alexnet for auto_parallel (#8917)

* test(AutoParallel): test alexnet for auto_parallel

* test(AutoParallel): test model add auto_parallel config

* Fix get sbp bug (#8939)

* Fix the bug of missing sbp for uniform op

* Speed up

* Add the mising sbp for optional input UserSourceOpTickInput

* Remove the repeated all-B sbp signature

* Add sbp for undefined UserSourceOpTickInput

* Resolve confits while merging master

* Recompute cost with time shape (#9009)

* Address comments

* fix merge conflict

* Address comments

* Disabled ZeRO when enabled AutoParallel (#9087)

fix(AutoParallel): disabled ZeRO when enabled AutoParallel

* Update oneflow/core/job_rewriter/optimizer_placement_optimization_pass.cpp

* Address comments

* Address comment.
GetComputationCostFn -> GetComputationCost

* Update oneflow/core/job_rewriter/auto_parallel.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* New interface for pr#9018

* Static analysis

* Fix ones like sbp bug and fix test import error in CI (#9123)

fix(AutoParallel): skip 1n1d sbp agreement check

* auto format by CI

* test(AutoParallel): skip acc check

* Address comments

* rename source op set nd_sbp function and add check

* fix typo

* Feat full auto parallel (#9140)

* Use B for inplace op and remove the check for sbp
while truning the auto prallelism on

* Slight change

* Not using B as the constrain

* Address comments

* add debugg log for non-deleted cast ops

* update prune parallel cast op log

* rename auto_parallel_prune_parallel_cast_ops to enable_auto_parallel_ignore_user_sbp_config

Co-authored-by: wyg1997 <wangyinggang@foxmail.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
  • Loading branch information
5 people authored Sep 27, 2022
1 parent 794fe3f commit f67ff82
Show file tree
Hide file tree
Showing 68 changed files with 5,338 additions and 75 deletions.
70 changes: 70 additions & 0 deletions docs/source/auto_parallel.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Auto Parallelism
====================================================

As the scale of deep-learning models grows larger and larger, distributed training,
or parallelism, is needed. Data parallelism and model parallelism has been designed
to speed up the training and solve memory issues.

In oneflow, SBP signature enables users to configure parallelism policy easily.
However, users still need to specify the SBP property for each operator, or most of them.
Users might spend a couple of days digging into the detail of parallelism and get a
low throughput just because of a slight mistake in the configuration of SBP signature.

.. note::

It only works on :doc:`graph` mode.


Our strength
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To get rid of all those configurations for SBP signatures, we developed auto parallelism.
Still, configurations of placement are necessary and we have not supported auto placement
yet. If you read this paragraph before you rush into any SBP stuff, then congratulation,
you do not need to learn SBPs. You can start writing your code as you did under CPU mode.
Our auto parallelism would generate a fast strategy customized for your specific models,
the size of parameters, and the number of available GPUs.


How to use auto parallelism?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You just need to simply enable the configuration settings in the model
of :doc:`graph` .

Example::

import oneflow as flow
class SubclassGraph(flow.nn.Graph):
def __init__(self):
super().__init__() # MUST be called
# auto parallelism configuration
self.config.enable_auto_parallel(True)
# other configurations about auto parallelism
# ......

def build(self):
pass

.. warning::

If you enable auto parallelism, OneFlow will take care of the SBP configurations
of operators except for explicit ``to_global`` functions.


Configuration API for auto parallelism
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. currentmodule:: oneflow.nn.graph.graph_config.GraphConfig

.. autosummary::
:toctree: generated
:nosignatures:

enable_auto_parallel
enable_auto_parallel_ignore_user_sbp_config
set_auto_parallel_computation_cost_ratio
set_auto_parallel_wait_time
enable_auto_parallel_mainstem_algo
enable_auto_parallel_sbp_collector

1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ OneFlow upholds the core concept and architecture of static compilation and stre
nn.init
optim
graph
auto_parallel
image
utils.data
utils.global_view
Expand Down
33 changes: 33 additions & 0 deletions oneflow/core/auto_parallel/algorithm_util.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
/*
Copyright 2020 The OneFlow Authors. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

#include "oneflow/core/auto_parallel/algorithm_util.h"

namespace oneflow {
namespace auto_parallel {

// Inverse function of order
// The reason why we need the inverse_order, a.k.a id2order, instead of id2value is to eliminate
// equality. For example, we have v[0] < v[1] = v[2] < v[3] We do not know v[1] is before or after
// v[2] with comp(v[1], v[2]). But if we transfer it to order order[0] < order[1] < order[2] <
// order[3] We know the strict order.
void InverseOrder(const std::vector<int32_t>& order, std::vector<int32_t>& inverse_order) {
inverse_order.resize(order.size());
for (int32_t i = 0; i < order.size(); i++) { inverse_order[order[i]] = i; }
}

} // namespace auto_parallel
} // namespace oneflow
82 changes: 82 additions & 0 deletions oneflow/core/auto_parallel/algorithm_util.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
/*
Copyright 2020 The OneFlow Authors. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
#ifndef ONEFLOW_CORE_AUTO_PARALLEL_ALGORITHM_UTIL_H_
#define ONEFLOW_CORE_AUTO_PARALLEL_ALGORITHM_UTIL_H_

#include <vector>
#include <cstdlib>
#include <algorithm>
#include <unordered_map>

namespace oneflow {
namespace auto_parallel {

// this function is to remove the i-th element from a vector in Constant time.
// the vector should not care about ordering.
// Be more careful about this function. Make sure that the traveling order of
// the vector goes from back to front.
template<class T>
void RemoveFrom(std::vector<T>& v, int32_t i) {
v[i] = v.back();
v.pop_back();
}

template<class T>
void CheckAndRemoveFrom(std::vector<T>& v, T& t) {
for (int32_t i = v.size() - 1; i >= 0; i--) {
if (v[i] == t) {
RemoveFrom<T>(v, i);
break;
}
}
}

// Inverse function, which transfer a vector to an unordered_map.
template<class T>
void InverseFunction(const std::vector<T>& v, std::unordered_map<T, int32_t>& inverse_map) {
inverse_map.clear();
for (int32_t i = 0; i < v.size(); i++) { inverse_map[v[i]] = i; }
}

// When you want to sort something but you can not move any elements, use order.
// Decide the order of sorting in a list v, we have
// v[order[i]] < v[order[j]] for all i<j.
// We could define the comparison, then we have
// comp(v[order[i]], v[order[j]]) == true for all i<j.
template<class T, class Compare>
void DecideOrder(const T& v, std::vector<int32_t>& order, const Compare& comp) {
// Initialize order
order.resize(v.size());
for (int32_t i = 0; i < v.size(); i++) { order[i] = i; }
// sort
std::sort(order.begin(), order.end(), [&](int32_t i, int32_t j) { return comp(v[i], v[j]); });
}

// Inverse function of order
// The reason why we need the inverse_order, a.k.a id2order, instead of id2value is to eliminate
// equality. For example, we have v[0] < v[1] = v[2] < v[3] We do not know v[1] is before or after
// v[2] with comp(v[1], v[2]). But if we transfer it to order order[0] < order[1] < order[2] <
// order[3] We know the strict order.
void InverseOrder(const std::vector<int32_t>& order, std::vector<int32_t>& inverse_order);

} // namespace auto_parallel

static const double kFloatDeviationMinus = 0.9999999;
static const double kFloatDeviationPlus = 1.0000001;

} // namespace oneflow

#endif // ONEFLOW_CORE_AUTO_PARALLEL_ALGORITHM_UTIL_H_
147 changes: 147 additions & 0 deletions oneflow/core/auto_parallel/binary_set.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
/*
Copyright 2020 The OneFlow Authors. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
#include "oneflow/core/auto_parallel/binary_set.h"

namespace oneflow {
namespace auto_parallel {

namespace {
// A static function for initialization of log_2 mapping
std::unordered_map<BinarySetEntryType, int32_t> InitLog2() {
std::unordered_map<BinarySetEntryType, int32_t> log_2;
for (int32_t i = 0; i < 8 * sizeof(BinarySetEntryType); i++) {
log_2[static_cast<BinarySetEntryType>(1 << i)] = i;
}
return log_2;
}

// Initialization of log_2 mapping
// Take log2 of a integer value: 2^n -> n.
const std::unordered_map<BinarySetEntryType, int32_t> log_2 = InitLog2();

} // namespace

// Constructor
BinarySet::BinarySet(int32_t size_of_set) : size_of_set_(size_of_set) {
int32_t k = (size_of_set - 1) / bit_entry_type_ + 1;
binary_set_values_.resize(k, 0);
}

// Initialization if needed
void BinarySet::Initialize(int32_t size_of_set) {
size_of_set_ = size_of_set;
int32_t k = (size_of_set - 1) / bit_entry_type_ + 1;
binary_set_values_.resize(k, 0);
}

// Clear all the elements in the set
void BinarySet::Clear() { binary_set_values_.assign(binary_set_values_.size(), 0); }

// Check if i-th element in this subset
bool BinarySet::CheckExistence(int32_t i) const {
int32_t k = i / bit_entry_type_;
int32_t j = i % bit_entry_type_;
return bool((binary_set_values_[k] >> j) & 1);
}

// Add i-th element into this subset
void BinarySet::AddEntry(int32_t i) {
int32_t k = i / bit_entry_type_;
int32_t j = i % bit_entry_type_;
binary_set_values_[k] |= (1 << j);
}
// Take i-th element out from this subset
void BinarySet::DeleteEntry(int32_t i) {
int32_t k = i / bit_entry_type_;
int32_t j = i % bit_entry_type_;
binary_set_values_[k] &= ~(1 << j);
}
// Get the union with another subset and store it into u
void BinarySet::UnionTo(const BinarySet& bs, BinarySet& u) {
for (int32_t k = 0; k < binary_set_values_.size(); k++) {
u.binary_set_values_[k] = binary_set_values_[k] | bs.binary_set_values_[k];
}
}
// If this binary set intersects another one
bool BinarySet::IfIntersect(const BinarySet& bs) const {
int32_t min_bs_size = std::min(binary_set_values_.size(), bs.binary_set_values_.size());
for (int32_t k = 0; k < min_bs_size; k++) {
if (binary_set_values_[k] & bs.binary_set_values_[k]) { return true; }
}
return false;
}
// Get the intersection with another subset and store it into i
void BinarySet::IntersectionTo(const BinarySet& bs, BinarySet& i) const {
int32_t min_bs_size = std::min(binary_set_values_.size(), bs.binary_set_values_.size());
if (min_bs_size > i.binary_set_values_.size()) { i.binary_set_values_.resize(min_bs_size, 0); }
for (int32_t k = 0; k < binary_set_values_.size(); k++) {
i.binary_set_values_[k] = binary_set_values_[k] & bs.binary_set_values_[k];
}
}
// Count number of elements in this subset
int32_t BinarySet::Total() const {
int32_t t = 0;
for (int32_t k = 0; k < binary_set_values_.size(); k++) {
BinarySetEntryType bsv = binary_set_values_[k];
bsv = (bsv & 0x5555555555555555) + ((bsv >> 1) & 0x5555555555555555);
bsv = (bsv & 0x3333333333333333) + ((bsv >> 2) & 0x3333333333333333);
bsv = (bsv & 0x0F0F0F0F0F0F0F0F) + ((bsv >> 4) & 0x0F0F0F0F0F0F0F0F);
bsv = (bsv & 0x00FF00FF00FF00FF) + ((bsv >> 8) & 0x00FF00FF00FF00FF);
bsv = (bsv & 0x0000FFFF0000FFFF) + ((bsv >> 16) & 0x0000FFFF0000FFFF);
// bsv = (bsv & 0x00000000FFFFFFFF) + ((bsv >> 32) & 0x00000000FFFFFFFF);
t += int32_t(bsv);
}
return t;
}

// Output all the elements in the subset
void BinarySet::Output(std::vector<int32_t>& out) const {
out.clear();
for (int32_t i = 0; i < size_of_set_; i++) {
if (CheckExistence(i)) { out.emplace_back(i); }
}
}

// Output all the elements in the subset
void BinarySet::QuickOutput(std::vector<int32_t>& out) const {
out.clear();
for (int32_t i = 0; i < binary_set_values_.size(); i++) {
BinarySetEntryType x = binary_set_values_[i];
BinarySetEntryType y = 0;
while (x) {
y = x;
x &= x - 1;
out.emplace_back(i * BinarySet::bit_entry_type_ + log_2.find(y - x)->second);
}
}
}

// Add elements of input into this subset
void BinarySet::AddEntries(std::vector<int32_t>& in) {
for (int32_t i : in) { AddEntry(i); }
}

// If two binary sets are equal to each other
bool BinarySet::operator==(const BinarySet& rhs) const {
if (size_of_set_ != rhs.size_of_set_) { return false; }
for (int32_t i = 0; i < binary_set_values_.size(); i++) {
if (binary_set_values_[i] != rhs.binary_set_values_[i]) { return false; }
}
return true;
}

} // namespace auto_parallel
} // namespace oneflow
Loading

0 comments on commit f67ff82

Please sign in to comment.