Skip to content

GPU'less compile #307

Merged
daniellowell merged 78 commits intodevelopfrom
alex_parallel_compile
Oct 7, 2020
Merged

GPU'less compile #307
daniellowell merged 78 commits intodevelopfrom
alex_parallel_compile

Conversation

@alexandraBara
Copy link
Copy Markdown
Contributor

@alexandraBara alexandraBara commented Jun 23, 2020

I am posting this PR preemptively so Tuna people can see my changes.

The idea is:

  1. Set env var to compile only (for desired arch/num_cu) - through Precompile kernels and dump blobs into kern_db
  2. Move kern cache to machine where the kernels will execute
  3. Run through regular invokers code which will compile (if kernel not already present in kern_db) and execute. Note: the kernel will already be in kern_db from the compile step.

The following steps need to happen:
build with MIOPEN_SQLITE_KERN_CACHE=ON

To compile we need to set the following env vars:
export MIOPEN_DEBUG_CONV_GEMM=0 (GEMM code path has not been implemented for this)
export MIOPEN_CUSTOM_CACHE_DIR=test (this will be appended to ~/.cache)
export MIOPEN_FIND_ENFORCE=4
export MIOPEN_COMPILE_AND_RUN=0 (compile only, escape generic search early)
export MIOPEN_DEVICE_ARCH=gfx906 (or whatever other supported arch)
export MIOPEN_DEVICE_CU=60 (or any other num_cu)

To run:
export MIOPEN_DEBUG_CONV_GEMM=0 (this can also be left out and invokers will actually compile and run as well)
export MIOPEN_CUSTOM_CACHE_DIR=test (must be same location as compile so it can find compiled kernels)
export MIOPEN_FIND_ENFORCE=4
export MIOPEN_COMPILE_AND_RUN=2 (anything but 0, can also be left unset)

I have tested as such:

  • compile only(for arch/num_cu different than my own machine) and check kern_db to make sure its populated with compiled kernels.
  • moved ukdb to machine with arch/num_cu for which it was compiled for.
  • set appropriate env vars to pick up custom cache location and run.
  • executed above steps for both hip and opencl
  • tested with rocm 3.5

Feel free to propose some actual utest I can add.

Blocked by: #203

Comment thread src/hipoc/hipoc_program.cpp Outdated
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* IMPLIED, INCnLnUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhhh, what happened here?

@daniellowell
Copy link
Copy Markdown

@DrizztDoUrden Please review this PR.

@alexandraBara
Copy link
Copy Markdown
Contributor Author

@alexandraBara This change affects +989 −1,791 lines. Are all of these changes really required?

remember this is branched of ddu/invoker-tune. most of the changes are his until he merges his branch the diff will be quite big.

In this case could you please change base branch of this PR to ddu/invoker-tune? Please see https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-base-branch-of-a-pull-request. This would greatly simplify reviewing.

I also recommend marking PR as a Draft to prevent it from merging into ddu/invoker-tune. Thanks.

Vasili's branch went in, so now its a PR from develop. Is this ok?

Comment thread src/binary_cache.cpp
Comment thread src/hipoc/hipoc_kernel.cpp
Comment thread src/ocl_kernel.cpp
JehandadKhan
JehandadKhan previously approved these changes Oct 5, 2020
@alexandraBara alexandraBara changed the title Preemptive PR on GPU'less compile GPU'less compile Oct 6, 2020
@daniellowell daniellowell merged commit 6fd907b into develop Oct 7, 2020
daniellowell pushed a commit that referenced this pull request Oct 18, 2020
* Implemented igemm data invokers
* Missing invoker
* Removed accidental redundant unused FFT calls
* Fixed kernel name split
* ConvHipImplicitGemmBwdDataV4R1 invoker
* Removed precompilation duplication
* Removed RunAndMeasureSolution's from header
* Finished merginr RaMS into invokers
* Self-copy/self-move tidy fix
* Fixed is_detected_v tidy warning
* fixed conv_common.hpp build
* Updated static_assert to use {}
* env vars for custom cache location and env var to spoof arch
* added PrecompileKernels fct call
* calling PrecompileKernels
* hip code path for compiling only, tobeadded: ocl
* added env var for num_cu and removed custom cache dir
* added custom bin cache location to avoid MIOPEN_VERSION_TWEAK
* implemented GPU'less compiling for ocl backend
* [Squashed] Generic search with invokers
* Missing PrecompileSoulutions
* taking bin cache location as full path, and updated gen_search to not throw escape err
Co-authored-by: Vasilii Filippov <vasja.filippov94@outlook.com>
Co-authored-by: Vasilii Filippov <vasilii.filippov@amd.com>
Co-authored-by: daniellowell <daniel.lowell@amd.com>
Comment thread src/binary_cache.cpp
Comment thread src/binary_cache.cpp
Comment thread src/binary_cache.cpp
MIOPEN_STRINGIZE(MIOPEN_VERSION_TWEAK);
auto p = boost::filesystem::path{miopen::ExpandUser(cache_dir)} / version;

const char* const custom = miopen::GetStringEnv(MIOPEN_CUSTOM_CACHE_DIR{});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Quality] This is developer's feature, therefore at least DEBUG must be used in the name. And this is binary cache, let's indicate that as well.

[Recommendation] MIOPEN_DEBUG_BINARY_CACHE_DIR

Comment thread src/hip/handlehip.cpp
Comment on lines +476 to +480
const char* const num_cu = miopen::GetStringEnv(MIOPEN_DEVICE_CU{});
if(num_cu != nullptr && strlen(num_cu) > 0)
{
return boost::lexical_cast<std::size_t>(num_cu);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Quality]

  • Developers' envvar should contain DEBUG
    • [Recommendation] MIOPEN_DEBUG_ENFORCE_DEVICE_CU.
  • boost::lexical_cast should not be used.
  • All the checks and conversions should be moved to a function that performs the job once, to minimize delays.

[Hint] We have Value() in env.hpp for this purpose.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially resolved in #981

Comment thread src/hip/handlehip.cpp

std::string Handle::GetDeviceName() const
{
const char* const arch = miopen::GetStringEnv(MIOPEN_DEVICE_ARCH{});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Quality] This change is not necessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are you referring to? Returning the env var specified over the actual arch we are running on is necessary to compile for a different architecture.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have MIOPEN_DEBUG_ENFORCE_DEVICE. See #307 (comment)

Comment on lines +349 to +355
const char* const c_and_r = miopen::GetStringEnv(MIOPEN_COMPILE_AND_RUN{});
std::string compile_and_run;
if(c_and_r != nullptr && strlen(c_and_r) > 0)
{
compile_and_run = c_and_r;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary.

{
current_solution = s.GetSolution(context, current_config, true);

if(compile_and_run == "0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if(IsEnabled(MIOPEN_DEBUG_COMPILE_ONLY{}))

kernels.push_back(kernel);
}

std::vector<Program> programs = PrecompileKernels(profile_h, kernels);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Quality] std::ignore = ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Note] This is just a waste of time for developers' builds (BUILD_DEV=On). Not that important, just FYI.

Comment on lines +158 to +162
OCLKernel(SharedProgramPtr p, const std::string& kernel_name)
: program(p), kernel(CreateKernel(p.get(), kernel_name))
{
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Quality] Do not add a new ctor. Just update existing one: read MIOPEN_DEBUG_COMPILE_ONLY and skip loading and running the kernels, if enabled.

}

std::vector<Program> programs = PrecompileKernels(profile_h, kernels);
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Quality] You may wish adding a new monitoring method to Heartbeat and using it here, to show signs of life when log level < 6.

alexandraBara pushed a commit that referenced this pull request Nov 16, 2020
JehandadKhan pushed a commit that referenced this pull request Nov 20, 2020
* Implemented igemm data invokers
* Missing invoker
* Removed accidental redundant unused FFT calls
* Fixed kernel name split
* ConvHipImplicitGemmBwdDataV4R1 invoker
* Removed precompilation duplication
* Removed RunAndMeasureSolution's from header
* Finished merginr RaMS into invokers
* Self-copy/self-move tidy fix
* Fixed is_detected_v tidy warning
* fixed conv_common.hpp build
* Updated static_assert to use {}
* env vars for custom cache location and env var to spoof arch
* added PrecompileKernels fct call
* calling PrecompileKernels
* hip code path for compiling only, tobeadded: ocl
* added env var for num_cu and removed custom cache dir
* added custom bin cache location to avoid MIOPEN_VERSION_TWEAK
* implemented GPU'less compiling for ocl backend
* [Squashed] Generic search with invokers
* Missing PrecompileSoulutions
* taking bin cache location as full path, and updated gen_search to not throw escape err
Co-authored-by: Vasilii Filippov <vasja.filippov94@outlook.com>
Co-authored-by: Vasilii Filippov <vasilii.filippov@amd.com>
Co-authored-by: daniellowell <daniel.lowell@amd.com>
@atamazov atamazov deleted the alex_parallel_compile branch March 7, 2021 18:52
atamazov added a commit that referenced this pull request Jun 11, 2021
atamazov added a commit that referenced this pull request Jun 11, 2021
atamazov added a commit that referenced this pull request Jun 11, 2021
atamazov added a commit that referenced this pull request Jun 23, 2021
atamazov added a commit that referenced this pull request Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants