Skip to content

Fix for distinct compile/evaluate #533

Merged
daniellowell merged 8 commits intodevelopfrom
alex_pcomp
Nov 1, 2020
Merged

Fix for distinct compile/evaluate #533
daniellowell merged 8 commits intodevelopfrom
alex_pcomp

Conversation

@alexandraBara
Copy link
Copy Markdown
Contributor

@alexandraBara alexandraBara commented Oct 21, 2020

Compiling all kernels at once with Precompile kernels. This enables us to compile in parallel for Tuna.

Ran local tests to get speed for compiling/evaluating separately on MI100.
./build/bin/MIOpenDriver conv -F 2 -n 128 -g 1 -k 1024 -c 1024 -H 35 -W 35 -y 3 -x 3 -p 0 -q 0 -u 2 -v 2 -l 1 -j 1 -V 0 -w 1 -t 1 -i 1
common env vars:
export MIOPEN_FIND_ENFORCE=4
export MIOPEN_LOG_LEVEL=6
export MIOPEN_DEBUG_CONV_GEMM=0
export MIOPEN_DEBUG_CONV_FFT=0
export MIOPEN_DEBUG_CONV_DIRECT=0
export MIOPEN_DEBUG_CONV_WINOGRAD=0
export MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=1

Regular compile+evaluate at the same time for igemm:
-- 5 iteration: avg 282 sec
-- 10 iteration: avg 145 sec

Separate compile, evaluate using extra env vars for compile:
export MIOPEN_COMPILE_AND_RUN=0
export MIOPEN_CUSTOM_CACHE_DIR=/root/test

First run with MIOPEN_COMPILE_PARALLEL_LEVEL=20
-- compile avg: 12 sec
-- run avg: 8 sec
-- total avg: 20 sec

Second run with MIOPEN_COMPILE_PARALLEL_LEVEL=10
-- compile avg: 19 sec
-- run avg: 8 sec
-- total avg: 27 sec

**UPDATE: parallel compile for kernels has been added as default behavior to generic_search. We can now compile in parallel and execute in the same run. No env vars needed other than set the compile level with: MIOPEN_COMPILE_PARALLEL_LEVEL=20.

We can still compile/execute separately using:
export MIOPEN_COMPILE_AND_RUN=0**

JehandadKhan
JehandadKhan previously approved these changes Oct 21, 2020
@daniellowell
Copy link
Copy Markdown

@alexandraBara @pfultz2 Do we understand the relationship between the multi-threading from "Find" and "Run-and-Measure"?
Can we disable parallel "Find" if we are setting tuning MIOPEN_FIND_ENFORCE > 2

@zjing14
Copy link
Copy Markdown
Contributor

zjing14 commented Oct 27, 2020

Tuning with MIOPEN_FIND_ENFORCE=4 MIOPEN_COMPILE_AND_RUN=0 MIOPEN_CUSTOM_CACHE_DIR=/root/test MIOPEN_COMPILE_PARALLEL_LEVEL=32
Got Error Msg:
MIOpen(HIP): Error [FindSolutionImpl] Search failed for: ConvHipImplicitGemmForwardV4R5Xdlops: /root/MIOpen_dev/src/include/miopen/generic_search.hpp:491: Search failed
Did I use it properly?

@alexandraBara
Copy link
Copy Markdown
Contributor Author

alexandraBara commented Oct 27, 2020

Tuning with MIOPEN_FIND_ENFORCE=4 MIOPEN_COMPILE_AND_RUN=0 MIOPEN_CUSTOM_CACHE_DIR=/root/test MIOPEN_COMPILE_PARALLEL_LEVEL=32
Got Error Msg:
MIOpen(HIP): Error [FindSolutionImpl] Search failed for: ConvHipImplicitGemmForwardV4R5Xdlops: /root/MIOpen_dev/src/include/miopen/generic_search.hpp:491: Search failed
Did I use it properly?

I thought the upper bound for MIOPEN_COMPILE_PARALLEL_LEVEL is 20, so i am not sure if setting it above 20 will have the effect you are thinking it would. But this wouldn't influence the outcome of the generic search.

This behavior is correct the first time around when it only compiles. We let generic_search fail so the rest of the code doesnt try to load the binaries in memory. However you should not see it fail the second time around (when you execute), after you unset MIOPEN_COMPILE_AND_RUN=0 . At this point the kernels are compiled and written to cache and the execution should continue, generic search will try and benchmark these kernels.
Maybe we can add a different error message when MIOPEN_COMPILE_AND_RUN=0 is used, something like: "Search skipped" instead of "Search failed". We can have a chat when you get online and i can explain more.

@zjing14 with the latest update to this PR you should be able to compile/run in the same execution. The only env var you need is to specify compile level threading: MIOPEN_COMPILE_PARALLEL_LEVEL=20.

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 27, 2020

Codecov Report

Merging #533 into develop will increase coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #533   +/-   ##
========================================
  Coverage    52.22%   52.22%           
========================================
  Files          297      297           
  Lines        46020    46018    -2     
========================================
+ Hits         24032    24033    +1     
+ Misses       21988    21985    -3     
Impacted Files Coverage Δ
src/include/miopen/generic_search.hpp 0.00% <0.00%> (ø)
src/ocl/clhelper.cpp 86.66% <0.00%> (-1.12%) ⬇️
src/db_record.cpp 89.28% <0.00%> (+1.78%) ⬆️
src/include/miopen/find_controls.hpp 84.00% <0.00%> (+4.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6189be...c58e17c. Read the comment docs.

@daniellowell
Copy link
Copy Markdown

All please rereview this PR. The update is that this has been fixed so that end users no longer need to explicitly set a flag to compiler and evaluate all at once.

JehandadKhan
JehandadKhan previously approved these changes Oct 28, 2020
@atamazov
Copy link
Copy Markdown
Contributor

@alexandraBara

alexandraBara added the bug label 7 days ago

What bug does this PR fix?

@daniellowell
Copy link
Copy Markdown

@atamazov The original PR #307 made the claim that parallel compile would take place before evaluation. In fact that did not happen due to a coding bug.

Copy link
Copy Markdown
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before continuation with this PR, I highly recommend resolving review comments in the baseline: #307 (review)

continue;
kernels.push_back(kernel);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need signs of life in this loop. Please add a new monitoring method to Heartbeat and use it here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added this: #if MIOPEN_ENABLE_SQLITE_KERN_CACHE around the first loop, i can add a comment

#533 (comment)

Comment thread src/include/miopen/generic_search.hpp Outdated
kernels.push_back(kernel);
}
}
std::vector<Program> programs = PrecompileKernels(profile_h, kernels);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BUILD_DEV=On, then this is waste of time, because binary cache is switched OFF.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a problem? we are aware that for this to work we need bin_cache turned on.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it affects all developers that work with auto-tuning. For example, recently I tried to test auto-tuning locally and found that it takes too long, so I just stopped trying.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I thought it was a compiler issue but actual reason is #307)

Copy link
Copy Markdown
Contributor

@DrizztDoUrden DrizztDoUrden Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to add the runtime program handles to the KernelCache (at least for development builds). That would make the loop not redundant. Alternatively, we may disable the loop when the binary cache is disabled and compile_and_run != "0".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the simplest variant + comment about alternative solution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added this: #if MIOPEN_ENABLE_SQLITE_KERN_CACHE around the first loop, i can add a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexandraBara Please do this

  • Find IsCacheDisabled() in binary_cache.cpp
  • Rename it to IsBinaryCacheDisabled()
  • Make it global (remove static and add its declaration to binary_cache.hpp)
  • Use it here instead of MIOPEN_ENABLE_SQLITE_KERN_CACHE

Comment thread src/include/miopen/generic_search.hpp Outdated
@atamazov
Copy link
Copy Markdown
Contributor

@daniellowell

@atamazov The original PR #307 made the claim that parallel compile would take place before evaluation. In fact that did not happen due to a coding bug.

Thanks, now I see. I thought it was intentional.

@atamazov
Copy link
Copy Markdown
Contributor

BTW I have nothing against resolving #307 (review) right in this PR.

Copy link
Copy Markdown
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving this as I feel this is urgent PR. Please fix the following ASAP (possibly in some other PR):

@daniellowell daniellowell merged commit 2f3aa3d into develop Nov 1, 2020
#endif

for(const auto& current_config : all_configs)
if(IsEnabled(MIOPEN_DEBUG_COMPILE_ONLY{}))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Auto-tuning is not functional after this change. I am curious how this PR was tested.

alexandraBara pushed a commit that referenced this pull request Nov 16, 2020
JehandadKhan pushed a commit that referenced this pull request Nov 20, 2020
* kernel parallel compile fix
* added PrecompileKernels function call as default behavior in generic search
zjing14 pushed a commit that referenced this pull request Dec 1, 2020
* kernel parallel compile fix
* added PrecompileKernels function call as default behavior in generic search
@atamazov atamazov deleted the alex_pcomp branch March 7, 2021 18:52
atamazov added a commit that referenced this pull request Dec 26, 2021
atamazov added a commit that referenced this pull request Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants