Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent GEMM Tuner Execution Times #428

Open
huttered40 opened this issue Nov 27, 2021 · 11 comments
Open

Inconsistent GEMM Tuner Execution Times #428

huttered40 opened this issue Nov 27, 2021 · 11 comments
Labels

Comments

@huttered40
Copy link

huttered40 commented Nov 27, 2021

Hi, I am trying to find the best set of 16 tuning parameters for a particular GEMM task: m=32,n=256,k=32, on an Intel KNL machine.

I have tuned GEMM for this particular task using a single process via the built-in tuners, and the first stage outputs the configuration that achieved the best performance. I manually disabled the three subsequent stages and modified src/tuning/kernels/xgemm.hpp to allow tuning problems with matrix dimensions less than 64.
Here is that output:

 ./clblast_tuner_xgemm -precision 64 -n 256 -m 32 -k 32
* (1/4) Tuning main GEMM kernel (GEMMK == 0) for fixed set of parameters

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 64 (double) 
    -m 32 
    -n 256 
    -k 32 
    -alpha 2.00 [=default]
    -beta 2.00 [=default]
    -fraction 1.00 [=default]
    -runs 2 [=default]
    -max_l2_norm 0.00 [=default]

* Found 54 configuration(s)
* Parameters explored: GEMMK MWG NWG KWG MDIMC NDIMC MDIMA NDIMB KWI VWM VWN STRM STRN SA SB KREG 

|   ID | total |                                                                           param |      local      |      global     |       compiles |         time | GFLOPS |            status |
x------x-------x---------------------------------------------------------------------------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|  ref |     - |                                                                               - |       8       8 |      32     256 |             OK |      0.68 ms |      - |      reference OK |
x------x-------x---------------------------------------------------------------------------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|    1 |    54 |    0   16   16   32    8    8    8    8    2    1    1    0    0    0    0    1 |       8       8 |      16     128 |   OK    251 ms |      5.30 ms |    0.1 |     results match |
|    2 |    54 |    0   16   16   32    8    8    8    8    2    1    1    0    0    1    1    1 |       8       8 |      16     128 |   OK    235 ms |      5.62 ms |    0.1 |     results match |
|    3 |    54 |    0   16   16   32    8    8    8    8    2    1    2    0    0    0    0    1 |       8       8 |      16     128 |   OK    239 ms |      5.65 ms |    0.1 |     results match |
|    4 |    54 |    0   16   16   32    8    8    8    8    2    1    2    0    0    1    1    1 |       8       8 |      16     128 |   OK    234 ms |      5.80 ms |    0.1 |     results match |
|    5 |    54 |    0   16   16   32    8    8    8    8    2    2    1    0    0    0    0    1 |       8       8 |      16     128 |   OK    229 ms |      5.41 ms |    0.1 |     results match |
|    6 |    54 |    0   16   16   32    8    8    8    8    2    2    1    0    0    1    1    1 |       8       8 |      16     128 |   OK    247 ms |      5.82 ms |    0.1 |     results match |
|    7 |    54 |    0   16   16   32    8    8    8    8    2    2    2    0    0    0    0    1 |       8       8 |      16     128 |   OK    234 ms |      5.64 ms |    0.1 |     results match |
|    8 |    54 |    0   16   16   32    8    8    8    8    2    2    2    0    0    1    1    1 |       8       8 |      16     128 |   OK    225 ms |      5.79 ms |    0.1 |     results match |
|    9 |    54 |    0   16   16   32    8   16    8   16    2    1    1    0    0    0    0    1 |       8      16 |      16     256 |   OK    231 ms |      5.56 ms |    0.1 |     results match |
|   10 |    54 |    0   16   16   32    8   16    8   16    2    1    1    0    0    1    1    1 |       8      16 |      16     256 |   OK    241 ms |      0.69 ms |    0.7 |     results match |
|   11 |    54 |    0   16   16   32    8   16    8   16    2    2    1    0    0    0    0    1 |       8      16 |      16     256 |   OK    223 ms |      5.97 ms |    0.1 |     results match |
|   12 |    54 |    0   16   16   32    8   16    8   16    2    2    1    0    0    1    1    1 |       8      16 |      16     256 |   OK    229 ms |      0.59 ms |    0.9 |     results match |
|   13 |    54 |    0   16   16   32   16    8   16    8    2    1    1    0    0    0    0    1 |      16       8 |      32     128 |   OK    264 ms |      6.17 ms |    0.1 |     results match |
|   14 |    54 |    0   16   16   32   16    8   16    8    2    1    1    0    0    1    1    1 |      16       8 |      32     128 |   OK    230 ms |      5.65 ms |    0.1 |     results match |
|   15 |    54 |    0   16   16   32   16    8   16    8    2    1    2    0    0    0    0    1 |      16       8 |      32     128 |   OK    227 ms |      5.81 ms |    0.1 |     results match |
|   16 |    54 |    0   16   16   32   16    8   16    8    2    1    2    0    0    1    1    1 |      16       8 |      32     128 |   OK    245 ms |      6.21 ms |    0.1 |     results match |
|   17 |    54 |    0   16   16   32   16   16   16   16    2    1    1    0    0    0    0    1 |      16      16 |      32     256 |   OK    271 ms |      0.83 ms |    0.6 |     results match |
|   18 |    54 |    0   16   16   32   16   16   16   16    2    1    1    0    0    1    1    1 |      16      16 |      32     256 |   OK    242 ms |      0.67 ms |    0.8 |     results match |
|   19 |    54 |    0   16   32   32    8    8    8    8    2    1    1    0    0    0    0    1 |       8       8 |      16      64 |   OK    249 ms |      0.53 ms |    1.0 |     results match |
|   20 |    54 |    0   16   32   32    8    8    8    8    2    1    1    0    0    1    1    1 |       8       8 |      16      64 |   OK    225 ms |      0.59 ms |    0.9 |     results match |
|   21 |    54 |    0   16   32   32    8    8    8    8    2    1    2    0    0    0    0    1 |       8       8 |      16      64 |   OK    262 ms |      5.95 ms |    0.1 |     results match |
|   22 |    54 |    0   16   32   32    8    8    8    8    2    1    2    0    0    1    1    1 |       8       8 |      16      64 |   OK    238 ms |      6.32 ms |    0.1 |     results match |
|   23 |    54 |    0   16   32   32    8    8    8    8    2    1    4    0    0    0    0    1 |       8       8 |      16      64 |   OK    246 ms |      0.52 ms |    1.0 |     results match |
|   24 |    54 |    0   16   32   32    8    8    8    8    2    1    4    0    0    1    1    1 |       8       8 |      16      64 |   OK    248 ms |      6.51 ms |    0.1 |     results match |
|   25 |    54 |    0   16   32   32    8    8    8    8    2    2    1    0    0    0    0    1 |       8       8 |      16      64 |   OK    297 ms |      6.12 ms |    0.1 |     results match |
|   26 |    54 |    0   16   32   32    8    8    8    8    2    2    1    0    0    1    1    1 |       8       8 |      16      64 |   OK    266 ms |      6.52 ms |    0.1 |     results match |
|   27 |    54 |    0   16   32   32    8    8    8    8    2    2    2    0    0    0    0    1 |       8       8 |      16      64 |   OK    223 ms |      6.04 ms |    0.1 |     results match |
|   28 |    54 |    0   16   32   32    8    8    8    8    2    2    2    0    0    1    1    1 |       8       8 |      16      64 |   OK    231 ms |      6.10 ms |    0.1 |     results match |
|   29 |    54 |    0   16   32   32    8    8    8    8    2    2    4    0    0    0    0    1 |       8       8 |      16      64 |   OK    256 ms |      5.72 ms |    0.1 |     results match |
|   30 |    54 |    0   16   32   32    8    8    8    8    2    2    4    0    0    1    1    1 |       8       8 |      16      64 |   OK    248 ms |      5.79 ms |    0.1 |     results match |
|   31 |    54 |    0   16   32   32    8   16    8   16    2    1    1    0    0    0    0    1 |       8      16 |      16     128 |   OK    239 ms |      6.00 ms |    0.1 |     results match |
|   32 |    54 |    0   16   32   32    8   16    8   16    2    1    1    0    0    1    1    1 |       8      16 |      16     128 |   OK    274 ms |      0.70 ms |    0.7 |     results match |
|   33 |    54 |    0   16   32   32    8   16    8   16    2    1    2    0    0    0    0    1 |       8      16 |      16     128 |   OK    235 ms |      5.97 ms |    0.1 |     results match |
|   34 |    54 |    0   16   32   32    8   16    8   16    2    1    2    0    0    1    1    1 |       8      16 |      16     128 |   OK    250 ms |      6.00 ms |    0.1 |     results match |
|   35 |    54 |    0   16   32   32    8   16    8   16    2    2    1    0    0    0    0    1 |       8      16 |      16     128 |   OK    248 ms |      5.85 ms |    0.1 |     results match |
|   36 |    54 |    0   16   32   32    8   16    8   16    2    2    1    0    0    1    1    1 |       8      16 |      16     128 |   OK    235 ms |      0.69 ms |    0.8 |     results match |
|   37 |    54 |    0   16   32   32    8   16    8   16    2    2    2    0    0    0    0    1 |       8      16 |      16     128 |   OK    250 ms |      6.05 ms |    0.1 |     results match |
|   38 |    54 |    0   16   32   32    8   16    8   16    2    2    2    0    0    1    1    1 |       8      16 |      16     128 |   OK    272 ms |      5.95 ms |    0.1 |     results match |
|   39 |    54 |    0   16   32   32    8   32    8   32    2    1    1    0    0    0    0    1 |       8      32 |      16     256 |   OK    251 ms |      5.61 ms |    0.1 |     results match |
|   40 |    54 |    0   16   32   32    8   32    8   32    2    1    1    0    0    1    1    1 |       8      32 |      16     256 |   OK    241 ms |      6.13 ms |    0.1 |     results match |
|   41 |    54 |    0   16   32   32    8   32    8   32    2    2    1    0    0    0    0    1 |       8      32 |      16     256 |   OK    228 ms |      6.02 ms |    0.1 |     results match |
|   42 |    54 |    0   16   32   32    8   32    8   32    2    2    1    0    0    1    1    1 |       8      32 |      16     256 |   OK    296 ms |      6.19 ms |    0.1 |     results match |
|   43 |    54 |    0   16   32   32   16    8   16    8    2    1    1    0    0    0    0    1 |      16       8 |      32      64 |   OK    271 ms |      0.70 ms |    0.7 |     results match |
|   44 |    54 |    0   16   32   32   16    8   16    8    2    1    1    0    0    1    1    1 |      16       8 |      32      64 |   OK    254 ms |      5.91 ms |    0.1 |     results match |
|   45 |    54 |    0   16   32   32   16    8   16    8    2    1    2    0    0    0    0    1 |      16       8 |      32      64 |   OK    232 ms |      6.17 ms |    0.1 |     results match |
|   46 |    54 |    0   16   32   32   16    8   16    8    2    1    2    0    0    1    1    1 |      16       8 |      32      64 |   OK    242 ms |      6.09 ms |    0.1 |     results match |
|   47 |    54 |    0   16   32   32   16    8   16    8    2    1    4    0    0    0    0    1 |      16       8 |      32      64 |   OK    247 ms |      6.00 ms |    0.1 |     results match |
|   48 |    54 |    0   16   32   32   16    8   16    8    2    1    4    0    0    1    1    1 |      16       8 |      32      64 |   OK    267 ms |      6.43 ms |    0.1 |     results match |
|   49 |    54 |    0   16   32   32   16   16   16   16    2    1    1    0    0    0    0    1 |      16      16 |      32     128 |   OK    256 ms |      5.82 ms |    0.1 |     results match |
|   50 |    54 |    0   16   32   32   16   16   16   16    2    1    1    0    0    1    1    1 |      16      16 |      32     128 |   OK    248 ms |      6.08 ms |    0.1 |     results match |
|   51 |    54 |    0   16   32   32   16   16   16   16    2    1    2    0    0    0    0    1 |      16      16 |      32     128 |   OK    232 ms |      5.89 ms |    0.1 |     results match |
|   52 |    54 |    0   16   32   32   16   16   16   16    2    1    2    0    0    1    1    1 |      16      16 |      32     128 |   OK    240 ms |      6.43 ms |    0.1 |     results match |
|   53 |    54 |    0   16   32   32   16   32   16   32    2    1    1    0    0    0    0    1 |      16      32 |      32     256 |   OK    241 ms |      6.24 ms |    0.1 |     results match |
|   54 |    54 |    0   16   32   32   16   32   16   32    2    1    1    0    0    1    1    1 |      16      32 |      32     256 |   OK    238 ms |      6.62 ms |    0.1 |     results match |
x------x-------x---------------------------------------------------------------------------------x-----------------x-----------------x----------------x--------------x--------x-------------------x


* Got average result of 4.99 ms: 0.1 GFLOPS
* Found best result 0.52 ms: 1.0 GFLOPS
* Best parameters: GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=32 PRECISION=64 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=4

* Writing a total of 54 results to 'clblast_xgemm_1_32_256_32.json'
* Completed tuning process

However, when I run my script, also on a single process, which manually overrides the tuning parameters via the OverrideParameters(...) API call, the tuner's best configuration yields an execution time that is more than 10x slower than what was reported during the offline tuning. Specifically, these "best parameters' give an average execution time of 16ms across 5 iterations, and I do not time the initial iteration to allow for the compilation overhead as mentioned in the docs. I believe I call GEMM in the same way that the tuner calls it (RowMajor, NoTranspose for either input matrix, non-zero scalars alpha and beta).

I further investigated this by exhaustively searching over all 1000+ legal configurations within my script and found that the best configuration yielded an execution time similar to that found in the offline tuning stage (0.55 ms), but that was attained by a completely different configuration.

I have tried repeating this debugging process for other matrix dimensions beyond m=32,n=256,k=32 and have found similar discrepancies. I do not believe it is an error in the manual translation from json file to my script, as I have checked over it numerous times.

Can you advise what step, if any, I am missing? Perhaps I am making false assumptions about the offline built-in tuner that CLBLAST provides? I'd like to leverage the offline tuner for a number of tasks and then manually take the best-found configuration for each into my script via the OverrideParameters(...) routine.

@CNugteren
Copy link
Owner

The first thing I notice from your tuning results is that there seem to be two groups: those around 5ms, and those around 0.6ms. My guess is those resulting in 0.6ms are producing incorrect results.

I (...) modified src/tuning/kernels/xgemm.hpp to allow tuning problems with matrix dimensions less than 64.

How exactly did you do this? Perhaps you can share a diff? My guess is that this modification resulted in some invalid configurations in the tuner, causing these 0.5ms results. You can also try to check what is common with the cases that resulted in those extremely fast results and figure out what went wrong there.

@huttered40
Copy link
Author

huttered40 commented Nov 29, 2021

Upon further investigation, the individual execution times I am seeing for the tuner's reported best-configuration (on an Intel KNL measured in my script) seem strange. I had been previously reporting the average of three executions, but decided to print out each individual execution. I performed 5 trials, each with an initial un-timed execution to allow for the compilation time following a call to OverrideParameters(...) followed by three measured iterations.

The task dimensions were m=32, n=256, k=32.
The configuration (which the tuner states achieved an execution time of 0.45ms) I am using is:

"best_parameters": "GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=16 NDIMB=8 NDIMC=8 NWG=16 PRECISION=64 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1"

Trial 1:
0.0130866
0.000790784
0.0142936
Average: 0.00939032

Trial 2:
0.00107513
0.000869536
0.0258419
Average: 0.00926218

Trial 3:
0.0250369
0.016514
0.0251048
Average: 0.0222186

Trial 4:
0.0008712
0.0314774
0.00111143
Average: 0.0111533

Trial 5:
0.000911874
0.000724752
0.022507
Average: 0.00804789

In all trials but trial 3, the configuration reports an execution time below 1ms (although still noticeably larger than the 0.45ms the tuner reported), yet the variance of the execution times within each trial is huge, thus skewing the average that I had been collecting. The fact that some iterations are taking 20+ ms doesn't make sense to me, especially as it can't be localized to a specific iteration (trial 5 the 3rd iteration takes 22ms, whereas in trial 3 the 1st iteration takes 25ms). I shouldn't think this would be attributed to noise. Variation in execution times of this magnitude would render autotuning largely worthless.

This excessive runtime variation across iterations of the same task and configuration is not unique to the task (m=32,n=256,k=32). I used the tuner to tune ~500 tasks and evaluated 729 tasks for m,n,k={8,16,32,64,128,256,512,1024,2048} and observed the same phenomenon for all of the smaller tasks. The larger tasks incurred little variation relative to its execution time, which makes sense.

As an example, the smallest task (m=8,n=8,k=8) executed 7 times in succession yields the following 7 execution times. The last is ~20x larger than the 6th:

0.000472691
0.000323357
0.00046400
0.000349837
0.000331636
0.000478194
0.00989678)

I cannot fathom what might be occurring that would cause this high variation. As far as I know, I am running with a single threaded process, so one core of KNL. I am not setting any thread-related environment variables.

For clarity, single-threaded MKL achieves an average execution time (across 3 iterations for the same task dimensions (m=32,n=256,k=32) ) of 0.12ms, so about 3-4x smaller than that reported by CLBLAST's tuner found from its 54-configuration search (see initial post). You mentioned above My guess is those resulting in 0.6ms are producing incorrect results., yet CLBLAST's reported 0.45ms execution time seems reasonable when compared against MKL. Further, none of the tested configurations hit an assert in my script or failed to match the reference in the tuner. Thus I don't believe it is an issue of incorrectness caused by a line I modified.

I have provided a subset of my script that yielded the above results, which admittedly was largely taken from the sample code for sgemm.cpp:

// Set up parameters with configuration parameter specified above
auto status = clblast::OverrideParameters(device(), "Xgemm", clblast::Precision::kDouble, parameters);
 if (status==clblast::StatusCode::kMissingOverrideParameter){ assert(0);}
 // Execute initial. Don't time it.
 auto queue_plain = queue();
 status = clblast::Gemm(clblast::Layout::kRowMajor, clblast::Transpose::kNo, clblast::Transpose::kNo,
                                    m, n, k,alpha,
                                    device_a(), 0, a_ld,device_b(), 0, b_ld,
                                    beta,device_c(), 0, c_ld,
                                    &queue_plain, &event);
if (status == clblast::StatusCode::kSuccess) {
  clWaitForEvents(1, &event);
  clReleaseEvent(event);
} else{assert(0);}
// Execute and time
double avg_time=0.;
for (size_t zz=0; zz<num_iter; zz++){
  // Start the timer
  auto start_time = std::chrono::steady_clock::now();
  // Call the DGEMM routine. Note that the type of alpha and beta (double) determine the precision.
  queue_plain = queue();
  status = clblast::Gemm(clblast::Layout::kRowMajor,
                                        clblast::Transpose::kNo, clblast::Transpose::kNo,
                                        m, n, k,
                                        alpha,
                                        device_a(), 0, a_ld,
                                        device_b(), 0, b_ld,
                                        beta,
                                        device_c(), 0, c_ld,
                                        &queue_plain, &event);

  // Record the execution time
  if (status == clblast::StatusCode::kSuccess) {
    clWaitForEvents(1, &event);
    clReleaseEvent(event);
  } else{assert(0);}
  auto elapsed_time = std::chrono::steady_clock::now() - start_time;
  auto time_sec = std::chrono::duration<double>(elapsed_time).count();
  avg_time += time_sec;
}
avg_time /= num_iter;
// Report avg_time

Other investigations:
For these same task dimensions (m=32,n=256,k=32) I exhaustively searched over 11664 distinct configurations and found 103 to report an average execution time (measured in exact same way as the code snippet above) to be less than 1ms. None of these configurations were the "best-configuration" as reported by the tuner.

@CNugteren
Copy link
Owner

Thanks for sharing so much details, but did you see my message? My hunch is still that you incorrectly modified the tuner and are actually running with configurations that are invalid for your matrix sizes. Invalid configurations can cause undefined behaviour, which could explain everything (including the variation over runs).

If that doesn't help, I would suggest to revert your changes and try the tuner with supported matrix sizes. Then again do your measurements and make sure everything is fine (expected speed, variance is gone, etc.).

BTW, there is a reason these constraints are in place (even though you can work around them if you do it carefully): for smaller sizes the direct GEMM kernel might be faster to use anyway.

@huttered40
Copy link
Author

huttered40 commented Nov 30, 2021

Ok thanks. I switched to using XgemmDirect as the kernel. I rebuilt CLBLAST from scratch with no modifications to the source code on my part, but I am still seeing the following behavior:

  1. The execution times can vary significantly across subsequent iterations, especially for smaller tasks.
  2. The execution times of the best-found configuration (reported by the tuner) are larger than those measured by my script (which sets the best-parameters manually for the corresponding task and executes the kernel, see code snippet below).

This is my workflow following the cmake build:

  1. ./clblast_tuner_xgemm_direct -precision 64 -m 16 -n 64 -k 16;. I tried m,n,k={16,64,256,1024}, so tuned across 64 distinct tasks.
  2. I did not upload them to the database, but instead in my script saved the 64 configurations and used the OverrideParameters API call to set them before calling the corresponding GEMM task.

This is the relevant part of my script:

// Set the parameters map for the best configuration found for task (m,n,k)
auto status = clblast::OverrideParameters(device(), "XgemmDirect", clblast::Precision::kDouble, parameters);
if (status==clblast::StatusCode::kMissingOverrideParameter){
  assert(0);
}
// Initial test
auto queue_plain = queue();
status = clblast::Gemm(clblast::Layout::kRowMajor,
                            clblast::Transpose::kNo, clblast::Transpose::kNo,
                            m, n, k,
                            alpha,
                            device_a(), 0, a_ld,
                            device_b(), 0, b_ld,
                            beta,
                            device_c(), 0, c_ld,
                            &queue_plain, &event);
if (status == clblast::StatusCode::kSuccess) {
  clWaitForEvents(1, &event);
  clReleaseEvent(event);
} else{assert(0);}
// Execute and time
double avg_time=0;
for (size_t zz=0; zz<num_iter; zz++){
      
    // Start the timer
    auto start_time = std::chrono::steady_clock::now();

    // Call the SGEMM routine. Note that the type of alpha and beta (double) determine the precision.
    queue_plain = queue();
    status = clblast::Gemm(clblast::Layout::kRowMajor,
                                clblast::Transpose::kNo, clblast::Transpose::kNo,
                                m, n, k,
                                alpha,
                                device_a(), 0, a_ld,
                                device_b(), 0, b_ld,
                                beta,
                                device_c(), 0, c_ld,
                                &queue_plain, &event);

    // Record the execution time
    if (status == clblast::StatusCode::kSuccess) {
      clWaitForEvents(1, &event);
      clReleaseEvent(event);
    } else{assert(0);}
    auto elapsed_time = std::chrono::steady_clock::now() - start_time;
    auto time_sec = std::chrono::duration<double>(elapsed_time).count();
    avg_time += time_sec;
}
avg_time /= num_iter;

As an example of issue 2) above, the following is the tuner's output for tuning task m=1024,n=1024,k=1024:

  "kernel_family": "xgemm_direct_1",
  "precision": "64",
  "best_kernel": "XgemmDirectTN",
  "best_time": "18.31",
  "best_parameters": "KWID=2 MDIMAD=8 MDIMCD=8 NDIMBD=8 NDIMCD=8 PADA=1 PADB=1 PRECISION=64 VWMD=2 VWND=1 WGD=32",
  "arg_m": "1024",
  "arg_n": "1024",
  "arg_k": "1024",
  "arg_alpha": "2.00",
  "arg_beta": "2.00",
  "clblast_device_type": "CPU",
  "clblast_device_vendor": "Intel",
  "clblast_device_architecture": "",
  "clblast_device_name": "Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "device": "pthread-Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "platform_vendor": "The pocl project",
  "platform_version": "OpenCL 2.0 pocl 1.7, Release, LLVM 11.1.0, RELOC, SLEEF, CUDA, DISTRO, POCL_DEBUG",
  "device_vendor": "GenuineIntel",
  "device_type": "CPU",
  "device_core_clock": "1600",
  "device_compute_units": "272",
  "device_extra_info": "",

Yet, 4 successive calls (following a warmup as shown in code snippet above) for the same task and the same best parameters give the following execution times:

0.10035
0.101461
0.118952
0.0976752)

There is little variation among execution times, which makes sense considering it is a large task, but the average execution time is over 100ms, which is >5x larger than what the tuner reported for the same configuration.
Is my code snippet calling GEMM in the same way as the tuner? Are there any hidden memory allocations occurring in the background? It is difficult to parse the source code so that I can verify this myself.

As an example of issue 1) above, I performed the same test for task m=64,n=64,k=64.
The tuner produces the following information:

  "kernel_family": "xgemm_direct_1",
  "precision": "64",
  "best_kernel": "XgemmDirectTN",
  "best_time": "0.41",
  "best_parameters": "KWID=2 MDIMAD=8 MDIMCD=8 NDIMBD=32 NDIMCD=32 PADA=1 PADB=1 PRECISION=64 VWMD=4 VWND=1 WGD=32",
  "arg_m": "64",
  "arg_n": "64",
  "arg_k": "64",
  "arg_alpha": "2.00",
  "arg_beta": "2.00",
  "clblast_device_type": "CPU",
  "clblast_device_vendor": "Intel",
  "clblast_device_architecture": "",
  "clblast_device_name": "Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "device": "pthread-Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "platform_vendor": "The pocl project",
  "platform_version": "OpenCL 2.0 pocl 1.7, Release, LLVM 11.1.0, RELOC, SLEEF, CUDA, DISTRO, POCL_DEBUG",
  "device_vendor": "GenuineIntel",
  "device_type": "CPU",
  "device_core_clock": "1600",
  "device_compute_units": "272",
  "device_extra_info": "",

Yet, my script achieves the following four execution times:

0.00113308
0.0459221
0.00103271
0.0125391

There is significant variation among these four, and further the minimum is 1ms, which is >2x larger than that reported by the tuner for the same configuration.

A further question I had:

  1. I was calling GEMM with matrices A and B not transposed, but the json files all report that the following kernel was the best kernel: XgemmDirectTN. Does that mean that I need to transpose A in my script for the configuration found by the tuner to make sense? I tried both ways, but it didn't make much difference.
  2. Is the tuner's reported best-time the minimum or an average across 4 runs? I see std::min_element(...) used in various places in src/tuning/, which makes me believe the minimum execution time is being reported.

@CNugteren
Copy link
Owner

Thanks again for sharing so much information. I would suggest to try to tackle issues one at a time. Let's start with the large one of m=n=k=1024, since that is a common situation and also large enough to hide any small overheads. Indeed, the time measured by the tuner should be close to the time that you measure with your script. There are a few things I would suggest to try first:

  1. Make sure that the direct GEMM kernel is used. There are these clblast_tuner_routine_xgemm tuners that tune a single parameter, determining when to use which kernel. You might be tuning for one, but measuring time of another. In your case I would just set that parameter such that it always uses the direct-GEMM kernel (for testing). See also the docs for more details.
  2. Make sure that the times reported by the tuner and your code make sense in terms of GFLOPS. The tuner already returns GFLOPS, and compare that to the expected peak of your device. It should be lower than the peak, but not more than 10 times lower. This check is done to make sure the time measurements make sense and no mistake was made in the units.
  3. Also measure with the provided clblast_client_xgemm binary. That one also measures time and GFLOPS. You could even modify the code there slightly to override the parameters. Or just put them in the database. At least that measurement tool should give you the proper numbers.

Then once that is solved let us continue with the other smaller kernels and the variation there. BTW, I see in your very first post above that the tuner reports 0.1 GFLOPS, which is probably nothing compared to your peak performance. So that already shows numbers are not to be trusted to much: any small overhead will contribute a lot for these situations.

@huttered40
Copy link
Author

huttered40 commented Dec 1, 2021

Ok. I had been tuning with the cblast_tuner_kernel_xgemm_direct, but also tuned cblast_tuner_routine_xgemm. I found that for all tested matrix sizes the direct gemm is best. I uploaded the corresponding json file to the database. Upon rebuilding CLBLAST and my script, I did not notice any changes in performance. I should note that my script is simply a modified version of a sample code samples/sgemm_batched.cpp. I merely replaced parts of the code with what I needed so that the executable would be built in the same way as the client. Let me know if that is ill-advised.

I have posted both the client's output and the tuner's output for the m=1024,n=1024,k=1024 task below. First, the tuner reports an execution time of 18.31ms, which is >2x smaller than the mean execution time achieved by the client across 20 iterations. My script is still achieving a mean execution time of ~110ms. So clearly there is a discrepancy between the tuner's output and the client's output. Further, as the min reported by the client is ~30ms and the max is ~66ms with a mean of ~40ms, that still seems like significant variation, especially compared against CBLAS.

  "kernel_family": "xgemm_direct_1",
  "precision": "64",
  "best_kernel": "XgemmDirectTN",
  "best_time": "18.31",
  "best_parameters": "KWID=2 MDIMAD=8 MDIMCD=8 NDIMBD=8 NDIMCD=8 PADA=1 PADB=1 PRECISION=64 VWMD=2 VWND=1 WGD=32",
  "arg_m": "1024",
  "arg_n": "1024",
  "arg_k": "1024",
  "arg_alpha": "2.00",
  "arg_beta": "2.00",
  "clblast_device_type": "CPU",
  "clblast_device_vendor": "Intel",
  "clblast_device_architecture": "",
  "clblast_device_name": "Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "device": "pthread-Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "platform_vendor": "The pocl project",
  "platform_version": "OpenCL 2.0 pocl 1.7, Release, LLVM 11.1.0, RELOC, SLEEF, CUDA, DISTRO, POCL_DEBUG",
  "device_vendor": "GenuineIntel",
  "device_type": "CPU",
  "device_core_clock": "1600",
  "device_compute_units": "272",
  "device_extra_info": "",
  "results": [
c455-011[knl](1011)$ ibrun -n 1 ./clblast_client_xgemm -precision 64 -m 1024 -n 1024 -k 1024 -full_statistics true -warm_up true -tuner_files json_files_v1/clblast_xgemm_direct_1_64_m1024_n1024_k1024.json -runs 20
TACC:  Starting up job 8844098 
TACC:  Starting parallel tasks... 

* Options given/available:
    -m 1024 
    -n 1024 
    -k 1024 
    -layout 101 (row-major) [=default]
    -transA 111 (regular) [=default]
    -transB 111 (regular) [=default]
    -lda 1024 [=default]
    -ldb 1024 [=default]
    -ldc 1024 [=default]
    -offa 0 [=default]
    -offb 0 [=default]
    -offc 0 [=default]
    -alpha 2.00 [=default]
    -beta 2.00 [=default]
    -platform 0 [=default]
    -device 0 [=default]
    -precision 64 (double) 
    -cblas 1 [=default]
    -step 1 [=default]
    -num_steps 0 [=default]
    -runs 20 
    -h [false]
    -q [false]
    -no_abbrv [false]
    -full_statistics [true]
    -warm_up [true]
    -tuner_files json_files_v1/clblast_xgemm_direct_1_64_m1024_n1024_k1024.json 

* Reading override-parameters from 'json_files_v1/clblast_xgemm_direct_1_64_m1024_n1024_k1024.json'
* Found parameters for kernel 'XgemmDirect': { KWID=2 MDIMAD=8 MDIMCD=8 NDIMBD=8 NDIMCD=8 PADA=1 PADB=1 VWMD=2 VWND=1 WGD=32 }
* Applying parameter override successfully for 'XgemmDirect'

                                                                                                                                             | <--            CLBlast            --> | <--           CPU BLAS            --> |
        m;        n;        k;   layout;   transA;   transB;      lda;      ldb;      ldc;     offa;     offb;     offc;    alpha;     beta; min_ms_1; max_ms_1;   mean_1; stddev_1; min_ms_3; max_ms_3;   mean_3; stddev_3
       1K;       1K;       1K;      101;      111;      111;       1K;       1K;       1K;        0;        0;        0;     2.00;     2.00;   30.450;   66.496;   40.905;    8.171;   92.136;   96.340;   94.799;    1.078

@huttered40
Copy link
Author

huttered40 commented Dec 1, 2021

I know you advised to focus on one issue at a time, but I tried a similar experiment with GEMV and the results point to what seems to be the same issue: the client and the tuner's outputs do not seem to match, and high variation is observed. The client's output below references the json file also below.

  "kernel_family": "xgemv",
  "precision": "64",
  "best_kernel": "Xgemv",
  "best_time": "2.15",
  "best_parameters": "PRECISION=64 WGS1=64 WPT1=2",
  "arg_m": "1024",
  "arg_n": "1024",
  "arg_alpha": "2.00",
  "arg_beta": "2.00",
  "clblast_device_type": "CPU",
  "clblast_device_vendor": "Intel",
  "clblast_device_architecture": "",
  "clblast_device_name": "Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "device": "pthread-Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz",
  "platform_vendor": "The pocl project",
  "platform_version": "OpenCL 2.0 pocl 1.7, Release, LLVM 11.1.0, RELOC, SLEEF, CUDA, DISTRO, POCL_DEBUG",
  "device_vendor": "GenuineIntel",
  "device_type": "CPU",
  "device_core_clock": "1600",
  "device_compute_units": "272",
  "device_extra_info": "",
  "results": [
c455-063[knl](1006)$ ibrun -n 1 ./clblast_client_xgemv -precision 64 -m 1024 -n 1024 -full_statistics true -warm_up true -tuner_files clblast_xgemv_64.json -runs 20
TACC:  Starting up job 8844718 
TACC:  Starting parallel tasks... 

* Options given/available:
    -m 1024 
    -n 1024 
    -layout 101 (row-major) [=default]
    -transA 111 (regular) [=default]
    -lda 1024 [=default]
    -incx 1 [=default]
    -incy 1 [=default]
    -offa 0 [=default]
    -offx 0 [=default]
    -offy 0 [=default]
    -alpha 2.00 [=default]
    -beta 2.00 [=default]
    -platform 0 [=default]
    -device 0 [=default]
    -precision 64 (double) 
    -cblas 1 [=default]
    -step 1 [=default]
    -num_steps 0 [=default]
    -runs 20 
    -h [false]
    -q [false]
    -no_abbrv [false]
    -full_statistics [true]
    -warm_up [true]
    -tuner_files clblast_xgemv_64.json 

* Reading override-parameters from 'clblast_xgemv_64.json'
* Found parameters for kernel 'Xgemv': { WGS1=64 WPT1=2 }
* Applying parameter override successfully for 'Xgemv'

                                                                                                                         | <--            CLBlast            --> | <--           CPU BLAS            --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta; min_ms_1; max_ms_1;   mean_1; stddev_1; min_ms_3; max_ms_3;   mean_3; stddev_3
       1K;       1K;      101;      111;       1K;        1;        1;        0;        0;        0;     2.00;     2.00;    1.538;   18.248;    8.161;    2.787;    0.707;    0.838;    0.763;    0.035

So, the current problem doesn't seem to be unique to GEMM, but rather to how CLBLAST operates with the installed OpenCL library?

@CNugteren
Copy link
Owner

CNugteren commented Dec 1, 2021

Hmm, OK. If the same behaviour also happens with GEMV, then that is even easier to test: no direct vs non-direct and shorter tuning. However, here we need to be careful as well: I believe that in case of 1024x1024 it will use the 'fast' GEMV kernel, and your tuning results that you share here are for the regular more generic GEMV kernel. See here in the code for the exact details.

Perhaps the issue is related to POCL, so I also installed that on my system and used my CPU with POCL.

Here are the results I get when running the client:

                                                                                                                         | <--            CLBlast            --> | <--           CPU BLAS            --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta; min_ms_1; max_ms_1;   mean_1; stddev_1; min_ms_3; max_ms_3;   mean_3; stddev_3
       1K;       1K;      101;      111;       1K;        1;        1;        0;        0;        0;     2.00;     2.00;    0.261;    0.542;    0.302;    0.037;    0.041;    0.103;    0.047;    0.008

This 0.2 matches roughly what I get with the tuner as well. However there is indeed quite some variation. Keep in mind that this is on a laptop CPU which is already running a desktop environment. If I run the same with my GPU through the Intel OpenCL driver, I get a bit less variation but still some:

                                                                                                                         | <--            CLBlast            --> | <--           CPU BLAS            --> |
        m;        n;   layout;   transA;      lda;     incx;     incy;     offa;     offx;     offy;    alpha;     beta; min_ms_1; max_ms_1;   mean_1; stddev_1; min_ms_3; max_ms_3;   mean_3; stddev_3
       1K;       1K;      101;      111;       1K;        1;        1;        0;        0;        0;     2.00;     2.00;    3.888;    4.704;    4.324;    0.265;    0.030;    0.055;    0.033;    0.007

However, none of these seem as bad as your variations.

I'll take some more time this weekend to have a more detailed look at the issue.

@CNugteren
Copy link
Owner

CNugteren commented Dec 6, 2021

Sorry for the late reply, I'm a bit busy these days. And it also has been a while since I digged into CLBlast.

I think what you should do is set the -DVERBOSE=ON option when running CMake. That way CLBlast will tell you which kernels run exactly and how much time they consumed. This will make the final reported values by the client no longer accurate, but if you use it in combination then I think it will provide you enough information to debug the issue. Given the name of the kernels that are running you can make sure you are running those that you tuned for and not another variant or extra kernels.

It will also tell you the exact tuning parameters it used. For example for the in-direct version of GEMM, depending on your transpose/layout options it needs to run pre/post processing kernels. And it shows the full list of parameters used at this moment, and all the kernels that are running with their execution times:

./clblast_client_xgemm -precision 32 -m 512 -n 512 -k 512 -full_statistics true -warm_up true -runs 1 -platform 0
(...)
[DEBUG] GEMM_Copy_8_8_2_1_Pad_32_8_4_2_Transpose_16_0_0_4_Padtranspose_1_16_2_Xgemm_0_1_32_2_16_16_64_8_8_64_1_1_0_0_4_4_XgemmDirect_2_8_8_8_8_1_1_1_1_8_GemmRoutine_128
[DEBUG] Running kernel 'TransposeMatrixFast'
[DEBUG] Completed kernel in 0.31 ms
[DEBUG] Running kernel 'TransposePadMatrix'
[DEBUG] Completed kernel in 0.44 ms
[DEBUG] Running kernel 'Xgemm'
[DEBUG] Completed kernel in 3.15 ms
[DEBUG] Running kernel 'TransposeMatrix'
[DEBUG] Completed kernel in 0.46 ms

or another (GEMV in this case) example running the XgemvFastRot kernel:

./clblast_client_xgemv -precision 32 -m 1024 -n 1024 -full_statistics true -warm_up true -runs 2 -platform 0
(...)
[DEBUG] GEMV_Xgemv_128_1_XgemvFast_2_128_2_XgemvFastRot_4_64_16_TrsvRoutine_32
[DEBUG] Running kernel 'XgemvFastRot'
[DEBUG] Completed kernel in 5.59 ms

I believe that if you use this feature you should be able to get to the bottom of this: are you measuring the right thing?

@baryluk
Copy link
Contributor

baryluk commented Feb 10, 2024

I also noticed tunner results not be very consistent.

My guess is that the tests are way too short, so things like other load on the GPU (i.e. video playback), or boost frequency handling, can mess things up. Longer tests, and multiple repetitions should be used instead. Some of the tests are way too short, and in microseconds range, which is hard to time accurately and easily susceptible to noise.

Example:

$ for i in `seq 30`; do LD_LIBRARY_PATH=./lib ./bin/clblast_tuner_transpose_fast --platform 1 | grep -E 'Best param|Found best'; done
* Found best result 0.05 ms: 180.1 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.05 ms: 185.3 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=8
* Found best result 0.05 ms: 178.3 GB/s
* Best parameters: PRECISION=32 TRA_DIM=4 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.04 ms: 212.5 GB/s
* Best parameters: PRECISION=32 TRA_DIM=16 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=2
* Found best result 0.04 ms: 211.9 GB/s
* Best parameters: PRECISION=32 TRA_DIM=16 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=2
* Found best result 0.04 ms: 204.8 GB/s
* Best parameters: PRECISION=32 TRA_DIM=16 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=2
* Found best result 0.04 ms: 200.8 GB/s
* Best parameters: PRECISION=32 TRA_DIM=16 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=2
* Found best result 0.04 ms: 187.5 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=1
* Found best result 0.04 ms: 205.7 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.04 ms: 212.7 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.05 ms: 177.8 GB/s
* Best parameters: PRECISION=32 TRA_DIM=4 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.04 ms: 205.7 GB/s
* Best parameters: PRECISION=32 TRA_DIM=16 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=2
* Found best result 0.05 ms: 185.5 GB/s
* Best parameters: PRECISION=32 TRA_DIM=16 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.04 ms: 198.3 GB/s
* Best parameters: PRECISION=32 TRA_DIM=4 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.04 ms: 190.9 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=8
* Found best result 0.04 ms: 206.3 GB/s
* Best parameters: PRECISION=32 TRA_DIM=4 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.05 ms: 170.7 GB/s
* Best parameters: PRECISION=32 TRA_DIM=16 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=2
* Found best result 0.04 ms: 205.1 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=0 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.05 ms: 183.0 GB/s
* Best parameters: PRECISION=32 TRA_DIM=4 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=4
* Found best result 0.04 ms: 204.0 GB/s
* Best parameters: PRECISION=32 TRA_DIM=8 TRA_PAD=1 TRA_SHUFFLE=1 TRA_WPT=4
^C

@CNugteren
Copy link
Owner

My guess is that the tests are way too short, so things like other load on the GPU (i.e. video playback), or boost frequency handling, can mess things up. Longer tests, and multiple repetitions should be used instead. Some of the tests are way too short, and in microseconds range, which is hard to time accurately and easily susceptible to noise.

Can you set the -runs argument from the default of 10 to something higher and see if that improves things? Perhaps for smaller tests that should be the default.

And of course it is better to test on a machine that doesn't run anything else, e.g. a headless machine or in Linux without a graphical environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants