Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow-ish run time on MSVC #9

Closed
patrikhuber opened this issue Feb 2, 2018 · 32 comments
Closed

Slow-ish run time on MSVC #9

patrikhuber opened this issue Feb 2, 2018 · 32 comments

Comments

@patrikhuber
Copy link
Contributor

patrikhuber commented Feb 2, 2018

Hi!

First of all thank you for this great library! :-)
I've got a fairly small model (18 layers) for real-time applications, basically mainly consisting of 5 blocks of Conv2D/ReLu/MaxPool2D, and input size 64x64x3. I'm unfortunately seeing some speed problems with fdeep.
A forward pass takes around 11ms in Keras, and it's taking 60ms in fdeep. (I've measured by calling predict 100x in a for-loop and then averaging - a bit crude but should do the trick for this purpose). I've compiled with the latest VS2017 15.5.5, Release mode, and default compiler flags (/O2). If I enable AVX2 and instrinsics, it goes down to 50ms, but still way too slow. (I've tried without im2col but it's even slower, around >10x).

I've run the VS profiler, but I'm not 100% sure I'm interpreting the results correctly. I think around 30%+5% of the total time is spent in Eigen's gebp and gemm functions, where we probably can't do much. Except maybe: I think I've seen you're using RowMajor storage for the Eigen matrices. Eigen is supposedly more optimised for its default, ColMajor storage. Would it be hard to change that in fdeep?
Another 30% seems to be spent in convolve_im2col. But I'm not 100% sure where. I first thought it was the memcpy in eigen_mat_to_values but eigen_mat_to_values itself contains very few profiler samples only.
There's also a lot of internal::transform and std::transform showing up in the profiler as well (internal::transform<ContainerOut>(reuse_t{}, f, std::forward<ContainerIn>(xs));) but I couldn't really figure out what the actual code is that this executes.
I also saw that I think you pre-instantiate some convolution functions for common kernels. Most of my convolution kernels are 3x3, and it looks like you only instantiate n x m kernels for n and m equals 1 and 2. Could it help adding 3x3 there?
So yea I'm really not sure about all of it. If indeed the majority of time is spent in Eigen's functions, then the RowMajor thing could indeed be a major problem.

I'm happy to send you the model and an example input via email if you wanted to have a look.

Here's some screenshots of the profiler:
image
image
image

Thank you very much!

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 2, 2018

Hi Patrik,

very interesting! I would not expect such performance issues with a model architecture like the one you describe.

Disabling im2col only makes sense if memory consumption is an issue and not speed. And only without im2col the compile-time specializations for specific strides and filter sizes are used. However they are there for 3x3 filters, since these are very common. For example FDEEPCONVOPTSYSXHWIF(1,1,3,3)\ means stride_y = 1; stride_x = 1; filter_size_y = 3; filter_size_x = 3;.

OK, but back to the relevant topics. :)
Performance of GEMM in Eigen seems to be roughly the same with row major and column major matrices:

// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name)
{
    using namespace std::chrono;
    for (int i = 0; i < 10; ++i)
    {
        float checksum = 0.0f; // to prevent compiler from optimizing everything away
        Mat a_rm(2048, 2048);
        Mat b_rm(2048, 2048);
        const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
        const auto c_rm = a_rm * b_rm;
        checksum += c_rm(123, 234);
        const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
        const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
        if (i > 1) // skip stats of first iteration (CPU cache stuff etc.)
        {
            std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
        }
    }
}
int main()
{
    run_test<ColMajorMatrixXf>("col major");
    run_test<RowMajorMatrixXf>("row major");
}
$ g++ -std=c++11 -O3 -march=native gemm_test.cpp -o gemm_test
$ ./gemm_test
col major (checksum: 0) elapsed_ms: 172
col major (checksum: 0) elapsed_ms: 169
col major (checksum: 0) elapsed_ms: 168
col major (checksum: 0) elapsed_ms: 168
col major (checksum: 0) elapsed_ms: 168
col major (checksum: 0) elapsed_ms: 167
col major (checksum: 0) elapsed_ms: 167
col major (checksum: 0) elapsed_ms: 171
row major (checksum: 0) elapsed_ms: 166
row major (checksum: 0) elapsed_ms: 168
row major (checksum: 0) elapsed_ms: 169
row major (checksum: 0) elapsed_ms: 170
row major (checksum: 0) elapsed_ms: 168
row major (checksum: 0) elapsed_ms: 171
row major (checksum: 0) elapsed_ms: 169
row major (checksum: 0) elapsed_ms: 168

All the fplus::transform stuff you see in the profiler is just a wrapper around std::transform from FunctionalPlus. This is used at many places in fdeep, for example for transforming the actication functions over the tensors. But transform is not the thing that actually uses the CPU time, it is the functions that are mapped over the tensors. I'm not experienced with the VC profiler, but on Windows I like to use very sleepy for profiling, and there you can quickly see that the time spend in transform is only inclusive (calls of subfunctions) and not exclusive. I think I can make out these two columns in your screenshots too. ;) Seeing the Eigen functions (matrix multiplication) and the activation functions (selu or whatever) with much inclusive percentage (actual bottlenecks) is normal.

This all does not yet explain the difference in performance you have between Keras and fdeep.

  1. Are you disabling your GPU for the Keras measurements?
  2. Are you only using one CPU core?
  3. Are you only passing one input tensor for the forward pass in Keras to disable the advantages of batch processing?
    In case you are using the code of keras_export/convert_model.py to measure, the answer to question 3 automatically is "yes".

You can achieve a yes to questions 1 and 2 on Linux like that

CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 ./keras_export/convert_model.py my_model.h5 my_model.json

However I'm not sure how to do it on Windows though. Perhaps you have to use Manual device placement for this.

But I would be happy to receive your model (.h5 file or the generating .py file) via email (address on my github profile) and also do some measurements with it.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 2, 2018

Hey, thank you for the quick reply! :-)
Ah, thanks for the explanation regarding instantiated filters.

I still wouldn't rule out Row vs ColMajor as a potential issue, even with your benchmark. It might be quite dependent on the matrix sizes. But agree with you, it could likely be something else.

I'm going to check out Very Sleepy. But the VS profiler can display Inclusive/Exclusive too. I think it's just a matter of my inexperience of making sense of this not-so-trivial profiler result.

  1. I've actually got the "tensorflow" package from pip installed and not "tensorflow-gpu", and Keras is using tensorflow as backend. So I don't think there's a way it could even use the GPU. (and I think if you start the Keras/TF session and a GPU is there/enabled, it would show in the log).
    In any case I also ran the prediction with matconvnet (the model is originally trained in matconvnet), and matconvnet manages to predict in 11ms too, exactly the same as Keras. And I've also installed matconvnet without GPU support, that's definitely running CPU-only.

  2. Good idea. Just tested this. Matconvnet is actually using more than 1 core (I think 3-4).
    Keras is fully utilising all 4 cores (100% total).

  3. Yes, only 1 input tensor. As mentioned I'd like to use it in a real-time scenario where each incoming frame will be passed to the model. So there's no batch-processing possible, there's only ever 1 thing to predict in each frame.

It seems like at least part of the answer is using multiple cores. All of these operations should quite easily be parallelisable, as most of the operations are independent. Do you think that would be not too hard to do with fdeep?
I'm wondering though whether this will really make all the difference (i.e. from 60ms to 11ms).

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 2, 2018

I experimented with enabling OpenMP so Eigen can parallelise its stuff, but the success was very limited with the models I tested. The CPU usage increased a lot but the run-time did not go down a lot.

If you send me the model I will measure it here. Otherwise you can try to disable all but one CPU core, perhaps simply in the windows task manager for this one process to see the "clean" keras run-time.

@patrikhuber
Copy link
Contributor Author

you can try to disable all but one CPU core

Great idea. I've pinned the processes to one CPU core. I'm getting now:

  • matconvnet: 18ms
  • Keras: 43ms

It seems like matconvnet/Matlab is bloody good :-)
It is noteworthy that the tensorflow pip package does not use AVX/AVX2, so if we compare fdeep to Keras, then it's 60ms vs 43ms. That's around 35% difference - quite a bit, but not as huge anymore.
Looking at matconvnet's numbers though, it seems like there should definitely be some leeway.

Hmm!

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 3, 2018

I don't know matconvnet, but the numbers look suspicious to me. Perhaps it fans out to multiple cores not using threads but multiprocessing, and thus is not as easy to "catch" on one CPU. Disabling all but one CPU system wide could help to make sure matconvnet can not sneak away from the restriction. ;)

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 3, 2018

In an attempt to get the question about the performance of row-major vs. column-major out of the way, I adjusted the gemm benchmark code to use randomized matrix sizes. It still look as though there is no significant difference:

// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing everything away
    const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
    for (size_t i = 0; i < 10; ++i)
    {
        Mat a_rm(s1, s2);
        Mat b_rm(s2, s3);
        const auto c_rm = a_rm * b_rm;
        checksum += c_rm(0, 0);
    }
    const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
int main()
{
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 2048);
    for (std::size_t i = 0; i < 100; ++i)
    {
        int s1 = dis(gen);
        int s2 = dis(gen);
        int s3 = dis(gen);
        std::cout << s1 << " " << s2 << " " << s3 << std::endl;
        run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
        run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
        std::cout << "--------" << std::endl;
    }
}
$ g++ -std=c++11 -O3 -march=native gemm_test.cpp -o gemm_test
$ ./gemm_test 
854 175 1217
col major (checksum: 0) elapsed_ms: 64
row major (checksum: 0) elapsed_ms: 37
--------
1272 1680 52
col major (checksum: 0) elapsed_ms: 28
row major (checksum: 0) elapsed_ms: 28
--------
572 1554 430
col major (checksum: 0) elapsed_ms: 78
row major (checksum: 0) elapsed_ms: 80
--------
1513 1720 1735
col major (checksum: 0) elapsed_ms: 948
row major (checksum: 0) elapsed_ms: 934
--------
1195 772 123
col major (checksum: 0) elapsed_ms: 26
row major (checksum: 0) elapsed_ms: 25
--------
1961 570 1729
col major (checksum: 0) elapsed_ms: 409
row major (checksum: 0) elapsed_ms: 390
--------
982 1233 471
col major (checksum: 0) elapsed_ms: 114
row major (checksum: 0) elapsed_ms: 119
--------
1842 1447 24
col major (checksum: 0) elapsed_ms: 23
row major (checksum: 0) elapsed_ms: 23
--------
476 663 1974
col major (checksum: 0) elapsed_ms: 131
row major (checksum: 0) elapsed_ms: 143
--------
1160 1346 1617
col major (checksum: 0) elapsed_ms: 495
row major (checksum: 0) elapsed_ms: 507
--------
692 1634 288
col major (checksum: 0) elapsed_ms: 66
row major (checksum: 0) elapsed_ms: 69
--------
1720 1895 32
col major (checksum: 0) elapsed_ms: 34
row major (checksum: 0) elapsed_ms: 34
--------
2007 396 469
col major (checksum: 0) elapsed_ms: 79
row major (checksum: 0) elapsed_ms: 80
--------
1747 556 1681
col major (checksum: 0) elapsed_ms: 330
row major (checksum: 0) elapsed_ms: 338
--------
1711 1658 499
col major (checksum: 0) elapsed_ms: 292
row major (checksum: 0) elapsed_ms: 295
--------

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 3, 2018

Your Eigen benchmark is interesting. I am actually quite surprised that we don't see a difference. I would really expect to see one, since in one case, it should read a cache line from memory and then can use the whole line, while in the other case, it should have to fetch much more often. But that's about as far as my knowledge goes so it's quite possible I'm missing something.

and thus is not as easy to "catch" on one CPU

I also checked the task manager while it was running, and it was strictly capped at 12.5% (which is 1/8th, exactly one of the "eight" cores (4+4 from HT)). But great idea with the msconfig option. I disabled all cores except for the first and measured again.
I'll put all the measurements that we collected so far into a table below.
PS: Running a system with only 1 CPU was a horrible experience (at least on Windows), do not do it ;-)

For Visual Studio, I ran the benchmark and tested /O2, /O3, /Oi, /Os, /Ot, /GL, /arch:AVX2, /fp:precise, /fp:fast and various combinations thereof. The best result was with AVX2 and O2 (or O3), and the rest on default. So enabling "more aggressive" stuff like /fp:fast, /Ot etc., either made no difference or made it even 5-10ms slower.

I also ran my benchmark on WSL with gcc and clang. The results are quite interesting.

VS2017 15.5.5 gcc-5.4/gcc-7/clang-6 WSL gcc Dobiasd
1 core, pinned thread (i7-7700HQ) matconvnet 18
Keras (no AVX/AVX2) 43
1 core, system-wide (i7-7700HQ) matconvnet 16
Keras (no AVX/AVX2) 29
fdeep (AVX2, /O2 - best flags) 48
fdeep (no AVX2, /O2) 55
fdeep (-march=core2, -O3) 42
fdeep (-march=native, -O3) 20
All cores (i7-7700HQ) matconvnet 10
Keras (no AVX/AVX2) 11
fdeep (AVX2) 50
fdeep (no AVX2, /O2) 60
1 core, pinned thread (@Dobiasd i5-6600) Keras (no AVX/AVX2) 26
fdeep (-O3) 31
fdeep (-O3 -march=native) 17

The benchmarks were made with Eigen 3.3.4 in all cases.

I think we can conclude some things from the benchmarks so far.
(Small remark first: I didn't use a Benchmark tool like google-benchmark, so I would take differences of 5ms or so with a grain of salt, but overall, the results are quite consistent.)

First I would say: Matlab/matconvnet is bloody good, also on a single-core. Perhaps not too surprising, linear algebra is their main business. But it is a bit surprising that Keras/Tensorflow (or Eigen, for that matter) can't get close to it.

Second, it seems like the MSVC optimiser really sucks in this scenario. 48ms was the best I could achieve, while gcc and clang easily achieve <=20ms. Even without -march=native, gcc & clang are faster than MSVC with AVX2.
I think I will file a ticket for this with Microsoft. Also maybe it's worth posting it on the Eigen tracker/list too, I'm not sure - maybe they can optimise their code on MSVC to "help" the optimiser. But we still don't really know which parts are slower, i.e. whether it is really the multiplications with Eigen, or something else, right?

Third, with regards to fdeep, we can probably conclude that it does a very good job, and there might not be much that we can do? Or what do you think?

What do you think about (optional) parallelising in fdeep on a per-filter basis? Probably there's a transform(...) somewhere that we could parallelise, either with something like a lightweight thread pool or std::async? (parallel std::transform might fit too but it's a bit too early for that C++17 feature).
Or do you know how TensorFlow and the other frameworks parallelise the prediction?

@patrikhuber patrikhuber changed the title Run time - fdeep 5x slower than Keras Slow-ish run time on MSVC Feb 3, 2018
@Dobiasd
Copy link
Owner

Dobiasd commented Feb 3, 2018

First of all: Great analysis. Thanks for this interesting stuff. :)

Parallelization is always easier to do on the outside than on the inside. So if you need to do multiple predictions, perhaps you can just run them simultaneously to achieve the needed throughput. Of course this does not help to get the result of a single prediction earlier.

Enabling OpenMP in the compiler could be one option since Eigen automatically uses it where appropriate. The transforms in fdeep are all running quite short each. I'm not sure if the parallelization overhead is worth it in that case. As a test you can simply replace all instances of fplus::transform with fplus::transform_parallelly in the code of fdeep and see what happens. But I don't expect much (or anything at all) with your VGG-like model architecture, which is very serially bound. This could help more if your computational graph was wider (xception, inception, resnet, densenet, etc.).

I just profiled fdeep with your model on my system and almost all of the CPU time is used in Eigen's GEMM (about 50%) and in the preparation (filling) of matrices for it (also about 50%). The relu activation functions do not show up at all, and the pooling can also be ignored:
sysprof output

Btw, thanks for adjusting the title of this issue. :D

edit: In your table I updated the measurements I made, just in case you are wondering.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 4, 2018

Thank you for your measurements, I think they confirm my interpretation of my VS profiler results.

I tried with /openmp. It does use all the cores (task manager shows 100%), but the run time gets worse, from 60ms per prediction to 70ms. (with and without AVX2 - same behaviour).

I think this is probably because Eigen parallelises the GEMM, which is what makes sense to do for Eigen. But the convolutions probably consist of many small GEMM operations, so it doesn't make sense to parallelise on a GEMM level. I also don't think that's what Keras/TensorFlow do. I think they parallelise on a filter level, or something along these lines. Which is what makes the most sense for CNN predictions.
It looks like Matconvnet and Keras do that and both parallelise on a batch-level as well as filter level, so the case when you only input 1 item for prediction is also parallelised in a smart way.

I might play around with fplus::transform_parallelly a bit, thank you for the pointer.

VGG-like model architecture, which is very serially bound

I think the execution graph might be serial, but it should still be parallelisable well on a filter-level, don't you think? I would guess this should be the optimal granularity for parallelisation.

the preparation (filling) of matrices for it (also about 50%)

I suppose there couldn't anything be optimised there? Like with using Eigen::Map (so it would allocate/copy/delete fewer data), or e.g. the Eigen/unsupported/Tensor stuff to avoid copies (I think TensorFlow is built upon that?), or maybe using some preallocated storage buffer for Eigen and reusing that? (I don't know details about the last suggestion but I've heard it somewhere).

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 4, 2018

Currently one (im2col) convolution only consists of one single large matrix multiplication.

I don't know what TensorFlow does. And when I tried to, I found it quite hard to read from their source code how it is working.

I think the execution graph might be serial, but it should still be parallelisable well on a filter-level, don't you think?

Probably, yes. At least without im2col I can well imagine how to parallelize the outer loop. But to become faster as im2col one would need a lot of filters and CPU cores, since it is intrinsically a lot slower than im2col.

I suppose there couldn't anything be optimised there?

Perhaps the filling of the matrices for the GEMM can be parallelized, but I'm not sure how. I struggled a lot to get it working at all in the first place and I already can not remember what it is doing actually. I'm glad I have the automatic tests making sure there are no big mistakes. ;-)

However I guess we can not avoid copying the values completely, because the matrix used in the GEMM to kind of emulate usual convolution looks a lot different from the actual tensors passed between the layers. Sorry I currently can not help more with this problem, but I'm kind of stumped. Perhaps you can make more sense of this all? :-) Here is a 7-minute high-level overview about how im2col works.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 4, 2018

Currently one (im2col) convolution only consists of one single large matrix multiplication.

Oh yes! You are of course right, this processed all filters at the same time. Wow. I really gotta watch that im2col video now, thank you :-)
Given that, it is actually quite surprising that /openmp doesn't help, as it is a big matrix multiplication (the big computation actually). But I guess for some reason the parallelisation overhead is greater than the gain...

You have some very good insight and raise good points. Thank you for the explanations.
It also looks to me like for single-core, fdeep is already extremely good, maybe as good as it can get, without some assembly or CPU-specific hand-written kernels. As for the parallel case, I really wonder how Keras/TF/Matconvnet do it, but indeed all their code is quite hard to read and probably not doable without significant effort.

I think we can probably close this then, I guess the conclusion is there isn't any issue (apart from the MSVC optimiser), and there are parallelisation solutions but none of them would be trivial. I will have a good think about it and we can always re-open or add something if either of us or anyone else has any additional knowledge or ideas.

Thank you very much for all the help!

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 4, 2018

As one more or less desparate attempt to speed up im2col convolution for vgg network, I made some of the parameters known at compile time for the special case.
The same method helped in non-im2col-convolution, since it allows the compiler to add some more optimizations, so why not here.

For testing I hacked it into the code in this ugly way:

template <std::size_t strides_y, std::size_t strides_x, std::size_t offset_y, std::size_t offset_x, std::size_t fy, std::size_t fx>
inline tensor3 convolve_im2col_opt(
    std::size_t out_height,
    std::size_t out_width,
    const std::vector<filter>& filters,
    const tensor3& in_padded)
{
    //std::cout << "convolve_im2col_opt\n";
    const std::size_t fz = filters.front().shape().depth_;

    RowMajorMatrixXf a(fz * fy * fx + 1, out_height * out_width);
    Eigen::Index a_y = 0;
    for (std::size_t zf = 0; zf < fz; ++zf)
    {
        for (std::size_t yf = 0; yf < fy; ++yf)
        {
            for (std::size_t xf = 0; xf < fx; ++xf)
            {
                Eigen::Index a_x = 0;
                for (std::size_t y = 0; y < out_height; ++y)
                {
                    for (std::size_t x = 0; x < out_width; ++x)
                    {
                        a(a_y, a_x++) = in_padded.get(zf,
                                offset_y + strides_y * y + yf,
                                offset_x + strides_x * x + xf);
                    }
                }
                ++a_y;
            }
        }
    }

    Eigen::Index a_x = 0;
    for (std::size_t y = 0; y < out_height; ++y)
    {
        for (std::size_t x = 0; x < out_width; ++x)
        {
            a(a_y, a_x++) = static_cast<float_type>(1);
        }
    }
    ++a_y;

    RowMajorMatrixXf b(filters.size(), fz * fy * fx + 1);
    Eigen::Index b_y = 0;
    Eigen::Index b_x = 0;
    for (std::size_t f = 0; f < filters.size(); ++f)
    {
        b_x = 0;
        const filter& filter = filters[f];
        for (std::size_t zf = 0; zf < fz; ++zf)
        {
            for (std::size_t yf = 0; yf < fy; ++yf)
            {
                for (std::size_t xf = 0; xf < fx; ++xf)
                {
                    b(b_y, b_x++) = filter.get(zf, yf, xf);
                }
            }
        }
        b(b_y, b_x++) = filter.get_bias();
        ++b_y;
    }

    const auto result = b * a;

    return tensor3(shape3(filters.size(), out_height, out_width),
        eigen_mat_to_values(result));
}

And added the following to the beginning of convolve_im2col

    if (strides_y == 1 && strides_x == 1 && offset_y == 0 && offset_x == 0 && fy == 3 && fx == 3)
    {
        return convolve_im2col_opt<1,1,0,0,3,3>(out_height, out_width, filters, in_padded);
    }

However this does not speed up the thing at all. I would have expected at least a tiny speed-up.

@patrikhuber
Copy link
Contributor Author

Hey, thanks for this!

I've also tried some more things, let me go step by step:

  1. First I quickly measured with VS2017 15.5.5 again, just to double check. 49ms => check.
    I then tried VS2017 15.6.0 Preview 3.0 (the very latest version). => 45ms. A slight improvement.

  2. I tried switching a and/or b to ColMajor in convolve_im2col but I observed the same as you, I could not get it faster.

  3. I tried your changes with the templated convolve_im2col. I think I am seeing a ~1ms improvement in VC 15.6.0 Preview 3.0, but my benchmark is too crude to discern this - we'd have to use a proper benchmark tool like Google Benchmark.
    I tried templating over fz too and making a and b with one dimension fixed size. I think the change was too small to be measured with my crude benchmark.

I did all these measurements with /O2 /arch:AVX2, I didn't bother trying any other flags anymore.

I wonder whether it would help making everything fixed size, including the fz and out_width/height. It is certainly great to have everything being loaded from a Keras model file dynamically, but usually, once you've trained a good model and use it in production, the model architecture is not going to change for a long while, and every bit of additional performance is more important (particularly in C++ ;-) ).
This would probably require some changes to the code, as we would have to specify the model definition in C++, and then loads the weights from the Keras file (preferably checking whether the manually specified model is identical to the one loaded from the Keras file).

One thing that would actually be good to measure is how much time is spent in each (conv) layer and measuring that over multiple runs. Do you have any better idea how to do this than for example writing to a separate csv file in each layer and then sum up and print the times in Python?

In the link about im2col that you posted, he afterwards introduces a faster convolution (mainly for 3x3 kernels) from this paper https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Lavin_Fast_Algorithms_for_CVPR_2016_paper.pdf, based on Winograd. They get like a 2x speed-up on VGG for a batch size of 1, on the GPU. It's overall quite cool, and I've seen this used in some deep learning repositories across GitHub. I think this could speed-up things on the CPU as well. But I think to implement this, and to do it efficiently, is quite beyond my skill level (or would take an insane amount of time).
One very interesting resource I could find is NNPACK: https://github.com/Maratyszcza/NNPACK. It seems to be used or at least optionally supported by all big deep learning frameworks. Most importantly for this issue here, it has optimised kernels for inference and mobile. It looks quite awesome actually! Apart from the fact that it's C code and I'm not sure it's header-only - there is a src directory but in the src/x86_64-fma directory there are only Python files as far as I can see so I'm a bit confused and would have to have a closer look. What would probably be awesome is to take their C code for all their inference kernels and integrate them neatly into fdeep :-) (this is interesting too Maratyszcza/NNPACK#25 and links to https://github.com/zeno40/nnpack-windows but I'd rather base off the original repo). Would probably also be worth to look at how tiny-dnn integrates NNPACK.

This is a good resource as well (Maratyszcza/NNPACK#30) and maybe this (https://eigen.tuxfamily.org/dox-devel/unsupported/TensorConvolution_8h_source.html) would be a good first try as well, it at least seems to be a multi-core implementation. (TensorFlow seems to use Eigen Convolution actually, while Caffe2 and others seem to use NNPACK tensorflow/tensorflow#9319).

Some other resources, but probably not as interesting as NNPACK: http://www.tvmlang.org/2018/01/16/opt-mali-gpu.html, tiny-dnn/tiny-dnn#109 (comment).

Okay, this ended up being quite a long post... :-)
I am quite convinced now that the way to go forward would actually be to try the Eigen Tensor-Convolution first, and if that is a dead end, then integrating the inference kernels from NNPACK (in a way so that fdeep can stay header-only with its nice interface). I probably can't spend any time on this over the next weeks as I have some other deadlines but I would be very interested to get your opinion on it (and if it's something you could imagine working on too), and then we could see from there.

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 6, 2018

Awesome analysis and research!

You method of measuring is fine I guess. A change that only decreases the runtime by 1 ms (out of 45) is probably quite boring anyway. ;)

we would have to specify the model definition in C++

I suggest we first prove that knowing more hyperparameters at compile time would really help, since this seems like a huge change.

One thing that would actually be good to measure is how much time is spent in each (conv) layer and measuring that over multiple runs. Do you have any better idea how to do this than for example writing to a separate csv file in each layer and then sum up and print the times in Python?

Perhaps we could split the models into small models with one layer each for easier measurement. But in the end it probably won't make much of a difference.

I also heard about Winograd convolutions, but did not follow this path, because it seemed very long to me. O:)

Using TensorConvolution from Eigen sounds like a good idea to me. And if TensorFlow uses it too, we could be able to keep the automatic testing that makes sure the output is exactly the same as with Keras. Without this tests it is extremely hard to change anything in fdeep safely.
I don't know if this would be possible with NNPACK, because interestingly the implementation have some wiggle room on the details of a convolution. This is the reason I gather these silly flags when converting, store them with the model and use them in the C++ code.


Also there are two smaller things I would like to try out. Probably the performance win would be not that big, but perhaps something worthwhile anyway:

  1. For every im2col convolution fdeep currently fills matrix b from scratch with the filter values. This could be done in the constructor of the conv layers while loading the model, so it has not to be done over and over at prediction time.

  2. The result of the big matrix multiplication is copied over into a fdeep tensor (using an std::vector as storage). Perhaps we can avoid this copy, e.g. by letting Eigen write its multiplication result directly into the memory block of the vector.

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 6, 2018

With this commit the filter matrices are now pregenerated when loading a model. This saves some time during prediction, even if it is not very much (-O3 only, so no -march=native):

Model before after
InceptionV3 0.61 s 0.54 s
ResNet50 0.41 s 0.35 s
VGG16 1.44 s 1.37 s
VGG19 1.76 s 1.67 s
Xception 1.00 s 0.86 s
DenseNet201 0.57 s 0.49 s
NASNetLarge 3.64 s 3.12 s

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 6, 2018

OK, here is the next commit.
It avoids the mentioned copying of the results from the matrix multiplication into the memory (vector) of the tensor3. However this brings nearly nothing in performance. But OK, the simple memcpy was probably quite fast already. ;)

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 6, 2018

I was hoping perhaps the reason the last commit (avoid gemm result copy) did not bring much in terms of performance, is that Eigen does not like to write its matrix multiplication results into unaligned memory (Eigen::Map<RowMajorMatrixXf, Eigen::Unaligned>). We could work around this by using a custom allocator for our std::vector (i.e. fdeep::internal::float_vec), one that automatically aligns to 16 bytes, which is the alignment Eigen uses to allocate its matrix memory internally.

To test this hypothesis I benchmarked Eigen for both cases, i.e. multiplying two matrices into unaligned external memory and multiplying two matrices into 16-bit aligned external memory.

However the results are sobering:

#include <array>
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Dense>
using Mat = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
void run_test_unaligned(int s1, int s2, int s3)
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing everything away
    const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
    for (size_t i = 0; i < 10; ++i)
    {
        Mat a(s1, s2);
        Mat b(s2, s3);
        const std::size_t num_bytes = a.rows() * b.cols() * sizeof(float);
        float* ptr = (float*)std::malloc(num_bytes);
        Eigen::Map<Mat, Eigen::Unaligned> c(
            ptr,
            static_cast<Eigen::Index>(a.rows()),
            static_cast<Eigen::Index>(b.cols()));
        c.noalias() = a * b;
        checksum += ptr[0];
        std::free(ptr);
    }
    const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << "unaligned (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
void run_test_aligned(int s1, int s2, int s3)
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing everything away
    const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
    for (size_t i = 0; i < 10; ++i)
    {
        Mat a(s1, s2);
        Mat b(s2, s3);
        const std::size_t num_bytes = a.rows() * b.cols() * sizeof(float);
        float* ptr = (float*)aligned_alloc(16, num_bytes);
        Eigen::Map<Mat> c(
            ptr,
            static_cast<Eigen::Index>(a.rows()),
            static_cast<Eigen::Index>(b.cols()));
        c.noalias() = a * b;
        checksum += ptr[0];
        std::free(ptr);
    }
    const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << "aligned   (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
int main()
{
    for (std::size_t i = 0; i < 100; ++i)
    {
        run_test_unaligned(1024, 2048, 768);
        run_test_aligned(1024, 2048, 768);
        std::cout << "--------" << std::endl;
    }
}
$ g++ -std=c++11 -O3 -march=native gemm_test_alignment.cpp -o gemm_test_alignment
./gemm_test_alignment
unaligned (checksum: 0) elapsed_ms: 327
aligned   (checksum: 0) elapsed_ms: 326
--------
unaligned (checksum: 0) elapsed_ms: 328
aligned   (checksum: 0) elapsed_ms: 328
--------
unaligned (checksum: 0) elapsed_ms: 327
aligned   (checksum: 0) elapsed_ms: 330
--------
unaligned (checksum: 0) elapsed_ms: 327
aligned   (checksum: 0) elapsed_ms: 328
--------
unaligned (checksum: 0) elapsed_ms: 328
aligned   (checksum: 0) elapsed_ms: 331
--------
unaligned (checksum: 0) elapsed_ms: 326
aligned   (checksum: 0) elapsed_ms: 325
--------
unaligned (checksum: 0) elapsed_ms: 329
aligned   (checksum: 0) elapsed_ms: 328

So there probably is nothing to gain here.

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 6, 2018

Oh, but good news for you (hopefully I didn't make a mistake while measuring). The time for a forward pass with your model went down from 17 ms to 10 ms on my machine (-march=native). Could you please check if you see a similar drastic effect with your setup?

Edit: If I take a look at your model architecture it however seems plausible, since you are using quite small data tensors (height and width) but a huge number of filters, especially in the last layers. So your forward pass probably likes the im2col filter matrices being precalculated when the model is loaded.

Edit2: Just double-checked. The performance gain for your model is real on my system. When I check out this version 100 forward passes take 1715 ms. With that version it is 988 ms.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 6, 2018

Wow, this is awesome!

Timings for my model with the very latest fdeep commit:

  • No AVX: 38ms on both VS2017 15.5.5 and 15.6.0 Preview 3 (that's down from 60ms!)
  • With AVX2: 29ms on 15.5.5, 27ms on 15.6.0 Preview 3 (down from 50ms!)

That is really an amazing improvement. It's obviously still not as fast as I'd like (29ms means "barely real-time" (30fps), but this is very hopeful and I think I can work with that for now. :-) Also because the gcc timings are quite amazing:

  • gcc-7: From 21m down to 13ms. With -fopenmp it goes to 11ms (but sometimes it spikes up to around 20-30ms - maybe because the OMP run-time is not always "hot"). Same timings with clang-6.0.

I played around with the MSVC flags again, but none of them make it any faster (and /openmp makes it slower). And just for reference, it's the same speed with AVX (1) AVX2, so the speed-gain comes from SSE->AVX, and there is none from AVX->AVX2.

I also quickly tested cd90b82 (which doesn't have the tensor-no-copy improvement yet) and I think it's like 1ms or so slower, so it seems indeed like a small improvement. I think this is too small to be measured correctly by my crude benchmark but let's say it is indeed 1ms, then it's actually still a 3% improvement, which is I'd say significant. :-)

Btw, I the other speed gains in your tables are also really good. You say "saves some time during prediction, even if it is not very much" but it's actually around 10-20%!

And regarding this: "interestingly the implementation have some wiggle room on the details of a convolution": I was actually quite surprised (and very happy!) that the prediction values for a test example were pretty much exactly the same for Matconvnet, Keras and fdeep. I think I checked up to 4 or 6 floating point digits. Even though Matlab probably uses quite different padding, convolutions, matrix multiply, etc... :-)

I think this is definitely awesome, thank you very much for these improvements! 🎉 🎉 🎉

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 7, 2018

Could you try to find out if MSVC is so slow with the matrix multiplication in Eigen of with the actual fdeep code? Perhaps you can run the GEMM benchmark from above on MSVC and GCC and compare the results.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 7, 2018

Note: I had to disable the aligned test on MSVC, as it doesn't support C11 aligned_alloc and apparently never will (I think they recommend using C++11 alignas/alignof instead).

VS2017 15.6.0 Preview 3:

>cl.exe gemm_alignment_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2 /Ob2 /nologo

>gemm_alignment_test.exe
unaligned (checksum: 0) elapsed_ms: 770
--------
unaligned (checksum: 0) elapsed_ms: 770
--------
unaligned (checksum: 0) elapsed_ms: 771
~$ g++-7 gemm_alignment_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc7_gemm_alignment_test

$ ./gcc7_gemm_alignment_test
unaligned (checksum: 0) elapsed_ms: 612
aligned   (checksum: 0) elapsed_ms: 619
--------
unaligned (checksum: 0) elapsed_ms: 620
aligned   (checksum: 0) elapsed_ms: 637
--------
unaligned (checksum: 0) elapsed_ms: 618
aligned   (checksum: 0) elapsed_ms: 619

clang-6 is around 730ms, so between the two.

So interestingly, there is "only" a 20-25% difference. Quite a bit though, but probably not out of the ordinary, as clang-6 is also 20% slower than gcc.

Now more interestingly, I took the first 4 random matrix sizes from your random benchmark and run these again on both compilers:

>gemm_test.exe
1124 1215 1465
col major (checksum: 0) elapsed_ms: 982
row major (checksum: 0) elapsed_ms: 1000
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1947
row major (checksum: 0) elapsed_ms: 2011
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 973
row major (checksum: 0) elapsed_ms: 954
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 682
row major (checksum: 0) elapsed_ms: 733
$ ./gcc7_gemm_test
1124 1215 1465
col major (checksum: 0) elapsed_ms: 701
row major (checksum: 0) elapsed_ms: 701
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1353
row major (checksum: 0) elapsed_ms: 1342
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 434
row major (checksum: 0) elapsed_ms: 429
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 340
row major (checksum: 0) elapsed_ms: 344

Now MSVC is suddenly significantly slower. So it seems to depend on the matrix size...? I'm a bit at a loss.

Another, but perhaps unrelated question, is why the run time from your gcc-compiled executable is half the one of mine. But your CPU has a 3.9 GHz turbo, mine has a 3.5 GHz turbo, maybe it's that. Or because I'm running it in WSL.

I just double-checked, I'm definitely using Eigen-3.3.4. And I see you used gcc with c++11 while I used 14/17 but I checked and it doesn't make any difference.

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 7, 2018

Cool, I did not even know CPUs have a turbo boost.
kitt

However our values (3.9 GHz and 3.5 GHz) do not indicate a performance difference with a factor of 2. But there are probably other factors as well like RAM speed etc.
Or the WSL thing, yes.

But since it not a constant performance difference and seems to depend on the matrix sized, there has to be something else going on, I guess.

Does the system you want to deploy your model to in the end have the same configuration as your dev machine?

@patrikhuber
Copy link
Contributor Author

I just tried something else just to make really sure it's not some weird MSVC flag combination or some flag I'm missing. I created a "plain" project with the VS "New Project" wizard, pasted the gemm test code, and ran it in x64 release mode. And it was even 2x slower than MSVC on the command line 🤣
After a bit of fiddling with the flags I found out that /GL (Whole Program Optimization) is the culprit, which is on by default. Once I turned it off, I am back to my old numbers from compiling with MSVC on the command-line. For this one I should probably really file a MSVC bug 😆
But it's a good confirmation that I'm not missing some important flag or another trivial mistake.

What gcc version did you compile on?

I just tried it in a Linux VM instead of WSL and got the same results.

Does the system you want to deploy your model to in the end have the same configuration as your dev machine?

It's a cross-platform project so there isn't one specific system or machine, but msvc is one of the important platforms. But yea, it might be worth testing it on another Windows machine.

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 7, 2018

Interesting results for Whole Program "Optimization". ;)
Let's see if the MSVC devs find out what's happening.

I'm using g++ 5.4.0 (the default version on Ubuntu 16.04.6, I guess).

Bye the way, you can use gcc (i.e. MinGW) to produce native windows executables in many cases. Perhaps this helps with run time.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 7, 2018

Hmm this is interesting. I tested on an i5-3550, which only has AVX, and no AVX2. (both our main CPUs have AVX2). VS2017 is more or less equally fast than on my i7. But g++ (5.4.0) with march=native is also significantly slower on the old i5, in fact, very similar to VS2017 (maybe 10-20% faster in some cases). Interestingly g++ 7.2.1 completely fails on the i5 and the benchmark is 2.5x slower... (this seems like a g++ performance regression).

It kind of looks to me as if this has to do with AVX and AVX2. MSVC cannot really optimise the code better when AVX2 is available, it's just as slow as with AVX. g++ and clang on the other hand seem sort-of equally fast than MSVC when only AVX is available on a CPU. But if AVX2 is available, then g++ and clang can really make use of it and blow MSVC out of the water.
I'm sure there is more going on, as this only explains a part of the difference of MSVC and g++.

In any case I think fdeep's performance is really good now, the numbers on g++ with AVX2 are extremely impressive. The rest seems up to the compiler optimisers and/or Eigen.
I'm pretty sure we could optimise it quite a bit further with NNPACK for all sorts of CPUs, but I guess that's for the future.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 7, 2018

Btw here's the profiler report from the latest fdeep version on MSVC:

image

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 7, 2018

Good explanation. It coud be true, since at least it makes sense.

And out of the 71% of convolve_im2col 54% are now used in Eigen functions, which looks better compared to your first screenshot where both used almost the same amount (Eigen part of convolve_im2col and non-Eigen part of convolve_im2col).

So now the most can be gained by making MSVC utilize AVX2 in Eigen as good as g++ does. Perhaps this is a job for the devs of Eigen. If we are lucky they only need to adjust one of their many #ifdefs.
I suggest you ask there. :)

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 7, 2018

I've posted to the Eigen list.
I am also trying to report this gcc performance regression but they won't let me sign up to their bug tracker... apparently my email(s) exceed the spam threshold... :-O
If you compile with -march=core2 (with SSE3 etc., but without AVX2), you should see that g++5 does a good job (the same as MSVC without /arch:AVX), but g++-7 is suddenly 1.5-2x slower. But this issue would only show on systems that do not have AVX2 CPUs.

@patrikhuber
Copy link
Contributor Author

patrikhuber commented Feb 8, 2018

The people on the Eigen mailing list were extremely helpful, and found out that it is because on MSVC it doesn't emit FMA instructions. Compiling the benchmark with -D__FMA__ results in the same run times for MSVC as gcc! Wow! I hope they will integrate this #define for MSVC in the next Eigen release. FYI here the link to the mailing list, but I think the posts only update overnight (so will be visible tomorrow): https://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2018/02/msg00001.html
In fdeep with my model, the run time went from the previous 27ms down to 21ms. Not as much as in the benchmark (which improved 1.5-2x), so it seems like MSVC has another bottleneck apart from the GEMM (which also explains why non-AVX2 MSVC is much slower than gcc too). But 21ms is quite nice now! 😃

Btw one person on the Eigen mailing list also found out that /GL (the "Whole Program Optimisation") makes it slower most likely because with that option, it doesn't emit FMA/AVX instructions anymore. Which I guess confirms that bug too.

And the link to the gcc bug report is here for reference.

@Dobiasd
Copy link
Owner

Dobiasd commented Feb 12, 2018

Awesome work! 👍

In the end I think we made good progress performance-wise here, not only for your model but for fdeep overall. :)

And on your way you also helped the Eigen people, the gcc devs and the users of each. 🥇

@patrikhuber
Copy link
Contributor Author

Yep, definitely, a pretty good outcome, and thank you very much for your help and your code changes again too! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants