Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster Image Interface - use it to optimize Pooling, Convolution #206

Open
etrain opened this issue Jan 20, 2016 · 3 comments
Open

Add faster Image Interface - use it to optimize Pooling, Convolution #206

etrain opened this issue Jan 20, 2016 · 3 comments
Labels

Comments

@etrain
Copy link
Contributor

etrain commented Jan 20, 2016

We don't necessarily iterate over images in a couple of common cases in a fast way, e.g.

while (x < numRows) {
  while (y < numCols) {
    while(c < numChannels {
      out(x,y,c) = img.get(x,y,c)+1.0
    }
  }
}

Is only fast if the image is in channel/column major order (because of cache-locality). This slows down the Pooler and the Convolver by up to an order of magnitude in the case that their inputs are not in exactly the right order.

One fix would be to come up with a nice Iterator interface that lets you iterate over images and get back an Tuple4[Int,Int,Int,Double] of the image in "natural" order with position and pixel value We need to figure out a good way to do this efficiently. One potential way to do this is to make the iterator @specialized on tuples of this sort.

Let's see if we can rewrite the pooler and the convolver to take advantage of an interface like this and push both to respectable FLOP/Memory Throughput levels regardless of Image Representation used.

@etrain etrain added this to the 0.3.0 milestone Jan 20, 2016
@etrain etrain added the images label Jan 20, 2016
@etrain
Copy link
Contributor Author

etrain commented Jan 20, 2016

cc @ericmjonas

@etrain
Copy link
Contributor Author

etrain commented Feb 5, 2016

I have started some work on this here's a summary of some early results.

I am benchmarking imaages of various sizes (given as "bytes") and have three implementations:
whi which literally loops over the bytes in an image in their natural order and adds them up. The second is iter which is effectively the proposal above (without object creation). This is ~20x slower at present. The third is sum which is how fast it is to just call .sum on the array. I have ratios that show slowdown of iterwhile over these two as well. Finally, there's some GB/s of each method measured on one core on my laptop. This the raw one is about 30-70% of machine peak for me (12 GB/s). However, it's likely that the CIFAR images fit nicely into L1 cache so I'd be surprised if they are really representative. At any rate - there is some extra work to be done here to get this iterator approach actually fast. I will post this benchmark somewhere soon so that we have a record of how it looks, but for now I'm just going to focus on loop orders and optimized data types to make Convolve and Pool faster for big images.

                     V1                            class      whi      iter       sum iterwhile  itersum    bytes      GBs      iGBs      sGBs
1              Cifar100 ChannelMajorArrayVectorizedImage     3000     61500     14000  20.50000 4.296703    24576 8.192000 0.3996098 1.7554286
2              Cifar100  ColumnMajorArrayVectorizedImage     3000     77000     16000  22.20833 4.332418    24576 8.192000 0.3191688 1.5360000
3              Cifar100     RowMajorArrayVectorizedImage     3000     80000     18000  27.50000 3.500000    24576 8.192000 0.3072000 1.3653333
4             Cifar1000 ChannelMajorArrayVectorizedImage     3000     97500     18000  31.58333 4.260989    24576 8.192000 0.2520615 1.3653333
5             Cifar1000  ColumnMajorArrayVectorizedImage     3000     86500     17500  27.33333 4.366221    24576 8.192000 0.2841156 1.4043429
6             Cifar1000     RowMajorArrayVectorizedImage     3000     61000     15000  24.33333 4.307692    24576 8.192000 0.4028852 1.6384000
7            Cifar10000 ChannelMajorArrayVectorizedImage     3000     85500     19500  28.87500 4.186813    24576 8.192000 0.2874386 1.2603077
8            Cifar10000  ColumnMajorArrayVectorizedImage     3000     60500     19000  20.16667 3.413043    24576 8.192000 0.4062149 1.2934737
9            Cifar10000     RowMajorArrayVectorizedImage     3000     71500     33500  22.50000 3.566667    24576 8.192000 0.3437203 0.7336119
10 ConvolvedSolarFlares ChannelMajorArrayVectorizedImage 10839500 177089000  40396500  15.89807 4.237969 50400800 4.649735 0.2846072 1.2476526
11 ConvolvedSolarFlares  ColumnMajorArrayVectorizedImage  8086000 176195000  38173000  21.85181 5.164889 50400800 6.233094 0.2860513 1.3203259
12 ConvolvedSolarFlares     RowMajorArrayVectorizedImage 10415000 178678500 101219000  15.38485 1.756449 50400800 4.839251 0.2820753 0.4979381
13             ImageNet ChannelMajorArrayVectorizedImage   217000   4540500   1306500  20.44049 4.473178  1572864 7.248221 0.3464077 1.2038760
14             ImageNet  ColumnMajorArrayVectorizedImage   208000   4545000   1028500  21.82231 4.649135  1572864 7.561846 0.3460647 1.5292795
15             ImageNet     RowMajorArrayVectorizedImage   273500   5242500   1055000  15.92783 4.626985  1572864 5.750874 0.3000217 1.4908664
16          SolarFlares ChannelMajorArrayVectorizedImage   944000  19662000   6139500  19.99306 3.232331  6291456 6.664678 0.3199805 1.0247505
17          SolarFlares  ColumnMajorArrayVectorizedImage  1342500  21790500   3667500  18.83585 5.319151  6291456 4.686373 0.2887247 1.7154618
18          SolarFlares     RowMajorArrayVectorizedImage   971000  18231500   4114000  17.91037 4.786838  6291456 6.479357 0.3450871 1.5292795

@etrain etrain removed this from the 0.3.0 milestone Feb 5, 2016
@etrain
Copy link
Contributor Author

etrain commented Feb 8, 2016

Just for posterity here - whi iter and sum measure time taken (in ns) to do each operation. iterwhile, and itersum are slowdown of iter/sum and iter/while - that is, how much slower (x) is iter vs. whi or iter vs sum.

The last 3 columns are GB/s of whi, iter, and sum respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants