Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency? #326

MashPlant · 2020-12-07T01:30:44Z

I am quite new to the polyhedral model, and may still be unfamiliar with related concepts, so please point out if I made any mistakes.

I would like to know is there any methods for parallelizing or loop tiling that automatically resolves data dependencies. To be more specific, consider the following one-dimensional stencil computation:

for (t = 1; t < T; t += 1)
  for (i = 1; i < N - 1; i += 1)
    A[t][i] = 0.25 * (A[t - 1][i + 1] - 2.0 * A[t - 1][i] +　A[t - 1][i - 1]);

Since computing A[t][i] needs to read A[t - 1][i + 1], the statement instance (t, i) have to be excuted after the statement instance (t - 1, i + 1). So the two-level loop cannot be simply tiled, otherwise data dependency will be violated.

However, the Computation::tile function in Tiramisu seems won't make any efforts to solve the data dependency:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
//  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
//  A.tile(t, i, 32, 32, t0, i0, t1, i1);
  
  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_input);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

Uncomment the two lines related to loop tiling, and the output Halide IR changes from:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 1, 199) {
    for (c3, 1, 98) {
      b_A[(c3 + int32((int64(c1)*(int64)100)))] = (((b_A[(int32((int64(c3) + (int64)1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))] - (b_A[(c3 + int32(((int64(c1)*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(c3) + (int64)-1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))])*0.250000f)
    }
  }
}

to:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 0, 7) {
    for (c3, 0, 4) {
      for (c5, (1 - min((c1*32), 1)), ((min((c1*32), 1) - max((c1*32), 168)) + 199)) {
        for (c7, (1 - min((c3*32), 1)), ((min((c3*32), 1) - max((c3*32), 67)) + 98)) {
          b_A[(((c3*32) + c7) + int32((int64(((c1*32) + c5))*(int64)100)))] = (((b_A[(int32((int64(((c3*32) + c7)) + (int64)1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))] - (b_A[(((c3*32) + c7) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(((c3*32) + c7)) + (int64)-1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))])*0.250000f)
        }
      }
    }
  }
}

where data dependency is violated, and test also fails.

I have heard that the Pluto algorithm can be adopted in such a scenario, which will automatically skew the iteration domain to solve the data dependency:

for (t = 1; t < T; t += 1)
  for (i = 1 + t; i < N - 1 + t; i += 1)
    A[t][i - t] = 0.25 * (A[t - 1][i - t + 1] - 2.0 * A[t - 1][i + t] + A[t - 1][i - t - 1]);

and the loop can be safely tiled. It is also possible to skew the iteration domain and perform loop tiling in Tiramisu:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
  var nt("nt"), ni("ni");
  A.skew(t, i, 1, nt, ni);
  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
  A.tile(nt, ni, 32, 32, t0, i0, t1, i1);

  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_output);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

It will pass the test. But it requires my observation of the loop patterns to make such a transformation. Moreover, if I want to paralize the loop, it requires not only skewing, but also synchronization and communication between parallel computation units. This seems complicated to me, but it can theoratically be automated through the Pluto algorithm. This is why I would like to know: is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependencies?

The text was updated successfully, but these errors were encountered:

rbaghdadi · 2020-12-09T09:01:36Z

Hi,

Yes you would need Tiramisu to apply skewing automatically for this type of stencils to be parallelized. The public Tiramisu does not check the correctness of transformations, so if you apply tiling on a code that should not be tiled you'll get wrong results. We have a private branch that checks for correctness but it never made its way to the master branch.

We have the skew() command which you and use use to apply skewing manually.

Currently Tiramisu does not do skewing automatically. Tiramisu uses the ISL library though which implements the Pluto algorithm. You can very easily call the Pluto algorithm in ISL and enable Tiramisu to automatically skew this code and parallelize it. We are also working on a new algorithm for automatic code parallelization that will solve this problem but it will not be ready soon.

So in short: Tiramisu does not apply skewing automatically currently, you can very easily add support for skewing if you want given that the Pluto algorithm is already implemented in ISL, we are just not calling it.

Please let me know if you have any other question.

MashPlant · 2020-12-12T13:08:43Z

Hi @rbaghdadi ,

Thank you for helping. I understand the situation now. Now the problem is that I'm not really familiar with the APIs in ISL, especially the ones related to isl_shcedule. I read the documentation of ISL and it seems that the Pluto algorithm implementation is closely related to isl_shcedule. Would you be kind enough to provide an example of using ISL to automatically skew and tile the loop in my example? Thanks a lot!

rbaghdadi · 2021-01-03T06:25:34Z

Here is how you can do dependence analysis in ISL: http://isl.gforge.inria.fr/user.html#Dependence-Analysis

rbaghdadi closed this as completed Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency? #326

Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency? #326

MashPlant commented Dec 7, 2020

rbaghdadi commented Dec 9, 2020

MashPlant commented Dec 12, 2020 •

edited

rbaghdadi commented Jan 3, 2021

Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency? #326

Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency? #326

Comments

MashPlant commented Dec 7, 2020

rbaghdadi commented Dec 9, 2020

MashPlant commented Dec 12, 2020 • edited

rbaghdadi commented Jan 3, 2021

MashPlant commented Dec 12, 2020 •

edited