Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency? #326

Closed
MashPlant opened this issue Dec 7, 2020 · 3 comments

Comments

@MashPlant
Copy link

I am quite new to the polyhedral model, and may still be unfamiliar with related concepts, so please point out if I made any mistakes.

I would like to know is there any methods for parallelizing or loop tiling that automatically resolves data dependencies. To be more specific, consider the following one-dimensional stencil computation:

for (t = 1; t < T; t += 1)
  for (i = 1; i < N - 1; i += 1)
    A[t][i] = 0.25 * (A[t - 1][i + 1] - 2.0 * A[t - 1][i] + A[t - 1][i - 1]);

Since computing A[t][i] needs to read A[t - 1][i + 1], the statement instance (t, i) have to be excuted after the statement instance (t - 1, i + 1). So the two-level loop cannot be simply tiled, otherwise data dependency will be violated.

However, the Computation::tile function in Tiramisu seems won't make any efforts to solve the data dependency:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
//  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
//  A.tile(t, i, 32, 32, t0, i0, t1, i1);
  
  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_input);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

Uncomment the two lines related to loop tiling, and the output Halide IR changes from:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 1, 199) {
    for (c3, 1, 98) {
      b_A[(c3 + int32((int64(c1)*(int64)100)))] = (((b_A[(int32((int64(c3) + (int64)1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))] - (b_A[(c3 + int32(((int64(c1)*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(c3) + (int64)-1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))])*0.250000f)
    }
  }
}

to:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 0, 7) {
    for (c3, 0, 4) {
      for (c5, (1 - min((c1*32), 1)), ((min((c1*32), 1) - max((c1*32), 168)) + 199)) {
        for (c7, (1 - min((c3*32), 1)), ((min((c3*32), 1) - max((c3*32), 67)) + 98)) {
          b_A[(((c3*32) + c7) + int32((int64(((c1*32) + c5))*(int64)100)))] = (((b_A[(int32((int64(((c3*32) + c7)) + (int64)1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))] - (b_A[(((c3*32) + c7) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(((c3*32) + c7)) + (int64)-1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))])*0.250000f)
        }
      }
    }
  }
}

where data dependency is violated, and test also fails.

I have heard that the Pluto algorithm can be adopted in such a scenario, which will automatically skew the iteration domain to solve the data dependency:

for (t = 1; t < T; t += 1)
  for (i = 1 + t; i < N - 1 + t; i += 1)
    A[t][i - t] = 0.25 * (A[t - 1][i - t + 1] - 2.0 * A[t - 1][i + t] + A[t - 1][i - t - 1]);

and the loop can be safely tiled. It is also possible to skew the iteration domain and perform loop tiling in Tiramisu:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
  var nt("nt"), ni("ni");
  A.skew(t, i, 1, nt, ni);
  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
  A.tile(nt, ni, 32, 32, t0, i0, t1, i1);

  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_output);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

It will pass the test. But it requires my observation of the loop patterns to make such a transformation. Moreover, if I want to paralize the loop, it requires not only skewing, but also synchronization and communication between parallel computation units. This seems complicated to me, but it can theoratically be automated through the Pluto algorithm. This is why I would like to know: is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependencies?

@rbaghdadi
Copy link
Collaborator

Hi,

Yes you would need Tiramisu to apply skewing automatically for this type of stencils to be parallelized. The public Tiramisu does not check the correctness of transformations, so if you apply tiling on a code that should not be tiled you'll get wrong results. We have a private branch that checks for correctness but it never made its way to the master branch.

We have the skew() command which you and use use to apply skewing manually.

Currently Tiramisu does not do skewing automatically. Tiramisu uses the ISL library though which implements the Pluto algorithm. You can very easily call the Pluto algorithm in ISL and enable Tiramisu to automatically skew this code and parallelize it. We are also working on a new algorithm for automatic code parallelization that will solve this problem but it will not be ready soon.

So in short: Tiramisu does not apply skewing automatically currently, you can very easily add support for skewing if you want given that the Pluto algorithm is already implemented in ISL, we are just not calling it.

Please let me know if you have any other question.

@MashPlant
Copy link
Author

MashPlant commented Dec 12, 2020

Hi @rbaghdadi ,

Thank you for helping. I understand the situation now. Now the problem is that I'm not really familiar with the APIs in ISL, especially the ones related to isl_shcedule. I read the documentation of ISL and it seems that the Pluto algorithm implementation is closely related to isl_shcedule. Would you be kind enough to provide an example of using ISL to automatically skew and tile the loop in my example? Thanks a lot!

@rbaghdadi
Copy link
Collaborator

Here is how you can do dependence analysis in ISL: http://isl.gforge.inria.fr/user.html#Dependence-Analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants