Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring conditionals back since the HIR change. #90

Merged
merged 1 commit into from Dec 23, 2020
Merged

Conversation

philberty
Copy link
Member

Example:

fn main() {
    let mut x = 5;

    if x == 5 {
        x = 1;
    } else if x == 3 {
        x = 2;
    } else {
        x = 3;
    }
}

GIMPLE

void main ()
{
  i32 mut x;

  mut x = 5;
  if (mut x == 5) goto <D.188>; else goto <D.189>;
  <D.188>:
  {
    mut x = 1;
  }
  goto <D.190>;
  <D.189>:
  {
    if (mut x == 3) goto <D.191>; else goto <D.192>;
    <D.191>:
    {
      mut x = 2;
    }
    goto <D.193>;
    <D.192>:
    {
      mut x = 3;
    }
    <D.193>:
  }
  <D.190>:
}

@philberty philberty added this to the Core Datastructures milestone Dec 22, 2020
@philberty philberty added this to Reviewer approved in Data Structures 1 - Core Dec 22, 2020
@philberty philberty moved this from Reviewer approved to Review in progress in Data Structures 1 - Core Dec 22, 2020
Copy link
Member

@SimplyTheOther SimplyTheOther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR seems great.

fncontext fnctx = ctx->peek_fn ();
Bfunction *fndecl = fnctx.fndecl;
Location start_location = expr.get_locus ();
Location end_location = expr.get_closing_locus ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to something I had thought about doing, but not in enough detail to open an enhancement issue. gcc supports something called "rich locations" in which three actual locations are encoded - a "start" location, a "caret" location, and an "end" location - basically like a location range with a specified value highlighted.

I couldn't find the page that I thought I originally read this in, but here's a quote from here:

As of GCC 6, ordinary locations changed from supporting just a point in the user’s source code to supporting three points: the caret location, plus a start and a finish:

      a = foo && bar;
          ~~~~^~~~~~
          |   |    |
          |   |    finish
          |   caret
          start

Instead of accessing start_location and end_location separately, these could be integrated into a single location value. There's also the technicality that the current definition of get_closing_locus() technically gets the beginning location of the last statement rather than the end location of the block (i.e. the right curly bracket), assuming that this would be the preferred behaviour.

The lexer would have to be modified to reflect this rich location system, and the parser and AST probably would have to be too (i.e. make a new location from the "start" location from one token and the "end" location of a different one). Also, the backend-independent Location abstraction would have to be modified to support this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this closing locus is mostly a placeholder it just returns get_locus. I think there needs to be an issue about the diagnostics in general. gccgo added extra specifiers to the format string which might give some inspiration. Though the Fortran front-end is usually deemed to be the gold standard these days.

Data Structures 1 - Core automation moved this from Review in progress to Reviewer approved Dec 23, 2020
@philberty philberty mentioned this pull request Dec 23, 2020
@philberty philberty merged commit aa2fbb5 into master Dec 23, 2020
Data Structures 1 - Core automation moved this from Reviewer approved to Done Dec 23, 2020
@philberty philberty deleted the phil/conditional branch December 23, 2020 14:11
@philberty philberty linked an issue Jan 3, 2021 that may be closed by this pull request
philberty pushed a commit that referenced this pull request Mar 3, 2021
… FMA.

This adds implementation for the optabs for complex additions.  With this the
following C code:

  void f90 (float complex a[restrict N], float complex b[restrict N],
	    float complex c[restrict N])
  {
    for (int i=0; i < N; i++)
      c[i] = a[i] + (b[i] * I);
  }

generates

  f90:
	  add     r3, r2, #1600
  .L2:
	  vld1.32 {q8}, [r0]!
	  vld1.32 {q9}, [r1]!
	  vcadd.f32       q8, q8, q9, #90
	  vst1.32 {q8}, [r2]!
	  cmp     r3, r2
	  bne     .L2
	  bx      lr

instead of

  f90:
	  add     r3, r2, #1600
  .L2:
	  vld2.32 {d24-d27}, [r0]!
	  vld2.32 {d20-d23}, [r1]!
	  vsub.f32	q8, q12, q11
	  vadd.f32	q9, q13, q10
	  vst2.32 {d16-d19}, [r2]!
	  cmp     r3, r2
	  bne     .L2
	  bx      lr

gcc/ChangeLog:

	* config/arm/arm_mve.h (__arm_vcaddq_rot90_u8, __arm_vcaddq_rot270_u8,
	, __arm_vcaddq_rot90_s8, __arm_vcaddq_rot270_s8,
	__arm_vcaddq_rot90_u16, __arm_vcaddq_rot270_u16, __arm_vcaddq_rot90_s16,
	__arm_vcaddq_rot270_s16, __arm_vcaddq_rot90_u32,
	__arm_vcaddq_rot270_u32, __arm_vcaddq_rot90_s32,
	__arm_vcaddq_rot270_s32, __arm_vcmulq_rot90_f16,
	__arm_vcmulq_rot270_f16, __arm_vcmulq_rot180_f16,
	__arm_vcmulq_f16, __arm_vcaddq_rot90_f16, __arm_vcaddq_rot270_f16,
	__arm_vcmulq_rot90_f32, __arm_vcmulq_rot270_f32,
	__arm_vcmulq_rot180_f32, __arm_vcmulq_f32, __arm_vcaddq_rot90_f32,
	__arm_vcaddq_rot270_f32, __arm_vcmlaq_f16, __arm_vcmlaq_rot180_f16,
	__arm_vcmlaq_rot270_f16, __arm_vcmlaq_rot90_f16, __arm_vcmlaq_f32,
	__arm_vcmlaq_rot180_f32, __arm_vcmlaq_rot270_f32,
	__arm_vcmlaq_rot90_f32): Update builtin calls.
	* config/arm/arm_mve_builtins.def (vcaddq_rot90_u, vcaddq_rot270_u,
	vcaddq_rot90_s, vcaddq_rot270_s, vcaddq_rot90_f, vcaddq_rot270_f,
	vcmulq_f, vcmulq_rot90_f, vcmulq_rot180_f, vcmulq_rot270_f,
	vcmlaq_f, vcmlaq_rot90_f, vcmlaq_rot180_f, vcmlaq_rot270_f): Removed.
	(vcaddq_rot90, vcaddq_rot270, vcmulq, vcmulq_rot90, vcmulq_rot180,
	vcmulq_rot270, vcmlaq, vcmlaq_rot90, vcmlaq_rot180, vcmlaq_rot270):
	New.
	* config/arm/constraints.md (Dz): Include MVE.
	* config/arm/iterators.md (mve_rotsplit1, mve_rotsplit2): New.
	(rot): Add UNSPEC_VCMLS, UNSPEC_VCMUL and UNSPEC_VCMUL180.
	(rot_op, rotsplit1, rotsplit2, fcmac1, VCMLA_OP, VCMUL_OP): New.
	* config/arm/mve.md (VCADDQ_ROT270_S, VCADDQ_ROT90_S, VCADDQ_ROT270_U,
	VCADDQ_ROT90_U, VCADDQ_ROT270_F, VCADDQ_ROT90_F, VCMULQ_F,
	VCMULQ_ROT180_F, VCMULQ_ROT270_F, VCMULQ_ROT90_F, VCMLAQ_F,
	VCMLAQ_ROT180_F, VCMLAQ_ROT90_F, VCMLAQ_ROT270_F, VCADDQ_ROT270_S,
	VCADDQ_ROT270, VCADDQ_ROT90): Removed.
	(mve_rot, VCMUL): New.
	(mve_vcaddq_rot270_<supf><mode, mve_vcaddq_rot90_<supf><mode>,
	mve_vcaddq_rot270_f<mode>, mve_vcaddq_rot90_f<mode>, mve_vcmulq_f<mode,
	mve_vcmulq_rot180_f<mode>, mve_vcmulq_rot270_f<mode>,
	mve_vcmulq_rot90_f<mode>, mve_vcmlaq_f<mode>, mve_vcmlaq_rot180_f<mode>,
	mve_vcmlaq_rot270_f<mode>, mve_vcmlaq_rot90_f<mode>): Removed.
	(mve_vcmlaq<mve_rot><mode>, mve_vcmulq<mve_rot><mode>,
	mve_vcaddq<mve_rot><mode>, cadd<rot><mode>3, mve_vcaddq<mve_rot><mode>):
	New.
	(cmul<rot_op><mode>3): Exclude MVE types.
	* config/arm/unspecs.md (UNSPEC_VCMUL90, UNSPEC_VCMUL270): New.
	* config/arm/vec-common.md (cadd<rot><mode>3, cmul<rot_op><mode>3,
	arm_vcmla<rot><mode>, cml<fcmac1><rot_op><mode>4): New.
	* config/arm/unspecs.md (UNSPEC_VCMUL, UNSPEC_VCMUL180, UNSPEC_VCMLS,
	UNSPEC_VCMLS180): New.
	* config/arm/neon.md (cmul<rot_op><mode>3): New.
philberty pushed a commit that referenced this pull request Mar 3, 2021
This adds implementation for the optabs for add complex operations.  With this
the following C code:

  void f90 (float complex a[restrict N], float complex b[restrict N],
	    float complex c[restrict N])
  {
    for (int i=0; i < N; i++)
      c[i] = a[i] + (b[i] * I);
  }

generates

  f90:
	  mov     x3, 0
	  .p2align 3,,7
  .L2:
	  ldr     q0, [x0, x3]
	  ldr     q1, [x1, x3]
	  fcadd   v0.4s, v0.4s, v1.4s, #90
	  str     q0, [x2, x3]
	  add     x3, x3, 16
	  cmp     x3, 1600
	  bne     .L2
	  ret

instead of

  f90:
	  add     x3, x1, 1600
	  .p2align 3,,7
  .L2:
	  ld2     {v4.4s - v5.4s}, [x0], 32
	  ld2     {v2.4s - v3.4s}, [x1], 32
	  fsub    v0.4s, v4.4s, v3.4s
	  fadd    v1.4s, v5.4s, v2.4s
	  st2     {v0.4s - v1.4s}, [x2], 32
	  cmp     x3, x1
	  bne     .L2
	  ret

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (cadd<rot><mode>3): New.
	* config/aarch64/iterators.md (SVE2_INT_CADD_OP): New.
	* config/aarch64/aarch64-sve.md (cadd<rot><mode>3): New.
	* config/aarch64/aarch64-sve2.md (cadd<rot><mode>3): New.
philberty pushed a commit that referenced this pull request Mar 3, 2021
This adds implementation for the optabs for complex additions.  With this the
following C code:

  void f90 (float complex a[restrict N], float complex b[restrict N],
	    float complex c[restrict N])
  {
    for (int i=0; i < N; i++)
      c[i] = a[i] + (b[i] * I);
  }

generates

  f90:
	  add     r3, r2, #1600
  .L2:
	  vld1.32 {q8}, [r0]!
	  vld1.32 {q9}, [r1]!
	  vcadd.f32       q8, q8, q9, #90
	  vst1.32 {q8}, [r2]!
	  cmp     r3, r2
	  bne     .L2
	  bx      lr

instead of

  f90:
	  add     r3, r2, #1600
  .L2:
	  vld2.32 {d24-d27}, [r0]!
	  vld2.32 {d20-d23}, [r1]!
	  vsub.f32	q8, q12, q11
	  vadd.f32	q9, q13, q10
	  vst2.32 {d16-d19}, [r2]!
	  cmp     r3, r2
	  bne     .L2
	  bx      lr

gcc/ChangeLog:

	* config/arm/arm_mve.h (__arm_vcaddq_rot90_u8, __arm_vcaddq_rot270_u8,
	__arm_vcaddq_rot90_s8, __arm_vcaddq_rot270_s8,
	__arm_vcaddq_rot90_u16, __arm_vcaddq_rot270_u16,
	__arm_vcaddq_rot90_s16, __arm_vcaddq_rot270_s16,
	__arm_vcaddq_rot90_u32, __arm_vcaddq_rot270_u32,
	__arm_vcaddq_rot90_s32, __arm_vcaddq_rot270_s32,
	__arm_vcaddq_rot90_f16, __arm_vcaddq_rot270_f16,
	__arm_vcaddq_rot90_f32, __arm_vcaddq_rot270_f32):  Update builtin calls.
	* config/arm/arm_mve_builtins.def (vcaddq_rot90_u, vcaddq_rot270_u,
	vcaddq_rot90_s, vcaddq_rot270_s, vcaddq_rot90_f, vcaddq_rot270_f):
	Removed.
	(vcaddq_rot90, vcaddq_rot270): New.
	* config/arm/constraints.md (Dz): Include MVE.
	* config/arm/iterators.md (mve_rot): New.
	(supf): Remove VCADDQ_ROT270_S, VCADDQ_ROT270_U, VCADDQ_ROT90_S,
	VCADDQ_ROT90_U.
	(VCADDQ_ROT270, VCADDQ_ROT90): Removed.
	* config/arm/mve.md (mve_vcaddq_rot270_<supf><mode,
	mve_vcaddq_rot90_<supf><mode>, mve_vcaddq_rot270_f<mode>,
	mve_vcaddq_rot90_f<mode>): Removed.
	(mve_vcaddq<mve_rot><mode>, mve_vcaddq<mve_rot><mode>): New.
	* config/arm/unspecs.md (VCADDQ_ROT270_S, VCADDQ_ROT90_S,
	VCADDQ_ROT270_U, VCADDQ_ROT90_U, VCADDQ_ROT270_F,
	VCADDQ_ROT90_F): Removed.
	* config/arm/vec-common.md (cadd<rot><mode>3): New.
philberty pushed a commit that referenced this pull request Mar 3, 2021
This adds implementation for the optabs for complex operations.  With this the
following C code:

  void g (float complex a[restrict N], float complex b[restrict N],
	  float complex c[restrict N])
  {
    for (int i=0; i < N; i++)
      c[i] =  a[i] * b[i];
  }

generates

NEON:

g:
        movi    v3.4s, 0
        mov     x3, 0
        .p2align 3,,7
.L2:
        mov     v0.16b, v3.16b
        ldr     q2, [x1, x3]
        ldr     q1, [x0, x3]
        fcmla   v0.4s, v1.4s, v2.4s, #0
        fcmla   v0.4s, v1.4s, v2.4s, #90
        str     q0, [x2, x3]
        add     x3, x3, 16
        cmp     x3, 1600
        bne     .L2
        ret

SVE:

g:
        mov     x3, 0
        mov     x4, 400
        ptrue   p1.b, all
        whilelo p0.s, xzr, x4
        mov     z3.s, #0
        .p2align 3,,7
.L2:
        ld1w    z1.s, p0/z, [x0, x3, lsl 2]
        ld1w    z2.s, p0/z, [x1, x3, lsl 2]
        movprfx z0, z3
        fcmla   z0.s, p1/m, z1.s, z2.s, #0
        fcmla   z0.s, p1/m, z1.s, z2.s, #90
        st1w    z0.s, p0, [x2, x3, lsl 2]
        incw    x3
        whilelo p0.s, x3, x4
        b.any   .L2
        ret

SVE2 (with int instead of float)
g:
        mov     x3, 0
        mov     x4, 400
        mov     z3.b, #0
        whilelo p0.s, xzr, x4
        .p2align 3,,7
.L2:
        ld1w    z1.s, p0/z, [x0, x3, lsl 2]
        ld1w    z2.s, p0/z, [x1, x3, lsl 2]
        movprfx z0, z3
        cmla    z0.s, z1.s, z2.s, #0
        cmla    z0.s, z1.s, z2.s, #90
        st1w    z0.s, p0, [x2, x3, lsl 2]
        incw    x3
        whilelo p0.s, x3, x4
        b.any   .L2
        ret

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (cml<fcmac1><conj_op><mode>4,
	cmul<conj_op><mode>3): New.
	* config/aarch64/iterators.md (UNSPEC_FCMUL,
	UNSPEC_FCMUL180, UNSPEC_FCMLA_CONJ, UNSPEC_FCMLA180_CONJ,
	UNSPEC_CMLA_CONJ, UNSPEC_CMLA180_CONJ, UNSPEC_CMUL, UNSPEC_CMUL180,
	FCMLA_OP, FCMUL_OP, conj_op, rotsplit1, rotsplit2, fcmac1, sve_rot1,
	sve_rot2, SVE2_INT_CMLA_OP, SVE2_INT_CMUL_OP, SVE2_INT_CADD_OP): New.
	(rot): Add UNSPEC_FCMUL, UNSPEC_FCMUL180.
	(rot_op): Renamed to conj_op.
	* config/aarch64/aarch64-sve.md (cml<fcmac1><conj_op><mode>4,
	cmul<conj_op><mode>3): New.
	* config/aarch64/aarch64-sve2.md (cml<fcmac1><conj_op><mode>4,
	cmul<conj_op><mode>3): New.
philberty pushed a commit that referenced this pull request Mar 3, 2021
This adds implementation for the optabs for complex operations.  With this the
following C code:

  void g (float complex a[restrict N], float complex b[restrict N],
	  float complex c[restrict N])
  {
    for (int i=0; i < N; i++)
      c[i] =  a[i] * b[i];
  }

generates

NEON:

g:
        vmov.f32        q11, #0.0  @ v4sf
        add     r3, r2, #1600
.L2:
        vmov    q8, q11  @ v4sf
        vld1.32 {q10}, [r1]!
        vld1.32 {q9}, [r0]!
        vcmla.f32       q8, q9, q10, #0
        vcmla.f32       q8, q9, q10, #90
        vst1.32 {q8}, [r2]!
        cmp     r3, r2
        bne     .L2
        bx      lr

MVE:

g:
        push    {lr}
        mov     lr, #100
        dls     lr, lr
.L2:
        vldrw.32        q1, [r1], #16
        vldrw.32        q2, [r0], #16
        vcmul.f32       q3, q2, q1, #0
        vcmla.f32       q3, q2, q1, #90
        vstrw.32        q3, [r2], #16
        le      lr, .L2
        ldr     pc, [sp], #4

instead of

g:
        add     r3, r2, #1600
.L2:
        vld2.32 {d20-d23}, [r0]!
        vld2.32 {d16-d19}, [r1]!
        vmul.f32        q14, q11, q9
        vmul.f32        q15, q11, q8
        vneg.f32        q14, q14
        vfma.f32        q15, q10, q9
        vfma.f32        q14, q10, q8
        vmov    q13, q15  @ v4sf
        vmov    q12, q14  @ v4sf
        vst2.32 {d24-d27}, [r2]!
        cmp     r3, r2
        bne     .L2
        bx      lr

and

g:
        add     r3, r2, #1600
.L2:
        vld2.32 {d20-d23}, [r0]!
        vld2.32 {d16-d19}, [r1]!
        vmul.f32        q15, q10, q8
        vmul.f32        q14, q10, q9
        vmls.f32        q15, q11, q9
        vmla.f32        q14, q11, q8
        vmov    q12, q15  @ v4sf
        vmov    q13, q14  @ v4sf
        vst2.32 {d24-d27}, [r2]!
        cmp     r3, r2
        bne     .L2
        bx      lr

respectively.

gcc/ChangeLog:

	* config/arm/iterators.md (rotsplit1, rotsplit2, conj_op, fcmac1,
	VCMLA_OP, VCMUL_OP): New.
	* config/arm/mve.md (mve_vcmlaq<mve_rot><mode>): Support vec_dup 0.
	* config/arm/neon.md (cmul<conj_op><mode>3): New.
	* config/arm/unspecs.md (UNSPEC_VCMLA_CONJ, UNSPEC_VCMLA180_CONJ,
	UNSPEC_VCMUL_CONJ): New.
	* config/arm/vec-common.md (cmul<conj_op><mode>3, arm_vcmla<rot><mode>,
	cml<fcmac1><conj_op><mode>4): New.
philberty pushed a commit that referenced this pull request Mar 3, 2021
The attached testcase ICEs due to a couple of issues.
In the testcase you have two SLP instances that share the majority of their
definition with each other.  One tree defines a COMPLEX_MUL sequence and the
other tree a COMPLEX_FMA.

The ice happens because:

1. the refcounts are wrong, in particular the FMA case doesn't correctly count
the references for the COMPLEX_MUL that it consumes.

2. when the FMA is created it incorrectly assumes it can just tear apart the MUL
node that it's consuming.  This is wrong and should only be done when there is
no more uses of the node, in which case the vector only pattern is no longer
relevant.

To fix the last part the SLP only pattern reset code was moved into
vect_free_slp_tree which results in cleaner code.  I also think it does belong
there since that function knows when there are no more uses of the node and so
the pattern should be unmarked, so when the the vectorizer is inspecting the BB
it doesn't find the now invalid vector only patterns.

The patch also clears the SLP_TREE_REPRESENTATIVE when stores are removed such
that we don't hit an error later trying to free the stmt_vec_info again.

Lastly it also tweaks the results of whether a pattern has been detected or not
to return true when another SLP instance has created a pattern that is only used
by a different instance (due to the trees being unshared).

Instead of ICEing this code now produces

        adrp    x1, .LANCHOR0
        add     x2, x1, :lo12:.LANCHOR0
        movi    v1.2s, 0
        mov     w0, 0
        ldr     x4, [x1, #:lo12:.LANCHOR0]
        ldrsw   x3, [x2, 16]
        ldr     x1, [x2, 8]
        ldrsw   x2, [x2, 20]
        ldr     d0, [x4]
        ldr     d2, [x1, x3, lsl 3]
        fcmla   v2.2s, v0.2s, v0.2s, #0
        fcmla   v2.2s, v0.2s, v0.2s, #90
        str     d2, [x1, x3, lsl 3]
        fcmla   v1.2s, v0.2s, v0.2s, #0
        fcmla   v1.2s, v0.2s, v0.2s, #90
        str     d1, [x1, x2, lsl 3]
        ret

PS. This testcase actually shows that the codegen we get in these cases is not
optimal. It should generate a MUL + ADD instead MUL + FMA.

But that's for GCC 12.

gcc/ChangeLog:

	PR tree-optimization/99149
	* tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
	buffer.
	(vect_slp_reset_pattern): Remove.
	(complex_fma_pattern::matches): Remove call to vect_slp_reset_pattern.
	(complex_mul_pattern::build, complex_fma_pattern::build,
	complex_fms_pattern::build): Fix ref counts.
	* tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern relevancy
	when node is being deleted.
	(vect_match_slp_patterns_2): Correct result of cache hit on patterns.
	(vect_schedule_slp): Invalidate SLP_TREE_REPRESENTATIVE of removed
	stores.
	* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.

gcc/testsuite/ChangeLog:

	PR tree-optimization/99149
	* g++.dg/vect/pr99149.cc: New test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Simple conditional statements
2 participants