New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring conditionals back since the HIR change. #90
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR seems great.
fncontext fnctx = ctx->peek_fn (); | ||
Bfunction *fndecl = fnctx.fndecl; | ||
Location start_location = expr.get_locus (); | ||
Location end_location = expr.get_closing_locus (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relates to something I had thought about doing, but not in enough detail to open an enhancement issue. gcc supports something called "rich locations" in which three actual locations are encoded - a "start" location, a "caret" location, and an "end" location - basically like a location range with a specified value highlighted.
I couldn't find the page that I thought I originally read this in, but here's a quote from here:
As of GCC 6, ordinary locations changed from supporting just a point in the user’s source code to supporting three points: the caret location, plus a start and a finish:
a = foo && bar;
~~~~^~~~~~
| | |
| | finish
| caret
start
Instead of accessing start_location
and end_location
separately, these could be integrated into a single location value. There's also the technicality that the current definition of get_closing_locus()
technically gets the beginning location of the last statement rather than the end location of the block (i.e. the right curly bracket), assuming that this would be the preferred behaviour.
The lexer would have to be modified to reflect this rich location system, and the parser and AST probably would have to be too (i.e. make a new location from the "start" location from one token and the "end" location of a different one). Also, the backend-independent Location
abstraction would have to be modified to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this closing locus is mostly a placeholder it just returns get_locus. I think there needs to be an issue about the diagnostics in general. gccgo added extra specifiers to the format string which might give some inspiration. Though the Fortran front-end is usually deemed to be the gold standard these days.
… FMA. This adds implementation for the optabs for complex additions. With this the following C code: void f90 (float complex a[restrict N], float complex b[restrict N], float complex c[restrict N]) { for (int i=0; i < N; i++) c[i] = a[i] + (b[i] * I); } generates f90: add r3, r2, #1600 .L2: vld1.32 {q8}, [r0]! vld1.32 {q9}, [r1]! vcadd.f32 q8, q8, q9, #90 vst1.32 {q8}, [r2]! cmp r3, r2 bne .L2 bx lr instead of f90: add r3, r2, #1600 .L2: vld2.32 {d24-d27}, [r0]! vld2.32 {d20-d23}, [r1]! vsub.f32 q8, q12, q11 vadd.f32 q9, q13, q10 vst2.32 {d16-d19}, [r2]! cmp r3, r2 bne .L2 bx lr gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vcaddq_rot90_u8, __arm_vcaddq_rot270_u8, , __arm_vcaddq_rot90_s8, __arm_vcaddq_rot270_s8, __arm_vcaddq_rot90_u16, __arm_vcaddq_rot270_u16, __arm_vcaddq_rot90_s16, __arm_vcaddq_rot270_s16, __arm_vcaddq_rot90_u32, __arm_vcaddq_rot270_u32, __arm_vcaddq_rot90_s32, __arm_vcaddq_rot270_s32, __arm_vcmulq_rot90_f16, __arm_vcmulq_rot270_f16, __arm_vcmulq_rot180_f16, __arm_vcmulq_f16, __arm_vcaddq_rot90_f16, __arm_vcaddq_rot270_f16, __arm_vcmulq_rot90_f32, __arm_vcmulq_rot270_f32, __arm_vcmulq_rot180_f32, __arm_vcmulq_f32, __arm_vcaddq_rot90_f32, __arm_vcaddq_rot270_f32, __arm_vcmlaq_f16, __arm_vcmlaq_rot180_f16, __arm_vcmlaq_rot270_f16, __arm_vcmlaq_rot90_f16, __arm_vcmlaq_f32, __arm_vcmlaq_rot180_f32, __arm_vcmlaq_rot270_f32, __arm_vcmlaq_rot90_f32): Update builtin calls. * config/arm/arm_mve_builtins.def (vcaddq_rot90_u, vcaddq_rot270_u, vcaddq_rot90_s, vcaddq_rot270_s, vcaddq_rot90_f, vcaddq_rot270_f, vcmulq_f, vcmulq_rot90_f, vcmulq_rot180_f, vcmulq_rot270_f, vcmlaq_f, vcmlaq_rot90_f, vcmlaq_rot180_f, vcmlaq_rot270_f): Removed. (vcaddq_rot90, vcaddq_rot270, vcmulq, vcmulq_rot90, vcmulq_rot180, vcmulq_rot270, vcmlaq, vcmlaq_rot90, vcmlaq_rot180, vcmlaq_rot270): New. * config/arm/constraints.md (Dz): Include MVE. * config/arm/iterators.md (mve_rotsplit1, mve_rotsplit2): New. (rot): Add UNSPEC_VCMLS, UNSPEC_VCMUL and UNSPEC_VCMUL180. (rot_op, rotsplit1, rotsplit2, fcmac1, VCMLA_OP, VCMUL_OP): New. * config/arm/mve.md (VCADDQ_ROT270_S, VCADDQ_ROT90_S, VCADDQ_ROT270_U, VCADDQ_ROT90_U, VCADDQ_ROT270_F, VCADDQ_ROT90_F, VCMULQ_F, VCMULQ_ROT180_F, VCMULQ_ROT270_F, VCMULQ_ROT90_F, VCMLAQ_F, VCMLAQ_ROT180_F, VCMLAQ_ROT90_F, VCMLAQ_ROT270_F, VCADDQ_ROT270_S, VCADDQ_ROT270, VCADDQ_ROT90): Removed. (mve_rot, VCMUL): New. (mve_vcaddq_rot270_<supf><mode, mve_vcaddq_rot90_<supf><mode>, mve_vcaddq_rot270_f<mode>, mve_vcaddq_rot90_f<mode>, mve_vcmulq_f<mode, mve_vcmulq_rot180_f<mode>, mve_vcmulq_rot270_f<mode>, mve_vcmulq_rot90_f<mode>, mve_vcmlaq_f<mode>, mve_vcmlaq_rot180_f<mode>, mve_vcmlaq_rot270_f<mode>, mve_vcmlaq_rot90_f<mode>): Removed. (mve_vcmlaq<mve_rot><mode>, mve_vcmulq<mve_rot><mode>, mve_vcaddq<mve_rot><mode>, cadd<rot><mode>3, mve_vcaddq<mve_rot><mode>): New. (cmul<rot_op><mode>3): Exclude MVE types. * config/arm/unspecs.md (UNSPEC_VCMUL90, UNSPEC_VCMUL270): New. * config/arm/vec-common.md (cadd<rot><mode>3, cmul<rot_op><mode>3, arm_vcmla<rot><mode>, cml<fcmac1><rot_op><mode>4): New. * config/arm/unspecs.md (UNSPEC_VCMUL, UNSPEC_VCMUL180, UNSPEC_VCMLS, UNSPEC_VCMLS180): New. * config/arm/neon.md (cmul<rot_op><mode>3): New.
This adds implementation for the optabs for add complex operations. With this the following C code: void f90 (float complex a[restrict N], float complex b[restrict N], float complex c[restrict N]) { for (int i=0; i < N; i++) c[i] = a[i] + (b[i] * I); } generates f90: mov x3, 0 .p2align 3,,7 .L2: ldr q0, [x0, x3] ldr q1, [x1, x3] fcadd v0.4s, v0.4s, v1.4s, #90 str q0, [x2, x3] add x3, x3, 16 cmp x3, 1600 bne .L2 ret instead of f90: add x3, x1, 1600 .p2align 3,,7 .L2: ld2 {v4.4s - v5.4s}, [x0], 32 ld2 {v2.4s - v3.4s}, [x1], 32 fsub v0.4s, v4.4s, v3.4s fadd v1.4s, v5.4s, v2.4s st2 {v0.4s - v1.4s}, [x2], 32 cmp x3, x1 bne .L2 ret gcc/ChangeLog: * config/aarch64/aarch64-simd.md (cadd<rot><mode>3): New. * config/aarch64/iterators.md (SVE2_INT_CADD_OP): New. * config/aarch64/aarch64-sve.md (cadd<rot><mode>3): New. * config/aarch64/aarch64-sve2.md (cadd<rot><mode>3): New.
This adds implementation for the optabs for complex additions. With this the following C code: void f90 (float complex a[restrict N], float complex b[restrict N], float complex c[restrict N]) { for (int i=0; i < N; i++) c[i] = a[i] + (b[i] * I); } generates f90: add r3, r2, #1600 .L2: vld1.32 {q8}, [r0]! vld1.32 {q9}, [r1]! vcadd.f32 q8, q8, q9, #90 vst1.32 {q8}, [r2]! cmp r3, r2 bne .L2 bx lr instead of f90: add r3, r2, #1600 .L2: vld2.32 {d24-d27}, [r0]! vld2.32 {d20-d23}, [r1]! vsub.f32 q8, q12, q11 vadd.f32 q9, q13, q10 vst2.32 {d16-d19}, [r2]! cmp r3, r2 bne .L2 bx lr gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vcaddq_rot90_u8, __arm_vcaddq_rot270_u8, __arm_vcaddq_rot90_s8, __arm_vcaddq_rot270_s8, __arm_vcaddq_rot90_u16, __arm_vcaddq_rot270_u16, __arm_vcaddq_rot90_s16, __arm_vcaddq_rot270_s16, __arm_vcaddq_rot90_u32, __arm_vcaddq_rot270_u32, __arm_vcaddq_rot90_s32, __arm_vcaddq_rot270_s32, __arm_vcaddq_rot90_f16, __arm_vcaddq_rot270_f16, __arm_vcaddq_rot90_f32, __arm_vcaddq_rot270_f32): Update builtin calls. * config/arm/arm_mve_builtins.def (vcaddq_rot90_u, vcaddq_rot270_u, vcaddq_rot90_s, vcaddq_rot270_s, vcaddq_rot90_f, vcaddq_rot270_f): Removed. (vcaddq_rot90, vcaddq_rot270): New. * config/arm/constraints.md (Dz): Include MVE. * config/arm/iterators.md (mve_rot): New. (supf): Remove VCADDQ_ROT270_S, VCADDQ_ROT270_U, VCADDQ_ROT90_S, VCADDQ_ROT90_U. (VCADDQ_ROT270, VCADDQ_ROT90): Removed. * config/arm/mve.md (mve_vcaddq_rot270_<supf><mode, mve_vcaddq_rot90_<supf><mode>, mve_vcaddq_rot270_f<mode>, mve_vcaddq_rot90_f<mode>): Removed. (mve_vcaddq<mve_rot><mode>, mve_vcaddq<mve_rot><mode>): New. * config/arm/unspecs.md (VCADDQ_ROT270_S, VCADDQ_ROT90_S, VCADDQ_ROT270_U, VCADDQ_ROT90_U, VCADDQ_ROT270_F, VCADDQ_ROT90_F): Removed. * config/arm/vec-common.md (cadd<rot><mode>3): New.
This adds implementation for the optabs for complex operations. With this the following C code: void g (float complex a[restrict N], float complex b[restrict N], float complex c[restrict N]) { for (int i=0; i < N; i++) c[i] = a[i] * b[i]; } generates NEON: g: movi v3.4s, 0 mov x3, 0 .p2align 3,,7 .L2: mov v0.16b, v3.16b ldr q2, [x1, x3] ldr q1, [x0, x3] fcmla v0.4s, v1.4s, v2.4s, #0 fcmla v0.4s, v1.4s, v2.4s, #90 str q0, [x2, x3] add x3, x3, 16 cmp x3, 1600 bne .L2 ret SVE: g: mov x3, 0 mov x4, 400 ptrue p1.b, all whilelo p0.s, xzr, x4 mov z3.s, #0 .p2align 3,,7 .L2: ld1w z1.s, p0/z, [x0, x3, lsl 2] ld1w z2.s, p0/z, [x1, x3, lsl 2] movprfx z0, z3 fcmla z0.s, p1/m, z1.s, z2.s, #0 fcmla z0.s, p1/m, z1.s, z2.s, #90 st1w z0.s, p0, [x2, x3, lsl 2] incw x3 whilelo p0.s, x3, x4 b.any .L2 ret SVE2 (with int instead of float) g: mov x3, 0 mov x4, 400 mov z3.b, #0 whilelo p0.s, xzr, x4 .p2align 3,,7 .L2: ld1w z1.s, p0/z, [x0, x3, lsl 2] ld1w z2.s, p0/z, [x1, x3, lsl 2] movprfx z0, z3 cmla z0.s, z1.s, z2.s, #0 cmla z0.s, z1.s, z2.s, #90 st1w z0.s, p0, [x2, x3, lsl 2] incw x3 whilelo p0.s, x3, x4 b.any .L2 ret gcc/ChangeLog: * config/aarch64/aarch64-simd.md (cml<fcmac1><conj_op><mode>4, cmul<conj_op><mode>3): New. * config/aarch64/iterators.md (UNSPEC_FCMUL, UNSPEC_FCMUL180, UNSPEC_FCMLA_CONJ, UNSPEC_FCMLA180_CONJ, UNSPEC_CMLA_CONJ, UNSPEC_CMLA180_CONJ, UNSPEC_CMUL, UNSPEC_CMUL180, FCMLA_OP, FCMUL_OP, conj_op, rotsplit1, rotsplit2, fcmac1, sve_rot1, sve_rot2, SVE2_INT_CMLA_OP, SVE2_INT_CMUL_OP, SVE2_INT_CADD_OP): New. (rot): Add UNSPEC_FCMUL, UNSPEC_FCMUL180. (rot_op): Renamed to conj_op. * config/aarch64/aarch64-sve.md (cml<fcmac1><conj_op><mode>4, cmul<conj_op><mode>3): New. * config/aarch64/aarch64-sve2.md (cml<fcmac1><conj_op><mode>4, cmul<conj_op><mode>3): New.
This adds implementation for the optabs for complex operations. With this the following C code: void g (float complex a[restrict N], float complex b[restrict N], float complex c[restrict N]) { for (int i=0; i < N; i++) c[i] = a[i] * b[i]; } generates NEON: g: vmov.f32 q11, #0.0 @ v4sf add r3, r2, #1600 .L2: vmov q8, q11 @ v4sf vld1.32 {q10}, [r1]! vld1.32 {q9}, [r0]! vcmla.f32 q8, q9, q10, #0 vcmla.f32 q8, q9, q10, #90 vst1.32 {q8}, [r2]! cmp r3, r2 bne .L2 bx lr MVE: g: push {lr} mov lr, #100 dls lr, lr .L2: vldrw.32 q1, [r1], #16 vldrw.32 q2, [r0], #16 vcmul.f32 q3, q2, q1, #0 vcmla.f32 q3, q2, q1, #90 vstrw.32 q3, [r2], #16 le lr, .L2 ldr pc, [sp], #4 instead of g: add r3, r2, #1600 .L2: vld2.32 {d20-d23}, [r0]! vld2.32 {d16-d19}, [r1]! vmul.f32 q14, q11, q9 vmul.f32 q15, q11, q8 vneg.f32 q14, q14 vfma.f32 q15, q10, q9 vfma.f32 q14, q10, q8 vmov q13, q15 @ v4sf vmov q12, q14 @ v4sf vst2.32 {d24-d27}, [r2]! cmp r3, r2 bne .L2 bx lr and g: add r3, r2, #1600 .L2: vld2.32 {d20-d23}, [r0]! vld2.32 {d16-d19}, [r1]! vmul.f32 q15, q10, q8 vmul.f32 q14, q10, q9 vmls.f32 q15, q11, q9 vmla.f32 q14, q11, q8 vmov q12, q15 @ v4sf vmov q13, q14 @ v4sf vst2.32 {d24-d27}, [r2]! cmp r3, r2 bne .L2 bx lr respectively. gcc/ChangeLog: * config/arm/iterators.md (rotsplit1, rotsplit2, conj_op, fcmac1, VCMLA_OP, VCMUL_OP): New. * config/arm/mve.md (mve_vcmlaq<mve_rot><mode>): Support vec_dup 0. * config/arm/neon.md (cmul<conj_op><mode>3): New. * config/arm/unspecs.md (UNSPEC_VCMLA_CONJ, UNSPEC_VCMLA180_CONJ, UNSPEC_VCMUL_CONJ): New. * config/arm/vec-common.md (cmul<conj_op><mode>3, arm_vcmla<rot><mode>, cml<fcmac1><conj_op><mode>4): New.
The attached testcase ICEs due to a couple of issues. In the testcase you have two SLP instances that share the majority of their definition with each other. One tree defines a COMPLEX_MUL sequence and the other tree a COMPLEX_FMA. The ice happens because: 1. the refcounts are wrong, in particular the FMA case doesn't correctly count the references for the COMPLEX_MUL that it consumes. 2. when the FMA is created it incorrectly assumes it can just tear apart the MUL node that it's consuming. This is wrong and should only be done when there is no more uses of the node, in which case the vector only pattern is no longer relevant. To fix the last part the SLP only pattern reset code was moved into vect_free_slp_tree which results in cleaner code. I also think it does belong there since that function knows when there are no more uses of the node and so the pattern should be unmarked, so when the the vectorizer is inspecting the BB it doesn't find the now invalid vector only patterns. The patch also clears the SLP_TREE_REPRESENTATIVE when stores are removed such that we don't hit an error later trying to free the stmt_vec_info again. Lastly it also tweaks the results of whether a pattern has been detected or not to return true when another SLP instance has created a pattern that is only used by a different instance (due to the trees being unshared). Instead of ICEing this code now produces adrp x1, .LANCHOR0 add x2, x1, :lo12:.LANCHOR0 movi v1.2s, 0 mov w0, 0 ldr x4, [x1, #:lo12:.LANCHOR0] ldrsw x3, [x2, 16] ldr x1, [x2, 8] ldrsw x2, [x2, 20] ldr d0, [x4] ldr d2, [x1, x3, lsl 3] fcmla v2.2s, v0.2s, v0.2s, #0 fcmla v2.2s, v0.2s, v0.2s, #90 str d2, [x1, x3, lsl 3] fcmla v1.2s, v0.2s, v0.2s, #0 fcmla v1.2s, v0.2s, v0.2s, #90 str d1, [x1, x2, lsl 3] ret PS. This testcase actually shows that the codegen we get in these cases is not optimal. It should generate a MUL + ADD instead MUL + FMA. But that's for GCC 12. gcc/ChangeLog: PR tree-optimization/99149 * tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the buffer. (vect_slp_reset_pattern): Remove. (complex_fma_pattern::matches): Remove call to vect_slp_reset_pattern. (complex_mul_pattern::build, complex_fma_pattern::build, complex_fms_pattern::build): Fix ref counts. * tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern relevancy when node is being deleted. (vect_match_slp_patterns_2): Correct result of cache hit on patterns. (vect_schedule_slp): Invalidate SLP_TREE_REPRESENTATIVE of removed stores. * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value. gcc/testsuite/ChangeLog: PR tree-optimization/99149 * g++.dg/vect/pr99149.cc: New test.
Example:
GIMPLE