GCC ARM optimization flag should be -Os, not -O2 for GCC versions later than 4.5.3 #664

bikeNomad · 2014-11-06T18:03:31Z

The bug referenced in workspace_tools/toolchains/gcc.py

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46762

has been fixed a long time ago (v 4.5.3) in ARM GCC.
Because of this, we should consider changing the optimization from -O2 to -Os

Also, latest versions of ARM GCC (4.8) have added the -Og option, which is described as "Optimize for debugging experience rather than speed or size"; this results in considerably smaller code than the current -O0 optimization when DEBUG is set.

Perhaps some way could be added to allow those of us who are debugging and using later ARM gcc versions to use -Og instead of -O0?

0xc0170 · 2014-11-07T07:54:07Z

Hi, thanks for looking at gcc toolchain in Tools. You can send a pull request for these changes.

If we switch O0 to Og, we should state it somewhere that it requires GCC 4.8.

adamgreen · 2014-11-18T09:42:50Z

I just thought I would give my thoughts on this issue. Take it for what it is worth, which is probably not much :)

TL;DR I think the defaults for 'Release' builds should be -O2 and 'Debug' builds should be -O0.

While the following code snippet is a bit contrived, it does demonstrate issues I have really encountered with the various optimization settings in GCC over the last couple of years.

#include "LPC17xx.h"

volatile int g_LoopDummy;

int main(int argc, char** argv)
{
    LPC_GPIO1->FIODIR |= 1 << 18; // P1.18 connected to LED1
    while(1)
    {
        LPC_GPIO1->FIOPIN ^= 1 << 18; // Toggle P1.18
        for (int i = 0 ; i < 5000000 && !g_LoopDummy ; i++)
        {
        }
    }
    return 0;
}

This is the disassembly of main() when the optimization level was set to -O2, my preferred level.

00000120 <main>:
     120:   4b08        ldr r3, [pc, #32]   ; (144 <main+0x24>)
     122:   4909        ldr r1, [pc, #36]   ; (148 <main+0x28>)
     124:   681a        ldr r2, [r3, #0]
     126:   4618        mov r0, r3
     128:   f442 2280   orr.w   r2, r2, #262144 ; 0x40000
     12c:   601a        str r2, [r3, #0]
     12e:   6942        ldr r2, [r0, #20]
     130:   4b06        ldr r3, [pc, #24]   ; (14c <main+0x2c>)
     132:   f482 2280   eor.w   r2, r2, #262144 ; 0x40000
     136:   6142        str r2, [r0, #20]
     138:   680a        ldr r2, [r1, #0]
     13a:   2a00        cmp r2, #0
     13c:   d1f7        bne.n   12e <main+0xe>
     13e:   3b01        subs    r3, #1
     140:   d1fa        bne.n   138 <main+0x18>
     142:   e7f4        b.n 12e <main+0xe>
     144:   2009c020    .word   0x2009c020
     148:   10000354    .word   0x10000354
     14c:   004c4b40    .word   0x004c4b40

This is the disassembly of main() when -Os is instead used.

00000120 <main>:
     120:   4b08        ldr r3, [pc, #32]   ; (144 <main+0x24>)
     122:   681a        ldr r2, [r3, #0]
     124:   f442 2280   orr.w   r2, r2, #262144 ; 0x40000
     128:   601a        str r2, [r3, #0]
     12a:   461a        mov r2, r3
     12c:   6953        ldr r3, [r2, #20]
     12e:   f483 2380   eor.w   r3, r3, #262144 ; 0x40000
     132:   6153        str r3, [r2, #20]
     134:   4b04        ldr r3, [pc, #16]   ; (148 <main+0x28>)
     136:   4905        ldr r1, [pc, #20]   ; (14c <main+0x2c>)
     138:   6809        ldr r1, [r1, #0]
     13a:   2900        cmp r1, #0
     13c:   d1f6        bne.n   12c <main+0xc>
     13e:   3b01        subs    r3, #1
     140:   d1f9        bne.n   136 <main+0x16>
     142:   e7f3        b.n 12c <main+0xc>
     144:   2009c020    .word   0x2009c020
     148:   004c4b40    .word   0x004c4b40
     14c:   10000354    .word   0x10000354

These two examples demonstrate the type of issue I have often encountered with -Os code generation. The global addresses of volatile variables are constant but get treated as loop variant in -Os. In the above example, you will see that the address of g_loopDummy is 0x10000354 for both examples. This address is loaded into r1 at address 0x122 when using -O2. This load is outside of any of the loops. However it is loaded into r1 at address 0x136 when the optimization level is set to -Os. This places it inside of the loop so it happens for every iteration of the loop. In my experience this ends up having a noticeable slow down on some real world driver code which perform such bit twiddling on device registers. I don't know why -Os does this. It just makes the code slower and doesn't result in smaller code.

On the plus side, -Os does typically generate smaller code and I have used it in situations where I really need to get the smallest possible code but due to issues such as the above example, I don't use it until I really need to.

This is the disassembly of main() when it is built with -O0. It is quite a bit longer than any of the others.

00000120 <main>:
     120:   b084        sub sp, #16
     122:   9001        str r0, [sp, #4]
     124:   9100        str r1, [sp, #0]
     126:   4b0d        ldr r3, [pc, #52]   ; (15c <main+0x3c>)
     128:   4a0c        ldr r2, [pc, #48]   ; (15c <main+0x3c>)
     12a:   6812        ldr r2, [r2, #0]
     12c:   f442 2280   orr.w   r2, r2, #262144 ; 0x40000
     130:   601a        str r2, [r3, #0]
     132:   4b0a        ldr r3, [pc, #40]   ; (15c <main+0x3c>)
     134:   4a09        ldr r2, [pc, #36]   ; (15c <main+0x3c>)
     136:   6952        ldr r2, [r2, #20]
     138:   f482 2280   eor.w   r2, r2, #262144 ; 0x40000
     13c:   615a        str r2, [r3, #20]
     13e:   2300        movs    r3, #0
     140:   9303        str r3, [sp, #12]
     142:   e002        b.n 14a <main+0x2a>
     144:   9b03        ldr r3, [sp, #12]
     146:   3301        adds    r3, #1
     148:   9303        str r3, [sp, #12]
     14a:   9a03        ldr r2, [sp, #12]
     14c:   4b04        ldr r3, [pc, #16]   ; (160 <main+0x40>)
     14e:   429a        cmp r2, r3
     150:   dc03        bgt.n   15a <main+0x3a>
     152:   4b04        ldr r3, [pc, #16]   ; (164 <main+0x44>)
     154:   681b        ldr r3, [r3, #0]
     156:   2b00        cmp r3, #0
     158:   d0f4        beq.n   144 <main+0x24>
     15a:   e7ea        b.n 132 <main+0x12>
     15c:   2009c020    .word   0x2009c020
     160:   004c4b3f    .word   0x004c4b3f
     164:   10000354    .word   0x10000354

This is the same code compiled with -Og

00000120 <main>:
     120:   b430        push    {r4, r5}
     122:   4b0b        ldr r3, [pc, #44]   ; (150 <main+0x30>)
     124:   681a        ldr r2, [r3, #0]
     126:   f442 2280   orr.w   r2, r2, #262144 ; 0x40000
     12a:   601a        str r2, [r3, #0]
     12c:   461c        mov r4, r3
     12e:   2500        movs    r5, #0
     130:   4908        ldr r1, [pc, #32]   ; (154 <main+0x34>)
     132:   4809        ldr r0, [pc, #36]   ; (158 <main+0x38>)
     134:   6963        ldr r3, [r4, #20]
     136:   f483 2380   eor.w   r3, r3, #262144 ; 0x40000
     13a:   6163        str r3, [r4, #20]
     13c:   462b        mov r3, r5
     13e:   e000        b.n 142 <main+0x22>
     140:   3301        adds    r3, #1
     142:   428b        cmp r3, r1
     144:   dcf6        bgt.n   134 <main+0x14>
     146:   6802        ldr r2, [r0, #0]
     148:   2a00        cmp r2, #0
     14a:   d0f9        beq.n   140 <main+0x20>
     14c:   e7f2        b.n 134 <main+0x14>
     14e:   bf00        nop
     150:   2009c020    .word   0x2009c020
     154:   004c4b3f    .word   0x004c4b3f
     158:   100003a4    .word   0x100003a4

The code is indeed smaller than when compiled with -O0. However, I don't know if the debugging experience will be quite what people expect when they create a 'Debug' build. The following shows an sample GDB session with this -Og compiled version.

(gdb) tbreak main
Temporary breakpoint 1 at 0x120: file main.c, line 21.
(gdb) c
Continuing.
Note: automatically using hardware breakpoints for read-only addresses.

Temporary breakpoint 1, main (argc=268435772, argv=0x0 <_reclaim_reent>) at main.c:21
21  {
(gdb) list
16  #include "LPC17xx.h"
17
18  volatile int g_LoopDummy;
19
20  int main(int argc, char** argv)
21  {
22      LPC_GPIO1->FIODIR |= 1 << 18; // P1.18 connected to LED1
23      while(1)
24      {
25          LPC_GPIO1->FIOPIN ^= 1 << 18; // Toggle P1.18
(gdb) break 25
Breakpoint 2 at 0x12c: file main.c, line 25.

Here I tried to set a breakpoint on line 25 which should be the line of code which toggles the P1.18 pin. If you look at the address this resolved to, 0x12c, in the disassembly you will see that this address is outside of the loop so it will only be hit once and then never again. This is not what a user would expect when debugging a "Debug" build.

(gdb) c
Continuing.

Breakpoint 2, main (argc=268435772, argv=0x0 <_reclaim_reent>) at main.c:25
25          LPC_GPIO1->FIOPIN ^= 1 << 18; // Toggle P1.18

Hits the breakpoint the first time.

(gdb) c
Continuing.
^C

Never hits it again and I end up manually breaking in.

(gdb) p argc
$1 = <optimized out>

This is a very contrived case for this simple code snippet but let me assure you it happens with real code as well. The argc parameter isn't used by this code (and isn't really set by the code which calls main() either) so it's value wasn't maintained. If I try to dump the argc parameter, I just get this warning. There are scenarios where variables like this are optimized out by the time you get to some code which crashes but if you had access to it, it would give you more information about what scenario led to the issue. Typically I want to have access to as many variables as possible in my 'Debug' builds.

dinau · 2014-11-20T14:39:37Z

Hi,
Adams's consideration is great.

I think it would be better to add the optimization option for example,

python workspace/build.py -o opt-Os -t GCC_ARM ......
python workspace/make.py  -o opt-Os -t GCC_ARM ......

dinau

shirishb · 2014-12-16T23:24:15Z

As an "easy" task while browsing through the project python tools I took on @dinau's suggestion from above.

Sample implementation for GCC toolchain is here: shirishb/mbed@c3ea4e4

It does not resolve the core issue here, but if the approach is acceptable I can extend it to cover other toolchains and submit a pull request.

ECNU3D · 2014-12-17T01:53:21Z

Hi,

I totally agreed with the suggestion of Adam, and option "-Og" is still quite error-prone on arm back end currently, so adding a switch would be more flexible. @adamgreen For your sample code, I think you could report it on https://launchpad.net/gcc-arm-embedded if you haven't already done so. There's a quite active group on launchpad working on the embedded gcc toolchain.

0xc0170 · 2014-12-17T07:24:16Z

I completely overlooked the dinau's suggestion. I proposed it last year, the feedback was negative, thus I just added the option - debug_info which sets them to 0, as we use it now.

kjbracey · 2016-07-14T08:42:38Z

-Os certainly produces slower code than -O2. That's the trade-off.

(As to Adam's specific example - the optimisations being turned off by -Os are "high level" early ones that have a tendency to lead to ultimately bigger code. In some specific cases the optimisation may not have actually led to bigger final code but there's no multiple pass system to go back and try again if you find it wasn't a space benefit in the end.

In this case it presumably doesn't hoist the constant out because that optimisation can increase size - the extra register required to hold the address increases register pressure which may lead to more loads/stores. In this case it doesn't, because the loop contents are so simple.

And it knows that a literal load really isn't that expensive - the trade-off is different to hoisting a real subexpression.)

But I know that for everything we work on in the 6LoWPAN area, space not speed is the issue. We've got more processor power than we know what to do with, compared to the speed of 6LoWPAN networks. So I would always choose lower size in the size/speed tradeoff.

I think -Os would be a more sensible default than -O2, but given that it is a trade-off, there should indeed be an easy way for users to flip it to -O2 (or even higher).

On -O0 versus -Og, I agree with Adam. I did experiment with -Og while looking at the settings for a different build system, and concluded that -Og wasn't debuggable enough. We settled on -O0 for the debug builds.

ciarmcom · 2016-08-01T12:35:56Z

ARM Internal Ref: IOTMORF-312

0xc0170 · 2016-10-28T14:05:47Z

This should be resolved. The default profile for GCC specifices -O as -Os, reference : https://github.com/ARMmbed/mbed-os/blob/master/tools/profiles/default.json#L8

bikeNomad changed the title ~~GCC ARM optimization flag should be -Os, not -O2 for GCC versions later than 4.6~~ GCC ARM optimization flag should be -Os, not -O2 for GCC versions later than 4.5.3 Nov 6, 2014

0xc0170 added the enhancement label Nov 7, 2014

0xc0170 mentioned this issue Jul 14, 2016

Add debug symbols to release builds (all toolchains) #2139

Closed

ciarmcom added the mirrored label Aug 1, 2016

sg- removed the mirrored label Aug 12, 2016

0xc0170 closed this as completed Oct 28, 2016

0xc0170 mentioned this issue Mar 9, 2018

Optimise debugging experience when building with gcc #6316

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCC ARM optimization flag should be -Os, not -O2 for GCC versions later than 4.5.3 #664

GCC ARM optimization flag should be -Os, not -O2 for GCC versions later than 4.5.3 #664

bikeNomad commented Nov 6, 2014

0xc0170 commented Nov 7, 2014

adamgreen commented Nov 18, 2014

dinau commented Nov 20, 2014

shirishb commented Dec 16, 2014

ECNU3D commented Dec 17, 2014

0xc0170 commented Dec 17, 2014

kjbracey commented Jul 14, 2016

ciarmcom commented Aug 1, 2016

0xc0170 commented Oct 28, 2016

GCC ARM optimization flag should be -Os, not -O2 for GCC versions later than 4.5.3 #664

GCC ARM optimization flag should be -Os, not -O2 for GCC versions later than 4.5.3 #664

Comments

bikeNomad commented Nov 6, 2014

0xc0170 commented Nov 7, 2014

adamgreen commented Nov 18, 2014

dinau commented Nov 20, 2014

shirishb commented Dec 16, 2014

ECNU3D commented Dec 17, 2014

0xc0170 commented Dec 17, 2014

kjbracey commented Jul 14, 2016

ciarmcom commented Aug 1, 2016

0xc0170 commented Oct 28, 2016