New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timing issues found in "Flash - clock and cache test" #4666

Merged
merged 5 commits into from Jul 13, 2017

Conversation

Projects
None yet
5 participants
@chrissnow
Contributor

chrissnow commented Jun 29, 2017

Description

ARMCC seemed to be inlining time_cpu_cycles() but with a different number of clock cycles in the loop, GCC worked fine.

Status

READY

Migrations

NO

Related PRs

#4640

Steps to test or reproduce

Outline the steps to test or reproduce the PR here.

Fix timing issues found in "Flash - clock and cache test"
ARMCC seemed to be inlining time_cpu_cycles() but with a different number of clock cycles in the loop, GCC worked fine.
@0xc0170

This comment has been minimized.

Member

0xc0170 commented Jun 29, 2017

Does IAR compile ? I would assume attribute is not supported there.

ARMCC seemed to be inlining time_cpu_cycles() but with a different number of clock cycles in the loop

Can you share assembly how it differs ? What exactly is causing this, if you can summarize it here regarding this test.

@chrissnow

This comment has been minimized.

Contributor

chrissnow commented Jun 29, 2017

Quoting @theotherjimmy

It looks like your inner timing loop is taking 8 cycles the first time, and 9 the second. Looking at the disassembly of my version of that test binary, the timing function time_cpu_cycles is being inlined at both call sites. This may be causing the problem. The patch below prevents the inlining of that critical timing function at each call site:

For reference the inner loop is made up of the following instructions:

 80058f0:       1e88            subs    r0, r1, #2  
 80058f2:       1e89            subs    r1, r1, #2
 80058f4:       9008            str     r0, [sp, #32]
 80058f6:       2900            cmp     r1, #0
 80058f8:       dcfa            bgt.n   80058f0 <_Z15time_cpu_cyclesj+0x34>

2 cycles from the subs + 2 from the str + 1 from the cmp + 1+ P (1-3) for the branch instruction. If P is 2 for most times through the loop when inlined at the flash_init_test call site (8 cycles), and P is 3 for the other call site (9 cycles), then the difference should go away when inlining is removed.

I haven't checked over the assembly after this change.

Does IAR compile ?

No, but it doesn't appear to be to do with me...

  * XDOT_L151CC::IAR::MBED-BUILD
        Building library mbed-build (XDOT_L151CC, IAR)
        Scan: mbed-os-new
        Compile [ 64.5%]: rtx_lib.c
        [Error] rtx_lib.c@50,2: [Pe065]: expected a ";"
        [Error] rtx_lib.c@116,2: [Pe065]: expected a ";"
        [Error] rtx_lib.c@120,2: [Pe065]: expected a ";"
        [Error] rtx_lib.c@164,2: [Pe065]: expected a ";"
        [Error] rtx_lib.c@168,2: [Pe065]: expected a ";"
        [Error] rtx_lib.c@184,2: [Pe065]: expected a ";"
        [Error] rtx_lib.c@188,16: [Pe065]: expected a ";"
        [Error] rtx_lib.c@335,2: [Pe079]: expected a type specifier
        [Error] rtx_lib.c@335,2: [Pe260]: explicit type is missing ("int" assumed)
        [Error] rtx_lib.c@335,2: [Pe141]: unnamed prototyped parameters not allowed when body is present
        [Error] rtx_lib.c@336,4: [Pe130]: expected a "{"
        [Warning] rtx_lib.c@337,4: [Pe174]: expression has no effect
        [Warning] rtx_lib.c@337,4: [Pe174]: expression has no effect
        [Error] rtx_lib.c@354,4: [Pe029]: expected an expression
        [Warning] rtx_lib.c@421,4: [Pe012]: parsing restarts here after previous syntax error
        [Warning] rtx_lib.c@337,4: [Pe174]: expression has no effect
        [Error] rtx_lib.c@421,5: [Pe065]: expected a ";"
        [Error] rtx_lib.c@421,3: [Pe169]: expected a declaration
        [Warning] rtx_lib.c@433,15: [Pe012]: parsing restarts here after previous syntax error
        [Warning] rtx_lib.c@49,30: [Pe177]: variable "os_isr_queue" was declared but never referenced
        [Warning] rtx_lib.c@123,30: [Pe177]: variable "os_idle_thread_attr" was declared but never referenced
        [Warning] rtx_lib.c@171,36: [Pe177]: variable "os_timer_thread_attr" was declared but never referenced
        [Warning] rtx_lib.c@191,0: [Pe177]: variable "os_timer_mq_attr" was declared but never referenced
@0xc0170

This comment has been minimized.

Member

0xc0170 commented Jun 29, 2017

What about #define ALLOWED_DRIFT_PPM 1000 //0.1%, this one is not too strict if one more cpu cycle makes the test fail?

Regarding failures with IAR, what version are you using, 7.8 should be OK (earlier versions had some bugs)

@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jun 29, 2017

@chrissnow Could you get it to compile on IAR?

@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jun 29, 2017

@c1728p9 Could you take a look?

@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Jun 29, 2017

The safest way to handle this would probably be to re-write this into assermbly. Something like this should work:

#ifdef __CC_ARM
MBED_FORCENOINLINE
__asm static void delay_loop(uint32_t count)
{
1
  SUBS a1, a1, #1
  BCS  %BT1
  BX   lr
}
#elif defined (__ICCARM__)
MBED_FORCENOINLINE
static void delay_loop(uint32_t count)
{
  __asm volatile(
    "loop: \n"
    " SUBS %0, %0, #1 \n"
    " BCS.n  loop\n"
    : "+r" (count)
    :
    : "cc"
  );
}
#else // GCC
MBED_FORCENOINLINE
static void delay_loop(uint32_t count)
{
  __asm__ volatile (
    "%=:\n\t"
#if defined(__thumb__) && !defined(__thumb2__)
    "SUB  %0, #1\n\t"
#else
    "SUBS %0, %0, #1\n\t"
#endif
    "BCS  %=b\n\t"
    : "+l" (count)
    :
    : "cc"
  );
}
#endif

You'll also need to add MBED_FORCENOINLINE or something similar to mbed_toolchain.h. For IAR you should be able to use this syntax:

#define MBED_FORCEINLINE _Pragma("inline=never")
@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jun 29, 2017

@0xc0170 Maybe you missed my point. The LOOP is 1 cycle off, making every pass through it 1 cycle more than it was before. This corresponds to 9 cycles taken for every 8 seen previously, making the ratio between after/before 1.125. That's 12.5% difference, not 1 cycle.

@chrissnow

This comment has been minimized.

Contributor

chrissnow commented Jun 29, 2017

That would tie up roughly with the results of the test.

@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jun 29, 2017

Yeah I think your test was about 12.495 % off. So close that the difference can be chalked up to measurement error.

@chrissnow

This comment has been minimized.

Contributor

chrissnow commented Jun 29, 2017

Given we don't care how long the test actually takes so long as the way it is executed doesn't affect it do we need to go down the assembly route?

I will sort out the IAR build with the MBED_FORCENOINLINE suggestion.

@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jun 30, 2017

@chrissnow The advantage of assembly would be to remove the store instruction. Removing the store might eliminate another cause of difference: different stack depth at the call site. This would affect execution time on a target where different RAM banks had different speeds, and the stack was on the border of two continuous banks. It's a small possibility, but then again so was the different ROM location bug.

@chrissnow

This comment has been minimized.

Contributor

chrissnow commented Jun 30, 2017

I have implemented the above suggestion and it seems to work nicely,
I dropped the PPM down to test the results and it's looking good.

FAIL: Values Not Within Delta 0 Expected 157801 Was 157799

Which is 12.5 PPM, well within the 1000 allowed.

@chrissnow

This comment has been minimized.

Contributor

chrissnow commented Jun 30, 2017

@0xc0170 I was using 7.2, using 8 I still have issues

        [Error] mbed_boot.c@571,0: [Pe020]: identifier "_MAX_LOCK" is undefined

Which I can find no definition of anywhere.

edit:
also

        [Error] mbed_retarget.cpp@832,47: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@832,15: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@833,30: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@833,15: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@834,30: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@834,15: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@847,47: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@847,15: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@848,30: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@848,15: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@849,30: [Pe393]: pointer to incomplete class type is not allowed
        [Error] mbed_retarget.cpp@849,0: [Pe393]: pointer to incomplete class type is not allowed
@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jun 30, 2017

@chrissnow We use 7.8, so IAR 8 may not work.

@@ -61,8 +102,7 @@ static int time_cpu_cycles(uint32_t cycles)
int timer_start = timer.read_us();
volatile uint32_t delay = (volatile uint32_t)cycles;

This comment has been minimized.

@theotherjimmy

theotherjimmy Jun 30, 2017

Contributor

You should not need volatile anymore

@chrissnow

This comment has been minimized.

Contributor

chrissnow commented Jun 30, 2017

@theotherjimmy I couldn't find 7.8 so used the trial of 8.
if you have IAR could you try and build, I'm pretty sure it will be ok as the main at least compiles.

@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jun 30, 2017

@chrissnow The version of IAR Embedded workbench that has EWARM 7.8 is Embedded Workbench 7.5, I think.

@theotherjimmy theotherjimmy added needs: CI and removed needs: work labels Jul 10, 2017

@theotherjimmy

This comment has been minimized.

Contributor

theotherjimmy commented Jul 10, 2017

/morph test

@mbed-bot

This comment has been minimized.

mbed-bot commented Jul 11, 2017

Result: SUCCESS

Your command has finished executing! Here's what you wrote!

/morph test

Output

mbed Build Number: 768

All builds and test passed!

@0xc0170

This comment has been minimized.

Member

0xc0170 commented Jul 11, 2017

@c1728p9 Happy with the update ( to use inline assembly delay loop as proposed above) ?

@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Jul 11, 2017

This looks good to me! Thanks for the changes @chrissnow

@0xc0170 0xc0170 merged commit 5a5d159 into ARMmbed:master Jul 13, 2017

4 checks passed

Cam-CI uvisor Build & Test Success
Details
ci/morph-test Job has completed
Details
continuous-integration/jenkins/pr-head This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
MBED_NOINLINE
__asm static void delay_loop(uint32_t count)
{
1

This comment has been minimized.

@0xc0170

0xc0170 Feb 21, 2018

Member

I got a question regarding this 1, is it required? I came across this code today, and wondering if this is line is needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment