Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor miscalculation in delayMicroseconds() #130

Closed
LaZsolt opened this issue Mar 15, 2020 · 13 comments
Closed

Minor miscalculation in delayMicroseconds() #130

LaZsolt opened this issue Mar 15, 2020 · 13 comments

Comments

@LaZsolt
Copy link

LaZsolt commented Mar 15, 2020

Hi!

I found a minor miscalculation in wiring.c and delayMicroseconds()

In case of 32 MHz clock speed, at line 201 microseconds multiplied by 8 takes 15 clock cycles.
us <<= 3; // x8 us, = 6 cycles not true

The assembly code:
63 e0 ldi r22, 0x03 ; 1 clock cycle
88 0f add r24, r24 ; 3 x 1 clock cycles
99 1f adc r25, r25 ; 3 x 1
6a 95 dec r22 ; 3 x 1
e1 f7 brne .-8 ; 2 x 2 + 1
; Sum 15 Cycles

@MCUdude
Copy link
Owner

MCUdude commented Mar 15, 2020

Thanks for reporting. I don't have time to look into this right now, but do you want to propose a fix, so that timing is accurate (and perhaps test it on actual hardware)?

@LaZsolt
Copy link
Author

LaZsolt commented Mar 16, 2020

Need some tests compilation and disassembling to view the effect of volatile. I think, the solution is:
`void delayMicroseconds( volatile unsigned int us)
{
#if F_CPU >= 32000000L
if (!us) return;

us <<= 1;
nop(); // This NOP will break optimization
us <<= 2;

us -= ?;
...`

@LaZsolt
Copy link
Author

LaZsolt commented Mar 22, 2020

Voliatile variable was not a good idea. Compiler generate too complex and big code.

But there is another problem even not using voliatile variable. If delayMicroseconds() called with constants only compiler generate function call free code. This example is from a complex code using timing in Onewire library.

Need more tricks. Thinking...

void delayMicroseconds( unsigned int us)
{
#if F_CPU >= 32000000L
if (!us) return;
us <<= 3;
us -= 3;
#endif
__asm__ __volatile__ (
"1: sbiw %0,1" "\n\t"
"brne 1b" : "=w" (us) : "0" (us)
);
}
delayMicroseconds(53);

compiled to:

d78: e3 ea ldi r30, 0xA3 ; 165
d7a: f1 e0 ldi r31, 0x01 ; 1
d7c: 31 97 sbiw r30, 0x01 ; 1
d7e: f1 f7 brne .-4 ; 0xd7c

@LaZsolt
Copy link
Author

LaZsolt commented May 4, 2020

Here are some compiled and disassembled codes. Link time optimization (LTO) must be disabled.

// 32 MHz
delayMicroseconds(3);       // = 6 cycles
    3302:   83 e0           ldi     r24, 0x03
    3304:   90 e0           ldi     r25, 0x00
    3306:   0e 94 a6 1b     call    0x374c      ; 0x374c <delayMicroseconds>
.
.

void delayMicroseconds(unsigned int us)
{
  if (!us) return;          // = 3 cycles, (4 when true)
    374c:   00 97           sbiw    r24, 0x00
    374e:   41 f0           breq    .+16        ; 0x3762 <delayMicroseconds+0x16>

  us <<= 3;          // x8 us, =15 cycles (instead of 6 cycles)
    3750:   23 e0           ldi     r18, 0x03   ; 1 c
    3752:   88 0f           add     r24, r24    ; 1 c x3
    3754:   99 1f           adc     r25, r25    ; 1 c x3
    3756:   2a 95           dec     r18         ; 1 c x3
    3758:   e1 f7           brne    .-8      	; 2+2+1 cycles
	                                            ; 0x3752 <delayMicroseconds+0x6>

  if (!(us -= 8)) return;   // = 3 cycles, (4 when true)
    375a:   08 97           sbiw    r24, 0x08
    375c:   11 f0           breq    .+4         ; 0x3762 <delayMicroseconds+0x16>

  // busy wait
  __asm__ __volatile__ (
    375e:   01 97           sbiw    r24, 0x01
    3760:   f1 f7           brne    .-4         ; 0x375e <delayMicroseconds+0x12>
  );
 
}                    // return = 4 cycles
    3762:   08 95           ret
// 24 MHz
delayMicroseconds(3);       // = 6 cycles
    3302:   83 e0           ldi     r24, 0x03
    3304:   90 e0           ldi     r25, 0x00
    3306:   0e 94 a6 1b     call    0x3758      ; 0x3758 <delayMicroseconds>
.
.

void delayMicroseconds(unsigned int us)
{
    3758:   9c 01           movw    r18, r24
  if (!us) return;         // = 1+2 cycles, (4 when true)
    375a:   89 2b           or      r24, r25
    375c:   49 f0           breq    .+18         ; 0x3770 <delayMicroseconds+0x18>

  us *= 6;          // x6 us, = 7+1 cycles
    375e:   46 e0           ldi     r20, 0x06
    3760:   42 9f           mul     r20, r18
    3762:   c0 01           movw    r24, r0
    3764:   43 9f           mul     r20, r19
    3766:   90 0d           add     r25, r0
    3768:   11 24           eor     r1, r1

  us -= 5;                  // = 2 cycles
    376a:   05 97           sbiw    r24, 0x05

  // busy wait
  __asm__ __volatile__ (
    376c:   01 97           sbiw    r24, 0x01
    376e:   f1 f7           brne    .-4          ; 0x376c <delayMicroseconds+0x14>
  );

}                    // return = 4 cycles
    3770:   08 95           ret
// 18.342 MHz
delayMicroseconds(3);       // = 6 cycles
    3302:   83 e0           ldi     r24, 0x03
    3304:   90 e0           ldi     r25, 0x00
    3306:   0e 94 a6 1b     call    0x3738      ; 0x3738 <delayMicroseconds>
.
.

void delayMicroseconds(unsigned int us)
{
  __asm__ __volatile__ (
    3738:   00 00           nop
    373a:   00 00           nop
    373c:   00 00           nop
    373e:   00 00           nop
  );              //just waiting 4 cycles
  if (us <= 1) return;      // = 3 cycles, (4 when true)
    3740:   82 30           cpi     r24, 0x02
    3742:   91 05           cpc     r25, r1
    3744:   38 f1           brcs    .+78        ; 0x3794 <delayMicroseconds+0x5c>

  us = (us << 2) + us; //x5 us,= 7 cycles
    3746:   9c 01           movw    r18, r24
    3748:   22 0f           add     r18, r18
    374a:   33 1f           adc     r19, r19
    374c:   22 0f           add     r18, r18
    374e:   33 1f           adc     r19, r19
    3750:   82 0f           add     r24, r18
    3752:   93 1f           adc     r25, r19

  if (us > 36)              // = 3 cycles, (4 when jump to else)
    3754:   85 32           cpi     r24, 0x25
    3756:   91 05           cpc     r25, r1
    3758:   f0 f0           brcs    .+60        ; 0x3796 <delayMicroseconds+0x5e>
  {
    us = (us >> 1) + (us >> 2) + (us >> 3) + (us >> 4); // x0.9375 us, = 50 cycles (not 20 cycles)
    us -= 24;               // = 2 cycles
    375a:   ac 01           movw    r20, r24    ; 1 c
    375c:   24 e0           ldi     r18, 0x04   ; 1 c
    375e:   56 95           lsr     r21         ; 1 c x4
    3760:   47 95           ror     r20         ; 1 c x4
    3762:   2a 95           dec     r18         ; 1 c x4
    3764:   e1 f7           brne    .-8         ; 2+2+2+1 0x375e <delayMicroseconds+0x26>
    3766:   9c 01           movw    r18, r24    ; 1 c
    3768:   36 95           lsr     r19         ; 1 c
    376a:   27 95           ror     r18         ; 1 c
    376c:   bc 01           movw    r22, r24    ; 1 c
    376e:   76 95           lsr     r23         ; 1 c
    3770:   67 95           ror     r22         ; 1 c
    3772:   76 95           lsr     r23         ; 1 c
    3774:   67 95           ror     r22         ; 1 c
    3776:   26 0f           add     r18, r22    ; 1 c
    3778:   37 1f           adc     r19, r23    ; 1 c
    377a:   23 51           subi    r18, 0x18   ; 1 cycle
    377c:   31 09           sbc     r19, r1     ; 1 cycle
    377e:   f3 e0           ldi     r31, 0x03   ; 1 c
    3780:   96 95           lsr     r25         ; 1 c x3
    3782:   87 95           ror     r24         ; 1 c x3
    3784:   fa 95           dec     r31         ; 1 c x3
    3786:   e1 f7           brne    .-8         ; 2+2+1 0x3780 <delayMicroseconds+0x48>
    3788:   82 0f           add     r24, r18    ; 1 c
    378a:   93 1f           adc     r25, r19    ; 1 c
    378c:   84 0f           add     r24, r20    ; 1 c
    378e:   95 1f           adc     r25, r21    ; 1 c
  }

  // busy wait
  __asm__ __volatile__ (
    3790:   01 97           sbiw    r24, 0x01
    3792:   f1 f7           brne    .-4      	; 0x3790 <delayMicroseconds+0x58>
  );

}                    // return = 4 cycles 
    3794:   08 95           ret

  else 
  { 
    us -= 8;                // = 4 cycles
    3796:       08 97       sbiw    r24, 0x08
    3798:       fb cf       rjmp    .-10     	; 0x3790 <delayMicroseconds+0x58>
  }

@LaZsolt
Copy link
Author

LaZsolt commented May 4, 2020

This comment was edited, because I realized the code was not as precise as I thought at first.

Now this code really precise.
If you don't want to delay zero time and every timing are constans (not variable), this most compact and precise code is for you. (Must be in Arduino.h) LTO does not matter any more,

#define delayMicroseconds(us)            \
(__extension__({                         \
  __asm__ __volatile__ (                 \
    "usL_%=:" "sbiw %0,1" "\n\t"         \
              "brne usL_%="              \
	          :  /* no outputs */    \
              :"w" ( (uint16_t) ((F_CPU * (uint32_t) us) / 4000000L) )  \
  );                                     \
}))

(If your MCU is a Chinese LGT8Fx, you have to replace 4000000L to 3000000L )

@LaZsolt
Copy link
Author

LaZsolt commented Aug 28, 2020

The code above has an error. Higher values of requested microseconds cause compiler 32 bit overflow.
The newer, better version:

#define delayMicroseconds(us)     \
(__extension__({                  \
  __asm__ __volatile__ (          \
    "usL_%=:" "sbiw %0,1" "\n\t"  \
              "brne usL_%="       \
              :  /* no outputs */ \
              :"w" ( (uint16_t) (( (F_CPU/1000) * (uint32_t) us ) / 4000L) )  \
  );                              \
}))

@LaZsolt
Copy link
Author

LaZsolt commented Sep 9, 2020

I created a new version of delayMicroseconds() function. I would be happy if someone check it: https://github.com/LaZsolt/delayMicroseconds/tree/master/for_MCUdude/function
And I calculated cycle count for every freq: https://github.com/LaZsolt/delayMicroseconds/tree/master/Analysis

@MCUdude
Copy link
Owner

MCUdude commented Sep 9, 2020

Sorry for staying quiet for so long. I'll see if I can look into it tonight.

But thanks for your work! Improvements are always welcome 🙂

@MCUdude
Copy link
Owner

MCUdude commented Sep 10, 2020

so, what would you recommend using?

#define delayMicroseconds(us)     \
(__extension__({                  \
  __asm__ __volatile__ (          \
    "usL_%=:" "sbiw %0,1" "\n\t"  \
              "brne usL_%="       \
              :  /* no outputs */ \
              :"w" ( (uint16_t) (( (F_CPU/1000) * (uint32_t) us ) / 4000L) )  \
  );                              \
}))

is of course way shorter, but I'm so inline assembly guy, so I can't really tell what's going on. Will the function above work for all F_CPU values? How about the code provided at https://github.com/LaZsolt/delayMicroseconds/tree/master/for_MCUdude/function?

Speaking of the code here: https://github.com/LaZsolt/delayMicroseconds/tree/master/for_MCUdude/function
have you actually tried some of the code on different clock speed to verify that is also does what it should in practice?

@LaZsolt
Copy link
Author

LaZsolt commented Sep 10, 2020

Yes, this simple inline assembly above will work all F_CPU values. ArduinoCore-samd delay.h defines almost the same code for delayMicroseconds().
I think may useable for microcore too.
But it has many limitations, which mean it is not fully compatibile with the old Arduino-AVR or current MiniCore delayMicroseconds() function.

  • The argument have to be compile time constants. In this case timing better than in current MiniCore.
  • Not recommended to use with runtime variable argument. It cause huge overhead in timing.
  • It always delays 1 more clock cycle on frequencies divisible by 4.
  • Requested microsecond cannot be 0, but using of __builtin_constant_p() can correct this mistake.
  • Requested microsecond cannot be 0, 1, 2, 3 on 1 MHz and cannot be 0, 1 on 2 MHz.

I tested it in my IOT web temperature sensor on 16 MHz, with DS18B20 sensors and ENC28J60 ehernet controller.
I have boards only on 16 MHz.
3 years ago matthijskooijman created a delay tester program, later I will run to test my code.

Calculated cyles and delays in microseconds:

requested us 32 MHz 20 MHz 16 MHz 12.5 MHz
1 33 - 1.03 21 - 1.05 17 - 1.06 13 - 1.04
2 65 - 2.03 41 - 2.05 33 - 2.06 25 - 2
3 97 - 3.03 61 - 3.05 49 - 3.06 37 - 2.96
4 129 - 4.03 81 - 4.05 65 - 4.06 49 - 3.92
5 161 - 5.03 101 - 5.05 81 - 5.06 61 - 4.88
6 193 - 6.03 121 - 6.05 97 - 6.06 73 - 5.84
7 225 - 7.03 141 - 7.05 113 - 7.06 85 - 6.8
8 257 - 8.03 161 - 8.05 129 - 8.06 101 - 8.08
9 289 - 9.03 181 - 9.05 145 - 9.06 113 - 9.04
17 545 - 17.03 341 - 17.05 273 - 17.06 213 - 17.04
33 1057 - 33.03 661 - 33.05 529 - 33.06 413 - 33.04
143 4577 - 143.03 2861 - 143.05 2289 - 143.06 1785 - 142.8
requested us 11.0592 MHz 9.6 MHz 8 MHz 2 MHz 1 MHz
1 9 - 0.81 9 - 0.94 9 - 1.13 262145 - 131072.5 262145 - 262145
2 21 - 1.9 17 - 1.77 17 - 2.13 5 - 2.5 262145 - 262145
3 33 - 2.98 29 - 3.02 25 - 3.13 5 - 2.5 262145 - 262145
4 45 - 4.07 37 - 3.85 33 - 4.13 9 - 4.5 5 - 5
5 53 - 4.79 49 - 5.1 41 - 5.13 9 - 4.5 5 - 5
6 65 - 5.88 57 - 5.94 49 - 6.13 13 - 6.5 5 - 5
7 77 - 6.96 65 - 6.77 57 - 7.13 13 - 6.5 5 - 5
8 89 - 8.04 77 - 8.02 65 - 8.13 17 - 8.5 9 - 9
9 97 - 8.77 85 - 8.85 73 - 9.13 17 - 8.5 9 - 9
17 189 - 17.09 161 - 16.77 137 - 17.13 33 - 16.5 17 - 17
33 365 - 33 317 - 33.02 265 - 33.125 65 - 32.5 33 - 33
143 1581 - 142.96 1373 - 143.02 1145 - 143.125 285 - 142. 141 - 141

@MCUdude
Copy link
Owner

MCUdude commented Sep 11, 2020

For MiniCore, and all the other Arduino cores for ATmega's (Mighty/Mega/MajorCore), I think it's best to keep compatibility and use the function you provided instead of the macro. I bet there is a lot of code out there that depends on delay times that are not known at compile time.

The tables you posted above, is this data from the macro or from the function?

@MCUdude
Copy link
Owner

MCUdude commented Sep 11, 2020

I connected my trusty signal generator to the clock input of the AVR development board I'm using to get a very accurate clock. I made a simple sketch that toggles a pin by witing to PINB. I tested your delayMicroseconds code with all clocks MiniCore supports, and it looks very good. I'm going to merge this code into MCUdude_corefiles. This way the new delayMicroseconds function will be available on MightyCore, MiniCore, MegaCore, and MajorCore. Thanks for your contribution!

MCUdude added a commit to MCUdude/MCUdude_corefiles that referenced this issue Sep 11, 2020
Thanks to @LaZsolt for the contribution.
Discussion here: MCUdude/MiniCore#130
@MCUdude MCUdude closed this as completed Sep 11, 2020
@MCUdude
Copy link
Owner

MCUdude commented Sep 11, 2020

The updated delayMicroseconds function will be available in the next MiniCore release!

MCUdude added a commit to MCUdude/MightyCore that referenced this issue Oct 3, 2020
…ffdd

bfadeffdd Remove tab
88cc4bfb4 Improve delayMicroseconds accuracy Thanks to @LaZsolt for the contribution. Discussion here: MCUdude/MiniCore#130
d81bb03ef Improve set-bit/clear-bit logic Usually, the sbo and cbi macros didn't compile down to a single instruction. The new implementation increases speed and reduces the compiled size
6977bd002 Improve formatting By adding names to parameters, it's easier to determine what each parameter is for in certain editors like VScode without having to look it up

git-subtree-dir: avr/cores/MCUdude_corefiles
git-subtree-split: bfadeffdd4b2e445c5bccfb97dd11d5c57ccc58b
MCUdude added a commit that referenced this issue Oct 3, 2020
…ffdd

bfadeffdd Remove tab
88cc4bfb4 Improve delayMicroseconds accuracy Thanks to @LaZsolt for the contribution. Discussion here: #130
d81bb03ef Improve set-bit/clear-bit logic Usually, the sbo and cbi macros didn't compile down to a single instruction. The new implementation increases speed and reduces the compiled size
6977bd002 Improve formatting By adding names to parameters, it's easier to determine what each parameter is for in certain editors like VScode without having to look it up
f18d41698 Add PROGMEM1..3 Makes it much easier to place and access stuff in far memory

git-subtree-dir: avr/cores/MCUdude_corefiles
git-subtree-split: bfadeffdd4b2e445c5bccfb97dd11d5c57ccc58b
MCUdude added a commit to MCUdude/MajorCore that referenced this issue Oct 3, 2020
bfadeffd Remove tab
88cc4bfb Improve delayMicroseconds accuracy Thanks to @LaZsolt for the contribution. Discussion here: MCUdude/MiniCore#130
d81bb03e Improve set-bit/clear-bit logic Usually, the sbo and cbi macros didn't compile down to a single instruction. The new implementation increases speed and reduces the compiled size
6977bd00 Improve formatting By adding names to parameters, it's easier to determine what each parameter is for in certain editors like VScode without having to look it up
f18d4169 Add PROGMEM1..3 Makes it much easier to place and access stuff in far memory

git-subtree-dir: avr/cores/MCUdude_corefiles
git-subtree-split: bfadeffdd4b2e445c5bccfb97dd11d5c57ccc58b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants