Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 19028 - Fix DELAY_US/NS cycle cost #19059

Closed
wants to merge 12 commits into from

Conversation

rhapsodyv
Copy link
Sponsor Member

@rhapsodyv rhapsodyv commented Aug 18, 2020

Description

DELAY_US/NS uses a busy loop in the function __delay_4cycles. The actual code fixes the cycles per iteration in 4. So, to delay 1000 us, it do 1000/4 iterations of the busy loop.

But, the busy loop inside the function __delay_4cycles can take 3 up to 6 cycles per iteration, according with Cortex M3 docs.

sub 1 cycle
cmp 1 cycle
bne 1 cycle + 1 to 3 cycles by pipeline refil 

https://stackoverflow.com/questions/28760617/pipeline-refill-cycles-for-instructions-in-arm
https://developer.arm.com/documentation/ddi0337/h/programmers-model/instruction-set-summary/cortex-m3-instructions

In our tests with STM32F103ZET6 and STM32F103VET6, the loop is taking 6 cycles per iteration. Even more, the lib maple define this loop as taking 6 cycles too in their function delay_us in ~/.platformio/packages/framework-arduinoststm32-maple/STM32F1/system/libmaple/include/libmaple/delay.h

Seems the best value is 6 for this busy loop.

But could have some core that is 4, as the original code was made.....

So, I renamed the function to __delay_Ncycles, created a overwritable define DELAY_CYCLES_ITERATION_COST that is 4 by default, but can be overwritten by the env build_flags = -DDELAY_CYCLES_ITERATION_COST=6

We can have a function that outputs if the value is correct or not:

DISABLE_ISRS();
uint32 end = systick_get_count();
__delay_4cycles(100);
uint32 start = systick_get_count();
ENABLE_ISRS();
uint32 cycles = (end - start) / 100.0;
if (DELAY_CYCLES_ITERATION_COST > cycles * 1.05 || DELAY_CYCLES_ITERATION_COST < cycles * 0.95) {
  SERIAL_ECHOLNPAIR("DELAY_CYCLES_ITERATION_COST is: ", DELAY_CYCLES_ITERATION_COST, " but SHOULD be: ", (uint32)cycles);
}

Searching the code, basically 2 things are using those functions that may be affected: LCD delays and _PULSE_WAIT in BABYSTEPPING when EXTRA_CYCLES_BABYSTEP > 20.

I created with help and advice of @sjasonsmith, a new busy loop based in systick, the is far more precise. If the HAL define the systick methods, it will be used. It fallbacks to the original delay_cycles.

Tests with the new systick based busy loop:

1000 Start: 8635382, End: 8636383, Diff: 1001
2000 Start: 8637103, End: 8639104, Diff: 2001
5000 Start: 8639253, End: 8644254, Diff: 5001
10000 Start: 8644403, End: 8654404, Diff: 10001
500 Start: 8654562, End: 8655063, Diff: 501
200 Start: 8655373, End: 8655574, Diff: 201

Benefits

Our DELAY_US and DELAY_NS were taking 50% more time.... It fix the problem.

Related Issues

#19028

@rhapsodyv rhapsodyv marked this pull request as draft August 18, 2020 01:13
@sjasonsmith sjasonsmith added A: STM32 Needs: Testing Testing is needed for this change labels Aug 18, 2020
@rhapsodyv rhapsodyv marked this pull request as ready for review August 18, 2020 03:46
@GMagician
Copy link
Contributor

Interesting thing...but I read into code:

Cortex-M3 through M7 can use the cycle counter of the DWT unit

and reading web page (https://www.anthonyvh.com/2017/05/18/cortex_m-cycle_counter/) a STM32 is mentioned.

So why STM32 doesn't use the debug timer? It seems that the code may be enabled only for a M7 processor (and if arm and thumb)

@sjasonsmith
Copy link
Contributor

I had some compilation issues building for my GTR board, but got it to work. Right now the code is Maple-specific. In HAL/STM32 you can read the systick value from SysTick->VAL.

You might be interested in this delayMicroseconds implementation from the STM32 framework, which has two different implementations based on DWT or SysTick. It looks like the GTR is using the DWT implementation. With some adaptation that could easily be turned into a delay_cycles or delay_ns implementation.
https://github.com/stm32duino/Arduino_Core_STM32/blob/59dda5c138badcb0a17a24f7a30fa81dececa0d7/cores/arduino/wiring_time.h#L63

When I booted with your change on a GTR (STM32F407), I see this:

DELAY_CYCLES_ITERATION_COST is: 6 but SHOULD be: 4

This still had an extra nop due to ARCH_PIPELINE_RELOAD_CYCLES. When I remove that:

DELAY_CYCLES_ITERATION_COST is: 6 but SHOULD be: 3

This makes it pretty clear that we get different behavior from different cores, so it is good we are addressing this.

@sjasonsmith
Copy link
Contributor

So why STM32 doesn't use the debug timer?

This is a question @rhapsodyv and I have asked as we explored this. Those are old comments which clearly do not actually match the code. The code I referenced from the STM32 framework (non-Maple) always uses DWT if it is available. That should be a more reliable way to solve this problem, but isn't available on all hardware. Any solution that can rely on CPU counters is going to be more reliable across hardware than instruction counting.

@GMagician
Copy link
Contributor

I tryed to force code compile for my samd51 and it seems to compile. SAMD51 is a cortex-M4

@GMagician
Copy link
Contributor

@sjasonsmith

but isn't available on all hardware

if present in M3 to M7 cortex, then it's present in many of the Marlin supported 32bits boards.
AFAIK out of this list are all 8 bits and esp boards...

@sjasonsmith
Copy link
Contributor

@GMagician some of the Malyan boards are STM32F0 (Cortex-M0).

@GMagician
Copy link
Contributor

GMagician commented Aug 18, 2020

I confirm, SAMD51 supports the debug timer (adafruit framework enables it automatically, no enableCycleCounter call is required)

but none of arm, thumb, __CORTEX_M` seem defined

@GMagician
Copy link
Contributor

GMagician commented Aug 18, 2020

OT. I found also this in Marlin stepper.cpp

> FORCE_INLINE int32_t Stepper::_eval_bezier_curve(const uint32_t curr_step) {
> #if defined(ARM) || defined(thumb)
> // For ARM Cortex M3/M4 CPUs, we have the optimized assembler version, that takes 43 cycles to execute

but even here __ARM__ and __thumb__ are not defined but __AVR__ is. Then "optimized cortex code" is never used

N.B. my environment is set for SAMD51

Ok...it's a intellisense issue

@GMagician
Copy link
Contributor

GMagician commented Aug 18, 2020

I checked cortex M0 documentation and, if correctly read, BNE is always 1 or 2 cycles (depending on taken or not).

So I think

#if __CORTEX_M >= 3

    // Cortex-M3 through M7 can use the cycle counter of the DWT unit
    // https://www.anthonyvh.com/2017/05/18/cortex_m-cycle_counter/

and original code in #else may work

OT @rhapsodyv you opened the Pandora's box ;-)

@AnHardt
Copy link
Member

AnHardt commented Aug 18, 2020

#15762, #15763
Maybe using a scalpel instead of a sword.
Maybe "!LPC*", or finding the right includes.

@rhapsodyv
Copy link
Sponsor Member Author

rhapsodyv commented Aug 18, 2020

#if __CORTEX_M >= 3

    // Cortex-M3 through M7 can use the cycle counter of the DWT unit
    // https://www.anthonyvh.com/2017/05/18/cortex_m-cycle_counter/

and original code in #else may work

It could work. The user tested and it didn't work, but, looking at the code, I think I know why. To use the DWT code, the HAL needs call enableCycleCounter() first. And right now, only HAL/STM32 does it. But even this HAL uses it only with M7....

HAL/STM32/HAL.cpp

// Needed for DELAY_NS() / DELAY_US() on CORTEX-M7
#if (defined(__arm__) || defined(__thumb__)) && __CORTEX_M == 7
  // HAL pre-initialization task
  // Force the preinit function to run between the premain() and main() function
  // of the STM32 arduino core
  __attribute__((constructor (102)))
  void HAL_preinit() {
    enableCycleCounter();
  }
#endif

I will try it, but I doubt that this pre init don't be incompatible or broke some boot loader, like chitu one...

Seems the safer way to use any delay relaying in cycle count, is by adjusting its divisor/multiplier by measuring it at runtime.....

@GMagician
Copy link
Contributor

To use the DWT code, the HAL needs call enableCycleCounter()

@rhapsodyv I just terminated a 2hr print with the new timer code, as stated before, AGCM4 doesn't need above function, but it doesn't hurt.

I also think that, as stated by @AnHardt and by links he posted, if LPC doesnt support it, it must be excluded by this.

@rhapsodyv
Copy link
Sponsor Member Author

@GMagician
We are measuring the timer this way:

  start = micros();
  DELAY_US(1000);
  // delay_us(1000);
  end = micros();
  SERIAL_ECHOLNPAIR("1 Start: ", start, ", End: ", end, ", Diff: ", end - start);

  start = micros();
  DELAY_US(2000);
  // delay_us(1000);
   end = micros();
  SERIAL_ECHOLNPAIR("2 Start: ", start, ", End: ", end, ", Diff: ", end - start);

Any print should work, as it affect only a few parts of the code. The user found the bug working with leds...

@rhapsodyv
Copy link
Sponsor Member Author

Seems __CORTEX_M isn't defined in STM32F1 ...... . . . . . .. .. . . . .. . . . . Seems a STM32 define only... I will check for the lib maple equivalent...

@rhapsodyv
Copy link
Sponsor Member Author

rhapsodyv commented Aug 18, 2020

Seems lib maple don't support DWT, or it use other naming.....

[framework-arduinoststm32-maple] grep DWT * -ri                                                                                                      10:16:29
STM32F1/system/libmaple/include/libmaple/scb.h:#define SCB_DFSR_DWTTRAP                (1U << 2)
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cm4.h:#define SCB_DFSR_DWTTRAP_Pos                2                                             /*!< SCB DFSR: DWTTRAP Position */
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cm4.h:#define SCB_DFSR_DWTTRAP_Msk               (1UL << SCB_DFSR_DWTTRAP_Pos)                  /*!< SCB DFSR: DWTTRAP Mask */

back to the cycle count fix? 🤔

@rhapsodyv rhapsodyv changed the title Fix 19028 - Fix DELAY_US cycle usage by using a register Fix 19028 - Fix DELAY_US/NS cycle cost Aug 18, 2020
@GMagician
Copy link
Contributor

Seems __CORTEX_M isn't defined in STM32F1...Seems a STM32 define only.

Also SAMD51 define it. I was confused by intellisense saying it wasn't. Once compiled I saw exactly what was and what wasn't defined

@rhapsodyv I checked "test" code on my AGCM4, here are results:

1 Start: 30744108, End: 30745109, Diff: 1001

2 Start: 30748367, End: 30750368, Diff: 2001

back to the cycle count fix? 🤔

not necessary, you may define yourself the missing registers (if SoC really support them)

@rhapsodyv
Copy link
Sponsor Member Author

back to the cycle count fix? 🤔

not necessary, you may define yourself the missing registers (if SoC really support them)

But yet I need a way to find the current __CORTEX_M or implement the correct M detection too.... because maple dont define it...

[framework-arduinoststm32-maple] grep __CORTEX_M * -r                                                                                                11:05:12
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cm4.h:#define __CORTEX_M                (0x04)                                                       /*!< Cortex core                    */
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cm4.h:#if (__CORTEX_M != 0x04)
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cm4.h:#if (__CORTEX_M != 0x04)
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cmInstr.h:#if       (__CORTEX_M >= 0x03)
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cmInstr.h:#endif /* (__CORTEX_M >= 0x03) */
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cmInstr.h:#if       (__CORTEX_M >= 0x03)
STM32F4/cores/maple/libmaple/usbF4/VCP/core_cmInstr.h:#endif /* (__CORTEX_M >= 0x03) */

@GMagician
Copy link
Contributor

GMagician commented Aug 18, 2020

STM32F4/cores/maple/libmaple/usbF4/VCP/core_cm4.h:#define __CORTEX_M (0x04)

What cortex version is STM32F4? M3?

Edit: I checked it's M4

@rhapsodyv
Copy link
Sponsor Member Author

@GMagician
Copy link
Contributor

I think I'm missing something...above define initialize __CORTEX_M to 4 maybe such include is not used/included?

@rhapsodyv
Copy link
Sponsor Member Author

I think I'm missing something...above define initialize __CORTEX_M to 4 maybe such include is not used/included?

That code is some usbf4 specific, not a M detector, as stm core have....

stm32duino detect the core and include the right definition. Seems lib maple dont have it.
stm32duino:

system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f777xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f765xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f767xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f723xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f722xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f732xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f733xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f730xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f756xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f769xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f779xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f745xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f750xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32F7xx/Include/stm32f746xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h753xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h747xx.h:#include "core_cm7.h"                 /*!< Cortex-M7 processor and core peripherals          */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h745xx.h:#include "core_cm7.h"                 /*!< Cortex-M7 processor and core peripherals          */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h750xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h755xx.h:#include "core_cm7.h"                 /*!< Cortex-M7 processor and core peripherals          */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h743xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h757xx.h:#include "core_cm7.h"                 /*!< Cortex-M7 processor and core peripherals          */
system/Drivers/CMSIS/Device/ST/STM32H7xx/Include/stm32h742xx.h:#include "core_cm7.h"                     /*!< Cortex-M7 processor and core peripherals      */

And a lot more, for a lot of stm32 versions....

@rhapsodyv
Copy link
Sponsor Member Author

rhapsodyv commented Aug 18, 2020

Using lib maple, we don't know which M version is....

Can we assume that it would be always M3 or M4?!

@GMagician
Copy link
Contributor

I don't know the STM32 boards so I can't answer, but if in doubt it's always possible to add a -D on the compile args, specific to STM SoCs

@GMagician
Copy link
Contributor

@rhapsodyv but isn't maple to be discontinued? I saw some PR to remove it

@thinkyhead
Copy link
Member

How close is this PR to a useful form with respect to the availability of DWT (and the NOP-counting fall-back)? I can see that what has been added is sensible and already lends an improvement. So, how much of this is worth merging now, and then what are the next steps?

@sjasonsmith
Copy link
Contributor

I am swamped at work and won't be able to spend time on anything right now. If others are familiar with the architectures and can make sound recommendations, that is fine.

@X-Ryl669
Copy link
Contributor

X-Ryl669 commented Jan 8, 2021

@rhapsodyv I checked the code about the usage of DELAY_NS macro. I've found that it's being used on platform like ESP32 and Linux that both have an pre-emptive kernel (FreeRTOS for the former and well, linux for the latter) so I wonder how the delay is ensured in that case, since the CPU can be interrupted by the scheduler anytime.
Do you expect that delay(X) means waiting at least X ns (but it can be 1ms or more due to scheduling), or it is supposed to pause the CPU for the as-close as possible to X (and in that case, what is the margin allowed)?
If it's the former, then a NOP loop is likely to work most of the time.
If it's the latter, reading the CPU's cycle counter via DWT would be better since it would detect scheduling.

@sjasonsmith This PR is good for the NOP waiting / M0 case but some code must be written for using DWT registers. It can be a macro, something like this (pseudo code):

#define DELAY_CYCLE(X) __asm__ __volatile__ { \
   A(".syntax unified") \
   A("ldr    r0,=E0001004") \ // Read DWT counter
   A("add  r0, r0, %0")       // Compute end time
   L("1") \
   A("ldr     r1,=E0001004") \ // Read DWT counter again in a loop
   A("cmp   r1, r0") \         // Check if we've reached the requested end time
   A("blt 1b") \               // No ? Let's jump back again
    : // no output
    : "r" (X) // input is only read
    : "r0", "r1" // used clobbers
   }

Obviously, such a loop can only work if there is a DWT at address E00010004 (else it'll read 0 and never exit)

@GMagician
Copy link
Contributor

GMagician commented Jan 8, 2021

@X-Ryl669 AFAIK time must be the one requested, here is why all this PR came to life.
In a multitask system you will be never sure about this time because scheduler can interrupt such code (unless some kind of critical semaphore is set).

@X-Ryl669
Copy link
Contributor

X-Ryl669 commented Jan 8, 2021

Yes, you are right. The issue with "simply" counting NOP cycles is that you can have both the delay in case of scheduling:

  1. Code calls DELAY_NS(600) => This gives 600 cycles to wait
  2. Start of loop, counter = 600
  3. <Scheduling happens here, you loose 1ms (for example)>
  4. Counter continue counting down to 0, so in the end you have waited "scheduler delay + 600 cycles"

With a cycle counter (like DWT), this would have given:

  1. Code calls DELAY_NS(600) => This gives 600 cycles to wait
  2. Start of loop, end_counter = current_cycle_counter + 600
  3. <Scheduling happens here, you loose 1ms (for example)>
  4. current_cycle_counter is now way above end_counter, you get out of the loop immediately. In the end you have waited "scheduler delay" only.

@ejtagle
Copy link
Contributor

ejtagle commented Jan 8, 2021

Completely agree with @GMagician and @X-Ryl669

Let's split cases:

With preemptive multitaskers (linux/ESP32/...) in user mode code, there is no way to ensure timing. Reason is the pointed one by @X-Ryl669 . So, those functions get "at least" the specified amount of delay, but could potentially get more due to task scheduling.

Without preemptive multitasking, and interrupts enabled, under ARM, even if executing the delay functions under an interrupt handler, you get exactly the exact same situation as before, as interrupt handlers can be preempted by higher priority interrupt handlers. So, the delay functions should be considered as "delay at least" the specified amount of time.

If you need exact timing, then you need some sort of interrupt disabling while performing the critical timing section (but this worsens the interrupt time response of other devices), and so i consider that using this to get more than 10uS of delay to be a no-go

The other approach is a hardware based solution (for example, ESP32 allows any pin to perform a timed pulsing), and some other processors also offer that, but for selected pins/ports, not all of them.

The main question here is if makes sense. LCD timing is not critical in the sense, that occasionally having a larger delay does not cause problems at all.

Babystepping, on the contrary, could be a problem. But babystepping could be handled by "injecting" deltas into the bresenhan algorithm, or into the planner, instead of using a delay macro (i think that would be the best approach here)

@X-Ryl669
Copy link
Contributor

X-Ryl669 commented Jan 8, 2021

I came to this PR because of the faulty NeoPixel behavior on STM32F1. The NeoPixel code is also a good example of "grunt" code, since it's burning CPU to match the timing requirement of WS2812 RGB leds. So, in the end, there are multiple places where delay_ns is used and this includes:

  • Some SPI screen (I've only found ST7920 code using this, but maybe I'm missing something)
  • Software SPI bitbanging (which is probably the only place with the stepper code where precise timing is required)
  • Stepper motor pulsing code (important)
  • In MAX6675 temperature sensor reading code (and the only usage is for waiting once on initialisation for 0.1µs) and MAX7219 (with a delay of 0.250µs)
  • NeoPixels (WS2812 strips are quite tolerant to their timing but not 150% off, and I think a solution like FastLED is a better idea anyway since it's not using CPU to count cycles but DMA engine instead)

For SPI screen, having longer timing is usually accepted by the device, provided the clock is in sync with the MOSI/MISO line.
For software bitbanging, unless if it's being used as a slave SPI, this should not be a big issue if the timing were a bit longer than expected for the same reason as above.
For stepper motor, I can't say. It's probably a place where interrupt are disabled anyway. The default delay is 0.65µs (so it's the limit that's reachable with a timer on STM32F1 @ 72MHz once all overheads are accounted for, but likely unreachable for smaller CPU)
For MAX6675, I've not checked, maybe waiting more than expected at reset time does not hurt -to be tested-
For NeoPixels it's an issue but IMHO NeoPixels library is not a the best solution for the problem.

So, in the end, I think precise timing is truely required for steppers motors so indexing the wait to a cycle counter or a cycle count is unavoidable. I would prefer DWT for the reason above since it'll be more correct in case of being interrupted than a dumb NOP counter.

@GMagician
Copy link
Contributor

GMagician commented Jan 8, 2021

The only "vital" part where timing is required (aim of this PR) is stepper handling, all other parts aren't so important (SPI, LCD and so on).
Please note that neopixel are a really beast in my optinion since timing is strict and, I know samd51 code but I think other are the same, precise timing is done by disabling all other interrupts (really bad idea since also stepper timing may be damaged by that)

@GMagician
Copy link
Contributor

GMagician commented Jan 8, 2021

@rhapsodyv

Two different functions, one for DELAY_US based on SysTick

It's a long time when I analyzed this but I saw this old post: #19059 (comment), don't know how it's programmed in other SoC and don't know how precise may be when programmed for 1ms like samd does

@ejtagle
Copy link
Contributor

ejtagle commented Jan 9, 2021

@GMagician: I agree 100% with you here. Let me explain the "reason" on why timing precision is so important here on the stepper ISR: Most motor drivers perform step interpolation, that means, they internally use 256 levels of current to create the sinusoidal waveforms required to smoothly move the stepper motors.
But none of those drivers require 256 pulses per revolution. Why? Because they internally interpolate time. They measure time between stepping pulses, split that time into the actual number of internally used steps and try to interpolate timewise the steps between the MPU step pulses.
Now imagine what happens when timing between pulses is not uniform: Yes, you guessed it: Sudden jumps in speed and acceleration while the motor driver tries to follow the MPU step pulses while internally interpolating timewise the required steps...

1 similar comment
@ejtagle
Copy link
Contributor

ejtagle commented Jan 9, 2021

@GMagician: I agree 100% with you here. Let me explain the "reason" on why timing precision is so important here on the stepper ISR: Most motor drivers perform step interpolation, that means, they internally use 256 levels of current to create the sinusoidal waveforms required to smoothly move the stepper motors.
But none of those drivers require 256 pulses per revolution. Why? Because they internally interpolate time. They measure time between stepping pulses, split that time into the actual number of internally used steps and try to interpolate timewise the steps between the MPU step pulses.
Now imagine what happens when timing between pulses is not uniform: Yes, you guessed it: Sudden jumps in speed and acceleration while the motor driver tries to follow the MPU step pulses while internally interpolating timewise the required steps...

@GMagician
Copy link
Contributor

GMagician commented Jan 9, 2021

@ejtagle what is not clear to me is: if timer is SysTick is programmed 1ms timeout like in SAMD51 is done. On framework this is what is done to wait uS in framework:

#ifdef __SAMD51__
/*
 * On SAMD51, use the (32bit) cycle count maintained by the DWT unit,
 * and count exact number of cycles elapsed, rather than guessing how
 * many cycles a loop takes, which is dangerous in the presence of
 * cache.  The overhead of the call and internal code is "about" 20
 * cycles.  (at 120MHz, that's about 1/6 us)
 */
void delayMicroseconds(unsigned int us)
{
  uint32_t start, elapsed;
  uint32_t count;

  if (us == 0)
    return;

  count = us * (VARIANT_MCK / 1000000) - 20;  // convert us to cycles.
  start = DWT->CYCCNT;  //CYCCNT is 32bits, takes 37s or so to wrap.
  while (1) {
    elapsed = DWT->CYCCNT - start;
    if (elapsed >= count)
      return;
  }
}

EDIT: on samd51 systick LOAD value is 999

@X-Ryl669
Copy link
Contributor

X-Ryl669 commented Jan 9, 2021

That's exactly DWT wait as described above. Just a remark, the code could be more precise if, instead of computing the difference in the loop, the end position in time is computed once outside of the loop and the loop consist of only: while(DWT->CYCCNT < end) {}.
There is no point checking for a null delay in that case, leading to a lot less overhead.

This kind of code isn't safe in case of roll over unless end is declared signed (int32_t end) so that the comparison is done with checking the sign bit.

@X-Ryl669
Copy link
Contributor

X-Ryl669 commented Jan 9, 2021

I would be interested to know how precise we could get with a C function compared to a ASM version. If it's 1/6 µs or less (so < 116ns), there is no point in writing a assembler version since in the code base, there isn't any call to DELAY_NS shorter than ~100ns.

@GMagician
Copy link
Contributor

GMagician commented Jan 9, 2021

@X-Ryl669 the samd51 code is save even for overlap since it works with 32bits unsigned and works with subtraction:
elapsed = DWT->CYCCNT - start;

EDIT: AFAIK signed 32 logic is not safe you need to consider overflow (0xFFFFFFFF -> 0 and 0x7FFFFFFF -> 0x80000000)

@X-Ryl669
Copy link
Contributor

X-Ryl669 commented Jan 9, 2021

Yes. I was talking about my version with the loop that's only comparing the counter value with an end value. The loop body is empty, there's only 3 CPU instructions in the loop (load, compare, branch) so it's even less overhead than your version (load, sub, compare, branch) that's at least 4 CPU instructions.

Unlike your version, my version is only safe if compared as signed value. That's because 0xFFFFFFFF < 0x1 is true only if using signed 32 bits value (that's what happen when the counter roll over)

@X-Ryl669
Copy link
Contributor

X-Ryl669 commented Jan 9, 2021

Care must be taken when using signed logic and overflow, but I can guarantee the logic is safe here. There are only 3 cases here:

  1. start is close to overflow and end is computed below overflow, so you'll end up with end like "0xFFFFFFFE" = -2, and you'll either compare to value below this number (like 0xFFFFFFFC = -4) or above if interrupted (0x00000001). In both case, the ordering is respected since -4 < -2 and 1 > -2
  2. start is below overflow so is end (obvious)
  3. start is below overflow and end is above overflow (like 0x32). You'll have 2 cases: counter is below overflow (thus negative) and it's below end, or positive and it's obvious.

In the end, using signed logic, in that case is perfectly safe (unsigned logic is not).

EDIT: It turns out that I'm wrong. There's an issue with INT_MAX (so around 0x7FFFFFFF). So my code is wrong, only yours is correct see test code. You can safely remove the test for null delay and you'll lower the overhead.

@GMagician
Copy link
Contributor

GMagician commented Jan 9, 2021

@X-Ryl669 that's not my code, it's the one in framework, turns out that in samd51 using such function is "naturally" a better choice

@rhapsodyv
Copy link
Sponsor Member Author

So, the resume is: we should add a generic DWT delay as default, then fallback to systick, then fallback to NOP...

Yet, I'm concerned about some tests that @sjasonsmith did. In his tests, systick delay didn't work well for very small delays. But it could be related to inline, for example.
After the changes, we need to test all again, to see if we need keep a NOP delay for very small delays, or if DWT/systick are ok.

@GMagician
Copy link
Contributor

@rhapsodyv considering what reported by samd framework 20 DWT tick are 1/6us and timer wraps in 37 seconds, then such code may span from a little time to up to 15 sec, more or less.
SysTick, if really is loaded with a value of 999 (then counts from 999 to 0 and reload such value), has little time resolution then @sjasonsmith result don't surprise me

@ejtagle
Copy link
Contributor

ejtagle commented Jan 10, 2021

/*
 * Use the (32bit) cycle count maintained by the DWT unit,
 * and count exact number of cycles elapsed, rather than guessing how
 * many cycles a loop takes, which is dangerous in the presence of
 * cache.  The overhead of the call and internal code is "about" 20
 * cycles.  (at 120MHz, that's about 1/6 us)
 */
static FORCE_INLINE void delayMicroseconds(unsigned int us)
{
  uint32_t start, elapsed;
  uint32_t count;

  if (us == 0)
    return;

  count = us * (VARIANT_MCK / 1000000) - 20;  // convert us to cycles.
  start = DWT->CYCCNT;  //CYCCNT is 32bits, takes 37s or so to wrap.
  do {
    elapsed = DWT->CYCCNT - start;
  } while (elapsed < count);
}

This function should compile to 3 instructions (on the while). count will be precomputed for constant us values, just leaving a nearly neglible overhead (could be removed by substracting a cycle count from count

I would not be bothering with the SysTick version, and i would just keep the NOP version and the DWT version.
If you want a Systick version, the code is mostly the same, but you must also change the

elapsed = DWT->CYCCNT - start;

by something along the lines:

elapsed = (DWT->CYCCNT - start) % SysTickPeriod;

Not worth the trouble, IMHO

@GMagician
Copy link
Contributor

One option may be switch this inline function to a define and let each HAL to override it. If not overriden it may default to DWT or nop

@X-Ryl669
Copy link
Contributor

You should remove the if (us == 0) return; part, it's completely useless (it only adds overhead, there is no place in the code with a delay set to 0 and even if there was one, the loop would only run once and exit anyway).

@ejtagle
Copy link
Contributor

ejtagle commented Jan 11, 2021

You should remove the if (us == 0) return; part, it's completely useless (it only adds overhead, there is no place in the code with a delay set to 0 and even if there was one, the loop would only run once and exit anyway).

Agree 100%

@sjasonsmith
Copy link
Contributor

I suspect that if any of you are interested in taking this PR over @rhapsodyv might be ok with that.
It is clear that some of you know your ARM internals better than I do, and apparently have more availability to spend time on it right now.

@X-Ryl669
Copy link
Contributor

Here's my try at fixing this

@rhapsodyv
Copy link
Sponsor Member Author

Thanks @X-Ryl669 . I will close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: STM32 Needs: Testing Testing is needed for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants