-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 19028 - Fix DELAY_US/NS cycle cost #19059
Conversation
Interesting thing...but I read into code:
and reading web page (https://www.anthonyvh.com/2017/05/18/cortex_m-cycle_counter/) a STM32 is mentioned. So why STM32 doesn't use the debug timer? It seems that the code may be enabled only for a M7 processor (and if arm and thumb) |
I had some compilation issues building for my GTR board, but got it to work. Right now the code is Maple-specific. In You might be interested in this When I booted with your change on a GTR (STM32F407), I see this:
This still had an extra
This makes it pretty clear that we get different behavior from different cores, so it is good we are addressing this. |
This is a question @rhapsodyv and I have asked as we explored this. Those are old comments which clearly do not actually match the code. The code I referenced from the STM32 framework (non-Maple) always uses DWT if it is available. That should be a more reliable way to solve this problem, but isn't available on all hardware. Any solution that can rely on CPU counters is going to be more reliable across hardware than instruction counting. |
I tryed to force code compile for my samd51 and it seems to compile. SAMD51 is a cortex-M4 |
if present in M3 to M7 cortex, then it's present in many of the Marlin supported 32bits boards. |
@GMagician some of the Malyan boards are STM32F0 (Cortex-M0). |
I confirm, SAMD51 supports the debug timer (adafruit framework enables it automatically, no
|
Ok...it's a intellisense issue |
I checked cortex M0 documentation and, if correctly read, BNE is always 1 or 2 cycles (depending on taken or not). So I think #if __CORTEX_M >= 3
// Cortex-M3 through M7 can use the cycle counter of the DWT unit
// https://www.anthonyvh.com/2017/05/18/cortex_m-cycle_counter/ and original code in OT @rhapsodyv you opened the Pandora's box ;-) |
It could work. The user tested and it didn't work, but, looking at the code, I think I know why. To use the DWT code, the HAL needs call
I will try it, but I doubt that this pre init don't be incompatible or broke some boot loader, like chitu one... Seems the safer way to use any delay relaying in cycle count, is by adjusting its divisor/multiplier by measuring it at runtime..... |
@rhapsodyv I just terminated a 2hr print with the new timer code, as stated before, AGCM4 doesn't need above function, but it doesn't hurt. I also think that, as stated by @AnHardt and by links he posted, if LPC doesnt support it, it must be excluded by this. |
@GMagician
Any print should work, as it affect only a few parts of the code. The user found the bug working with leds... |
Seems |
Seems lib maple don't support DWT, or it use other naming.....
back to the cycle count fix? 🤔 |
Also SAMD51 define it. I was confused by intellisense saying it wasn't. Once compiled I saw exactly what was and what wasn't defined @rhapsodyv I checked "test" code on my AGCM4, here are results:
not necessary, you may define yourself the missing registers (if SoC really support them) |
But yet I need a way to find the current __CORTEX_M or implement the correct M detection too.... because maple dont define it...
|
What cortex version is STM32F4? M3? Edit: I checked it's M4 |
I think I'm missing something...above define initialize __CORTEX_M to 4 maybe such include is not used/included? |
That code is some usbf4 specific, not a M detector, as stm core have.... stm32duino detect the core and include the right definition. Seems lib maple dont have it.
And a lot more, for a lot of stm32 versions.... |
Using lib maple, we don't know which M version is.... Can we assume that it would be always M3 or M4?! |
I don't know the STM32 boards so I can't answer, but if in doubt it's always possible to add a -D on the compile args, specific to STM SoCs |
@rhapsodyv but isn't maple to be discontinued? I saw some PR to remove it |
How close is this PR to a useful form with respect to the availability of DWT (and the NOP-counting fall-back)? I can see that what has been added is sensible and already lends an improvement. So, how much of this is worth merging now, and then what are the next steps? |
I am swamped at work and won't be able to spend time on anything right now. If others are familiar with the architectures and can make sound recommendations, that is fine. |
@rhapsodyv I checked the code about the usage of DELAY_NS macro. I've found that it's being used on platform like ESP32 and Linux that both have an pre-emptive kernel (FreeRTOS for the former and well, linux for the latter) so I wonder how the delay is ensured in that case, since the CPU can be interrupted by the scheduler anytime. @sjasonsmith This PR is good for the NOP waiting / M0 case but some code must be written for using DWT registers. It can be a macro, something like this (pseudo code):
Obviously, such a loop can only work if there is a DWT at address |
@X-Ryl669 AFAIK time must be the one requested, here is why all this PR came to life. |
Yes, you are right. The issue with "simply" counting NOP cycles is that you can have both the delay in case of scheduling:
With a cycle counter (like DWT), this would have given:
|
Completely agree with @GMagician and @X-Ryl669 Let's split cases: With preemptive multitaskers (linux/ESP32/...) in user mode code, there is no way to ensure timing. Reason is the pointed one by @X-Ryl669 . So, those functions get "at least" the specified amount of delay, but could potentially get more due to task scheduling. Without preemptive multitasking, and interrupts enabled, under ARM, even if executing the delay functions under an interrupt handler, you get exactly the exact same situation as before, as interrupt handlers can be preempted by higher priority interrupt handlers. So, the delay functions should be considered as "delay at least" the specified amount of time. If you need exact timing, then you need some sort of interrupt disabling while performing the critical timing section (but this worsens the interrupt time response of other devices), and so i consider that using this to get more than 10uS of delay to be a no-go The other approach is a hardware based solution (for example, ESP32 allows any pin to perform a timed pulsing), and some other processors also offer that, but for selected pins/ports, not all of them. The main question here is if makes sense. LCD timing is not critical in the sense, that occasionally having a larger delay does not cause problems at all. Babystepping, on the contrary, could be a problem. But babystepping could be handled by "injecting" deltas into the bresenhan algorithm, or into the planner, instead of using a delay macro (i think that would be the best approach here) |
I came to this PR because of the faulty NeoPixel behavior on STM32F1. The NeoPixel code is also a good example of "grunt" code, since it's burning CPU to match the timing requirement of WS2812 RGB leds. So, in the end, there are multiple places where
For SPI screen, having longer timing is usually accepted by the device, provided the clock is in sync with the MOSI/MISO line. So, in the end, I think precise timing is truely required for steppers motors so indexing the wait to a cycle counter or a cycle count is unavoidable. I would prefer DWT for the reason above since it'll be more correct in case of being interrupted than a dumb NOP counter. |
The only "vital" part where timing is required (aim of this PR) is stepper handling, all other parts aren't so important (SPI, LCD and so on). |
It's a long time when I analyzed this but I saw this old post: #19059 (comment), don't know how it's programmed in other SoC and don't know how precise may be when programmed for 1ms like samd does |
@GMagician: I agree 100% with you here. Let me explain the "reason" on why timing precision is so important here on the stepper ISR: Most motor drivers perform step interpolation, that means, they internally use 256 levels of current to create the sinusoidal waveforms required to smoothly move the stepper motors. |
1 similar comment
@GMagician: I agree 100% with you here. Let me explain the "reason" on why timing precision is so important here on the stepper ISR: Most motor drivers perform step interpolation, that means, they internally use 256 levels of current to create the sinusoidal waveforms required to smoothly move the stepper motors. |
@ejtagle what is not clear to me is: if timer is SysTick is programmed 1ms timeout like in SAMD51 is done. On framework this is what is done to wait uS in framework: #ifdef __SAMD51__
/*
* On SAMD51, use the (32bit) cycle count maintained by the DWT unit,
* and count exact number of cycles elapsed, rather than guessing how
* many cycles a loop takes, which is dangerous in the presence of
* cache. The overhead of the call and internal code is "about" 20
* cycles. (at 120MHz, that's about 1/6 us)
*/
void delayMicroseconds(unsigned int us)
{
uint32_t start, elapsed;
uint32_t count;
if (us == 0)
return;
count = us * (VARIANT_MCK / 1000000) - 20; // convert us to cycles.
start = DWT->CYCCNT; //CYCCNT is 32bits, takes 37s or so to wrap.
while (1) {
elapsed = DWT->CYCCNT - start;
if (elapsed >= count)
return;
}
} EDIT: on samd51 systick LOAD value is 999 |
That's exactly DWT wait as described above. Just a remark, the code could be more precise if, instead of computing the difference in the loop, the end position in time is computed once outside of the loop and the loop consist of only: This kind of code isn't safe in case of roll over unless |
I would be interested to know how precise we could get with a C function compared to a ASM version. If it's 1/6 µs or less (so < 116ns), there is no point in writing a assembler version since in the code base, there isn't any call to DELAY_NS shorter than ~100ns. |
@X-Ryl669 the samd51 code is save even for overlap since it works with 32bits unsigned and works with subtraction: EDIT: AFAIK signed 32 logic is not safe you need to consider overflow (0xFFFFFFFF -> 0 and 0x7FFFFFFF -> 0x80000000) |
Yes. I was talking about my version with the loop that's only comparing the counter value with an end value. The loop body is empty, there's only 3 CPU instructions in the loop (load, compare, branch) so it's even less overhead than your version (load, sub, compare, branch) that's at least 4 CPU instructions. Unlike your version, my version is only safe if compared as signed value. That's because 0xFFFFFFFF < 0x1 is true only if using signed 32 bits value (that's what happen when the counter roll over) |
Care must be taken when using signed logic and overflow,
In the end, using signed logic, in that case is perfectly safe (unsigned logic is not). EDIT: It turns out that I'm wrong. There's an issue with INT_MAX (so around 0x7FFFFFFF). So my code is wrong, only yours is correct see test code. You can safely remove the test for null delay and you'll lower the overhead. |
@X-Ryl669 that's not my code, it's the one in framework, turns out that in samd51 using such function is "naturally" a better choice |
So, the resume is: we should add a generic DWT delay as default, then fallback to systick, then fallback to NOP... Yet, I'm concerned about some tests that @sjasonsmith did. In his tests, systick delay didn't work well for very small delays. But it could be related to inline, for example. |
@rhapsodyv considering what reported by samd framework 20 DWT tick are 1/6us and timer wraps in 37 seconds, then such code may span from a little time to up to 15 sec, more or less. |
This function should compile to 3 instructions (on the while). count will be precomputed for constant us values, just leaving a nearly neglible overhead (could be removed by substracting a cycle count from count I would not be bothering with the SysTick version, and i would just keep the NOP version and the DWT version. elapsed = DWT->CYCCNT - start; by something along the lines: elapsed = (DWT->CYCCNT - start) % SysTickPeriod; Not worth the trouble, IMHO |
One option may be switch this inline function to a define and let each HAL to override it. If not overriden it may default to DWT or nop |
You should remove the |
Agree 100% |
I suspect that if any of you are interested in taking this PR over @rhapsodyv might be ok with that. |
Here's my try at fixing this |
Thanks @X-Ryl669 . I will close this one. |
Description
DELAY_US/NS uses a busy loop in the function
__delay_4cycles
. The actual code fixes the cycles per iteration in 4. So, to delay 1000 us, it do 1000/4 iterations of the busy loop.But, the busy loop inside the function
__delay_4cycles
can take 3 up to 6 cycles per iteration, according with Cortex M3 docs.https://stackoverflow.com/questions/28760617/pipeline-refill-cycles-for-instructions-in-arm
https://developer.arm.com/documentation/ddi0337/h/programmers-model/instruction-set-summary/cortex-m3-instructions
In our tests with STM32F103ZET6 and STM32F103VET6, the loop is taking 6 cycles per iteration. Even more, the lib maple define this loop as taking 6 cycles too in their function
delay_us
in~/.platformio/packages/framework-arduinoststm32-maple/STM32F1/system/libmaple/include/libmaple/delay.h
Seems the best value is 6 for this busy loop.But could have some core that is 4, as the original code was made.....
So,
I renamed the function tocreated a overwritable define__delay_Ncycles
,DELAY_CYCLES_ITERATION_COST
that is 4 by default, but can be overwritten by the envbuild_flags = -DDELAY_CYCLES_ITERATION_COST=6
We can have a function that outputs if the value is correct or not:
Searching the code, basically 2 things are using those functions that may be affected: LCD delays and
_PULSE_WAIT
inBABYSTEPPING
when EXTRA_CYCLES_BABYSTEP > 20.I created with help and advice of @sjasonsmith, a new busy loop based in systick, the is far more precise. If the HAL define the systick methods, it will be used. It fallbacks to the original delay_cycles.
Tests with the new systick based busy loop:
Benefits
Our DELAY_US and DELAY_NS were taking 50% more time.... It fix the problem.
Related Issues
#19028