Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CONFIG_SKIP_BOOT_MSG makes newlib stdio races more likely #20067

Closed
mchesser opened this issue Nov 9, 2023 · 2 comments · Fixed by #20392
Closed

CONFIG_SKIP_BOOT_MSG makes newlib stdio races more likely #20067

mchesser opened this issue Nov 9, 2023 · 2 comments · Fixed by #20392

Comments

@mchesser
Copy link

mchesser commented Nov 9, 2023

After spending some time analyzing this issue, I believe that the actual cause is already known -- i.e., #4488 / #8619-comment-569952641 (and is actually documented in the release notes).

Feel free to close as a duplicate, but I thought I would leave the analysis below regarding the impact of CONFIG_SKIP_BOOT_MSG.


Description

The reentrant stdio functions in newlib initialize some shared data (via the function __sinit) the first time an IO function, e.g., puts, is used. Normally, before main, RIOT prints a message using puts (i.e., "This is RIOT" + version), which ends up calling __sinit before user code is executed, avoiding most of the issues with initialization.

However if CONFIG_SKIP_BOOT_MSG is set, the __sinit will not be called until the first print statement. If two threads attempt to print at a roughly similar time, then it is possible1 for the second thread to execute with a partially initialized reent object which causes various crashes depending on how much of the structure has been initialized.

1. Platforms that do not perform any locking as part of _lock_acquire (see: #8619-comment-569952641).

As an example, if we have code configured like this:

USEMODULE += ztimer_usec
USEMODULE += sched_round_robin
CFLAGS += -DCONFIG_SKIP_BOOT_MSG
BOARD=nucleo-f446re

That spawns multiple threads that print -- e.g.:

main.c
#include <stdio.h>
#include <stdint.h>

#include "thread.h"
#include "ztimer.h"
#include "timex.h"

uint32_t WORKER1_SPIN_TIME_US = (10 * US_PER_MS) - 30; // Approximate RR scheduling time - 30 usecs.
volatile uint32_t WORKER1_LOOP_SPIN = 5;               // Loop iterations for sub-microsecond adjustment.

void* worker1(void* arg)
{
    (void)arg;

    ztimer_spin(ZTIMER_USEC, WORKER1_SPIN_TIME_US);
    while (WORKER1_LOOP_SPIN != 0) { WORKER1_LOOP_SPIN -= 1; }

    puts("WORKER 1\n");

    while(1) {};
}

void* worker2(void* arg)
{
    (void)arg;
    puts("WORKER 2\n");

    while(1) {};
}

#define WORKER_STACKSIZE (THREAD_STACKSIZE_TINY+THREAD_EXTRA_STACKSIZE_PRINTF)

int main(void)
{
    static char w1_stack[WORKER_STACKSIZE];
    thread_create(w1_stack, sizeof(w1_stack), 7, THREAD_CREATE_STACKTEST, worker1, NULL, "T1");

    static char w2_stack[WORKER_STACKSIZE];
    thread_create(w2_stack, sizeof(w2_stack), 7, THREAD_CREATE_STACKTEST, worker2, NULL, "T2");

    return 0;
}

Then if worker1 is preempted in the middle of stdio initialization, e.g. at:

worker1 at main.c:27
  _puts_r
    __sinit
      __sfp
        __sfmoreglue
          ...
    <TIMER 1 IRQ>
    <PREEMPTION>

Then worker2 may crash when puts is called:

worker2 at main.c:39
  _puts_r
    __swsetup_r+52
@maribu
Copy link
Member

maribu commented Nov 9, 2023

Thanks for reporting this. In #8619 it is mentioned that the nano variant of newlib behaves differently, would you mind to check which version is used? Can you reproduce the issue in both cases? (BUILD_IN_DOCKER=1 will use a toolchain that has both nano and non-nano variant of the newlib.)

(Note that RIOT will by default use the nano version, but only if that version is available.)

In a different case were racy code did not result in garbled stdio (there seems to be a consensus that this is acceptable in most use cases), but in crashes, we opted to make this thread-safe no matter what. (This was for memory allocation.) So the consistent thing here would be to make sure that __sinit is called prior before calling main().

Also, it would be nice to also provide at least the option to have newlib thread-safe even for functions that do not crash when racing. Some people do like e.g. clean stdio output enough to spend resources on that :)

@mchesser
Copy link
Author

mchesser commented Nov 21, 2023

Took me a while to get back to this, but I've managed to reproduce with a docker based build of the target. Based on the output of test/sys/libc_newlib and the build script output (see below), this build appears to be using newlib-nano.

Compile command
arm-none-eabi-gcc \
        -DRIOT_FILE_RELATIVE=\"examples/print_race/main.c\" \
        -DRIOT_FILE_NOPATH=\"main.c\" \
        -DCONFIG_SKIP_BOOT_MSG -Werror -DCPU_FAM_STM32F4 -DSTM32F446xx -DCPU_LINE_STM32F446xx -DSTM32_FLASHSIZE=524288U -D__SYSTEM_STM32F4XX_H -DSYSTEM_STM32F4XX_H -mno-thumb-interwork -mcpu=cortex-m4 -mlittle-endian -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -ffunction-sections -fdata-sections -fshort-enums -ggdb -g3 -Os -DCPU_MODEL_STM32F446RE -DCPU_CORE_CORTEX_M4F -DCPU_HAS_BACKUP_RAM=1 -DRIOT_APPLICATION=\"print_race\" -DBOARD_NUCLEO_F446RE=\"nucleo-f446re\" -DRIOT_BOARD=BOARD_NUCLEO_F446RE -DCPU_STM32=\"stm32\" -DRIOT_CPU=CPU_STM32 -DMCU_STM32=\"stm32\" -DRIOT_MCU=MCU_STM32 -std=c11 -fwrapv -Wstrict-overflow -fno-common -ffunction-sections -fdata-sections -Wall -Wextra -Wmissing-include-dirs -DNDEBUG -fno-delete-null-pointer-checks -fdiagnostics-color -Wstrict-prototypes -Wold-style-definition -gz -Wformat=2 -Wformat-overflow -Wformat-truncation -fmacro-prefix-map=/RIOT/= -Wcast-align -DCPU_RAM_BASE=0x20000000 -DCPU_RAM_SIZE=0x20000 -include '/RIOT/examples/print_race/bin/nucleo-f446re/riotbuild/riotbuild.h'  -isystem /opt/gcc-arm-none-eabi-10.3-2021.10/arm-none-eabi/include/newlib-nano -I/RIOT/core/lib/include -I/RIOT/core/include -I/RIOT/drivers/include -I/RIOT/sys/include -I/RIOT/boards/nucleo-f446re/include -I/RIOT/boards/common/nucleo/include -I/RIOT/boards/common/stm32/include -I/RIOT/boards/common/nucleo64/include -I/RIOT/cpu/stm32/include -I/RIOT/build/stm32/cmsis/f4/Include -I/RIOT/cpu/stm32/include/clk -I/RIOT/cpu/cortexm_common/include -I/RIOT/sys/libc/include -I/RIOT/build/pkg/cmsis/CMSIS/Core/Include -I/RIOT/sys/auto_init/include -I/RIOT/examples/print_race/bin/nucleo-f446re/preprocessor -MQ '/RIOT/examples/print_race/bin/nucleo-f446re/application_print_race/main.o' -MD -MP -c -o /RIOT/examples/print_race/bin/nucleo-f446re/application_print_race/main.o /RIOT/examples/print_race/main.c

It ended up being quite difficult to get a consistent crash when attached to a debugger. I think this is partly because some of the data allocated as part of stdio initialization is within 'uninitialized' RAM, which happens to be initialized with data from any previous execution.

The method I ended with, was to try and force the firmware to call a null pointer in worker2 by carefully adjusting the delay to ensure that preemption occurs after zeroing the stdout structure and some initial setup, but before the function pointer for write is initialized. Checks on the stdout FILE structure that occur as part of puts (e.g., SWR must be set in flags), mean that preemption needs to within this region of code: findfp.c:60-67.

The null call then occurs as part of worker2->puts->puts_r->__swbuf_r->_fflush_r->__sflush_r at flush.c:195

Getting this exact crash, required sub-microsecond adjustment which I did via a busy loop, however I think other uses of stdio may result in a larger window for the crash.

Although sinit has slightly changed in newer versions newlibc, because of some fairly recent patches (staring from: https://sourceware.org/pipermail/newlib/2022/019283.html). I'm fairly sure the same problem still occurs (e.g., by causing the switch in a similar location)

@maribu maribu mentioned this issue Feb 8, 2024
14 tasks
maribu added a commit to maribu/RIOT that referenced this issue Feb 16, 2024
This eagerly calls `__sinit()` instead of lazy initialization upon the
first call to stdio (e.g. `puts()`, `printf()`). The issue is that
without locking (as is currently the case for all RIOT platforms but
ESP) two concurrent "first calls" may result in concurrent
initialization of the same structure and data corruption.

Fixes RIOT-OS#20067
maribu added a commit to maribu/RIOT that referenced this issue Feb 16, 2024
This eagerly calls `__sinit()` instead of lazy initialization upon the
first call to stdio (e.g. `puts()`, `printf()`). The issue is that
without locking (as is currently the case for all RIOT platforms but
ESP) two concurrent "first calls" may result in concurrent
initialization of the same structure and data corruption.

Fixes RIOT-OS#20067
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants