Crash Dumps

Angus Gratton edited this page Jul 6, 2016 · 15 revisions

Crash Dumps

When the ESP8266 CPU encounters a fatal exception, esp-open-rtos outputs a crash dump like this:

Fatal exception (29): 
epc1=0x40208560
epc2=0x00000000
epc3=0x40103117
excvaddr=0x00000000
depc=0x00000000
excsave1=0x40208558
Registers:
a0 40208558 a1 3fff5040 a2  00000000 a3  00000081
a4  00000001 a5  00000081 a6  00000000 a7  00000000
a8  00000000 a9  00000000 a10 00000000 a11 00000000
a12 00000000 a13 60000300 SAR 0000001e

Stack: SP=0x3fff5040
0x3fff5040: 00000000 00000000 00000000 00000000
0x3fff5050: 00000000 a5a5a5a5 a5a5a5a5 00000290
0x3fff5060: 3fff5408 00000000 00000000 00000000
0x3fff5070: 00000000 00000000 00000000 00000000
0x3fff5080: 00000000 00000000 00000000 00000000
0x3fff5090: 00000000 00000000 00000000 00000000
0x3fff50a0: 00000000 00000000 00000000 00000000
0x3fff50b0: 00000000 00000000 00000000 00000000

Free Heap: 42592
_heap_start 0x3fff1380 brk 0x3fff5480 supervisor sp 0x3ffffae0 sp-brk 42592 bytes
arena (total_size) 16640 fordblks (free_size) 0 uordblocks (used_size) 16640

The code which generates crash dumps can be found in core/debug_dumps.c

Exception Cause

The number in parentheses (ie (29) in the example above) is an Xtensa exception cause number:

Number Cause Reason
0 IllegalInstruction Illegal instruction was executed.
1 Syscall SYSCALL instruction encountered (esp-open-rtos does not use syscalls or have a syscall handler, so this is an error)
2 InstructionFetchError Attempt to execute instructions from an illegal address, or read error when fetching instruction
3 LoadStoreError Attempt to read/write memory in a manner not supported by the hardware (for example, trying to read/write a byte to a memory area that only supports word accesses)
4 Level1Interrupt (This is normal and should never generate a crash dump)
5 (reserved/unused) (Not used by ESP8266 hardware. Should never occur.)
6 IntegerDivideByZero CPU tried to divide by zero.
7..8 (reserved/unused) (Not used by ESP8266 hardware. Should never occur.)
9 LoadStoreAlignment Attempt to read/write memory with an unaligned address (for example, trying to read/write a 32-bit word at an address that is not a multiple of 4)
10..19 (reserved/unused) (Not used by ESP8266 hardware. Should never occur.)
20 InstFetchProhibited CPU tried to execute code from a memory region which is not valid for code
21..27 (reserved/unused) (Not used by ESP8266 hardware. Should never occur.)
28 LoadProhibited CPU tried to load memory from a region which is protected against reads
29 StoreProhibited CPU tried to store memory from a region which is protected against writes

Register Dump

Immediately following the exception number is a dump of the Xtensa registers at the time the exception occured.

EPCx registers

The program counter (PC) is the register in a CPU that points to the currently executing instruction.

EPCx gives the PC register address at different exception levels, which is the program addresses that were in use when the CPU encountered the exception:

epc1=0x40221970
epc2=0x00000000
epc3=0x00000000

EPC1 is the saved PC address that was in use when the exception occured, assuming it caused an exception at the normal level (level 1). This is almost always the PC address to use when debugging a crash dump.

EPC2 is used to save the current PC address when a debugger halted execution. If you're not using a debugger then you can expect it to be 0.

EPC3 is the saved PC from the last time the CPU jumped to the NMI (Non Maskable Interrupt) vector. This address is unrelated to the crash 99.999% of the time, unless the crash occurred inside the NMI exception handler (very rare, unless you're writing code that runs in the NMI). This address can still be informative as it shows you a place where the CPU was running, the last time an NMI occured (ie recently).

EXCVADDR register

EXCVADDR is the memory address the caused the exception, if there was one. If you get LoadStore, LoadProhibited or StoreProhibited exception cause then look at the excvaddr to see what load/store address caused the exception:

excvaddr=0x00000000

For example, reading from a NULL pointer causes a LoadProhibited exception (28) with EXCVADDR=0. Writing to a NULL pointer causes a StoreProhibited exception (29) with EXCVADDR=0 (writing to a NULL pointer is what caused the example fatal exception at the top of the page).

(Note that the entire region from 0x00000000 to 0x20000000 in the ESP8266 is protected against both reads and writes. This means that reading/writing to anywhere in that address range will generate an exception (a handy side-effect of this is that, as mentioned, NULL pointer accesses fall in this range, and thus generate exceptions)

EXCSAVE1 Register

excsave1=0x40208558

EXCSAVE1 is the PC address of normal (non-interrupt-context) execution from the last time an interrupt occured. Like EPC3, this often tells you a place where the CPU was running recently (but it may not be running there now). As interrupts are used for context switching between tasks, this program address is often related to the last context switch.

General Purpose Registers

Registers:
a0 40208558 a1 3fff5040 a2  00000000 a3  00000081
a4  00000001 a5  00000081 a6  00000000 a7  00000000
a8  00000000 a9  00000000 a10 00000000 a11 00000000
a12 00000000 a13 60000300 SAR 0000001e

The general purpose registers are manipulated as your code runs and aren't guaranteed to mean anything in particular, but they include control flow values and program variable values. Some of these may be useful! Once you find the line of source where the crash happened (see below), you can optionally look at the function disassembly to find out which registers were being used for what.

A register that often holds a predictable value is a0. When using the Xtensa "call0" ABI (as used by ESP8266), this register holds the return address from the current function call. (For more explanation of this, see "stack dump" section below.)

Finding source code

Once you have some PC addresses (like 0x40208560), you want to convert them to line numbers in your source code. There are ways to do this:

filteroutput.py

The filteroutput.py tool in the utils/ directory will automatically filter serial output and pick out program addresses, and inject the line numbers. If you use make test to flash & run your programs then filteroutput.py is already being used to monitor the serial port.

The crash dump from above, processed with filteroutput.py:

epc1=0x40208560

0x40208560: blinkenTask at blink.c:51 (discriminator 1)

epc2=0x00000000
epc3=0x40106530

0x40106530: memcmp at /home/gus/dev/esp/rtos/newlib/build/xtensa-lx106-elf/newlib/libc/string/../../../../../newlib/libc/string/memcmp.c:64

excvaddr=0x00000000
depc=0x00000000
excsave1=0x40208558

0x40208558: blinkenTask at blink.c:51 (discriminator 1)

The line printed under epc1 shows the line number where the program crashed.

filteroutput.py can be used to live monitor a serial port, like this:

filteroutput.py -p /dev/ttyACM0 -b 115200

Or you can pipe an existing log file through it, like this:

filteroutput.py < my_debug_log.txt

filteroutput.py will automatically look for an ELF file (.out file in the build directory) under the current directory. This is the file which is used to look up line numbers based on PC addresses. If run from a different directory, use the -e option to pass in the path to the ELF, ie

filteroutput.py -e ../build/my_program.out < my_debug_log.txt

Run filteroutput.py -h for more usage details.

addrsource

The addrsource tool in the utils/ directory of esp-open-rtos will give you more details about a crashing PC address.

Similar to filteroutput.py, addrsource automatically looks for an ELF file in the build directory. So if you are running addrsource from your app directory (i.e. the directory that has build and firmware subdirectories), you can just give it the EPC1 address as the only argument:

addrsource 0x40221970

If the current directory is somewhere else, you will need to also tell it where to find the ELF binary (.out file in the build directory) to use:

addrsource 0x40221970 myprogram/build/myprogram.out

The addrsource tool prints the line number, source function body, and disassembly for the exception address.

Just line numbers

If all you need is the filename and line number, the xtensa-lx106-elf-addr2line tool (provided as part of the cross-compiler suite from esp-open-sdk) is a simpler tool that just prints the source code line numbers based on provided addresses. filteroutput.py is running addr2line behind the scenes.

Launch addr2line with the path to the ELF binary:

xtensa-lx106-elf-addr2line -e build/program.out

Then type/paste in the addresses:

$ xtensa-lx106-elf-addr2line -e build/program.out
0x40221970
/path/to/program.c:76

Stack dump

Stack: SP=0x3fff5040
0x3fff5040: 00000000 00000000 00000000 00000000
0x3fff5050: 00000000 a5a5a5a5 a5a5a5a5 00000290
0x3fff5060: 3fff5408 00000000 00000000 00000000
0x3fff5070: 00000000 00000000 00000000 00000000
0x3fff5080: 00000000 00000000 00000000 00000000
0x3fff5090: 00000000 00000000 00000000 00000000
0x3fff50a0: 00000000 00000000 00000000 00000000
0x3fff50b0: 00000000 00000000 00000000 00000000

The crash dump includes a "stack dump" - a chunk of RAM read from the current task's stack. The current address of the stack pointer (SP) is printed on the first line.

FreeRTOS allocates each task's stack from the heap at runtime, so the stack dump may go past the end of the task's stack and include other memory (the stacks of other tasks, or other heap memory). FreeRTOS uses the magic value 0xa5a5a5a5 as an end-of-stack marker, so you can usually assume that anything after 0xa5a5a5a5 is no longer the stack of your program.

Reading a stack dump

You can use the stack dump to figure out the call stack at the time of the crash. It's not quite as simple as viewing a backtrace in gdb though (patches to make it more simple are welcome!)

Here's some contrived sample code that makes a bunch of nested function calls and then crashes:

void crash_now(void)
{
    *((volatile uint32_t *)0) = 0; // null pointer deref, instant crash
}

void call_crash_now(void)
{
    for(int i = 0; i < 3; i++) {
        crash_now();
    }
}

void should_we_crash(bool crash_now)
{
    if(crash_now) {
        call_crash_now();
    }
}

void init_task(void *pvParameters)
{
    should_we_crash(true);
    while(1) { }
}

(NB: If you actually compile this, add __attribute__((noinline)) after each void or gcc just inlines all these functions into one!)

Here's the crash dump:

Fatal exception (29):
epc1=0x40211a61

0x40211a61: crash_now at crash.c:16

epc2=0x00000000
epc3=0x00000000
excvaddr=0x00000000
depc=0x00000000
excsave1=0x40208551

0x40208551: call_crash_now at crash.c:22

Registers:
a0 40208551 a1 3fff5020 a2  00000000 a3  00000000

0x40208551: call_crash_now at crash.c:22

a4  00000000 a5  00000000 a6  00000000 a7  00000000
a8  00000000 a9  00000000 a10 00000000 a11 00000000
a12 00000000 a13 00000000 SAR 00000000

Stack: SP=0x3fff5020
0x3fff5020: 00000000 00000000 00000000 4020856d

0x4020856d: should_we_crash at crash.c:31

0x3fff5030: 00000000 00000000 00000000 4020857e

0x4020857e: init_task at crash.c:35

0x3fff5040: 00000000 00000000 00000000 00000000
0x3fff5050: 00000000 a5a5a5a5 a5a5a5a5 00000290

Reading from the top of the crash dump down:

EPC1 contains the actual crash address - caused by the null pointer dereference in the crash_now function.

Reading down the register dump, a0 has value 0x40208551. This is the return address for the crash_now function, and points to the call_crash_now calling function.

In the stack memory dump, the address 0x4020856d appears. This is the return address from call_crash_now, as it was pushed onto the stack. It points to the should_we_crash function that's one higher on the call stack.

Four words further along in the stack dump is 0x4020857e, which points to the init_task function which calls into should_we_crash. You'll usually find a task init function at the "end" of the stack, unless the stack is corrupt.

Further on in the stack dump, you can see the 0xa5a5a5a5 stack end marker - this confirms for us that anything after this is not part of our stack.

So we can quickly assemble an idea of the full call trace of the task at the time of the crash:

init_task -> should_we_crash -> call_crash_now -> crash_now.

Note that program addresses can appear in the stack dump for other reasons (function pointers, left over data in RAM that hasn't been overwritten yet, etc.) If an address in the stack dump doesn't seem like it could possibly have been part of the call trace, you can probably ignore it.

Return Address != Call Address

When you look at a gdb backtrace, you see the line numbers of each function call itself. When looking at a raw memory/register dump like this, you're looking at the return addresses - which is the instruction in the calling function that's to be executed after the called function returns.

For example, in the dump above one of the stack addresses is 0x4020857e which is crash.c line 31:

void __attribute__((noinline)) should_we_crash(bool crash_now)
{
    if(crash_now) {
        call_crash_now(); // <--- line 29 is here
    }
} // <--- line 31 is here

Line 31 is the address after call_crash_now() returns - it points to the next instruction executed. You have to look a few lines up in the C code to see where the call was made. For most source code, this is pretty obvious.

To get a precise line number, you can run addr2line on a value 3 less than the value in the dump. This works because each instruction is 3 bytes long.

$ xtensa-lx106-elf-addr2line -e build/blink.out
0x4020856d
crash.c:31
0x4020856a
crash.c:29

... bingo!

Heap Information

Free Heap: 42592
_heap_start 0x3fff1380 brk 0x3fff5480 supervisor sp 0x3ffffae0 sp-brk 42592 bytes
arena (total_size) 16640 fordblks (free_size) 0 uordblocks (used_size) 16640

The last part of the crash dump contains information about the heap. Usually the only value you need to worry about is "Free Heap" - if it's very low then this may explain why your program crashed. If it's a totally nonsense value then the heap may be corrupt.

You can get information about the free heap value at runtime by calling xPortGetFreeHeapSize().

The other lines printed are information about brk(), and internal heap information from mallinfo. Look in debug_dumps.c to see how these values are all derived:

  • _heap_start - Address in RAM where the heap starts (heap grows up from here.) This address is determined at link time, by the linker script.
  • brk - The "brk" system call takes never-used RAM and adds it to the data segment that is used by the heap. Before any heap is allocated, brk is equal to _heap_start. This address in RAM shows how high in RAM the segment used by the heap has ever gotten.
  • supervisor sp - Address in RAM of the "supervisor" (pre-FreeRTOS) stack pointer. This is how far the stack grew down from the very end of RAM, before the RTOS scheduler started. If the end of the heap (as determined by brk) grows close to this point, you may run out of heap and allocations will fail.
  • sp-brk - Literally, supervisor sp minus brk. This shows how many bytes of memory were never ever allocated to the heap. It lets you see the "high water mark" of heap usage. If this value is reasonably high then it's very unlikely you ran out of heap (although malloc() can still fail if the value it's trying to allocate is higher than what's available here.) The overall "free heap" size is calulcated from this value, plus the free memory inside heap's segment. sp-brk is guaranteed to be unfragmented, whereas free space inside the heap may be fragmented.
  • arena (aka total_size) - This is the total number of bytes inside the heap. It should be equal to brk-_heap_start. If that's not the case then the heap is probably corrupt.
  • fordblks (aka free_size) - This is the total number of bytes free in the heap. Usually this is memory that has been allocated and then freed, and not yet allocated again.
  • uordblocks (used_size) - This is the total number of bytes currently allocated in the heap.

Heap Fragmentation

The heap info dump doesn't tell you much about heap fragmentation - it's possible to have a lot of free heap, but allocating a chunk of contiguous memory will still fail because all the free memory is fragmented. A clue that this might happen is if the fordblks value is high (ie a lot of the free heap was previously malloced() as something else, so the free space might be in fragments.) You will always be able to allocate at least sp-brk number of bytes in one chunk, so if this value is high then fragmentation-related problems are unlikely.

Exceptions that get handled

Some regions (Instruction RAM & mapped SPI flash) in the ESP8266 address space require 4 byte aligned word reads and writes. Unaligned or short reads/writes will raise LoadStoreError exceptions.

esp-open-rtos includes a LoadStoreError handler (in core/exception_vectors.S) which will automatically "fix" unaligned or non-word reads by replacing them with the necessary aligned word reads. This lets you store string constants and other data in flash memory ("irom") and then read them character by character as if they were in RAM.

esp-open-rtos does not fix LoadStoreErrors caused by unaligned stores to these address spaces.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.