gethostbyname test FAIL with NUCLEO_F746ZG and IAR #5622

jeromecoutant · 2017-11-30T15:27:55Z

Description

mbed test -m NUCLEO_F746ZG -t iar -vv -n tests-netsocket-gethostbyname

=> test is often FAIL
Target can't connect during DHCP procedure...

Test is OK with ARM and GCC

Target
NUCLEO_F746ZG

Toolchain
IAR

0xc0170 · 2017-11-30T16:31:27Z

To be seen here #5597 (the last 2 ci test run)

@kegilbert @studavekar I haven't seen it logged yet, was this known prior this report?

kegilbert · 2017-11-30T16:41:32Z

@0xc0170 I had seen that error on the linked PR but thought it was due to the PR itself. I'll look into it on our end.

kegilbert · 2017-12-01T00:21:57Z

I'm able to reproduce this locally as well on IAR.

0xc0170 · 2017-12-01T07:50:18Z

@kegilbert Thanks, I noticed that this is now real on master.

@SeppoTakalo @mikaleppanen @kjbracey-arm Can you help with this one?

kjbracey · 2017-12-04T09:34:28Z

Is this really specifically only the gethostbyname test failing? The failure appears to be an assert on return from the "connect" - getting a DHCP failure. So nothing to do with the gethostbyname bit.

What appears to be distinct about gethostbyname compared to other tests is that it does its connect in the test setup stage, rather than in a test case. That's presumably a clue. Not familiar enough with the test frame work to interpret the clue though.

0xc0170 · 2017-12-07T10:43:12Z

Thanks Kevin. It's just that one test and only one toolchain, and seldomly appears in our test logs.

What appears to be distinct about gethostbyname compared to other tests is that it does its connect in the test setup stage, rather than in a test case. That's presumably a clue. Not familiar enough with the test frame work to interpret the clue though.

@jeromecoutant Can we look at this particular, if that would help?

Has anyone been able to reproduce this?

jeromecoutant · 2017-12-07T15:41:14Z

Test seems OK in morph test results #5576
http://mbed-os-logs.s3-website-us-west-1.amazonaws.com/?prefix=logs/5576/473

kegilbert · 2017-12-07T17:24:25Z

The error seems to be fairly intermittent, I was unable to locally reproduce earlier this week but saw it crop up again in a CI run the next day. Trying to look more into it now.

jarlamsa · 2017-12-12T12:15:00Z

I was able to reproduce this locally, seems like test-case issue, I am preparing a fix for this.

jarlamsa · 2017-12-12T14:08:03Z

Made a PR: #5689
I am not entirely sure why the preprocessor macro seems to cause this to happen

kjbracey · 2017-12-12T15:03:49Z

Okay, seems we've got an environment where we can get 100% failure with the default develop build, but it passes with a debug build. That preprocessor macro change in #5689 must just be one example of a memory layout change that avoids the problem.

I guess the change in #5689 is that we are adding 1 pointer in RAM - the const char *, shifting everything by 4 bytes. (If it were const char * const, the pointer would be in ROM, so maybe wouldn't have changed anything? Might be worth trying.)

I think tomorrow's job is to JTAG debug the failing develop image to try to spot some corrupted memory stopping the connect from completing.

I am reminded of #4686 - the best guess we currently have there is that someone is corrupting a static variable in the HAL's serial code. Different platform and compiler, but maybe could be caused by the same generic memory corruption bug?

jeromecoutant · 2017-12-13T08:50:37Z

Now that #5576 is merged, maybe we can close this issue ?

@0xc0170 @kegilbert , maybe check next morph test results before closing ?

0xc0170 · 2017-12-13T09:18:57Z

Okay, seems we've got an environment where we can get 100% failure with the default develop build, but it passes with a debug build. That preprocessor macro change in #5689 must just be one example of a memory layout change that avoids the problem.

💯 Let us know if debugging this today leads to a fix

@jeromecoutant I would keep this open until @kjbracey-arm confirms

kjbracey · 2017-12-13T10:06:50Z

#5576 won't have resolved the underlying problem here, it may just have hidden it again. This should stay open until we track down the real cause. Should probably close #5689 and we'll make a new PR when/if we find the real culprit.

kjbracey · 2017-12-13T15:50:39Z

It appears it is a flaw in the STM32F Ethernet driver - I believe it's newly arising due to the Cortex-M7 STM chips.

The driver writes to the Ethernet DMA descriptor, then writes to the Ethernet peripheral telling it to go.

I assume the cache isn't active, but the store buffer is active. The write to the DMA descriptor hasn't cleared the store buffer by the time the Ethernet looks at it. So the Ethernet immediately goes back to sleep without transmitting.

The Cortex-M7 doesn't automatically drain the store buffer when you access Device memory.

Adding a DMB between the descriptor write and peripheral write solves the issue.

So nothing to do with #4686, it seems.

Not quite sure why it comes and goes between builds - my current guess is that it's down to alignment of the descriptors, which will in turn affect the order that the descriptor hits the RAM.

SeppoTakalo · 2017-12-13T15:55:01Z

@screamerbg Please forward this issue to ST.

jarlamsa · 2017-12-14T07:03:41Z

If needed, I have a .bin and .elf for reproducing the issue.

0xc0170 · 2017-12-14T08:27:16Z

If needed, I have a .bin and .elf for reproducing the issue.

If you can attach it, please do here.

@ARMmbed/team-st-mcd

jarlamsa · 2017-12-14T08:34:12Z

Attached.
gethostbyname.zip

Pending official update from STM, add memory barriers to the Ethernet HAL code for the STM32F7xx family. Cortex-M7 has a merging write buffer that is not automatically flushed by accesses to devices, so without these DMBs, we sometimes lose synch with the transmitter. The DMBs are architecturally needed in every version of this HAL, but adding just to the STM32F7 version for now to clear test, as the problem has only been observed on Cortex-M7-based devices. Fixes ARMmbed#5622.

Pending official update from STM, add memory barriers to the Ethernet HAL code for the STM32F7xx family. Cortex-M7 has a merging write buffer that is not automatically flushed by accesses to devices, so without these DMBs, we sometimes lose synch with the transmitter. The DMBs are architecturally needed in every version of this HAL, but adding just to the STM32F7 version for now to clear test, as the problem has only been observed on Cortex-M7-based devices. Fixes #5622.

0xc0170 added devices: st type: bug labels Nov 30, 2017

0xc0170 mentioned this issue Dec 1, 2017

Add Critical Section HAL API specification #5346

Merged

4 tasks

jeromecoutant mentioned this issue Dec 7, 2017

STM32 UART init update #5570

Merged

jeromecoutant mentioned this issue Dec 13, 2017

Fixes NUCLEO_F746ZG gethostbyname test compiled with IAR #5689

Closed

kjbracey mentioned this issue Dec 18, 2017

Add memory barriers to STM32F7xx Ethernet #5720

Merged

sg- assigned kjbracey Dec 18, 2017

0xc0170 closed this as completed in #5720 Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gethostbyname test FAIL with NUCLEO_F746ZG and IAR #5622

gethostbyname test FAIL with NUCLEO_F746ZG and IAR #5622

jeromecoutant commented Nov 30, 2017

0xc0170 commented Nov 30, 2017

kegilbert commented Nov 30, 2017

kegilbert commented Dec 1, 2017

0xc0170 commented Dec 1, 2017 •

edited

kjbracey commented Dec 4, 2017 •

edited

0xc0170 commented Dec 7, 2017 •

edited

jeromecoutant commented Dec 7, 2017

kegilbert commented Dec 7, 2017

jarlamsa commented Dec 12, 2017

jarlamsa commented Dec 12, 2017

kjbracey commented Dec 12, 2017

jeromecoutant commented Dec 13, 2017

0xc0170 commented Dec 13, 2017

kjbracey commented Dec 13, 2017 •

edited

kjbracey commented Dec 13, 2017

SeppoTakalo commented Dec 13, 2017

jarlamsa commented Dec 14, 2017

0xc0170 commented Dec 14, 2017

jarlamsa commented Dec 14, 2017

gethostbyname test FAIL with NUCLEO_F746ZG and IAR #5622

gethostbyname test FAIL with NUCLEO_F746ZG and IAR #5622

Comments

jeromecoutant commented Nov 30, 2017

Description

0xc0170 commented Nov 30, 2017

kegilbert commented Nov 30, 2017

kegilbert commented Dec 1, 2017

0xc0170 commented Dec 1, 2017 • edited

kjbracey commented Dec 4, 2017 • edited

0xc0170 commented Dec 7, 2017 • edited

jeromecoutant commented Dec 7, 2017

kegilbert commented Dec 7, 2017

jarlamsa commented Dec 12, 2017

jarlamsa commented Dec 12, 2017

kjbracey commented Dec 12, 2017

jeromecoutant commented Dec 13, 2017

0xc0170 commented Dec 13, 2017

kjbracey commented Dec 13, 2017 • edited

kjbracey commented Dec 13, 2017

SeppoTakalo commented Dec 13, 2017

jarlamsa commented Dec 14, 2017

0xc0170 commented Dec 14, 2017

jarlamsa commented Dec 14, 2017

0xc0170 commented Dec 1, 2017 •

edited

kjbracey commented Dec 4, 2017 •

edited

0xc0170 commented Dec 7, 2017 •

edited

kjbracey commented Dec 13, 2017 •

edited