Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gethostbyname test FAIL with NUCLEO_F746ZG and IAR #5622

Closed
jeromecoutant opened this issue Nov 30, 2017 · 19 comments
Closed

gethostbyname test FAIL with NUCLEO_F746ZG and IAR #5622

jeromecoutant opened this issue Nov 30, 2017 · 19 comments

Comments

@jeromecoutant
Copy link
Collaborator

Description

mbed test -m NUCLEO_F746ZG -t iar -vv -n tests-netsocket-gethostbyname

=> test is often FAIL
Target can't connect during DHCP procedure...

Test is OK with ARM and GCC

Target
NUCLEO_F746ZG

Toolchain
IAR

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 30, 2017

To be seen here #5597 (the last 2 ci test run)

@kegilbert @studavekar I haven't seen it logged yet, was this known prior this report?

@kegilbert
Copy link
Contributor

@0xc0170 I had seen that error on the linked PR but thought it was due to the PR itself. I'll look into it on our end.

@kegilbert
Copy link
Contributor

I'm able to reproduce this locally as well on IAR.

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 1, 2017

@kegilbert Thanks, I noticed that this is now real on master.

@SeppoTakalo @mikaleppanen @kjbracey-arm Can you help with this one?

@kjbracey
Copy link
Contributor

kjbracey commented Dec 4, 2017

Is this really specifically only the gethostbyname test failing? The failure appears to be an assert on return from the "connect" - getting a DHCP failure. So nothing to do with the gethostbyname bit.

What appears to be distinct about gethostbyname compared to other tests is that it does its connect in the test setup stage, rather than in a test case. That's presumably a clue. Not familiar enough with the test frame work to interpret the clue though.

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 7, 2017

Thanks Kevin. It's just that one test and only one toolchain, and seldomly appears in our test logs.

What appears to be distinct about gethostbyname compared to other tests is that it does its connect in the test setup stage, rather than in a test case. That's presumably a clue. Not familiar enough with the test frame work to interpret the clue though.

@jeromecoutant Can we look at this particular, if that would help?

Has anyone been able to reproduce this?

@jeromecoutant
Copy link
Collaborator Author

@kegilbert
Copy link
Contributor

The error seems to be fairly intermittent, I was unable to locally reproduce earlier this week but saw it crop up again in a CI run the next day. Trying to look more into it now.

@jarlamsa
Copy link
Contributor

I was able to reproduce this locally, seems like test-case issue, I am preparing a fix for this.

@jarlamsa
Copy link
Contributor

Made a PR: #5689
I am not entirely sure why the preprocessor macro seems to cause this to happen

@kjbracey
Copy link
Contributor

Okay, seems we've got an environment where we can get 100% failure with the default develop build, but it passes with a debug build. That preprocessor macro change in #5689 must just be one example of a memory layout change that avoids the problem.

I guess the change in #5689 is that we are adding 1 pointer in RAM - the const char *, shifting everything by 4 bytes. (If it were const char * const, the pointer would be in ROM, so maybe wouldn't have changed anything? Might be worth trying.)

I think tomorrow's job is to JTAG debug the failing develop image to try to spot some corrupted memory stopping the connect from completing.

I am reminded of #4686 - the best guess we currently have there is that someone is corrupting a static variable in the HAL's serial code. Different platform and compiler, but maybe could be caused by the same generic memory corruption bug?

@jeromecoutant
Copy link
Collaborator Author

Now that #5576 is merged, maybe we can close this issue ?

@0xc0170 @kegilbert , maybe check next morph test results before closing ?

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 13, 2017

Okay, seems we've got an environment where we can get 100% failure with the default develop build, but it passes with a debug build. That preprocessor macro change in #5689 must just be one example of a memory layout change that avoids the problem.

💯 Let us know if debugging this today leads to a fix

@jeromecoutant I would keep this open until @kjbracey-arm confirms

@kjbracey
Copy link
Contributor

kjbracey commented Dec 13, 2017

#5576 won't have resolved the underlying problem here, it may just have hidden it again. This should stay open until we track down the real cause. Should probably close #5689 and we'll make a new PR when/if we find the real culprit.

@kjbracey
Copy link
Contributor

It appears it is a flaw in the STM32F Ethernet driver - I believe it's newly arising due to the Cortex-M7 STM chips.

The driver writes to the Ethernet DMA descriptor, then writes to the Ethernet peripheral telling it to go.

I assume the cache isn't active, but the store buffer is active. The write to the DMA descriptor hasn't cleared the store buffer by the time the Ethernet looks at it. So the Ethernet immediately goes back to sleep without transmitting.

The Cortex-M7 doesn't automatically drain the store buffer when you access Device memory.

Adding a DMB between the descriptor write and peripheral write solves the issue.

So nothing to do with #4686, it seems.

Not quite sure why it comes and goes between builds - my current guess is that it's down to alignment of the descriptors, which will in turn affect the order that the descriptor hits the RAM.

@SeppoTakalo
Copy link
Contributor

@screamerbg Please forward this issue to ST.

@jarlamsa
Copy link
Contributor

If needed, I have a .bin and .elf for reproducing the issue.

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 14, 2017

If needed, I have a .bin and .elf for reproducing the issue.

If you can attach it, please do here.

@ARMmbed/team-st-mcd

@jarlamsa
Copy link
Contributor

Attached.
gethostbyname.zip

kjbracey added a commit to kjbracey/mbed-os that referenced this issue Dec 20, 2017
Pending official update from STM, add memory barriers to the Ethernet
HAL code for the STM32F7xx family.

Cortex-M7 has a merging write buffer that is not automatically flushed
by accesses to devices, so without these DMBs, we sometimes lose synch
with the transmitter.

The DMBs are architecturally needed in every version of this HAL, but
adding just to the STM32F7 version for now to clear test, as the
problem has only been observed on Cortex-M7-based devices.

Fixes ARMmbed#5622.
adbridge pushed a commit that referenced this issue Dec 29, 2017
Pending official update from STM, add memory barriers to the Ethernet
HAL code for the STM32F7xx family.

Cortex-M7 has a merging write buffer that is not automatically flushed
by accesses to devices, so without these DMBs, we sometimes lose synch
with the transmitter.

The DMBs are architecturally needed in every version of this HAL, but
adding just to the STM32F7 version for now to clear test, as the
problem has only been observed on Cortex-M7-based devices.

Fixes #5622.
adbridge pushed a commit that referenced this issue Dec 29, 2017
Pending official update from STM, add memory barriers to the Ethernet
HAL code for the STM32F7xx family.

Cortex-M7 has a merging write buffer that is not automatically flushed
by accesses to devices, so without these DMBs, we sometimes lose synch
with the transmitter.

The DMBs are architecturally needed in every version of this HAL, but
adding just to the STM32F7 version for now to clear test, as the
problem has only been observed on Cortex-M7-based devices.

Fixes #5622.
adbridge pushed a commit that referenced this issue Jan 2, 2018
Pending official update from STM, add memory barriers to the Ethernet
HAL code for the STM32F7xx family.

Cortex-M7 has a merging write buffer that is not automatically flushed
by accesses to devices, so without these DMBs, we sometimes lose synch
with the transmitter.

The DMBs are architecturally needed in every version of this HAL, but
adding just to the STM32F7 version for now to clear test, as the
problem has only been observed on Cortex-M7-based devices.

Fixes #5622.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants