Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nuttx - statfs() might lead to halt #13087

Closed
BazookaJoe1900 opened this issue Oct 3, 2019 · 8 comments · Fixed by #13311
Closed

nuttx - statfs() might lead to halt #13087

BazookaJoe1900 opened this issue Oct 3, 2019 · 8 comments · Fixed by #13311

Comments

@BazookaJoe1900
Copy link
Member

note-this is duplication to issue i Opened on PX4/NuttX (original Issue). I am not sure where to put that and one can be removed.

Describe the bug
I added mavlink message, that checks the SD card status periodically (in contrast to current implementation STORAGE_INFORMATION, that need to be asked by the ground).
during checking my message, I tested what will happen if I remove the SD card during operation. doing that caused the mavlink thread to stop.

first, do you consider that as a problem? how the system behave if SD is removed, or get malfunction during flight.
I only checked that above scenario, (added periodic check mavlink message). but i guess that other methods that required the SD are much more critical, for example reading mission on commander....

I think that it related to the fact that reading the status of fat32 is done by fat_statfs(), that waits for semaphore fat_semtake(fs);

To Reproduce
Testing code can be found at:
https://github.com/BazookaJoe1900/Firmware/tree/testing-sd_removal

Steps to reproduce the behavior:

  1. start logging (use 'logger on' for example)
  2. remove the SD card
  3. most of times, the mavlink will stop. you can all see that using top that the mavlink is on 0%
@dagar
Copy link
Member

dagar commented Oct 3, 2019

Yes I consider this a problem if an SD card failure can cause the mavlink module or navigator (guessing) to stop responding. Can we handle the failure gracefully in dataman?

@BazookaJoe1900
Copy link
Member Author

That is not that simple. its something deeper on the OS.
@davids5, any thoughts?

@davids5
Copy link
Member

davids5 commented Oct 4, 2019

Yes I agree it should fail with a timeout. I will look into this, but it will be after the 16 th,

@BazookaJoe1900
Copy link
Member Author

Note that the timeout need to be somewhere on the writer or the other process that holding the semaphore.

@dagar
Copy link
Member

dagar commented Oct 4, 2019

That is not that simple. its something deeper on the OS.

I know, I mean that will be the ultimate test once the core problem is addressed. If in the mission the failure also needs to trigger a failsafe otherwise you'll get stuck on the last mission item.

@julianoes
Copy link
Contributor

If in the mission the failure also needs to trigger a failsafe otherwise you'll get stuck on the last mission item.

Presumably this is all implemented. It should complain about a failure to load the mission and be in the same state as after a finished mission.

@BazookaJoe1900
Copy link
Member Author

BazookaJoe1900 commented Oct 15, 2019

From what I seen, writing to the SD, and other operations with it doesn't has timeouts.
so other task that writing to the SD (lets say the logger) can block the access to the SD.
The blocking can be permanent if there was an error, for example, removing the SD.
so the commander, will try to access the SD, and get stack there because the writer will not free the semaphore.

@mrpollo
Copy link
Contributor

mrpollo commented Oct 15, 2019

I'm guessing this is dev call material, anyone here would like to lead the discussion on the next dev call Oct 16th?

@mrpollo mrpollo added this to Low priority in Devcall via automation Oct 15, 2019
@mrpollo mrpollo moved this from Low priority to Currently under discussion in Devcall Oct 15, 2019
@julianoes julianoes added this to Testing in Release 1.11 Blockers Oct 22, 2019
@julianoes julianoes moved this from Testing to In progress in Release 1.11 Blockers Oct 22, 2019
Release 1.11 Blockers automation moved this from In progress to Done Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Devcall
  
Currently under discussion
Development

Successfully merging a pull request may close this issue.

5 participants