Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden flash handling in AP_Logger #13130

Merged
merged 1 commit into from
Jan 2, 2020
Merged

Conversation

andyp1per
Copy link
Collaborator

@andyp1per andyp1per commented Dec 30, 2019

This PR addresses a number of issues in the dataflash handling which I would appreciate feedback on:

  • mavlink backend needs to be the last backend - otherwise functions like AP_Logger::get_log_boundaries will return the wrong thing for block+mavlink
  • be really careful to catch aborted erases - I've added more protection around the case where someone starts a chip erase and then pulls the plug. I basically first erase the format header so that if something bad happens the erase will start again at the next boot.
  • take care to protect shared structures in io thread - not sure how much this matters but appears wrong
  • if flash corruption is detected try and recover whole files - I got in the state where I had good files on the chip followed by corruption. I have written some startup code that iterates through the good files and then erases the remaining part of the chip if corruption is found.
  • the code was erroneously referring to a 64k block as a "sector", this makes it really confusing. I have cleaned up the internal variables because I need both PagesPerSector and PagesPerBlock, but there could be more renaming
  • mavlink listing did not cope with wrapping - the case where logs did not start at 1 and hence the last log had an index greater than the number of logs. This is fixed and mission planner now reports correct log numbers and sizes in this instance
  • output status messages for useful functions like begin/end erase which can take some time

@andyp1per andyp1per force-pushed the pr-flash-fix branch 2 times, most recently from 515657d to f160f95 Compare December 30, 2019 20:00
Copy link
Contributor

@tridge tridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this!

libraries/AP_Logger/AP_Logger_DataFlash.cpp Outdated Show resolved Hide resolved
libraries/AP_Logger/AP_Logger_DataFlash.cpp Outdated Show resolved Hide resolved
libraries/AP_Logger/AP_Logger_DataFlash.cpp Outdated Show resolved Hide resolved
libraries/AP_Logger/AP_Logger_Block.cpp Show resolved Hide resolved
@andyp1per
Copy link
Collaborator Author

andyp1per commented Dec 30, 2019

Also @tridge I have seen AP_Logger_Block::find_last_page_of_log() return 0, when would this ever be the case (and it was not because the file wrapped)?
Update: this is because of the listing bug which I have subsequently fixed

@andyp1per andyp1per force-pushed the pr-flash-fix branch 2 times, most recently from f794d0c to a4d29de Compare December 31, 2019 16:37
@andyp1per
Copy link
Collaborator Author

Ok @tridge I've addressed your comments and fixed some bugs I found. I was successfully able to recover a log from a KakuteF7Mini with this which before caused mission planner to hang.

@andyp1per
Copy link
Collaborator Author

And yet another bug found and fixed. Turns out our mavlink listing support (used by mission planner) did not cope with wrapping

be really careful to catch aborted erases
take care to protect shared structures in io thread
if flash corruption is detected try and recover whole files
overwrite format in erase to make sure erase happens
output useful messages at critical times
a block is 64k a sector is 4k, rename internal variables appropriately
cope with log wrapping when sending log listings over mavlink
@tridge tridge merged commit 54b6e7b into ArduPilot:master Jan 2, 2020
@andyp1per andyp1per deleted the pr-flash-fix branch January 2, 2020 08:24
@rmackay9
Copy link
Contributor

rmackay9 commented Jan 3, 2020

I suspect we should backport this to the 4.0 branches (or at least Copter)?

@andyp1per
Copy link
Collaborator Author

@rmackay9 I agree, although it woud be nice to have a report that nothing has regressed. That said it's only small copters using these boards and the status quo in 4.0 is pretty broken.

@andyp1per
Copy link
Collaborator Author

@rmackay9 I am still seeing issues - I don't suggest you backport it at this time.

@rmackay9
Copy link
Contributor

rmackay9 commented Feb 4, 2020

FYI, this is included in Copter-4.0.2-rc4

@peterbarker
Copy link
Contributor

This PR did very bad things to log downloads in SITL. I wouldn't be surprised to know it's killed it on HW - I haven't tested yet.

Before:

MANUAL> Requesting log list
Log 1  numLogs 2 lastLog 2 size 691989 Mon May  4 19:25:31 2020
Log 2  numLogs 2 lastLog 2 size 220522 Mon May  4 19:28:31 2020

After:

Log 4  numLogs 2 lastLog 5 size 0 
Log 5  numLogs 2 lastLog 5 size 0 

The PR removed the distinction between log-on-disk and log-in-mavlink-packet - perhaps accidentally, perhaps not. Either way, SITL is looking for the wrong files.

@andyp1per
Copy link
Collaborator Author

It definitely works in hardware beacuse that was my specific use case that I tested and that did not work before. Even so - oops, mea culpa.

@andyp1per
Copy link
Collaborator Author

How would it have removed that distinction?

@andyp1per
Copy link
Collaborator Author

I can't reproduce what you are seeing. On current master:

LAND> log list
LAND> Requesting log list
Log 1  numLogs 9 lastLog 9 size 63488 Sat Apr 25 15:43:50 2020
Log 2  numLogs 9 lastLog 9 size 543616 Sat Apr 25 15:43:55 2020
Log 3  numLogs 9 lastLog 9 size 1095680 Sat Apr 25 15:44:12 2020
Log 4  numLogs 9 lastLog 9 size 1185792 Sat Apr 25 15:44:27 2020
Log 5  numLogs 9 lastLog 9 size 1214464 Sat Apr 25 15:44:43 2020
Log 6  numLogs 9 lastLog 9 size 1390592 Sat Apr 25 15:45:01 2020
Log 7  numLogs 9 lastLog 9 size 1056768 Sat Apr 25 15:45:15 2020
Log 8  numLogs 9 lastLog 9 size 638976 Sat Apr 25 15:45:25 2020
Log 9  numLogs 9 lastLog 9 size 18640896 Mon May  4 19:52:19 2020

@peterbarker
Copy link
Contributor

@andyp1per Try deleting the oldest 5 logs.

@andyp1per
Copy link
Collaborator Author

Ah yes that fails like you have. So it must be to do with the reading rather than storage since we have seen that the data is there.

@andyp1per
Copy link
Collaborator Author

Ok, if I comment out the changes to AP_Logger_MAVLinkLogTransfer.cpp I get data

STABILIZE> Requesting log list
Log 1  numLogs 5 lastLog 5 size 1390592 Sat Apr 25 15:45:01 2020
Log 2  numLogs 5 lastLog 5 size 1056768 Sat Apr 25 15:45:15 2020
Log 3  numLogs 5 lastLog 5 size 638976 Sat Apr 25 15:45:25 2020
Log 4  numLogs 5 lastLog 5 size 39380992 Mon May  4 19:46:12 2020
Log 5  numLogs 5 lastLog 5 size 8754678 Mon May  4 22:36:55 2020

but that will break FRAM-based logging list which this fixed. So somehow we need the happy medium of both.

@xfacta
Copy link

xfacta commented Jan 29, 2021

@andyp1per
Copy link
Collaborator Author

The issues were fixed in #14299

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants