Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mtd_spi_nor | littlefs: improve reliability with corrupted flash #14908

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

vincent-d
Copy link
Member

Contribution description

This PR improves the reliability of mtd_spi_nor and littlefs:

  • mtd_spi_nor: properly checks for WEL enabling/disabling, check for errors when writing/erasing and return a dedicated error code (-EIO).
  • littlefs: catch -EIO and translate it into LFS_ERR_CORRUPT that is internally used to realloc sectors if needed

Testing procedure

Check that littlefs(2) still works (examples/filesystem).
This has been tested with flash that had dead sectors. littlefs could detect them and recover.

@vincent-d vincent-d added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: drivers Area: Device drivers CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Area: fs Area: File systems labels Aug 31, 2020
Copy link
Contributor

@benpicco benpicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the deal with the Security register and will it be available on all flash devices?

Comment on lines 593 to 597
/* Read security register */
uint8_t rdscur;
mtd_spi_cmd_read(dev, dev->params->opcode->rdscur, &rdscur, sizeof(rdscur));
if (rdscur & SECURITY_PFAIL) {
DEBUG("mtd_spi_nor_write: ERR: page program failed!\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the relationship between the two.

@vincent-d
Copy link
Member Author

What's the deal with the Security register and will it be available on all flash devices?

On macronix flashes (at least, I haven't check other vendors actually :/ ), security register contains some flags: Erase fail bit and Program fail bit that are set when erase or program failed (usually meaning the sector is dead). See, for instance (page 83):
https://www.macronix.com/Lists/Datasheet/Attachments/7574/MX25L51245G,%203V,%20512Mb,%20v1.3.pdf

@benpicco
Copy link
Contributor

benpicco commented Aug 31, 2020

Hm I couldn't find such flags on neither GD25Q32C nor AT25SF041

@vincent-d
Copy link
Member Author

Hm I couldn't find such flags on neither GD25Q32C nor AT25SF041

d'oh! I'll revisit the patch to enable this with a compile flag

@benpicco
Copy link
Contributor

Is there some way to detect this at run-time?

@vincent-d
Copy link
Member Author

I guess it could be possible to read the jedec ID. But I'm not sure it's worth it.

@vincent-d vincent-d changed the title mtd_spi_nor | littlefs: mtd_spi_nor | littlefs: improve reliability with corrupted flash Sep 17, 2020
@fjmolinas
Copy link
Contributor

ping @benpicco

@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 22, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Mar 2, 2022
@fjmolinas fjmolinas removed the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Mar 2, 2022
@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Mar 2, 2022
Copy link
Contributor

@benpicco benpicco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for letting this slip through the cracks!
This actually looks really good, would merge after giving it a quick test.

Can you rebase this?

Comment on lines +39 to +55
#define STATUS_WIP 0x01u
#define STATUS_WEL 0x02u
#define STATUS_BP0 0x04u
#define STATUS_BP1 0x08u
#define STATUS_BP2 0x10u
#define STATUS_BP3 0x20u
#define STATUS_QE 0x40u
#define STATUS_SRWD 0x80u

#define SECURITY_SOTP 0x01u
#define SECURITY_LDSO 0x02u
#define SECURITY_PSB 0x04u
#define SECURITY_ESB 0x08u
#define SECURITY_XXXXX 0x10u
#define SECURITY_PFAIL 0x20u
#define SECURITY_EFAIL 0x40u
#define SECURITY_WPSEL 0x80u
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move those to the .c file, no need to define them globally - then you can also get away without a prefix 😉

@stale
Copy link

stale bot commented Nov 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Nov 2, 2022
@crasbe
Copy link
Contributor

crasbe commented Apr 15, 2024

@benpicco Can we get this reactivated? The project I'm currently working on uses an ISSI flash which has the same security features (however the bit positions in the registers are different and the OpCodes are different as well). This PR would be a good base and reduce the amount of work significantly.

Given the age of the PR, it probably has merge conflicts now. Unfortunately it looks like @vincent-d is not active anymore, his last GitHub activity was in 2022 and there is not much to find about OTAkeys anymore either.

On a different note: I checked the code and there is no timeout for the added functions wait_for_write_enable_cleared and wait_for_write_enable_set, which would get the thread stuck if the chip does not answer.
(The function wait_for_write_complete which existed before and still exists in the current code does not have a timeout either, it counts the attempts and how many times it yielded. But there is no timeout either.)

I don't know how this is handled generally in other modules?

@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Apr 15, 2024
@benpicco
Copy link
Contributor

Sure, feel free to take over the PR.
There should of course be a timeout, we don't want block the application in case of a hardware error but instead give to opportunity to handle it gracefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers Area: fs Area: File systems Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants