Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardfault in lfs_dir_relocatingcommit, LFS 2.9, LFS_ASSERT(pdir) with pdir null #949

Open
HariTheCoder341 opened this issue Mar 1, 2024 · 3 comments
Labels
needs investigation no idea what is wrong

Comments

@HariTheCoder341
Copy link

HariTheCoder341 commented Mar 1, 2024

We are experiencing hardfault time to time in this scenario (while(true) loop).
We have 3 directory created, "Records", "ToSend", "Archive"

  1. creating a file in "Records" and adding data to it (random size)
  2. renaming the file from "Records" and moving it to "ToSend"
  3. Copying the content of "ToSend" file in a single file in "Archive", appending content to it

Hardfault happens in function "lfs_dir_relocatingcommit" line 2226 hitted by LFS_ASSERT(pdir); because pdir is 0 (null).
It happen here:

01/03/2024 10:03:01: ../Core/lfs/lfs.c:2438:debug: Relocating {0xa5d, 0x1613} -> {0xcd1, 0xa5d}
01/03/2024 10:03:01: ../Core/lfs/lfs.c:2535:debug: Fixing move while relocating {0xa28, 0x3de} 0x0

After reboot and during mount:

01/03/2024 10:03:12: ../Core/lfs/lfs.c:4534:debug: Found pending gstate 0xcff0000000000a28000003de
01/03/2024 10:03:12: ../Core/lfs/lfs.c:4883:debug: Fixing move {0xa28, 0x3de} 0x0
01/03/2024 10:03:12: ../Core/lfs/lfs.c:4960:debug: Fixing half-orphan {0xa5d, 0x1613} -> {0xcd1, 0xa5d}

We also noticed that lfs_dir_relocatingcommit is called with pdir = NULL in three points, line 2433, 2494, 2545.
Is this intentional?

Why could this problem happen and how can we avoid this?

But, if we mount the fs at the beginning and unmount it at each while cycle, it does not happen.
EDIT: It happens slower... more iteration to get to the error

Thank you

@geky geky added the needs investigation no idea what is wrong label Mar 8, 2024
@geky
Copy link
Member

geky commented Mar 8, 2024

Hi @HariTheCoder341, thanks for creating an issue.

This is going to be difficult to debug. Is it possible to reduce this down into a locally-reproducible example? Preferably in a littlefs test case (example).


We also noticed that lfs_dir_relocatingcommit is called with pdir = NULL in three points, line 2433, 2494, 2545.

Are you sure you're using v2.9? I don't think these line numbers quite line up.

The pdir argument here is actually a side-effect of lfs_dir_relocatingcommit. It contains the previous mdir in the dir's linked-list if the mdir needs to be dropped (mdir.count=0, LFS_OK_DROPPED). We need to find the pdir in lfs_dir_relocatingcommit to determine how to update the mlist, but we can't actually drop in lfs_dir_relocatingcommit without recursion. A lot of the mess in these functions is tip-toeing around a flattened recursive algorithm.

But not every call to lfs_dir_relocatingcommit can drop. This would not be solved by always providing pdir, because the layer above would not be able to handle the resulting LFS_OK_DROPPED state correctly.

The Fixing move while relocating message (here) is followed immediately by an lfs_dir_relocatingcommit call with a delete tag, which is probably resulting a drop, which littlefs doesn't expect.

@HariTheCoder341
Copy link
Author

Hi @geky and thank you for your quick reply.

I confirm we are using 2.9.

I'm trying to reproduce the problem with a specific test function (written in c) in attach.
With this function, the problem does no appear and we reach 725000 iterations without any reset, 1 day running.
Our real process involve attributes, I don't know if it is correlated to... I don't think so.
I will try now to enrich the test function, including appending at the end of the process, instead of removing, in order to mimic the real scenario.

We also tried to replace (in the real environment) the lfs_rename function with a custom function that perform a file copy.
With this change, the problem does not occur.

I'll keep investigating and let you know...

test.zip

@HariTheCoder341
Copy link
Author

Hi,
In the past few days I tried to write a test to reproduce the problem, without success.
The operations are the same as in the real application, but in a Linux environment they do not lead to the problem, which instead persists on STM32.
I attach the test code.

I confirm that by replacing the lfs_rename with a copy function, which replicates the purpose, we have no reset after extensive and continuous testing for 10 consecutive days.

I also generated a log file with LFS_YES_TRACE active, but it exceeds 1.5 GB, so I don't think it can be attached here or that it can be of any help, I'm only publishing the final part.

It takes a day and a half for the problem to occur... not knowing how to proceed, we will comment out the lfs_rename in favor of the copy for now.

lfstest.zip
log_pub.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation no idea what is wrong
Projects
None yet
Development

No branches or pull requests

2 participants