Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double swap on interrupted revert #485

Merged
merged 5 commits into from
May 31, 2019

Conversation

ccollins476ad
Copy link
Collaborator

@ccollins476ad ccollins476ad commented May 23, 2019

This fixes #480.

When mcuboot rewrites image trailers during a swap, some information is lost. If a reset occurs before the swap completes, mcuboot may not be able to determine which swap type to resume upon startup. Specifically, if a "revert" swap gets interupted, mcuboot will perform an extraneous swap on the subsequent boot. See #480 for details.

NOTE: A different PR (#482) adds unit tests which test this fix.

This PR adds an additional field to the image trailer: swap-type. The new trailer structure is illustrated below:

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ~                                                               ~
    ~    Swap status (BOOT_MAX_IMG_SECTORS * min-write-size * 3)    ~
    ~                                                               ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                 Encryption key 0 (16 octets) [*]              |
    |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                 Encryption key 1 (16 octets) [*]              |
    |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                      Swap size (4 octets)                     |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   Swap type   |           0xff padding (7 octets)             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   Copy done   |           0xff padding (7 octets)             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   Image OK    |           0xff padding (7 octets)             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                       MAGIC (16 octets)                       |
    |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

[*]: Only present if the encryption option is enabled (`MCUBOOT_ENC_IMAGES`).

The swap-type field contains one of the BOOT_SWAP_TYPE_[...] constants. Every time a trailer is written, this field is written along with it. When resuming an interrupted swap, mcuboot uses this field alone to determine the type of swap being resumed. For new swap operations (non-resume case), this field is not read at all; instead, mcuboot consults the boot_swap_tables array to determine the swap operation to perform (as it did prior to this PR).

Some additional changes were necessary to make all the simulated unit tests pass:

  • Before initiating a new swap operation, always write the image trailer to the scratch area. This step allows mcuboot to persist the swap-type field somewhere before erasing the trailer in the primary slot. If a reset occurs immediately after the erase, mcuboot recovers by using the trailer in the scratch area.

  • Related to the above: if the scratch area is being used to hold status bytes (because there are no spare sectors in the primary slot), erase the scratch area immediately after the trailer gets written to the primary slot. This eliminates ambiguity regarding the location of the current trailer in case a reset occurs shortly afterwards.

@MariuszSkamra
Copy link

I confirm this fix. I've tested this PR with my app from #480.

@ccollins476ad ccollins476ad force-pushed the prev-swap branch 2 times, most recently from 0aa34d3 to 8fd2b83 Compare May 24, 2019 16:43
@ccollins476ad
Copy link
Collaborator Author

Note: I force pushed a commit with documentation and comment changes. I did not change any code with this force push. (I messed up the trailer structure representation in my first commit).

@utzig
Copy link
Member

utzig commented May 24, 2019

@ccollins476ad Please cherry-pick the commit from PR #482

@ccollins476ad
Copy link
Collaborator Author

Thanks for the reminder, @utzig. I have pulled in your commits from #482.

@utzig
Copy link
Member

utzig commented May 29, 2019

@ccollins476ad From your commit message:

Prior to this change, the scratch image trailer had a different format from a slot image trailer. Specifically, the copy_done field was excluded from the scratch trailer.

More importantly than copy_done was that in the scratch area we only ever store a single sector's state, so for it's usage calculation there was: BOOT_STATUS_STATE_COUNT * min_write_sz, while for slotX it was: BOOT_STATUS_MAX_ENTRIES * BOOT_STATUS_STATE_COUNT * min_write_sz. Now BOOT_STATUS_MAX_ENTRIES is user configurable and might end up being a big value; not to say there's anything wrong with the change, but I thought this correction was requried! :-P

@ccollins476ad
Copy link
Collaborator Author

@utzig, that is a good point, and something that I completely missed. At the very least, I will update the commit message.

Do you think this could cause a problem? I think the scratch area is guaranteed to be large enough to accommodate this change, but I could use a second opinion!

@utzig
Copy link
Member

utzig commented May 29, 2019

@ccollins476ad I don't think it will cause issues; so I was thinking about this and it seems to me that if you cannot fit the data of the last sector with metadata on scratch, than it won't fit on the last sector as well, so in the end this seems like a good move into simplifying the code. Also metadata on scratch is only ever used it there's data to be swapped in any slot's last sectors, thus it's a bit of a corner case that would be better off having a specific test added on the simulator. Your changes LGTM, I just need to finish reviewing one commit (the big one!).

@@ -616,6 +642,15 @@ boot_set_pending(int permanent)
rc = boot_write_image_ok(fap);
}

if (rc == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In BOOT_MAGIC_GOOD case swap_type should be checked and checked in case operation was interrupted here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anyone would be able to schedule a new swap operation if we were resuming an interruption; what exact issue do you have mind?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think we need to worry about the resume case here. boot_set_pending() doesn't get called from the boot loader, it gets called from the application. If the application is running, there can't be a swap in progress.

This is a little weird, though. If the application tries to schedule a swap operation, and one is already scheduled, we just return 0. It seems like we ought to check if the two swap operations have the same type. If they are different, we should return an error. I think this should go in a separate PR, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was a power-down had occurred just before writing the swap_type. Shouldn't an application be able to recognize such a undone request unambiguously, and be able to renew boot_set_pending() after that? For sure it is possible to check that the application had already collected proper upgrade image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvlsianpu I think I see what you mean.

Application schedules a "test" swap. This requires two flash writes:

  1. Write magic to secondary trailer.
  2. Write swap_type to secondary trailer.

If the device reboots after step 1 (before step 2), the trailer will be incomplete.

However, I don't think this is a problem. Since this is a new swap (not a resume), mcuboot won't even look at the swap_type field at startup. It should execute the "test" swap as expected.

If mcuboot then reboots in the middle of this swap, then we have a potential issue. Mcuboot will recognize that a swap was interrupted and needs to be resumed, but it won't be able to determine the swap type. I think this just puts us back where we were before this bug fix ("revert" will fail; all other swap types happen to work by chance). Since you cannot specify a "revert" swap type with the boot_set_pending() function, I don't think there is a problem here.

I do think there is a general issue of the device resetting with a partially-written trailer. This issue has always existed, though, so it is probably best left for a separate PR.

Please let me know if I am still missing something.

@utzig
Copy link
Member

utzig commented May 30, 2019

@ccollins476ad Please fixup the last commit.

ccollins476ad and others added 5 commits May 30, 2019 08:44
Prior to this change, the scratch image trailer had a different format
from a slot image trailer.  Specifically:

1. The scratch trailer only contained a single set of status entries
   (three bytes); the slot trailer contained `BOOT_STATUS_MAX_ENTRIES`
   sets of status entries.

2. The scratch trailer did not contain the `copy_done` field.

This inconsistency required some extra conditional logic in the trailer
handling code.  It is simpler to just use the same trailer format
everywhere.

This commit removes this inconsistency.  Now, the scratch trailer
structure is identical to that of the slot trailer.

Signed-off-by: Christopher Collins <ccollins@apache.org>
Signed-off-by: Christopher Collins <ccollins@apache.org>
This fixes mcu-tools#480.

When mcuboot rewrites image trailers during a swap, some information is
lost.  If a reset occurs before the swap completes, mcuboot may not be
able to determine what which swap type to resume upon startup.
Specifically, if a "revert" swap gets interupted, mcuboot will perform
an extraneous swap on the subsequent boot.  See
mcu-tools#480 for details.

This commit adds an additional field to the image trailer: `swap-type`.
The new trailer structure is illustrated below:

```
     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ~                                                               ~
    ~    Swap status (BOOT_MAX_IMG_SECTORS * min-write-size * 3)    ~
    ~                                                               ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ~                 Encryption key 0 (16 octets) [*]              ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ~                 Encryption key 1 (16 octets) [*]              ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |           Swap size           |    0xff padding (4 octets)    |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   Swap type   |           0xff padding (7 octets)             ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   Copy done   |           0xff padding (7 octets)             ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |   Image OK    |           0xff padding (7 octets)             ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    ~                       MAGIC (16 octets)                       ~
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
```

The `swap-type` field contains one of the `BOOT_SWAP_TYPE_[...]` constants.
Every time a trailer is written, this field is written along with it.
When resuming an interrupted swap, mcuboot uses this field alone to
determine the type of swap being resumed. For new swap operations
(non-resume case), this field is not read at all; instead, mcuboot
consults the `boot_swap_tables` array to determine the swap operation to
perform (as it did prior to this commit).

Some additional changes were necessary to make all the simulated unit
tests pass:

* Before initiating a new swap operation, always write the image trailer
to the scratch area.  This step allows mcuboot to persist the
`swap-type` field somewhere before erasing the trailer in the primary
slot.  If a reset occurs immediately after the erase, mcuboot recovers
by using the trailer in the scratch area.

* Related to the above: if the scratch area is being used to hold status
bytes (because there are no spare sectors in the primary slot), erase
the scratch area immediately after the trailer gets written to the
primary slot.  This eliminates ambiguity regarding the location of the
current trailer in case a reset occurs shortly afterwards.

Signed-off-by: Christopher Collins <ccollins@apache.org>
Make images slightly larger to allow more swap status metadata to be
written to flash, to increase amount of debugging info and possibility
of failures on random write fails.

Signed-off-by: Fabio Utzig <utzig@apache.org>
Signed-off-by: Christopher Collins <ccollins@apache.org>
This extends the test+revert case with an interruption on the revert
stage, as it was previously only interrupted on the test stage. For
simplicity the interruption happens on the same interruption point for
both test and revert stages.

Signed-off-by: Fabio Utzig <utzig@apache.org>
Signed-off-by: Christopher Collins <ccollins@apache.org>
@ccollins476ad
Copy link
Collaborator Author

Thanks, @utzig. I have fixed up that commit, and reworded the "scratch trailer structure" commit as discussed above.

Copy link
Member

@utzig utzig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I suggest waiting for an approval by @nvlsianpu as well!

@ccollins476ad ccollins476ad merged commit fc07eab into mcu-tools:master May 31, 2019
@ccollins476ad ccollins476ad deleted the prev-swap branch May 31, 2019 17:15
@utzig utzig mentioned this pull request Jun 4, 2019
nvlsianpu added a commit to nvlsianpu/zephyr that referenced this pull request Jun 4, 2019
Mcuboot changed TLV in PR: Fix double swap on interrupted revert
(mcu-tools/mcuboot#485)

Above bugfix changes a little way for upgrade request.

This path introduces re-formatted original mcuboot bootutil_misc.c
code as much as it was reasonable which includes the bugfix
adaptation and support for devices witch have bite erased sate different
than 1 as well.

Signed-off-by: Andrzej Puzdrowski <andrzej.puzdrowski@nordicsemi.no>
carlescufi pushed a commit to zephyrproject-rtos/zephyr that referenced this pull request Jun 18, 2019
Mcuboot changed TLV in PR: Fix double swap on interrupted revert
(mcu-tools/mcuboot#485)

Above bugfix changes a little way for upgrade request.

This path introduces re-formatted original mcuboot bootutil_misc.c
code as much as it was reasonable which includes the bugfix
adaptation and support for devices witch have bite erased sate different
than 1 as well.

Signed-off-by: Andrzej Puzdrowski <andrzej.puzdrowski@nordicsemi.no>
@nvlsianpu nvlsianpu added this to the 1.3.1 milestone Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watchdog reset during bootloader stage breaks image swap
4 participants