Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: extend support to all Jetsons #10

Closed
wants to merge 33 commits into from

Conversation

madisongh
Copy link

Note that I still consider this a work in progress, but I thought I'd get these changes up for review for some early feedback.

This patch series extends Mender support to all of the current Jetson development kits: TX1, TX2, Xavier, and Nano.

Key changes:

  1. Supporting the non-U-Boot platforms (Xavier and TX2, where U-Boot is optional)
  2. Supporting Nano dev kits with SPI flash+SDcard, with U-Boot environment residing in SPI flash.
  3. Reworking the interface with nv_update_engine to eliminate the need for an ArtifactRebootEnter script.
  4. For TX2s with U-Boot, adding a mechanism to handle the possibility that the U-Boot environment location changes with an upgrade.

I've done some testing both manually and OTA on TX1, TX2 with and without U-Boot, Nano, and Xavier. Xavier has some issues that are probably not Mender-specific - this is the first time I've actually tried using the NVIDIA bootloader update support on that platform, and it looks like something may need fixing in meta-tegra. Other platforms look OK so far.

During testing I ran into some issues that required fixes in meta-tegra, so if you play with these changes, make sure you're up-to-date with the zeus branch there.

A couple of things in particular I'd like feedback on:

  • I'm not 100% sure I've got the U-boot variable migration scripts completely right, particularly for the rollback case. I'm willing to drop them altogether, since they're handling what is probably an uncommon use case.
  • I know @Ecordonnier already has a pending PR for the warrior branch for Nano support, and that uses the SDcard for U-Boot environment storage... I can see why, since it was harder to coax the Mender stuff into having it reside in the SPI flash. If that goes in, I can rework these changes to also use the SDcard for the environment (or make it an option).
  • I ran into some issues with the tegra-state-scripts recipe not propagating changes, particularly when I was deleting scripts. I was thinking of moving the installation and deployment of the scripts to recipes that need them, rather than having this separate recipe, which I think would improve the reliability.

As of L4T R32.x, nv_update_engine can be told not to reboot the
system after an update. This eliminates the need for the RebootEnter
script, which we can replace with install and rollback scripts to
handle the Tegra specifics.

This also eliminates the need for the mender-install-manual
command, since we no longer need to manually invoke the RebootEnter
script.

We also split out the U-Boot-specific scripts separately, so
they only get used on platforms with U-Boot.

Signed-off-by: Matt Madison <matt@madison.systems>
Tegra updates include the bootloaders, and if we are using U-Boot,
bootloader updates can change the location of the U-Boot environment.
So to ensure that Mender updates propagate U-Boot variables correctly,
we use scripts to ensure that needed variables are set as part of
the installation and rollback flows.

Signed-off-by: Matt Madison <matt@madison.systems>
A subset of those used in other meta layers.

Signed-off-by: Matt Madison <matt@madison.systems>
For Mender updates, we disable the nv_update_verifier service
and handle update verification through a commit state script
in the mender service.

Signed-off-by: Matt Madison <matt@madison.systems>
and rework the bbappends so we don't have to modify the tegra-binaries
recipe - doing so causes many recipes to get rebuilt when something
changes.

Signed-off-by: Matt Madison <matt@madison.systems>
U-Boot is either not supported or optional on some
Tegra platforms, so only include the mender-uboot
feature when U-Boot is the bootloader.

Signed-off-by: Matt Madison <matt@madison.systems>
A fake version for fw_printenv and fw_setenv for Mender
to use on non-U-Boot Tegra platforms.  The bootloader
interface is implemented in the state scripts instead.

Signed-off-by: Matt Madison <matt@madison.systems>
For cboot, the BUP payload is generated with the initrd image.
Also, remove the RDEPENDS on tegra-state-scripts, since that
recipe no longer installs files into the rootfs.

Signed-off-by: Matt Madison <matt@madison.systems>
to add a runtime dependency on the BUP payload and
fake U-Boot fw-utils for non-U-Boot Tegra platforms.

Signed-off-by: Matt Madison <matt@madison.systems>
We need to ensure that the state scripts are deployed before
the mender image is created.

Signed-off-by: Matt Madison <matt@madison.systems>
Bootloader redundancy isn't supported on tegra210 platforms yet,
so make sure we don't try to depend on BUP payload construction
for them.

Signed-off-by: Matt Madison <matt@madison.systems>
The variable migration scripts are *only* needed on tegra186
platforms that use U-Boot, to provide for the possibility of
relocation of the U-Boot environment during a bootloader update.

Only tegra186 and tegra194 platforms need the other scripts that
interface with NVIDIA's bootloader update system.

Signed-off-by: Matt Madison <matt@madison.systems>
to match the real u-boot-fw-utils.

Signed-off-by: Matt Madison <matt@madison.systems>
to support for TX1 and Nano development platforms.  Nano dev kits
use SPI flash for the U-Boot environment, and an SDcard for main
storage.

Signed-off-by: Matt Madison <matt@madison.systems>
to pull in a custom sdcard-layout file for Nano devkits.

Signed-off-by: Matt Madison <matt@madison.systems>
This will need to be reworked a bit for Nano production
modules, which have a different layout for their eMMCs.

Signed-off-by: Matt Madison <matt@madison.systems>
Signed-off-by: Matt Madison <matt@madison.systems>
to restore CONFIG_ENV_IS_IN_MMC.

Signed-off-by: Matt Madison <matt@madison.systems>
I thought that the Mender overrides could also be used for file paths, but
I was wrong, so handle the U-Boot vs. non-U-Boot cases via functions in
the recipe.

Signed-off-by: Matt Madison <matt@madison.systems>
@Ecordonnier
Copy link

Nice changes! I am back from holiday on the 6th and should have some time to review the branch.

Regarding the u-boot environment being in the SD-card in my PR, the reason I chose the SD-card instead of the SPI flash is that currently, in the company I am working for, only the SD card is flashed to provision a new Jetson Nano, and the SPI flash is left in its factory state. AFAIU having the environment in the SPI flash would add one more step to the provisioning process without any advantage for our use-case.

In the past, as I worked on projects having different machines/models using the same module but a different carrier-board, we solved this by using some identification resistor on the carrier-board connected to some GPIO of the module so that the software on the module could recognize the carrier-board, rather than by storing the machine ID in the SPI flash on the carrier-board. This has the advantage that it is more robust than storing the machine ID in software.
However maybe you have different use-cases in mind for which such an approach would not be feasible or less easy than storing the ID in the SPI flash.

@dwalkes
Copy link
Member

dwalkes commented Dec 31, 2019

@madisongh Thanks for sharing! This is great! I just read through, haven't tried the changes myself yet.

I'm not 100% sure I've got the U-boot variable migration scripts completely right, particularly for the rollback case. I'm willing to drop them altogether, since they're handling what is probably an uncommon use case.

Heroic efforts to handle this! I'd be less worried about them not working properly as I would be that they could somehow overwrite an area they shouldn't overwrite, although I can't think of a specific scenario where this would happen or would cause problems. If anything I'd say keep them in git and just remove from do_compile_append with a comment saying they aren't fully tested.

I know @Ecordonnier already has a pending PR for the warrior branch for Nano support, and that uses the SDcard for U-Boot environment storage... I can see why, since it was harder to coax the Mender stuff into having it reside in the SPI flash. If that goes in, I can rework these changes to also use the SDcard for the environment (or make it an option).

I don't have a strong opinion on this one, but since it's useful for @Ecordonnier to have this option it seems like it would be worth keeping it.

I ran into some issues with the tegra-state-scripts recipe not propagating changes, particularly when I was deleting scripts. I was thinking of moving the installation and deployment of the scripts to recipes that need them, rather than having this separate recipe, which I think would improve the reliability.

Fine with me.

@dwalkes
Copy link
Member

dwalkes commented Jan 6, 2020

I just read through, haven't tried the changes myself yet.

I tried today, using cboot-prebuilt configuration on TX2 and tegraflash to flash the new image.

I scp'd in the .mender file to the /tmp directory, then ran this command

mender -install /tmp/core-image-base-jetson-tx2.mender -log-file /data/mender-install.log -log-level debug

This completed successfully.
I see this output from nvbootctrl after the command is run

root@jetson-tx2:~# nvbootctrl get-current-slot
0
root@jetson-tx2:~# nvbootctrl dump-slots-info
magic:0x43424e00,             version: 3             features: 3             num_slots: 2
slot: 0,             priority: 14,             suffix: _a,             retry_count: 7,             boot_successful: 1
slot: 1,             priority: 15,             suffix: _b,             retry_count: 7,             boot_successful: 0
root@jetson-tx2:~#

Here's mender-install.log

I then rebooted, here's an example reboot log:
reboot-after-update.log

At this point I'm still on /dev/mmcblk0p1 for /. I was expecting to go to the APP2 partition for the rootfs in mmcblk0p30.

I also noticed I'm still on slot 0 after the reboot, was expecting to go to slot 1

root@jetson-tx2:~# nvbootctrl get-current-slot
0

and I noticed the retry counter went to 0 on slot 1

root@jetson-tx2:~# nvbootctrl dump-slots-info
magic:0x43424e00,             version: 3             features: 3             num_slots: 2
slot: 0,             priority: 14,             suffix: _a,             retry_count: 7,             boot_successful: 1
slot: 1,             priority: 15,             suffix: _b,             retry_count: 0,             boot_successful: 0
root@jetson-tx2:~#

I don't understand how the root filesystem is supposed to be setup in the cboot case. I see that fw_printenv and fw_setenv u-boot-fw-utils-fake files just use a /var/lib/mender/upgrade_available marker file which doesn't look like it will have anything to do with rootfs partition selection. I can see logic related to finding the other boot partition in redundant-boot-install-script but I don't see how this gets reflected in cboot or in the boot process, and I don't see anything about setting the application partition for the rootfs in the Bootloader Update section of the nvidia documentation based on nv_update_engine --install

Any pointers are appreciated. This is the first time I've tried to use cboot only as bootloader so I'm probably missing something obvious.

@madisongh
Copy link
Author

From the reboot log, it looks like there was something wrong with the BUP payload:

[ 1338.562920] reboot: Restarting system
[0000.200] C> ERROR: Highest Layer Module = 0x40, Lowest Layer Module = 0x40,
Aux Info = 0x0, Reason = 0xd

which caused the bootloader to revert back to slot 0.... the slots info you see reflect the unsuccessful boot.

The boot-time rootfs selection for the cboot case is driven by the boot slot. cboot passes the current boot slot as a kernel command line parameter, and the init script in the tegra-minimal-initramfs initrd uses that to decide which rootfs to mount. My posted changes may be missing something there, though, since the script in meta-tegra by default expects the slot 1 rootfs partition to be labeled APP_b instead of APP2.

@Ecordonnier
Copy link

I checked the code and didn't find any issue in the logic. Also see my comment, basically since my branch and yours are not compatible I think only your branch should get merged.

Feel free to take over the README.md change of my branch as well as update the wiki entry for jetson-nano once you send a PR to the upstream repository.

Regarding the u-boot environment I still think it would be better to have it in the sd-card (one step less for provisioning, meaning reduced cost of provisioning and reduced risk of error).

Signed-off-by: Matt Madison <matt@madison.systems>
to use the '_b' suffix implemented by NVIDIA's A/B boot slot
support and meta-tegra's initrd scripts. The naming isn't
important for the platforms using U-Boot, but make the
change across the board for consistency.

Signed-off-by: Matt Madison <matt@madison.systems>
@madisongh
Copy link
Author

@dwalkes I ran a bunch of tests today after reproducing the issue you're seeing. Looks like it might be related to the issue I was seeing on my Xavier. I don't see the issue with the TX2 I have set up for secure boot, though - bootloader updates work fine there (although my setup for that machine is a bit different from the default).

Bootloader updates also work fine on both Xavier and TX2/cboot when I don't include meta-mender-core/meta-mender-tegra in my layers. And they appear to work fine when using TX2/U-Boot with the mender stuff.

I'm not sure what's going on yet. I'm going to run some comparisons of the built artifacts from working vs. non-working builds and see if that will point me in the right direction.

@dwalkes
Copy link
Member

dwalkes commented Jan 9, 2020

Thanks @madisongh I spent a bit of time on it last night mostly just understanding more about how bup creation works in tegra. I was planning to take out mender layers as a next step which it looks like you already did. I should have more time to spend on it this weekend if you haven't solved by then. Were you able to find out what Reason = 0xd means in the MB1 error log? All I could find was this post and the resolution there doesn't completely make sense to me.

@madisongh
Copy link
Author

I tracked down the problem and will push updates to address that and the other issues reported here shortly.

We need back the mmc dev and part #defines, just as
we have for the TX1.

Signed-off-by: Matt Madison <matt@madison.systems>
The scripts were sending non-error output to stderr

Signed-off-by: Matt Madison <matt@madison.systems>
…e-scripts

* Split the rollback state script in two - pre-reboot and post-reboot.
  Mark the current (rolled-back-to) boot slot successful post-reboot.

* When using U-Boot on TX2, a boot failure detected after U-Boot takes
  control would desyncrhonize the the Tegra boot slot and Mender's
  rootfs slot.  Have the ArtifactRollback_Leave state script detect
  this and switch the boot slot to resynchronize them.

* Also with U-Boot on TX2, a failure in the Tegra bootloader (before
  U-Boot takes over) would cause the processor to switch boot slots
  back to the good one, and Mender would happily mark that boot as
  successful since U-Boot didn't detect the problem. To deal with
  this, add an ArtifactCommit_Enter state script to check that
  the boot slot matches the rootfs slot and if not, return an error
  so Mender rolls back the update.

* Remove the '-e' from the #! lines of the scripts.  Some use expr,
  which returns a non-zero status if the resulting expression is zero,
  and that was causing error exits from the script.

Signed-off-by: Matt Madison <matt@madison.systems>
For Jetson-Nano devices with SPI flash and SDcards, the default
is to use SPI flash for the U-Boot environment. However, if you
prefer to store the U-Boot environment on the SDcard, add

TEGRA_MENDER_UBOOT_ENV_IN_SPIFLASH = "0"

to your build configuration (e.g., local.conf) to override
the default.

Signed-off-by: Matt Madison <matt@madison.systems>
to increase the ENV partition to 256K, so it can actually
hold two full copies of the Mender-sized (128K) U-Boot
environment.

Signed-off-by: Matt Madison <matt@madison.systems>
* Set MENDER_RESERVED_SPACE_BOOTLOADER_DATA when using
  SDcard space for the U-Boot environment on Jetson-Nano
  devices.

* Decouple the sizing of the rootfs partition from the
  requested size for the image rootfs, so the partition
  is always a bit larger. In particular with SDCard
  partitioning, the image rootfs size was exceeding
  the partition size.

Also renamed 'mender_tegra_*' to 'tegra_mender_*' to
make sure that there are no conflicts with Mender's
own naming of things, and to be consistent with our
local variable names (which cannot begin with MENDER_).

Signed-off-by: Matt Madison <matt@madison.systems>
@madisongh
Copy link
Author

So "shortly" turned out to take longer than I expected, but I've pushed changes for:

  • Fixing the issue with non-U-boot devices -- the problem was with a discrepancy in setting the size of the APP partition between the real image and the initrafms image (which is where BUP payloads get generated for the cboot devices). The MB1 Boot Control Table used by the first-stage bootloader includes pointers to boot partitions located in the eMMC as direct offsets from the start of the eMMC (no reading of the GPT). Since the boot partitions come after the APP partition, sizing it differently in the two contexts caused the problem.
  • Removing the state scripts that did U-Boot environment variable migration, per earlier discussion.
  • Adding in a setting for allowing SDcard-resident U-Boot environment on Nanos.

I also did some fault injection testing and made additional changes to the state scripts to better handle update failures and rollbacks. In particular, there are checks to make sure that we don't let the Tegra A/B boot slot get out of sync with Mender's for different failure scenarios.

I'll run some further tests this weekend, but this latest should be in reasonable shape.

@dwalkes
Copy link
Member

dwalkes commented Jan 11, 2020

Thanks @madisongh!

I can confirm this fixes issues I noticed with cboot TX2 related updates. I've also tested TX2 u-boot, Nano with and without TEGRA_MENDER_UBOOT_ENV_IN_SPIFLASH and rollback in each scenario, all worked for me. I don't have an Xavier platform to test with but based on the relatively small number of changes from TX2 I'm not expecting issues with that platform.

I started a conversation on mender hub to get some feedback on the fake fw_printenv/fw_setenv approach here.

You are of course welcome to add your name to the maintainer list if you like but I'm guessing you have enough on your plate with meta-tegra. If you did want to maintain this it might make more sense to make it a part of meta-tegra. If not I'll plan to continue to maintain as I can through meta-mender-community.

Thanks for sharing your expertise and massive improvements with these contributions, very much appreciated!

The wrapper script to unlock the eMMC boot1 partition
is also needed there.

Signed-off-by: Matt Madison <matt@madison.systems>
For platforms that configure ROOTFSPART_SIZE directly we need
to base the MENDER_IMAGE_ROOTFS_SIZE setting on that value
rather than on MENDER_CALC_ROOTFS_SIZE.

Signed-off-by: Matt Madison <matt@madison.systems>
which is in bytes, rather than Kibytes.

Signed-off-by: Matt Madison <matt@madison.systems>
…tall script

Missed that on the last pass.

Signed-off-by: Matt Madison <matt@madison.systems>
@madisongh
Copy link
Author

I've pushed a handful a fixes for problems that cropped up in my testing this weekend. With these changes, Jetson-TX1 now updates properly, and the one system I have with a custom flash layout that hard-codes the value for ROOTFSPART_SIZE also works. Retested across all my devices this morning with these changes, all check out.

Let me know if you'd like me to rebase these changes and either re-push to this PR or open up a new one with the cleaned-up history.

@dwalkes
Copy link
Member

dwalkes commented Jan 12, 2020

Thanks @madisongh

Let me know if you'd like me to rebase these changes and either re-push to this PR or open up a new one with the cleaned-up history.

No need as far as I'm concerned. Assuming you want to keep these changes in meta-mender-community I think we need to wait for them to add a zeus branch before submitting a PR there. Depending on what happens with mendersoftware#119 I'll rebase these changes on top of warrior as needed.

@madisongh
Copy link
Author

You are of course welcome to add your name to the maintainer list if you like but I'm guessing you have enough on your plate with meta-tegra. If you did want to maintain this it might make more sense to make it a part of meta-tegra. If not I'll plan to continue to maintain as I can through meta-mender-community.

Let's keep it separate, although I'm open to making changes in meta-tegra to simplify the integration where it makes sense to do so. As you say, I already have my hands full, so I'm fine with handing off the ongoing maintenance, but feel free to reach out if you need something.

@madisongh
Copy link
Author

madisongh commented Jan 27, 2020

FYI, I have updates for L4T R32.3.1 on this branch. Flash layouts changed quite a bit between R32.2.x and R32.3.x, so OTA upgrades between them aren't going to be possible without a bunch of work (and even then, maybe not).

@dwalkes
Copy link
Member

dwalkes commented Jan 28, 2020

Thanks @madisongh

Flash layouts changed quite a bit between R32.2.x and R32.3.x, so OTA upgrades between them aren't going to be possible without a bunch of work (and even then, maybe not).

Would you suggest I just make the zeus branch for the meta-mender-mender community project target zeus-l4t-r32.3.1 for people who don't have specific requirements, assuming 32.3.x is most likely to be what's closest to matching flash layouts going forward?

@madisongh
Copy link
Author

Would you suggest I just make the zeus branch for the meta-mender-mender community project target zeus-l4t-r32.3.1 for people who don't have specific requirements, assuming 32.3.x is most likely to be what's closest to matching flash layouts going forward?

Might be a good idea. Another approach which I'm going to try is to use the 32.3.1 layouts with the 32.2.x-based zeus branch, to see if that will work. But going with 32.3.1 as the base has the added advantage that bootloader update support looks better in that version.

dwalkes added a commit that referenced this pull request Feb 16, 2020
…a-fix-flash"

This reverts commit 856bbb3, reversing
changes made to c41fe6e.

Performing this revert to make the rebase of zeus branches, which
weren't based on warrior, easier to perform.  See
#10 for
details.
dwalkes added a commit that referenced this pull request Feb 16, 2020
…son-nano"

This reverts commit c41fe6e, reversing
changes made to f524d9d.

Reverted to make rebase merges for zeus easier, since these were placed
on top of warrior before initial Nano support.  See

#10
and
mendersoftware#119 (comment)
@dwalkes
Copy link
Member

dwalkes commented Feb 16, 2020

I've opened #11 to track what I plan to merge on the zeus branch, targeting l4t r32.3.1 and reverting previous merges for warrior support before applying the changes on @madisongh zeus-l4t-r32.3.1 branch

@madisongh
Copy link
Author

Flash layouts changed quite a bit between R32.2.x and R32.3.x, so OTA upgrades between them aren't going to be possible without a bunch of work (and even then, maybe not).

I've been able to make this work by using an R32.3.1-based flash layout XML file for my R32.2.x builds.

dwalkes added a commit that referenced this pull request Apr 6, 2020
…a-fix-flash"

This reverts commit 856bbb3, reversing
changes made to c41fe6e.

Performing this revert to make the rebase of zeus branches, which
weren't based on warrior, easier to perform.  See
#10 for
details.
dwalkes added a commit that referenced this pull request Apr 6, 2020
…son-nano"

This reverts commit c41fe6e, reversing
changes made to f524d9d.

Reverted to make rebase merges for zeus easier, since these were placed
on top of warrior before initial Nano support.  See

#10
and
mendersoftware#119 (comment)
dwalkes added a commit that referenced this pull request Apr 19, 2020
…a-fix-flash"

This reverts commit 856bbb3, reversing
changes made to c41fe6e.

Performing this revert to make the rebase of zeus branches, which
weren't based on warrior, easier to perform.  See
#10 for
details.

Signed-off-by: Dan Walkes <danwalkes@trellis-logic.com>
dwalkes added a commit that referenced this pull request Apr 19, 2020
…son-nano"

This reverts commit c41fe6e, reversing
changes made to f524d9d.

Reverted to make rebase merges for zeus easier, since these were placed
on top of warrior before initial Nano support.  See

#10
and
mendersoftware#119 (comment)

Signed-off-by: Dan Walkes <danwalkes@trellis-logic.com>
@dwalkes dwalkes closed this Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants