-
-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
24.04: server: cloud-init: ssh_import_id not working due to network not yet ready #757
Comments
|
Hmm at first thought - I have the below systemd override because cloud-init hangs for 120 seconds at boot if there is no network found, I lower the timeout to 10 seconds. Maybe this is the culprit? https://github.com/Joshua-Riek/ubuntu-rockchip-settings/blob/noble/data/server/override.conf Full path of the override is /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf |
|
It renders itself as /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf in the end? I can try a new flash and hot patch the file before first boot =) |
|
Yeah, give that a try, id change the value to something more reasonable like 60 seconds, either way I now think 10 seconds may be too aggressive. |
Looks like it has no effect, and looking at uptime in the log it looks like it does not wait at all.. Maybe systemd-networkd-wait-online.service.d considers loopback enough? |
|
I think it's because systemd handles the DNS (this has been a pain to deal with tbh). For example, if you flash the OS to an SD Card and chroot into it, you won't be able to run apt update because DNS is expected to be handled by systemd. Try to modify the file |
|
I dont think its running the command at all, running it as configured it blocks forever as it seems to wait for all devices to be up. and I only have 1 of 2 ports connected.. |
|
Maybe we want to run Could the double ExecStart= assignment be an issue? |
Double |
I didnt chroot to it, i flashed it, then mounted the ext4 partition and modified the file, unmounted, powered off, pulled out sd card and booted. Sure I can give resolve.conf a try, however, looking at the logs, only loopback is up, and super close in time to that output, it gives error, and resolving hosts etc works fine from the machine, and the fact that when I run the command post-boot, it does not return before timeout due to it seems to expect all interfaces to be up |
|
resolv.conf seems to be a link to a non existing file Ill remove the link and replace with a file with content still not working. I really think this wait thing is not running at all, or maybe its running, but not as a pre-requisite to "modules:config" |
|
Sidenote: Found this in the log too I think this part I didn't modify =) EDIT: link: https://github.com/Joshua-Riek/ubuntu-rockchip/blob/main/overlay/boot/firmware/user-data#L28 The error to resolve is not too unexpected as we use google dns now (and probably no hosts file "hack") =) But it seems it is not happy about groups under chpasswd. |
|
No idea what I am looking at but based on a google result of someone mentioning a similar issue I found an interesting command to run I have no idea what it does, but looks like "network-pre.target" ran before "cloud-init-local.service", maybe that wait thing needs to be in "network-pre.target"? https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1636912 Full output |
|
Very new to cloid-init but maybe some of the "wants" is wrong in in "cloud-init.service" i find this Maybe whatever wait command we want to override should be in the After script, and we are not in the Before? (so runs after?) |
|
Found a verbose cloud-init log file =) https://gist.github.com/nilo85/16224aa34b79998a328bd0a5ddc888c0 |
|
Seems its just about less than a second between cloud-init:init and cloud-init:modules:config and found this in cloud-init docs:
So I suspect whatever is supposed to wait for network doesnt really work. (and I bet the command times out always as without --any it seems to wait for all to come online) So if this --timeout 120 was before modules, I would expect to see 120s to have passed between the stages Time to go to bed =) |
|
Found out you could get a plotted image of the whole process with "systemd-analyze plot > something.svg", maybe this shows what went on EDIT: Looks like the wait script runs concurrently with cloud init config (and wrong order) EDIT2: I now see there is both a .target and a .service, .service is probably the one we want and it is run a bit further down and after network-online.target |
|
I just flashed a 24.04 on a RPi4 and this is how the paths differ Seems clear to me that on official RPi Ubuntu cloud-conf is supposed to be run after network-online.target Checking paths for network-online.target Content of this file is identical, and both seem to be after network-online.target EDIT: EDIT: Now that I think about it, none of the plots contained the wait-online.service we "overriden" and on the rpi, no such file exists so not sure what we override =) EDIT: I patched the ssh-import-id script to run "ip a" and "nslookup google.com", this is the output: So seems indeed the network is not up at the point it tries to run it EDIT: Wait.. this looks weird? https://github.com/Joshua-Riek/ubuntu-rockchip-settings/blob/noble/data/meson.build#L8 EDIT: I noticed that the overide was going to /lib/... and not /usr/lib as the default one... however didnt make a difference Now ediitng the cloud-config, it looks like maybe this getty thing might be the issue afterall? It overrides Before, maybe =) EDIT: no, I guess not, cause After is still network-online.target... but why didnt it show up in the graph?... This is a crash course in systemd =D EDIT: comparing jounnal with "journalctl -b" I see that on OPi network is considered up immediately, while on RPI it is way more down, so I wonder if there might be some issue with the netplan maybe EDIT: netplan looks identical except en* vs eth0. however running "sudo journalctl -xeu systemd-networkd.service" I can see: 08:45:13 it gets ip from DHCP, but get this a bit too early EDIT: Found something!!! |
|
Based on my latest finding I have a strong suspicion this is the issue someone else had similar issue updating to 24.04 from 23.XX here https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/2063973/comments/3 Due to some name mismatch in initramfs vs some udev renaming, could we have a similar issue? =) RPi netplan config: OPi netplan config If I get this right, it probably does not like "zz-all-en" and expects "enP4p65s0" EDIT: I patched the cloud-init provided netork file with explicit name of interface to But still same issue, I think I am so close to figuring it out but maybe next is the udev rules magic knowledge etc I am missing, having my hopes up @Joshua-Riek you can interpret this =) FINAL EDIT: I got it working, this is the content of my network-config file: not sure if its renderer and / or the removal of optional that did the trick in the end. However RPi config had optional so not sure.. but now it works for my setup 🥳 Maybe this is the root cause to your issue where you wanted to shorten the timeout in the first place? |
|
Sorry just catching up on this now, because there are so many devices supported with different network interfaces, I tried to use a generic config to match all ethernet interfaces (hence the "zz-all-en" and "zz-all-eth"). I think that I will need to keep track of the networking interfaces for each board and set them accordingly during the image creation process. Thanks for the testing and looking into this a whole bunch :) |
|
Was a nice journey to the modern linux userspace =D last time I touched something like this, init.d was the newest coolest thing =D I think what I discovered here works well enough as a workaround for now, the users who depend on cloud-init, also has control over network-config via the fat partition so shouldn't be a biggie =) I just started investing enough time into my little setup that I really don't want to endup wanting to add another node and then have manually been running a lot of commands for setup, so for me it was critical I got coud-init working =D |
|
@nilo85 nice work debugging this
@Joshua-Riek Please don't do this. Overriding
Sounds likely.
I would guess the former. I'd be surprised if removing optional fixed your issue, and if it did that should probably be reported to netplan since that is the opposite of what I would expect. @Joshua-Riek Any ideas why netplan wouldn't detect the right backend automatically? Maybe Rockchip has NetworkManager installed as well? |
|
I will take a closer look into cloud-init over the weekend and see how this should be properly addressed. I've been taking a little break the past few days with the release of Ubuntu 24.04. |
|
@holmanb, I have a large portion of users who may not have ethernet and will not configure cloud-init for their use case. Because of this some users can experience a two-minute boot delay due to systemd-networkd-wait-online. Is there a way to properly adjust the timeout for systemd-networkd-wait-online or is it imperative that the service hangs for the full two minutes? |
Adjusting timeouts is the wrong approach to solving this problem. From @nilo85's comment about If you are using NetworkManager, then systemd-networkd.service (and associated units like systemd-networkd-wait-online.service) should NOT be enabled. I think that this is the real fix to this issue - pick one or the other. |
|
@Joshua-Riek If you actually do want to use systemd-networkd rather than NetworkManager, then I suggest that you revert this change:
Per the docs:
You actually do want However, if you decide to use only NetworkManager, then it really doesn't matter since this config only affects systemd-networkd - just make sure that you have systemd-networkd disabled. |
|
Thanks @holmanb for the detailed information, this is very insightful. I double checked and NetworkManager is not installed on the system, networkd is being used. That being the case, when using |
Thanks for checking, nevermind then on that.
Thanks for checking. This sounds like a possible repeat of LP: #2039083, however I think that the version of systemd shipped in 24.04 should have fixed this? Perhaps you could try the suggestion proposed in that bug:
|






Workaround
It turns out something changed on ubuntu that doesnt like interfaces not having strict matches or something. In the post below I linked to an issue with description, and I also explained how I worked around it by providing my own netowrk-config file for cloud-init.
#757 (comment)
Issue
Congrats to releasing 24.04! You worked long and hard for this.
I finally managed to get around playing with cloud-init and I am currently having issues with importing ssh keys via github.
I have the following snippet in user-data
cloud init logs look like this
It looks like it is trying too early, before getting network ready. Could very much be me doing something wrong but all other examples of ssh-import-id I have seen does not seem to do anything special, so maybe we need to tweak something here..
The text was updated successfully, but these errors were encountered: