Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure parse_network_config uses fallback cfg when generate IMDS network cfg fails #549

Merged

Conversation

johnsonshi
Copy link
Contributor

@johnsonshi johnsonshi commented Aug 26, 2020

Azure datasource's parse_network_config throws a fatal uncaught exception when an exception is raised during generation of network config from IMDS metadata. This happens when IMDS metadata is invalid/corrupted (such as when it is missing network or interface metadata). This causes the rest of provisioning to fail.

This changes parse_network_config to be a non-fatal implementation. Additionally, when generating network config from IMDS metadata fails, fall back on generating fallback network config (_generate_network_config_from_fallback_config).

This also changes fallback network config generation (_generate_network_config_from_fallback_config) to blacklist an additional driver: mlx5_core.

@johnsonshi
Copy link
Contributor Author

@anhvoms @Moustafa-Moustafa @trstringer This prevents invalid/corrupted IMDS network metadata from causing provisioning as a whole to fail.

cloudinit/sources/DataSourceAzure.py Outdated Show resolved Hide resolved
cloudinit/sources/DataSourceAzure.py Outdated Show resolved Hide resolved
cloudinit/sources/DataSourceAzure.py Outdated Show resolved Hide resolved
@blackboxsw
Copy link
Collaborator

@johnsonshi thanks for this PR, could you attach as a comment the response of cloud-init query -all (I'm specifically interested in the network section as surfaced by IMDS on one of these failed nodes). If reproducible

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general question I have is if IMDS is in a state that cloud-init can't parse is IMDS network config content recoverable? As in, would retries buy us anything in these cases?

@johnsonshi
Copy link
Contributor Author

One general question I have is if IMDS is in a state that cloud-init can't parse is IMDS network config content recoverable? As in, would retries buy us anything in these cases?

I'm afraid not. The IMDS inconsistencies and delays aren't deterministic, and it's impossible to "determine" whether the complete network metadata returned is even complete or not. The root cause of these inconsistencies is due to a platform issue.

@johnsonshi
Copy link
Contributor Author

@johnsonshi thanks for this PR, could you attach as a comment the response of cloud-init query -all (I'm specifically interested in the network section as surfaced by IMDS on one of these failed nodes). If reproducible

We discovered these by looking at the cloud-init.log files of failed deployments. The IMDS response contained no network key in the instance metadata. We could wait until network is available in the response. However, normal behavior for IMDS is to not have network be delayed (it should be available way before). If there is a delay in network metadata availability, then it may be indicative of deeper platform issues (perhaps even the data within network isn't even complete).

2020-07-30 19:20:16,722 - handlers.py[DEBUG]: start: azure-ds/parse_network_config: 
2020-07-30 19:20:16,722 - DataSourceAzure.py[DEBUG]: Azure: generating network configuration from IMDS
2020-07-30 19:20:16,722 - handlers.py[DEBUG]: finish: azure-ds/parse_network_config: FAIL: 
2020-07-30 19:20:16,722 - util.py[WARNING]: failed stage init-local
2020-07-30 19:20:16,730 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 362, in main_init
    init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 678, in apply_network_config
    netcfg, src = self._find_networking_config()
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 643, in _find_networking_config
    if self.datasource and hasattr(self.datasource, 'network_config'):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 799, in network_config
    self._network_config = parse_network_config(nc_src)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 1332, in parse_network_config
    network_metadata = imds_metadata['network']
KeyError: 'network'

@blackboxsw blackboxsw self-assigned this Sep 10, 2020
Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @johnsonshi sorry for the delay here on response. I think we can drop that UT you mentioned and please update the pull request description to make note of the additional mlx5_core driver blacklist functionality on fallback config.

We'll be using the pull request description as your squashed merge commit for this PR when this lands.

cloudinit/sources/DataSourceAzure.py Show resolved Hide resolved
tests/unittests/test_datasource/test_azure.py Show resolved Hide resolved
@johnsonshi
Copy link
Contributor Author

Apologies I've got a couple of high priority items so this has been delayed.

@johnsonshi
Copy link
Contributor Author

@blackboxsw I've updated the PR description + added the comments on the VM instance SKUs with mlx4 and mlx5 hardware. Also dropped the redundant UT as previously discussed.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent @johnsonshi thanks for the additions. I've launched and upgraded to this version cloud-init and see correct network configuration emitted and network properly setup on non-infiniband eth0 device.

ubuntu@SRU-worked-azure:~$ grep Trace /var/log/cloud-init.log 
ubuntu@SRU-worked-azure:~$ cloud-init status --long
status: done
time: Thu, 24 Sep 2020 16:43:35 +0000
detail:
DataSourceAzure [seed=/var/lib/waagent]
ubuntu@SRU-worked-azure:~$ cat /etc/netplan/50-cloud-init.yaml 
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        eth0:
            dhcp4: true
            dhcp4-overrides: &id001
                route-metric: 100
            dhcp6: true
            dhcp6-overrides: *id001
            match:
                driver: hv_netvsc
                macaddress: 00:0d:3a:e2:d9:0e
            set-name: eth0
        eth1:
            dhcp4: true
            dhcp4-overrides: &id002
                route-metric: 200
            dhcp6: true
            dhcp6-overrides: *id002
            match:
                driver: hv_netvsc
                macaddress: 00:0d:3a:e2:de:17
            set-name: eth1
    version: 2
ubuntu@SRU-worked-azure:~$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:3a:e2:d9:0e brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.4/24 brd 10.0.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 ace:cab:deca:deed::4/128 scope global dynamic noprefixroute 
       valid_lft 17279947sec preferred_lft 8639947sec
    inet6 fe80::20d:3aff:fee2:d90e/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:3a:e2:de:17 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.5/24 brd 10.0.0.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::20d:3aff:fee2:de17/64 scope link 
       valid_lft forever preferred_lft forever
4: rename4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
    link/ether 00:0d:3a:e2:d9:0e brd ff:ff:ff:ff:ff:ff
5: rename5: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth1 state UP group default qlen 1000
    link/ether 00:0d:3a:e2:de:17 brd ff:ff:ff:ff:ff:ff

@blackboxsw blackboxsw merged commit 4316490 into canonical:master Sep 24, 2020
@johnsonshi johnsonshi deleted the make-azure-parse-network-config-non-fatal branch September 25, 2020 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants