Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executing extensions sometimes fails with a race condition #1267

Closed
dougclayton opened this issue Jul 19, 2018 · 3 comments
Closed

Executing extensions sometimes fails with a race condition #1267

dougclayton opened this issue Jul 19, 2018 · 3 comments

Comments

@dougclayton
Copy link

This failure was observed with the CustomScript extension 2.0.6:

2018/07/19 18:12:51.155339 INFO Azure Linux Agent Version:2.2.18
2018/07/19 18:12:51.156534 INFO OS: centos 7.4.1708
2018/07/19 18:12:51.157695 INFO Python: 2.7.5
2018/07/19 18:12:51.158454 INFO Run daemon
2018/07/19 18:12:51.159240 INFO Activate resource disk
2018/07/19 18:12:51.169186 INFO Examining partition table
2018/07/19 18:12:51.192928 INFO GPT not detected, determining filesystem
2018/07/19 18:12:51.212716 INFO sfdisk with --part-type failed [1], retrying with -c
2018/07/19 18:12:51.217037 INFO sfdisk -c -f /dev/sdb 1 -n succeeded
2018/07/19 18:12:51.217392 INFO The partition is formatted with ntfs, updating partition type to 83
2018/07/19 18:12:51.222030 INFO sfdisk with --part-type failed [1], retrying with -c
2018/07/19 18:12:51.230838 INFO sfdisk -c  /dev/sdb 1 83 succeeded
2018/07/19 18:12:51.237868 INFO Format partition [mkfs.ext4 -F /dev/sdb1]
2018/07/19 18:12:53.908282 INFO Mount resource disk [mount /dev/sdb1 /mnt/resource]
2018/07/19 18:12:54.049868 INFO Resource disk /dev/sdb is mounted at /mnt/resource with ext4
2018/07/19 18:12:54.050962 INFO Clean protocol
2018/07/19 18:12:54.051381 INFO Running default provisioning handler
2018/07/19 18:12:54.063021 INFO Copying ovf-env.xml
2018/07/19 18:12:54.138803 INFO Successfully mounted dvd
2018/07/19 18:12:54.218682 INFO Detect protocol by file
2018/07/19 18:12:54.219165 INFO Clean protocol
2018/07/19 18:12:54.220523 INFO WireServer endpoint is not found. Rerun dhcp handler
2018/07/19 18:12:54.221650 INFO Test for route to 168.63.129.16
2018/07/19 18:12:54.223949 INFO Route to 168.63.129.16 exists
2018/07/19 18:12:54.225283 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:12:54.245818 INFO Fabric preferred wire protocol version:2015-04-05
2018/07/19 18:12:54.246745 INFO Wire protocol version:2012-11-30
2018/07/19 18:12:54.247142 WARNING Server preferred version:2015-04-05
2018/07/19 18:12:58.586462 INFO Starting provisioning
2018/07/19 18:12:58.588046 INFO Handle ovf-env.xml.
2018/07/19 18:12:58.591315 INFO Set hostname [bd0fy00000W]
2018/07/19 18:12:58.623701 INFO Publish hostname [bd0fy00000W]
2018/07/19 18:12:58.772194 INFO Examine /proc/net/route for primary interface
2018/07/19 18:12:58.772661 INFO Primary interface is [eth0]
2018/07/19 18:12:58.773424 INFO interface [lo] has flags [73], is loopback [True]
2018/07/19 18:12:58.774333 INFO Interface [lo] skipped
2018/07/19 18:12:58.775383 INFO interface [eth0] has flags [4163], is loopback [False]
2018/07/19 18:12:58.776375 INFO Interface [eth0] selected
2018/07/19 18:12:58.790172 INFO Examine /proc/net/route for primary interface
2018/07/19 18:12:58.790566 INFO Primary interface is [eth0]
2018/07/19 18:12:58.791675 INFO interface [lo] has flags [73], is loopback [True]
2018/07/19 18:12:58.792211 INFO Interface [lo] skipped
2018/07/19 18:12:58.793463 INFO interface [eth0] has flags [4163], is loopback [False]
2018/07/19 18:12:58.795159 INFO Interface [eth0] selected
2018/07/19 18:13:08.792296 INFO Create user account if not exists
2018/07/19 18:13:09.076512 INFO Configure sudoer
2018/07/19 18:13:09.079451 INFO Configure sshd
2018/07/19 18:13:09.080410 INFO Disabled SSH password-based authentication methods.
2018/07/19 18:13:09.081547 INFO Configured SSH client probing to keep connections alive.
2018/07/19 18:13:09.082540 INFO Deploy ssh public key.
2018/07/19 18:13:09.126857 INFO Save custom data
2018/07/19 18:13:09.277404 INFO Event: name=WALinuxAgent, op=Provision, message=Provision succeed, duration=15225
2018/07/19 18:13:11.633707 INFO Provisioning complete
2018/07/19 18:13:11.634194 INFO RDMA capabilities are not enabled, skipping
2018/07/19 18:13:11.635203 INFO Installed Agent WALinuxAgent-2.2.18 is the most current agent
2018/07/19 18:13:11.712889 INFO Agent WALinuxAgent-2.2.18 is running as the goal state agent
2018/07/19 18:13:11.717510 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:13:11.720455 INFO Event: name=WALinuxAgent, op=HeartBeat, message=, duration=0
2018/07/19 18:13:11.721358 INFO Start env monitor service.
2018/07/19 18:13:11.722556 INFO Configure routes
2018/07/19 18:13:11.724267 INFO Gateway:None
2018/07/19 18:13:11.724931 INFO Routes:None
2018/07/19 18:13:11.744904 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:13:11.745269 INFO WALinuxAgent-2.2.18 running as process 1929
2018/07/19 18:13:11.749191 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0
2018/07/19 18:13:11.750806 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:13:11.841953 INFO Set block dev timeout: sda with timeout: 300
2018/07/19 18:13:11.842365 INFO Set block dev timeout: sdb with timeout: 300
2018/07/19 18:13:11.853647 INFO Event: name=WALinuxAgent, op=Install, message=Agent WALinuxAgent-2.2.20 downloaded successfully, duration=0
2018/07/19 18:13:11.896156 INFO Event: name=WALinuxAgent, op=Install, message=Agent WALinuxAgent-2.2.25 downloaded successfully, duration=0
2018/07/19 18:13:11.936173 INFO Event: name=WALinuxAgent, op=Install, message=Agent WALinuxAgent-2.2.26 downloaded successfully, duration=0
2018/07/19 18:13:11.940809 INFO Agent WALinuxAgent-2.2.18 discovered WALinuxAgent-2.2.26 as an update and will exit
2018/07/19 18:13:12.640691 INFO Agent WALinuxAgent-2.2.18 launched with command 'python -u /usr/sbin/waagent -run-exthandlers' is successfully running
2018/07/19 18:13:12.641141 INFO Event: name=WALinuxAgent, op=Enable, message=Agent WALinuxAgent-2.2.18 launched with command 'python -u /usr/sbin/waagent -run-exthandlers' is successfully running, duration=0
2018/07/19 18:13:12.644234 INFO Determined Agent WALinuxAgent-2.2.26 to be the latest agent
2018/07/19 18:13:12.736611 INFO Agent WALinuxAgent-2.2.26 is running as the goal state agent
2018/07/19 18:13:12.752534 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:13:12.757569 INFO Start env monitor service.
2018/07/19 18:13:12.758753 INFO Configure routes
2018/07/19 18:13:12.759296 INFO Gateway:None
2018/07/19 18:13:12.760617 INFO Routes:None
2018/07/19 18:13:12.827245 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:13:12.834944 INFO WALinuxAgent-2.2.26 running as process 1940
2018/07/19 18:13:12.875674 INFO Event: name=WALinuxAgent, op=Partition, message=87, duration=0
2018/07/19 18:13:12.910323 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0
2018/07/19 18:13:12.934617 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:13:13.233968 INFO Wire server endpoint:168.63.129.16
2018/07/19 18:13:13.339454 INFO [Microsoft.Azure.Extensions.CustomScript-2.0.6] Target handler state: enabled
2018/07/19 18:13:13.347322 INFO [Microsoft.Azure.Extensions.CustomScript-2.0.6] [Enable] current handler state is: notinstalled
2018/07/19 18:13:13.490002 INFO Event: name=Microsoft.Azure.Extensions.CustomScript, op=Download, message=Download succeeded, duration=132
2018/07/19 18:13:13.490520 INFO [Microsoft.Azure.Extensions.CustomScript-2.0.6] Initialize extension directory
2018/07/19 18:13:13.491622 INFO [Microsoft.Azure.Extensions.CustomScript-2.0.6] Update settings file: 1.settings
2018/07/19 18:13:13.492289 INFO [Microsoft.Azure.Extensions.CustomScript-2.0.6] Install extension [bin/custom-script-shim install]
2018/07/19 18:13:13.517854 INFO Successfully added Azure fabric firewall rules
2018/07/19 18:13:13.526180 INFO Firewall rules:
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination         

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination         
       0        0 ACCEPT     tcp  --  *      *       0.0.0.0/0            168.63.129.16        owner UID match 0
       0        0 DROP       tcp  --  *      *       0.0.0.0/0            168.63.129.16        ctstate INVALID,NEW

2018/07/19 18:13:13.528288 INFO Purging disk cache, current incarnation is 1
2018/07/19 18:13:14.496545 ERROR Event: name=Microsoft.Azure.Extensions.CustomScript, op=Install, message=[ExtensionError] Non-zero exit code: 1, bin/custom-script-shim install, duration=0
2018/07/19 18:13:14.590861 INFO [Microsoft.Azure.Extensions.CustomScript-2.0.6] Remove extension handler directory: /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6
2018/07/19 18:13:14.606529 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 1, duration=1374
2018/07/19 18:28:13.351674 INFO Agent WALinuxAgent-2.2.26 launched with command 'python -u bin/WALinuxAgent-2.2.26-py2.7.egg -run-exthandlers' is successfully running
2018/07/19 18:28:13.357100 INFO Event: name=WALinuxAgent, op=Enable, message=Agent WALinuxAgent-2.2.26 launched with command 'python -u bin/WALinuxAgent-2.2.26-py2.7.egg -run-exthandlers' is successfully running, duration=0

The custom-script-shim bash script runs custom-script-extension but fails with this:

/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-shim: line 77: /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension: Text file busy

"Text file busy" is an error you get when bash tries to run a script that is currently open:

# touch test.sh
# chmod a+x test.sh
# cat > test.sh
^Z
[1]+  Stopped                 cat > test.sh
# ./test.sh
-bash: ./test.sh: Text file busy

This suggests that the custom-script-extension file may not have been closed properly from the extraction, or perhaps was modified by a background thread. I do not believe this is a CustomScript bug since it simply runs a binary in the extension's zip file, which is written when WALinuxAgent extracts the zip.

@hglkrijger
Copy link
Member

@dougclayton are you able to provide a repro for this case?

@dougclayton
Copy link
Author

I can't provide a reproduction beyond the details above. It happens repeatedly, but quite infrequently, with the standard CustomScript extension and the 2.2.26 waagent binary. I was hoping that the fact that a file could sometimes be open would be enough to spot the issue. I looked through the code myself and it seems like waagent is properly closing the files before running the extension, but I'm not familiar enough with the codebase to say for sure.

With the fixes for #1268 and #1269, this becomes less of an issue for us because we can now detect it and reimage the VM.

@hglkrijger
Copy link
Member

I am timing this out, as we can take no further action in the agent. In addition, the component which actually emits the error above is the custom script shim, so I think the investigation should start there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants