Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

feat: re-enable container runtime dir relocation #3393

Merged
merged 1 commit into from
Jun 5, 2020

Conversation

alexeldeib
Copy link
Contributor

@alexeldeib alexeldeib commented Jun 3, 2020

This reverts commit 5af21fe.

Reason for Change:

Issue Fixed:

Requirements:

Notes:
not sure if there have been material changes in between that should affect this, the revert was fairly clean. doing some manual testing.

@alexeldeib alexeldeib added the aks label Jun 3, 2020
@alexeldeib alexeldeib changed the title enable container runtime dir relocation feat: enable container runtime dir relocation Jun 3, 2020
@alexeldeib alexeldeib changed the title feat: enable container runtime dir relocation feat: re-enable container runtime dir relocation Jun 3, 2020
@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jun 3, 2020

Manually tested with docker + containerd since it's been a while since this was originally applied, still looks good though.

I would like to avoid special case logic for handling copy to temp disk at provision time. If our target is ephemeral OS disk or using data disk for docker, both of those will use a cached VHD. This avoids increasing provisioning latency. For temp disk, we will not make it the default and so will not increase latency by default. We will only allow it as an opt-in toggle in the short/medium-term, especially for targeted SKUs like N series. These users are clear on the cost and have already been paying the runtime cost of pulling images.

When these SKUs support ephemeral OS disk or AKS supports using a data disk for docker root, we would de-prioritize temp disk and ask those same customers to use ephemeral OS + data disk, which will not have the same latency issues. In short I don't think it's worth it to detect if the user placed it on "/mnt" vs a data disk because we would remove that logic in the future anyway, and prefer for it not to be used.

@alexeldeib
Copy link
Contributor Author

alexeldeib commented Jun 3, 2020

I did some testing around provisioning time and using this option with a VHD (the AKS scenario). I tested provisioning time with docker on temp disk (this should trigger network pull at provision time)

Standard_D8s_v3 for all tests

roundtrip create/delete, 1024gb, default

Time (mean ± σ):     679.197 s ± 114.975 s    [User: 10.768 s, System: 0.179 s]
Range (min … max):   552.897 s … 864.985 s    5 runs

roundtrip create/delete, 1024gb, reroot

Time (mean ± σ):     611.854 s ± 60.042 s    [User: 12.139 s, System: 0.187 s]
Range (min … max):   524.549 s … 668.768 s    5 runs

100gb, default

real	3m37.317s
user	0m11.176s
sys	0m0.102s

100gb, reroot

real	3m16.865s
user	0m12.421s
sys	0m0.118s

1024gb, default

real	2m51.709s
user	0m12.575s
sys	0m0.222s

1024gb, reroot

real	2m53.774s
user	0m6.842s
sys	0m0.127s

copy from 100 GB OS disk (16 GB of data copied, this throttles the OS disk and hangs for a long time)

time cp -R /var/lib/docker /mnt/container

real	9m24.323s
user	0m1.495s
sys	0m27.279s

Seems we have an enormous cache, but pulling a few images to temp disk on fast VMs is basically negligible, and we even gain some improvements by offloading from the OS disk in the small case. For the N-series SKUs we are primarily targeting with this change, this is promising.

added I'm waiting on a longer run with more data, but figured I'd share this in favor of letting the network pull happen.

@jackfrancis jackfrancis added this to In progress in backlog Jun 4, 2020
@xuto2
Copy link
Contributor

xuto2 commented Jun 5, 2020

/lgtm

@acs-bot
Copy link

acs-bot commented Jun 5, 2020

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alexeldeib, xuto2
To complete the pull request process, please assign devigned
You can assign the PR to them by writing /assign @devigned in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xuto2 xuto2 merged commit c4e226e into Azure:aks-release-v0.47.0-1 Jun 5, 2020
backlog automation moved this from In progress to Done Jun 5, 2020
xuto2 added a commit that referenced this pull request Jun 5, 2020
xuto2 added a commit that referenced this pull request Jun 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
backlog
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants