Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

refactor: skip sgx driver installation if already exists in OS #3062

Conversation

Francis-Liu
Copy link
Contributor

Reason for Change:

Azure Ubuntu images is now shipped with Intel SGX driver by default. We skip the driver installation if "/dev/sgx" is detected. We keep the installation code in case of a custom image is used.

Issue Fixed:

NA

Requirements:

Notes:

None

@jackfrancis
Copy link
Member

Thanks @Francis-Liu! We are currently only testing SGX up to Kubernetes v1.16. Is that correct? Do SGX drivers + workloads still not work in Kubernetes v1.17+ ?

@jackfrancis
Copy link
Member

I ran our E2E tests against this change and SGX workloads were failing, indicating that we still need to do the drivers installation.

@Francis-Liu
Copy link
Contributor Author

Thank you @jackfrancis. Did you run the E2E testing using OpenEnclave Samples Tests? Regarding your first question, when testing aks-engine change, I did not explicitly specify which Kubernetes version to use. Could you elaborate your considerations on what might cause SGX driver and workload to be unable to run on Kubernetes v1.17+? From our side, we want the driver and workloads to be tested against latest Kubernetes version.

@jackfrancis
Copy link
Member

We are running this spec to test:

https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/workloads/sgx-test.yaml

I'm not aware of any reason why things wouldn't work on >= 1.17, but our test cluster configuration only tests 1.16 and below:

https://github.com/Azure/aks-engine/blob/master/test/e2e/test_cluster_configs/sgx.json

Maybe there's no good reason for that.

@Francis-Liu
Copy link
Contributor Author

The information you provided is absolutely important! Regardless of whether we install the driver on the fly or not, the SGX driver was updated by Intel in such a way that /dev/sgx has been changed from a char device to a folder. That seems to have explained why the e2e test workload failed. Did you have a chance to save the error message from the failed workload?

For the allowedOrchestratorVersions I agree that there should be no obvious reason latest version isn't supported. Once we fix the e2e test workload, I suggest we add v1.17, v1.18? to the list and give it a try.

@jackfrancis
Copy link
Member

From the pod:

Events:
  Type     Reason       Age                    From                                        Message
  ----     ------       ----                   ----                                        -------
  Normal   Scheduled    10m                    default-scheduler                           Successfully assigned default/sgx-test-4bnvt to k8s-agentpool-36985434-vmss000000
  Warning  FailedMount  6m38s (x2 over 8m53s)  kubelet, k8s-agentpool-36985434-vmss000000  Unable to attach or mount volumes: unmounted volumes=[dev-sgx], unattached volumes=[dev-sgx default-token-68zp8]: timed out waiting for the condition
  Warning  FailedMount  2m6s (x2 over 4m21s)   kubelet, k8s-agentpool-36985434-vmss000000  Unable to attach or mount volumes: unmounted volumes=[dev-sgx], unattached volumes=[default-token-68zp8 dev-sgx]: timed out waiting for the condition
  Warning  FailedMount  40s (x13 over 10m)     kubelet, k8s-agentpool-36985434-vmss000000  MountVolume.SetUp failed for volume "dev-sgx" : hostPath type check failed: /dev/sgx is not a character device

@jackfrancis
Copy link
Member

The pod spec:

$ k describe pod sgx-test-4bnvt
Name:           sgx-test-4bnvt
Namespace:      default
Priority:       0
Node:           k8s-agentpool-36985434-vmss000000/10.240.0.34
Start Time:     Tue, 14 Apr 2020 14:18:13 -0700
Labels:         controller-uid=8025f724-d4f7-4ece-9c78-86bfdc751ecd
                job-name=sgx-test
Annotations:    kubernetes.io/psp: privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Job/sgx-test
Containers:
  sgxtest:
    Container ID:  
    Image:         oeciteam/sgx-test
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /helloworld/host/helloworldhost
      /helloworld/enclave/helloworldenc.signed
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /dev/sgx from dev-sgx (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-68zp8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  dev-sgx:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/sgx
    HostPathType:  CharDevice
  default-token-68zp8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-68zp8
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

@Francis-Liu
Copy link
Contributor Author

Thanks @jackfrancis. /dev/sgx volumeMounts needs some change. I will propose a fix.

Signed-off-by: Francis Liu <Francis.Liu2012@gmail.com>
@Francis-Liu Francis-Liu force-pushed the Francis-Liu/skip-sgx-driver-install-if-avail branch from c065e2f to dfa00d1 Compare April 15, 2020 23:34
@Francis-Liu
Copy link
Contributor Author

Hi @jackfrancis, I have updated this PR: (1) fixed a bug in cse_main.sh, -f --> -e; (2) removed the type: CharDevice check; and (3) changed document from "acc-16.04" to "aks-ubuntu-16.04". It passed test from my local environment.

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Apr 16, 2020

Codecov Report

Merging #3062 into master will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3062      +/-   ##
==========================================
- Coverage   70.63%   70.58%   -0.05%     
==========================================
  Files         145      145              
  Lines       25151    25219      +68     
==========================================
+ Hits        17765    17802      +37     
- Misses       6283     6312      +29     
- Partials     1103     1105       +2     
Impacted Files Coverage Δ
pkg/engine/templates_generated.go 30.99% <ø> (-0.59%) ⬇️
pkg/engine/availabilitysets.go 87.75% <0.00%> (-12.25%) ⬇️
cmd/upgrade.go 45.25% <0.00%> (-1.85%) ⬇️
pkg/engine/virtualmachinescalesets.go 80.79% <0.00%> (-0.33%) ⬇️
pkg/engine/engine.go 86.31% <0.00%> (-0.28%) ⬇️
pkg/engine/cse.go 100.00% <0.00%> (ø)
pkg/engine/artifacts.go 100.00% <0.00%> (ø)
pkg/api/defaults-cloud-controller-manager.go 92.85% <0.00%> (ø)
pkg/api/addons.go 97.74% <0.00%> (+<0.01%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fb97a2...dfa00d1. Read the comment docs.

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jackfrancis jackfrancis merged commit 2b0d70a into Azure:master Apr 16, 2020
alexeldeib pushed a commit to alexeldeib/aks-engine that referenced this pull request Apr 21, 2020
…#3062)

Signed-off-by: Francis Liu <Francis.Liu2012@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants