Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cilium: 4 sig-network tests fail (sonobuoy) #144

Closed
garloff opened this issue Mar 2, 2022 · 17 comments · Fixed by #489
Closed

cilium: 4 sig-network tests fail (sonobuoy) #144

garloff opened this issue Mar 2, 2022 · 17 comments · Fixed by #489
Assignees
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling documentation Improvements or additions to documentation question Further information is requested SCS-VP06a Related to tender lot SCS-VP06a Sprint Hobart Sprint Hobart (2023, cwk 30+31)

Comments

@garloff
Copy link
Contributor

garloff commented Mar 2, 2022

Testing a capo cluster with k8s-1.21.9, I consistently get 4 failed tests:
Failed tests:

[sig-network] Services should have session affinity timeout work for service with type clusterIP [LinuxOnly] [Conformance]
[sig-network] Services should have session affinity work for service with type clusterIP [LinuxOnly] [Conformance]
[sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]
[sig-network] Services should be able to switch session affinity for service with type clusterIP [LinuxOnly] [Conformance]

To Do: Investigate:

  • Are they dependent on k8s version?
  • Or anything that we'd need to take care of in our capi/capo setup?
@garloff garloff changed the title sonobuoy: 4 sig-network fails sonobuoy: 4 sig-network tests fail Mar 2, 2022
@garloff garloff changed the title sonobuoy: 4 sig-network tests fail cilium: 4 sig-network tests fail (sonobuoy) Mar 3, 2022
@garloff
Copy link
Contributor Author

garloff commented Mar 3, 2022

This only happens with cilium.

@garloff garloff added documentation Improvements or additions to documentation question Further information is requested labels Mar 17, 2022
@garloff
Copy link
Contributor Author

garloff commented Sep 20, 2022

With k8s-1.24.4 and cilium v1.12.2, we are down to one failed test.

[sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly]
 [Conformance]                                                                                                                          

@garloff
Copy link
Contributor Author

garloff commented Oct 20, 2022

Retesting with k8s-1.25.3 and cilium v1.12.2, we still have the 4 failed tests:

Failed tests:                                                         
[It] [sig-network] Services should have session affinity work for service with type clusterIP [LinuxOnly] [Conformance]                     
[It] [sig-network] Services should have session affinity timeout work for service with type clusterIP [LinuxOnly] [Conformance]             
[It] [sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]                                                                                                                             
[It] [sig-network] Services should be able to switch session affinity for service with type clusterIP [LinuxOnly] [Conformance]             

@chess-knight
Copy link
Member

Retesting with k8s-1.25.3 and cilium v1.12.2, we still have the 4 failed tests:

Failed tests:                                                         
[It] [sig-network] Services should have session affinity work for service with type clusterIP [LinuxOnly] [Conformance]                     
[It] [sig-network] Services should have session affinity timeout work for service with type clusterIP [LinuxOnly] [Conformance]             
[It] [sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]                                                                                                                             
[It] [sig-network] Services should be able to switch session affinity for service with type clusterIP [LinuxOnly] [Conformance]             

Retesting with k8s-1.25.3 and cilium v1.13.3, we have only HostPort failed:

Failed tests:
[It] [sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]

@garloff
Copy link
Contributor Author

garloff commented Jun 22, 2023

Thanks for testing @chess-knight -- we have progress, which is nice!
It would be good to understand whether this is a known cilium limitation or whether we can fix this by settings things up slightly differently.

@chess-knight
Copy link
Member

It would be good to understand whether this is a known cilium limitation or whether we can fix this by settings things up slightly differently.

Ok. I investigated a little bit further and I found this open upstream issue.
This issue is exactly about the HostPort conformance problem.
In this issue and also in multiple linked issues(e.g. this), people state that they succeeded with this conformance test by using some combination of cilium install additional parameters. But also some people do not pass the test in that way.

If I understand the mentioned issue correctly, there is still one sub-issue, which needs to be resolved to fully support HostPort in the default installation.

@jschoone jschoone added the SCS-VP06a Related to tender lot SCS-VP06a label Jul 26, 2023
@NotTheEvilOne
Copy link
Contributor

Side-note: There is a known workaround cilium@14287#issuecomment-1645325590

@mxmxchere mxmxchere self-assigned this Jul 26, 2023
@chess-knight
Copy link
Member

chess-knight commented Jul 28, 2023

I again dived into the mentioned problem and here are my findings:

  1. First what I found is that cilium has a mode(kubeProxyReplacement - default is disabled), which allows replacing/supplementing kube-proxy or it is used in kube-proxy-free environment.
    We install cilium via cilium-cli (see cilium install --version $CILIUM_VERSION) and this installation method has auto-detection of kube-proxy. If kube-proxy is present in the k8s cluster, the option kubeProxyReplacement=disabled, but if kube-proxy is not installed kubeProxyReplacement=strict, which means kube-proxy is entirely replaced. There is also a third option in the cilium v1.13 kubeProxyReplacement=partial which can be used to replace only some of kube-proxy features (e.g. set externalIPs.enabled - cilium will take care of externalIPs)

  2. The auto-detection described in point 1 is for our setup really tricky because we install cilium right after the apiserver is reachable. At the same time, the kube-proxy is installed.
    I executed kubectl command right after the k8s API was reachable:

    $ kubectl get daemonset -n kube-system
    NAME     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
    cilium   1         1         0       1            0           kubernetes.io/os=linux   1s
    
    $ kubectl get daemonset -n kube-system
    NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
    cilium       1         1         0       1            0           kubernetes.io/os=linux   3s
    kube-proxy   1         1         0       1            0           kubernetes.io/os=linux   2s

    So I found that sometimes cilium is installed correctly(according to expectations) but sometimes cilium is installed before kube-proxy exists and then auto-detection sets kubeProxyReplacement=strict which can be unwanted.

    ℹ️  Using Cilium version 1.13.3
    🔮 Auto-detected cluster name: testcluster
    🔮 Auto-detected datapath mode: tunnel
    🔮 Auto-detected kube-proxy has been installed
    ℹ️  helm template --namespace kube-system cilium cilium/cilium --version 1.13.3 --set cluster.id=0,cluster.name=testcluster,encryption.nodeEncryption=false,ipam.operator.clusterPoolIPv4PodCIDRList[0]=192.168.0.0/16,kubeProxyReplacement=disabled,operator.replicas=1,serviceAccounts.cilium.name=cilium,serviceAccounts.operator.name=cilium-operator,tunnel=vxlan
    

    versus

    ℹ️  Using Cilium version 1.13.3
    🔮 Auto-detected cluster name: testcluster
    🔮 Auto-detected datapath mode: tunnel
    🔮 Auto-detected kube-proxy has not been installed
    ℹ️  Cilium will fully replace all functionalities of kube-proxy
    ℹ️  helm template --namespace kube-system cilium cilium/cilium --version 1.13.3 --set bpf.masquerade=true,cluster.id=0,cluster.name=testcluster,encryption.nodeEncryption=false,ipam.operator.clusterPoolIPv4PodCIDRList[0]=192.168.0.0/16,k8sServiceHost=213.131.230.24,k8sServicePort=6443,kubeProxyReplacement=strict,operator.replicas=1,serviceAccounts.cilium.name=cilium,serviceAccounts.operator.name=cilium-operator,tunnel=vxlan
    
  3. Based on point 2 we get different results on the certified-conformance tests.
    When kubeProxyReplacement=strict we have only HostPort test failed:

    Failed tests:
    [It] [sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]
    

    The source code of the HostPort test can be found here - it fails with INFO: Can not connect from e2e-host-exec to pod(pod1) to serverIP: 127.0.0.1, port: 54323
    When kubeProxyReplacement=disabled we have additional session affinity tests failed:

    Failed tests:
    [It] [sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]
    [It] [sig-network] Services should have session affinity work for service with type clusterIP [LinuxOnly] [Conformance]
    [It] [sig-network] Services should be able to switch session affinity for service with type clusterIP [LinuxOnly] [Conformance]
    

    The source code of the session affinity tests can be found here - it fails with FAIL: Affinity should hold but didn't.

  4. First what I tried was a combination --helm-set cni.chainingMode=portmap --helm-set kubeProxyReplacement=strict but It didn't work, because, in strict mode, portmap plugin option is ignored and replaced by cilium native hostPort support. It means that tests will fail in the same way. There is also loopback address hostPort limitation mentioned.

  5. Based on points 3 and 4 and also on the comments in SchedulerPredicates: pods with same hostPort but different hostIP and protocol cilium/cilium#14287 I came up with these options if we want to pass certified-conformance tests.
    At first, we need flag kubeProxyReplacement=disabled or kubeProxyReplacement=partial because the behaviour of the cilium install is pretty random(point 2). Then the HostPort support can be activated by cni.chainingMode=portmap and last, we need to enable session affinity by sessionAffinity=true, see also HostPort and sessionAffinity docs.

  6. Based on point 5 I tested --helm-set kubeProxyReplacement=disabled --helm-set cni.chainingMode=portmap --helm-set sessionAffinity=true and i got === Sonobuoy conformance tests passed in 6100s === at the end (it reflects SchedulerPredicates: pods with same hostPort but different hostIP and protocol cilium/cilium#14287 (comment)). I tried also --helm-set kubeProxyReplacement=partial --helm-set cni.chainingMode=portmap --helm-set sessionAffinity=true and it also passed.

Conclusion:
I found a way how to pass conformance tests with Cilium CNI.

But, we need to decide if we want to replace kube-proxy with cilium by setting kubeProxyReplacement=strict - it seems the HostPort test will fail then until the upstream issue will be solved (or we can ignore this one test as cilium does). We can set kubeProxyReplacement=partial (see points 5 and 6 or cilium/cilium#14287 (comment)). Or set kubeProxyReplacement=disabled - this is what we until now expected because it is the default. Also, we need to decide, if we want to allow a port map plugin for the hostPort functionality because it is something that is not recommended until it is absolutely necessary according to k8s best practices - but without this setting tests will fail again. The sessionAffinity setting is not a critical setting to decide for me, because I see it as a standard in k8s world.

The question about kube-proxy replacement I see as a separate issue, which we can solve in the future. Also, I don't see the portmap plugin as a critical option because it is used by default in calico and we will have certified conformance tests passed.

So I am for adding --helm-set kubeProxyReplacement=disabled --helm-set cni.chainingMode=portmap --helm-set sessionAffinity=true to the cilium install command. In my eyes, it is also very close to what calico does because there is no problem with conformance tests.

P.S. I tested it with default settings (k8s v1.25.11) and I hope that it will work the same for other k8s versions. See also this cilium/cilium#21060 (comment)

@mxmxchere
Copy link
Contributor

Wow, thanks for this very detailed analysis.

From my point of view we can set kubeProxyReplacement to disabled for now and also set the other flags to pass the sonobuoy tests in the first place.

However we should keep in mind that we want to use ciliums implementation for gateway-API which requires kubeProxyReplacement to be set to true.

I prefer to pass the tests first, make cilium the tested default and deal with all things related to gateway-API in the next step (after merging the PR)

@mxmxchere
Copy link
Contributor

mxmxchere commented Aug 8, 2023

Preface:
IMHO it is a little bit cumbersome that the tests are identified by their full description, other testing suites (lynis, shellcheck...) use IDs for that, so i introduced my own IDs for this issue. The strings below are unique in the current set of tests given out by sonobuoy e2e in version 0.56.17

TEST1='Services should have session affinity timeout work for service with type clusterIP'
TEST2='Services should have session affinity work for service with type clusterIP'
TEST3='HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol'
TEST4='Services should be able to switch session affinity for service with type clusterIP'

(I shortened them because the full string includes square brackets and the string is used as regex lateron)

I have had nearly every possible result from 3 failing test to 4 failing test, one or a pass of all tests, all with k8s 1.25.11, cilium 1.13 and kubeproxyReplacement disabled. The most consistent result was 4 failing tests (6 or 7 times in a row now). The command i used to run the test is:

sonobuoy run -p e2e --e2e-focus="$TEST1|$TEST2|$TEST3|$TEST4"

Unfortunately this regex runs 7 Tests according to sonobuoy results. sonobuoy status sums up to 4 tests...

@mxmxchere
Copy link
Contributor

mxmxchere commented Aug 8, 2023

Ok, now i am up to what you found out and i can reproduce that all tests pass with:

kubeProxyReplacement=disabled
cni.chainingMode=portmap
sessionAffinity=true

Regarding the hostPort thing: As far as i have understood it is not recommended using it (although for example cinder does). From a conformance and cluster-provider perspective it is still required to support it. So regarding your question

we need to decide, if we want to allow a port map plugin for the hostPort functionality because it is something that is not recommended until it is absolutely necessary according to k8s best practices - but without this setting tests will fail again.

It is not recommended to use hostPort (as the guy who applies pod-resources) but required to work (as the guy who wants to provide a certified cluster).

That this is achieved by cilium by using the portmap plugin does not matter (this is how i understand it)

@chess-knight What do you think? Am i missing something?

chess-knight added a commit that referenced this issue Aug 8, 2023
Fixes #144

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
@chess-knight
Copy link
Member

Hello @mxmxchere,
During your comment, I was on the way to creating SovereignCloudStack/issues#489.

@mxmxchere
Copy link
Contributor

fantastic, so as i can read in your code, we both agree that we should set those three options and are good to go.

@chess-knight
Copy link
Member

fantastic, so as i can read in your code, we both agree that we should set those three options and are good to go.

Yes, I tested it multiple times successfully

@chess-knight
Copy link
Member

However we should keep in mind that we want to use ciliums implementation for gateway-API which requires kubeProxyReplacement to be set to true.

Yes, you are right. We will need find a way how to deploy gatewayAPI with cilium and pass conformance tests.
The tracking issue for this is #710

@chess-knight
Copy link
Member

Preface: IMHO it is a little bit cumbersome that the tests are identified by their full description, other testing suites (lynis, shellcheck...) use IDs for that, so i introduced my own IDs for this issue. The strings below are unique in the current set of tests given out by sonobuoy e2e in version 0.56.17

TEST1='Services should have session affinity timeout work for service with type clusterIP'
TEST2='Services should have session affinity work for service with type clusterIP'
TEST3='HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol'
TEST4='Services should be able to switch session affinity for service with type clusterIP'
$ sonobuoy e2e --focus 'Services.*clusterIP'
[It] [sig-network] Services should be able to switch session affinity for service with type clusterIP [LinuxOnly] [Conformance]
[It] [sig-network] Services should have session affinity timeout work for service with type clusterIP [LinuxOnly]
[It] [sig-network] Services should have session affinity work for service with type clusterIP [LinuxOnly] [Conformance]

Test TEST2 is not a conformance test - missing [Conformance]

@chess-knight
Copy link
Member

Unfortunately this regex runs 7 Tests according to sonobuoy results. sonobuoy status sums up to 4 tests...

From my experiences, it is probably because there are always 3 additional e2e tests before and after - [SynchronizedBeforeSuite], [SynchronizedAfterSuite] and [ReportAfterSuite].

jschoone added a commit that referenced this issue Aug 8, 2023
Fixes #144

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>
@jschoone jschoone added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Aug 9, 2023
fdobrovolny pushed a commit that referenced this issue Sep 26, 2023
Fixes #144

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>

(cherry picked from commit e8ef70f)
Signed-off-by: Filip Dobrovolny <dobrovolny.filip@gmail.com>
fdobrovolny pushed a commit that referenced this issue Sep 30, 2023
Fixes #144

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>

(cherry picked from commit e8ef70f)
Signed-off-by: Filip Dobrovolny <dobrovolny.filip@gmail.com>
jschoone added a commit that referenced this issue Oct 6, 2023
Fixes #144

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>

(cherry picked from commit e8ef70f)
Signed-off-by: Filip Dobrovolny <dobrovolny.filip@gmail.com>
fdobrovolny pushed a commit that referenced this issue Oct 30, 2023
Fixes #144

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>

(cherry picked from commit e8ef70f)
Signed-off-by: Filip Dobrovolny <dobrovolny.filip@gmail.com>
chess-knight added a commit that referenced this issue Nov 6, 2023
* Add migration steps for existing k8s clusters to adopt #432 (#477)

This commit adds migration steps for existing k8s clusters
to be able to adopt #432 feature. #432 added option to use
a custom container registry in containerd.

Issue: #470

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>
(cherry picked from commit 970ed8f)

* Pass conformance tests with cilium (#489)

Fixes #144

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>

(cherry picked from commit e8ef70f)
Signed-off-by: Filip Dobrovolny <dobrovolny.filip@gmail.com>

* Option to deploy harbor (#445)

* Deploy harbor during create cluster stage

This approach uses only 2 variables(domain, email).

There is also kustomize and envsubst used.

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Move harbor deployment to separate script

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Create ec2 credentials

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Add persistence

Change also staging issuer to prod

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Deploy dependencies(flux, cert-mgr, ingress-nginx)

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Add some info message about manual dns action

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Fix deploing dependencies

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Add info about getting admin password

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Restructuralize harbor variables

Add also clusterIP deployment option

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Force Cinder CSI deployment when persistence is true

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Address comments from @matofeder

Add more documentation, usefull log messages and more

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Fix on specific k8s-harbor repo tag

As discussed with @fdobrovolny

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Mention forced services in configuration and default environment

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Add KUBECONFIG needed for the secret scripts

Move deploying of Harbor to the end of create_cluster.sh

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

* Fix yamllint warnings

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>

---------

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Signed-off-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>

(cherry picked from commit 82ad213)
Signed-off-by: Filip Dobrovolny <dobrovolny.filip@gmail.com>

---------

Signed-off-by: Roman Hros <roman.hros@dnation.cloud>
Signed-off-by: Filip Dobrovolny <dobrovolny.filip@gmail.com>
Signed-off-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>
Co-authored-by: Matej Feder <matej.feder@dnation.cloud>
Co-authored-by: Roman Hros <roman.hros@dnation.cloud>
Co-authored-by: Jan Schoone <6106846+jschoone@users.noreply.github.com>
@jschoone jschoone added the Sprint Hobart Sprint Hobart (2023, cwk 30+31) label Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling documentation Improvements or additions to documentation question Further information is requested SCS-VP06a Related to tender lot SCS-VP06a Sprint Hobart Sprint Hobart (2023, cwk 30+31)
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants