Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NE-1444: Bump HaProxy to the latest version 2.8 #563

Merged
merged 2 commits into from Feb 26, 2024

Conversation

frobware
Copy link
Contributor

This PR addresses the absence of the socat package in container builds, which has been reported as missing in metal-ipi jobs. It explicitly adds the socat package to the container builds to rectify this issue. The need for a direct inclusion of socat suggests a potential change in the baseline environment provided by RHEL 9, given that in previous RHEL 8 builds, socat was not explicitly included but was nonetheless present and extensively used for debugging HAProxy issues through its stats socket.

This PR reverts openshift/router#561.

cc @dgoodwin

…7829993043"

This reverts commit 4c436c6, reversing
changes made to d5b0316.
Add the socat package to resolve runtime issues seen with nightly
payloads where metal jobs are permafailing due to the missing socat
binary.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 14, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 14, 2024

@frobware: This pull request references TRT-1507 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

This PR addresses the absence of the socat package in container builds, which has been reported as missing in metal-ipi jobs. It explicitly adds the socat package to the container builds to rectify this issue. The need for a direct inclusion of socat suggests a potential change in the baseline environment provided by RHEL 9, given that in previous RHEL 8 builds, socat was not explicitly included but was nonetheless present and extensively used for debugging HAProxy issues through its stats socket.

This PR reverts openshift/router#561.

cc @dgoodwin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@frobware
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

@frobware: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b90242f0-cb41-11ee-91c5-91b63600bbc7-0

@frobware
Copy link
Contributor Author

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2024

Does the e2e-metal-ipi-ovn-ipv6 job use the HAProxy image with the changes in the PR?

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2024

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2024

Oh, is socat present in the image that CI builds but absent in the image that ART builds?

@dgoodwin
Copy link
Contributor

Oh, is socat present in the image that CI builds but absent in the image that ART builds?

Justin Pierce confirmed there is an active bug somewhere on the ART side where their tooling is randomly choosing either a bare bones base RHEL 9 image, or the actual OCP image it's supposed to use as the base of everything. We're suspecting that might be involved here in which case there's no rhyme or reason to what we'd see in any individual run, it's luck of the draw. They're hoping for a fix today.

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2024

Oh, is socat present in the image that CI builds but absent in the image that ART builds?

Justin Pierce confirmed there is an active bug somewhere on the ART side where their tooling is randomly choosing either a bare bones base RHEL 9 image, or the actual OCP image it's supposed to use as the base of everything. We're suspecting that might be involved here in which case there's no rhyme or reason to what we'd see in any individual run, it's luck of the draw. They're hoping for a fix today.

Does that imply that it is safe to merge the revert of the revert once it passes CI? And will the additional commit in this PR, to install socat explicitly, guard against the ART bug?

@dgoodwin
Copy link
Contributor

Situation is a bit fuzzy :) but I think merging the revert of the revert would not be safe now, only if we're confirmed the ART bug is fixed at the time we validated. However with explicit install of socat, which seems like a good practice regardless especially if not in base image from RHEL, I think we could merge this as soon as we see green. Feels like that should cover us, not 100% certain but I'd be ok with a merge here if this runs green.

@frobware
Copy link
Contributor Author

frobware commented Feb 14, 2024

Does the e2e-metal-ipi-ovn-ipv6 job use the HAProxy image with the changes in the PR?

The instruction in the revert PR #561 was to run /payload-job periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm. I did that in #563 (comment).

@frobware
Copy link
Contributor Author

Oh, is socat present in the image that CI builds but absent in the image that ART builds?

If so, then I'd say that's only recently. From a month ago:

% oc rsh -n openshift-ingress router-default-ff66df969-pqcpf
Defaulted container "router" out of: router, logs
sh-5.1$ socat
2024/02/13 13:46:14 socat[20] E exactly 2 addresses required (there are 0); use option "-h" for help
sh-5.1$ which socat
/usr/bin/socat
sh-5.1$
exit
[x1c] ~infra/ocp416
% oc version
Client Version: 4.14.12
Kustomize Version: v5.0.1
Server Version: 4.16.0-0.nightly-2024-01-15-064322
Kubernetes Version: v1.29.0+f629574

@Miciah
Copy link
Contributor

Miciah commented Feb 14, 2024

I agree that installing socat explicitly (for openshift-kni-infra HAProxy instance and for debugging) is prudent.
/approve
/lgtm
/hold
We can remove the hold once the ART issue has been sorted out and we have a passing payload test.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2024
Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2024
@Miciah
Copy link
Contributor

Miciah commented Feb 15, 2024

/payload-job periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

Copy link
Contributor

openshift-ci bot commented Feb 15, 2024

@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9f9f9100-cbc3-11ee-8eb1-40b7035a5792-0

@frobware
Copy link
Contributor Author

/retest

@dgoodwin
Copy link
Contributor

/payload-job periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

Copy link
Contributor

openshift-ci bot commented Feb 15, 2024

@dgoodwin: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/30fd2080-cbf7-11ee-90cb-38959efd6abb-0

@frobware
Copy link
Contributor Author

/retest

@frobware
Copy link
Contributor Author

/assign

@frobware
Copy link
Contributor Author

/retest

@frobware frobware changed the title TRT-1507: Resolve missing socat binary NE-1444: Resolve missing socat binary Feb 20, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 20, 2024

@frobware: This pull request references NE-1444 which is a valid jira issue.

In response to this:

This PR addresses the absence of the socat package in container builds, which has been reported as missing in metal-ipi jobs. It explicitly adds the socat package to the container builds to rectify this issue. The need for a direct inclusion of socat suggests a potential change in the baseline environment provided by RHEL 9, given that in previous RHEL 8 builds, socat was not explicitly included but was nonetheless present and extensively used for debugging HAProxy issues through its stats socket.

This PR reverts openshift/router#561.

cc @dgoodwin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@frobware
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 20, 2024

@frobware: This pull request references NE-1444 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@frobware
Copy link
Contributor Author

The e2e-metal-ipi-ovn-ipv6 job may require:

@frobware frobware changed the title NE-1444: Resolve missing socat binary NE-1444: Bump HaProxy to the latest version 2.8 Feb 20, 2024
@frobware
Copy link
Contributor Author

/retest

@Miciah
Copy link
Contributor

Miciah commented Feb 22, 2024

/retest
since openshift/machine-config-operator#4201 merged.

@Miciah
Copy link
Contributor

Miciah commented Feb 24, 2024

e2e-metal-ipi-ovn-ipv6 failed because two tests failed. First, [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers failed:

{  37 events happened too frequently

event [namespace/openshift-e2e-loki node/worker-1.ostest.test.metalkube.org pod/loki-promtail-2472j hmsg/16cc22e678 - Back-off restarting failed container prod-bearer-token in pod loki-promtail-2472j_openshift-e2e-loki(7d57f55c-d02c-49af-ac4a-dc2de067737d)] happened 180 times

The prod-bearer-token container was logging the following output:

level=info name=token-refresher ts=2024-02-23T03:02:35.768988846Z caller=main.go:169 msg=token-refresher
2024/02/23 03:02:35 OIDC provider initialization failed: Get "https://sso.redhat.com/auth/realms/redhat-external/.well-known/openid-configuration": proxyconnect tcp: dial tcp [fd00:1101::1]:8213: connect: connection refused

So it seems like some proxy that the CI job uses is refusing connections. It seems that OCPBUGS-29478 has been filed to track this issue.

Second, [sig-arch] events should not repeat pathologically for ns/openshift-image-registry also failed:

{  3 events happened too frequently

event happened 102 times, something is wrong: namespace/openshift-image-registry node/master-2.ostest.test.metalkube.org pod/node-ca-q5lpn hmsg/0e4d66f0ba - reason/FailedToRetrieveImagePullSecret Unable to retrieve some image pull secrets (node-ca-dockercfg-fqvwx); attempting to pull the image may not succeed. From: 02:55:16Z To: 02:55:17Z result=reject 
event happened 100 times, something is wrong: namespace/openshift-image-registry node/master-0.ostest.test.metalkube.org pod/node-ca-gbgwt hmsg/0e4d66f0ba - reason/FailedToRetrieveImagePullSecret Unable to retrieve some image pull secrets (node-ca-dockercfg-fqvwx); attempting to pull the image may not succeed. From: 02:55:24Z To: 02:55:25Z result=reject 
event happened 100 times, something is wrong: namespace/openshift-image-registry node/master-1.ostest.test.metalkube.org pod/node-ca-xkk6x hmsg/0e4d66f0ba - reason/FailedToRetrieveImagePullSecret Unable to retrieve some image pull secrets (node-ca-dockercfg-fqvwx); attempting to pull the image may not succeed. From: 02:55:09Z To: 02:55:10Z result=reject }

I haven't seen that failure on the router CI job before. I do see some similar-looking failures in other CI jobs, and some bug reports for "FailedToRetrieveImagePullSecret" for other platforms, but not for bare metal. Let's see whether it happens again on this PR.

/test e2e-metal-ipi-ovn-ipv6

@Miciah
Copy link
Contributor

Miciah commented Feb 24, 2024

/payload-job periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

Copy link
Contributor

openshift-ci bot commented Feb 24, 2024

@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0211f770-d2cb-11ee-8834-7bc5b0012730-0

@Miciah
Copy link
Contributor

Miciah commented Feb 26, 2024

The most recent payload test is green.
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit cf0442b into openshift:master Feb 26, 2024
8 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-haproxy-router-base-container-v4.16.0-202402261840.p0.gcf0442b.assembly.stream.el9 for distgit ose-haproxy-router-base.
All builds following this will include this PR.

Copy link
Contributor

openshift-ci bot commented Feb 26, 2024

@frobware: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Miciah
Copy link
Contributor

Miciah commented Feb 26, 2024

For anyone following along, NE-1444 is the tracker for the original HAProxy 2.8 bump, and NE-1506 is the tracker that we created for this PR (the revert of the revert of the HAProxy 2.8 bump). Since the PR has merged, I have closed NE-1506.

@frobware frobware deleted the lolcat branch May 1, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants