Sleep-based synchronization between cache warmup and the first sync to Kong #2315

mflendrich · 2022-03-10T17:25:39Z

Is there an existing issue for this?

I have searched the existing issues

Problem Statement

In order to mitigate #2249, #2255 has added a 5s sleep before the first sync.

Proposed Solution

Replace the sleep with a solution proposed in #2249 (comment)

Additional information

No response

Acceptance Criteria

The first sync is blocked until the cache has been filled with all the relevant k8s objects in a reliable fashion.

rainest · 2022-05-03T21:18:27Z

Do we indeed want this as an onboarding task? IIRC this was more to track that we're still using the sleep for lack of a great alternative and to collect any reports where the sleep is either insufficient or causes some unexpected issue.

#2249 (review of the problem and non-viable options for fixing it) and #2250 (a PoC implementation to demonstrate why the known viable option is not ideal) show prior work, but to be clear the proposal here is basically something we'd like to avoid in order to preserve separation between the controllers, even if it works in a technical sense. This issue says "do this" but this should probably still be a spike to try and come up with alternatives.

shaneutt · 2022-05-04T12:53:14Z

Indeed, I don't think we should use this as an onboarding task, we should mark this one as a nice to have.

mflendrich · 2022-05-04T15:03:00Z

Sleeping is not a correct synchronization mechanism and as such IMO does not require further justification for fixing (for example because changes to unrelated parts of code may unexpectedly change the time it takes for the cache warmup to finish). Whether it is likely to show visible outcomes in predictable future is unknown (maybe it's unlikely, but we cannot know that for sure), though. Issues like this add up and result in surprising bugs across the codebase. I'm happy to be convinced otherwise, though.

Good point. Would making this a spike as @rainest proposed make it a decent onboarding exploratory (based on the prior art) task?

To me, personally, "nice to have" means something like "this is a feature that is not likely to draw our attention ever". We could think that fixing any technical debt is a "nice to have" (and I'm viewing this issue as pure technical debt). Do we want to apply "nice to have" to technical debt items that pose only a maintenance problem? I see this label more on the functional side.

To summarize, here's my thinking/alternative proposal:

let's replace "do this" with "please explore and see if we can add a barrier that resolves the race otherwise" (seems uncontroversial)
keep "area/debt" rather than "nice to have"
consider this "onboarding" because it's a pure golang issue requiring not so much k8s/domain background

WDYT?

shaneutt · 2022-05-04T15:47:34Z

As long as it's clear that this is a spike and the work that needs to be done first is to propose a solution, not immediately implement one, that's fine with me. Also you can have it marked as both area/debt and nice-to-have at the same time.

jiachinzhao · 2022-07-28T08:31:47Z

I have encountered related issues when testing in our prod env.

Kong 2.8.0 db-less
Kong ingress controller 2.4.2
EKS 1.18

Our prod EKS env have 1000+ service, I configured single ingress by a new ingressClass.

Start a kong, after it’s running, request to it cause 503, “failure to get a peer from the ring-balancer”, after a while maybe 1 or 2 min, response recover to 200

debug by kong admin api, I found there are 1 route, 1 service, but no target found in related upstream when error occured.

Ingress Controller log
2022-07-28 14:14:52 time="2022-07-28T06:14:52Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:49 time="2022-07-28T06:14:49Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:46 time="2022-07-28T06:14:46Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:43 time="2022-07-28T06:14:43Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:40 time="2022-07-28T06:14:40Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
2022-07-28 14:14:37 time="2022-07-28T06:14:37Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080

Service is running normal in EKS and It ’s occured certainly in our scenario.

is there any plan or release version to fix it?

rainest · 2022-07-28T19:01:37Z

@jiachinzhao that's not this issue; if it was you wouldn't have any service or route at all. That error generally means that your Service has no Endpoints because there are no Pods ready to serve it.

If you do still think there's an issue with the controller not properly seeing Endpoints it should, please open a separate issue.

jiachinzhao · 2022-07-29T10:26:49Z

OK, i will open a separate issue to explain.

another issue related

step to reproduce

Kong 2.8.0 db-less
Kong ingress controller 2.2.0
EKS 1.18
apply a single ingress rule
service and pod running normally

start a kong pod then excute test-new.sh

the shell aims to check whether ingress controller is ready before or after kong routes sync normally.

ip=`kubectl get pod -o wide -n kong | grep 'kong-new' | awk '{print $6}'`
echo $ip
while true
do
   date
   echo -e "\ningress controller healthz"
   curl -v $ip:10254/healthz
   echo -e "\nkong routes"
   curl $ip:8001/routes
   echo -e "\nrequest pong"
   curl -H 'host: kong-new.test.io'  $ip:8000/v1/frontlogsrv/pong
   sleep 1
done

the shell output shows that ingress controller healthz is ready before
kong routes sync normally.

this will lead to response 404 when traffic proxy to this pod

but in ingress controller 1.3.x, healthz always will be ready after waiting for cache sync success.

Requests sent

``` ./test-new.sh 10.15.52.171

Fri Jul 29 09:33:59 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
connect to 10.15.52.171 port 10254 failed: Connection refused
Failed to connect to 10.15.52.171 port 10254: Connection refused
Closing connection 0
curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:00 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
connect to 10.15.52.171 port 10254 failed: Connection refused
Failed to connect to 10.15.52.171 port 10254: Connection refused
Closing connection 0
curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:01 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
connect to 10.15.52.171 port 10254 failed: Connection refused
Failed to connect to 10.15.52.171 port 10254: Connection refused
Closing connection 0
curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:02 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
connect to 10.15.52.171 port 10254 failed: Connection refused
Failed to connect to 10.15.52.171 port 10254: Connection refused
Closing connection 0
curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:03 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:04 GMT
< Content-Length: 2
<

Connection #0 to host 10.15.52.171 left intact
ok
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:05 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:05 GMT
< Content-Length: 2
<

Connection #0 to host 10.15.52.171 left intact
ok
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:06 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:06 GMT
< Content-Length: 2
<

Connection #0 to host 10.15.52.171 left intact
ok
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:07 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:07 GMT
< Content-Length: 2
<

Connection #0 to host 10.15.52.171 left intact
ok
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:08 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:08 GMT
< Content-Length: 2
<

Connection #0 to host 10.15.52.171 left intact
ok
kong routes
{"data":[],"next":null}
request pong
{"message":"no Route matched with those values"}

Fri Jul 29 09:34:09 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:09 GMT
< Content-Length: 2
<

Connection #0 to host 10.15.52.171 left intact
ok
kong routes
{"data":[{"snis":null,"tags":null,"request_buffering":true,"response_buffering":true,"created_at":1659087249,"paths":["/v1/frontlogsrv/"],"methods":null,"https_redirect_status_code":426,"preserve_host":true,"service":{"id":"e56c138c-7c1d-5925-8e9e-698913f8e7ab"},"hosts":["kong-pre-release.klook.io"],"destinations":null,"protocols":["http","https"],"regex_priority":0,"path_handling":"v0","id":"bb31773d-8418-5943-b4ae-c88d8eae90a5","updated_at":1659087249,"headers":null,"name":"deploy.frontlogsrv-kong-new-test.00","strip_path":false,"sources":null}],"next":null}
request pong
frontlogsrv:01.:8080:active:service/deploy/frontlogsrv

Fri Jul 29 09:34:10 UTC 2022
ingerss controller healthz

Trying 10.15.52.171...
TCP_NODELAY set
Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0)

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 29 Jul 2022 09:34:10 GMT
< Content-Length: 2
<

Connection #0 to host 10.15.52.171 left intact
ok
kong routes
{"data":[{"snis":null,"tags":null,"request_buffering":true,"response_buffering":true,"created_at":1659087249,"paths":["/v1/frontlogsrv/"],"methods":null,"https_redirect_status_code":426,"preserve_host":true,"service":{"id":"e56c138c-7c1d-5925-8e9e-698913f8e7ab"},"hosts":["kong-pre-release.klook.io"],"destinations":null,"protocols":["http","https"],"regex_priority":0,"path_handling":"v0","id":"bb31773d-8418-5943-b4ae-c88d8eae90a5","updated_at":1659087249,"headers":null,"name":"deploy.frontlogsrv-kong-new-test.00","strip_path":false,"sources":null}],"next":null}
request pong
frontlogsrv:01.:8080:active:service/deploy/frontlogsrv

</details>

dcarley · 2023-01-24T13:42:11Z

I thought that we had run into a startup/sync bug caused by this sleep but it actually turned out that the chart was using the wrong readiness endpoint: Kong/charts#716

pmalek · 2023-01-24T14:07:06Z

Hi @dcarley 👋

The problem you're trying to solve is not related to this bug. What readyz endpoint is doing in KIC as of now is returning true if the first sync (i.e. sending Kong objects - from parsed and translated Kubernetes resources - to Kong instance(s)) has already happened.

Sometimes this sync can send less objects then we'd intend it to because the parser https://github.com/Kong/kubernetes-ingress-controller/blob/91bd96dd47a7d15dcc340d6d2c032ddd1a8ab22d/internal/dataplane/parser/parser.go didn't manage to fill its caches yet. This is what #2250 tried to remediate by using manager's client to get those k8s resources going around the parser cache(s).

You fix in Kong/charts#716 still makes sense but it doesn't solve the issue described here.

mflendrich added the area/debt label Mar 10, 2022

mflendrich mentioned this issue Mar 10, 2022

Sleep-based synchronization between cache warmup and the first sync to Kong #2249

Closed

1 task

shaneutt added the priority/low label Mar 22, 2022

mflendrich added the onboarding label Mar 23, 2022

mflendrich assigned randmonkey Apr 14, 2022

mflendrich unassigned randmonkey May 18, 2022

shaneutt added size/XXL and removed onboarding labels Jun 3, 2022

backjo mentioned this issue Nov 28, 2023

fix(manager): set InitCacheSyncDuration to 5s by default and allow it… #5238

Merged

1 task

mflendrich added this to the KIC v3.2.x milestone Jan 3, 2024

rainest linked a pull request Jan 25, 2024 that will close this issue

feat(manager) block synchronizer on cache readiness #5483

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sleep-based synchronization between cache warmup and the first sync to Kong #2315

Sleep-based synchronization between cache warmup and the first sync to Kong #2315

mflendrich commented Mar 10, 2022

rainest commented May 3, 2022

shaneutt commented May 4, 2022

mflendrich commented May 4, 2022

shaneutt commented May 4, 2022

jiachinzhao commented Jul 28, 2022

rainest commented Jul 28, 2022

jiachinzhao commented Jul 29, 2022 •

edited by pmalek

dcarley commented Jan 24, 2023

pmalek commented Jan 24, 2023

Sleep-based synchronization between cache warmup and the first sync to Kong #2315

Sleep-based synchronization between cache warmup and the first sync to Kong #2315

Comments

mflendrich commented Mar 10, 2022

Is there an existing issue for this?

Problem Statement

Proposed Solution

Additional information

Acceptance Criteria

rainest commented May 3, 2022

shaneutt commented May 4, 2022

mflendrich commented May 4, 2022

shaneutt commented May 4, 2022

jiachinzhao commented Jul 28, 2022

rainest commented Jul 28, 2022

jiachinzhao commented Jul 29, 2022 • edited by pmalek

dcarley commented Jan 24, 2023

pmalek commented Jan 24, 2023

jiachinzhao commented Jul 29, 2022 •

edited by pmalek