Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

gabemontero
Copy link
Contributor

@gabemontero gabemontero commented Feb 20, 2024

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Changes

Fixes #720

alternative to #712 with regard to still addressing the goroutine/memory leak, swtiching from the tkn client to k8s client, but with the reduction of threads, allows for retry when appropriate on UpdateLog errors.

/kind bug

These changes address a goroutine/memory leak that existed with the prior main branch vesrion,

Additionally

  • switch from tkn client to k8s pod log retrieval
  • fix panic, add debug, to e2e_gcs_test
  • rm tkn cli dependency
  • go mod tidy, vendor
  • add rbac for pod listing
  • move streamLogs onto reconciler thread
  • dump default worker pool thread count to 32 when logs are enabled
  • switch from grpc CloseSend to CloseAndRecv to address intermittent canceled context errors

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

  • [n/a ] Has Docs included if any changes are user facing
  • [ n/a] Has Tests included if any functionality added or changed
  • [ /] Tested your changes locally (if this is a code change)
  • [ /] Follows the commit message standard
  • [ /] Meets the Tekton contributor standards (including functionality, content, code)
  • [/ ] Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [n/a ] Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • [ n/a] Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

addresses UpdateLog calls from the watcher failing with canceled context, and allows for proper erro retry, while still fixing goroutine leak, which leads to overall memory leak, when storing logs

@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Feb 20, 2024
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 20, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7

@gabemontero gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from c9ed339 to b340071 Compare February 20, 2024 21:53
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7

@gabemontero gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from b340071 to 97634f4 Compare February 20, 2024 22:12
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.4% -14.0

@gabemontero
Copy link
Contributor Author

@sayan-biswas @avinal @enarha @ramessesii2 @adambkaplan FYI a first pass at replacing tkn cli with generic k8s get pod logs API calls

I could see this replacing #712 after a little more refining.... but curious on opinions / feedback

also really close to getting rid of launching the separate goroutine in sendLogs, but could also see deferring that to a follow pull request that also dives into verifying UpdateLogs completed in the API server to completely remove the intermittent canceled context UpdateLog instances we are seeing in Pavel's soak test. But I took a first pass at that today and that is going to require some unit test overall, as I've discovered the unit tests are not fully wired to return fake data.

@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 21, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.7% -7.7

@gabemontero gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from f9f94f4 to a9570cd Compare February 21, 2024 21:23
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.5% -7.9

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.5% -7.9

@gabemontero gabemontero changed the title [WIP] switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test Feb 22, 2024
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 22, 2024
if podLogsErr != nil {
return podLogsErr
}
hdr := fmt.Sprintf("*** Logs for pod %s container %s ***\n", pod.Name, container.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we merge #712 first so we can get some immediately relief with respect to the memory leak, then yes, I'll look more into the "Rainbow" decorator or whatever it was that I saw while investigating things, and see about keeping that format.

Also, I want to get a baseline in our prod env (vs. our separate stress tests in test envs) with the current multi-threaded format before moving on to this PR's further reduction in threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice that the colouring package is used. Since we are storing the logs I don't think we should do the "rainbow" colours, but the format with [task-name] and [step-name] is what I meant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to old format is up and passed e2e's (though a second e2e run is in progress as I type as I've removed debug and squashed related commits)

PTAL @sayan-biswas @avinal

@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2024
@sayan-biswas
Copy link
Contributor

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b

@gabemontero
Copy link
Contributor Author

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b

@sayan-biswas is your question here ^^ separate from the one in #715 (comment) ?

If it is, apologies I'm missing what you are getting at, and ask that you elaborate either here or in slack/team meetings. Thanks.

@sayan-biswas
Copy link
Contributor

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

@gabemontero
Copy link
Contributor Author

gabemontero commented Feb 26, 2024

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

OK thanks for confirming the comments refer to the same core notion @sayan-biswas

As to memory footprint, I am waiting to hear back from Pavel on when he can start a smoke test, so not yet.

This is a factor as to why I want to merge #712 first. We have validated the memory footprint is fixed there. Also, I want to see how this change looks in our prod env for a bit with more worker threads, particularly with the workqueue depth, so as to potentially compare with this change.

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 56.1% -13.2

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.3% -8.9

@gabemontero
Copy link
Contributor Author

flake on kind install of results in integration test ... trying again

/retest

@gabemontero
Copy link
Contributor Author

/retest

@gabemontero
Copy link
Contributor Author

OK @sayan-biswas @avinal @ramessesii2 the switch from CloseSend to CloseRecv (which calls CloseSend under the covers and then does a Recv on a LogSummary to apparently confirm the operation was successful) passed e2e's.

I am next going to attempt to re-introduce a child context with timeout and calling of its cancel to hopefully

  1. fully clean up the grpc goroutines currently blocked on context that Pavel has observed in the smoke test
  2. but not cancel the context prematurely such that the UpdateLog call is interrupted with 'canceled context'

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5

@gabemontero
Copy link
Contributor Author

OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496

This PR now:

  • employs k8s pod log APIs with no goroutines launched by the watcher to allow for error inspection and intelligent retry
  • but employs tkn style line by line prefixing
  • it employs a child context with timeout and a call to its cancel func which should cleanup all the grpc client goroutines (at least 4 by newClientStreamWithParams alone) that get created as part of making the UpdateLog call to the results API server .... we'll need to confirm this with the next (and hopefully final for now) soak test round with @pmacik
  • but the switch from CloseSend to CloseAndRecv (which calls CloseSend under the covers, but then makes a Recv call on the LogSummary obj which at least with the integration tests seem to confirm the UpdateLog completes before our cancel func cleans things up; when we only used CloseSend the UpdateLog call would get aborted with a canceled context error

side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server

I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.

Thanks

@gabemontero gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from 60a8dd1 to b8a991a Compare March 1, 2024 15:31
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5

@gabemontero
Copy link
Contributor Author

OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496

This PR now:

* employs k8s pod log APIs with no goroutines launched by the watcher to allow for error inspection and intelligent retry

* but employs tkn style line by line prefixing

* it employs a child context with timeout and a call to its cancel func which _should_ cleanup all the grpc client goroutines (at least 4 by `newClientStreamWithParams` alone) that get created as part of making the `UpdateLog` call to the results API server .... we'll need to confirm this with the next (and hopefully final for now) soak test round with @pmacik

* but the switch from `CloseSend` to `CloseAndRecv` (which calls `CloseSend` under the covers, but then makes a `Recv` call on the `LogSummary` obj which at least with the integration tests seem to confirm the `UpdateLog` completes before our cancel func cleans things up;  when we only used `CloseSend` the `UpdateLog` call would get aborted with a `canceled context error`

side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server

I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.

Thanks

good news @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm the soak test completed over the weekend and both

  • no goroutine / memory lean .... only minor diffs in overall thread accumulation
  • the intermittent UploadLog canceled context error also appears to have been addressed

let's please go ahead and review and get this PR ready for merge. Aside from fixing the intermittent canceled context errors, this change also allows us to better retry on retryable errrors.

thanks

@gabemontero
Copy link
Contributor Author

/hold cancel

update PR description, issue fixed, release note, to address latest findings

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2024

// comparing closeErr with io.EOF does not work; and I could not find code / desc etc. constants in the grpc code that handled
// the wrapped EOF error we expect to get from grpc when things are "OK"
if logSummary, closeErr := logsClient.CloseAndRecv(); closeErr != nil && !strings.Contains(closeErr.Error(), "EOF") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this wait for the streaming to complete? If so, then shouldn't his hold the main reconciler loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and yes .... all my recent work has proven beyond a doubt that spawning threads off of the reconciler thread for these grpc calls is absolutely untenable..... absolutely cannot go back to that mode of operation moving forward

  • go routine / mem leaks because contexts are not properly cleaned up
  • intermittent UpdateLog calls getting aborted because of canceled context
  • and as you know, since you've reported it to me previously, can't do proper retry on retryable errors with it

My commit to bump the worker thread pool to 32 when logging is enabled is in part a means to offset this, in case users inadvertently try stress tests with this ... that said, prior soak tests from Pavel when the threadiness was still 2 vs. 32 showed if "performs OK"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I'm fine losing that commit and leaving the default worker thread count to 2 if there is concern about 32 threads running with kind on one's laptop .... I seriously doubt that is the case given the pprof data Pavel accumulated so far, but the other issues were more of the focus.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2024
@gabemontero gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from b8a991a to a8d623e Compare April 1, 2024 16:38
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2024
@gabemontero gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from a8d623e to 4b55355 Compare April 11, 2024 19:58
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 55.3% 51.2% -4.1

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
fix panic, add debug, to e2e_gcs_test
rm tkn cli dependency
go mod tidy, vendor
add rbac for pod listing
move streamLogs onto reconciler thread
mimic tkn client style prefixing on each line of log

These changes address a goroutine/memory leak that existed with the multi threaded tkn
client version, as well as now allow us to reconcile on retryable errors

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
…e goroutine cleanup by not interrupt UpdateLog call before it finishes

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
@gabemontero gabemontero force-pushed the update-log-chan-goroutine-k8sclient branch from 4b55355 to 5530c4b Compare April 17, 2024 18:39
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 55.3% 51.2% -4.1

@gabemontero
Copy link
Contributor Author

/hold

still waiting to see in our prod env if this switch is necessary based on how the current tkn based solution performs

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
4 participants