Skip to content

backend: caching: implemented caching for single-cluster #3499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

upsaurav12
Copy link
Contributor

@upsaurav12 upsaurav12 commented Jun 20, 2025

Implementation

I have implemented caching for the single-cluster. For multi-cluster i am trying to check by creating user by changing use-context.

Here how i have implemented:

  1. I am capturing response during ProxyRequest and storing it in cache in string.
  2. Then I am changing the response from gzip to byte and then to string.
  3. If the user is making request for the first time it will first first selfsubjectrulesreviews and if it returns Failure then it is not authenticated. If doesn't then process to acquire from cache.
  4. A new key will be generated for new kind and namespace and clusters and it will be same if the user is making the same requests.
  5. If key is not present in cache then it will go for actual request to k8's and then store in cache.

However more things i have to add like cache should be synced with the latest k8's value but i am finding solution whether i should go for countinuously make requests to k8's after 5s or fixed time oruse watch events from k8's.

Test Results

=== RUN   TestInitialize
=== RUN   TestInitialize/initializes_responseCapture_with_defaults
--- PASS: TestInitialize/initializes_responseCapture_with_defaults (0.00s)
--- PASS: TestInitialize (0.00s)
=== RUN   TestExtractNamespace
=== RUN   TestExtractNamespace/valid_url_with_namespace
--- PASS: TestExtractNamespace/valid_url_with_namespace (0.00s)
=== RUN   TestExtractNamespace/empty_url
--- PASS: TestExtractNamespace/empty_url (0.00s)
=== RUN   TestExtractNamespace/invalid_url_format
--- PASS: TestExtractNamespace/invalid_url_format (0.00s)
--- PASS: TestExtractNamespace (0.00s)
=== RUN   TestGetResponseBody
=== RUN   TestGetResponseBody/valid_response
--- PASS: TestGetResponseBody/valid_response (0.00s)
=== RUN   TestGetResponseBody/empty_Response
--- PASS: TestGetResponseBody/empty_Response (0.00s)
=== RUN   TestGetResponseBody/empty_contentType
--- PASS: TestGetResponseBody/empty_contentType (0.00s)
--- PASS: TestGetResponseBody (0.00s)
=== RUN   TestGenerateKey
=== RUN   TestGenerateKey/url_was_valid_
--- PASS: TestGenerateKey/url_was_valid_ (0.00s)
=== RUN   TestGenerateKey/empty_cluster
--- PASS: TestGenerateKey/empty_cluster (0.00s)
--- PASS: TestGenerateKey (0.00s)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: upsaurav12
Once this PR has been reviewed and has the lgtm label, please assign joaquimrocha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 20, 2025
@illume
Copy link
Contributor

illume commented Jun 23, 2025

Nice one.

ps. See commit guidelines, and run make backend-lint to see lint issues localy.

Copy link
Contributor

@illume illume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start. Thanks.

Would you mind separating the code out from headlamp.go? Code can be put into a separate module/package, and then headlamp.go can import it and use it. We are in the process of moving as much code out of headlamp.go as possible into package modules. It's all a bit too much code tangled up in there.

Could you please add tests? As well, please add documentation to your functions.

I left a comment about commit message guidelines and running backend-lint.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Jun 25, 2025

@illume I have added extracted the cache implementation from headlamp.go. I am sorry i wasn't able to write tests because my system is running slow will write the code and push them

I am not sure that this implementation might work for multiple users however i am finding way to test it with services accounts but in development mode it is directly fetching from kubeconfig file and it loads clusters directly without asking for bearer token

@upsaurav12 upsaurav12 force-pushed the caching-pagination-search branch 4 times, most recently from 2452767 to 9aa03d1 Compare June 26, 2025 12:29
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 27, 2025
@illume
Copy link
Contributor

illume commented Jun 27, 2025

How about making an own package for pkg/k8cache? I think maybe it should not go inside pkg/cache, since it's topic is slightly different than what pkg/cache already does.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Jun 27, 2025

How about making an own package for pkg/k8cache? I think maybe it should not go inside pkg/cache, since it's topic is slightly different than what pkg/cache already does.

I was thinking the same but eventually had to impelment in the same package because of cache initialization and ease of importing functions.

I will. modify it.

@upsaurav12
Copy link
Contributor Author

I have moved the code to pkg/k8cache as package.

@upsaurav12
Copy link
Contributor Author

upsaurav12 commented Jun 28, 2025

Hi I have just tested this cache implementation with different serviceaccounts as a users. for example service accounts have no-access to cluster resources, user having limited access and user having full access. From my end it is working fine like showing the data that the user is authorized.

Screen.Recording.-.Jun.28.2025-VEED.mp4

However still there are still some points that is need to be fixed/address.

  1. Initialization of Cache:- Currently i am initializing the cache as a global variable which i am not sure that it would be ideal approach for this project specifically. it is fine to initialize as it?
  2. Synchronization of Cache data :- We can have two options Watch API and Polling to K8;s at a fixed time interval. which approach should be the ideal for the project ?
  3. Use of token :- Would it be good approach to use token in the key generation because here i am already using SSAR request to authorize the user but in K8's dashboard PR they had used it.

Copy link
Contributor

@illume illume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few notes.

@illume illume requested review from yolossn and removed request for joaquimrocha and skoeva July 3, 2025 08:13
@upsaurav12
Copy link
Contributor Author

I left a few notes.

I am sorry i could get these replies. i don't know why but they weren't visible to me?
thanks for suggestion i have made some changes for your requested changes.

@illume
Copy link
Contributor

illume commented Jul 4, 2025

I am sorry i could get these replies. i don't know why but they weren't visible to me?

No worries. I think I wrote them, and then few days came back and noticed they were still pending and submitted the review. I'm not sure why they kept the old date on them, I think it's a github bug. That's why they weren't visible to you - I didn't submit the review yet.

return
}

isAllowed, err := k8cache.IsAllowed(r.URL, kContext, w, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isAllowed is false then a proper response has to be returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new function ReturnAuthErrorResponse which returns the AuthErrResponse JSON object if the user is not authorized to access the resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks. Is there a test case that covers this?

I think it's important that we test the authentication parts. I don't see a test for ReturnAuthErrorResponse, which is why I'm wondering if there are other test cases that check this part is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added test cases for new functions i have added after moving cache logic to middleware so that why you seeing any test for them.
If it is necessary i will write those tests once i have completed the cache implementation (ie make cache optional , and many areas to improve).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan.

// If the key was not found inside the cache then this will make actual call to k8's
// and this will capture the response body and convert the captured response to string.
// After converting it will store the response with the key and TTL of 10*min.
func RequestToK8sAndStore(k8scache cache.Cache[string], k *kubeconfig.Context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does make a request to the k8s api server so lets improve/update the name for better understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also please remove unused args from this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed kuberconfig.Context arg and changed name from RequestToK8sAndStore to RequestK8ClusterAPIAndStore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yolossn what do you think of the new name?

(To me the old name RequestToK8sAndStore was fine, so I'm not sure about this one)

// getClientMD is used to get a clientset for the given context and token.
// It will reuse clientsets if a matching one is already cached.
func getClientMD(k *kubeconfig.Context, token string) (*kubernetes.Clientset, error) {
contextKey := strings.Split(k.ClusterID, "+")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the context behind this string split?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, I don't have direct way to get contextID. I can get contextID by using k.ClusterID but this contains to many things /home/saurav/.kubeconfig+<contextID> to get contexKey i have to split k.ClusterID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can't think of a cleaner way to do this, can you please leave a comment there explaining what it is about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@upsaurav12 you can just pass the cluster name from the request here or use k.ContextName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update the code. thanks

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 14, 2025
func CacheMiddleWare(c *HeadlampConfig) mux.MiddlewareFunc {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.Method == "POST" || r.Method == "UPDATE" || r.Method == "DELETE" || r.Method == "PUT" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be if != "GET", so it's an allow list of things we should support. It's usually better to do allow lists in case things change. If things change when it's just GET then it should still work. If it's the opposite way if we've missed something or something else is added this condition will be broken.

What about OPTIONS and HEAD? How should they be handled here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code and moved the logic to LoadFromCache. because the request methods like this should not go to cache instead forward to actual request.

Also there was discussion for cache invalidation where we need to remove the stale data in the cache. but i am not sure what should be the workflow.
For example if there was a request like POST on secrets resource. if that same resource data is present in cache (using key) we can stale the older data in the cache, and store the new data in cache.
for this we need to check after making request to K8's (POST/DELETE) then it will check that the key was in cache or not if present, remove it or else proceed to store that data.
@yolossn @illume WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Please mark this resolved when you're done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will push code once done completely

@illume illume requested a review from Copilot July 17, 2025 08:02
Copilot

This comment was marked as outdated.

@upsaurav12 upsaurav12 requested review from illume, yolossn and Copilot July 17, 2025 09:00
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces single-cluster caching for Kubernetes API proxy requests, adding key generation, response capture, and middleware integration.

  • Add k8cache package with key hashing, response capture, cache get/set, and filtering logic
  • Wire up CacheEnabled flag in configuration and server entrypoint to enable caching middleware
  • Provide comprehensive unit tests for key functions and cache behavior

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
backend/pkg/k8cache/key.go Define CacheKey struct and SHA hashing utility
backend/pkg/k8cache/k8cache.go Implement caching logic: response capture, key generation, load/store
backend/pkg/k8cache/k8cache_test.go Add unit tests for hashing, header filtering, marshal/unmarshal
backend/pkg/headlampconfig/headlampConfig.go Add CacheEnabled flag to headlamp configuration
backend/pkg/config/config.go Add cache-enabled CLI/env flag to core config
backend/cmd/server.go Integrate caching middleware based on CacheEnabled
backend/cmd/headlamp.go Declare global cache instance and refactor cluster request handler
Comments suppressed due to low confidence (2)

backend/pkg/k8cache/k8cache.go:167

  • Update the comment to match the function name 'FilterHeadersForCache' instead of 'FilterHeaderCache' for clarity.
// FilterHeaderCache ensures that the cached headers accurately reflect the state of the

}

// UnmarshalCachedData deserialize a JSON string received from cache
// back into a CacheResposeData struct. This function is used to recontructing
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix spelling in comment: 'CacheResposeData' should be 'CachedResponseData' and 'recontructing' should be 'reconstructing'.

Suggested change
// back into a CacheResposeData struct. This function is used to recontructing
// back into a CachedResponseData struct. This function is used to reconstructing

Copilot uses AI. Check for mistakes.

Kind: "Status",
APIVersion: "v1",
MetaData: MetaData{},
Message: fmt.Sprintf("%s is forbidden: User \"system:serviceaccount:default:%s\" cannot", last, contextKey) +
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space after 'cannot' or before the next '%s' to avoid concatenating words (e.g., 'cannot get').

Suggested change
Message: fmt.Sprintf("%s is forbidden: User \"system:serviceaccount:default:%s\" cannot", last, contextKey) +
Message: fmt.Sprintf("%s is forbidden: User \"system:serviceaccount:default:%s\" cannot ", last, contextKey) +

Copilot uses AI. Check for mistakes.

Comment on lines +303 to +305
// TestMarshallToStore tests whether the MarshallToStore
// serialized correctly that will be stored into the cache.
func TestMarshallToStore(t *testing.T) {
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name and comment spell 'MarshallToStore' with two L's. It should be 'MarshalToStore' to match the function name.

Suggested change
// TestMarshallToStore tests whether the MarshallToStore
// serialized correctly that will be stored into the cache.
func TestMarshallToStore(t *testing.T) {
// TestMarshalToStore tests whether the MarshalToStore
// serialized correctly that will be stored into the cache.
func TestMarshalToStore(t *testing.T) {

Copilot uses AI. Check for mistakes.

attribute.String("cluster", mux.Vars(r)["clusterName"]),
)

defer span.End()
Copy link
Preview

Copilot AI Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling 'defer span.End()' inside this helper ends the trace span immediately when this function returns, which may be before downstream logic executes. Consider ending the span in the caller after the handler completes.

Suggested change
defer span.End()
// Note: The caller is responsible for ending the span by calling span.End().

Copilot uses AI. Check for mistakes.

@illume
Copy link
Contributor

illume commented Jul 21, 2025

@yolossn @upsaurav12 I went through an marked a few convos as resolved that were finished.

What's the plan to work on next? (what's left to do?)

Is it getting near a point where the code should be cleaned up and merged? Maybe now it's a good time to start adding tests, and breaking it up into clean commits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

4 participants