Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Deployment Details Bug #699

Merged
merged 7 commits into from
Feb 2, 2022
Merged

Fix Deployment Details Bug #699

merged 7 commits into from
Feb 2, 2022

Conversation

NimJay
Copy link
Collaborator

@NimJay NimJay commented Jan 28, 2022

Fixes #697

Background

  • Look at the footer of onlineboutique.dev.
  • You'll see details about the "Zone", "Pod", and "Cluster".
  • This info is loaded by the getDeploymentDetails (upon every HTTP request) in the frontend service.
  • getDeploymentDetails is using a Google Cloud API (cloud.google.com/go/compute/metadata) to load the cluster name and zone.
  • This was causing issues on non-GCP deployments (e.g., Kind and EKS).
  • I've fixed the issue by doing 2 things:
    • perform getDeploymentDetails asynchronously (so that the HTTP request handler isn't blocked)
    • have getDeploymentDetails load necessary info once.
  • This certainly isn't a perfect solution — but fixing this issue is high priority.

Summary of Changes

1. New File: deployment_details.go

  • I've moved the logic related to loading deployment details into a new file called deployment_details.go.
  • Reason 1: Single responsibility principle.
  • Reason 2: The handlers.go file is too long in my opinion.

2. Don't Wait on Deployment Details

  • Previously, we loaded deployment details synchronously.
  • But now, we're using a go routine to load deployment details asynchronously.

3. Only Load Deployment Details Once

  • Previously, every HTTP request that needed to display deployment details reloaded the deployment details (e.g., each request made a call to Google Cloud to figure out the zone and cluster name).
  • Now we're only loading it once!

4. Mutex

  • I have introduced the use of a mutex to ensure that two HTTP requests don't race to load the deployment details twice.
  • Basically, I want to avoid this scenario:
    • Request 1: Are the details loaded? Nope, gotta load it.
    • Request 2: Are the details loaded? Nope, gotta load it.
    • Request 1: Let everyone know I'm taking care of loading it.
    • Request 2: Let everyone know I'm taking care of loading it.

5. Front-end Message

  • The following message is now displayed in the footer appropriately:

Deployment details are either still loading or failed to load.
Try refreshing this page.

Testing

  • I have tested these changes on both a GKE cluster (e.g., see staging URL) and a local Kind cluster (i.e., see comment below). This works. 👍
  • If you would like to test these changes yourself on a Kind cluster:
  1. Install Kind: brew install kind
  2. Create a Kind cluster: kind create cluster
  3. Make sure you're using the Kind cluster: kubectx. You should see kind-kind highlighted.
  4. Pull the changes from this pull-request: git pull nimjay/non-gcp-bug.
  5. Build the modified frontend service: docker push
  6. Push the modified frontend service: docker push gcr.io/my-project/new-front-end
  7. Change the kubernetes-manifests.yaml file: use docker push gcr.io/my-project/new-front-end instead of gcr.io/google-samples/microservices-demo/frontend:v0.3.5.
  8. kubectl apply -f kubernetes-manifests.yaml
  9. To access the deployment on your web browser as localhost:8080, run: kubectl port-forward <name-of-frontend-pod> 8080:8080. (Use kubectl get pods to get the name of your front-end pod.)

Additional Notes

  • I am very new to Golang — so I'm open to significantly refactoring my code. I'm guessing there's a TON of room for improvement. 😅

var metaServerClient = metadata.NewClient(&http.Client{})

// The use of httpRequest here lets us to see HTTP request info in the logs.
var log = httpRequest.Context().Value(ctxKeyLog{}).(logrus.FieldLogger)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, ideally, this loadDeploymentDetails function would not know about the httpRequest object.
So, in the future, we could pass in the log object instead of the httpRequest object into the function.
But such an improvement would be outside the scope of this pull-request.

@github-actions
Copy link

🚲 PR staged at http://34.132.215.174

@NimJay
Copy link
Collaborator Author

NimJay commented Jan 31, 2022

I have tested this on a Kind cluster (i.e., a non-GCP cluster) and it works! 😌
Ping me if you would like me to quickly showcase this:

Screen Shot 2022-01-31 at 10 50 03 AM

See the footer:

Deployment details are either still loading or failed to load. Try refreshing this page.

I think we should eventually work on actually displaying deployment details for non-GCP clusters (instead of the above message).
But this is out of scope for now — let's fix this P1 issue.

@NimJay NimJay marked this pull request as ready for review January 31, 2022 15:57
@NimJay NimJay requested a review from a team as a code owner January 31, 2022 15:57
@xtineskim
Copy link
Contributor

I just tried with minikube, and footer had Pod and Cluster shown as blank (instead of Deployment details are either still loading or failed to load. Try refreshing this page.) Is this the expected behavior?
Screen Shot 2022-01-31 at 11 53 38 AM

DetailsAreNotLoaded = 0
DetailsAreLoading = 1
DetailsAreLoaded = 2
)
Copy link
Collaborator Author

@NimJay NimJay Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was inspired to do enum this way by this article, Using Enums (and Enum Types) in Golang.

@NimJay
Copy link
Collaborator Author

NimJay commented Jan 31, 2022

@ckim328, thanks for testing these changes on minikube.
Ah, yes, I'm seeing the same output on Kind.
Thanks for catching that!

Screen Shot 2022-01-31 at 12 47 49 PM

What's happening?
The following message is only displayed while getDeploymentDetailsIfLoaded() is still loading (i.e., still trying to talk to Google Cloud to figure out zone/cluster info).

"Deployment details are either still loading or failed to load. Try refreshing this page."

The changes from this pull-request ensures that the frontend service only tries to talk to Google Cloud once:
So after talking to Google Cloud fails, we will see the following in the kubectl logs of the frontend pod:

... Failed to fetch the name of the cluster in which the pod is running ...
... Failed to fetch the Zone of the node where ...

So yes, this is expected.

If you want, instead of showing an empty value for zone and cluster, I can state, "failed to load" (or a similar message) or I can hide the "zone" and "cluster" labels completely?
What do you think?

@xtineskim
Copy link
Contributor

xtineskim commented Jan 31, 2022

Thanks for the clarification @NimJay 👍
I think the case where it is a locally deployed, I'm ok with removing the labels

It's ok if it's not captured in the scope of this P1 discussion, we can def address these later!

@@ -32,6 +32,9 @@
<b>Cluster: </b>{{index .deploymentDetails "CLUSTERNAME" }}<br/>
<b>Zone: </b>{{index .deploymentDetails "ZONE" }}<br/>
<b>Pod: </b>{{index .deploymentDetails "HOSTNAME" }}
{{ else }}
Deployment details are either still loading or failed to load.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too opiniated on this but I think we should silently fail and not show anything rather than show an error message. Or just show "Deployment details are unavailable" or similar.
The reasoning behind this is that more often than not, the deployment details are simply not available, but the current diff makes it seem like the user should be reloading, or that there is something fundamentally wrong with their cluster's well being.
Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckim328 suggested a similar "fail silently" user experience (i.e., hiding labels that have no values).
So I've gone ahead and implemented that. :)

I have removed the following verbiage:

or failed to load

because this was not true. If an error occurred, we would've just seen labels with no values.

I have preserved the following verbiage:

Deployment details are still loading.
Try refreshing this page.

because good error message are actionable.
And there is a tutorial that asks developers to look at the deployment details (see #628).

Thanks, you two.

Comment on lines +43 to +45
deploymentDetailsMap["HOSTNAME"] = podHostname
deploymentDetailsMap["CLUSTERNAME"] = podCluster
deploymentDetailsMap["ZONE"] = podZone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just get this information once, and only once ever? We can do this from an init() function, with a package variable that we can query as many times as we want.

// deployment_details.go
package main

import (...)

var deploymentDetails map[string]string

func init() {
    // get the deployment details from the meta server

    deploymentDetails = map[string]string {
        "HOSTNAME": podHostname,
        "CLUSTERNAME": podCluster,
        "ZONE": podZone,
    }
}

Then from the handlers:

if err := templates.ExecuteTemplate(w, "home", map[string]interface{}{
    . . .
    . . .
    "deploymentDetails": deploymentDetails,
}

That would mean no mutex, no enums, no race conditions, no "loading / not loading", and reduce complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the meta server is not available or reachable (e.g. if you're deploying locally), then you would leave the map blank, and that should be taken care of seamlessly by the HTML template with the "if deploymentDetails" bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I'm glad you took a look at my changes.
I'm now loading the deployment details from func init().
I was able to get rid of the following complexities:

  • mutex
  • race conditions
  • enums

However, I wasn't able to get rid of the "no 'loading / not loading'" stuff. This is because we don't want the loading of details to block the main thread in non-GCP deployments (as described in #685).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, if we should still account for temporary unreachability of the metadata server. I thought the previous mutex approach took care of all possibilities. The current change would render the details never loaded if the first attempt on init fails.

Just a thought. We could still see how this works and add it as an improvement if need be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, though, in what scenario would the meta server to be unavailable? I thought maybe if the GCE API (https://status.cloud.google.com/) is down completely, but the meta server is internal, no? The SLA for any arbitrary single-zone cluster is >= 99.5%, so I'm imagining the meta server to be much higher, though that is solely intuition.

I think that's a very low probability, but even so mutex seems overkill. We could do periodic retries in the go routine until we get a result, for example (though I think that's still overkill).

I think it's safe to leave it as-is, but I'm also opened to adding retries (taking into account that it may retry forever for non-GCP instances).

@github-actions
Copy link

github-actions bot commented Feb 1, 2022

🚲 PR staged at http://34.132.215.174

@github-actions
Copy link

github-actions bot commented Feb 1, 2022

🚲 PR staged at http://34.132.215.174

DetailsAreLoaded = 2
)

var areDeploymentDetailsLoaded = false
var deploymentDetailsMap = make(map[string]string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can stay as var deploymentDetailsMap = map[string]string and then you have an easy way to check if the deployment are loaded / present, which is that deploymentDetailsMap != nil.
That would also remove the need for that additional areDeploymentDetailsLoaded variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! 👌
Now the declaration at the top looks like:

var deploymentDetailsMap map[string]string

and it's initialized in the loadDeploymentDetails() function:

func loadDeploymentDetails() {
	deploymentDetailsMap = make(map[string]string)

and the getDeploymentDetailsIfLoaded function is now:

func getDeploymentDetailsIfLoaded() map[string]string {
	return deploymentDetailsMap
}


// The use of httpRequest here lets us to see HTTP request info in the logs.
var log = httpRequest.Context().Value(ctxKeyLog{}).(logrus.FieldLogger)
func initializeLogger() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with having a single logger, but if so:

  • Let's put it in either main.go or its own logger.go, and
  • Let's use it everywhere. I notice some files (like main.go uses their own duplicated loggers)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing.
Let's track this in a different issue: #703
I'd say it's outside the scope of this P1.

@@ -50,18 +60,14 @@ func loadDeploymentDetails(httpRequest *http.Request) map[string]string {
"hostname": podHostname,
}).Debug("Loaded deployment details")

detailsLoadingStatus = DetailsAreLoaded
areDeploymentDetailsLoaded = true

return deploymentDetailsMap
}

func getDeploymentDetailsIfLoaded(httpRequest *http.Request) map[string]string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely redundant if we change the deploymentDetailsMap to be nil at initialization (so remove the make() from there, so this entire method can go away.

Copy link
Collaborator Author

@NimJay NimJay Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've preserved the getDeploymentDetailsIfLoaded() method because it implies that the thing being returned might not have been loaded; thus, I assume anyone using/seeing getDeploymentDetailsIfLoaded() will cautiously check if it returns nil.
Thoughts?

@@ -50,18 +60,14 @@ func loadDeploymentDetails(httpRequest *http.Request) map[string]string {
"hostname": podHostname,
}).Debug("Loaded deployment details")

detailsLoadingStatus = DetailsAreLoaded
areDeploymentDetailsLoaded = true

return deploymentDetailsMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to return anything. The called (go loadDeploymentDetails()) is not expecting anything returned. We just have to modify the global map and silently return (which is implicit at the end of the function, so we don't need a return at all)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I will work on not returning anything. :)

src/frontend/templates/footer.html Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Feb 1, 2022

🚲 PR staged at http://34.132.215.174

@NimJay
Copy link
Collaborator Author

NimJay commented Feb 1, 2022

@bourgeoisor thanks for the new round of comments!
Great suggestions. I've responded to all of them — and applied most.
I've tested on a local Kind cluster, and everything works! 👍

@github-actions
Copy link

github-actions bot commented Feb 2, 2022

🚲 PR staged at http://34.132.215.174

Comment on lines +43 to +45
deploymentDetailsMap["HOSTNAME"] = podHostname
deploymentDetailsMap["CLUSTERNAME"] = podCluster
deploymentDetailsMap["ZONE"] = podZone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, though, in what scenario would the meta server to be unavailable? I thought maybe if the GCE API (https://status.cloud.google.com/) is down completely, but the meta server is internal, no? The SLA for any arbitrary single-zone cluster is >= 99.5%, so I'm imagining the meta server to be much higher, though that is solely intuition.

I think that's a very low probability, but even so mutex seems overkill. We could do periodic retries in the go routine until we get a result, for example (though I think that's still overkill).

I think it's safe to leave it as-is, but I'm also opened to adding retries (taking into account that it may retry forever for non-GCP instances).

src/frontend/handlers.go Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Feb 2, 2022

🚲 PR staged at http://34.132.215.174

Copy link
Member

@bourgeoisor bourgeoisor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with the caveats of the open comment threads

@NimJay NimJay merged commit 6b25a1a into main Feb 2, 2022
@NimJay NimJay deleted the nimjay/non-gcp-bug branch February 2, 2022 15:43
@NimJay NimJay restored the nimjay/non-gcp-bug branch February 2, 2022 15:43
@NimJay NimJay deleted the nimjay/non-gcp-bug branch February 2, 2022 15:53
This was referenced Feb 2, 2022
sitaramkm pushed a commit to sitaramkm/microservices-demo that referenced this pull request Mar 27, 2022
* Improve loading of deployment details

* Hide missing deployment details labels

* Load deployment details once via init()

* Add comment about go loadDeploymentDetails()

* Simplify deployment_details.go

* Remove getDeploymentDetailsIfLoaded() function

Co-authored-by: Shabir Mohamed Abdul Samadh <7249208+Shabirmean@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Frontend blocking requests when not in GCP
4 participants