Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/opentelemetry: add trace event for name resolution delay #8074

Open
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

vinothkumarr227
Copy link
Contributor

@vinothkumarr227 vinothkumarr227 commented Feb 10, 2025

stats/opentelemetry: add trace event for name resolution delay.

RELEASE NOTES: None

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.00%. Comparing base (775150f) to head (a307b5a).
Report is 8 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8074      +/-   ##
==========================================
- Coverage   82.25%   82.00%   -0.26%     
==========================================
  Files         393      410      +17     
  Lines       39143    40248    +1105     
==========================================
+ Hits        32197    33004     +807     
- Misses       5616     5869     +253     
- Partials     1330     1375      +45     
Files with missing lines Coverage Δ
clientconn.go 90.82% <100.00%> (+0.26%) ⬆️
stats/opentelemetry/client_metrics.go 89.44% <100.00%> (+0.05%) ⬆️
stats/opentelemetry/client_tracing.go 84.00% <100.00%> (+7.52%) ⬆️
stats/opentelemetry/opentelemetry.go 75.00% <ø> (ø)
stream.go 81.55% <100.00%> (+0.03%) ⬆️

... and 48 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal requested a review from purnesh42H February 17, 2025 07:16
@arjan-bal arjan-bal added this to the 1.71 Release milestone Feb 17, 2025
@arjan-bal arjan-bal added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Feb 17, 2025
},
events: []trace.Event{
{
Name: "Delayed LB pick complete",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting. So, LB policy provided a new picker as well. Is this always the case when the rpc block on name resolution delay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// event is logged only once using an atomic flag, even across retries.
// Add a "Delayed name resolution complete" event to the call span
// if there was name resolution delay. In case of multiple retry attempts,
// ensure that event is added only once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: period (.) at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return nil
}

func setupRetryStubServer(t *testing.T, metricsOptions *opentelemetry.MetricsOptions, traceOptions *experimental.TraceOptions) *stubserver.StubServer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docstring similar to setupStubServer summarizing the logic of retries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// "Delayed name resolution complete" event is recorded in the call trace span
// only once if any of the retry attempt encountered a delay in name resolution
func (s) TestSpan_WithRetriesAndNameResolutionDelay(t *testing.T) {
func TestTraceSpan_WithRetriesAndNameResolutionDelay(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add (s) back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if err != nil {
t.Logf("FullDuplexCall error: %v", err)
rpcError <- err
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't return here. Let it continue as we want to fail gracefully below if error is received.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return
}

rpcError <- nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to set nil? rpcError will just be appended with next error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

found := false
for _, span := range spans {
if span.Name == want.name && span.SpanKind.String() == want.spanKind {
found = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need found? Looks like you are breaking the first time itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable found was used to track whether a specific span name or match occurred during the loop execution. I have now renamed the variable to match to align with the following check: if !match { t.Errorf("Span not found: %s (kind: %s)", want.name, want.spanKind) }

@@ -148,6 +149,11 @@ type callInfo struct {
target string

method string
// nameResolutionEventAdded is an atomic flag that ensures the name
// resolution delay event is added to the call span only once. If a retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....name resolution delay event is added to the OpenTelemetry trace call span only once....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for second sentence. Can just say "only once across retry attempts"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -25,6 +25,7 @@ import (
otelinternaltracing "google.golang.org/grpc/stats/opentelemetry/internal/tracing"
)

const delayedResolutionEventName = "Delayed name resolution complete"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be single block

const (
     delayedResolutionEventName = "Delayed name resolution complete"
     tracerName = "grpc-go"
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

rpcError <- fmt.Errorf("UnaryCall failed: %w", err)
}()

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't create another go routine. Make both Unary call and Streaming call in one go routine. Otherwise there can be unnecessary flakiness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this by wrapping both waiting operations in separate select blocks, The errors are now stored in these channels and handled within their respective select blocks.

})

select {
case err := <-rpcError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wouldn't ensure if streaming and unary calls have finished. We need to wait until both are finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants