Skip to content

GRPC Client - TraceId mixup #2316

Open
Open
@ananthvanchip

Description

@ananthvanchip

Describe the bug

Corruption in the traceId that is being sent in the headers while making grpc calls. The issue is with the grpc-client-instrumentation.

Environment

  • OS: Linux
  • Go Version: 1.17+

To Reproduce

  1. Create a dummy grpc endpoint to invoke.
  2. Use a single connection and send multiple requests on the connection via different go-routines.
  3. Alternatively, create multiple connections and send multiple requests on those connections via different go-routines.
  4. You will see that the traceId will sometimes get mismatched (associated with the wrong go-routine.)

Expected behavior

TraceId should not be corrupted/exchanged b/w different requests.

Additional context

There are two issues in the way grpc-client is being impemented.

Issue 1

bpf_probe_read(&nextid, sizeof(nextid), (void *)(httpclient_ptr + (httpclient_nextid_pos)));

The nextId is used as the bridge b/w the go-routine of the request and the go-routine of the loopyWriter. The nextId is a merely an integer and it increases sequentially for every request. (Its odd numbers starting from 1). However, when multiple http2client are created (to different endpoints) they will all generate nextId from 1. So in a concurrent scenario, we will have different clients(or connections) which can have the same nextId and the association could get jumbled when the requests are being made. This will cause the wrong traceId to be sent to the wrong endpoint randomly and will create confusions.

Issue 2
We are adding a uprobe at the entry of
func (t *http2Client) NewStream(ctx context.Context, callHdr *CallHdr) (*ClientStream, error)

In that uprobe we are reading the nextId. The nextId is a global variable within the http2Client object. When two go-routines using the same connection/client make a request, they will create a new stream. For both of them, in the uprobe we will assign the same nextId. However, internally they will be assigned different nextId as the nextId assignment within the function is covered by a mutex lock.
https://github.com/grpc/grpc-go/blob/b89909b7bd0d9bd333aab291e90fec1fa8d45ce9/internal/transport/http2_client.go#L832

This again will cause different traceIds to be sent to different requests which will lead to eventual corruption.

Solution.
To solve both the issues, we can stay away from using nextId as the bridge b/w the requesting go-routine and the go-routine within loopyWriter. Instead, we can add a uprobe on
https://github.com/grpc/grpc-go/blob/09166b665e8b6442bc70e88cdef580da9e7b7b06/internal/transport/controlbuf.go#L352

The second parameter is a headerFrame which is available in the loopyWriter.handleHeaders method. The pointer to headerFrame can be the bridge. This will help us get rid of the entire nextId/streamId and the issues arising from that.

I am not a go/ebpf expert. Let me know what you think.. I can provide a patch on this if necessary.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions