-
Notifications
You must be signed in to change notification settings - Fork 14
bpf: parse mysql data in kernel space #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @mmat11!
I have one base question before we look more in detail here, what is the reason we'd want to parse the SQL protocol in eBPF?
In a sense, the rule we've followed so far has been that we want to parse in C code only the protocols where we see necessary the need because of injecting trace context. In this case SQL is simply extracting data so we can report the event.
One main drawback of implementing this in C is that it's harder to test and implement additional logic. For example, one interesting aspect is the statement prepare. If we did this in userspace perhaps we can cache the preapred statement per pid/sql query and then find the sql when the statement execute is made. This would be harder in C.
Is the main reason that we have such short buffers for the TCP data and you'd like to extract more context, e.g. larger query text? I thought about this more and I wonder if we can do the following:
- Look at some SQL like protocol detection, e.g. find out it's mysql, like you've done.
- Introduce a variable buffer event as a type for the ring buffer. We can communicate from the userspace config as a volatile const what this max size is.
- At the time of making the TCP event, we can create our usual event, but read more of the data now and store it after the current event. We tag the TCP event flags to say look for a follow-up event when you read it and then pull both to be processed.
I think we can reuse this approach more in general for other types of TCP events where we need larger buffers.
So I think if we added the response error detection in this file and had larger buffers I think we'd get this for free, right? https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/main/pkg/internal/ebpf/common/sql_detect_mysql.go We had plan to implement the cache for prepared statements like we do for the grpc connections. It will not catch statements which are prepared before we attached OBI, but it will catch all after. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #118 +/- ##
===========================================
- Coverage 75.17% 57.03% -18.14%
===========================================
Files 177 176 -1
Lines 19135 18891 -244
===========================================
- Hits 14384 10774 -3610
- Misses 4001 7338 +3337
- Partials 750 779 +29
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Thank you so much for putting in the effort, this is much appreciated. This is a long PR with a lot of implicit knowledge, so all I did was a first pass in terms of style and other small things. Once those are addresses, we can iterate this to bring it to the finish line - also, if you have any questions, don't hesitate to bug me on Slack or elsewhere.
|
||
// Every mysql command packet is prefixed by an header | ||
// https://mariadb.com/kb/en/0-packet/ | ||
struct mysql_hdr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can, move the maps and the types to their own headers like we did in the pid/
subdir - unfortunately it's an ongoing effort and we did not have time to clean up the rest of the code, but new code could already do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I need to make a mysql protocol directory? mixing these maps with other protocols' maps would be confusing imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure we are experimenting with is tracer/{types,maps}
- so for now I'd just chuck it in there for consistency. We do need to make a pass sometime soon and finish tidying up the code - it's in my list, but it's low priority at the moment.
@grcevski for context, this is the superseded PR: grafana/beyla#1848
The "unknown TCP" path will send to userspace the request+response event with a predetermined buffer size. Making events of variable size will also make things slightly more complicated and less consistent with event structs generation as we can't use bpf2go to generate and fill the structs but we need to parse them manually. One other argument is that this way every handler (both in kernel and userspace) is responsible for its own protocol after classification.
Note: prepared statements support is not implemented yet and the statement cache will be implemented in userspace. The cache would look like this: Having said that, both ways are doable for achieving this, I thought we agreed that doing the parsing in ebpf would be the preferred long-term solution. |
@grcevski makes good points (which I only saw after I published mine). To piggyback on his comments, the ring buffer does allow for variable sized events - we could have something like
we can then use Then in userspace we can cast that back to the equivalent struct and access data as usual.... |
The
Event unmarshaling should also be done manually in that case (see https://github.com/elastic/ebpfevents/blob/main/pkg/varlen/varlen.go). IIRC bpf2go will refuse to handle fields like |
I think we don't truly need variable size, we can keep like what you had 8K and 512, but these can now be constants that can be supplied at load time when we do loadSpec, so we can make this variable and allow users to choose how much they are willing to use to get more accurate data in exchange for extra CPU overhead. I think I understand the eBPF type issue, but what do you think about making a type that has only two things, flags with the type as all of our eBPF event structures, size of the buffer and empty 0 sized u8 buffer, essentially just a "header"? We then write the "variable" size buffer on the ring buffer after writing the "header" of the event. In the userspace, we always just read the first byte Would this approach work for you? The complexity of reading is done once and we have the capability that will let us supply extra data for any protocol from now on. |
If we decide to go down the parsing on userspace route, why not then use bpf_ringbuf_output()? In most cases when using bpf_ringbuf_reserve() we already copy the data anyway, so instead of doing
for the mysql case, we could do:
on userspace then I am assuming (for what I saw of cilium/ebpf source) that we'd have no issues reading this as a regular event as usual - so nothing changes. We could still have a "maximum size" flag as @grcevski is proposing, but IMHO that makes it simpler. So if we do decide to ship things to userspace for further processing, that would be my preference (assuming it works).
As for whether we should do that in first place, I think we should evaluate both options. I am unsure what the best approach is here (user space vs ebpf). I'd personally do things in ebpf only provided that the complexity can be kept at bay as the less event overhead (less pressure in the ringbuffer, less allocation, less copies, etc...). So what would be the objective criteria for that? And tangent to this question (because it's an issue regardless of the approach chosen) how are we planning to deal with incomplete buffers? |
@grcevski @rafaelroquetto I think we are discussing and trying to decide between two different topics:
In my opinion, it boils down to personal preference and confidence in the implementation. Event processing in userspace might in theory be less performant, since in ebpf any unknown buffer or inconsistent data can be discarded early. (I don't have any data/numbers to back this up, though.) About extensibility: About complexity: About testing:
This decision is independent of this PR, and both the "unknown TCP" path and the mysql handler could benefit from variable-length buffers. We can perhaps discuss this further in the next SIG meeting? @rafaelroquetto Thanks again for the prompt review 🙏🏻 , I'll wait a decision on .1 before addressing the changes |
Hi @mmat11, I think you bring a very valid point about the ability to filter in eBPF space, this would remove the overhead of sending the TCP packets for unknown protocols and make the use of the ring buffer more efficient. Let me summarise my thoughts here on the topic and I think I have a proposal for a way forward:
Having said all of this, I think the filtering at eBPF space has a lot of merit, especially when users want this. Therefore I have a proposal that will allow us to do this, if you think it's sensible.
Thanks again for bringing this up and let me know what your thoughts are. |
Sounds good! Should we have this toggle per-protocol?
Given the fact events are different from each other, wouldn't it be more consistent to use different event types? (just a nit, I'm fine with doing both)
By discard you mean let the "unknown TCP" fallback handle it?
Agree, we should use variable length buffers -- I will open an issue soon Thanks, will apply these changes and Rafael's review suggestions! |
Hey Mattia,
We can do that, we already have this functionality in user-space to filter out events we don't want, e.q. no SQL traces, but SQL metrics etc. Maybe we can leverage those toggles to remove the unwanted events after we've migrated most to eBPF side, if we manage to do it all. Perhaps we can leave adding this option as a last step, for now we can benefit by detecting something like MySQL in eBPF and then directing the userspace parsing directly.
Sure, let's add separate event types for the TCP data type. I would just keep the data type the same and the code that handles the buffer acquisition and nesting for distributed traces. I'm suggesting this so that we minimise how much code we have to duplicate. If it's not possible to make all protocols easily fit inside the instruction limit, perhaps we can tail call to say MySQL handler and then tail call to the TCP event with custom type.
Yeah let's keep the unknown TCP fall back as it is today. If we manage to migrate most protocol parsing to eBPF space we can decide to add the option to not send the TCP fallback. Kafka is a lot of parsing code, so we need to add C unit tests. HTTP2/gRPC is also a lot of work for the fall back code we use at the moment in userspace. Redis and MongoDB (which someone is adding at the moment) are simple enough. |
hi @grcevski and @rafaelroquetto , I did as discussed here and offline, the PR needs a bit of polishing and perhaps I missed some cleanups like moving the maps in their files as suggested by Rafael, but it can be re-reviewed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @mmat11 ! Thank you so much for this, outstanding work!
I just have a bunch of nits (that you couldn't have known beforehand). I think once we iron those out, we should be good to go.
Again, thank you for all your effort and for bearing with me. Grazie!
bpf/generictracer/protocol_mysql.h
Outdated
return false; | ||
} | ||
|
||
static int parse_response_packet_type(struct mysql_hdr *hdr, const u8 *data, size_t data_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for instance, could we have a buffer here at this stage in which we read data, and all of the parsing functions operate on that buffer rather than a cascade of bpf_probe_read_kernel
? Or at least read bigger chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed is_last_packet_ok()
to read 25 bytes from the end on the stack so we can avoid all the small bpf_probe_read*
.
Do you mind if we do this other optimization in a separate PR? From what I see we need 2 separate chunks as sometimes we read from the start of the buffer and sometimes from the end, doing it on the stack would be ok but ideally I'd put these chunks in a map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it in this PR
6130919
to
75c73be
Compare
@rafaelroquetto thanks again for the quick review, I think I covered all here 1cc64e2 |
Signed-off-by: Mattia Meleleo <mattia.meleleo@coralogix.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big PR with a lot of implicit knowledge, so I hope you don't mind me having to iterate this.
// Every mysql command packet is prefixed by an header | ||
// https://mariadb.com/kb/en/0-packet/ | ||
struct mysql_hdr { | ||
u32 payload_length : 24; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised there's an issue here - the MySQL protocol defines integers to always be little-endian - using bitfields may break that assumption on big endian architectures.
Also, there are more general issues that can come to bite: the alignment of bitfields is unspecified, the compiler may choose to add padding there, etc... we shouldn't be using it, instead, we should do
u8 payload_lenth[3];
and then have a helper that returns something like
return ((u32) hdr->payload[0])
| ((u32) hdr->payload[1] << 8)
| ((u32) hdr->payload[2] << 16);
for an extended rationale, see the "Notes" section at https://cppreference.com/w/c/language/bit_field.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we only support amd64 and arm64?
regarding alignment, is that true for explicitly packed structures? I have __attribute__((packed))
there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we only support amd64 and arm64?
we do, until we don't... but this is just one of the issues, overall bitfields are problematic - if you look at the "Notes" section of the link above, you will see we'd be relying on a lot of undefined and implementation defined behaviour, many of which are hard to spot and to enforce (the compiler will allow the code to compile, may not even emit a warning, but then the code misbehaves).
regarding alignment, is that true for explicitly packed structures? I have attribute((packed)) there
here's an example of bitfields in a packed structure, in which the bitfieds themselves are not packed: https://godbolt.org/z/7zbaMrz9M
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see padding being added there 🤔 , the size is 5 because of the bitfield (3b => 1 byte) + the uint32 (4 bytes) and the offset is 1. If you remove the packed attribute, the size is 8 and the offset is 4 as it should be.
My bitfield size is also a multiple of 8 so there should be no issues doing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is exactly that, sorry if the example was crappy - as a multiple of 8, your bitfield neatly aligns to the next byte, so there are no "spare" bits, which affects the size of the struct. If you pass -Wpadded
to the example, you will see
warning: padding struct 'struct Test' with 3 bits to align 'b' [-Wpadded]
Even if there are no alignment issues with your bitfield (which is not guaranteed), the usage is still probe to all of the other issues described above.
// Metadata | ||
bool hdr_arrived; // Signals whether to skip or not the first 4 bytes in the current buffer as | ||
// they arrived in a previous packet. | ||
} __attribute__((packed)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading about the mysql header, it's best to actually define two structs whose size you can take with sizeof()
instead of manually setting the size constants above, e.g.
struct mysql_hdr {
u8 payload_length[3]; // little-endian 24-bit value
u8 sequence_id;
};
// can now write sizeof(mysql_hdr) rather than k_mysql_hdr_size
struct mysql_command_packet { // usually client requests
struct mysql_hdr hdr;
u8 command_id;
};
// ditto for sizeof(mysql_command_packet)
u8 response_packet_type; | ||
int packet_type_response; | ||
|
||
if (data_len == k_mysql_hdr_without_command_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would then read
if (data_len == k_mysql_hdr_without_command_size) { | |
if (data_len == sizeof(struct mysql_hdr)) { |
u8 response_packet_type; | ||
int packet_type_response; | ||
|
||
if (data_len == k_mysql_hdr_without_command_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if you get a subsequence packet that is also 4 bytes long? Imagine the first packet is the header, and then for the same connection info, you get a 4 byte payload (e.g. header + COM_QUERY "SET"
), you will overwrite the header info here.
What you could do instead is:
- check if you already have stored a header in the map
- perhaps check if the sequence number is sane
return 0; | ||
} | ||
|
||
if (data_len < (k_mysql_hdr_size + 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then here, you should bail only if you don't have any previously stored header
return 0; | ||
} | ||
|
||
struct mysql_hdr hdr = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could then ditch the hdr_arrived
metadata, and also this variable - at this point, either you have a previously stored header (what comes next is a payload) or you don't, and data
contains the header + a payload chunk.
I think what I am trying to say is something along the lines of:
struct mysql_hdr *hdr = find_mysql_hdr(info);
if (hdr == NULL && data_len < sizeof(*hdr)) {
bpf_dbg_printk("not mysql");
return 0;
}
// consume header if we don't have any yet
if (hdr == NULL) {
hdr = new_mysql_hdr(info);
bpf_probe_read_kernel(hdr, ...);
data_len -= sizeof(*hdr);
}
// nothing left to read
if (data_len == 0) {
return 0;
}
// ---> at this stage, you have a header and some data to consume
as far as I know MySQL protocol is strictly request->response - we should leverage that, so that once you've parsed a command (request), you remember that, and then you only try to parse a response if you have successfully parsed a request of that command of interest - otherwise, you can drop the orphan response packet and bail early, which is good.
// +------------+-------------+----------------------+ | ||
*packet_type = PACKET_TYPE_REQUEST; | ||
break; | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we haven't found any request, are we interested on "orphan" responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orphan responses will be dropped both later in the tcp codepath and also in userspace, I can add another check here if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we should drop them early - there's no point on running all that code downstream just to drop them
// Tail call back into generic TCP handler. | ||
// Once the header is fixed up, we can use the generic TCP handling code | ||
// in order to reuse all the common logic. | ||
bpf_tail_call(ctx, &jump_table, k_tail_protocol_tcp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this stage, we should have a header and a payload ready to be consumed, so couldn't we simply construct a specific event that contains the relevant bits of tcp_req
plus the "raw buffer"?
otherwise, are we not sending two events, the tcp_large_buffer
one and the tcp_req
one? If that's the case, it means we are reading u_buf
multiple times - once for the large buffer, and once as part of of the original TCP codepath (which reads K_TCP_MAX_LEN=256) and puts it in the ringbuffer...
If my understanding is correct (and it may as well not be), then we should instead really dispatch everything just once via a large buffer to userspace and skip the common TCP codepath altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was agreed offline to reuse the tcp code path in order to have common code (such as traceparent) and edge cases already figured out (also see: #118 (comment)) and maximize code reuse at the cost of one additional tail call.
On the "non-large buffer" part, given the tcp code path uses the reserve/commit API, the event we send is still fixed size so we just save a copy by not using the fixed buffer. I kept it in order to use it as fallback in the case the large buffer is not found in the LRU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't mutually exclusive in my view - we could (and should) move the reusable part of the code to a common place that can be invoked by both codepaths. But in my opinion there's no upside whatsoever on doing multiple passes on u_buf
and submitting duplicated information (both events) when it's perfectly feasible not to do so - otherwise this is premature pessimisation in a hotpath (tcp_sendmsg
and tcp_recvmsg
are invoked at a very high frequency). Separating things would also improve clarity on an otherwise dense code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you are proposing @rafaelroquetto is not easy to do with our current way of the way the code is structured. How about we keep it as is and then we can follow-up with a PR to clean that part up. I feel like the changes are getting too large anyway. Nothing stops us to iterate on this, it doesn't have to be the final state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t feel comfortable merging this as-is for two main reasons:
- Duplicate events: We’re emitting both the large-buffer event and the original TCP event with a fixed payload.
- Excessive bpf_probe_read_kernel(): We call
bpf_probe_read_kernel()
far more often than necessary
Originally, I understood the plan to be a single variable-buffer event into userspace—so I may be missing something.
That said, if you guys prefer to move forward as-is and address it in subsequent PRs, I won’t block it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rafaelroquetto nice! I'll address the excessive usage of bpf_probe_read
today so we can hopefully get this in
} | ||
) | ||
|
||
var largeBuffers = expirable.NewLRU[largeBufferKey, largeBuffer](1024, nil, 5*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changes look good IMO now, I will do another review soon to see if there are other things I might have missed, but we should move these large buffers in the EBPFParseContext
structure, rather than making them static in the file. This is in preparation for OBI being able to be vendored in the OTel collector in the future, where essentially we might expect the OBI component to be reloaded. We used to have these statics everywhere, now they are all removed.
I believe we are good on integration tests for this PR, since we test MySQL and as long as it doesn't break things should be good. I'm going to spend some time today looking to see if I can tell why some of these tests are failing if you need help. One thing I would do is add unit tests for the new bits of the IF logic in the request handling, just to ensure we handle the path correctly. Ideally we should add integration tests for the variable buffers, but that can follow in another PR, since it's not blocking the default use of the product. |
@mmat11 another thing I realised is that sometimes the buffer we are reading may be userspace memory, not always kernel - if you suspect the failing tests could be related to this (and I don't mean to send you on the wrong path), perhaps we should consider using |
This PR allows parsing mysql requests in kernel space. The benefits are that bigger queries/buffers are supported and the spans are enriched with additional data and errors based on the response.
Prepared statements support will be introduced shortly after this is approved/merged.
It has only been tested manually, let me know if additional integration/e2e testing can be done.
TODO: