Skip to content

bpf: parse mysql data in kernel space #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mmat11
Copy link
Contributor

@mmat11 mmat11 commented Jun 5, 2025

This PR allows parsing mysql requests in kernel space. The benefits are that bigger queries/buffers are supported and the spans are enriched with additional data and errors based on the response.

Prepared statements support will be introduced shortly after this is approved/merged.

It has only been tested manually, let me know if additional integration/e2e testing can be done.

TODO:

  • additional tests?

@mmat11 mmat11 requested a review from a team as a code owner June 5, 2025 13:42
Copy link

linux-foundation-easycla bot commented Jun 5, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @mmat11!

I have one base question before we look more in detail here, what is the reason we'd want to parse the SQL protocol in eBPF?

In a sense, the rule we've followed so far has been that we want to parse in C code only the protocols where we see necessary the need because of injecting trace context. In this case SQL is simply extracting data so we can report the event.

One main drawback of implementing this in C is that it's harder to test and implement additional logic. For example, one interesting aspect is the statement prepare. If we did this in userspace perhaps we can cache the preapred statement per pid/sql query and then find the sql when the statement execute is made. This would be harder in C.

Is the main reason that we have such short buffers for the TCP data and you'd like to extract more context, e.g. larger query text? I thought about this more and I wonder if we can do the following:

  1. Look at some SQL like protocol detection, e.g. find out it's mysql, like you've done.
  2. Introduce a variable buffer event as a type for the ring buffer. We can communicate from the userspace config as a volatile const what this max size is.
  3. At the time of making the TCP event, we can create our usual event, but read more of the data now and store it after the current event. We tag the TCP event flags to say look for a follow-up event when you read it and then pull both to be processed.

I think we can reuse this approach more in general for other types of TCP events where we need larger buffers.

@grcevski
Copy link
Contributor

grcevski commented Jun 5, 2025

So I think if we added the response error detection in this file and had larger buffers I think we'd get this for free, right? https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/main/pkg/internal/ebpf/common/sql_detect_mysql.go

We had plan to implement the cache for prepared statements like we do for the grpc connections. It will not catch statements which are prepared before we attached OBI, but it will catch all after.

Copy link

codecov bot commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 39.32039% with 125 lines in your changes missing coverage. Please review.

Project coverage is 57.03%. Comparing base (00f36a6) to head (c44d49f).

Files with missing lines Patch % Lines
pkg/components/sqlprune/mysql.go 14.63% 34 Missing and 1 partial ⚠️
pkg/app/request/span.go 0.00% 28 Missing ⚠️
pkg/components/ebpf/common/tcp_detect_transform.go 52.94% 22 Missing and 2 partials ⚠️
pkg/components/ebpf/common/sql_detect_transform.go 0.00% 14 Missing ⚠️
pkg/components/sqlprune/sqlparser.go 28.57% 10 Missing ⚠️
pkg/export/otel/traces.go 0.00% 4 Missing ⚠️
pkg/beyla/config.go 0.00% 2 Missing and 1 partial ⚠️
pkg/components/ebpf/common/tcp_large_buffer.go 90.90% 2 Missing and 1 partial ⚠️
pkg/app/request/metric_attributes.go 0.00% 2 Missing ⚠️
pkg/config/ebpf_tracer.go 75.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (00f36a6) and HEAD (c44d49f). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (00f36a6) HEAD (c44d49f)
unittests 1 0
oats-test 1 0
integration-test 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #118       +/-   ##
===========================================
- Coverage   75.17%   57.03%   -18.14%     
===========================================
  Files         177      176        -1     
  Lines       19135    18891      -244     
===========================================
- Hits        14384    10774     -3610     
- Misses       4001     7338     +3337     
- Partials      750      779       +29     
Flag Coverage Δ
integration-test ?
integration-test-arm 35.84% <39.32%> (-0.17%) ⬇️
k8s-integration-test 53.33% <5.82%> (-0.50%) ⬇️
oats-test ?
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@rafaelroquetto rafaelroquetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Thank you so much for putting in the effort, this is much appreciated. This is a long PR with a lot of implicit knowledge, so all I did was a first pass in terms of style and other small things. Once those are addresses, we can iterate this to bring it to the finish line - also, if you have any questions, don't hesitate to bug me on Slack or elsewhere.


// Every mysql command packet is prefixed by an header
// https://mariadb.com/kb/en/0-packet/
struct mysql_hdr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can, move the maps and the types to their own headers like we did in the pid/ subdir - unfortunately it's an ongoing effort and we did not have time to clean up the rest of the code, but new code could already do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I need to make a mysql protocol directory? mixing these maps with other protocols' maps would be confusing imho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure we are experimenting with is tracer/{types,maps} - so for now I'd just chuck it in there for consistency. We do need to make a pass sometime soon and finish tidying up the code - it's in my list, but it's low priority at the moment.

@mmat11
Copy link
Contributor Author

mmat11 commented Jun 5, 2025

@grcevski for context, this is the superseded PR: grafana/beyla#1848

I have one base question before we look more in detail here, what is the reason we'd want to parse the SQL protocol in eBPF?

The "unknown TCP" path will send to userspace the request+response event with a predetermined buffer size. Making events of variable size will also make things slightly more complicated and less consistent with event structs generation as we can't use bpf2go to generate and fill the structs but we need to parse them manually. One other argument is that this way every handler (both in kernel and userspace) is responsible for its own protocol after classification.

For example, one interesting aspect is the statement prepare. If we did this in userspace perhaps we can cache the preapred statement per pid/sql query and then find the sql when the statement execute is made. This would be harder in C. We had plan to implement the cache for prepared statements like we do for the grpc connections. It will not catch statements which are prepared before we attached OBI, but it will catch all after.

Note: prepared statements support is not implemented yet and the statement cache will be implemented in userspace. The cache would look like this: {prepared_statement_id, pid: query}; we already did this and it's not an issue with the current implementation (I left some comments TODO: prepared statements )

Having said that, both ways are doable for achieving this, I thought we agreed that doing the parsing in ebpf would be the preferred long-term solution.

@rafaelroquetto
Copy link
Contributor

@grcevski makes good points (which I only saw after I published mine). To piggyback on his comments, the ring buffer does allow for variable sized events - we could have something like

struct mysql_event { // or whatever it's called
   u8 type;
   /* ... */
   u8 data_size; // or u16? 
   unsigned char data[]; // must be last

we can then use void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags) to reserve the actual full event (including the tailing data payload) - so it can be written in one go.

Then in userspace we can cast that back to the equivalent struct and access data as usual....

@mmat11
Copy link
Contributor Author

mmat11 commented Jun 5, 2025

@grcevski makes good points (which I only saw after I published mine). To piggyback on his comments, the ring buffer does allow for variable sized events - we could have something like

struct mysql_event { // or whatever it's called
   u8 type;
   /* ... */
   u8 data_size; // or u16? 
   unsigned char data[]; // must be last

we can then use void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags) to reserve the actual full event (including the tailing data payload) - so it can be written in one go.

The bpf_ringbuf_reserve API can't be used with variable sized events (as the size must be known at reserve time). In that case, the bpf_ringbuf_output API must be used (see https://github.com/elastic/ebpf/blob/main/GPL/Events/Varlen.h).

Then in userspace we can cast that back to the equivalent struct and access data as usual....

Event unmarshaling should also be done manually in that case (see https://github.com/elastic/ebpfevents/blob/main/pkg/varlen/varlen.go). IIRC bpf2go will refuse to handle fields like unsigned char data[];

@grcevski
Copy link
Contributor

grcevski commented Jun 5, 2025

The bpf_ringbuf_reserve API can't be used with variable sized events (as the size must be known at reserve time). In that case, the bpf_ringbuf_output API must be used (see https://github.com/elastic/ebpf/blob/main/GPL/Events/Varlen.h).

I think we don't truly need variable size, we can keep like what you had 8K and 512, but these can now be constants that can be supplied at load time when we do loadSpec, so we can make this variable and allow users to choose how much they are willing to use to get more accurate data in exchange for extra CPU overhead.

I think I understand the eBPF type issue, but what do you think about making a type that has only two things, flags with the type as all of our eBPF event structures, size of the buffer and empty 0 sized u8 buffer, essentially just a "header"? We then write the "variable" size buffer on the ring buffer after writing the "header" of the event. In the userspace, we always just read the first byte
first and then once we know the "type" of the event we can proceed to read the size and eventually consume the trailing "variable" buffer.

Would this approach work for you? The complexity of reading is done once and we have the capability that will let us supply extra data for any protocol from now on.

@rafaelroquetto
Copy link
Contributor

rafaelroquetto commented Jun 5, 2025

If we decide to go down the parsing on userspace route, why not then use bpf_ringbuf_output()? In most cases when using bpf_ringbuf_reserve() we already copy the data anyway, so instead of doing

 evt *trace = bpf_ringbuf_reserve(&events, sizeof(evt), 0); 
__buitin_memcpy(trace, src, sizeof(*trace));
bpf_ringbuf_submit(trace, 0):

for the mysql case, we could do:

bpf_ringbuf_output(events, trace, variable_trace_size, 0);

on userspace then I am assuming (for what I saw of cilium/ebpf source) that we'd have no issues reading this as a regular event as usual - so nothing changes. We could still have a "maximum size" flag as @grcevski is proposing, but IMHO that makes it simpler. So if we do decide to ship things to userspace for further processing, that would be my preference (assuming it works).

bpf2go is able to generate a struct with a trailing data field (it generates something like Data[0] uint8) - I am not sure if that can be directly accessed from Go, but if not, we can just do it manually (like we do in ReinterpretCast).

As for whether we should do that in first place, I think we should evaluate both options.
Now, I am not sure what the best approach is here (ebpf vs user space).

I am unsure what the best approach is here (user space vs ebpf). I'd personally do things in ebpf only provided that the complexity can be kept at bay as the less event overhead (less pressure in the ringbuffer, less allocation, less copies, etc...).

So what would be the objective criteria for that? And tangent to this question (because it's an issue regardless of the approach chosen) how are we planning to deal with incomplete buffers?

@mmat11
Copy link
Contributor Author

mmat11 commented Jun 6, 2025

@grcevski @rafaelroquetto I think we are discussing and trying to decide between two different topics:

  1. Mysql protocol parsing in ebpf vs userspace
  2. Variable length event buffers

  1. Mysql protocol parsing in ebpf vs userspace

In my opinion, it boils down to personal preference and confidence in the implementation. Event processing in userspace might in theory be less performant, since in ebpf any unknown buffer or inconsistent data can be discarded early. (I don't have any data/numbers to back this up, though.)

About extensibility:
Any request and response is sent to userspace, so there are no technical limitations to supporting additional features such as prepared statements (with the known drawback of only knowing the statement IDs starting from the OBI process startup — but this limitation applies regardless of the implementation).

About complexity:
Given that we should keep the is_mysql ebpf function, what we save by implementing the rest in userspace is limited to the event sending functions. We still need the most complex parts in place, specifically, parsing the header and the most common response packet types, regardless of where we choose to implement it. And personally I also value the fact that the mysql-specific code is self contained and not mixed with other protocols, making it easier to understand the whole picture.

About testing:
While I agree that it's harder to test in general (specifically unit tests), I think most of the testing value comes from integration/e2e tests which would remain transparent to this implementation detail. The mysql (and other products) network protocols don't tend to change too often and what's worth testing I believe are the clients implementations and edge cases, which is better done in integration/e2es anyway.


  1. Variable-length event buffers

This decision is independent of this PR, and both the "unknown TCP" path and the mysql handler could benefit from variable-length buffers. We can perhaps discuss this further in the next SIG meeting?


@rafaelroquetto Thanks again for the prompt review 🙏🏻 , I'll wait a decision on .1 before addressing the changes

@grcevski
Copy link
Contributor

grcevski commented Jun 6, 2025

Hi @mmat11, I think you bring a very valid point about the ability to filter in eBPF space, this would remove the overhead of sending the TCP packets for unknown protocols and make the use of the ring buffer more efficient.

Let me summarise my thoughts here on the topic and I think I have a proposal for a way forward:

  1. Personally, I think the decision between eBPF side parsing and user space parsing is not always personal preference. Where parsing at eBPF really matters is reducing the overall overhead of the protocol handling, e.g. being able to extract the useful information without passing the whole buffer to the user space. Another is the ability to parse protocols for correct handling, e.g. the HTTP2/gRPC stream id, without it we wouldn't be able to handle HTTP2/gRPC correctly. On the other hand, depending on the protocol it's much harder to parse and test in eBPF code than in Go user space code. For example, if we consider the complexity and the amount of changes that have been done over the course of years to the Kafka protocol, you'll see how developing that is much easier in userspace. Integration tests go only so far, unit tests are much better at testing handling 10 different protocol versions.
  2. I don't have particular speed concerns on the Go userspace side, the Go compiled code is reasonably efficient for the kinds of operations we do, and it's very easy to extend the protocols, e.g. if we wanted to start parsing MongoDB we can easily use the bson library for Go. From my own benchmarking, the most expensive part if the transfer of data between the kernel and userspace.
  3. Sending unknown TCP packets to userspace has its own merits, it's not always a bad thing. One example is the current backup path we have for parsing HTTP2/gRPC packets. Namely, HTTP2/gRPC clients tend to use keepalive aggressively, so it's very common that the connection is already established when OBI instruments the application. In this scenario the HTTP2 connect preamble is never seen and we use userspace parsing to detect the connection and establish the needed metadata. Without this backup path, no gRPC connections work in the OpenTelemetry DEMO for example. Also, we've had requests in the past to report unknown protocols in the traces, mostly for proprietary or esoteric protocols. While you can't see the operation, you can tell how long it took and what was the destination.
  4. Having one TCP handler place is beneficial for context propagation code reuse, where we handle the detection of nested SQL/Redis/Kafka requests under HTTP for example.

Having said all of this, I think the filtering at eBPF space has a lot of merit, especially when users want this. Therefore I have a proposal that will allow us to do this, if you think it's sensible.

  1. Start by adding an option to the config file that will allow us to enable eBPF side filtering. This will be passed through the eBPF side in generictracer.go as a constant. We'll have this off by default for now until we migrate enough of the protocol detection to eBPF side.
  2. Implement protocol detectors in eBPF like you've done with MySQL, but keep it light especially for complex protocols like Kafka, because we can always finish the detection in userspace. These protocol detectors still use the TCP path, don't make new event types, but they tag the TCP event with sub-type hint to be taken in consideration in userspace.
  3. Use the option in 1. to discard anything that's not detected.
  4. Provide means to send variable buffer sizes, keeping the defaults as they are now. 8K is a lot for each request and users must consciously choose to enable/increase this. We've had users with 100s nested SQLs under a HTTP transaction, some ORMs do this commonly. Imagine what the overhead would be if we captured 8K for each SQL and they were running with 100 QPS? We had code in the past for capturing extra buffers, but it was removed. I'm going to try to find it. The main issue with variable buffers is that you need to capture one at request start and another on the response and you need to keep the request start around for the duration of the request. If we cannot make the types use variable buffers, what we used to do in the past is ship the larger buffer to userspace when the request start is seen, it would be held in an LRU until the full event arrives and then we use the extra data.

Thanks again for bringing this up and let me know what your thoughts are.

@mmat11
Copy link
Contributor Author

mmat11 commented Jun 7, 2025

Start by adding an option to the config file that will allow us to enable eBPF side filtering. This will be passed through the eBPF side in generictracer.go as a constant. We'll have this off by default for now until we migrate enough of the protocol detection to eBPF side.

Sounds good! Should we have this toggle per-protocol?

Implement protocol detectors in eBPF like you've done with MySQL, but keep it light especially for complex protocols like Kafka, because we can always finish the detection in userspace. These protocol detectors still use the TCP path, don't make new event types, but they tag the TCP event with sub-type hint to be taken in consideration in userspace.

Given the fact events are different from each other, wouldn't it be more consistent to use different event types? (just a nit, I'm fine with doing both)

Use the option in 1. to discard anything that's not detected.

By discard you mean let the "unknown TCP" fallback handle it?

Provide means to send variable buffer sizes, keeping the defaults as they are now. 8K is a lot for each request and users must consciously choose to enable/increase this. We've had users with 100s nested SQLs under a HTTP transaction, some ORMs do this commonly. Imagine what the overhead would be if we captured 8K for each SQL and they were running with 100 QPS? We had code in the past for capturing extra buffers, but it was removed. I'm going to try to find it. The main issue with variable buffers is that you need to capture one at request start and another on the response and you need to keep the request start around for the duration of the request. If we cannot make the types use variable buffers, what we used to do in the past is ship the larger buffer to userspace when the request start is seen, it would be held in an LRU until the full event arrives and then we use the extra data.

Agree, we should use variable length buffers -- I will open an issue soon

Thanks, will apply these changes and Rafael's review suggestions!

@grcevski
Copy link
Contributor

grcevski commented Jun 8, 2025

Hey Mattia,

Sounds good! Should we have this toggle per-protocol?

We can do that, we already have this functionality in user-space to filter out events we don't want, e.q. no SQL traces, but SQL metrics etc. Maybe we can leverage those toggles to remove the unwanted events after we've migrated most to eBPF side, if we manage to do it all. Perhaps we can leave adding this option as a last step, for now we can benefit by detecting something like MySQL in eBPF and then directing the userspace parsing directly.

Given the fact events are different from each other, wouldn't it be more consistent to use different event types? (just a nit, I'm fine with doing both)

Sure, let's add separate event types for the TCP data type. I would just keep the data type the same and the code that handles the buffer acquisition and nesting for distributed traces. I'm suggesting this so that we minimise how much code we have to duplicate. If it's not possible to make all protocols easily fit inside the instruction limit, perhaps we can tail call to say MySQL handler and then tail call to the TCP event with custom type.

By discard you mean let the "unknown TCP" fallback handle it?

Yeah let's keep the unknown TCP fall back as it is today. If we manage to migrate most protocol parsing to eBPF space we can decide to add the option to not send the TCP fallback. Kafka is a lot of parsing code, so we need to add C unit tests. HTTP2/gRPC is also a lot of work for the fall back code we use at the moment in userspace. Redis and MongoDB (which someone is adding at the moment) are simple enough.

@mmat11 mmat11 force-pushed the matt/sql-errors branch from f66bcf7 to be5db5b Compare June 18, 2025 23:54
@mmat11
Copy link
Contributor Author

mmat11 commented Jun 18, 2025

hi @grcevski and @rafaelroquetto , I did as discussed here and offline, the PR needs a bit of polishing and perhaps I missed some cleanups like moving the maps in their files as suggested by Rafael, but it can be re-reviewed now

Copy link
Contributor

@rafaelroquetto rafaelroquetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @mmat11 ! Thank you so much for this, outstanding work!
I just have a bunch of nits (that you couldn't have known beforehand). I think once we iron those out, we should be good to go.

Again, thank you for all your effort and for bearing with me. Grazie!

return false;
}

static int parse_response_packet_type(struct mysql_hdr *hdr, const u8 *data, size_t data_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for instance, could we have a buffer here at this stage in which we read data, and all of the parsing functions operate on that buffer rather than a cascade of bpf_probe_read_kernel? Or at least read bigger chunks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed is_last_packet_ok() to read 25 bytes from the end on the stack so we can avoid all the small bpf_probe_read*.
Do you mind if we do this other optimization in a separate PR? From what I see we need 2 separate chunks as sometimes we read from the start of the buffer and sometimes from the end, doing it on the stack would be ok but ideally I'd put these chunks in a map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it in this PR

@mmat11 mmat11 force-pushed the matt/sql-errors branch 2 times, most recently from 6130919 to 75c73be Compare June 19, 2025 22:39
@mmat11
Copy link
Contributor Author

mmat11 commented Jun 19, 2025

@rafaelroquetto thanks again for the quick review, I think I covered all here 1cc64e2
Let me know if I forgot something

@mmat11 mmat11 force-pushed the matt/sql-errors branch from 75c73be to a655ca0 Compare June 20, 2025 00:03
mmat11 added 2 commits June 20, 2025 02:20
Signed-off-by: Mattia Meleleo <mattia.meleleo@coralogix.com>
@mmat11 mmat11 force-pushed the matt/sql-errors branch from a655ca0 to 1cc64e2 Compare June 20, 2025 00:30
@mmat11 mmat11 force-pushed the matt/sql-errors branch from 1cc64e2 to c44d49f Compare June 20, 2025 01:00
Copy link
Contributor

@rafaelroquetto rafaelroquetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big PR with a lot of implicit knowledge, so I hope you don't mind me having to iterate this.

// Every mysql command packet is prefixed by an header
// https://mariadb.com/kb/en/0-packet/
struct mysql_hdr {
u32 payload_length : 24;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised there's an issue here - the MySQL protocol defines integers to always be little-endian - using bitfields may break that assumption on big endian architectures.

Also, there are more general issues that can come to bite: the alignment of bitfields is unspecified, the compiler may choose to add padding there, etc... we shouldn't be using it, instead, we should do

u8 payload_lenth[3];

and then have a helper that returns something like


	return ((u32) hdr->payload[0])
		| ((u32) hdr->payload[1] << 8)
		| ((u32) hdr->payload[2] << 16);

for an extended rationale, see the "Notes" section at https://cppreference.com/w/c/language/bit_field.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we only support amd64 and arm64?

regarding alignment, is that true for explicitly packed structures? I have __attribute__((packed)) there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we only support amd64 and arm64?
we do, until we don't... but this is just one of the issues, overall bitfields are problematic - if you look at the "Notes" section of the link above, you will see we'd be relying on a lot of undefined and implementation defined behaviour, many of which are hard to spot and to enforce (the compiler will allow the code to compile, may not even emit a warning, but then the code misbehaves).

regarding alignment, is that true for explicitly packed structures? I have attribute((packed)) there

here's an example of bitfields in a packed structure, in which the bitfieds themselves are not packed: https://godbolt.org/z/7zbaMrz9M

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see padding being added there 🤔 , the size is 5 because of the bitfield (3b => 1 byte) + the uint32 (4 bytes) and the offset is 1. If you remove the packed attribute, the size is 8 and the offset is 4 as it should be.
My bitfield size is also a multiple of 8 so there should be no issues doing that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is exactly that, sorry if the example was crappy - as a multiple of 8, your bitfield neatly aligns to the next byte, so there are no "spare" bits, which affects the size of the struct. If you pass -Wpadded to the example, you will see

warning: padding struct 'struct Test' with 3 bits to align 'b' [-Wpadded]

Even if there are no alignment issues with your bitfield (which is not guaranteed), the usage is still probe to all of the other issues described above.

// Metadata
bool hdr_arrived; // Signals whether to skip or not the first 4 bytes in the current buffer as
// they arrived in a previous packet.
} __attribute__((packed));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading about the mysql header, it's best to actually define two structs whose size you can take with sizeof() instead of manually setting the size constants above, e.g.

struct mysql_hdr {
    u8 payload_length[3]; // little-endian 24-bit value
    u8 sequence_id;
};
// can now write sizeof(mysql_hdr) rather than k_mysql_hdr_size

struct mysql_command_packet { // usually client requests
    struct mysql_hdr hdr;
    u8 command_id;
};
// ditto for sizeof(mysql_command_packet)

u8 response_packet_type;
int packet_type_response;

if (data_len == k_mysql_hdr_without_command_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would then read

Suggested change
if (data_len == k_mysql_hdr_without_command_size) {
if (data_len == sizeof(struct mysql_hdr)) {

u8 response_packet_type;
int packet_type_response;

if (data_len == k_mysql_hdr_without_command_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you get a subsequence packet that is also 4 bytes long? Imagine the first packet is the header, and then for the same connection info, you get a 4 byte payload (e.g. header + COM_QUERY "SET"), you will overwrite the header info here.
What you could do instead is:

  1. check if you already have stored a header in the map
  2. perhaps check if the sequence number is sane

return 0;
}

if (data_len < (k_mysql_hdr_size + 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then here, you should bail only if you don't have any previously stored header

return 0;
}

struct mysql_hdr hdr = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could then ditch the hdr_arrived metadata, and also this variable - at this point, either you have a previously stored header (what comes next is a payload) or you don't, and data contains the header + a payload chunk.

I think what I am trying to say is something along the lines of:

    struct mysql_hdr *hdr = find_mysql_hdr(info);

    if (hdr == NULL && data_len < sizeof(*hdr)) {
        bpf_dbg_printk("not mysql");
        return 0;
    }

    // consume header if we don't have any yet
    if (hdr == NULL) {
        hdr = new_mysql_hdr(info);
        bpf_probe_read_kernel(hdr, ...);
        data_len -= sizeof(*hdr);
    }

    // nothing left to read
    if (data_len == 0) {
        return 0;
    }

   // ---> at  this stage, you have a header and some data to consume

as far as I know MySQL protocol is strictly request->response - we should leverage that, so that once you've parsed a command (request), you remember that, and then you only try to parse a response if you have successfully parsed a request of that command of interest - otherwise, you can drop the orphan response packet and bail early, which is good.

// +------------+-------------+----------------------+
*packet_type = PACKET_TYPE_REQUEST;
break;
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we haven't found any request, are we interested on "orphan" responses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orphan responses will be dropped both later in the tcp codepath and also in userspace, I can add another check here if you prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should drop them early - there's no point on running all that code downstream just to drop them

// Tail call back into generic TCP handler.
// Once the header is fixed up, we can use the generic TCP handling code
// in order to reuse all the common logic.
bpf_tail_call(ctx, &jump_table, k_tail_protocol_tcp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this stage, we should have a header and a payload ready to be consumed, so couldn't we simply construct a specific event that contains the relevant bits of tcp_req plus the "raw buffer"?

otherwise, are we not sending two events, the tcp_large_buffer one and the tcp_req one? If that's the case, it means we are reading u_buf multiple times - once for the large buffer, and once as part of of the original TCP codepath (which reads K_TCP_MAX_LEN=256) and puts it in the ringbuffer...

If my understanding is correct (and it may as well not be), then we should instead really dispatch everything just once via a large buffer to userspace and skip the common TCP codepath altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was agreed offline to reuse the tcp code path in order to have common code (such as traceparent) and edge cases already figured out (also see: #118 (comment)) and maximize code reuse at the cost of one additional tail call.

On the "non-large buffer" part, given the tcp code path uses the reserve/commit API, the event we send is still fixed size so we just save a copy by not using the fixed buffer. I kept it in order to use it as fallback in the case the large buffer is not found in the LRU

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't mutually exclusive in my view - we could (and should) move the reusable part of the code to a common place that can be invoked by both codepaths. But in my opinion there's no upside whatsoever on doing multiple passes on u_buf and submitting duplicated information (both events) when it's perfectly feasible not to do so - otherwise this is premature pessimisation in a hotpath (tcp_sendmsg and tcp_recvmsg are invoked at a very high frequency). Separating things would also improve clarity on an otherwise dense code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are proposing @rafaelroquetto is not easy to do with our current way of the way the code is structured. How about we keep it as is and then we can follow-up with a PR to clean that part up. I feel like the changes are getting too large anyway. Nothing stops us to iterate on this, it doesn't have to be the final state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t feel comfortable merging this as-is for two main reasons:

  1. Duplicate events: We’re emitting both the large-buffer event and the original TCP event with a fixed payload.
  2. Excessive bpf_probe_read_kernel(): We call bpf_probe_read_kernel() far more often than necessary

Originally, I understood the plan to be a single variable-buffer event into userspace—so I may be missing something.

That said, if you guys prefer to move forward as-is and address it in subsequent PRs, I won’t block it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmat11, I spoke to @grcevski and he gave me the entire rationale on why it was decided to use two events, so it makes sense to me now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelroquetto nice! I'll address the excessive usage of bpf_probe_read today so we can hopefully get this in

}
)

var largeBuffers = expirable.NewLRU[largeBufferKey, largeBuffer](1024, nil, 5*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes look good IMO now, I will do another review soon to see if there are other things I might have missed, but we should move these large buffers in the EBPFParseContext structure, rather than making them static in the file. This is in preparation for OBI being able to be vendored in the OTel collector in the future, where essentially we might expect the OBI component to be reloaded. We used to have these statics everywhere, now they are all removed.

@grcevski
Copy link
Contributor

  • additional tests?

I believe we are good on integration tests for this PR, since we test MySQL and as long as it doesn't break things should be good. I'm going to spend some time today looking to see if I can tell why some of these tests are failing if you need help.

One thing I would do is add unit tests for the new bits of the IF logic in the request handling, just to ensure we handle the path correctly.

Ideally we should add integration tests for the variable buffers, but that can follow in another PR, since it's not blocking the default use of the product.

@rafaelroquetto
Copy link
Contributor

@mmat11 another thing I realised is that sometimes the buffer we are reading may be userspace memory, not always kernel - if you suspect the failing tests could be related to this (and I don't mean to send you on the wrong path), perhaps we should consider using bpf_probe_read() after all (instead of _read_kernel()), as it will dispatch to the correct _kernel or _user variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants