Skip to content

Conversation

@vmx
Copy link

@vmx vmx commented Nov 12, 2025

The libipld crate is deprecated. Usually the transition from libipld is into using ipld-core and serde_ipld_dagcbor. Though this crate is so low-level, that it should use cbor4ii directly.

cbor4ii is the CBOR library that serde_ipld_dagcbor is using.

The tests pass locally. I also haven't done any benchmarking. So this should be seen as a starting point, I'm happy to get it over the finish line if there's interest.

Copying the SliceReader from cbor4ii isn't ideal, maybe we get an upstream fix. I've opened quininer/cbor4ii#50.

The `libipld` crate is deprecated. Usually the transition from `libipld`
is into using `ipld-core` and `serde_ipld_dagcbor`. Though this crate
is so low-level, that it should use `cbor4ii` directly.

`cbor4ii` is the CBOR library that `serde_ipld_dagcbor` is using.
@codspeed-hq
Copy link

codspeed-hq bot commented Nov 12, 2025

CodSpeed Performance Report

Merging #80 will improve performances by ×2.5

Comparing vmx:remove-libipld (b64c304) with main (7a1eabd)

Summary

⚡ 161 improvements
✅ 31 untouched

Benchmarks breakdown

Mode Benchmark BASE HEAD Change
Simulation test_dag_cbor_decode[roundtrip01.json] 19.2 µs 16.6 µs +15.55%
Simulation test_dag_cbor_decode[roundtrip02.json] 19.2 µs 16.6 µs +15.53%
Simulation test_dag_cbor_decode[roundtrip03.json] 19.2 µs 16.6 µs +15.15%
Simulation test_dag_cbor_decode[roundtrip04.json] 19.1 µs 16.7 µs +14.78%
Simulation test_dag_cbor_decode[roundtrip05.json] 21.2 µs 18 µs +17.72%
Simulation test_dag_cbor_decode[roundtrip06.json] 17.7 µs 14.8 µs +19.56%
Simulation test_dag_cbor_decode[roundtrip07.json] 17.6 µs 14.8 µs +18.71%
Simulation test_dag_cbor_decode[roundtrip08.json] 19.1 µs 17.1 µs +11.46%
Simulation test_dag_cbor_decode[roundtrip09.json] 21.7 µs 17.6 µs +23.68%
Simulation test_dag_cbor_decode[roundtrip10.json] 22.6 µs 17.9 µs +26.03%
Simulation test_dag_cbor_decode[roundtrip11.json] 19.8 µs 17.5 µs +13.07%
Simulation test_dag_cbor_decode[roundtrip12.json] 20.2 µs 17.8 µs +13.19%
Simulation test_dag_cbor_decode[roundtrip13.json] 20.2 µs 17.9 µs +13.3%
Simulation test_dag_cbor_decode[roundtrip14.json] 20.2 µs 17.8 µs +13.49%
Simulation test_dag_cbor_decode[roundtrip15.json] 19 µs 16.7 µs +14.05%
Simulation test_dag_cbor_decode[roundtrip16.json] 19.7 µs 17.5 µs +12.6%
Simulation test_dag_cbor_decode[roundtrip17.json] 19.7 µs 17.4 µs +13.04%
Simulation test_dag_cbor_decode[roundtrip18.json] 19.8 µs 17.3 µs +14.47%
Simulation test_dag_cbor_decode[roundtrip19.json] 19.7 µs 17.3 µs +13.77%
Simulation test_dag_cbor_decode[roundtrip20.json] 19.3 µs 17 µs +14.08%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

@MarshalX
Copy link
Owner

MarshalX commented Nov 13, 2025

This is awesome!

I considered migrating to something else after the deprecation of libipld... During my investigations, I remember creating issues in one of your repos related to the recursion limit. I do see that this is solved by impl of dec::Read.

I love how compatible it is with the current test suite. Moreover, we do have some incredible performance boosts! For example, test_decode_car got +24% (858ms -> 693.6ms) and test_dag_cbor_decode_torture_cids an unbelievable +92% (93ms -> 48.4ms)!!!

However, we also got a performance regression. Which is worth checking:

  • test_dag_cbor_decode_real_data[canada.json] -26%
  • test_dag_cbor_encode_real_data[canada.json] -9%
  • test_dag_cbor_encode_real_data[citm_catalog.json] -3%

The interesting note (possible hint) is that this is canada.json, which is a lot of lists of floats. This is where to dig into

"RecursionError: maximum recursion depth exceeded in DAG-CBOR decoding",
).restore(py);
)
.restore(py);
Copy link
Owner

@MarshalX MarshalX Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code formatter do you use? Gonna make it a step in the CI pipeline

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a plain cargo fmt.

@vmx
Copy link
Author

vmx commented Nov 14, 2025

@MarshalX I've pushed a new commit which should fix the performance regression. Can you please re-trigger a benchmark run?

@MarshalX
Copy link
Owner

MarshalX commented Nov 14, 2025

@vmx done! Incredible results! Decoding gained x2.5 at peak and there is no problems with decoding anymore!!! The rest is only 3 benchmarks around encoding. Still canada.json -9%

btw, do you know why CID decoding sped up so much? maybe you are familiar with the differences

vmx added 2 commits November 17, 2025 15:22
The maximum recursion limit tracked within `decode_dag_cbor_to_pyobject()`,
hence it doesn't need to be part of the SliceReader.
The latest version contains performance improvements.
@vmx
Copy link
Author

vmx commented Nov 17, 2025

btw, do you know why CID decoding sped up so much?

A comment from you said that it's allocating. The new version shouldn't allocate. Maybe that makes the difference?

For the encoding I did some improvements upstream for numbers. That should at least get the canada.json test performance up. Let's see what happens with the others. I've have a hard time trying to figure out what makes them slower.

The please retrigger the benchmark once again.

@MarshalX
Copy link
Owner

MarshalX commented Nov 17, 2025

I did some improvements upstream for numbers

Hero!

Canada encoding faster by 5% than one commit before 🔥 Still some little degradation by 4%, but this is acceptable

This is hard to believe but somehow github.json in encoding is now slower by 9%

I am looking only at real data benchmarks since the rest are so micro and could be noise as hell

Upd. Github benchmark measured in us. Mb tiny enough to make some noise... but historically it showed always the correct number to rely on

@vmx
Copy link
Author

vmx commented Nov 17, 2025

I've tried things locally. I had large variations between runs. I then added some timing (wall clock) within the Rust code. A large overhead (30x) and large variations came from the Python initialization part. The actual parsing for the github.json one is really small.

Is there a way to re-run the current main branch benchmark again, just to see how big the variation is between runs?

@MarshalX
Copy link
Owner

Here is the result of the main benchmark run from today VS 26 days ago: https://codspeed.io/MarshalX/python-libipld/runs/compare/691bc274750130912a26cc99..68f8f140424026582c5e7fc4

@vmx
Copy link
Author

vmx commented Nov 18, 2025

Here is the result of the main benchmark run from today VS 26 days ago: https://codspeed.io/MarshalX/python-libipld/runs/compare/691bc274750130912a26cc99..68f8f140424026582c5e7fc4

My take away from those two runs is:

  • the github.json one seems pretty stable
  • random other tests (twitter.json) can deviate up to 9%. And that one is even in the "ms"-range.

I'd still like to know, why the github.json is slower, but I'm not sure if spending much more time on it is really worth it. As it's all mostly a single function it's kind of hard to profile and investigate what's really going on (and I also haven't done that much with Rust projects yet).

@MarshalX
Copy link
Owner

I do agree with you. And I am ready to move forward.

Moreover, encoding is not yet critical for the atproto community, so it will not affect the major user base of the library at all.

Let's wait until the next upstream lib release with your perf boost and public api and merge it.

Thank you for your hard work!

Do the same as the original version and rely on Python for the
string UTF-8 validation.
@MarshalX
Copy link
Owner

MarshalX commented Nov 20, 2025

i do recall some perf problems around PyString::new_bound, that's why I've picked (#41) an unsafe approach to make direct CPython ffi calls instead of using pyo3 wrap. i do see in the regression the calls to PyObject_GetMethod, which is possible the pyo3 overhead

upd. not sure how pyo3 changed since prev year. they did a great job around this new bound api to illuminate overheads

upd2. i misread it with from_bytes. it looks exacty fii call as before https://github.com/PyO3/pyo3/blob/d8e9a3860b5a08b8020364841808b2d3cb2f4f68/src/types/string.rs#L175-L183
upd3. yeah, they just added from_bytes in September this year, that why unsafe was in place before

@vmx
Copy link
Author

vmx commented Nov 20, 2025

In local testing it didn't make things slower, hence I've used it. Do I read your updates correctly that it's all good?

I run benchmarks via e.g. uv run pytest -k 'test_dag_cbor_decode_real_data[github.json]' --benchmark-enable. Is that the correct way?

Before the most recent change:
------------------------------------------- benchmark: 1 tests -------------------------------------------
Name (time in us)                                    Min      Mean    StdDev  Outliers  Rounds  Iterations
----------------------------------------------------------------------------------------------------------
test_dag_cbor_decode_real_data[github.json]     375.4650  479.6358  114.6506     487;0    1779           1
----------------------------------------------------------------------------------------------------------

After the change:
------------------------------------------- benchmark: 1 tests ------------------------------------------
Name (time in us)                                    Min      Mean   StdDev  Outliers  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test_dag_cbor_decode_real_data[github.json]     284.8470  325.7824  73.1020   268;358    2016           1
---------------------------------------------------------------------------------------------------------

@vmx
Copy link
Author

vmx commented Nov 20, 2025

The latest regression shows that tests vary a lot between runs. The encoding code path did not change.

@MarshalX
Copy link
Owner

MarshalX commented Nov 20, 2025

Yes, all good!

It is the correct way. At least this is the exact way how to runs inside pipelines

I start hating the results on the codspeed. Maybe PGO adds this randomness... but the input data is static... The CI pipeline looks awkward to me. The PGO gathering stage runs benchmarks properly, benchmark: 192 tests with the same table in the output as your local runs. But the codpeed benchmark run looks like uses only tests (0 benchmarked)? without doing proper rounds and iterations? that's rly hard to tell because they are injecting their own benchmark runner as far as i know...

CodSpeed had to disable the following plugins: pytest-benchmark

and they do use pytest-codspeed

@MarshalX
Copy link
Owner

Yeap. looks like codspeed was completely off and is not compatible with how pytest-benchmark defines benchmarks in the code... Let me dig into and and push fixed in separated PR

@MarshalX
Copy link
Owner

MarshalX commented Nov 20, 2025

Welp, I spent a few hours playing around and here are my notes:

  • I do not think that we should rely on codspeed; today I open to myself how it works https://codspeed.io/docs/instruments/cpu/overview. The most important thing is "A benchmark will be run only once and the CPU behavior will be simulated". So, there are no real runs more than 1
  • I do not think that we should rely on any CI/CD benchmarks because this repo uses GitHub-hosted runners. As far as I learn today, the results are +-10-20% XD

I do think that we must use local bench comparison only. The greatest thing that I did was a group of useful benchmarks. Here is how to start comparing locally:

# checkout main
uv pip install -v -e .  
uv run pytest . -m benchmark_main --benchmark-enable --benchmark-save=main
# checkout your branch
uv pip install -v -e .  
uv run pytest . -m benchmark_main --benchmark-enable --benchmark-save=cbor4ii

uv run pytest-benchmark compare --group-by="name" 

My local comparison:
obraz

Remote comparison using the new workflow:
obraz
src: https://github.com/MarshalX/python-libipld/actions/runs/19549156715/attempts/2#summary-55975972813

Verdict: encoding is still -2-11% slower, which is so strange because without digging too deep, I can not answer why. Which correlates with codspeed simulation

@vmx
Copy link
Author

vmx commented Nov 20, 2025

Verdict: encoding is still -2-11% slower, which is so strange because without digging too deep, I can not answer why. Which correlates with codspeed simulation

I also have no clue why encoding would be slower. I'll rerun the tests as you mentioned above (it would be good to have that in the README). In the past local re-runs had still a pretty big variation which i'm also not sure why this is. I also tried to run them directly from Rust through a binary, but even there the variations are large.

@vmx
Copy link
Author

vmx commented Nov 20, 2025

I just couldn't give up. I think I've found the main issue. Please try again.

@MarshalX
Copy link
Owner

Decoding fails, but here is my local encoding tests

main is main
0002_cbor4ii - old writer
0003_cbor4ii - new writer

obraz

Rust's BufWriter is highly optimized. Use it instead of a custom one.
Wrap it in a newtype so that we can implement `cbor4ii`s `enc::Write`.
@MarshalX
Copy link
Owner

MarshalX commented Nov 20, 2025

Looks like this is it! You did it @vmx! I would say that now the perf is the same, and some +-2% is randomness

I really like to see how max values are much lower in cbor4ii. I feel some potential here. We need to see gains with PGO :)

@vmx
Copy link
Author

vmx commented Nov 20, 2025

Please re-run again locally. I've missed the flushing in the last version. Now the code is even closer to the original one, if you look at the full diff. No changes on the main encoding entry.

@MarshalX
Copy link
Owner

Not sure what happened with twitter encoding this time but the rest is ok

0003_cbor4ii - your latest commit

obraz

Btw codspeed results are here!
obraz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants