Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: fixed incorrect bytestring encoding PlutusData #269

Merged
merged 9 commits into from
Oct 13, 2023

Conversation

theeldermillenial
Copy link
Contributor

@theeldermillenial theeldermillenial commented Sep 21, 2023

Fix #268

Summary

pycardano incorrectly encodes bytes longer than 64 bytes for PlutusData. Currently, a bytes element is encoded the same regardless of length, but if the length is larger than 64 bytes it should be broken up into 64 byte chunks per the following spec:
https://developers.cardano.org/docs/get-started/cardano-serialization-lib/transaction-metadata/#metadata-limitations

This PR fixes this issue by creating a dummy class to catch PlutusData objects during serialization and properly breaks up bytes values longer than 64 bytes in length.

If further explanation is needed, I can provide examples. This PR provides a unit test to verify the expected output is obtained with a long bytes input.

@nielstron
Copy link
Contributor

nielstron commented Sep 21, 2023

Is there a reason to introduce MetadataIndefiniteList? I don't see the benefit over using IndefiniteList directly

# Currently, cbor2 doesn't support indefinite list, therefore we need special
# handling here to explicitly write header (b'\x9f'), each body item, and footer (b'\xff') to
# the output bytestring.
encoder.write(b"\x9f")
for item in value:
encoder.encode(item)
if (
isinstance(value, MetadataIndefiniteList)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why the encoding as MetadatIndefiniteList is necessary, can we not simply encode all bytes longer than 64 bytes as a list?

Copy link
Contributor Author

@theeldermillenial theeldermillenial Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I'm still new to Cardano. I guess these were my thoughts and how I would push back on encoding all bytes the same way.

  1. If bytes length only has a restriction for metadata, will encoding them the same way in other parts of the message cause an issue?
  2. Why impose the same criteria on other parts of the message if it's not strictly required?
  3. Won't chunking bytes data cause nominally larger message lengths, and by extension, marginally higher tx fees?

If you feel these are non-issues, I think it's easy enough to remove the dummy class and encode everything the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in retrospect your comments on the issue I raised make more sense now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the cardano ledger generally specifies that the cbor encoding of bytestrings should be at most 64 bytes long each piece. This would prevent OOM attacks when reading very long bytestrings. However the ledger does not enforce this, leading to different implementations being abound on chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't OOM attack prevented my maximum transaction size anyway? Again, just playing devils advocate here. I'm still relatively new to Cardano.

I can revert changes back to the original without the dummy class for PlutusData. Just give me a yay or nay. With these changes I was able to successfully submit to smart contracts, so I know these changes work correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to bump this so I can finish it off :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this without the dummy class - maybe you can revert to that and see if both the test cases and your submission pass?

then Jerry or I can create a test case for exactly this datum submission

@theeldermillenial
Copy link
Contributor Author

Removed the dummy class.

All unit tests pass, including the specific test created to test a long bytestring.

My specific use case works.

I'm happy to do any additional work required to finish this off. I didn't see any contributing doc, so I don't know how versioning is handled.

@codecov-commenter
Copy link

codecov-commenter commented Oct 1, 2023

Codecov Report

Merging #269 (81abc6a) into main (0a95536) will decrease coverage by 0.34%.
The diff coverage is 53.33%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #269      +/-   ##
==========================================
- Coverage   85.14%   84.81%   -0.34%     
==========================================
  Files          26       26              
  Lines        2995     3022      +27     
  Branches      719      728       +9     
==========================================
+ Hits         2550     2563      +13     
- Misses        335      345      +10     
- Partials      110      114       +4     
Files Coverage Δ
pycardano/metadata.py 94.59% <ø> (ø)
pycardano/plutus.py 87.05% <45.45%> (-2.02%) ⬇️
pycardano/serialization.py 83.43% <57.89%> (-1.65%) ⬇️

Copy link
Collaborator

@cffls cffls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! Posted a question in the code. Apology for the delayed response.

@@ -176,7 +176,14 @@ def default_encoder(
# the output bytestring.
encoder.write(b"\x9f")
for item in value:
encoder.encode(item)
if isinstance(item, bytes) and len(item) > 64:
encoder.write(b"\x5f")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only activated when an item is inside a indefinite list. Do we need to break byte strings that are not part of indefinite list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we need to break all bytes that are longer than 64 bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misunderstood, but it seemed to me that this was the best place to put it since all PlutusData are cast to IndefiniteList.

If I pull it out of the IndefiniteList block, will it be handled properly? I guess it should.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you (correctly) noticed that all PlutusData fields are part of an indefinite list. However plutusdata can also contain bytes without being part of PlutusData (i.e. pure bytes or bytes that are keys in dictionaries)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the final answer to pull it outside of the IndefiniteList block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this documentation it seems that yes, we need dummy classes. But not for lists, for bytes! :)

I am also wondering if there are cases where integers are incorrectly encoded (when they exceed 64 bytes size) since I implemented a special case for this here: https://github.com/OpShin/uplc/blob/448f634cc1225de6dd7390b670b01396d2e71156/uplc/ast.py#L430

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am seeing more and more the intuition behind all the custom classes in OpShin.

I realize it's a bigger lift, but is there any reason why we wouldn't just take OpShin's implementation and pull it over to here? Then, just rely on pycardano rather than duplicating efforts across repos?

I apologize if I'm speaking out of ignorance and there are things I'm not considering, but this seems like it might be the more lasting implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries at all. The code I wrote for OpShin/UPLC was created after pycardano was written, hence there might be a point in copying it over. Then again, the UPLC implementation is really only catered towards PlutusData, while PyCardano also handles serialization of all other kinds of things - not sure if anything will break.

Long story short: The only reason that there are two different implementations is that no one yet tried to unify them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I would like to have this done sooner rather than later. Can I just create a dummy class for bytes to patch this and open a more general issue about syncing datum handling between OpShin and pycardano?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sounds good to me! Would also prefer to get this resolved over any big open stale PR :)

@theeldermillenial
Copy link
Contributor Author

Alright, this was annoying. I made a new dummy class to wrap bytes. That seemed to properly wrap the PlutusData tests so that the unit tests pass (with a modification to the ground truth CBOR for one test), but wrecked about 20 other unit tests that did raw dictionary/list comparisons to primitives. I was able to get around that by modifying the comparator for the dummy class.

There's still one failing unit test that has to do with generating the script hash. I didn't have a chance to dig into it. Maybe you can take a look and more easily see what might be the issues. Once that is resolved, this should be good.

@theeldermillenial
Copy link
Contributor Author

Alright, as best as I can tell, the last failing test was caused by serializing the COST_MODEL. If dictionaries are serialized to bytes, then the hash should change with the new way data is serialized to cbor.

@@ -149,7 +149,7 @@ def test_script_data_hash():
redeemers = [Redeemer(unit, ExecutionUnits(1000000, 1000000))]
redeemers[0].tag = RedeemerTag.SPEND
assert ScriptDataHash.from_primitive(
"032d812ee0731af78fe4ec67e4d30d16313c09e6fb675af28f825797e8b5621d"
"b11ed6f6046df925b6409b850ac54a829cd1e7603145c9aaf765885d8ec64da7"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should change. If we use write the same test in Haskell, it would generate the same hash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as you noticed in the other comment, this changes because all bytes are being encoded the same way per nielstrons suggestion.

Your comment makes sense. If we only change encoding in metadata/plutusdata, then the hash would not change.

Comment on lines 268 to 269
elif isinstance(value, bytes):
return ByteString(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it seems incorrect to replace every bytes with ByteString. Instead, we just offer ByteString for users to use in PlutusData or Metadata.
For some internal implementation that generates bytes as intermediate values, e.g. script_data_hash, we don't want to change its type arbitrarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be implementable by a simple parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was one of my original points and how I originally had it implemented.

Can someone please make a definitive final decision so I can fix and be done? I've implemented and reimplemented this multiple times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielstron Could you elaborate how adding a parameter will work?

Copy link
Contributor

@nielstron nielstron Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not work as straightforward as I imagined it. @theeldermillenial was right, maybe we should just roll with the initial design. I appreciate the excourse though because now we know precisely which bytes to encode this way 😅 sorry for the divergence.

maybe we can document this (and ideally find some supporting documentation on the discrepancy in the implementation)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preferred approach is to offer users ByteString class to use, which the encoder can automatically break it down to byte array. If a bytes object is found longer than 64 instead, pycardano should raise an exception and recommend users to use ByteString.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cffls But should this error only be thrown for PolusData and Metadata? Or should we apply it globally?

My two cents is "only implement exactly what is defined". The bytes length restriction appears to be limited to "metadata", so maybe we only apply it to metadata.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this check for metadata: https://github.com/Python-Cardano/pycardano/blob/main/pycardano/metadata.py#L40-L49

I thought this should be also enforced in PlutusData, which was the reason why this PR was raised. If not, I am fine with only providing ByteString as an option for user in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies. I misspoke. When I said Metadata, I also meant PlutusData.

Also, I now see exactly what you're saying, and I think you're solution makes the most sense. You are saying we should inject a check for long byte strings in PlutusData and throw an error similar to what is seen in Metadata. Part of that error message should indicate the user can use the new ByteString class to allow longer bytes.

I think this is the most transparent approach, and it keeps in line with what I see to be pycardano's philosophy of being very unbiased.

If this is what you mean, I'll make the changes and we can be done. I will revert any hashes I altered, since this should really only affect the test I created. If there's any other unit tests you would like to see, I'm happy to add them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is exactly what I meant. Please go ahead with this approach. Thank you for confirming. ☺️

Copy link
Collaborator

@cffls cffls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Some code failed mypy analysis, you can do make qa to check errors.

@theeldermillenial
Copy link
Contributor Author

Should be good to go now.

Now that I know all your qa stuff, it should be easier for me to contribute. I'm dedicated to building stuff on Cardano in Python, so I'll try to contribute as I find things. Like #273

Copy link
Collaborator

@cffls cffls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for your contribution!

@cffls cffls merged commit a6f76ea into Python-Cardano:main Oct 13, 2023
11 checks passed
@theeldermillenial theeldermillenial deleted the bugfix/datum-bytestring branch October 14, 2023 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Recreating long submission bytestream datum
4 participants