Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception due to unicode on testnet #297

Closed
johnalotoski opened this issue Sep 19, 2020 · 16 comments
Closed

Exception due to unicode on testnet #297

johnalotoski opened this issue Sep 19, 2020 · 16 comments

Comments

@johnalotoski
Copy link
Contributor

johnalotoski commented Sep 19, 2020

On testnet network, cardano-db-sync at tags/5.0.0 in extended mode, is throwing exceptions at block 1816809 due to unicode parsing/handling issues. It cannot sync past this block.

[db-sync-node:Info:234667] [2020-09-19 14:23:20.41 UTC] Starting chainSyncClient
[db-sync-node:Info:234667] [2020-09-19 14:23:20.48 UTC] Cardano.Db tip is at slot 6061839, block 1816809
[db-sync-node:Info:234672] [2020-09-19 14:23:20.48 UTC] Running DB thread
[db-sync-node:Info:234672] [2020-09-19 14:23:20.65 UTC] Rolling back to slot 6061839, hash 98414d12d1d1f05210dea6ce4082e1bcbbcfdf56343fd1cb44a8778c4c9ea57a
[db-sync-node:Info:234672] [2020-09-19 14:23:20.66 UTC] Deleting blocks numbered: []
[db-sync-node:Error:234672] [2020-09-19 14:23:21.03 UTC] runDBThread: DbInsertException "TxMetadata" (SqlError {sqlState = "22P05", sqlExecStatus = FatalError, sqlErrorMsg = "unsupported Unicode escape sequence", sqlErrorDetail = "\\u0000 cannot be converted to text.", sqlErrorHint = ""})
[db-sync-node:Error:234667] [2020-09-19 14:23:21.05 UTC] ChainSyncWithBlocksPtcl: DbInsertException "TxMetadata" (SqlError {sqlState = "22P05", sqlExecStatus = FatalError, sqlErrorMsg = "unsupported Unicode escape sequence", sqlErrorDetail = "\\u0000 cannot be converted to text.", sqlErrorHint = ""})
[db-sync-node.Subscription:Error:234663] [2020-09-19 14:23:21.05 UTC] [String "Application Exception: LocalAddress {getFilePath = \"/run/cardano-node/node.socket\"} DbInsertException \"TxMetadata\" (SqlError {sqlState = \"22P05\", sqlExecStatus = FatalError, sqlErrorMsg = \"unsupported Unicode escape sequence\", sqlErrorDetail = \"\\\\u0000 cannot be converted to text.\", sqlErrorHint = \"\"})",String "SubscriptionTrace"]
[db-sync-node.ErrorPolicy:Error:4] [2020-09-19 14:23:21.05 UTC] [String "ErrorPolicyUnhandledApplicationException (DbInsertException \"TxMetadata\" (SqlError {sqlState = \"22P05\", sqlExecStatus = FatalError, sqlErrorMsg = \"unsupported Unicode escape sequence\", sqlErrorDetail = \"\\\\u0000 cannot be converted to text.\", sqlErrorHint = \"\"}))",String "ErrorPolicyTrace",String "LocalAddress {getFilePath = \"/run/cardano-node/node.socket\"}"]
@erikd
Copy link
Contributor

erikd commented Sep 20, 2020

This is HUGE pain in the neck.

The tx metedata that is being inserted is

(1,S "\NUL\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX")

which is being encoded as JSON:

"\u0000\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003"

which of course is not valid UTF8 and hence the database rejects it.

Even worse, the above is encoded as UTF8 by the Haskell Text library as shown, but PostgreSQL then rejects it.

@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

Let me reiterate what a HUGE pain in the neck this is.

The string in question is decoded to UTF8 by the Haskell Text library and then passed to Postgres which throws an SqlError exception. I have tried catching this exception but catching the exception does not seem to work.

I can catch and log the exception:

[db-sync-node:Warning:37] [2020-09-21 00:26:18.77 UTC] insertTxMetadata: Failed to insert (1,S "\NUL...\ETX")
SqlError {sqlState = "22P05", sqlExecStatus = FatalError, sqlErrorMsg
 = "unsupported Unicode escape sequence", sqlErrorDetail = "\\u0000 cannot be converted to text.", sqlErrorHint = ""}

but then there is a second excpetion:

[db-sync-node:Error:37] [2020-09-21 00:26:18.77 UTC] runDBThread: SqlError {sqlState = "25P02", sqlExecStatus = FatalError, sqlErrorMsg = "current transaction is aborted, commands ignored until end of transaction
 block", sqlErrorDetail = "", sqlErrorHint = ""}

This suggests that I need to more fully validate the JSON before insertion.

@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

We can store invalid UTF8 in PostgreSQL, but only as bytea data not as the jsonb and if we go for the former Postgres's JSON query support will not be avaiable.

@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

I have tried a manual UTF8 decode:

decodeUtf8IO :: ByteString -> IO (Either Text.UnicodeException Text)
decodeUtf8IO = try . evaluate . Text.decodeUtf8With Text.strictDecode

and that too has the same problem.

@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

From: https://stackoverflow.com/questions/31671634/handling-unicode-sequences-in-postgresql

A null byte isn't legal in a PostgreSQL string. Neither is a zero code point.

@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

Currently think this is due to the way Haskell/Persistent serializes this to send to postgres. Currently trying to validate this theory.

@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

Aeson.encode already escapes the NUL char to \u0000 which Persistent then rejects.

Coming up with a neat fix for this is not going to be easy.

@loverdos
Copy link

I was looking into this and it seems these properties hold:

  • "\u0000\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003\u0003" is pretty valid as UTF-8, since UTF-8 does allow \u0000 to be represented directly.
  • \u0000 is not allowed for jsonb because it cannot be converted to text.

@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

The only temporary solution to encode it to a JSON, and then to Text and then detect the bad character sequence and reject the JSON object it if contains the bad sequence. I also wonder if \u0000 is the only sequence I need to reject,

@AlexITC
Copy link

AlexITC commented Sep 21, 2020

Couldn't you store two columns? the bytea, and the JSON when it doesn't contain the non-text character, so that anyone is free to use whatever they need.

@loverdos
Copy link

@erikd The \u0000 restriction is clearly stated here, along with all the other rules that Postgres follows. Bottom line: there are rejections and transformations going on.

To avoid a leaky abstraction, the exact values (as given by the user) must be stored. This roughly means either a blob (= no structure in the DB) or JSON string values that somehow preserve the original meaning: for instance base64-encoded string values in the DB (so recovering the structure at the expense of extra processing in the application in order to translate from base64 to the actual value).

erikd added a commit that referenced this issue Sep 21, 2020
TxMetadata is stored as JSON and that JSON is stored in a 'jsonb' column
of PostgreSQL. However, there are limitations to that Postgres 'jsonb'
data type, specifically, it cannot contain Uniciode NUL characters. This
temporary fix simply drops TxMetadata JSON objects that would otherwise
be rejected by Postgres. Hopefully a better solution will be will be
dreamt up and implemented later.

Temporary workaround fix for:
#297
erikd added a commit that referenced this issue Sep 21, 2020
TxMetadata is stored as JSON and that JSON is stored in a 'jsonb' column
in PostgreSQL. However, there are limitations to that Postgres 'jsonb'
data type. Specifically, it cannot contain Uniciode NUL characters. This
temporary fix simply drops TxMetadata JSON objects that would otherwise
be rejected by Postgres. Hopefully a better solution will be will be
dreamt up and implemented later.

Temporary workaround fix for:
#297
erikd added a commit that referenced this issue Sep 21, 2020
TxMetadata is stored as JSON and that JSON is stored in a 'jsonb' column
in PostgreSQL. However, there are limitations to that Postgres 'jsonb'
data type. Specifically, it cannot contain Uniciode NUL characters. This
temporary fix simply drops TxMetadata JSON objects that would otherwise
be rejected by Postgres. Hopefully a better solution will be will be
dreamt up and implemented later.

Temporary workaround fix for:
#297
erikd added a commit that referenced this issue Sep 21, 2020
TxMetadata is stored as JSON and that JSON is stored in a 'jsonb' column
in PostgreSQL. However, there are limitations to that Postgres 'jsonb'
data type. Specifically, it cannot contain Uniciode NUL characters. This
temporary fix simply drops TxMetadata JSON objects that would otherwise
be rejected by Postgres. Hopefully a better solution will be will be
dreamt up and implemented later.

Temporary workaround fix for:
#297
@erikd
Copy link
Contributor

erikd commented Sep 21, 2020

The temporary workaround fix (simply dropping metadata objects that cannot be inserted in Postgres) has been merged to master.

@ArturWieczorek
Copy link
Contributor

This how it looks now with temp fix:

[db-sync-node:Info:37] [2020-09-22 10:19:09.66 UTC] insertShelleyBlock: epoch 84, slot 6060000, block 1816721, hash 3ec15354c53deae4eb26a206cc3185f799e80bd09393f279bce7e53a7d633144
[db-sync-node:Warning:37] [2020-09-22 10:19:24.08 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.08 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.10 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Warning:37] [2020-09-22 10:19:24.10 UTC] insertTxMetadata: dropped due to a Unicode NUL character.
[db-sync-node:Info:37] [2020-09-22 10:19:24.93 UTC] insertShelleyBlock: epoch 84, slot 6065000, block 1816973, hash 0ab0dd5c36c6eb480b8bbd05508be952c5ed8597a4e422af25bd80a905a9368d

@ghost
Copy link

ghost commented Oct 16, 2020

Whether docker is repaired, and what is the repaired version

@erikd
Copy link
Contributor

erikd commented Oct 17, 2020

Its not fixed, but 5.0.x (but I would recommend 5.0.2) has a workaround that avoids this issue by dropping tx_metadata containing the Unicode NUL character.

@erikd
Copy link
Contributor

erikd commented Jun 11, 2021

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants