set_type_codec() appears to assume a particular set_type_codec for the "char" datatype #617

zzzeek · 2020-09-16T17:07:05Z

asyncpg version: 0.21.0
PostgreSQL version: 11.8 fedora
Do you use a PostgreSQL SaaS? If so, which? Can you reproduce
the issue with a local PostgreSQL install?: N/A
Python version: 3.8.3
Platform: Fedora 31
Do you use pgbouncer?: no
Did you install asyncpg with pip?: yes
If you built asyncpg locally, which version of Cython did you use?: N/A
Can the issue be reproduced under both asyncio and
uvloop?: N/A

It appears that the implementation for set_type_codec() relies upon the results of the query TYPE_BY_NAME which itself is assumed to return a bytes value from the PostgreSQL "char" datatype.

I was previously unaware that PostgreSQL actually has two "char" variants bpchar and char, and in the documentation at https://magicstack.github.io/asyncpg/current/usage.html#type-conversion this is talking about the "bpchar" datatype. that's fine. However, when trying to normalize asyncpg's behavior against that of the psycopg2 and pg8000 drivers, both of which will give you back string for both of these types (we have determined this is also a bug in those drivers, as they fail to return arbirary bytes for such a datatype and likely was missed when they migrated to Python 3), I tried setting up a type_codec for "char" that would allow it to return strings:

    await conn.set_type_codec(
        "char",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=lambda value: value,
        format="text",
    )

that works, but when you do that, you no longer can use the set_type_codec method for anything else, because the behavior of the type is redefined outside of the assumptions made by is_scalar_type.

The example program below illustrates this failure when attempting to subsequently set up a codec for the JSONB datatype:

import asyncio
import json

import asyncpg


async def main(illustrate_bug):
    conn = await asyncpg.connect(
        user="scott", password="tiger", database="test"
    )

    if illustrate_bug:
        await conn.set_type_codec(
            "char",
            schema="pg_catalog",
            encoder=lambda value: value,
            decoder=lambda value: value,
            format="text",
        )

    await conn.set_type_codec(
        "jsonb",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=json.loads,
        format="text",
    )


print("no bug")
asyncio.run(main(False))


print("bug")
asyncio.run(main(True))

output:

no bug
bug
Traceback (most recent call last):
  File "test3.py", line 35, in <module>
    asyncio.run(main(True))
  File "/opt/python-3.8.3/lib/python3.8/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/opt/python-3.8.3/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "test3.py", line 21, in main
    await conn.set_type_codec(
  File "/home/classic/.venv3/lib/python3.8/site-packages/asyncpg/connection.py", line 991, in set_type_codec
    raise ValueError(
ValueError: cannot use custom codec on non-scalar type pg_catalog.jsonb

Since the "char" datatype is kind of an obscure construct, it's likely reasonable that asyncpg disallow setting up a type codec for this particular type, or perhaps it could emit a warning, but at the moment there doesn't seem to be documentation suggesting there are limitations on what kinds of type codecs can be constructed.

none of this is blocking us, just something we came across and I hope it's helpful to the asyncpg project. cc @fantix

The text was updated successfully, but these errors were encountered:

fantix · 2020-09-16T17:31:37Z

Thank you @zzzeek for the very comprehensive description here!

Putting on my asyncpg hat, I'm thinking now perhaps we should run the asyncpg built-in introspection queries with fixed codecs only, so that user-defined codecs won't affect the introspections. People could not only change how "char" is decoded, but also any other types. I just don't have ideas yet how that can be done, I'll try to propose a PR.

Fixes: MagicStack#617

elprans · 2020-09-17T18:38:16Z

I was previously unaware that PostgreSQL actually has two "char" variants bpchar and char,

bpchar is an internal name for "(b)lank (p)added character", which is char(n) in SQL. The other one is varchar(n), which is equivalent to text for all purposes of conversion. The docs refer to "char" and "varchar" as the standard names for those data types in SQL.

zzzeek · 2020-09-17T19:49:25Z

I was previously unaware that PostgreSQL actually has two "char" variants bpchar and char,

bpchar is an internal name for "(b)lank (p)added character", which is char(n) in SQL. The other one is varchar(n), which is equivalent to text for all purposes of conversion. The docs refer to "char" and "varchar" as the standard names for those data types in SQL.

So, the set_type_codec() method uses the internal names. It would help for clarity if these names were noted in the documentation, right in this chart, as well as a distinct row for the internal "char" datatype, which is also exposed by PostgreSQL in its information schema catalogs. This is one of those "everything is fine but I spent a lot of time pulling my hair out" kind of issues, I'm sure you know what I'm talking about.

here's a demo to illustrate the name "bpchar" as part of asyncpg's public API and that it is different from "char" (again credit to @fantix for showing me how to actually create a table with the internal "char" type):

import asyncio

import asyncpg


async def main():
    conn = await asyncpg.connect(
        user="scott", password="tiger", database="test"
    )

    await conn.execute(
        "CREATE TABLE IF NOT EXISTS test_char "
        '(sql_val char, internal_val "char")'
    )

    await conn.execute("DELETE FROM test_char")

    await conn.execute("INSERT INTO test_char VALUES ($1, $2)", "x", b"x")

    await conn.set_type_codec(
        "bpchar",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=lambda value: "im a bp char: %s" % (value,),
        format="text",
    )

    await conn.set_type_codec(
        "char",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=lambda value: "im an internal char: %s" % (value,),
        format="text",
    )

    rows = await conn.fetch("SELECT sql_val, internal_val FROM test_char")
    print(rows)


asyncio.run(main())

output:

$ python test3.py 
[<Record sql_val='im a bp char: x' internal_val='im an internal char: x'>]

elprans · 2020-09-17T21:28:37Z

Oh yeah, I totally forgot about the funny single-character "char" type in Postgres that is completely different from char! I can see how that may have led to some hair pulling :-)

This is definitely a bug. set_type_codec should correctly treat char as char(1) and not the weird "char".

I'm not sure it makes sense to include internal type names in the type chart given that Postgres itself doesn't document those names. Perhaps set_type_codec should have been written to refuse to take internal names, but that ship has sailed.

Currently, `Connection.set_type_codec()` only accepts type names as they appear in `pg_catalog.pg_type` and would refuse to handle a standard SQL spelling of a type like `character varying`. This is an oversight, as the internal type names aren't really supposed to be treated as public Postgres API. Additionally, for historical reasons, Postgres has a single-byte `"char"` type, which is distinct from both `varchar` and SQL `char`, which may lead to massive confusion if a user sets up a custom codec on it expecting to handle the `char(n)` type instead. Issue: #617.

elprans · 2020-09-18T00:58:37Z

@fantix, I submitted #619, which complements your fixes, but also creates a small conflict. Feel free to rebase onto that branch.

Fixes: MagicStack#617

Currently, `Connection.set_type_codec()` only accepts type names as they appear in `pg_catalog.pg_type` and would refuse to handle a standard SQL spelling of a type like `character varying`. This is an oversight, as the internal type names aren't really supposed to be treated as public Postgres API. Additionally, for historical reasons, Postgres has a single-byte `"char"` type, which is distinct from both `varchar` and SQL `char`, which may lead to massive confusion if a user sets up a custom codec on it expecting to handle the `char(n)` type instead. Issue: #617

Fixes: MagicStack#617

Fixes: #617

zzzeek mentioned this issue Sep 16, 2020

asyncpg introspection is broken by char codec sqlalchemy/sqlalchemy#5586

Closed

fantix added a commit to fantix/asyncpg that referenced this issue Sep 17, 2020

Disable custom data codec for internal introspection

c4cc89f

Fixes: MagicStack#617

fantix mentioned this issue Sep 17, 2020

Ignore custom data codec for internal introspection #618

Merged

elprans mentioned this issue Sep 18, 2020

Fix set_type_codec() to accept standard SQL type names #619

Merged

fantix added a commit to fantix/asyncpg that referenced this issue Sep 18, 2020

Disable custom data codec for internal introspection

821f279

Fixes: MagicStack#617

fantix added a commit to fantix/asyncpg that referenced this issue Sep 22, 2020

Disable custom data codec for internal introspection

a0ce948

Fixes: MagicStack#617

fantix added a commit to fantix/asyncpg that referenced this issue Sep 24, 2020

Ignore custom data codec for internal introspection

e41e1db

Fixes: MagicStack#617

fantix added a commit to fantix/asyncpg that referenced this issue Sep 24, 2020

Ignore custom data codec for internal introspection

c94d6fc

Fixes: MagicStack#617

elprans closed this as completed in #618 Sep 25, 2020

elprans pushed a commit that referenced this issue Sep 25, 2020

Ignore custom data codec for internal introspection (#618)

e064f59

Fixes: #617

elprans mentioned this issue Nov 26, 2020

Prefer SSL connections by default #660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set_type_codec() appears to assume a particular set_type_codec for the "char" datatype #617

set_type_codec() appears to assume a particular set_type_codec for the "char" datatype #617

zzzeek commented Sep 16, 2020

fantix commented Sep 16, 2020

elprans commented Sep 17, 2020

zzzeek commented Sep 17, 2020 •

edited

Loading

elprans commented Sep 17, 2020

elprans commented Sep 18, 2020

set_type_codec() appears to assume a particular set_type_codec for the "char" datatype #617

set_type_codec() appears to assume a particular set_type_codec for the "char" datatype #617

Comments

zzzeek commented Sep 16, 2020

fantix commented Sep 16, 2020

elprans commented Sep 17, 2020

zzzeek commented Sep 17, 2020 • edited Loading

elprans commented Sep 17, 2020

elprans commented Sep 18, 2020

zzzeek commented Sep 17, 2020 •

edited

Loading