Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_type_codec() appears to assume a particular set_type_codec for the "char" datatype #617

Closed
zzzeek opened this issue Sep 16, 2020 · 5 comments · Fixed by #618
Closed

Comments

@zzzeek
Copy link

zzzeek commented Sep 16, 2020

  • asyncpg version: 0.21.0

  • PostgreSQL version: 11.8 fedora

  • Do you use a PostgreSQL SaaS? If so, which? Can you reproduce
    the issue with a local PostgreSQL install?
    : N/A

  • Python version: 3.8.3

  • Platform: Fedora 31

  • Do you use pgbouncer?: no

  • Did you install asyncpg with pip?: yes

  • If you built asyncpg locally, which version of Cython did you use?: N/A

  • Can the issue be reproduced under both asyncio and
    uvloop?
    : N/A

It appears that the implementation for set_type_codec() relies upon the results of the query TYPE_BY_NAME which itself is assumed to return a bytes value from the PostgreSQL "char" datatype.

I was previously unaware that PostgreSQL actually has two "char" variants bpchar and char, and in the documentation at https://magicstack.github.io/asyncpg/current/usage.html#type-conversion this is talking about the "bpchar" datatype. that's fine. However, when trying to normalize asyncpg's behavior against that of the psycopg2 and pg8000 drivers, both of which will give you back string for both of these types (we have determined this is also a bug in those drivers, as they fail to return arbirary bytes for such a datatype and likely was missed when they migrated to Python 3), I tried setting up a type_codec for "char" that would allow it to return strings:

    await conn.set_type_codec(
        "char",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=lambda value: value,
        format="text",
    )

that works, but when you do that, you no longer can use the set_type_codec method for anything else, because the behavior of the type is redefined outside of the assumptions made by is_scalar_type.

The example program below illustrates this failure when attempting to subsequently set up a codec for the JSONB datatype:

import asyncio
import json

import asyncpg


async def main(illustrate_bug):
    conn = await asyncpg.connect(
        user="scott", password="tiger", database="test"
    )

    if illustrate_bug:
        await conn.set_type_codec(
            "char",
            schema="pg_catalog",
            encoder=lambda value: value,
            decoder=lambda value: value,
            format="text",
        )

    await conn.set_type_codec(
        "jsonb",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=json.loads,
        format="text",
    )


print("no bug")
asyncio.run(main(False))


print("bug")
asyncio.run(main(True))

output:

no bug
bug
Traceback (most recent call last):
  File "test3.py", line 35, in <module>
    asyncio.run(main(True))
  File "/opt/python-3.8.3/lib/python3.8/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/opt/python-3.8.3/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "test3.py", line 21, in main
    await conn.set_type_codec(
  File "/home/classic/.venv3/lib/python3.8/site-packages/asyncpg/connection.py", line 991, in set_type_codec
    raise ValueError(
ValueError: cannot use custom codec on non-scalar type pg_catalog.jsonb

Since the "char" datatype is kind of an obscure construct, it's likely reasonable that asyncpg disallow setting up a type codec for this particular type, or perhaps it could emit a warning, but at the moment there doesn't seem to be documentation suggesting there are limitations on what kinds of type codecs can be constructed.

none of this is blocking us, just something we came across and I hope it's helpful to the asyncpg project. cc @fantix

@fantix
Copy link
Member

fantix commented Sep 16, 2020

Thank you @zzzeek for the very comprehensive description here!

Putting on my asyncpg hat, I'm thinking now perhaps we should run the asyncpg built-in introspection queries with fixed codecs only, so that user-defined codecs won't affect the introspections. People could not only change how "char" is decoded, but also any other types. I just don't have ideas yet how that can be done, I'll try to propose a PR.

@elprans
Copy link
Member

elprans commented Sep 17, 2020

I was previously unaware that PostgreSQL actually has two "char" variants bpchar and char,

bpchar is an internal name for "(b)lank (p)added character", which is char(n) in SQL. The other one is varchar(n), which is equivalent to text for all purposes of conversion. The docs refer to "char" and "varchar" as the standard names for those data types in SQL.

@zzzeek
Copy link
Author

zzzeek commented Sep 17, 2020

I was previously unaware that PostgreSQL actually has two "char" variants bpchar and char,

bpchar is an internal name for "(b)lank (p)added character", which is char(n) in SQL. The other one is varchar(n), which is equivalent to text for all purposes of conversion. The docs refer to "char" and "varchar" as the standard names for those data types in SQL.

So, the set_type_codec() method uses the internal names. It would help for clarity if these names were noted in the documentation, right in this chart, as well as a distinct row for the internal "char" datatype, which is also exposed by PostgreSQL in its information schema catalogs. This is one of those "everything is fine but I spent a lot of time pulling my hair out" kind of issues, I'm sure you know what I'm talking about.

here's a demo to illustrate the name "bpchar" as part of asyncpg's public API and that it is different from "char" (again credit to @fantix for showing me how to actually create a table with the internal "char" type):

import asyncio

import asyncpg


async def main():
    conn = await asyncpg.connect(
        user="scott", password="tiger", database="test"
    )

    await conn.execute(
        "CREATE TABLE IF NOT EXISTS test_char "
        '(sql_val char, internal_val "char")'
    )

    await conn.execute("DELETE FROM test_char")

    await conn.execute("INSERT INTO test_char VALUES ($1, $2)", "x", b"x")

    await conn.set_type_codec(
        "bpchar",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=lambda value: "im a bp char: %s" % (value,),
        format="text",
    )

    await conn.set_type_codec(
        "char",
        schema="pg_catalog",
        encoder=lambda value: value,
        decoder=lambda value: "im an internal char: %s" % (value,),
        format="text",
    )

    rows = await conn.fetch("SELECT sql_val, internal_val FROM test_char")
    print(rows)


asyncio.run(main())

output:

$ python test3.py 
[<Record sql_val='im a bp char: x' internal_val='im an internal char: x'>]

@elprans
Copy link
Member

elprans commented Sep 17, 2020

Oh yeah, I totally forgot about the funny single-character "char" type in Postgres that is completely different from char! I can see how that may have led to some hair pulling :-)

This is definitely a bug. set_type_codec should correctly treat char as char(1) and not the weird "char".

I'm not sure it makes sense to include internal type names in the type chart given that Postgres itself doesn't document those names. Perhaps set_type_codec should have been written to refuse to take internal names, but that ship has sailed.

elprans added a commit that referenced this issue Sep 18, 2020
Currently, `Connection.set_type_codec()` only accepts type names as they
appear in `pg_catalog.pg_type` and would refuse to handle a standard SQL
spelling of a type like `character varying`.  This is an oversight, as
the internal type names aren't really supposed to be treated as public
Postgres API.  Additionally, for historical reasons, Postgres has a
single-byte `"char"` type, which is distinct from both `varchar` and
SQL `char`, which may lead to massive confusion if a user sets up a
custom codec on it expecting to handle the `char(n)` type instead.

Issue: #617.
elprans added a commit that referenced this issue Sep 18, 2020
Currently, `Connection.set_type_codec()` only accepts type names as they
appear in `pg_catalog.pg_type` and would refuse to handle a standard SQL
spelling of a type like `character varying`.  This is an oversight, as
the internal type names aren't really supposed to be treated as public
Postgres API.  Additionally, for historical reasons, Postgres has a
single-byte `"char"` type, which is distinct from both `varchar` and
SQL `char`, which may lead to massive confusion if a user sets up a
custom codec on it expecting to handle the `char(n)` type instead.

Issue: #617.
@elprans
Copy link
Member

elprans commented Sep 18, 2020

@fantix, I submitted #619, which complements your fixes, but also creates a small conflict. Feel free to rebase onto that branch.

fantix added a commit to fantix/asyncpg that referenced this issue Sep 18, 2020
elprans added a commit that referenced this issue Sep 22, 2020
Currently, `Connection.set_type_codec()` only accepts type names as they
appear in `pg_catalog.pg_type` and would refuse to handle a standard SQL
spelling of a type like `character varying`.  This is an oversight, as
the internal type names aren't really supposed to be treated as public
Postgres API.  Additionally, for historical reasons, Postgres has a
single-byte `"char"` type, which is distinct from both `varchar` and
SQL `char`, which may lead to massive confusion if a user sets up a
custom codec on it expecting to handle the `char(n)` type instead.

Issue: #617
fantix added a commit to fantix/asyncpg that referenced this issue Sep 22, 2020
fantix added a commit to fantix/asyncpg that referenced this issue Sep 24, 2020
fantix added a commit to fantix/asyncpg that referenced this issue Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants