Name mangling in the SQL catalog results in ambiguity #6591
Last updated: 2019-04-30 12:36:02 +0200
Date: 2018-04-25 16:30:44 +0200
User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0
tl;dr: We should use an invalid character to concatenate names within sql_catalog_nme.
MonetDB persists an "SQL catalog", which maps SQL schema-table-column triplets to BAT ids within the pool. It's implemented using two BATs with fixed logical names sql_catalog_nme and sql_catalog_bid - a name column and a BAT id column.
Since only one string column is used, the triplet of strings we actually want to search for is mangled into a single string, by concatenating the names with underscore separators.
Example: If we're looking for schema "sys", table "ontime", column "flightdata", we'll actually look for "sys_ontime_flightdata" within "sql_catalog_nme", find it at a certain index in sql_catalog_nme, then get the BAT id from the corresponding index in sql_catalog_bid.
This approach is problematic, since schema, table and column names can themselves contain underscores. There is currently no escaping or underscore or special treatment for name collisions.
Example mclient session:
sql>create table ab ( cd_ef INTEGER );
It is, of course, perfectly valid to create an SQL table "ab" with column "cd_ef" and table ab_cd with column "ef".
The problem is even more severe when trying to perform the reverse lookup, i.e. trying to determine the SQL name triplet (schema, table, column) for a BAT (whose ID appears in sql_catalog_bid). Any presence of underscores in any of the three names creates an ambiguity when we wish to demangle. While MonetDB doesn't do this itself, it is not inconceivable to want to do this in forensic analysis.
Make the mangling unambiguous, by using a separator which cannot appear in any of the names. AFAICT, schema, table and column names all use the following grammar rules::
SQL language identifier> ::=
We could use something like a printing character such as ., , @ or %. I prefer the former - since that's what's often used intuitively, anyway.
Date: 2018-04-25 16:49:32 +0200
Removed my name: I don't need two copies.
Date: 2018-05-09 08:57:12 +0200
quoted identifiers allow all your alternatives, ie not that easy to fix
Date: 2018-09-24 02:14:25 +0200
Instead of using a different separator character, could we just escape the separator character ('_') in the filename if it occurs within an identifier? Escaping could be achieved by repeating the separator character twice.
Date: 2018-09-24 10:29:33 +0200
(In reply to James Watmuff from comment 3)
Sorry, that's still ambiguous. If you have a sequence of three underscores (because one of the constituent names starts or ends with an underscore) how would you group them?
We did come up with a potential solution to this problem, though. It will take time to implement (not because it's particularly difficult, just because there is plenty of other work): use various control characters instead of underscore (and in some other internally generated names). I'm assuming that control characters cannot be specified at the SQL level, so you cannot have names that contain those, so they should be good and unambiguous for internal use.
Date: 2018-09-24 11:13:15 +0200
(In reply to Sjoerd Mullender from comment 4)
Indeed, James Watmuff's suggestion is ambiguous
A control-character-based solution would work, obviously; but an unambiguous and possibly more readable escaping scheme can also be chosen. For example: The sequence X could be used, with X representing characters which would not used directly, such as a dot, an underscore and possibly others. This has the benefit of being printable, readable, and safe for copy-pasting as text; and it would not clash with backslash-based or brackets-based schemes. Of course, whatever works is fine.
I should also mention that, even for forensics, the inability to determine exact column names is not critical and can be worked around: One can locate the system tables in sql_catalog_nme as-is; and using these tables, determine the full names. This is what MonetDB itself does, I suppose, and it's what I now do with the MonetDB BBP reader library (https://bitbucket.org/eyalroz/monetdb-bbp-reader).
Date: 2018-09-24 11:43:08 +0200
(In reply to Eyal Rozenberg from comment 5)
Sorry, doesn't work either. As long as you can have a schema/table/column name that can contain the sequence that is used to glue those bits together, things are ambiguous. What could work if whenever such a sequence occurs inside one of those part, that sequence is somehow escaped. For instance, use . as glue, and whenever there is an underscore in the schema/table/column name, replace it with a double _ (__). Then when you're reading the glued-together name from left to right, whenever you encounter an underscore, the following character determines whether it's a separator (.) or part of the name ().
Date: 2018-09-24 12:14:45 +0200
(In reply to Sjoerd Mullender from comment 6)
It would work, I believe, since you also escape all underscores. So underscores would only ever be used as an escape character.
Date: 2018-09-25 01:19:08 +0200
Oops - yes my original suggestion was obviously flawed.
It just occurred to me that escaping SQL identifiers is already a solved problem using double quotes.
So what about just "schema"."table"."column", where (") inside an identifier is escaped as ("").
Date: 2018-11-28 17:54:29 +0100
I have recently been evaluating a number of engines for use with an existing platform/application. I've really liked what I have seen with MonetDB so far, but have run into this issue. Given the pre-existence of the code I have to work with and number of deploy instances, changing the schema is not a good option. Would love to see this get a little higher priority for a fix so that I could continue. Thanks.
Date: 2019-01-30 11:00:56 +0100
The bats within the logger are now found using an id. This solves any name
Date: 2019-01-30 11:04:23 +0100
For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=eb19fe85e464
The text was updated successfully, but these errors were encountered: