Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name mangling in the SQL catalog results in ambiguity #6591

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

Name mangling in the SQL catalog results in ambiguity #6591

monetdb-team opened this issue Nov 30, 2020 · 0 comments

Comments

@monetdb-team
Copy link

@monetdb-team monetdb-team commented Nov 30, 2020

Date: 2018-04-25 16:30:44 +0200
From: @eyalroz
To: SQL devs <>
Version: 11.29.7 (Mar2018-SP1)
CC: @aris-koning, hknief, james, @kutsurak, m.raasveldt, martin.van.dinther, @njnes

Last updated: 2019-04-30 12:36:02 +0200

Comment 26408

Date: 2018-04-25 16:30:44 +0200
From: @eyalroz

User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0
Build Identifier:

tl;dr: We should use an invalid character to concatenate names within sql_catalog_nme.

MonetDB persists an "SQL catalog", which maps SQL schema-table-column triplets to BAT ids within the pool. It's implemented using two BATs with fixed logical names sql_catalog_nme and sql_catalog_bid - a name column and a BAT id column.

Since only one string column is used, the triplet of strings we actually want to search for is mangled into a single string, by concatenating the names with underscore separators.

Example: If we're looking for schema "sys", table "ontime", column "flightdata", we'll actually look for "sys_ontime_flightdata" within "sql_catalog_nme", find it at a certain index in sql_catalog_nme, then get the BAT id from the corresponding index in sql_catalog_bid.

This approach is problematic, since schema, table and column names can themselves contain underscores. There is currently no escaping or underscore or special treatment for name collisions.

Example mclient session:

sql>create table ab ( cd_ef INTEGER );
operation successful
sql>create table ab_cd ( ef INTEGER );
CREATE TABLE: sys_ab_cd_ef conflicts

It is, of course, perfectly valid to create an SQL table "ab" with column "cd_ef" and table ab_cd with column "ef".

The problem is even more severe when trying to perform the reverse lookup, i.e. trying to determine the SQL name triplet (schema, table, column) for a BAT (whose ID appears in sql_catalog_bid). Any presence of underscores in any of the three names creates an ambiguity when we wish to demangle. While MonetDB doesn't do this itself, it is not inconceivable to want to do this in forensic analysis.

Proposed solution:

Make the mangling unambiguous, by using a separator which cannot appear in any of the names. AFAICT, schema, table and column names all use the following grammar rules::

SQL language identifier> ::=

[ { | }... ]

::=

::=

|

We could use something like a printing character such as ., , @ or %. I prefer the former - since that's what's often used intuitively, anyway.

Reproducible: Always

Comment 26409

Date: 2018-04-25 16:49:32 +0200
From: @sjoerdmullender

Removed my name: I don't need two copies.

Comment 26423

Date: 2018-05-09 08:57:12 +0200
From: @njnes

quoted identifiers allow all your alternatives, ie not that easy to fix

Comment 26622

Date: 2018-09-24 02:14:25 +0200
From: James Watmuff <>

quoted identifiers allow all your alternatives, ie not that easy to fix

Instead of using a different separator character, could we just escape the separator character ('_') in the filename if it occurs within an identifier? Escaping could be achieved by repeating the separator character twice.

Examples:
(sys, ab, cd) => sys_ab_cd
(sys, ab, cd_ef) => sys_ab_cd__ef
(sys, ab_cd, ef) => sys_ab__cd_ef
(sys, ab__cd, ef) => sys_ab____cd_ef

Comment 26623

Date: 2018-09-24 10:29:33 +0200
From: @sjoerdmullender

(In reply to James Watmuff from comment 3)

quoted identifiers allow all your alternatives, ie not that easy to fix

Instead of using a different separator character, could we just escape the
separator character ('_') in the filename if it occurs within an identifier?
Escaping could be achieved by repeating the separator character twice.

Examples:
(sys, ab, cd) => sys_ab_cd
(sys, ab, cd_ef) => sys_ab_cd__ef
(sys, ab_cd, ef) => sys_ab__cd_ef
(sys, ab__cd, ef) => sys_ab____cd_ef

Sorry, that's still ambiguous. If you have a sequence of three underscores (because one of the constituent names starts or ends with an underscore) how would you group them?

We did come up with a potential solution to this problem, though. It will take time to implement (not because it's particularly difficult, just because there is plenty of other work): use various control characters instead of underscore (and in some other internally generated names). I'm assuming that control characters cannot be specified at the SQL level, so you cannot have names that contain those, so they should be good and unambiguous for internal use.

Comment 26624

Date: 2018-09-24 11:13:15 +0200
From: @eyalroz

(In reply to Sjoerd Mullender from comment 4)

Indeed, James Watmuff's suggestion is ambiguous

A control-character-based solution would work, obviously; but an unambiguous and possibly more readable escaping scheme can also be chosen. For example: The sequence X could be used, with X representing characters which would not used directly, such as a dot, an underscore and possibly others. This has the benefit of being printable, readable, and safe for copy-pasting as text; and it would not clash with backslash-based or brackets-based schemes. Of course, whatever works is fine.

I should also mention that, even for forensics, the inability to determine exact column names is not critical and can be worked around: One can locate the system tables in sql_catalog_nme as-is; and using these tables, determine the full names. This is what MonetDB itself does, I suppose, and it's what I now do with the MonetDB BBP reader library (https://bitbucket.org/eyalroz/monetdb-bbp-reader).

Comment 26625

Date: 2018-09-24 11:43:08 +0200
From: @sjoerdmullender

(In reply to Eyal Rozenberg from comment 5)

(In reply to Sjoerd Mullender from comment 4)

Indeed, James Watmuff's suggestion is ambiguous

A control-character-based solution would work, obviously; but an unambiguous
and possibly more readable escaping scheme can also be chosen. For example:
The sequence X could be used, with X representing characters which would
not used directly, such as a dot, an underscore and possibly others. This
has the benefit of being printable, readable, and safe for copy-pasting as
text; and it would not clash with backslash-based or brackets-based schemes.
Of course, whatever works is fine.

Sorry, doesn't work either. As long as you can have a schema/table/column name that can contain the sequence that is used to glue those bits together, things are ambiguous. What could work if whenever such a sequence occurs inside one of those part, that sequence is somehow escaped. For instance, use . as glue, and whenever there is an underscore in the schema/table/column name, replace it with a double _ (__). Then when you're reading the glued-together name from left to right, whenever you encounter an underscore, the following character determines whether it's a separator (.) or part of the name ().

Comment 26626

Date: 2018-09-24 12:14:45 +0200
From: @eyalroz

(In reply to Sjoerd Mullender from comment 6)

Sorry, doesn't work either. As long as you can have a schema/table/column
name that can contain the sequence that is used to glue those bits together,
things are ambiguous.

It would work, I believe, since you also escape all underscores. So underscores would only ever be used as an escape character.

Comment 26627

Date: 2018-09-25 01:19:08 +0200
From: James Watmuff <>

Oops - yes my original suggestion was obviously flawed.

It just occurred to me that escaping SQL identifiers is already a solved problem using double quotes.

So what about just "schema"."table"."column", where (") inside an identifier is escaped as ("").

Comment 26701

Date: 2018-11-28 17:54:29 +0100
From: Herman Knief <>

I have recently been evaluating a number of engines for use with an existing platform/application. I've really liked what I have seen with MonetDB so far, but have run into this issue. Given the pre-existence of the code I have to work with and number of deploy instances, changing the schema is not a good option. Would love to see this get a little higher priority for a fix so that I could continue. Thanks.

Comment 26859

Date: 2019-01-30 11:00:56 +0100
From: @njnes

The bats within the logger are now found using an id. This solves any name
conflict introduced by the mangling. Solution will be part of the April
release.

Comment 26860

Date: 2019-01-30 11:04:23 +0100
From: MonetDB Mercurial Repository <>

Changeset eb19fe85e464 made by Niels Nes niels@cwi.nl in the MonetDB repo, refers to this bug.

For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=eb19fe85e464

Changeset description:

fix bug #6591, bats within the logger are found using the sqlid, which solves
any naming conflict introduced by the name mangling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant