Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,sql-parser: parse CREATE SOURCES #6299

Merged
merged 1 commit into from Apr 7, 2021

Conversation

JLDLaughlin
Copy link
Contributor

@JLDLaughlin JLDLaughlin commented Apr 1, 2021

Parses statements like:

 CREATE SOURCES FROM POSTGRES
    HOST        '...'
    PUBLICATION '...'
    NAMESPACE   '...'
    TABLES (...)

where each of the upstream sources shares a HOST, PUBLICATION, and
NAMESPACE, but has its own table name, alias, and optional column
definitions. Leaves planning and sequencing unimplemented.


This change is Reviewable

@JLDLaughlin JLDLaughlin force-pushed the create-sources branch 2 times, most recently from 0ad202e to 3d8c5c0 Compare April 1, 2021 17:31
@benesch
Copy link
Member

benesch commented Apr 1, 2021

Gah, sorry to throw a wrench in things here, but can we workshop the syntax a bit? Repeating the connection string and publication name will be a chore, and also won't give us the I believe desirable property that the connection string and publication name are the same across all the sources?

I was envisioning something less general purpose but more UX-focused, like:

CREATE SOURCES FROM POSTGRES
  HOST 'host=kanto user=ash password=teamrocket dbname=pokemon'
  PUBLICATION 'red'
  NAMESPACE 'generation1'
  [TABLES (
      <pg_name> [AS [[<mz_db>.]<mz_schema>.]<mz_table>] (<table_def>),
      ...
  )]

If there are other folks (yourself included!) who have strong opinions on the subject, then probably a design doc is warranted!

@JLDLaughlin JLDLaughlin marked this pull request as draft April 2, 2021 17:37
@JLDLaughlin JLDLaughlin force-pushed the create-sources branch 3 times, most recently from 5c92203 to 4cbdae5 Compare April 2, 2021 18:47
@JLDLaughlin JLDLaughlin marked this pull request as ready for review April 2, 2021 19:12
@JLDLaughlin JLDLaughlin requested review from petrosagg and benesch and removed request for petrosagg April 2, 2021 19:12
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct PgTable<T: AstInfo> {
/// The name of the table to sync
pub name: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be changed to UnresolvedObjectName. We are parsing the column defs like a normal create table statement, I think we should parse the table name similarly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just taking a look now, that makes sense to me! Fixing. Thanks @mjibson !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have some difficulties with

but you can greatly simplify that whole query:

format!("SELECT {}::regclass::oid", table) (using params here didn't work in cockroach, but it might work in postgres. But the blah::regclass::oid thing is the magic that will get you the OID of the table you want).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit the same problem when I tried to implement your suggestion in the other PR. Do you know if you can use parameters with this type of query? If not I'd personally stick with the original query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petrosagg would you want to switch back to avoid potential SQL injection? or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we parse the name as a valid ObjectName then it should be ok, but if we accept it as a string literal I think we should use the long form query

table: &mut PgTable<Raw>,
) -> Result<(), anyhow::Error> {
let fetched_columns =
postgres_util::fetch_columns(conn, namespace, &table.name.to_ast_string()).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct anymore. table.name could be something like db.schema.table which will be converted to a single string which will break the query in fetch_columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Sorry, was pushing too quickly to take advantage of CI. I think I added a sufficient coverage for this in the new pg-cdc.td test cases.

@maddyblue
Copy link
Contributor

maddyblue commented Apr 2, 2021

Uh I'm rethinking my suggestion about UnresolvedObjectName because it's more difficult than I realized. I'd do it in a followup PR.

@JLDLaughlin
Copy link
Contributor Author

JLDLaughlin commented Apr 2, 2021

Edited: I think the changes I just pushed correctly swap over to UnresolvedObjectName, but please let me know if there's something I'm missing! Happy to switch back.

@JLDLaughlin JLDLaughlin force-pushed the create-sources branch 4 times, most recently from aa35f6a to d45031e Compare April 3, 2021 00:02
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review! Took me a while to spin up. Left you comments within!

table: Option<PgTable<T>>,
/// The tables to sync, if multiple tables
/// Will be set for `CREATE SOURCES`, but not `CREATE SOURCE`
tables: Option<Vec<PgTable<T>>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like something that wants to be an enum?

enum PostgresConnectorTableSpec<T> {
   Table(PgTable<T>),
   Tables(Vec<PgTable<T>>),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about it more, despite the duplication I think it will be safer if you create an entirely new fork of the AST for the multiple sources case.

enum Connector<T> {
    // ...
    Postgres {
        conn: String,
        publication: String,
        table: String,
        columns: Vec<ColumnDef<T>>,
    }
}

enum MultiConnector {
    Postgres {
        conn: String,
        publication: String,
        tables: Vec<PgTable<T>>,
    }
}

pub struct PgTable<T: AstInfo> {
     pub name: UnresolvedObjectName,
     pub alias: Option<T::ObjectName>,
     pub columns: Vec<ColumnDef<T>>,
 }

That way in the planner you won't have to be switching on table vs tables. It also makes me a little nervous when valid ASTs don't serialize to valid SQL; e.g., with the current design, you can construct an AST that renders to "CREATE SOURCE ... TABLES (...)" which is not actually valid SQL.

}
if let Some(tables) = tables {
for table in tables.iter_mut() {
purify_postgres_connector(conn, table).await?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this case will ever get hit because this is within a CreateSource branch, and this can only happen with CREATE SOURCES?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This ended up being part of the reason I proposed creating two forks of the AST.)

if let Some(tables) = tables {
for table in tables {
visit_table_columns(&mut normalizer, table)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is part of the reason I'm a bit sketched out by the AST in its current form; you'll never actually hit the tables case here!)

.strip_prefix("\"")
.unwrap_or_else(|| &table)
.strip_suffix("\"")
.unwrap_or_else(|| &table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, eek, this is going to mishandle table names that have single quotes in them, and possible identifiers with other sorts of special characters. I think the original query was definitely better here—sorry!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I see what's going on here. The idea with using unresolved names is that you'd no longer need the NAMESPACE option in CREATE SOURCE(S). Instead the namespace should come from the part before the dot; so

CREATE SOURCE blah FROM POSTGRES HOST 'foo:5432' TABLES (
    public.one AS renamed
    public.two AS renamed2
    weird_schema.three
)

So then here you'd require that table be an object name of two components, and pass the first component into the original query as the namespace and the second component into the original query as the table name.

@JLDLaughlin JLDLaughlin force-pushed the create-sources branch 3 times, most recently from 85d2d30 to 3b7f96e Compare April 7, 2021 00:55
Parses statements like:
	CREATE SOURCES FROM POSTGRES
	  HOST        '...'
	  PUBLICATION '...'
	  NAMESPACE   '...'
	  TABLES (...)
where each of the upstream sources shares a HOST, PUBLICATION, and
NAMESPACE, but has its own table name, alias, and optional column
definitions. Leaves planning and sequencing unimplemented.
@JLDLaughlin
Copy link
Contributor Author

I've made the suggested parsing/purifying/AST changes, but I punted on the UnresolvedObjectName for now! I can run back through and make that change for the single Postgres connector and the MultiConnector in another PR.

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@JLDLaughlin JLDLaughlin merged commit 630eef6 into MaterializeInc:main Apr 7, 2021
@JLDLaughlin JLDLaughlin deleted the create-sources branch April 7, 2021 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants