Skip to content

Feat: Add support for Trino Iceberg tables#2129

Merged
izeigerman merged 3 commits intoSQLMesh:mainfrom
erindru:issue-1998-trino-iceberg-support
Feb 20, 2024
Merged

Feat: Add support for Trino Iceberg tables#2129
izeigerman merged 3 commits intoSQLMesh:mainfrom
erindru:issue-1998-trino-iceberg-support

Conversation

@erindru
Copy link
Collaborator

@erindru erindru commented Feb 15, 2024

  • Add the Iceberg connector to the integration test Trino instance
  • Add a configuration to run the integration tests against the Iceberg connector
  • Add the ability to probe the type of a catalog within an EngineAdapter
  • Use this ability to change how table properties are generated in the Trino adapter so it can use partitioned_by for Hive tables and partitioning for Iceberg tables

ref: Issue #1998

@CLAassistant
Copy link

CLAassistant commented Feb 15, 2024

CLA assistant check
All committers have signed the CLA.

@erindru erindru changed the title [WIP] Add support for 'partitioning' property in Trino Iceberg tables Add support for 'partitioning' property in Trino Iceberg tables Feb 15, 2024
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this slightly to make it take up less memory. Rather than having an entire Postgres instance per metastore, I changed it to a single Postgres instance with a database per metastore

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this gateway definition was repeated in multiple places - here, and where the engine_adapter fixture was created.

So rather than repeating the logic to generate it, I made it its own fixture

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with sqlglot, is this a valid approach?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgesittas Can you confirm the best approach here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure - what SQL code is this supposed to generate? Is there a way we can add a test to ensure we generate the correct DDL statement?

One observation is that instead of this="partitioning", we may need to do something like this=exp.var("partitioning") to have an expression in this vs a string (which is usually the case for Property expressions IIRC).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgesittas I added some tests in test_trino.py to demonstrate the generated SQL.

the tl;dr is that the SQL should be identical to Hive - with the exception of the partitioned_by property being called partitioning instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks 👍

Copy link
Collaborator

@eakmanrq eakmanrq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing and a highly requested feature! Thanks for all the attention to details you put into this.

One thing that is missing is documentation. Can you update the Trino documentation to include a note about Iceberg support?

Also from reviewing this PR it seems like the code currently assumes a single catalog is used when connecting to Trino. For example it always checks "current_catalog" when wanting to know what the connector is when the object it is creating could actually be in a different catalog and therefore be a different connector. If that is correct, I think that constraint is fine but should also be documented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that we could end up with multiple catalogs in the response? I'm wondering if instead we should raise if we get multiple and do seq_get(connector_name, 0) or self.DEFAULT_CATALOG_TYPE (seq_get is part of sqglot).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catalogs in Trino need to have unique names or Trino will fail to start, so there should only ever be a single row returned.

Since fetchone() returns a single row as a tuple, so the length check is to see if there was a value in the first column connector_name, and if there is, it gets returned.

If no records are returned, we just return the default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now. You could then use seq_get(connector_name, 0) or self.DEFAULT_CATALOG_TYPE or this is fine too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, its more concise. I added it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should raise UnsupportedCatalogOperationError here if self.CATALOG_SUPPORt.is_unsupported. It is strange for example for MySQL to return a mysql current catalog when it doesn't support catalogs at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call, it didn't even occur to me that catalogs might not be supported by some databases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgesittas Can you confirm the best approach here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@erindru
Copy link
Collaborator Author

erindru commented Feb 16, 2024

Thanks for your review @eakmanrq , i'll make some tweaks based on the feedback.

One thing that is missing is documentation. Can you update the Trino documentation to include a note about Iceberg support?

Actually, I wasn't intending to commit you guys to supporting Iceberg until I was sure it was working myself :) this PR was an initial implementation but i'm expecting other issues to crop up so I wasn't going to state it was "supported" just yet.

However, I can update the docs if you're ok with that

Also from reviewing this PR it seems like the code currently assumes a single catalog is used when connecting to Trino. For example it always checks "current_catalog" when wanting to know what the connector is when the object it is creating could actually be in a different catalog and therefore be a different connector. If that is correct, I think that constraint is fine but should also be documented.

You're absolutely right, I just wanted to gauge feedback before spending too much time making it flexible. _build_table_properties_exp doesn't have a parameter containing the table/catalog being built and I was trying to avoid changing method signatures across all the adapters in the first instance. I'll revisit this

@eakmanrq
Copy link
Collaborator

Actually, I wasn't intending to commit you guys to supporting Iceberg until I was sure it was working myself :) this PR was an initial implementation but i'm expecting other issues to crop up so I wasn't going to state it was "supported" just yet.

We currently define supported as being able to pass all the integration tests. Since that is the case with your change then I think it would be correct to consider it supported.

@erindru erindru force-pushed the issue-1998-trino-iceberg-support branch from 3527990 to 9b38b2f Compare February 17, 2024 00:34
@erindru
Copy link
Collaborator Author

erindru commented Feb 17, 2024

We currently define supported as being able to pass all the integration tests. Since that is the case with your change then I think it would be correct to consider it supported.

Cool, works for me! I've added a section to the docs.

Comment on lines 152 to 154
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
property: exp.Property
property = exp.PartitionedByProperty(
this=exp.Schema(expressions=partitioned_by),
)
property: exp.Property = exp.PartitionedByProperty(
this=exp.Schema(expressions=partitioned_by),
)

@georgesittas
Copy link
Contributor

The PR looks good to me, nice work @erindru! As I'm not very familiar with some of the details here, could you also take another looks @eakmanrq?

@erindru erindru changed the title Add support for 'partitioning' property in Trino Iceberg tables Feat: Add support for 'partitioning' property in Trino Iceberg tables Feb 19, 2024
@erindru erindru changed the title Feat: Add support for 'partitioning' property in Trino Iceberg tables Feat: Add support for Trino Iceberg tables Feb 19, 2024
- Use the catalog of the table instead of the catalog of the connection when probing the catalog type
- Add Trino unit tests showing the Iceberg code paths and generated SQL
- Add section on Iceberg to the Trino docs section
@erindru erindru force-pushed the issue-1998-trino-iceberg-support branch from b76ba6d to 32a56cc Compare February 19, 2024 19:05
Copy link
Collaborator

@eakmanrq eakmanrq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks @erindru!

Feel free to merge if you ready.

@erindru
Copy link
Collaborator Author

erindru commented Feb 20, 2024

Thanks! I can't merge it, GitHub shows the following message:

Only those with write access to this repository can merge pull requests.

@izeigerman izeigerman merged commit 2fe2c71 into SQLMesh:main Feb 20, 2024
@erindru erindru deleted the issue-1998-trino-iceberg-support branch February 20, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants