Skip to content

SigV4 Auth Support for Catalog Federation - Part 1: Entity Transformation System #1899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

XJDKC
Copy link
Member

@XJDKC XJDKC commented Jun 16, 2025

Milestones

This is Part 1 of the [Splitting] Initial SigV4 Auth Support for Catalog Federation. Upcoming parts will build on this system:

Introduction

This PR introduces a general-purpose Entity Transformation System to support future features like injecting service identity metadata into catalog entities for federated access (e.g. AWS Glue using SigV4). It provides a clean and extensible hook mechanism for enriching entities during specific lifecycle phases.

This system is intended to serve as the foundation for service identity injection and other dynamic transformations in subsequent parts of the SigV4 support work.

Key Components

  • EntityTransformer: A functional interface representing a single transformation to be applied to a PolarisBaseEntity.
  • EntityTransformationEngine: A runtime engine that applies a sequence of EntityTransformers to an entity based on the configured TransformationPoint.
  • TransformationPoint: Enum representing defined hook points in the entity lifecycle (e.g. CATALOG_PRE_PERSIST).
  • @AppliesTo annotation: A repeatable qualifier that marks a transformer as applicable to a given TransformationPoint.
  • EntityTransformationConfiguration: Optional config-driven registration of transformers by identifier (preserving ordering).

Example Configuration:

polaris.entity-mutation.mutators=no-op,catalog-connection-config

Flowchart

Catalog Federation - Creds Management

@XJDKC XJDKC force-pushed the rxing-catalog-federation-sigv4-part-1 branch from f5218b3 to 6a896c1 Compare June 16, 2025 09:22
@snazy
Copy link
Member

snazy commented Jun 16, 2025

@XJDKC I'm a bit concerned about updating the representation of entities, because those represent the persisted state. Maybe an opportunity to abstract such concerns from persistence?

@XJDKC
Copy link
Member Author

XJDKC commented Jun 16, 2025

@XJDKC I'm a bit concerned about updating the representation of entities, because those represent the persisted state. Maybe an opportunity to abstract such concerns from persistence?

Hey @snazy, it’s definitely possible with this framework, we could define a new TransformationPoint like CATALOG_PRE_LOAD and dynamically apply a transformer to inject the service identity info when loading the Catalog Entity from the metastore.

I think different vendors may have different views on this. For example, it's quite reasonable to persist the service identity info, since it's very unlikely to change. We already persist service identity in the metastore for storage config, so it might make sense to follow a similar pattern here. Even if we choose not to persist the service identity directly on the Catalog Entity, we still need to persist the mapping between the entity and its assigned identity somewhere (e.g. In a ServiceIdentityRegistry I'm planning to add)

Changing the service identity for a Catalog Entity is risky, e.g., users might configure their IAM role’s trust policy to allowlist the specific service identity initially assigned to their federated Polaris catalog. If Polaris later switches to a different identity when accessing the Glue Catalog, that integration could break unexpectedly.

So perhaps the right approach is to allow the service vendor to decide whether the service identity should be persisted or injected dynamically. WDYT?

@eric-maynard
Copy link
Contributor

eric-maynard commented Jun 16, 2025

For better or worse, we currently use entities both for the persisted state of an object and for an in-memory representation of the same. I'm still a fan of separating this out (something like the DAO proposal from months back) but that a big refactor that I'm not sure I'd want to block SigV4.

@snazy perhaps we can view this as a step in the right direction as it encourages us to treat entities as immutable and to use discrete "transformations" to generate new entities instead of trying to modify them in place or to construct lots of new entities in an adhoc way throughout the code base. Eventually, entities should be truly immutable and this may become the canonical way to transform an entity into a new one.

I am, however, a little concerned about how heavyweight this process could become... if I just want to update an entity's version today, I can call a method like setGrantRecordsVersion. After this change, it looks like the expectation is that I have a transformation engine, and then I construct and execute a transformation to accomplish this change?

@XJDKC
Copy link
Member Author

XJDKC commented Jun 16, 2025

For better or worse, we currently use entities both for the persisted state of an object and for an in-memory representation of the same. I'm still a fan of separating this out (something like the DAO proposal from months back) but that a big refactor that I'm not sure I'd want to block SigV4.

@snazy perhaps we can view this as a step in the right direction as it encourages us to treat entities as immutable and to use discrete "transformations" to generate new entities instead of trying to modify them in place or to construct lots of new entities in an adhoc way throughout the code base. Eventually, entities should be truly immutable and this may become the canonical way to transform an entity into a new one.

I am, however, a little concerned about how heavyweight this process could become... if I just want to update an entity's version today, I can call a method like setGrantRecordsVersion. After this change, it looks like the expectation is that I have a transformation engine, and then I construct and execute a transformation to accomplish this change?

Yes, the Entity Transformation System I'm adding won't update the entity in place, it creates a copy and applies the modifications there.

What I want to point out is that, this system isn't meant to replace the current set methods used in the persistence layer. Its purpose is to provide hooks for applying transformations at specific points in the entity lifecycle.

It's admittedly a bit heavyweight, since each transformer creates a new copy of the entity. But without a builder-style abstraction (like what we have for generated model classes), we can't apply the transformation efficiently.

Given that we may eventually refactor this area to introduce a proper DAO layer, and that we're currently only using the transformation system for a single SigV4-related transformer, it might make sense to proceed with the current approach and revisit the refactor later. WDYT?

XJDKC

This comment was marked as duplicate.

Copy link
Member Author

@XJDKC XJDKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link some comments from @dennishuo in this PR as well:

*
* @param transformationPoint The point in the entity lifecycle where transformers should be
* applied.
* @param entity The original Polaris entity to mutate.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad we don't have a way to express C++-style const correctness as we move towards immutable PolarisBaseEntity, but maybe we can at least mention here that the intention is that implementations should not mutate the input entity, but instead only create the new transformed entity.

@@ -477,6 +479,14 @@ private void revokeGrantRecord(

ms.persistStorageIntegrationIfNeededInCurrentTxn(callCtx, catalog, integration);

// CATALOG_PRE_PERSIST transformation point
EntityTransformationEngine entityTransformationEngine = callCtx.getEntityTransformationEngine();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, unfortunately we didn't quite manage to refactor all of the shared logic nryerrn AtomicOperationMetaStoreManager and TransactionalMetaStoreManagerImpl into a base class yet; the best we have is the prepareToPersistNewEntity hook defined in BaseMetaStoreManager.

This means either this logic needs to be added in both the AtomicOperationMetaStoreManager and TransactionalMetaStoreManagerImpl, or needs to be abstracted into BaseMetaStoreManager somehow while making sure both of those impls invoke the shared logic. persistNewEntity isn't specific to catalog-creation so you'd need an enum on entityType there if you just put this logic in there.

@@ -188,6 +188,9 @@ polaris.oidc.principal-roles-mapper.type=default
# polaris.storage.gcp.token=token
# polaris.storage.gcp.lifespan=PT1H

# Polaris Entity Transformation Config
polaris.entity-transformation.transformers=no-op
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, it's useful to be able to have this kind of generalized transformer/mutator hook, but if the hook is also strongly coupled to particular features, we don't want to require separate configuration of the info injection here vs the core feature (e.g. sigv4) somewhere else.

Especially if we end up having feature configurations for the core feature, then it's important that the feature itself auto-configures itself to be functional. In a way then the classes that define the sigv4 functionality should be what injects the mutator/transformer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rulin: That's a good point, perhaps we can consider making some transformers tied to the feature parameters rather than the config itself if they are part of the core functionality of the feature.

@snazy
Copy link
Member

snazy commented Jun 18, 2025

Given that we may eventually refactor this area to introduce a proper DAO layer, and that we're currently only using the transformation system for a single SigV4-related transformer, it might make sense to proceed with the current approach and revisit the refactor later. WDYT?

Honestly, I think the approach is too heavyweight and too broad to target a very specific problem. It will be very hard for people to realize (later, by accident) that some entity instance does not reflect the "real" state. That will cause confusion and likely hard to debug issues.

I propose to separate the relevant parts into APIs/implementations that deal with the problem/use-case at hand, and definitely outside of the metastore - storage credential handling doesn't belong there IMO.

@XJDKC
Copy link
Member Author

XJDKC commented Jun 25, 2025

Given that we may eventually refactor this area to introduce a proper DAO layer, and that we're currently only using the transformation system for a single SigV4-related transformer, it might make sense to proceed with the current approach and revisit the refactor later. WDYT?

Honestly, I think the approach is too heavyweight and too broad to target a very specific problem. It will be very hard for people to realize (later, by accident) that some entity instance does not reflect the "real" state. That will cause confusion and likely hard to debug issues.

I propose to separate the relevant parts into APIs/implementations that deal with the problem/use-case at hand, and definitely outside of the metastore - storage credential handling doesn't belong there IMO.

Agree with the concern about embedding this logic in the persistence layer, moving it outside does make sense in general as the final entity is different from the entity we pass to the metastore. That said, I want to point out a few things specific to the storage config use case.

For storage config, the service identity isn't assigned by Polaris itself, it's provided by users via the CreateCatalogRequest. That's why I have this proposal to fix this issue. Also, the storage integration is currently created within the persistence layer:
TransactionalMetaStoreManagerImpl.java#L972-L978.

That's why I was leaning toward assigning the service identity in the persistence layer. For certain vendors, we may need to store and retrieve the mapping between realm and service identity info from the database. Doing this lookup in the persistence layer allows us to keep everything within the same transaction when persisting the catalog entity. Moving this from the persistence layer may end up causing inconsistency.

Adding some hook points is also very useful in general, e.g. it provides the ability to do some patches if there are some bugs and we end up persisting some wrong data in database.

We can also put the service identity injection logic here:
PolarisAdminService.java#L730-L755

WDYT? @eric-maynard @snazy @dennishuo @dimas-b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants