Skip to content

HIVE-27473: Rewrite MetaStoreClients to be composable #5771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ngsg
Copy link
Contributor

@ngsg ngsg commented Apr 14, 2025

Design document: Google docs

What changes were proposed in this pull request?

This patch refactors HiveMetaStoreClient and its subclasses in composable classes.

Why are the changes needed?

By making SessionHiveMetaStoreClient composable, this refactoring enables users to create and use custom Catalogs with Hive's session-level features, such as temporary tables.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

This patch is currently in draft state, so no tests have been added yet.

@okumin
Copy link
Contributor

okumin commented Apr 14, 2025

The content in Google Doc is perfectly aligned with my memory. I'm still reading the proposed changes. I'd appreciate it if the latest dependencies between MetaStoreClients somewhere were written somewhere

@ngsg
Copy link
Contributor Author

ngsg commented Apr 15, 2025

@okumin, I have added an Appendix section to the design document, which summarizes the dependencies between MetaStoreClients as well as which classes in the PR codebase provide specific features compared to the master branch. I'm not entirely sure if this aligns with your request, so please let me know if it doesn't address what you were looking for.

@ngsg
Copy link
Contributor Author

ngsg commented Apr 15, 2025

@deniskuzZ, Thank you for looking over the PR. I think the suggested class names make sense, and I'll update the PR accordingly.

Regarding HiveMetaStoreClientWithTmpTable, my initial thought was that we should separate features as much as possible to allow custom catalog implementations to selectively include only what they need. However, I have reconsidered my thought and now believe that the features originating from SessionHiveMetaStoreClient and HiveMetaStoreClientWithLocalCache should always wrap the custom catalog. As described in the figure 2 of the design document, the next commit will include four layers (Hook, Session, LocalCache, Thrift) instead of five (Hook, TmpTable, Session, LocalCache, Thrift).

import java.util.Map;

public class NoopHiveMetaStoreClientDelegator implements IMetaStoreClient {
// effectively final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it could. I changed it to final.

CacheKey cacheKey = new CacheKey(KeyType.TABLE, req);
Table r = (Table) mscLocalCache.getIfPresent(cacheKey);
if (r == null) {
r = getDelegate().getTable(req);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use this.delegate instead of getDelegate().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, since we cannot remove the public accessor getDelegate() due to HiveMetaStoreClient, it would be better to keep using getDelegate() in inherited classes as well as HiveMetaStoreClient. I'll check HiveMetaStoreClient again and remove the public accessor if we can safely decouple it from underlying client.

this.delegate = delegate;
}

public IMetaStoreClient getDelegate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be available outside of this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. I made it possible to set and access delegate externally because it was necessary during the early implementation. Since this PR code is directly come from the PoC stage, the code is quite rough, but the next commit will likely be an improved version that addresses your feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, HiveMetaStoreClient needs the public accessor. I didn't investigate the code deeply yet, and I'll check whether we can decouple HiveMetaStoreCleint from its delegate and finally remove this public accessor.

return new GetPartitionsByNamesResult(result);
private static IMetaStoreClient createUnderlyingClient(Configuration conf, HiveMetaHookLoader hookLoader,
Boolean allowEmbedded) throws MetaException {
IMetaStoreClient thriftClient = new ThriftHiveMetaStoreClient(conf, allowEmbedded);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have ThriftHiveMetaStoreClient here it means we can not reuse the functionality of this class with other than ThriftHiveMetaStoreClient e.g. RESTClient for Iceberg REST catalog can not reuse the functionality of ql/HiveMetaStoreClientWithLocalCache. This should be an input as delegate I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider custom catalogs in this PR as it is not a consideration of HIVE-27473. In my opinion, HIVE-12679 and HIVE-28658 could address custom catalog issues based on the separated HMSClient layers introduced in this PR.

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the dropTable-related methods, and they should work. I also think removing the zig-zag dependencies helps our maintainability of the Hive project.
Instead, we have to maintain lots of @Override in any classes, and third-party developers also have to do it. It could be a negative point we want to resolve someday.

public void createTable(Table tbl) throws MetaException, NoSuchObjectException, TException {
CreateTableRequest request = new CreateTableRequest(tbl);
createTable(request);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also need createTable(Table tbl, EnvironmentContext envContext)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add createTable(Table tbl, EnvironmentContext envContext) since it is not a method of IMetaStoreClient.

It seems that only the test code uses this method, so I think we can safely skip this method by modifying the test to use createTable(CreateTableRequest request), which is a method of IMetaStoreClient and can be used to pass both Table and EnvironmentContext.

@Override
public void createTable(CreateTableRequest request)
throws MetaException, NoSuchObjectException, TException {
Table tbl = request.getTable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more consistent if we set the default catalog name when it's empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that your suggestion makes the code more consistent.

However, I have a question about the use of the default catalog and where the responsibility for setting it should lie. Currently, we use getDefaultCatalog() in many places to fill in empty catalog names. In my opinion, there should ideally be a single point of responsibility, or at least as few classes as possible, where the default catalog is defined and applied.

I also think we should minimize the behavior of proxy classes whenever possible. Rather than setting the default catalog within proxy classes, it might be better to pass null and let the underlying class, or even HMSHandler, handle it.

I'm not entirely sure this is a better approach, and I'm fine with continuing the current strategy of applying getDefaultCatalog() wherever needed. It would be helpful if you could share your opinion on this issue.

ignoreUnknownTable);
// SG:FIXME, introduce IMetaStoreClient.dropTable :: DropTableRequest -> void
// We need a new method that takes catName and checks tbl.isSetTxnId()
// Or does TxN stuff meaningful only if we are in default catalog?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to invoke the hook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have enclosed the dropTable call with the hook. I'll check this method again as well as the noted issue when I revisit HookMetaStoreClientProxy.

@ngsg ngsg force-pushed the HIVE-27473-Rewrite-MetaStoreClients-to-be-composable branch from e620366 to 147329d Compare April 22, 2025 11:43
@ngsg
Copy link
Contributor Author

ngsg commented Apr 22, 2025

@okumin , @deniskuzZ , @zratkai, thank you for your review. I've updated the branch to address your feedback and improve code quality. The main changes are summarized as follows:

  • Renamed the new classes
  • Integrated the temporary table layer and session feature layer
  • Updated {Session/LocalCaching}MetaStoreClientProxy to handle ValidTxnWriteIdList properly
  • Modified SessionMetaStoreClientProxy to set processorCapability in advance, ensuring the cached key matches the actual request
  • Changed the access level of TempTable from package-private to public, in order to place all proxy classes under the same directory

There are still several remaining issues, including SG:FIXME markers and some unchecked areas. In the latest commit, I focused on SessionMetaStoreClientProxy and plan to move on to the other proxy classes. I’d appreciate it if you could review the changes and share any considerations I may have missed.

@okumin
Copy link
Contributor

okumin commented Apr 23, 2025

@okumin
Copy link
Contributor

okumin commented Apr 23, 2025

Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request.

The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are.

If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed!

@ngsg
Copy link
Contributor Author

ngsg commented Apr 24, 2025

I found Hive 4.0.0 has around 20 deprecated methods. Are we allowed to discontinue them before HIVE-27473?

I think removing the deprecated methods would be helpful for simplifying the work in HIVE-27473. However, I'm not entirely sure about the standard process for removing deprecated methods in Hive. If their removal is acceptable at this stage, I would be supportive of proceeding in that direction and able to work for it.

@ngsg
Copy link
Contributor Author

ngsg commented Apr 24, 2025

Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request.

The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are.

If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed!

Sure, it would be nice to make an intermediate design decision at this point.
Before raising the proposal on the mailing list, I have a small question about your comment: when you mentioned “existing integrations,” were you referring to integration tests, or to external integrations such as third-party catalogs? I’d like to make sure I fully understand the impact you're referring to, especially in terms of potential behavioral changes.

@okumin
Copy link
Contributor

okumin commented Apr 28, 2025

@ngsg

when you mentioned “existing integrations,” were you referring to integration tests, or to external integrations such as third-party catalogs?

I know Glue Data Catalog depends on the unmerged HIVE-12679, and I can infer that Databricks also uses or has used it from their release notes. Keeping the compatibility might help those users(it is also possible that they prefer a cleaner interface). I will involve them once we start the discussion.

@okumin
Copy link
Contributor

okumin commented Apr 28, 2025

I happened to find Apache Ranger also relies on IMetaStoreClient. I believe HIVE-27473 does not impact the feature, but I am leaving a comment nonetheless.

https://github.com/apache/ranger/blob/master/hive-agent/src/main/java/org/apache/ranger/authorization/hive/authorizer/RangerHiveAuthorizer.java

@ngsg
Copy link
Contributor Author

ngsg commented Apr 28, 2025

@okumin, I understand now. Thank you for the explanation.
I’ve also started a discussion on the mailing list. It would be great if you could take a look and help enrich it with your insights.
Thanks again for your support.

@deniskuzZ
Copy link
Member

i think we can drop the deprecated methods in 4.1 and right now review and mark methods that we want to discontinue

@ngsg
Copy link
Contributor Author

ngsg commented Jul 5, 2025

@deniskuzZ, I have updated the PR according to your renaming requests below. Regarding SynchronizedMetaStoreClient, since there are now two classes with that name in Hive.java, I decided to use FQN for the class formerly called as SynchronizedMetaStoreClientProxy, as it comes from a different package.

I hope these changes don't break anything. I'll check and fix anything I might have missed in the morning.

  • BaseMetaStoreClientProxy -> MetaStoreClientProxy
  • HookEnabledMetaStoreClientProxy -> HookEnabledMetaStoreClient
  • SynchronizedMetaStoreClientProxy -> SynchronizedMetaStoreClient

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, pending tests

@deniskuzZ
Copy link
Member

deniskuzZ commented Jul 5, 2025

@deniskuzZ, I have updated the PR according to your renaming requests below. Regarding SynchronizedMetaStoreClient, since there are now two classes with that name in Hive.java, I decided to use FQN for the class formerly called as SynchronizedMetaStoreClientProxy, as it comes from a different package.

I hope these changes don't break anything. I'll check and fix anything I might have missed in the morning.

  • BaseMetaStoreClientProxy -> MetaStoreClientProxy

@ngsg, how about [Base]MetaStoreClientProxy -> MetaStoreClientWrapper? i think that is the best fit here. Note, I can rename in the next PR.

also we can kill the hive-exec SynchronizedMetaStoreClient and replace with metastore.client.SynchronizedMetaStoreClient

private IMetaStoreClient syncMetaStoreClient;
....
@LimitedPrivate(value = {"Hive"})
public synchronized IMetaStoreClient getSynchronizedMSC() throws MetaException {
  if (syncMetaStoreClient == null) {
    syncMetaStoreClient = SynchronizedMetaStoreClient.newClient(getConf(), getMSC(true, false));
  }
  return syncMetaStoreClient;
}

@ngsg
Copy link
Contributor Author

ngsg commented Jul 6, 2025

@ngsg, how about [Base]MetaStoreClientProxy -> MetaStoreClientWrapper? i think that is the best fit here. Note, I can rename in the next PR.

I'll do that, as there remains minor work.


also we can kill the hive-exec SynchronizedMetaStoreClient and replace with metastore.client.SynchronizedMetaStoreClient

private IMetaStoreClient syncMetaStoreClient;
....
@LimitedPrivate(value = {"Hive"})
public synchronized IMetaStoreClient getSynchronizedMSC() throws MetaException {
  if (syncMetaStoreClient == null) {
    syncMetaStoreClient = SynchronizedMetaStoreClient.newClient(getConf(), getMSC(true, false));
  }
  return syncMetaStoreClient;
}

I think we can even drop getSynchronizedMSC, since there is no unsynchronized MetaStoreClient; now all MetaStoreClients are wrapped by metastore.client.SynchronizedMetaStoreClient. I'm not sure why we introduced two redundant synchronization mechanism in HiveMetaStoreClient.newSynchronizedClient (HIVE-1899) and hive-exec SynchronizedMetaStoreClient (HIVE-14204), but I believe that we can unify them into a single class. I'm OK with going with any weaker changes as well, but I would like to hear your thoughts on the proposed commit.
(The commit can be found in here.)


Regarding the failed test cases, I couldn't reproduce them in my local environment, and I believe they're likely unrelated to the changes.

mvn test -pl itests/qtest-iceberg,ql -Pitests -Dtest.groups= -Dtest=TestReplicationMetricCollector#testSuccessIncrLoadMetrics,TestIcebergLlapLocalCompactorCliDriver  -Dqfile=iceberg_major_compaction_partitioned.q

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I retriggered CI

@deniskuzZ
Copy link
Member

I think we can even drop getSynchronizedMSC, since there is no unsynchronized MetaStoreClient; now all MetaStoreClients are wrapped by metastore.client.SynchronizedMetaStoreClient. I'm not sure why we introduced two redundant synchronization mechanism in HiveMetaStoreClient.newSynchronizedClient (HIVE-1899) and hive-exec SynchronizedMetaStoreClient (HIVE-14204), but I believe that we can unify them into a single class. I'm OK with going with any weaker changes as well, but I would like to hear your thoughts on the proposed commit.
(The commit can be found in here.)

The change makes sense given the current code, but not all clients were synchronized previously, were they?

@ngsg
Copy link
Contributor Author

ngsg commented Jul 6, 2025

I think we can even drop getSynchronizedMSC, since there is no unsynchronized MetaStoreClient; now all MetaStoreClients are wrapped by metastore.client.SynchronizedMetaStoreClient. I'm not sure why we introduced two redundant synchronization mechanism in HiveMetaStoreClient.newSynchronizedClient (HIVE-1899) and hive-exec SynchronizedMetaStoreClient (HIVE-14204), but I believe that we can unify them into a single class. I'm OK with going with any weaker changes as well, but I would like to hear your thoughts on the proposed commit.
(The commit can be found in here.)

The change makes sense given the current code, but not all clients were synchronized previously, were they?

Yes, if I understand correctly, client using the embedded metastore didn't support synchronized access.

@@ -6006,11 +6009,21 @@ public HiveMetaHook getHook(
}
};

IMetaStoreClient thriftClient = ThriftHiveMetaStoreClient.newClient(conf, allowEmbedded);
Copy link
Member

@deniskuzZ deniskuzZ Jul 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we refactor with a builder

IMetaStoreClient client = new MetaStoreClientBuilder(conf)
    .newThriftClient(allowEmbedded)
    .withLocalCache()
    .withSessionWrapper()
    .withHooks(hookLoader)
    .withRetry()
    .withSynchronizedWrapper()
    .build();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I think handling retry needs special cares as we create it using reflection. Let me refactor it and update the PR, soon.

@deniskuzZ
Copy link
Member

@ngsg, if MetaStoreClientBuilder ask takes time, let's include this and merge

@ngsg
Copy link
Contributor Author

ngsg commented Jul 6, 2025

@ngsg, if MetaStoreClientBuilder ask takes time, let's include this and merge

OK. I wrote MetaStoreClientBuilder, but I'm not fully confident about it yet. Let me address the requested changes on this and merge it.

@deniskuzZ
Copy link
Member

@ngsg, if MetaStoreClientBuilder ask takes time, let's include this and merge

OK. I wrote MetaStoreClientBuilder, but I'm not fully confident about it yet. Let me address the requested changes on this and merge it.

I'll merge the PR with a green build

Copy link

sonarqubecloud bot commented Jul 6, 2025

@deniskuzZ deniskuzZ merged commit 92f9966 into apache:master Jul 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants