-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-27473: Rewrite MetaStoreClients to be composable #5771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-27473: Rewrite MetaStoreClients to be composable #5771
Conversation
The content in Google Doc is perfectly aligned with my memory. I'm still reading the proposed changes. I'd appreciate it if the latest dependencies between MetaStoreClients somewhere were written somewhere |
ql/src/java/org/apache/hadoop/hive/ql/metadata/client/HiveMetaStoreClientWithLocalCache.java
Outdated
Show resolved
Hide resolved
...rc/java/org/apache/hadoop/hive/ql/metadata/client/HiveMetaStoreClientWithSessionFeature.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/hadoop/hive/metastore/client/NoopHiveMetaStoreClientDelegator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreClientWithTmpTable.java
Outdated
Show resolved
Hide resolved
@okumin, I have added an Appendix section to the design document, which summarizes the dependencies between MetaStoreClients as well as which classes in the PR codebase provide specific features compared to the master branch. I'm not entirely sure if this aligns with your request, so please let me know if it doesn't address what you were looking for. |
@deniskuzZ, Thank you for looking over the PR. I think the suggested class names make sense, and I'll update the PR accordingly. Regarding |
import java.util.Map; | ||
|
||
public class NoopHiveMetaStoreClientDelegator implements IMetaStoreClient { | ||
// effectively final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it could. I changed it to final.
CacheKey cacheKey = new CacheKey(KeyType.TABLE, req); | ||
Table r = (Table) mscLocalCache.getIfPresent(cacheKey); | ||
if (r == null) { | ||
r = getDelegate().getTable(req); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use this.delegate instead of getDelegate().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, since we cannot remove the public accessor getDelegate()
due to HiveMetaStoreClient
, it would be better to keep using getDelegate()
in inherited classes as well as HiveMetaStoreClient
. I'll check HiveMetaStoreClient
again and remove the public accessor if we can safely decouple it from underlying client.
this.delegate = delegate; | ||
} | ||
|
||
public IMetaStoreClient getDelegate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this be available outside of this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. I made it possible to set and access delegate
externally because it was necessary during the early implementation. Since this PR code is directly come from the PoC stage, the code is quite rough, but the next commit will likely be an improved version that addresses your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, HiveMetaStoreClient
needs the public accessor. I didn't investigate the code deeply yet, and I'll check whether we can decouple HiveMetaStoreCleint
from its delegate and finally remove this public accessor.
return new GetPartitionsByNamesResult(result); | ||
private static IMetaStoreClient createUnderlyingClient(Configuration conf, HiveMetaHookLoader hookLoader, | ||
Boolean allowEmbedded) throws MetaException { | ||
IMetaStoreClient thriftClient = new ThriftHiveMetaStoreClient(conf, allowEmbedded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have ThriftHiveMetaStoreClient here it means we can not reuse the functionality of this class with other than ThriftHiveMetaStoreClient e.g. RESTClient for Iceberg REST catalog can not reuse the functionality of ql/HiveMetaStoreClientWithLocalCache. This should be an input as delegate I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't consider custom catalogs in this PR as it is not a consideration of HIVE-27473. In my opinion, HIVE-12679 and HIVE-28658 could address custom catalog issues based on the separated HMSClient layers introduced in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the dropTable-related methods, and they should work. I also think removing the zig-zag dependencies helps our maintainability of the Hive project.
Instead, we have to maintain lots of @Override
in any classes, and third-party developers also have to do it. It could be a negative point we want to resolve someday.
public void createTable(Table tbl) throws MetaException, NoSuchObjectException, TException { | ||
CreateTableRequest request = new CreateTableRequest(tbl); | ||
createTable(request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also need createTable(Table tbl, EnvironmentContext envContext)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add createTable(Table tbl, EnvironmentContext envContext)
since it is not a method of IMetaStoreClient
.
It seems that only the test code uses this method, so I think we can safely skip this method by modifying the test to use createTable(CreateTableRequest request)
, which is a method of IMetaStoreClient
and can be used to pass both Table
and EnvironmentContext
.
@Override | ||
public void createTable(CreateTableRequest request) | ||
throws MetaException, NoSuchObjectException, TException { | ||
Table tbl = request.getTable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more consistent if we set the default catalog name when it's empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that your suggestion makes the code more consistent.
However, I have a question about the use of the default catalog and where the responsibility for setting it should lie. Currently, we use getDefaultCatalog()
in many places to fill in empty catalog names. In my opinion, there should ideally be a single point of responsibility, or at least as few classes as possible, where the default catalog is defined and applied.
I also think we should minimize the behavior of proxy classes whenever possible. Rather than setting the default catalog within proxy classes, it might be better to pass null and let the underlying class, or even HMSHandler
, handle it.
I'm not entirely sure this is a better approach, and I'm fine with continuing the current strategy of applying getDefaultCatalog()
wherever needed. It would be helpful if you could share your opinion on this issue.
ignoreUnknownTable); | ||
// SG:FIXME, introduce IMetaStoreClient.dropTable :: DropTableRequest -> void | ||
// We need a new method that takes catName and checks tbl.isSetTxnId() | ||
// Or does TxN stuff meaningful only if we are in default catalog? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to invoke the hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have enclosed the dropTable
call with the hook. I'll check this method again as well as the noted issue when I revisit HookMetaStoreClientProxy
.
e620366
to
147329d
Compare
@okumin , @deniskuzZ , @zratkai, thank you for your review. I've updated the branch to address your feedback and improve code quality. The main changes are summarized as follows:
There are still several remaining issues, including |
I found Hive 4.0.0 has around 20 deprecated methods. Are we allowed to discontinue them before HIVE-27473? |
Can we raise the proposal to the mailing list and tie-break the best approaches at this point? People in #4444 should be very interested. Then, we can invest our resources in reviewing this pull request. The Option 2 of my original document might make integration clean. We will add a small change to Apache Hive, and third-party vendors implement the minimal number of methods. However, this approach doesn't allow existing integrations to run as they are. If we can make Option 1 clean and the community people agree with it, I think many people would be happy. Additionally, the HiveMetaStoreClient family with complicated dependencies will be decomposed! |
I think removing the deprecated methods would be helpful for simplifying the work in HIVE-27473. However, I'm not entirely sure about the standard process for removing deprecated methods in Hive. If their removal is acceptable at this stage, I would be supportive of proceeding in that direction and able to work for it. |
Sure, it would be nice to make an intermediate design decision at this point. |
I know Glue Data Catalog depends on the unmerged HIVE-12679, and I can infer that Databricks also uses or has used it from their release notes. Keeping the compatibility might help those users(it is also possible that they prefer a cleaner interface). I will involve them once we start the discussion. |
I happened to find Apache Ranger also relies on IMetaStoreClient. I believe HIVE-27473 does not impact the feature, but I am leaving a comment nonetheless. |
@okumin, I understand now. Thank you for the explanation. |
i think we can drop the deprecated methods in 4.1 and right now review and mark methods that we want to discontinue |
@deniskuzZ, I have updated the PR according to your renaming requests below. Regarding I hope these changes don't break anything. I'll check and fix anything I might have missed in the morning.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, pending tests
@ngsg, how about also we can kill the hive-exec
|
...ommon/src/main/java/org/apache/hadoop/hive/metastore/client/SynchronizedMetaStoreClient.java
Outdated
Show resolved
Hide resolved
I'll do that, as there remains minor work.
I think we can even drop Regarding the failed test cases, I couldn't reproduce them in my local environment, and I believe they're likely unrelated to the changes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I retriggered CI
The change makes sense given the current code, but not all clients were synchronized previously, were they? |
Yes, if I understand correctly, client using the embedded metastore didn't support synchronized access. |
@@ -6006,11 +6009,21 @@ public HiveMetaHook getHook( | |||
} | |||
}; | |||
|
|||
IMetaStoreClient thriftClient = ThriftHiveMetaStoreClient.newClient(conf, allowEmbedded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we refactor with a builder
IMetaStoreClient client = new MetaStoreClientBuilder(conf)
.newThriftClient(allowEmbedded)
.withLocalCache()
.withSessionWrapper()
.withHooks(hookLoader)
.withRetry()
.withSynchronizedWrapper()
.build();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I think handling retry
needs special cares as we create it using reflection. Let me refactor it and update the PR, soon.
|
Design document: Google docs
What changes were proposed in this pull request?
This patch refactors
HiveMetaStoreClient
and its subclasses in composable classes.Why are the changes needed?
By making
SessionHiveMetaStoreClient
composable, this refactoring enables users to create and use custom Catalogs with Hive's session-level features, such as temporary tables.Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
This patch is currently in draft state, so no tests have been added yet.