Skip to content

HIVE-20189: Separate metastore client code into its own module #5924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

deniskuzZ
Copy link
Member

@deniskuzZ deniskuzZ commented Jul 3, 2025

What changes were proposed in this pull request?

move client classes into it's own module

Why are the changes needed?

improve the structure of standalone-metastore classes

Does this PR introduce any user-facing change?

No

How was this patch tested?

jenkins

<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-standalone-metastore-common</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new module depends on the metastore-common, which doesn't reduce the size or dependencies from metastore-common, not sure if it's ok, for the user what will benefit from the new module?

Copy link
Member Author

@deniskuzZ deniskuzZ Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected for a client module to depend on a common module. The client module delivers core client functionality and can be enhanced later with features like caching.
Offering a 'client' JAR aligns with common conventions and is expected to be user-friendly.
ATM we have 2 distinct cache wrappers in different Hive modules, just because there was no structure:

  • org.apache.hadoop.hive.metastore.HiveClientCache
  • org.apache.iceberg.hive.CachedClientPool

Copy link
Member Author

@deniskuzZ deniskuzZ Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another point I don't understand, why ql and beeline modules depend on metastore-server? I think we should drop the dependency on server and move the classes into the metastore common or client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically agree that we can potentially improve the structure in the future. We will have obvious guidelines about how to organize files.

  • metastore-client: Client-specific files, which Server doesn't need
  • metastore-server: Server-specific files, which Client doesn't need
  • metastore-common: Common modules

another point I don't understand, why ql and beeline modules depend on metastore-server? I think we should drop the dependency on server and move the classes into the metastore common or client.

I guess ql requires metastore-server to use an embedded HMS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't embedded HMS only used in tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it isn't. It can be used when metastore.thrift.uris is empty. For example, our Hive docker image(not HMS docker image) can set up a HiveServer2 without HMS. It probably uses the embedded mode that runs HMS-equivalent threads in HS2.

/**
* Check if metastore is being used in embedded mode.
* This utility function exists so that the logic for determining the mode is same
* in HiveConf and HiveMetaStoreClient
* @param msUri - metastore server uri
* @return true if the metastore is embedded
*/
public static boolean isEmbeddedMetaStore(String msUri) {
return (msUri == null) || msUri.trim().isEmpty();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can say it is for testing purposes and separate the classes from hive-exec.

@deniskuzZ
Copy link
Member Author

deniskuzZ commented Jul 11, 2025

I've kept the package unchanged for IMetaStoreClient, HiveMetaStoreClient, RetryingMetaStoreClient. Ideally, they should be moved under the client package.
@dengzhhu653, @ngsg, @okumin, WDYT should we refactor here (not a small PR already) or in a follow-up?

Copy link
Contributor

@ngsg ngsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks reasonable to me. I quickly checked whether there are any remaining candidates that could be moved to standalone-metastore-client, and it seems that all client-only-related classes have been gathered into the client package. I'm also in favor of the new HiveMetaStoreClientBuilder class.

I left a few minor comments where I had questions. Also, it seems that the CI failed due to the unification of HiveClientCache. Could you update the TestPassProperties#testSequenceTableWriteReadMR accordingly?


Regarding IMetaStoreClient, HiveMetaStoreClient, and RetryingMetaStoreClient, I'm leaning towards refactoring them in a follow-up PR, as the large refactoring might obscure the existing changes.

@@ -41,7 +41,7 @@
import org.apache.hadoop.hive.metastore.api.PartitionsByExprRequest;
import org.apache.hadoop.hive.metastore.api.PartitionsStatsRequest;
import org.apache.hadoop.hive.metastore.api.Table;
import org.apache.hadoop.hive.metastore.client.HiveMetaStoreClientUtils;
import org.apache.hadoop.hive.metastore.client.utils.HiveMetaStoreClientUtils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order of import might be swapped as follows:

import org.apache.hadoop.hive.metastore.client.MetaStoreClientWrapper;
import org.apache.hadoop.hive.metastore.client.utils.HiveMetaStoreClientUtils;

@@ -92,7 +92,7 @@
import org.apache.hadoop.hive.metastore.api.TableValidWriteIds;
import org.apache.hadoop.hive.metastore.api.UniqueConstraintsRequest;
import org.apache.hadoop.hive.metastore.api.UniqueConstraintsResponse;
import org.apache.hadoop.hive.metastore.client.HiveMetaStoreClientUtils;
import org.apache.hadoop.hive.metastore.client.utils.HiveMetaStoreClientUtils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same import order issue here.

@@ -65,15 +67,21 @@ public class RetryingMetaStoreClient implements InvocationHandler {
private final UserGroupInformation ugi;
private final int retryLimit;
private final long retryDelaySeconds;
private final ConcurrentHashMap<String, Long> metaCallTimeMap;
private final Map<String, Long> metaCallTimeMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about removing all uses of ConcurrentHashMap from this file? It seems that only one method uses ConcurrentHashMap, and the actual value passed to that method is always null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the tab size is set to 4 inside Iterator { }. Could we restore it to 2?

import org.apache.hadoop.hive.metastore.HiveMetaHookLoader;
import org.apache.hadoop.hive.metastore.IMetaStoreClient;
import org.apache.hadoop.hive.metastore.api.MetaException;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the empty line and sort the import list?

Copy link


public HiveMetaStoreClientBuilder(Configuration conf) {
Preconditions.checkNotNull(conf);
this.conf = conf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify it with one-line like this.conf = Preconditions.checkNotNull(conf). I prefer Objects.requireNonNull for new code, though.

@@ -107,7 +107,7 @@ public HiveClientCache(final int timeout) {
/**
* @param timeout the length of time in seconds after a client is created that it should be automatically removed
*/
private HiveClientCache(final int timeout, final int initialCapacity, final int maxCapacity, final boolean enableStats) {
protected HiveClientCache(final int timeout, final int initialCapacity, final int maxCapacity, final boolean enableStats) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it simpler to make this public and directly instantiate HiveClientCache in HCatUtil?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants