-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive table metadata are not refreshed #62869
Comments
Please show me the details of step 3: Add new data to HDFS. Does it mean adding a new partition or insert overwrite an existing partition or just create a new file under existing partition directly in HDFS. Table metadata cache only refresh when partitions query from hive metastore are newer than current one. |
New files are added directly to hdfs under existing partition (I believe it's a correct terminology?). Here is external Hive table DDL:
Output of Spark job is stored in Parquet format as follows.
When running same query in Superset, I get a correct number of rows. Even when a new file is added to the path. |
@xmb2 do you use |
We judge whether the metadata cache need refresh by returned partitions from hive metastore. AFAIK, adding files under existing partitions directly on HDFS will not change any informations in hive metastore. And it is not the use case we cared about when we were designing Hive Engine before. We expect users update hive table throught SQL instead of lower level operations on HDFS. It makes sense because hive metastore should be the only standard about how many partitions and what are their current statuses in current hive table. |
No, we do not use local cache - setting Thanks @taiyang-li for explanation. Just have checked in code, we add a new partition once a day ( Maybe, can we disable metadata cache and works as in Superset mentioned before? |
Yes. You can disable metadata cache by modify |
For clarification, Superset, where query gives all data, is connected directly to Hive. So it's not a workaround for us. About disabling cache. I hope for some settings option (as for other cache types exist), which also can be beneficial for others who use Hive engine and not to build and maintain own version of Clickhouse (I am not a programmer). However I don't know how complex would be implement such feature request, but definitely appreciated it. |
@xmb2 I personally don't think it is a typical use case of Hive Engine because the right way to get hdfs partition directories to read is first querying metadata from hive metastore instead of directly accessing hdfs. Anyway, it is a trivial change, please feel free to contribute to CH if you want this feature. I think I had told you how to manage it. |
Describe what's wrong
Quering Hive engine table gives incomplete data results when a new data are stored (e. g. in Parquet file format) and previously HiveMetastoreClient already fetched metadata for the table.
Does it reproduce on the most recent release?
Yes, on the current and previous LTS version -
v24.3.2
,v23.8.12
.How to reproduce
Expected behavior
Query returns actual number of rows of given Parquet files.
Additional context
First time quering data.
Run SELECT again after new parquet is stored (every half hour). This time, query returns exact number as before because client reads same amount of Hive files.
Only workaround for now is restarting clickhouse-server.service. Maybe @taiyang-li can help?
The text was updated successfully, but these errors were encountered: