Hive table metadata are not refreshed #62869

xmb2 · 2024-04-23T10:06:59Z

Describe what's wrong

Quering Hive engine table gives incomplete data results when a new data are stored (e. g. in Parquet file format) and previously HiveMetastoreClient already fetched metadata for the table.

Does it reproduce on the most recent release?

Yes, on the current and previous LTS version - v24.3.2, v23.8.12.

How to reproduce

Create Hive engine table partitioned by year, month and day. Input format is Parquet.
Get number of rows for table.
Add new data to HDFS.
Run the same count query again.

CREATE TABLE hive.tracking_zbozi_click_logs
(
    `timestamp` Nullable(Int64),
    `aaa` Nullable(Int32),
    `bbb` Nullable(String),
    `ccc` Array(Tuple(ccca Nullable(String), cccb Nullable(String), cccd Nullable(Decimal(20, 0)))),
    `ddd` Nullable(Int32),
    `eee` Nullable(String),
    `fff` Nullable(String),
    `ggg` Nullable(Decimal(20, 0)),
    `hhh` Nullable(String),
    `iii` Nullable(String),
    `jjj` Nullable(String),
    `kkk` Nullable(String),
    `lll` Nullable(Bool),
    `mmm` Nullable(String),
    `year` UInt32,
    `month` UInt32,
    `day` UInt32
)
ENGINE = Hive('thrift://<hive-server>:9083', 'default', 'tracking_zbozi_click_logs')
PARTITION BY (year, month, day)
ORDER BY tuple()
SETTINGS input_format_parquet_allow_missing_columns = 1, input_format_max_rows_to_read_for_schema_inference = 100000

Expected behavior

Query returns actual number of rows of given Parquet files.

Additional context

First time quering data.

2024.04.23 09:57:47.033532 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Debug> executeQuery: (from 127.0.0.1:58954)  SELECT year, month, day, count(1) as clicks FROM tracking_zbozi_click_logs WHERE year = 2024 AND month = 4 and day = 23 GROUP BY year, month, day ORDER BY day (stage: Complete)
2024.04.23 09:57:47.077291 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Trace> HiveMetastoreClient: Get table metadata for default.tracking_zbozi_click_logs
2024.04.23 09:57:47.162733 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Information> StorageHive: Collect 18 hive files to read
2024.04.23 09:57:47.214542 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Debug> executeQuery: Read 57653 rows, 675.62 KiB in 0.181149 sec., 318262.86648007995 rows/sec., 3.64 MiB/sec.

Run SELECT again after new parquet is stored (every half hour). This time, query returns exact number as before because client reads same amount of Hive files.

2024.04.23 10:17:36.150890 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Debug> executeQuery: (from 127.0.0.1:58954)  SELECT year, month, day, count(1) as clicks FROM tracking_zbozi_click_logs WHERE year = 2024 AND month = 4 and day = 23 GROUP BY year, month, day ORDER BY day (stage: Complete)
2024.04.23 10:17:36.154613 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Trace> HiveMetastoreClient: Get table metadata for default.tracking_zbozi_click_logs
2024.04.23 10:17:36.238725 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Information> StorageHive: Collect 18 hive files to read
2024.04.23 10:17:36.246530 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Debug> executeQuery: Read 57653 rows, 675.62 KiB in 0.095742 sec., 602170.4163272127 rows/sec., 6.89 MiB/sec.

Only workaround for now is restarting clickhouse-server.service. Maybe @taiyang-li can help?

The text was updated successfully, but these errors were encountered:

taiyang-li · 2024-04-23T12:18:48Z

Please show me the details of step 3: Add new data to HDFS. Does it mean adding a new partition or insert overwrite an existing partition or just create a new file under existing partition directly in HDFS.

Table metadata cache only refresh when partitions query from hive metastore are newer than current one.

xmb2 · 2024-04-23T18:07:46Z

New files are added directly to hdfs under existing partition (I believe it's a correct terminology?). Here is external Hive table DDL:

CREATE EXTERNAL TABLE `default.tracking_zbozi_click_logs`(
...
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
...
LOCATION
'hdfs://xxx/warehouse/tracking_zbozi_click_logs'

Output of Spark job is stored in Parquet format as follows.

warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-00-CEST-2024-04-23-18-30-CEST-0.parquet
warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-30-CEST-2024-04-23-19-00-CEST-0.parquet
...

When running same query in Superset, I get a correct number of rows. Even when a new file is added to the path.

den-crane · 2024-04-23T18:14:24Z

@xmb2 do you use local_cache_for_remote_fs ?

hdfs & cache -> wrong data/results #47836

taiyang-li · 2024-04-24T03:15:57Z

New files are added directly to hdfs under existing partition (I believe it's a correct terminology?). Here is external Hive table DDL:
CREATE EXTERNAL TABLE `default.tracking_zbozi_click_logs`(
...
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
...
LOCATION
'hdfs://xxx/warehouse/tracking_zbozi_click_logs'
Output of Spark job is stored in Parquet format as follows.
warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-00-CEST-2024-04-23-18-30-CEST-0.parquet
warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-30-CEST-2024-04-23-19-00-CEST-0.parquet
...
When running same query in Superset, I get a correct number of rows. Even when a new file is added to the path.

We judge whether the metadata cache need refresh by returned partitions from hive metastore. AFAIK, adding files under existing partitions directly on HDFS will not change any informations in hive metastore. And it is not the use case we cared about when we were designing Hive Engine before. We expect users update hive table throught SQL instead of lower level operations on HDFS.

It makes sense because hive metastore should be the only standard about how many partitions and what are their current statuses in current hive table.

xmb2 · 2024-04-24T08:14:34Z

@xmb2 do you use local_cache_for_remote_fs ?

hdfs & cache -> wrong data/results #47836

No, we do not use local cache - setting local_cache_for_remote_fs is disabled.

Thanks @taiyang-li for explanation. Just have checked in code, we add a new partition once a day (ALTER TABLE {table} ADD IF NOT EXISTS PARTITION(year = {year}, month = {month}, day = {day})) and after that store new files for that day directly in hdfs. So for our use case, currently Hive engine is not the best match. 😞

Maybe, can we disable metadata cache and works as in Superset mentioned before?

taiyang-li · 2024-04-25T06:39:34Z

Yes. You can disable metadata cache by modify HiveMetastoreClient::HiveTableMetadata::shouldUpdate, makes it always returning true. But be carefully that it has a negative impact on performance

xmb2 · 2024-04-25T10:32:20Z

For clarification, Superset, where query gives all data, is connected directly to Hive. So it's not a workaround for us.

About disabling cache. I hope for some settings option (as for other cache types exist), which also can be beneficial for others who use Hive engine and not to build and maintain own version of Clickhouse (I am not a programmer).

However I don't know how complex would be implement such feature request, but definitely appreciated it.

taiyang-li · 2024-04-26T02:42:48Z

Yes. You can disable metadata cache by modify HiveMetastoreClient::HiveTableMetadata::shouldUpdate, makes it always returning true. But be carefully that it has a negative impact on performance

@xmb2 I personally don't think it is a typical use case of Hive Engine because the right way to get hdfs partition directories to read is first querying metadata from hive metastore instead of directly accessing hdfs. Anyway, it is a trivial change, please feel free to contribute to CH if you want this feature. I think I had told you how to manage it.

xmb2 added the potential bug To be reviewed by developers and confirmed/rejected. label Apr 23, 2024

alexey-milovidov added experimental feature Bug in the feature that should not be used in production help wanted and removed potential bug To be reviewed by developers and confirmed/rejected. labels Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive table metadata are not refreshed #62869

Hive table metadata are not refreshed #62869

xmb2 commented Apr 23, 2024

taiyang-li commented Apr 23, 2024 •

edited

xmb2 commented Apr 23, 2024

den-crane commented Apr 23, 2024 •

edited

taiyang-li commented Apr 24, 2024 •

edited

xmb2 commented Apr 24, 2024

taiyang-li commented Apr 25, 2024

xmb2 commented Apr 25, 2024

taiyang-li commented Apr 26, 2024

Hive table metadata are not refreshed #62869

Hive table metadata are not refreshed #62869

Comments

xmb2 commented Apr 23, 2024

taiyang-li commented Apr 23, 2024 • edited

xmb2 commented Apr 23, 2024

den-crane commented Apr 23, 2024 • edited

taiyang-li commented Apr 24, 2024 • edited

xmb2 commented Apr 24, 2024

taiyang-li commented Apr 25, 2024

xmb2 commented Apr 25, 2024

taiyang-li commented Apr 26, 2024

taiyang-li commented Apr 23, 2024 •

edited

den-crane commented Apr 23, 2024 •

edited

taiyang-li commented Apr 24, 2024 •

edited