Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive table metadata are not refreshed #62869

Open
xmb2 opened this issue Apr 23, 2024 · 8 comments
Open

Hive table metadata are not refreshed #62869

xmb2 opened this issue Apr 23, 2024 · 8 comments
Labels
experimental feature Bug in the feature that should not be used in production help wanted

Comments

@xmb2
Copy link

xmb2 commented Apr 23, 2024

Describe what's wrong

Quering Hive engine table gives incomplete data results when a new data are stored (e. g. in Parquet file format) and previously HiveMetastoreClient already fetched metadata for the table.

Does it reproduce on the most recent release?

Yes, on the current and previous LTS version - v24.3.2, v23.8.12.

How to reproduce

  • Create Hive engine table partitioned by year, month and day. Input format is Parquet.
  • Get number of rows for table.
  • Add new data to HDFS.
  • Run the same count query again.
CREATE TABLE hive.tracking_zbozi_click_logs
(
    `timestamp` Nullable(Int64),
    `aaa` Nullable(Int32),
    `bbb` Nullable(String),
    `ccc` Array(Tuple(ccca Nullable(String), cccb Nullable(String), cccd Nullable(Decimal(20, 0)))),
    `ddd` Nullable(Int32),
    `eee` Nullable(String),
    `fff` Nullable(String),
    `ggg` Nullable(Decimal(20, 0)),
    `hhh` Nullable(String),
    `iii` Nullable(String),
    `jjj` Nullable(String),
    `kkk` Nullable(String),
    `lll` Nullable(Bool),
    `mmm` Nullable(String),
    `year` UInt32,
    `month` UInt32,
    `day` UInt32
)
ENGINE = Hive('thrift://<hive-server>:9083', 'default', 'tracking_zbozi_click_logs')
PARTITION BY (year, month, day)
ORDER BY tuple()
SETTINGS input_format_parquet_allow_missing_columns = 1, input_format_max_rows_to_read_for_schema_inference = 100000

Expected behavior

Query returns actual number of rows of given Parquet files.

Additional context

First time quering data.

2024.04.23 09:57:47.033532 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Debug> executeQuery: (from 127.0.0.1:58954)  SELECT year, month, day, count(1) as clicks FROM tracking_zbozi_click_logs WHERE year = 2024 AND month = 4 and day = 23 GROUP BY year, month, day ORDER BY day (stage: Complete)
2024.04.23 09:57:47.077291 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Trace> HiveMetastoreClient: Get table metadata for default.tracking_zbozi_click_logs
2024.04.23 09:57:47.162733 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Information> StorageHive: Collect 18 hive files to read
2024.04.23 09:57:47.214542 [ 2677872 ] {0cf4daf0-41c9-4916-ba35-5f5f347790be} <Debug> executeQuery: Read 57653 rows, 675.62 KiB in 0.181149 sec., 318262.86648007995 rows/sec., 3.64 MiB/sec.

Run SELECT again after new parquet is stored (every half hour). This time, query returns exact number as before because client reads same amount of Hive files.

2024.04.23 10:17:36.150890 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Debug> executeQuery: (from 127.0.0.1:58954)  SELECT year, month, day, count(1) as clicks FROM tracking_zbozi_click_logs WHERE year = 2024 AND month = 4 and day = 23 GROUP BY year, month, day ORDER BY day (stage: Complete)
2024.04.23 10:17:36.154613 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Trace> HiveMetastoreClient: Get table metadata for default.tracking_zbozi_click_logs
2024.04.23 10:17:36.238725 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Information> StorageHive: Collect 18 hive files to read
2024.04.23 10:17:36.246530 [ 2677872 ] {4ca9481f-6bd2-49d3-b563-4cd05e78094f} <Debug> executeQuery: Read 57653 rows, 675.62 KiB in 0.095742 sec., 602170.4163272127 rows/sec., 6.89 MiB/sec.

Only workaround for now is restarting clickhouse-server.service. Maybe @taiyang-li can help?

@xmb2 xmb2 added the potential bug To be reviewed by developers and confirmed/rejected. label Apr 23, 2024
@taiyang-li
Copy link
Contributor

taiyang-li commented Apr 23, 2024

Please show me the details of step 3: Add new data to HDFS. Does it mean adding a new partition or insert overwrite an existing partition or just create a new file under existing partition directly in HDFS.

Table metadata cache only refresh when partitions query from hive metastore are newer than current one.

@xmb2
Copy link
Author

xmb2 commented Apr 23, 2024

New files are added directly to hdfs under existing partition (I believe it's a correct terminology?). Here is external Hive table DDL:

CREATE EXTERNAL TABLE `default.tracking_zbozi_click_logs`(
...
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
...
LOCATION
'hdfs://xxx/warehouse/tracking_zbozi_click_logs'

Output of Spark job is stored in Parquet format as follows.

warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-00-CEST-2024-04-23-18-30-CEST-0.parquet
warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-30-CEST-2024-04-23-19-00-CEST-0.parquet
...

When running same query in Superset, I get a correct number of rows. Even when a new file is added to the path.

@den-crane
Copy link
Contributor

den-crane commented Apr 23, 2024

@xmb2 do you use local_cache_for_remote_fs ?

@taiyang-li
Copy link
Contributor

taiyang-li commented Apr 24, 2024

New files are added directly to hdfs under existing partition (I believe it's a correct terminology?). Here is external Hive table DDL:

CREATE EXTERNAL TABLE `default.tracking_zbozi_click_logs`(
...
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
...
LOCATION
'hdfs://xxx/warehouse/tracking_zbozi_click_logs'

Output of Spark job is stored in Parquet format as follows.

warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-00-CEST-2024-04-23-18-30-CEST-0.parquet
warehouse/tracking_zbozi_click_logs/year=2024/month=4/day=23/tracking.zbozi.click_log-2024-04-23-18-30-CEST-2024-04-23-19-00-CEST-0.parquet
...

When running same query in Superset, I get a correct number of rows. Even when a new file is added to the path.

We judge whether the metadata cache need refresh by returned partitions from hive metastore. AFAIK, adding files under existing partitions directly on HDFS will not change any informations in hive metastore. And it is not the use case we cared about when we were designing Hive Engine before. We expect users update hive table throught SQL instead of lower level operations on HDFS.

It makes sense because hive metastore should be the only standard about how many partitions and what are their current statuses in current hive table.

@xmb2
Copy link
Author

xmb2 commented Apr 24, 2024

@xmb2 do you use local_cache_for_remote_fs ?

No, we do not use local cache - setting local_cache_for_remote_fs is disabled.

Thanks @taiyang-li for explanation. Just have checked in code, we add a new partition once a day (ALTER TABLE {table} ADD IF NOT EXISTS PARTITION(year = {year}, month = {month}, day = {day})) and after that store new files for that day directly in hdfs. So for our use case, currently Hive engine is not the best match. 😞

Maybe, can we disable metadata cache and works as in Superset mentioned before?

@taiyang-li
Copy link
Contributor

Yes. You can disable metadata cache by modify HiveMetastoreClient::HiveTableMetadata::shouldUpdate, makes it always returning true. But be carefully that it has a negative impact on performance

@xmb2
Copy link
Author

xmb2 commented Apr 25, 2024

For clarification, Superset, where query gives all data, is connected directly to Hive. So it's not a workaround for us.

About disabling cache. I hope for some settings option (as for other cache types exist), which also can be beneficial for others who use Hive engine and not to build and maintain own version of Clickhouse (I am not a programmer).

However I don't know how complex would be implement such feature request, but definitely appreciated it.

@taiyang-li
Copy link
Contributor

Yes. You can disable metadata cache by modify HiveMetastoreClient::HiveTableMetadata::shouldUpdate, makes it always returning true. But be carefully that it has a negative impact on performance

@xmb2 I personally don't think it is a typical use case of Hive Engine because the right way to get hdfs partition directories to read is first querying metadata from hive metastore instead of directly accessing hdfs. Anyway, it is a trivial change, please feel free to contribute to CH if you want this feature. I think I had told you how to manage it.

@alexey-milovidov alexey-milovidov added experimental feature Bug in the feature that should not be used in production help wanted and removed potential bug To be reviewed by developers and confirmed/rejected. labels Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental feature Bug in the feature that should not be used in production help wanted
Projects
None yet
Development

No branches or pull requests

4 participants