Problems to use HDFS storage_policy for MergeTree with more than 300GB #43112

ArsenyClean · 2022-11-10T10:10:04Z

Hellow!

We use storage_policy=hdfs for MergeTree tables

ClickHouse save many little files

Each file about 2KB, than if we just save about 300GB, then we will get an exception about files limit in each folder

See dfs.namenode.fs-limits.max-directory-items

https://github.com/naver/hadoop/blob/master/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java#L285

load_balancing=round_robin not work in our case, cause memory limit for folder not exist, and move_factor parameter will not work

We create 100 folders and 100 volumes for them and 1 policy for all of them. But ClickHouse save all data for 1 folder

min_bytes_for_wide_part make no effect for loading, each file about 2KB anyway

The text was updated successfully, but these errors were encountered:

kssenii · 2022-11-10T10:31:39Z

it should not write many small files if you inserted all data in one insert

please check

SELECT local_path, remote_path, size FROM system.remote_data_paths WHERE disk = 'hdfs_disk'

den-crane · 2022-11-10T10:41:58Z

related #40968

den-crane · 2023-10-23T09:37:31Z

with S3 objects are sharded to subfolders, with HDFS they are not

S3

CREATE TABLE test_s3( c1  Int8, c2  Date, c3  Array(String)) ENGINE = MergeTree
ORDER BY (c1, c2) SETTINGS disk = 's3disk';

insert into test_s3 select 1,1,[] from numbers(10);

select remote_path FROM system.remote_data_paths limit 10;

┌─remote_path──────────────────────────────────────┐
│ denis/s3cached/fpk/uzqyqbvtejpxuyyltdknvjiztcstq │
│ denis/s3cached/mhj/ryhvqribntufhhaaussebheujgfay │
│ denis/s3cached/hje/voewtvkmwsqclivjahvypevfatmhk │
│ denis/s3cached/idp/vrybpmmlimfvlopeufjfexnswqkuk │
│ denis/s3cached/nff/wchzqebvmarfsjhwldhjmyrsjmidq │
│ denis/s3cached/ihw/addnwhalvvtvxopszigztizrpcnsd │
│ denis/s3cached/xeb/ukuqphbchejbswgtmjxlpsmeilipe │
│ denis/s3cached/hcn/wclmrbdvqqfzbqawqjplmhfgzvsyp │
│ denis/s3cached/otw/fpnfunznizlzyyjnsmtzqvdnbxoci │
│ denis/s3cached/wph/qhvtztdxcydezxyqsxawwqakuvpnj │
└──────────────────────────────────────────────────┘

HDFS

CREATE TABLE test_s3( c1  Int8, c2  Date, c3  Array(String)) ENGINE = MergeTree
ORDER BY (c1, c2) SETTINGS disk = 'hdfs';

insert into test_s3 select 1,1,[] from numbers(10);

select remote_path FROM system.remote_data_paths limit 10;

┌─remote_path────────────────────────────────────────────────────────┐
│ hdfs://hdfs1:9000/temp/clickhouse/oqsowlekotqvslxcsmbyjhuidwsyulwt │
│ hdfs://hdfs1:9000/temp/clickhouse/wmafmhwmsrnyphpagkoncioyfqlijanb │
│ hdfs://hdfs1:9000/temp/clickhouse/kmclrzqxiajkiumhftvoypbgwcedsjbk │
│ hdfs://hdfs1:9000/temp/clickhouse/extykovxvclzesejrxctkfymdzqoafco │
│ hdfs://hdfs1:9000/temp/clickhouse/zibvjvkqdzweipcxpmtutjbdwexudukq │
│ hdfs://hdfs1:9000/temp/clickhouse/lzoeiijfekfqcfsibozlcgbplyvhezbv │
│ hdfs://hdfs1:9000/temp/clickhouse/idpmkhgbutkvtyrteuderpbfursfroam │
│ hdfs://hdfs1:9000/temp/clickhouse/ylnrydjstlkqgupclylpenchvwsigmom │
│ hdfs://hdfs1:9000/temp/clickhouse/hiaphzfshelfkcckighqbanqwafhwxwi │
│ hdfs://hdfs1:9000/temp/clickhouse/nobllxxzaguqnjbkhkpyauurbzetfmmi │
└────────────────────────────────────────────────────────────────────┘

den-crane · 2023-10-23T16:59:35Z

And that's not easy, need to pred-create folders in HDFS, before putting a file.
Create 10000 folders during Clickhouse start?

alexey-milovidov · 2024-04-27T02:26:04Z

We don't support HDFS.

den-crane · 2024-05-20T11:23:44Z

And that's not easy, need to pred-create folders in HDFS, before putting a file. Create 10000 folders during Clickhouse start?

Alex Sapin suggested to use an optimistic approach. If create_hdfs_object fails with an error "folded does not exist" then create a folder, retry create_hdfs_object.

ArsenyClean added the potential bug To be reviewed by developers and confirmed/rejected. label Nov 10, 2022

ArsenyClean mentioned this issue Nov 10, 2022

Correctly configure HDFS policy for MergeTree #42995

Closed

den-crane added feature unexpected behaviour comp-hdfs and removed potential bug To be reviewed by developers and confirmed/rejected. labels Nov 10, 2022

den-crane mentioned this issue Oct 23, 2023

When I use S3 or HDFS as clickhouse's storage method, will clickhouse's MergeTree table engine automatically help me complete the MergePart work in the background? #55928

Closed

alexey-milovidov added the st-wontfix Known issue, no plans to fix it currenlty label Apr 27, 2024

alexey-milovidov closed this as completed Apr 27, 2024

This was referenced May 16, 2024

Unable to Connect to HDFS for Tiered Storage Setup #63941

Open

HDFS is unsupported #64126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems to use HDFS storage_policy for MergeTree with more than 300GB #43112

Problems to use HDFS storage_policy for MergeTree with more than 300GB #43112

ArsenyClean commented Nov 10, 2022

kssenii commented Nov 10, 2022

den-crane commented Nov 10, 2022

den-crane commented Oct 23, 2023

den-crane commented Oct 23, 2023

alexey-milovidov commented Apr 27, 2024

den-crane commented May 20, 2024

Problems to use HDFS storage_policy for MergeTree with more than 300GB #43112

Problems to use HDFS storage_policy for MergeTree with more than 300GB #43112

Comments

ArsenyClean commented Nov 10, 2022

kssenii commented Nov 10, 2022

den-crane commented Nov 10, 2022

den-crane commented Oct 23, 2023

den-crane commented Oct 23, 2023

alexey-milovidov commented Apr 27, 2024

den-crane commented May 20, 2024