Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems to use HDFS storage_policy for MergeTree with more than 300GB #43112

Closed
ArsenyClean opened this issue Nov 10, 2022 · 6 comments
Closed
Labels
comp-hdfs feature st-wontfix Known issue, no plans to fix it currenlty unexpected behaviour

Comments

@ArsenyClean
Copy link

Hellow!

We use storage_policy=hdfs for MergeTree tables

  1. ClickHouse save many little files

Each file about 2KB, than if we just save about 300GB, then we will get an exception about files limit in each folder

See dfs.namenode.fs-limits.max-directory-items

https://github.com/naver/hadoop/blob/master/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java#L285

  1. load_balancing=round_robin not work in our case, cause memory limit for folder not exist, and move_factor parameter will not work

We create 100 folders and 100 volumes for them and 1 policy for all of them. But ClickHouse save all data for 1 folder

min_bytes_for_wide_part make no effect for loading, each file about 2KB anyway

@ArsenyClean ArsenyClean added the potential bug To be reviewed by developers and confirmed/rejected. label Nov 10, 2022
@kssenii
Copy link
Member

kssenii commented Nov 10, 2022

it should not write many small files if you inserted all data in one insert

please check

SELECT local_path, remote_path, size FROM system.remote_data_paths WHERE disk = 'hdfs_disk'

@den-crane den-crane added feature unexpected behaviour comp-hdfs and removed potential bug To be reviewed by developers and confirmed/rejected. labels Nov 10, 2022
@den-crane
Copy link
Contributor

related #40968

@den-crane
Copy link
Contributor

with S3 objects are sharded to subfolders, with HDFS they are not

S3

CREATE TABLE test_s3( c1  Int8, c2  Date, c3  Array(String)) ENGINE = MergeTree
ORDER BY (c1, c2) SETTINGS disk = 's3disk';

insert into test_s3 select 1,1,[] from numbers(10);

select remote_path FROM system.remote_data_paths limit 10;

┌─remote_path──────────────────────────────────────┐
│ denis/s3cached/fpk/uzqyqbvtejpxuyyltdknvjiztcstq │
│ denis/s3cached/mhj/ryhvqribntufhhaaussebheujgfay │
│ denis/s3cached/hje/voewtvkmwsqclivjahvypevfatmhk │
│ denis/s3cached/idp/vrybpmmlimfvlopeufjfexnswqkuk │
│ denis/s3cached/nff/wchzqebvmarfsjhwldhjmyrsjmidq │
│ denis/s3cached/ihw/addnwhalvvtvxopszigztizrpcnsd │
│ denis/s3cached/xeb/ukuqphbchejbswgtmjxlpsmeilipe │
│ denis/s3cached/hcn/wclmrbdvqqfzbqawqjplmhfgzvsyp │
│ denis/s3cached/otw/fpnfunznizlzyyjnsmtzqvdnbxoci │
│ denis/s3cached/wph/qhvtztdxcydezxyqsxawwqakuvpnj │
└──────────────────────────────────────────────────┘

HDFS

CREATE TABLE test_s3( c1  Int8, c2  Date, c3  Array(String)) ENGINE = MergeTree
ORDER BY (c1, c2) SETTINGS disk = 'hdfs';

insert into test_s3 select 1,1,[] from numbers(10);

select remote_path FROM system.remote_data_paths limit 10;

┌─remote_path────────────────────────────────────────────────────────┐
│ hdfs://hdfs1:9000/temp/clickhouse/oqsowlekotqvslxcsmbyjhuidwsyulwt │
│ hdfs://hdfs1:9000/temp/clickhouse/wmafmhwmsrnyphpagkoncioyfqlijanb │
│ hdfs://hdfs1:9000/temp/clickhouse/kmclrzqxiajkiumhftvoypbgwcedsjbk │
│ hdfs://hdfs1:9000/temp/clickhouse/extykovxvclzesejrxctkfymdzqoafco │
│ hdfs://hdfs1:9000/temp/clickhouse/zibvjvkqdzweipcxpmtutjbdwexudukq │
│ hdfs://hdfs1:9000/temp/clickhouse/lzoeiijfekfqcfsibozlcgbplyvhezbv │
│ hdfs://hdfs1:9000/temp/clickhouse/idpmkhgbutkvtyrteuderpbfursfroam │
│ hdfs://hdfs1:9000/temp/clickhouse/ylnrydjstlkqgupclylpenchvwsigmom │
│ hdfs://hdfs1:9000/temp/clickhouse/hiaphzfshelfkcckighqbanqwafhwxwi │
│ hdfs://hdfs1:9000/temp/clickhouse/nobllxxzaguqnjbkhkpyauurbzetfmmi │
└────────────────────────────────────────────────────────────────────┘

@den-crane
Copy link
Contributor

And that's not easy, need to pred-create folders in HDFS, before putting a file.
Create 10000 folders during Clickhouse start?

@alexey-milovidov alexey-milovidov added the st-wontfix Known issue, no plans to fix it currenlty label Apr 27, 2024
@alexey-milovidov
Copy link
Member

We don't support HDFS.

@den-crane
Copy link
Contributor

And that's not easy, need to pred-create folders in HDFS, before putting a file. Create 10000 folders during Clickhouse start?

Alex Sapin suggested to use an optimistic approach. If create_hdfs_object fails with an error "folded does not exist" then create a folder, retry create_hdfs_object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-hdfs feature st-wontfix Known issue, no plans to fix it currenlty unexpected behaviour
Projects
None yet
Development

No branches or pull requests

4 participants