Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Can i read parquet data from HDFS? #443

Closed
wangxingda opened this issue Feb 27, 2024 · 6 comments
Closed

[Question] Can i read parquet data from HDFS? #443

wangxingda opened this issue Feb 27, 2024 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@wangxingda
Copy link

I recompile hugectr with -DENABLE_HDFS=ON, i get an this error when i read parquet data from HDFS.


[HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library
res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119)
[HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: failed to read a file
Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255)
[HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library
res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119)
[HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: failed to read a file
Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255)

@wangxingda wangxingda added the question Further information is requested label Feb 27, 2024
@JacoCheung
Copy link
Collaborator

Hi @wangxingda , Thanks for trying HugeCTR with HDFS. We used to have a notebook sample demonstrating the usage of HDFS. Can you confirm that there exists a _metadata.json file in your dataset source folder? (follow the instructions of the notebook sample)

In addition could you please post your cmake log here? I'd confirm the macro ENABLE_ARROW_PARQUET is defined or not.

@JacoCheung JacoCheung self-assigned this Feb 27, 2024
@wangxingda
Copy link
Author

wangxingda commented Feb 28, 2024

@JacoCheung Thanks for your help. I just use CMakeLists.txt with main branch in hugectr repo. And i confirm my metadata file is exists.

Do you notice this line if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt ?
Does this mean that I cannot use both "parquet" and "HDFS" at the same time?

The notebook seems to be out of date, I can not run it successfully with both parquet and HDFS.

@JacoCheung
Copy link
Collaborator

Hi @wangxingda , Thanks for reminding! There was a destructive change to the remote reading in v23.02 release where if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt came into play.

Specifically, to optimize the reading process in HugeCTR, we had to know the row_group_size of all training data files (Parquet) in advance (Before any actual data reading). And the way of getting the information is to resort to arrow parquet reader reading the metadata from parquet file from local filesystem.

Therefore, HDFS should be disabled since v23.02 release. We should mark it as a known issue. Sorry for the inconvenience.

May I know the reason and importance of trying HDFS? Is it a toy trial or not? Could you try out the release prior to v23.02 release if you need HDFS feature support in the short term.

@wangxingda
Copy link
Author

@JacoCheung Thanks, I plan to use hugectr in a production environment. The training data strore in HDFS. So does hugectr-team have a plan to support HDFS with parquet format? I hope to support this feature very much.

@JacoCheung
Copy link
Collaborator

Hi @wangxingda , Thanks for your reply. Yes, we're surely to restore the remote IO (HDFS) feature. As I mentioned, this should be an issue to be fixed. We're planning to refactor our data reader and fix the HDFS problem. But before that happen, you can play with release prior to v23.02 release.

Thanks.

@wangxingda
Copy link
Author

Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants