Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet-Tools (rowcount, schema, and dump) #203

Merged
merged 7 commits into from
Jun 13, 2024

Conversation

KIwabuchi
Copy link
Member

@KIwabuchi KIwabuchi commented Apr 20, 2024

Usage:

Usage
mpirun -np <#of ranks> ./parquet-tools [options]

Options
-c <subcommand>
  rowcount
    Return the number of rows in parquet files. If no subcommand option was specified, return the value stored in the metadata without actually reading the whole data and counting the number of lines.
  schema
    Show the schemas of parquet files.
  dump
    Dump data to files. One output file per rank.
-p <path>
  Parquet file path or a directory path that contains parquet files. All parquet files must have the same schema.
-h Show this help message.

Subcommand Usage
rowcount [options]
 Options
  -l Read rows w/o converting.
  -v Read rows converting to arrays of std::variant.
  -j Read rows converting to arrays of JSON objects.
dump -o <output file prefix> [options]
  -o <path> Prefix of output files.
 Options
  -v Dump rows converting to arrays of std::variant (default).
  -j Dump rows converting to arrays of JSON objects.

Count Rows Example

$ mpirun -np 2 ./tools/parquet_tools -c rowcount -p ../test/data/parquet_files_json/ -v
Read as variants.
Elapsed time: 0.000284546 seconds
#of rows = 3
#of conversion error lines = 0

Show Schema Example

$ mpirun -np 2 ./tools/parquet_tools -c schema -p ../test/data/parquet_files_json/ 
Schema
required group field_id=-1 schema {
  optional int64 field_id=-1 id;
  optional boolean field_id=-1 bool;
  optional int32 field_id=-1 int32;
  optional int64 field_id=-1 int64;
  optional float field_id=-1 float;
  optional double field_id=-1 double;
  optional binary field_id=-1 byte_array (String);
}

@KIwabuchi KIwabuchi changed the title Parquet Tools (rowcount and schema) Parquet-Tools (rowcount and schema) Apr 20, 2024
@KIwabuchi KIwabuchi changed the title Parquet-Tools (rowcount and schema) Parquet-Tools (rowcount, schema, and dump) May 16, 2024
- Add a cmake option `PIP_PYARROW_ROOT` to use Arrow and Parquet installed along with pyarrow by pip.

- Use pip to install Arrow and Parquet in the CI Test.
@KIwabuchi
Copy link
Member Author

KIwabuchi commented Jun 8, 2024

@rogerpearce @steiltre

I changed YGM’s CMake file to enable cmake to find Arrow installed by pip.
I had to drastically change the cmake files to keep supporting the normally installed Arrow and to clear up the cmake code.
I passed the CI Test on the latest Ubuntu (22.04), having pip to install pyarrow 🎉

However, there was an error when linking Parquet to YGM executables on Ubuntu 20.04.
To fix the issue, we need to define a C macro _GLIBCXX_USE_CXX11_ABI=0 as described in the link below:
https://github.com/apache/arrow/pull/10582/files
As the macro is related to ABI (which is tricky), I want to ask your thoughts.
I think potential directions are:

  • Define the macro only during the CI test on Ubuntu 20.04 (using the macro on Ubuntu 22.04 causes another issue).
  • Skip Paquet program tests on Ubuntu 20.04.
  • Installing Arrow using apt-get on Ubuntu 20.04 and using pyarrow on Ubuntu 22.04.

What do you think?


Usage example:

pip install pyarrow
PIP_PYARROW_ROOT=$(python -c "import pyarrow as pa; print(pa.get_library_dirs()[0])")
cmake ../ -DPIP_PYARROW_ROOT=$PIP_PYARROW_ROOT

@steiltre
Copy link
Collaborator

steiltre commented Jun 9, 2024

I'm good with skipping the Parquet tests on 20.04. Being able to test on 22.04 should be good enough for our purposes.

@KIwabuchi
Copy link
Member Author

Okay, I'll stop running the tests on 20.04.

@KIwabuchi
Copy link
Member Author

@steiltre @rogerpearce
I changed the script to skip the tests on 20.04.

@steiltre
Copy link
Collaborator

Thanks Keita. This all looks good to me. Going to merge in now.

@steiltre steiltre merged commit 7abf888 into LLNL:v0.7-dev Jun 13, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants