Skip to content

Provide convenient introspection of Parquet files #59

@hodgesrm

Description

@hodgesrm

When diagnosing problems in storage we often need to look at Parquet metadata. It would be much more productive to have a describe-parquet command to introspect files quickly. Example:

ice describe-parquet --help
Usage: ice-rest-catalog ice describe-parquet [options] [--json] [-c=<configFile>]
                                     [--log-level=<logLevel>] [<target>]
-a, --all Show everything
-s, --summary Show size, rows, number of row groups, size, compress_size, etc.
-c, --columns Show columns
-r --row-groups Show row groups
-d --row-group-details Show column stats within row group

Thoughts on alternatives.

  1. This command could alternatively be a wrapper around parquet-tools. The parquet-tools meta output is not bad but you have to run it through inconvenient jq commands to get readable output.

  2. ClickHouse has pretty good Parquet introspection but it's hard to read. The ice command would have the same information but obtained directly from the file without needing to use ClickHouse. Here's the ClickHouse example:

SELECT *
FROM s3('s3://<bucket>/nyc/taxis/data/1751985059861-5526da589bc474f953669f343505ec94f0074bfc701eb046184de439bfe0485f.parquet', ParquetMetaData)

Query id: 49c06e8f-45c4-4ff0-8587-95e7b8f8f8a5

Row 1:
──────
num_columns:             20
num_rows:                3475226 -- 3.48 million
num_row_groups:          1
format_version:          1.0
metadata_size:           4914
total_uncompressed_size: 94235387 -- 94.24 million
total_compressed_size:   56257069 -- 56.26 million
columns:                 [('VendorID','VendorID',1,0,'INT32','None','GZIP',815756,457781,'43.88%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tpep_pickup_datetime','tpep_pickup_datetime',1,0,'INT64','Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)','GZIP',27244805,13315093,'51.13%',['RLE','PLAIN','PLAIN_DICTIONARY','BIT_PACKED']),('tpep_dropoff_datetime','tpep_dropoff_datetime',1,0,'INT64','Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)','GZIP',27249800,13630784,'49.98%',['RLE','PLAIN','PLAIN_DICTIONARY','BIT_PACKED']),('passenger_count','passenger_count',1,0,'INT64','None','GZIP',1248716,565309,'54.73%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('trip_distance','trip_distance',1,0,'DOUBLE','None','GZIP',5458535,4879027,'10.62%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('RatecodeID','RatecodeID',1,0,'INT64','None','GZIP',543574,277677,'48.92%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('store_and_fwd_flag','store_and_fwd_flag',1,0,'BYTE_ARRAY','String','GZIP',35900,30404,'15.31%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('PULocationID','PULocationID',1,0,'INT32','None','GZIP',3559027,2657564,'25.33%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('DOLocationID','DOLocationID',1,0,'INT32','None','GZIP',3906660,3491779,'10.62%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('payment_type','payment_type',1,0,'INT64','None','GZIP',910126,459559,'49.51%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('fare_amount','fare_amount',1,0,'DOUBLE','None','GZIP',5325026,3884220,'27.06%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('extra','extra',1,0,'DOUBLE','None','GZIP',2071927,893528,'56.87%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('mta_tax','mta_tax',1,0,'DOUBLE','None','GZIP',371194,177152,'52.28%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tip_amount','tip_amount',1,0,'DOUBLE','None','GZIP',4929627,3756133,'23.8%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tolls_amount','tolls_amount',1,0,'DOUBLE','None','GZIP',1825209,405014,'77.81%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('improvement_surcharge','improvement_surcharge',1,0,'DOUBLE','None','GZIP',298522,160411,'46.26%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('total_amount','total_amount',1,0,'DOUBLE','None','GZIP',6513089,6185099,'5.036%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('congestion_surcharge','congestion_surcharge',1,0,'DOUBLE','None','GZIP',556454,303021,'45.54%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('Airport_fee','Airport_fee',1,0,'DOUBLE','None','GZIP',594481,272186,'54.21%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('cbd_congestion_fee','cbd_congestion_fee',1,0,'DOUBLE','None','GZIP',776959,455328,'41.4%',['RLE','PLAIN_DICTIONARY','BIT_PACKED'])]
row_groups:              [(4,20,3475226,94235387,56257069,[('VendorID','VendorID',457781,815756,true,(3475226,0,NULL,'1','7'),0),('tpep_pickup_datetime','tpep_pickup_datetime',13315093,27244805,true,(3475226,0,NULL,'1735678075000000','1738368044000000'),0),('tpep_dropoff_datetime','tpep_dropoff_datetime',13630784,27249800,true,(3475226,0,NULL,'1734508360000000','1738453451000000'),0),('passenger_count','passenger_count',565309,1248716,true,(2935077,540149,NULL,'0','9'),0),('trip_distance','trip_distance',4879027,5458535,true,(3475226,0,NULL,'0.000000','276423.570000'),0),('RatecodeID','RatecodeID',277677,543574,true,(2935077,540149,NULL,'1','99'),0),('store_and_fwd_flag','store_and_fwd_flag',30404,35900,true,(2935077,540149,NULL,'N','Y'),0),('PULocationID','PULocationID',2657564,3559027,true,(3475226,0,NULL,'1','265'),0),('DOLocationID','DOLocationID',3491779,3906660,true,(3475226,0,NULL,'1','265'),0),('payment_type','payment_type',459559,910126,true,(3475226,0,NULL,'0','5'),0),('fare_amount','fare_amount',3884220,5325026,true,(3475226,0,NULL,'-900.000000','863372.120000'),0),('extra','extra',893528,2071927,true,(3475226,0,NULL,'-7.500000','15.000000'),0),('mta_tax','mta_tax',177152,371194,true,(3475226,0,NULL,'-0.500000','10.500000'),0),('tip_amount','tip_amount',3756133,4929627,true,(3475226,0,NULL,'-86.000000','400.000000'),0),('tolls_amount','tolls_amount',405014,1825209,true,(3475226,0,NULL,'-126.940000','170.940000'),0),('improvement_surcharge','improvement_surcharge',160411,298522,true,(3475226,0,NULL,'-1.000000','1.000000'),0),('total_amount','total_amount',6185099,6513089,true,(3475226,0,NULL,'-901.000000','863380.370000'),0),('congestion_surcharge','congestion_surcharge',303021,556454,true,(2935077,540149,NULL,'-2.500000','2.500000'),0),('Airport_fee','Airport_fee',272186,594481,true,(2935077,540149,NULL,'-1.750000','6.750000'),0),('cbd_congestion_fee','cbd_congestion_fee',455328,776959,true,(3475226,0,NULL,'-0.750000','0.750000'),0)])]

1 row in set. Elapsed: 0.117 sec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions