-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
When diagnosing problems in storage we often need to look at Parquet metadata. It would be much more productive to have a describe-parquet command to introspect files quickly. Example:
ice describe-parquet --help
Usage: ice-rest-catalog ice describe-parquet [options] [--json] [-c=<configFile>]
[--log-level=<logLevel>] [<target>]
-a, --all Show everything
-s, --summary Show size, rows, number of row groups, size, compress_size, etc.
-c, --columns Show columns
-r --row-groups Show row groups
-d --row-group-details Show column stats within row group
Thoughts on alternatives.
-
This command could alternatively be a wrapper around parquet-tools. The
parquet-tools meta
output is not bad but you have to run it through inconvenient jq commands to get readable output. -
ClickHouse has pretty good Parquet introspection but it's hard to read. The ice command would have the same information but obtained directly from the file without needing to use ClickHouse. Here's the ClickHouse example:
SELECT *
FROM s3('s3://<bucket>/nyc/taxis/data/1751985059861-5526da589bc474f953669f343505ec94f0074bfc701eb046184de439bfe0485f.parquet', ParquetMetaData)
Query id: 49c06e8f-45c4-4ff0-8587-95e7b8f8f8a5
Row 1:
──────
num_columns: 20
num_rows: 3475226 -- 3.48 million
num_row_groups: 1
format_version: 1.0
metadata_size: 4914
total_uncompressed_size: 94235387 -- 94.24 million
total_compressed_size: 56257069 -- 56.26 million
columns: [('VendorID','VendorID',1,0,'INT32','None','GZIP',815756,457781,'43.88%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tpep_pickup_datetime','tpep_pickup_datetime',1,0,'INT64','Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)','GZIP',27244805,13315093,'51.13%',['RLE','PLAIN','PLAIN_DICTIONARY','BIT_PACKED']),('tpep_dropoff_datetime','tpep_dropoff_datetime',1,0,'INT64','Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)','GZIP',27249800,13630784,'49.98%',['RLE','PLAIN','PLAIN_DICTIONARY','BIT_PACKED']),('passenger_count','passenger_count',1,0,'INT64','None','GZIP',1248716,565309,'54.73%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('trip_distance','trip_distance',1,0,'DOUBLE','None','GZIP',5458535,4879027,'10.62%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('RatecodeID','RatecodeID',1,0,'INT64','None','GZIP',543574,277677,'48.92%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('store_and_fwd_flag','store_and_fwd_flag',1,0,'BYTE_ARRAY','String','GZIP',35900,30404,'15.31%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('PULocationID','PULocationID',1,0,'INT32','None','GZIP',3559027,2657564,'25.33%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('DOLocationID','DOLocationID',1,0,'INT32','None','GZIP',3906660,3491779,'10.62%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('payment_type','payment_type',1,0,'INT64','None','GZIP',910126,459559,'49.51%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('fare_amount','fare_amount',1,0,'DOUBLE','None','GZIP',5325026,3884220,'27.06%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('extra','extra',1,0,'DOUBLE','None','GZIP',2071927,893528,'56.87%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('mta_tax','mta_tax',1,0,'DOUBLE','None','GZIP',371194,177152,'52.28%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tip_amount','tip_amount',1,0,'DOUBLE','None','GZIP',4929627,3756133,'23.8%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tolls_amount','tolls_amount',1,0,'DOUBLE','None','GZIP',1825209,405014,'77.81%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('improvement_surcharge','improvement_surcharge',1,0,'DOUBLE','None','GZIP',298522,160411,'46.26%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('total_amount','total_amount',1,0,'DOUBLE','None','GZIP',6513089,6185099,'5.036%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('congestion_surcharge','congestion_surcharge',1,0,'DOUBLE','None','GZIP',556454,303021,'45.54%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('Airport_fee','Airport_fee',1,0,'DOUBLE','None','GZIP',594481,272186,'54.21%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('cbd_congestion_fee','cbd_congestion_fee',1,0,'DOUBLE','None','GZIP',776959,455328,'41.4%',['RLE','PLAIN_DICTIONARY','BIT_PACKED'])]
row_groups: [(4,20,3475226,94235387,56257069,[('VendorID','VendorID',457781,815756,true,(3475226,0,NULL,'1','7'),0),('tpep_pickup_datetime','tpep_pickup_datetime',13315093,27244805,true,(3475226,0,NULL,'1735678075000000','1738368044000000'),0),('tpep_dropoff_datetime','tpep_dropoff_datetime',13630784,27249800,true,(3475226,0,NULL,'1734508360000000','1738453451000000'),0),('passenger_count','passenger_count',565309,1248716,true,(2935077,540149,NULL,'0','9'),0),('trip_distance','trip_distance',4879027,5458535,true,(3475226,0,NULL,'0.000000','276423.570000'),0),('RatecodeID','RatecodeID',277677,543574,true,(2935077,540149,NULL,'1','99'),0),('store_and_fwd_flag','store_and_fwd_flag',30404,35900,true,(2935077,540149,NULL,'N','Y'),0),('PULocationID','PULocationID',2657564,3559027,true,(3475226,0,NULL,'1','265'),0),('DOLocationID','DOLocationID',3491779,3906660,true,(3475226,0,NULL,'1','265'),0),('payment_type','payment_type',459559,910126,true,(3475226,0,NULL,'0','5'),0),('fare_amount','fare_amount',3884220,5325026,true,(3475226,0,NULL,'-900.000000','863372.120000'),0),('extra','extra',893528,2071927,true,(3475226,0,NULL,'-7.500000','15.000000'),0),('mta_tax','mta_tax',177152,371194,true,(3475226,0,NULL,'-0.500000','10.500000'),0),('tip_amount','tip_amount',3756133,4929627,true,(3475226,0,NULL,'-86.000000','400.000000'),0),('tolls_amount','tolls_amount',405014,1825209,true,(3475226,0,NULL,'-126.940000','170.940000'),0),('improvement_surcharge','improvement_surcharge',160411,298522,true,(3475226,0,NULL,'-1.000000','1.000000'),0),('total_amount','total_amount',6185099,6513089,true,(3475226,0,NULL,'-901.000000','863380.370000'),0),('congestion_surcharge','congestion_surcharge',303021,556454,true,(2935077,540149,NULL,'-2.500000','2.500000'),0),('Airport_fee','Airport_fee',272186,594481,true,(2935077,540149,NULL,'-1.750000','6.750000'),0),('cbd_congestion_fee','cbd_congestion_fee',455328,776959,true,(3475226,0,NULL,'-0.750000','0.750000'),0)])]
1 row in set. Elapsed: 0.117 sec.
Metadata
Metadata
Assignees
Labels
No labels