S3 partitioned read workaround #44177

leriel · 2022-12-12T23:50:28Z

leriel
Dec 12, 2022

Reading data from partitioned S3 storage is not yet implemented:

create table forex_p (
  datetime DateTime64(3),
  bid String,
  ask String,
  base String,
  quote String,
  month String)
ENGINE=S3('https://datasets-documentation.s3.amazonaws.com/forex/csv/year_month/2000{_partition_id}-tick.csv.zst', 'CSVWithNames')
PARTITION BY month;

select count(*) from forex_p where month = '05';
-- Reading from a partitioned S3 storage is not implemented yet. (NOT_IMPLEMENTED)

Usually one needs to read from s3 table or function with globs and filter by _file or _path to only read from files/paths we care about. However, with use of extract and view, life can be made little easier:

select count(*)
from
  s3('https://datasets-documentation.s3.amazonaws.com/forex/csv/year_month/2000{05,06}-tick.csv.zst', 'CSVWithNames')
-- 181425
-- 1 row in set. Elapsed: 3.632 sec. Processed 4.24 thousand rows, 38.14 KB (1.17 thousand rows/s., 10.50 KB/s.)
select count(*)
from
  s3('https://datasets-documentation.s3.amazonaws.com/forex/csv/year_month/200005-tick.csv.zst', 'CSVWithNames')
-- 4238
-- 1 row in set. Elapsed: 2.634 sec.

create view s3_partition_read as (
  select
  *,
  _file,
  extract(_file, '2000([0-9]{2})') as month
from
  s3('https://datasets-documentation.s3.amazonaws.com/forex/csv/year_month/2000{05,06}-tick.csv.zst', 'CSVWithNames')
)


select count(*) from s3_partition_read 
-- 181425
select count(*) from s3_partition_read where month = '05'
-- 4238

I don't have verbose output to confirm that only relevant file was read (any tips how to get transferred s3 file list appreciated), only the time difference. Also, for some reason number of processed rows comes up only above certain amount of transferred data.

Hope that helps anyone, good luck

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 partitioned read workaround #44177

{{title}}

Replies: 1 comment

Select a reply

S3 partitioned read workaround #44177

leriel Dec 12, 2022

Replies: 1 comment

leriel
Dec 12, 2022