You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UnamedRus
changed the title
virtual column _row_number for file, s3, hdfs functions.
virtual column _row_number, _error for file, s3, hdfs functions.
Dec 25, 2021
UnamedRus
changed the title
virtual column _row_number, _error for file, s3, hdfs functions.
virtual column _row_number, _error, _raw_value for file, s3, hdfs functions.
Dec 25, 2021
Hi!
Just wanted to bump this feature request.
For parquet files some virtual columns could be quite useful:
_row_group_number - number of the row group in current file for the given row
_row_number_in_row_group - number of row within current row group
_row_number - row number within the file
Right now it's possible to use row_number() over () but it has some huge disadvantages. It limits thread count to 1 and also requires memory to materialize full query results before any aggregation can be applied on top of such table.
Such virtual columns would unblock some interesting options for parquet files processing
Use case
Describe the solution you'd like
Give ability to get corresponding row number of file being ingested via s3/hdfs/file table function.
Describe alternatives you've considered
But it's too slow.
Additional context
Probably we can have some other columns like _raw in case of parsing error.
The text was updated successfully, but these errors were encountered: