Skip to content

Feature Request: Support Dynamic File Loading with url function instead of Table Function #74629

@sean98

Description

@sean98

Current Behavior

Currently, the url table function allows querying files hosted on an HTTP server as follows:

SELECT * FROM url('http://localhost:80/file1', 'JSONEachRow')

This produces a result like:

┌─result─────────────────────────────────┐
│ { "source": "file1", "key": "value1" } │
│ { "source": "file1", "key": "value2" } │
└────────────────────────────────────────┘

Although it works well for a single file, handling multiple files becomes very complicated.
For example, to query multiple files, I must use UNION:

SELECT * FROM url('http://localhost/file1', 'JSONEachRow')
UNION
SELECT * FROM url('http://localhost/file2', 'JSONEachRow')
UNION
SELECT * FROM url('http://localhost/file3', 'JSONEachRow');

This approach becomes not feasible when dealing with a large or dynamic number of files.

Desired Behavior

I would like to query multiple files dynamically by leveraging a table of file names.
For instance, given a table of files, I could construct a query like this:

SELECT url('http://localhost/' || fileName, 'JSONEachRow') FROM files;

The expected result is:

┌─result───────────────────────────────────────────────────────────────────────────┐
│ [{ "source": "file1", "key": "value1" }, { "source": "file1", "key": "value2" }] │
│ [{ "source": "file2", "key": "value1" }, { "source": "file2", "key": "value2" }] │
│ [{ "source": "file3", "key": "value1" }, { "source": "file3", "key": "value2" }] │
└──────────────────────────────────────────────────────────────────────────────────┘

Than I could use arrayJoin to unfold and flatten this data into a table format identical to the one produced using multiple UNION clauses.

This feature would be particularly helpful for cases where files are created dynamically, as I could use a materialized view to query new files automatically upon creation.

This feature might be also helpful for other table functions like file or s3

If there is already a way to achieve this behaviour efficiently, I would like to know!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions