Skip to content

Commit

Permalink
Update to make column aliases and named columns syntax complementary
Browse files Browse the repository at this point in the history
Now they are both allowed, with column names being used for validation and
column aliases being used for column naming.
  • Loading branch information
quodlibetor committed Jul 26, 2021
1 parent 1dd7344 commit f72d8ac
Showing 1 changed file with 41 additions and 17 deletions.
58 changes: 41 additions & 17 deletions doc/developer/design/20210713_S3_sources_with_headers.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ into their combination with S3, introducing new syntax for the CSV format that
explicitly lists all column names, which will be the purification target for all
CSV format clauses.

The new syntax is `WITH (HEADER AND)? NAMED COLUMNS (colname (, colname)*)`. See
The new syntax is `WITH HEADER, COLUMNS (colname (, colname)*)`. See
examples below.

[csv-purify]: https://github.com/MaterializeInc/materialize/blob/88cf93c3309ca62/src/sql/src/pure.rs#L480-L501
Expand All @@ -92,39 +92,59 @@ will be rewritten via purify to:
```sql
CREATE SOURCE example
FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket'
FORMAT CSV WITH HEADER AND NAMED COLUMNS (id, value);
FORMAT CSV WITH HEADER, COLUMNS (id, value);
```

Conversely, the following create source statement (and the equivalent without
headers) will be rejected if it is not immediately possible to determine the
headers because there is no object in the queue:
We preserve column aliases if present, so the following statement with the same
file containing `id,value` header line:

```sql
CREATE SOURCE example (a, b)
FROM S3 DISCOVER OBJECTS USING BUCKET SCAN 'bucket'
FORMAT CSV WITH HEADER;
```

will be rewritten to:

```sql
CREATE SOURCE example (a, b)
FROM S3 DISCOVER OBJECTS USING USING BUCKET SCAN 'bucket'
FORMAT CSV WITH HEADER, COLUMNS (id, value);
```

and the columns will have the names "a" and "b". If a source file or object is
encountered that has a header column that does not match the `COLUMNS (name,
..)` declaration, the dataflow will be put in an error state. But see future
work below for how we may handle this more gracefully in the future.

#### Handling missing objects for header discovery

As part of purification any source that must read a header will fail immediately
if it is not possible to determine the header.

Consider, the following create source statement will be rejected if it is not
immediately possible to determine the headers because there is no object in the
queue:

```sql
CREATE SOURCE example
FROM S3 DISCOVER OBJECTS USING SQS NOTIFICATIONS 'queuename'
FORMAT CSV WITH HEADER;
```

it will instead require that users specify the column names using one of the
following syntaxes:
it will instead require that users specify the column names using the following
syntax:

*
```sql
CREATE SOURCE example (id, value)
FROM S3 DISCOVER OBJECTS USING SQS NOTIFICATIONS 'queuename'
FORMAT CSV WITH HEADER;
```
*
```sql
CREATE SOURCE example
FROM S3 DISCOVER OBJECTS USING SQS NOTIFICATIONS 'queuename'
FORMAT CSV WITH NAMED COLUMNS (id, value);
FORMAT CSV HEADER, COLUMNS (id, value);
```

All syntaxes for CSV-formatted sources (`WITH n COLUMNS` without an alias, `WITH
n COLUMNS` with aliases, `WITH HEADER`) will always be rewritten to the named
columns syntax inside the catalog. Column aliases and the named columns syntax
will not be allowed to be used at the same time.
columns syntax inside the catalog.

This means that the following create source statement:

Expand All @@ -139,9 +159,13 @@ will be rewritten to:
```sql
CREATE SOURCE example
FROM ...
FORMAT CSV WITH NAMED COLUMNS (column1, column2);
FORMAT CSV WITH COLUMNS (column1, column2);
```

If both column aliases and the named column syntax is used simultaneously the
header columns will be verified against the source files/objects, and the
columns presented into the dataflow will get the names from the column aliases.

## Alternatives and future work

### Supporting a subset of schema fields
Expand Down

0 comments on commit f72d8ac

Please sign in to comment.