Conversation
### Changelog category (leave one): - Backward Incompatible Change ### Changelog entry (a [user-readable short description](https://github.com/ClickHouse/ClickHouse/blob/master/docs/changelog_entry_guidelines.md) of the changes that goes into CHANGELOG.md): - Introduce BOM encoding option for s3 CSV/TAB insertion ### Documentation entry for user-facing changes If enabled, write UTF-8 BOM (Byte Order Mark) at the beginning of CSV output. This helps Excel correctly identify the file encoding. Usage: ```sql CREATE TABLE test (id UInt8, name String) ENGINE = MergeTree() ORDER BY id; INSERT INTO test VALUES (1, 'hello'), (2, 'world'); SELECT * FROM test_03788_bom FORMAT TabSeparated SETTINGS output_format_tsv_write_bom = 1; ```
|
Workflow [PR], commit [4a5fb4c] Summary: ❌
|
There was a problem hiding this comment.
Pull request overview
This PR introduces BOM (Byte Order Mark) encoding options for CSV and TSV output formats to improve Excel compatibility. When enabled, a UTF-8 BOM (\xEF\xBB\xBF) is written at the beginning of the output.
Changes:
- Added two new settings:
output_format_csv_write_bomandoutput_format_tsv_write_bom - Modified CSV and TSV output format processors to write BOM when enabled
- Added comprehensive tests covering both formats with and without BOM
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/Core/FormatFactorySettings.h | Declares new boolean settings for CSV and TSV BOM output |
| src/Formats/FormatSettings.h | Adds write_bom fields to CSV and TSV format setting structures |
| src/Formats/FormatFactory.cpp | Wires up the new settings from context to format settings |
| src/Processors/Formats/Impl/CSVRowOutputFormat.cpp | Implements BOM writing in CSV output prefix |
| src/Processors/Formats/Impl/TabSeparatedRowOutputFormat.cpp | Implements BOM writing in TSV output prefix |
| tests/queries/0_stateless/03788_csv_tsv_write_bom.sql | Test queries for both formats with and without BOM |
| tests/queries/0_stateless/03788_csv_tsv_write_bom.reference | Expected output showing BOM character () at appropriate positions |
|
@scanhex12 Hi! I I am curious what the process is for getting this merged? It seems like the tests that are failing are unrelated to this commit so I'm unsure what to do next. |
|
@mmarkell Fast tests should always be green, so failure is related. You need to add your setting into this file https://github.com/ClickHouse/ClickHouse/blob/master/src/Core/SettingsChangesHistory.cpp#L44 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Documentation entry for user-facing changes
If enabled, write UTF-8 BOM (Byte Order Mark) at the beginning of CSV output. This helps Excel correctly identify the file encoding.
Usage: