Skip to content

BOM option to s3 insertion#94090

Open
mmarkell wants to merge 1 commit intoClickHouse:masterfrom
mmarkell:s3-bom
Open

BOM option to s3 insertion#94090
mmarkell wants to merge 1 commit intoClickHouse:masterfrom
mmarkell:s3-bom

Conversation

@mmarkell
Copy link

@mmarkell mmarkell commented Jan 13, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

  • Introduce BOM encoding option for s3 CSV/TAB insertion

Documentation entry for user-facing changes

If enabled, write UTF-8 BOM (Byte Order Mark) at the beginning of CSV output. This helps Excel correctly identify the file encoding.

Usage:

CREATE TABLE test (id UInt8, name String) ENGINE = MergeTree() ORDER BY id;
INSERT INTO test VALUES (1, 'hello'), (2, 'world');
SELECT * FROM test FORMAT TabSeparated SETTINGS output_format_tsv_write_bom = 1;

### Changelog category (leave one):
- Backward Incompatible Change

### Changelog entry (a [user-readable short description](https://github.com/ClickHouse/ClickHouse/blob/master/docs/changelog_entry_guidelines.md) of the changes that goes into CHANGELOG.md):
- Introduce BOM encoding option for s3 CSV/TAB insertion

### Documentation entry for user-facing changes

If enabled, write UTF-8 BOM (Byte Order Mark) at the beginning of CSV output. This helps Excel correctly identify the file encoding.

Usage:
```sql
CREATE TABLE test (id UInt8, name String) ENGINE = MergeTree() ORDER BY id;
INSERT INTO test VALUES (1, 'hello'), (2, 'world');
SELECT * FROM test_03788_bom FORMAT TabSeparated SETTINGS output_format_tsv_write_bom = 1;
```
@CLAassistant
Copy link

CLAassistant commented Jan 13, 2026

CLA assistant check
All committers have signed the CLA.

@scanhex12 scanhex12 added the can be tested Allows running workflows for external contributors label Jan 13, 2026
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Jan 13, 2026

Workflow [PR], commit [4a5fb4c]

Summary:

job_name test_name status info comment
Fast test failure
02995_new_settings_history FAIL cidb
Build (amd_debug) dropped
Build (amd_asan) dropped
Build (amd_tsan) dropped
Build (amd_msan) dropped
Build (amd_ubsan) dropped
Build (amd_binary) dropped
Build (arm_asan) dropped
Build (arm_binary) dropped
Build (arm_tsan) dropped

@scanhex12 scanhex12 self-assigned this Jan 13, 2026
@scanhex12 scanhex12 requested a review from Copilot January 13, 2026 17:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces BOM (Byte Order Mark) encoding options for CSV and TSV output formats to improve Excel compatibility. When enabled, a UTF-8 BOM (\xEF\xBB\xBF) is written at the beginning of the output.

Changes:

  • Added two new settings: output_format_csv_write_bom and output_format_tsv_write_bom
  • Modified CSV and TSV output format processors to write BOM when enabled
  • Added comprehensive tests covering both formats with and without BOM

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/Core/FormatFactorySettings.h Declares new boolean settings for CSV and TSV BOM output
src/Formats/FormatSettings.h Adds write_bom fields to CSV and TSV format setting structures
src/Formats/FormatFactory.cpp Wires up the new settings from context to format settings
src/Processors/Formats/Impl/CSVRowOutputFormat.cpp Implements BOM writing in CSV output prefix
src/Processors/Formats/Impl/TabSeparatedRowOutputFormat.cpp Implements BOM writing in TSV output prefix
tests/queries/0_stateless/03788_csv_tsv_write_bom.sql Test queries for both formats with and without BOM
tests/queries/0_stateless/03788_csv_tsv_write_bom.reference Expected output showing BOM character () at appropriate positions

@mmarkell
Copy link
Author

@scanhex12 Hi! I I am curious what the process is for getting this merged? It seems like the tests that are failing are unrelated to this commit so I'm unsure what to do next.

@scanhex12
Copy link
Member

@mmarkell Fast tests should always be green, so failure is related. You need to add your setting into this file https://github.com/ClickHouse/ClickHouse/blob/master/src/Core/SettingsChangesHistory.cpp#L44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add UTF8 BOM option to s3 Table Function

3 participants