Adding tokenizer factory to reduce code duplication by george-larionov · Pull Request #91529 · ClickHouse/ClickHouse

george-larionov · 2025-12-05T05:12:16Z

Adding tokenizer factory to reduce code duplication in several places. A side effect of this change is that the 3 places where we use tokenizers (regular text index, bloom filter text index, and tokens function) now all use the same defaults and params checks. Before, the bloom filter text index allowed any value for ngram_length, while the other two allowed only values between 2 and 8. I decided to make the allowed values be between 1 and 8 for all, but this could be changed if needed.

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Ngrams tokenizer can now be built with ngram_length = 1.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

clickhouse-gh · 2025-12-05T05:12:44Z

Workflow [PR], commit [c5de17c]

Summary: ❌

job_name	test_name	status	info
Stateless tests (arm_asan, targeted)		failure
	02346_text_index_hits	FAIL	cidb
Integration tests (amd_tsan, 2/6)		failure
	test_storage_s3_queue/test_5.py::test_failed_startup	FAIL	cidb
BuzzHouse (amd_debug)		failure
	Logical error: Inconsistent AST formatting in A: the query:	FAIL	cidb

…er text index

…g-add-tokenizer-factory

rschu1ze · 2025-12-08T18:19:54Z

Stateless tests (arm_asan, targeted)

02346_text_index_hits: timed out, unrelated, will make a fix for this separately

Integration tests (amd_tsan, 2/6)

test_storage_s3_queue/test_5.py::test_failed_startup: test_storage_s3_queue/test_5.py::test_failed_startup is flaky #88918

BuzzHouse (amd_debug)

Logical error: Inconsistent AST formatting in A: the query is unrelated (10+ open bugs exist)

clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Dec 5, 2025

george-larionov force-pushed the code-refactoring-add-tokenizer-factory branch 2 times, most recently from ae9d4a7 to 2408fe4 Compare December 5, 2025 21:53

george-larionov added 9 commits December 6, 2025 20:23

adding simple tokenizer factory just for single instance for now

c672651

updating tokenizer factory to work for both text index and bloom filt…

1cb17c3

…er text index

tokenizer factory supports all instances of tokenizer creation

d2bdae1

using new factory to reduce repeated code in validator function

3fc8865

style fixes and updating bounds for ngram length

bb11b81

making buffer use correct overload

7f1eae0

updating test and adding some param checks

813c1be

updating doc and tests

50654ad

small error message fixes

ffffc73

george-larionov force-pushed the code-refactoring-add-tokenizer-factory branch from 2408fe4 to ffffc73 Compare December 6, 2025 20:24

clickhouse-gh bot added pr-improvement Pull request with some product improvements and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Dec 6, 2025

george-larionov marked this pull request as ready for review December 6, 2025 20:36

Merge remote-tracking branch 'ClickHouse/master' into code-refactorin…

2f84d3a

…g-add-tokenizer-factory

rschu1ze self-assigned this Dec 8, 2025

rschu1ze added 2 commits December 8, 2025 09:51

Minor fixes

59315ec

More minor fixes

c5de17c

rschu1ze approved these changes Dec 8, 2025

View reviewed changes

rschu1ze added this pull request to the merge queue Dec 8, 2025

Merged via the queue into master with commit 7fa1376 Dec 8, 2025
126 of 130 checks passed

rschu1ze deleted the code-refactoring-add-tokenizer-factory branch December 8, 2025 18:36

robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 8, 2025

This was referenced Dec 12, 2025

Missing validation on ngrambf index #91854

Closed

extra validation for ngram length #92024

Merged

rschu1ze mentioned this pull request Dec 19, 2025

Avoid checking the upper length limit for ngram tokenizer #92672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding tokenizer factory to reduce code duplication#91529

Adding tokenizer factory to reduce code duplication#91529
rschu1ze merged 12 commits intomasterfrom
code-refactoring-add-tokenizer-factory

george-larionov commented Dec 5, 2025 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

rschu1ze commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

george-larionov commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschu1ze commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

george-larionov commented Dec 5, 2025 •

edited

Loading

clickhouse-gh bot commented Dec 5, 2025 •

edited

Loading