Skip to content

Adding tokenizer factory to reduce code duplication#91529

Merged
rschu1ze merged 12 commits intomasterfrom
code-refactoring-add-tokenizer-factory
Dec 8, 2025
Merged

Adding tokenizer factory to reduce code duplication#91529
rschu1ze merged 12 commits intomasterfrom
code-refactoring-add-tokenizer-factory

Conversation

@george-larionov
Copy link
Copy Markdown
Member

@george-larionov george-larionov commented Dec 5, 2025

Adding tokenizer factory to reduce code duplication in several places. A side effect of this change is that the 3 places where we use tokenizers (regular text index, bloom filter text index, and tokens function) now all use the same defaults and params checks. Before, the bloom filter text index allowed any value for ngram_length, while the other two allowed only values between 2 and 8. I decided to make the allowed values be between 1 and 8 for all, but this could be changed if needed.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Ngrams tokenizer can now be built with ngram_length = 1.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Dec 5, 2025

Workflow [PR], commit [c5de17c]

Summary:

job_name test_name status info comment
Stateless tests (arm_asan, targeted) failure
02346_text_index_hits FAIL cidb
Integration tests (amd_tsan, 2/6) failure
test_storage_s3_queue/test_5.py::test_failed_startup FAIL cidb
BuzzHouse (amd_debug) failure
Logical error: Inconsistent AST formatting in A: the query: FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Dec 5, 2025
@george-larionov george-larionov force-pushed the code-refactoring-add-tokenizer-factory branch 2 times, most recently from ae9d4a7 to 2408fe4 Compare December 5, 2025 21:53
@george-larionov george-larionov force-pushed the code-refactoring-add-tokenizer-factory branch from 2408fe4 to ffffc73 Compare December 6, 2025 20:24
@clickhouse-gh clickhouse-gh bot added pr-improvement Pull request with some product improvements and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Dec 6, 2025
@george-larionov george-larionov marked this pull request as ready for review December 6, 2025 20:36
@rschu1ze rschu1ze self-assigned this Dec 8, 2025
@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Dec 8, 2025

Stateless tests (arm_asan, targeted)

  • 02346_text_index_hits: timed out, unrelated, will make a fix for this separately

Integration tests (amd_tsan, 2/6)

BuzzHouse (amd_debug)

  • Logical error: Inconsistent AST formatting in A: the query is unrelated (10+ open bugs exist)

@rschu1ze rschu1ze added this pull request to the merge queue Dec 8, 2025
Merged via the queue into master with commit 7fa1376 Dec 8, 2025
126 of 130 checks passed
@rschu1ze rschu1ze deleted the code-refactoring-add-tokenizer-factory branch December 8, 2025 18:36
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants