feat: extract tokenizer into separate project with multi-tokenizer support#256
feat: extract tokenizer into separate project with multi-tokenizer support#256
Conversation
…pport - Create new TelegramSearchBot.Tokenizer project with abstraction layer: - ITokenizer interface (Tokenize, SafeTokenize, TokenizeWithOffsets) - ITokenizerFactory for factory pattern - TokenizerMetadata for tokenizer info - TokenizerType enum for tokenizer types - Implement SmartChineseTokenizer using Lucene SmartChineseAnalyzer - Add TokenizerFactory implementation - Create Tokenizer.Tests project with 18 unit tests - Update Search project to use ITokenizer interface: - LuceneManager, SimpleSearchService, SyntaxSearchService - PhraseQueryProcessor, ContentQueryBuilder, ExtQueryBuilder - Refactor SearchHelper to use ITokenizer with SetTokenizer() - Remove UnifiedTokenizer duplicate implementation - Delete dead code in SimpleSearchService All 398 tests pass (398 Passed, 0 Failed).
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 18 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (16)
📝 WalkthroughWalkthroughTokenization was extracted into a new TelegramSearchBot.Tokenizer project (abstractions, SmartChinese implementation, factory, and tests). Search code and tools were switched from the removed UnifiedTokenizer to the new ITokenizer abstraction; SearchHelper was refactored to use token offsets and a pluggable tokenizer. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 PR检查报告📋 检查概览
🧪 测试结果
📊 代码质量
📁 测试产物
🔗 相关链接此报告由GitHub Actions自动生成 |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs (1)
38-51:⚠️ Potential issue | 🟠 MajorPhrase queries will fail for repeated tokens due to deduplication.
The
SafeTokenize()method (viaSmartChineseTokenizer.Tokenize()) uses aHashSet<string>internally (seeSmartChineseTokenizer.cs:29), which:
- Loses duplicate tokens — A phrase like
"北京 北京"produces only["北京"]instead of["北京", "北京"]- Does not guarantee token order —
HashSetordering is undefinedThis breaks
BuildPhraseQueryForField()which relies on positional indices (lines 67-68). For phrase search, the tokenizer must preserve both order and duplicates.Consider using
TokenizeWithOffsets()which returns tokens in order (from Lucene's token stream), or create a separate tokenization method that preserves duplicates:- var terms = _tokenizer.SafeTokenize(phraseText).ToList(); + // Use TokenizeWithOffsets to preserve order and duplicates + var terms = _tokenizer.TokenizeWithOffsets(phraseText) + .Select(t => t.Term) + .ToList();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs` around lines 38 - 51, SafeTokenize (via SmartChineseTokenizer.Tokenize) uses a HashSet and drops duplicates and ordering, which breaks BuildPhraseQueryForField's positional logic; replace the call to _tokenizer.SafeTokenize(phraseText) in the PhraseRegex loop with an ordered tokenization that preserves duplicates and positions (e.g., use TokenizeWithOffsets or add a new tokenizer method that returns tokens in sequence including repeated tokens), then pass that ordered list to BuildPhraseQueryForField and _extOptimizer.BuildOptimizedExtPhraseQuery so phrase position indices remain correct.
🧹 Nitpick comments (8)
TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs (1)
11-11: Consider caching tokenizer instances to avoid repeated Lucene analyzer initialization.The factory creates a new
SmartChineseTokenizerinstance on eachCreate()call. SinceSmartChineseAnalyzer(from Lucene.Net) initialization is expensive—involving dictionary and resource loading—andSmartChineseTokenizeris thread-safe (all fields are readonly after construction), caching instances perTokenizerTypewould improve performance without introducing thread-safety concerns.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs` at line 11, The factory's Create() returns a new SmartChineseTokenizer each call which causes repeated expensive Lucene analyzer initialization; modify TokenizerFactory to cache tokenizer instances per TokenizerType (e.g., store a readonly/static ConcurrentDictionary<TokenizerType, ITokenizer> or a lazy-initialized field) and return the cached SmartChineseTokenizer for TokenizerType.SmartChinese in Create(), ensuring the cache is populated once (thread-safe) and reused thereafter; reference TokenizerFactory.Create(), TokenizerType.SmartChinese, SmartChineseTokenizer and SmartChineseAnalyzer when making the change.TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs (2)
3-3: Document thatEndoffset is exclusive.Based on
SmartChineseTokenizer.TokenizeWithOffsets()which uses Lucene'sIOffsetAttribute.EndOffset(exclusive), theEndproperty follows Lucene conventions. This should be documented to prevent off-by-one errors.-public record TokenWithOffset(int Start, int End, string Term); +/// <summary> +/// Represents a token with its character offsets in the source text. +/// </summary> +/// <param name="Start">Inclusive start offset.</param> +/// <param name="End">Exclusive end offset (Lucene convention).</param> +/// <param name="Term">The extracted token text.</param> +public record TokenWithOffset(int Start, int End, string Term);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs` at line 3, Add documentation to the TokenWithOffset record clarifying that the End offset is exclusive (follows Lucene's IOffsetAttribute.EndOffset semantics) to avoid off-by-one mistakes; update the XML doc comment on the TokenWithOffset declaration (and mention the End property) to state explicitly that Start is inclusive and End is exclusive and that offsets come from SmartChineseTokenizer.TokenizeWithOffsets()/Lucene conventions.
5-11: Consider adding XML documentation to clarify method contracts.The interface design is clean, but the distinction between
Tokenize()andSafeTokenize()is unclear from the interface alone. Adding documentation would help implementers and consumers understand the expected behavior.📝 Proposed documentation
public interface ITokenizer { + /// <summary> + /// Tokenizes the input text. May throw exceptions on invalid input. + /// </summary> + /// <param name="text">Text to tokenize.</param> + /// <returns>List of tokens extracted from the text.</returns> IReadOnlyList<string> Tokenize(string text); + + /// <summary> + /// Tokenizes the input text with exception handling. + /// Returns empty list instead of throwing on errors. + /// </summary> + /// <param name="text">Text to tokenize.</param> + /// <returns>List of tokens, or empty list if tokenization fails.</returns> IReadOnlyList<string> SafeTokenize(string text); + + /// <summary> + /// Tokenizes the input text and returns tokens with their character offsets. + /// </summary> + /// <param name="text">Text to tokenize.</param> + /// <returns>List of tokens with start/end offsets (end is exclusive).</returns> IReadOnlyList<TokenWithOffset> TokenizeWithOffsets(string text); + + /// <summary> + /// Metadata describing this tokenizer implementation. + /// </summary> TokenizerMetadata Metadata { get; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs` around lines 5 - 11, Add XML documentation to the ITokenizer interface to clarify the contract for each member: document ITokenizer.Tokenize(string) describing expected normalization, token boundaries, behavior on null/empty input and whether it may throw; document ITokenizer.SafeTokenize(string) to state how it differs (e.g., never throws, returns empty list on invalid input, or falls back to a safe tokenization strategy); document TokenizeWithOffsets(string) to specify that returned TokenWithOffset values align with original string indices and how offsets are computed; and document the Metadata property to explain what tokenizer capabilities and limits it exposes. Use the method/property names (Tokenize, SafeTokenize, TokenizeWithOffsets, Metadata, ITokenizer) so implementers can follow the exact expected behaviors.TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs (2)
35-47: Test name is misleading - SafeTokenize error handling isn't actually tested.The test
SafeTokenize_ReturnsNonEmptyList_EvenOnErrordoesn't actually test error handling behavior. TheMockTokenizer.SafeTokenizesimply delegates toTokenizewithout any error simulation. Consider either:
- Renaming to
SafeTokenize_ReturnsNonEmptyList_ForValidText- Or adding a mock that throws in
Tokenizeto verifySafeTokenizehandles it gracefully🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 35 - 47, The test name claims to validate SafeTokenize's error handling but MockTokenizer.SafeTokenize just delegates to Tokenize and no error is simulated; either rename the test to SafeTokenize_ReturnsNonEmptyList_ForValidText to reflect the current behavior, or modify the test to validate error handling by creating a mock/stub whose Tokenize method throws (or returns null) and then call SafeTokenize to assert it still returns a non-empty, non-null result; target MockTokenizer and its Tokenize/SafeTokenize methods when making the change.
73-87: Missing test coverage forTokenizeWithOffsets.The
MockTokenizerimplementsTokenizeWithOffsetsbut no test exercises this method. Consider adding a test to validate the contract:📝 Suggested test
[Fact] public void TokenizeWithOffsets_ReturnsTokensWithCorrectOffsets() { // Arrange var tokenizer = new MockTokenizer(); // Act var result = tokenizer.TokenizeWithOffsets("hello world"); // Assert Assert.NotNull(result); Assert.Equal(2, result.Count); Assert.Equal(0, result[0].Start); Assert.Equal(5, result[0].End); Assert.Equal("hello", result[0].Term); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 73 - 87, Add a unit test that exercises MockTokenizer.TokenizeWithOffsets to verify it returns correct TokenWithOffset entries: create a test (e.g., TokenizeWithOffsets_ReturnsTokensWithCorrectOffsets) that instantiates MockTokenizer, calls TokenizeWithOffsets("hello world"), asserts result is not null, has Count == 2, and checks the first token's Start == 0, End == 5 and Term == "hello" (and optionally checks the second token offsets/term); this ensures the TokenWithOffset contract is covered.TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs (2)
99-125: Log callback tests don't verify actual logging behavior.The tests with log callbacks only verify the methods don't throw, but don't assert that the callback was actually invoked or verify log content. If the goal is to test logging integration, consider:
📝 Example enhancement for logging verification
[Fact] -public void Tokenize_WithLogAction_DoesNotThrow() +public void Tokenize_WithLogAction_InvokesCallbackOnFailure() { // Arrange var logger = new List<string>(); - var tokenizerWithLog = new SmartChineseTokenizer(msg => logger.Add(msg)); - - // Act - var result = tokenizerWithLog.Tokenize("测试文本"); + var tokenizerWithLog = new SmartChineseTokenizer(msg => logger.Add(msg)); - // Assert + // Act - trigger an error scenario if possible, or verify callback mechanism + var result = tokenizerWithLog.Tokenize("测试文本"); + + // Assert - at minimum verify the tokenizer works with callback attached Assert.NotNull(result); + // Note: logger may be empty for successful tokenization }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs` around lines 99 - 125, The tests Tokenize_WithLogAction_DoesNotThrow and SafeTokenize_WithLogAction_DoesNotThrow only assert no exception; update them to also verify the provided log callback was invoked by asserting on the logger list (e.g., Assert.True(logger.Count > 0) and/or Assert.Contains(expectedMessageSubstring, logger)) after calling SmartChineseTokenizer.Tokenize and SafeTokenize respectively; reference the SmartChineseTokenizer constructor (msg => logger.Add(msg)), the Tokenize and SafeTokenize calls, and assert logger contents or count to confirm actual logging behavior.
6-126: Missing test coverage forTokenizeWithOffsets.The
SmartChineseTokenizer.TokenizeWithOffsetsmethod is not tested. This method is used for offset-based matching in search results and should have dedicated tests verifying correct offset calculation.📝 Suggested tests
[Fact] public void TokenizeWithOffsets_ReturnsTokensWithValidOffsets() { // Act var result = _tokenizer.TokenizeWithOffsets("今天天气真好"); // Assert Assert.NotNull(result); Assert.NotEmpty(result); Assert.All(result, token => { Assert.True(token.Start >= 0); Assert.True(token.End > token.Start); Assert.False(string.IsNullOrEmpty(token.Term)); }); } [Fact] public void TokenizeWithOffsets_ReturnsNonEmptyList_ForEmptyText() { // Act var result = _tokenizer.TokenizeWithOffsets(""); // Assert Assert.NotNull(result); Assert.Empty(result); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs` around lines 6 - 126, Add unit tests covering SmartChineseTokenizer.TokenizeWithOffsets: create tests in SmartChineseTokenizerTests that call _tokenizer.TokenizeWithOffsets with Chinese text and with an empty string, then assert the results are non-null and non-empty for normal text and empty for empty input; for the normal-text test, iterate returned tokens and assert each token has Term non-empty, Start >= 0 and End > Start to verify offset correctness. Ensure tests are named like TokenizeWithOffsets_ReturnsTokensWithValidOffsets and TokenizeWithOffsets_ReturnsEmptyList_ForEmptyText and use the existing _tokenizer instance (or construct a tokenizer with a log action when needed).TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs (1)
54-70:SafeTokenizecatch block may be unreachable.The outer
catchblock inSafeTokenize(lines 60-69) may never execute becauseTokenize()already catches all exceptions internally and returns a fallback list. UnlessTokenize's catch block itself throws (unlikely), theSafeTokenizecatch is defensive but unreachable.This is fine as defensive programming, but consider if this is the intended behavior or if
Tokenizeshould propagate exceptions while onlySafeTokenizehandles them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs` around lines 54 - 70, SafeTokenize's catch is effectively unreachable because Tokenize currently swallows exceptions internally; decide to centralize error handling by letting Tokenize propagate errors and having SafeTokenize perform the fallback. Change Tokenize (the method called by SafeTokenize) to remove or rethrow its internal catch (or at least rethrow after logging) so exceptions reach SafeTokenize, which will then log via _logAction and return the fallback token list in SafeTokenize; alternatively if you prefer Tokenize to keep its own fallback, remove the outer try/catch in SafeTokenize to avoid dead code—update Tokenize or SafeTokenize accordingly so only one of them handles exceptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@TelegramSearchBot.Search/Tool/SearchHelper.cs`:
- Around line 10-25: The _tokenizer static is mutated without synchronization
causing races in GetTokenizer and SetTokenizer and leaking state across tests;
make initialization thread-safe (e.g., replace the current nullable _tokenizer
with a thread-safe Lazy<ITokenizer> or add a private static readonly lock object
like _tokenizerLock and use double‑check locking inside GetTokenizer and lock in
SetTokenizer) so lazy creation of
TelegramSearchBot.Tokenizer.Implementations.SmartChineseTokenizer is safe, and
add an internal ResetTokenizer() helper that clears the tokenizer under the same
lock so SearchHelperTests.cs can reset state between tests.
In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs`:
- Around line 75-103: TokenizeWithOffsets is missing a tokenStream.End() call
like Tokenize has; inside the try block, after the while
(tokenStream.IncrementToken()) loop and before disposing tokenStream, call
tokenStream.End() to properly finalize the token stream (and then you can read
final offsetAttribute values if needed) so the analyzer is correctly terminated
and the stream is ended before returning tokens.
- Around line 30-52: The TokenStream lifecycle in SmartChineseTokenizer.cs is
missing a call to tokenStream.End() after the token iteration; update the try
block so that after the while (tokenStream.IncrementToken()) loop completes you
call tokenStream.End() before disposing (still within the using scope) to ensure
end-of-stream finalization (offsets/positions) is applied; keep the existing
exception handling and logging around _logAction and retain adding the original
text to keywords on error.
---
Outside diff comments:
In `@TelegramSearchBot.Search/Tool/PhraseQueryProcessor.cs`:
- Around line 38-51: SafeTokenize (via SmartChineseTokenizer.Tokenize) uses a
HashSet and drops duplicates and ordering, which breaks
BuildPhraseQueryForField's positional logic; replace the call to
_tokenizer.SafeTokenize(phraseText) in the PhraseRegex loop with an ordered
tokenization that preserves duplicates and positions (e.g., use
TokenizeWithOffsets or add a new tokenizer method that returns tokens in
sequence including repeated tokens), then pass that ordered list to
BuildPhraseQueryForField and _extOptimizer.BuildOptimizedExtPhraseQuery so
phrase position indices remain correct.
---
Nitpick comments:
In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs`:
- Around line 35-47: The test name claims to validate SafeTokenize's error
handling but MockTokenizer.SafeTokenize just delegates to Tokenize and no error
is simulated; either rename the test to
SafeTokenize_ReturnsNonEmptyList_ForValidText to reflect the current behavior,
or modify the test to validate error handling by creating a mock/stub whose
Tokenize method throws (or returns null) and then call SafeTokenize to assert it
still returns a non-empty, non-null result; target MockTokenizer and its
Tokenize/SafeTokenize methods when making the change.
- Around line 73-87: Add a unit test that exercises
MockTokenizer.TokenizeWithOffsets to verify it returns correct TokenWithOffset
entries: create a test (e.g.,
TokenizeWithOffsets_ReturnsTokensWithCorrectOffsets) that instantiates
MockTokenizer, calls TokenizeWithOffsets("hello world"), asserts result is not
null, has Count == 2, and checks the first token's Start == 0, End == 5 and Term
== "hello" (and optionally checks the second token offsets/term); this ensures
the TokenWithOffset contract is covered.
In `@TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs`:
- Around line 99-125: The tests Tokenize_WithLogAction_DoesNotThrow and
SafeTokenize_WithLogAction_DoesNotThrow only assert no exception; update them to
also verify the provided log callback was invoked by asserting on the logger
list (e.g., Assert.True(logger.Count > 0) and/or
Assert.Contains(expectedMessageSubstring, logger)) after calling
SmartChineseTokenizer.Tokenize and SafeTokenize respectively; reference the
SmartChineseTokenizer constructor (msg => logger.Add(msg)), the Tokenize and
SafeTokenize calls, and assert logger contents or count to confirm actual
logging behavior.
- Around line 6-126: Add unit tests covering
SmartChineseTokenizer.TokenizeWithOffsets: create tests in
SmartChineseTokenizerTests that call _tokenizer.TokenizeWithOffsets with Chinese
text and with an empty string, then assert the results are non-null and
non-empty for normal text and empty for empty input; for the normal-text test,
iterate returned tokens and assert each token has Term non-empty, Start >= 0 and
End > Start to verify offset correctness. Ensure tests are named like
TokenizeWithOffsets_ReturnsTokensWithValidOffsets and
TokenizeWithOffsets_ReturnsEmptyList_ForEmptyText and use the existing
_tokenizer instance (or construct a tokenizer with a log action when needed).
In `@TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs`:
- Line 3: Add documentation to the TokenWithOffset record clarifying that the
End offset is exclusive (follows Lucene's IOffsetAttribute.EndOffset semantics)
to avoid off-by-one mistakes; update the XML doc comment on the TokenWithOffset
declaration (and mention the End property) to state explicitly that Start is
inclusive and End is exclusive and that offsets come from
SmartChineseTokenizer.TokenizeWithOffsets()/Lucene conventions.
- Around line 5-11: Add XML documentation to the ITokenizer interface to clarify
the contract for each member: document ITokenizer.Tokenize(string) describing
expected normalization, token boundaries, behavior on null/empty input and
whether it may throw; document ITokenizer.SafeTokenize(string) to state how it
differs (e.g., never throws, returns empty list on invalid input, or falls back
to a safe tokenization strategy); document TokenizeWithOffsets(string) to
specify that returned TokenWithOffset values align with original string indices
and how offsets are computed; and document the Metadata property to explain what
tokenizer capabilities and limits it exposes. Use the method/property names
(Tokenize, SafeTokenize, TokenizeWithOffsets, Metadata, ITokenizer) so
implementers can follow the exact expected behaviors.
In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs`:
- Around line 54-70: SafeTokenize's catch is effectively unreachable because
Tokenize currently swallows exceptions internally; decide to centralize error
handling by letting Tokenize propagate errors and having SafeTokenize perform
the fallback. Change Tokenize (the method called by SafeTokenize) to remove or
rethrow its internal catch (or at least rethrow after logging) so exceptions
reach SafeTokenize, which will then log via _logAction and return the fallback
token list in SafeTokenize; alternatively if you prefer Tokenize to keep its own
fallback, remove the outer try/catch in SafeTokenize to avoid dead code—update
Tokenize or SafeTokenize accordingly so only one of them handles exceptions.
In `@TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs`:
- Line 11: The factory's Create() returns a new SmartChineseTokenizer each call
which causes repeated expensive Lucene analyzer initialization; modify
TokenizerFactory to cache tokenizer instances per TokenizerType (e.g., store a
readonly/static ConcurrentDictionary<TokenizerType, ITokenizer> or a
lazy-initialized field) and return the cached SmartChineseTokenizer for
TokenizerType.SmartChinese in Create(), ensuring the cache is populated once
(thread-safe) and reused thereafter; reference TokenizerFactory.Create(),
TokenizerType.SmartChinese, SmartChineseTokenizer and SmartChineseAnalyzer when
making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dfc6637b-b22c-422e-ab4b-b3fb900f963d
📒 Files selected for processing (21)
TelegramSearchBot.Search/Service/SimpleSearchService.csTelegramSearchBot.Search/Service/SyntaxSearchService.csTelegramSearchBot.Search/TelegramSearchBot.Search.csprojTelegramSearchBot.Search/Tokenizer/UnifiedTokenizer.csTelegramSearchBot.Search/Tool/ContentQueryBuilder.csTelegramSearchBot.Search/Tool/ExtQueryBuilder.csTelegramSearchBot.Search/Tool/LuceneManager.csTelegramSearchBot.Search/Tool/PhraseQueryProcessor.csTelegramSearchBot.Search/Tool/SearchHelper.csTelegramSearchBot.Tokenizer.Tests/ITokenizerTests.csTelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.csTelegramSearchBot.Tokenizer.Tests/TelegramSearchBot.Tokenizer.Tests.csprojTelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.csTelegramSearchBot.Tokenizer/Abstractions/ITokenizer.csTelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.csTelegramSearchBot.Tokenizer/Abstractions/TokenizerMetadata.csTelegramSearchBot.Tokenizer/Abstractions/TokenizerType.csTelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.csTelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.csTelegramSearchBot.Tokenizer/TelegramSearchBot.Tokenizer.csprojTelegramSearchBot.sln
💤 Files with no reviewable changes (1)
- TelegramSearchBot.Search/Tokenizer/UnifiedTokenizer.cs
There was a problem hiding this comment.
🧹 Nitpick comments (3)
TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs (1)
13-21: Consider implementingIDisposableto dispose the analyzer.
SmartChineseAnalyzer(viaAnalyzer) is disposable. The_analyzerfield is created once and never disposed, which could lead to resource leaks ifSmartChineseTokenizerinstances are created and discarded frequently (e.g., in tests or dynamic factory scenarios).♻️ Proposed fix
-public class SmartChineseTokenizer : ITokenizer { +public class SmartChineseTokenizer : ITokenizer, IDisposable { private readonly Analyzer _analyzer; private readonly Action<string>? _logAction; + private bool _disposed; public SmartChineseTokenizer(Action<string>? logAction = null) { _analyzer = new SmartChineseAnalyzer(LuceneVersion.LUCENE_48); _logAction = logAction; Metadata = new TokenizerMetadata("SmartChinese", "Chinese", true); } + + public void Dispose() { + if (!_disposed) { + _analyzer.Dispose(); + _disposed = true; + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs` around lines 13 - 21, SmartChineseTokenizer constructs a SmartChineseAnalyzer and never disposes it, risking resource leaks; implement IDisposable on SmartChineseTokenizer, store the analyzer in _analyzer (already present), add a public void Dispose() (or Dispose(bool) pattern) that calls _analyzer.Dispose() and suppresses finalization as appropriate, and ensure any consumers can call Dispose (or use a finalizer/IDisposable pattern) so the SmartChineseAnalyzer resources are released when SmartChineseTokenizer is no longer used.TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs (2)
54-77: Missing test coverage forTokenizeWithOffsets.The
ITokenizercontract includesTokenizeWithOffsets, but there's no test validating this method. Consider adding a test to ensure implementers correctly produce offset information.Additionally,
MockTokenizer.Tokenizereturns distinct tokens (viaDistinct()), butTokenizeWithOffsetsdoesn't apply the same deduplication—this inconsistency could mask bugs in real implementations.💡 Suggested test addition
[Fact] public void TokenizeWithOffsets_ReturnsCorrectOffsets() { // Arrange var tokenizer = new MockTokenizer(); // Act var result = tokenizer.TokenizeWithOffsets("hello world"); // Assert Assert.Equal(2, result.Count); Assert.Equal(0, result[0].Start); Assert.Equal(5, result[0].End); Assert.Equal("hello", result[0].Term); Assert.Equal(6, result[1].Start); Assert.Equal(11, result[1].End); Assert.Equal("world", result[1].Term); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 54 - 77, Add a unit test covering MockTokenizer.TokenizeWithOffsets to verify it returns correct start/end offsets and terms for input strings (e.g., "hello world"), and either update TokenizeWithOffsets or the test to respect the same deduplication behaviour as MockTokenizer.Tokenize (which uses Distinct()); specifically, locate MockTokenizer (implements ITokenizer) and ensure TokenizeWithOffsets produces offsets for the same set/order of tokens Tokenize returns or adjust Tokenize to not dedupe, then add assertions that check count, Start, End, and Term for each TokenWithOffset to match expected values.
32-43: Test name is misleading - no error scenario is tested.The test name
SafeTokenize_ReturnsNonEmptyList_EvenOnErrorimplies error handling is being tested, butMockTokenizer.SafeTokenizesimply delegates toTokenizewithout simulating any error condition. This test only verifies thatSafeTokenizereturns a non-empty list for valid input, identical to theTokenizetest.Consider either:
- Renaming to
SafeTokenize_ReturnsNonEmptyList_ForValidText- Creating a mock that throws in
Tokenizeto actually test error fallback behavior🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs` around lines 32 - 43, The test name falsely claims error behavior; update the test to either rename it to SafeTokenize_ReturnsNonEmptyList_ForValidText or modify MockTokenizer to throw from Tokenize and assert SafeTokenize still returns a non-empty list; specifically locate the test method SafeTokenize_ReturnsNonEmptyList_EvenOnError and either change its name to reflect valid input behavior or implement a MockTokenizer override of Tokenize that throws an exception and verify SafeTokenize catches that and returns a non-empty collection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@TelegramSearchBot.Tokenizer.Tests/ITokenizerTests.cs`:
- Around line 54-77: Add a unit test covering MockTokenizer.TokenizeWithOffsets
to verify it returns correct start/end offsets and terms for input strings
(e.g., "hello world"), and either update TokenizeWithOffsets or the test to
respect the same deduplication behaviour as MockTokenizer.Tokenize (which uses
Distinct()); specifically, locate MockTokenizer (implements ITokenizer) and
ensure TokenizeWithOffsets produces offsets for the same set/order of tokens
Tokenize returns or adjust Tokenize to not dedupe, then add assertions that
check count, Start, End, and Term for each TokenWithOffset to match expected
values.
- Around line 32-43: The test name falsely claims error behavior; update the
test to either rename it to SafeTokenize_ReturnsNonEmptyList_ForValidText or
modify MockTokenizer to throw from Tokenize and assert SafeTokenize still
returns a non-empty list; specifically locate the test method
SafeTokenize_ReturnsNonEmptyList_EvenOnError and either change its name to
reflect valid input behavior or implement a MockTokenizer override of Tokenize
that throws an exception and verify SafeTokenize catches that and returns a
non-empty collection.
In `@TelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.cs`:
- Around line 13-21: SmartChineseTokenizer constructs a SmartChineseAnalyzer and
never disposes it, risking resource leaks; implement IDisposable on
SmartChineseTokenizer, store the analyzer in _analyzer (already present), add a
public void Dispose() (or Dispose(bool) pattern) that calls _analyzer.Dispose()
and suppresses finalization as appropriate, and ensure any consumers can call
Dispose (or use a finalizer/IDisposable pattern) so the SmartChineseAnalyzer
resources are released when SmartChineseTokenizer is no longer used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5f86d06d-541c-40c1-a36a-226f65a2e25b
📒 Files selected for processing (9)
TelegramSearchBot.Search/Tool/SearchHelper.csTelegramSearchBot.Tokenizer.Tests/ITokenizerTests.csTelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.csTelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.csTelegramSearchBot.Tokenizer/Abstractions/ITokenizer.csTelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.csTelegramSearchBot.Tokenizer/Abstractions/TokenizerType.csTelegramSearchBot.Tokenizer/Implementations/SmartChineseTokenizer.csTelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs
✅ Files skipped from review due to trivial changes (3)
- TelegramSearchBot.Tokenizer/Abstractions/TokenizerType.cs
- TelegramSearchBot.Tokenizer/Abstractions/ITokenizerFactory.cs
- TelegramSearchBot.Tokenizer/Abstractions/ITokenizer.cs
🚧 Files skipped from review as they are similar to previous changes (3)
- TelegramSearchBot.Tokenizer/Implementations/TokenizerFactory.cs
- TelegramSearchBot.Tokenizer.Tests/TokenizerFactoryTests.cs
- TelegramSearchBot.Tokenizer.Tests/SmartChineseTokenizerTests.cs
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
AGENTS.md (1)
53-65: Consider adding a language specifier to the fenced code block.The static analysis tool suggests adding a language specifier. For ASCII diagrams, you can use
```textto silence the linter.📝 Proposed fix
-``` +```text TelegramSearchBot.sln ├── TelegramSearchBot/ # Main console app🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@AGENTS.md` around lines 53 - 65, Update the fenced code block containing the ASCII project tree (starting with "TelegramSearchBot.sln" and the directory entries like "TelegramSearchBot/", "TelegramSearchBot.Common/", etc.) to include a language specifier (e.g., use ```text) so the linter/static analysis no longer flags the block; simply change the opening fence to ```text and leave the ASCII diagram content unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AGENTS.md`:
- Line 140: The doc line incorrectly references TypeScript/JavaScript constructs
("as any", "@ts-ignore", "@ts-expect-error"); replace that guidance with
C#-appropriate type-safety guidance by removing those JS/TS tokens and instead
recommending avoiding "dynamic", avoiding "unsafe" code blocks, and using proper
nullable reference type handling and explicit casts/guarding; update the
sentence around the existing phrase to read something like "Never use dynamic,
unsafe code, or ignore nullable reference warnings—handle nullability and
casting explicitly" so the guidance matches C# idioms.
---
Nitpick comments:
In `@AGENTS.md`:
- Around line 53-65: Update the fenced code block containing the ASCII project
tree (starting with "TelegramSearchBot.sln" and the directory entries like
"TelegramSearchBot/", "TelegramSearchBot.Common/", etc.) to include a language
specifier (e.g., use ```text) so the linter/static analysis no longer flags the
block; simply change the opening fence to ```text and leave the ASCII diagram
content unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a6b41995-c83e-46b0-9d09-c518ce2df147
📒 Files selected for processing (4)
AGENTS.mdTelegramSearchBot.Test/Manage/EditMcpConfServiceTests.csTelegramSearchBot/Model/Mcp/McpConfState.csTelegramSearchBot/Service/Manage/EditMcpConfService.cs
✅ Files skipped from review due to trivial changes (3)
- TelegramSearchBot/Model/Mcp/McpConfState.cs
- TelegramSearchBot/Service/Manage/EditMcpConfService.cs
- TelegramSearchBot.Test/Manage/EditMcpConfServiceTests.cs
| ``` | ||
|
|
||
| ### Type Safety | ||
| - **Never** use `as any`, `@ts-ignore`, `@ts-expect-error` |
There was a problem hiding this comment.
Fix incorrect TypeScript/JavaScript references in C# documentation.
Line 140 references TypeScript/JavaScript constructs (as any, @ts-ignore, @ts-expect-error) that don't exist in C#. This appears to be copied from a TypeScript/JavaScript style guide. For C# type safety, relevant guidance would cover avoiding dynamic, unsafe code, and proper nullable reference handling.
🔧 Proposed fix
-- **Never** use `as any`, `@ts-ignore`, `@ts-expect-error`
+- Avoid `dynamic` keyword unless absolutely necessary
+- Avoid `unsafe` code blocks📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - **Never** use `as any`, `@ts-ignore`, `@ts-expect-error` | |
| - Avoid `dynamic` keyword unless absolutely necessary | |
| - Avoid `unsafe` code blocks |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@AGENTS.md` at line 140, The doc line incorrectly references
TypeScript/JavaScript constructs ("as any", "@ts-ignore", "@ts-expect-error");
replace that guidance with C#-appropriate type-safety guidance by removing those
JS/TS tokens and instead recommending avoiding "dynamic", avoiding "unsafe" code
blocks, and using proper nullable reference type handling and explicit
casts/guarding; update the sentence around the existing phrase to read something
like "Never use dynamic, unsafe code, or ignore nullable reference
warnings—handle nullability and casting explicitly" so the guidance matches C#
idioms.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
TelegramSearchBot.Tokenizerproject with abstraction layer for multi-tokenizer supportITokenizerinterface instead of direct implementationChanges
New Project: TelegramSearchBot.Tokenizer
Abstractions:
ITokenizerinterface (Tokenize, SafeTokenize, TokenizeWithOffsets)ITokenizerFactoryfor factory patternTokenizerMetadatafor tokenizer infoTokenizerTypeenum for tokenizer typesTokenWithOffsetrecord for offset informationImplementations:
SmartChineseTokenizerusing Lucene SmartChineseAnalyzerTokenizerFactoryimplementationNew Test Project: TelegramSearchBot.Tokenizer.Tests
Updated: Search Project
LuceneManager,SimpleSearchService,SyntaxSearchServicePhraseQueryProcessor,ContentQueryBuilder,ExtQueryBuilderSearchHelpernow uses ITokenizer via SetTokenizer()Removed
TelegramSearchBot.Search/Tokenizer/UnifiedTokenizer.cs(duplicate implementation)Verification
Future Extensibility
Adding new tokenizers (e.g., JiebaTokenizer) now only requires:
TokenizerTypeenumITokenizerinterfaceTokenizerFactory.Create()Summary by CodeRabbit