Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Sep 15, 2025

The regex based tokenizer would treat all words that had some punctuation in them as skip words and would skip them.
And because of that, these tokens would be ignored since the tagger would tag them as skip words because they contained punctuation.

This PR fixes that by separating the punctuation in the regex based tokenizer example.

This should fix #128.

EDIT:
To be a bit more explicit regarding the regex tokenizer:

  • Before it would tokenize know-how into know and -how
  • Now it will tokenize know-how into know, -, and how

@tomolopolis
Copy link
Member

@mart-r mart-r force-pushed the bug/medcat/CU-869ag0tqj-fix-regex-tknz-dashes branch from 6aabc68 to e037bbd Compare September 18, 2025 14:54
Copy link
Collaborator

@alhendrickson alhendrickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, now it's much simpler with your latest commit

@mart-r mart-r merged commit a1562ee into main Sep 19, 2025
20 checks passed
@mart-r mart-r deleted the bug/medcat/CU-869ag0tqj-fix-regex-tknz-dashes branch September 19, 2025 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Words after hyphens are removed by CDBMaker

4 participants