Bug(medcat): CU-869ag0tqj Fix regex tokenizer dashes #138

mart-r · 2025-09-15T15:38:40Z

The regex based tokenizer would treat all words that had some punctuation in them as skip words and would skip them.
And because of that, these tokens would be ignored since the tagger would tag them as skip words because they contained punctuation.

This PR fixes that by separating the punctuation in the regex based tokenizer example.

This should fix #128.

EDIT:
To be a bit more explicit regarding the regex tokenizer:

Before it would tokenize know-how into know and -how
Now it will tokenize know-how into know, -, and how

tomolopolis · 2025-09-15T15:38:44Z

Task linked: CU-869ag0tqj Fix issue with dashes and regex-tokenizer

…egex-based tokenizer

…ames

…nizer

alhendrickson

lgtm, now it's much simpler with your latest commit

medcat-v2/tests/model_creation/test_cdb_maker.py

…ith dashes

mart-r added 10 commits September 18, 2025 15:54

CU-869ag0tqj: Add tests to show issue with dashes in CDB maker with r…

56456f0

…egex-based tokenizer

CU-869ag0tqj: Make a full build at test time when adding dash-based n…

685eb80

…ames

CU-869ag0tqj: Separate starting punctuation in regex based tokenizer

9276ef7

CU-869ag0tqj: Use compiled regex for regex-based tokenizer

0f4b7f3

CU-869ag0tqj: Fix small typing issues

7ebcb87

CU-869ag0tqj: Add a small comment

cde7e34

CU-869ag0tqj: Update regex-based tokenizer tests

faf2e2c

CU-869ag0tqj: Refactor token getting in regex tokenizer somewhat

b99bedc

CU-869ag0tqj: Separate punctuation getting for regex tokenizing

e826e21

CU-869ag0tqj: Add some further comments in code

e037bbd

mart-r force-pushed the bug/medcat/CU-869ag0tqj-fix-regex-tknz-dashes branch from 6aabc68 to e037bbd Compare September 18, 2025 14:54

mart-r added 2 commits September 19, 2025 09:06

CU-869ag0tqj: Fix regex tokenizer expected results

0ec0c7a

CU-869ag0tqj: Simplify regex for punctuation separation in regex toke…

c6963cf

…nizer

alhendrickson approved these changes Sep 19, 2025

View reviewed changes

medcat-v2/tests/model_creation/test_cdb_maker.py Outdated Show resolved Hide resolved

CU-869ag0tqj: Add explicit list of expected names in CDB maker test w…

6a7e21a

…ith dashes

mart-r merged commit a1562ee into main Sep 19, 2025
20 checks passed

mart-r deleted the bug/medcat/CU-869ag0tqj-fix-regex-tknz-dashes branch September 19, 2025 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug(medcat): CU-869ag0tqj Fix regex tokenizer dashes #138

Bug(medcat): CU-869ag0tqj Fix regex tokenizer dashes #138

Uh oh!

mart-r commented Sep 15, 2025 •

edited

Loading

Uh oh!

tomolopolis commented Sep 15, 2025

Uh oh!

alhendrickson left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bug(medcat): CU-869ag0tqj Fix regex tokenizer dashes #138

Bug(medcat): CU-869ag0tqj Fix regex tokenizer dashes #138

Uh oh!

Conversation

mart-r commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomolopolis commented Sep 15, 2025

Uh oh!

alhendrickson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mart-r commented Sep 15, 2025 •

edited

Loading