extend tokenizer regex #1

emphasize · 2023-06-14T20:56:25Z

This expands the capabilities of both word and sentence tokenizer

Tests for words tokenization

Test for sentence tokenization:

import regex as re

regex = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s'
text = "Dr. Müller bought cheapsite.com for 1.5 million dollars on 8.Februar, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."
sentences = re.split(regex, text)
for sentence in sentences:
  print(sentence)
assert sentences == [
  'Dr. Müller bought cheapsite.com for 1.5 million dollars on 8.Februar, i.e. he paid a lot for it.',
  'Did he mind?',
  "Adam Jones Jr. thinks he didn't.",
  "In any case, this isn't true...",
  "Well, with a probability of .9 it isn't."]

emphasize added 2 commits June 9, 2023 19:21

extend word tokenizer

ce8c62d

sentence tokenizer

6a41259

JarbasAl approved these changes Jun 14, 2023

View reviewed changes

JarbasAl requested a review from ChanceNCounter June 14, 2023 21:20

JarbasAl added the enhancement New feature or request label Jun 14, 2023

JarbasAl merged commit 783a0b6 into OpenVoiceOS:master Jun 15, 2023

emphasize deleted the refactor/word_tokenizer branch June 15, 2023 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend tokenizer regex #1

extend tokenizer regex #1

emphasize commented Jun 14, 2023

extend tokenizer regex #1

extend tokenizer regex #1

Conversation

emphasize commented Jun 14, 2023