Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend tokenizer regex #1

Merged
merged 2 commits into from
Jun 15, 2023

Conversation

emphasize
Copy link
Member

This expands the capabilities of both word and sentence tokenizer

Tests for words tokenization

Test for sentence tokenization:

import regex as re

regex = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s'
text = "Dr. Müller bought cheapsite.com for 1.5 million dollars on 8.Februar, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."
sentences = re.split(regex, text)
for sentence in sentences:
  print(sentence)
assert sentences == [
  'Dr. Müller bought cheapsite.com for 1.5 million dollars on 8.Februar, i.e. he paid a lot for it.',
  'Did he mind?',
  "Adam Jones Jr. thinks he didn't.",
  "In any case, this isn't true...",
  "Well, with a probability of .9 it isn't."]

@JarbasAl JarbasAl added the enhancement New feature or request label Jun 14, 2023
@JarbasAl JarbasAl merged commit 783a0b6 into OpenVoiceOS:master Jun 15, 2023
@emphasize emphasize deleted the refactor/word_tokenizer branch June 15, 2023 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants