In [1]:
import sys
sys.path.insert(0, "..")

In [2]:
import spacy
import medspacy

from medspacy.custom_tokenizer import create_medspacy_tokenizer

# Overview
Example of how to enable the default medspaCy tokenizer and compare it to the default English tokenizer on 
some representative examples from short clinical text.

In [3]:
# we can only use one of the following tokenizers, so let's use the medspacy tokenizer 
# which handles infixes (e.g. 'h/o', 'chf+cp', etc)

nlp = spacy.blank("en")

In [4]:
spacy_tokenizer = nlp.tokenizer

In [5]:
medspacy_tokenizer = create_medspacy_tokenizer(nlp)

# Process our document with both default and medspacy

In [6]:
example_text = r'Pt c\o n;v;d h\o chf+cp'

In [7]:
default_doc = spacy_tokenizer(example_text)

medspacy_doc = medspacy_tokenizer(example_text)

In the result, we can see that the medspaCy tokenizer is much more aggressive on punctuation. This is intentional and has better handling of long sequences of punctuation, typos involving punctuation, and compound words joined with punctuation.

The medspacy tokenizer is not always appropriate for a task, but can often make pattern matching with rules simpler and more accurate. Please test both the default tokenizer, medspaCy tokenizer, and other options for any particular project

In [8]:
print('Tokens in default tokenizer')
for token in default_doc:
    print(token.text)

Tokens in default tokenizer
Pt
c\o
n;v;d
h\o
chf+cp


In [9]:
print('Tokens in medspacy tokenizer')
for token in medspacy_doc:
    print(token.text)

Tokens in medspacy tokenizer
Pt
c
\
o
n
;
v
;
d
h
\
o
chf
+
cp
