In [1]:
import sys

In [2]:
sys.path.insert(0, "..")

# Overview
This notebook will show how to use pre-and postprocessing techniques in medspaCy.

In [3]:
import medspacy

from medspacy.ner import TargetRule

In [4]:
nlp = medspacy.load(enable=["target_matcher", "sectionizer"])



In [5]:
target_rules = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("radiotherapy", "PROBLEM",
              pattern=[{"LOWER": "xrt"}]),
    TargetRule("metastasis", "PROBLEM"),
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ],),
    TargetRule("Hypertension", "PROBLEM",
              pattern=[{"LOWER": {"IN": ["htn", "hypertension"]}}],),
    
]

In [6]:
nlp.get_pipe("target_matcher").add(target_rules)

In [7]:
with open("./discharge_summary.txt") as f:
    text = f.read()

# Preprocessing
In preprocessing, we'll take some steps to clean up the text. MedspaCy provides a class for destructive preprocessing, meaning that the original text is not preserved. In the future, we'd like to support non-destructive preprocessing as well.

The `Preprocessor` component is wrapped around the pipeline tokenizer and modifies the text before it is tokenized.

In [8]:
from medspacy.preprocess import Preprocessor

In [9]:
preprocessor = Preprocessor(nlp.tokenizer)

Unlike other components, this is not added using the `add_pipe` method, but is instead set to be `nlp.tokenizer`:

In [10]:
nlp.tokenizer = preprocessor

Rules are defined using the `PreprocessRule`. Each rule defines a pattern to match in the text and a modification to make to the text whenever a match is found. The rule takes these arguments:
- `pattern`: A compiled regular expression defining the text to match in a text
- `repl`: An optional replacement for the matched text. Default will replace be a blank string, meaning that the matched text will be removed. This can be either a string or a function to pass in to `re.sub`
- `callback`: A callback function which takes the match object as an argument and returns the replaced text. This can be used for more complex modifications to the text other than just modifying the specific text
- `desc`: An optional description for the rule


Given our example discharge summary, we'll take the following steps to make the tex cleaner and easier to work with:
- Lower-case (for demonstration purposes only; later steps are often case-insensitive unless explicitly told not to be)
- Replace MIMIC-style time brackets with "2010" and remove all other MIMIC-style formatting
- Replace acronyms such as "dx'd" and "tx'd" to simplify later processing

In [11]:
from medspacy.preprocess import PreprocessingRule
import re

In [12]:
preprocess_rules = [
    lambda x: x.lower(),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{1,4}-[\d]{1,2}(-[\d]{1,2})?\*\*\]"),
        repl="01-01-2010",
        desc="Replace MIMIC date brackets with a generic date."
    ),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{4}\*\*\]"),
        repl="2010",
        desc="Replace MIMIC year brackets with a generic year."
    ),
    
    PreprocessingRule(
        re.compile("dx'd"), repl="Diagnosed", 
                  desc="Replace abbreviation"
    ),
    
    PreprocessingRule(
        re.compile("tx'd"), repl="Treated", 
                  desc="Replace abbreviation"
    ),
    
        PreprocessingRule(
        re.compile("\[\*\*[^\]]+\]"), 
        desc="Remove all other bracketed placeholder text from MIMIC"
    )
]

In [13]:
preprocessor.add(preprocess_rules)

In [14]:
# Before preprocessing
print(text[:500])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging sh


In [15]:
doc = nlp(text)

In [16]:
# After preprocesing
print(doc.text[:500])

admission date:  01-01-2010              discharge date:   01-01-2010

date of birth:  01-01-2010             sex:   f

service: surgery

allergies:
hydrochlorothiazide

attending:
chief complaint:
abdominal pain

major surgical or invasive procedure:
picc line 01-01-2010
ercp w/ sphincterotomy 01-01-2010


history of present illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. imaging shows no evidence of metastasis.

past medi


# Postprocessing
The final component we'll introduce is the `postprocessor`. The postprocessor iterates through each entity and checks a series of conditions on each. If all conditions evaluate as `True`, then some action is taken on the entity. Some use cases of this include removing an entity or changing an attributes.

For example, let's say that we want to exclude any entity which comes from the **"patient_instructions"** section, as these are typically not experienced by the patient and are purely hypothetical. We'll write a rule to remove any entity from `doc.ents` if it came from this section. 

The design pattern for a postprocessing rule is as follows:
- A `PostprocessingRule` contains a list of `patterns` and an `action` to take if all of the `patterns` evaluate as `True`
- Each `PostprocessingPattern` takes a `condition`, which evaluates as `True` or `False`. If all patterns return `True`, the action is taken
- Each pattern can take option `condition_args` to pass into the condition check, and each rule takes optional `action_args`
- The module `postprocessing_functions` offer utility functions for the `condition` and `description` arguments

First, let's process only the last section of the note for our example without the postprocessor:

In [17]:
doc = nlp(text[-560:])

In [18]:
doc

discharge instructions:
patient may shower. please call your surgeon or return to the
emergency room if  experience fever >101.5, nausea, vomiting,
abdominal pain, shortness of breath, abdominal pain or any
significant  change in your medical condition. a

completed by:   md  01-01-2010 @ 1404
signed electronically by: dr.  
 on: fri 01-01-2010 8:03 am
(end of report)

In [19]:
from medspacy.postprocess import Postprocessor, PostprocessingRule, PostprocessingPattern
from medspacy.postprocess import postprocessing_functions

In [20]:
postprocessor = Postprocessor(debug=False)

In [21]:
nlp.add_pipe(postprocessor)

In [22]:
postprocess_rules = [
    # Instantiate our rule
    PostprocessingRule(
        # Pass in a list of patterns
        patterns=[
            # The pattern will check if the entitie's section is "patient_instructions"
            PostprocessingPattern(condition=lambda ent: ent._.section_title == "patient_instructions"),
        ],
        # If all patterns are True, this entity will be removed.
        action=postprocessing_functions.remove_ent,
        description="Remove any entities from the instructions section."
    ),
    
]

In [23]:
postprocessor.add(postprocess_rules)

Let's look at the entities in our text before adding the preprocessor:

In [24]:
medspacy.visualization.visualize_ent(doc)

Now let's reprocess with our postprocessor:

In [25]:
doc = nlp(text[-560:])

In [26]:
medspacy.visualization.visualize_ent(doc)

This is a simple example, but for more  complex examples in the context of COVID-19 surveillance, see this repository: https://github.com/abchapman93/VA_COVID-19_NLP_BSV