In [1]:
import sys

In [2]:
sys.path.insert(0, "..")

In [3]:
import spacy
import medspacy

# Overview
The previous notebooks showed the components which are loaded by a default medSpaCy model: A custom tokenizer; sentence splitter; target matcher; and context. MedSpaCy also provides other components which can be instantiated and added to an existing pipeline:
- `Sectionizer`: Detecting section boundries in a clinical note
- `Preprocessor`: Destructive preprocessing for simplifying or cleaning up text
- `Postprocessor`: Additional business logic for altering or removing entities at the end of pipeline processing

We'll start by loading some of the custom rules shown in previous notebooks:

In [4]:
nlp = medspacy.load()

In [5]:
from medspacy.ner import TargetRule

In [6]:
target_matcher = nlp.get_pipe("medspacy_target_matcher")

In [7]:
target_rules = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("radiotherapy", "PROBLEM",
              pattern=[{"LOWER": "xrt"}]),
    TargetRule("metastasis", "PROBLEM"),
    
]

In [8]:
target_matcher.add(target_rules)

In [9]:
with open("./discharge_summary.txt") as f:
    text = f.read()

# Section detection
We are often interested in which section of a clinical note an entity occurs in. This can be useful for setting attributes like temporality (similar to ConText) or for extracting entities from specific sections of the note.

MedSpaCy includes the `Sectionizer` class from the [clinical_sectionizer](https://github.com/medspacy/sectionizer) package. Similar to `ConTextComponent`, we can instantiate this with default rules and add new ones to fit our specific data. Section detection is especially dependent on your data, as each EHR will use different note formatting.

In [10]:
from medspacy.section_detection import Sectionizer

In [11]:
sectionizer = nlp.add_pipe("medspacy_sectionizer")

In [12]:
nlp.pipe_names

['medspacy_pyrush',
 'medspacy_target_matcher',
 'medspacy_context',
 'medspacy_sectionizer']

In [13]:
doc = nlp(text)

`visualize_ent` will now highlight section titles in addition to entities and context modifiers:

In [14]:
from medspacy.visualization import visualize_ent

In [15]:
visualize_ent(doc)

We can see here that the default rules did not catch the section title **"Brief Hospital Course"**. We can add a `SectionRule` to define which sections to match in the text. `SectionRule` is similar to `TargetRule` and `ConTextRule` and takes the following arguments:
- **"literal"**: An exact string to match if `pattern` is None
- **"category"**: The normalized section title
- **"pattern"** (opt): Either a regular expression string or a spaCy pattern (list of dicts) to match the text
- **"parent"** (opt): An optional parent, which will be explained in other notebooks

In [16]:
from medspacy.section_detection import SectionRule
section_rules = [
    SectionRule(literal="Brief Hospital Course:", category="hospital_course"),
    SectionRule("Major Surgical or Invasive Procedure:", "procedure",
               pattern=r"Major Surgical( or |/)Invasive Procedure:"),
    SectionRule("Assessment/Plan", "assessment_and_plan",
               pattern=[
                   {"LOWER": "assessment"},
                   {"LOWER": {"IN": ["and", "/", "&"]}},
                   {"LOWER": "plan"}
               ]),
]

In [17]:
sectionizer.add(section_rules)

In [18]:
visualize_ent(nlp("""
Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitted on [**2573-5-30**]. Ultrasound at the time of
admission demonstrated pancreatic duct dilitation and an
edematous gallbladder. She was admitted to the ICU.

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]

Assessment/Plan:
Follow up in 3 weeks.
"""))

The sectionizer will add attributes to allow us to access section data. The `doc` object will have a number of section-related attributes:

In [19]:
# Normalized section titles
print(doc._.section_titles)

[, Service:, Allergies:, Chief Complaint:, History of Present Illness:, Past Medical History:, Social History:, Family History:, Brief Hospital Course:, Discharge Medications:, Discharge Diagnosis:, Discharge Instructions:, Signed electronically by:]


In [20]:
# The Spans of the doc representing section headers
doc._.section_titles

[,
 Service:,
 Allergies:,
 Chief Complaint:,
 History of Present Illness:,
 Past Medical History:,
 Social History:,
 Family History:,
 Brief Hospital Course:,
 Discharge Medications:,
 Discharge Diagnosis:,
 Discharge Instructions:,
 Signed electronically by:]

In [21]:
# Spans of the entire sections of the notes
doc._.section_spans[:5]

[Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]
 
 Date of Birth:  [**2498-8-19**]             Sex:   F
 ,
 Service: SURGERY
 ,
 Allergies:
 Hydrochlorothiazide
 
 Attending:[**First Name3 (LF) 1893**],
 Chief Complaint:
 Abdominal pain
 
 Major Surgical or Invasive Procedure:
 PICC line [**6-25**]
 ERCP w/ sphincterotomy [**5-31**]
 
 ,
 History of Present Illness:
 74y female with type 2 dm and a recent stroke affecting her
 speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis. She is not receiving any chemo.
 ]

Which can be accessed as `Section` objects under `doc._.sections`:

In [22]:
print(doc._.sections[:5])

[Section(category=None at 0 : 0 in the doc with a body at 0 : 54 based on the rule None, Section(category=other at 54 : 56 in the doc with a body at 56 : 58 based on the rule SectionRule(literal="Service:", category="other", pattern=None, on_match=None, parents=None, parent_required=False), Section(category=allergy at 58 : 60 in the doc with a body at 60 : 78 based on the rule SectionRule(literal="ALLERGIES:", category="allergy", pattern=None, on_match=None, parents=None, parent_required=False), Section(category=chief_complaint at 78 : 81 in the doc with a body at 81 : 118 based on the rule SectionRule(literal="CHIEF COMPLAINT:", category="chief_complaint", pattern=None, on_match=None, parents=None, parent_required=False), Section(category=history_of_present_illness at 118 : 123 in the doc with a body at 123 : 163 based on the rule SectionRule(literal="HISTORY OF PRESENT ILLNESS:", category="history_of_present_illness", pattern=None, on_match=None, parents=None, parent_required=False)]

For each section detected in the note, we'll print out the
- The normalized section title
- The actual span of the doc containing the section header
- The parent section
- The entire span of the section in the doc

These are explained in more detail in the `section_detection/` notebooks.

In [23]:
for section in doc._.sections:
    print(section.category, section.title_span, section.parent)
    print(section.section_span[:25])
    print("----------------")

None (0, 0) None
(0, 54)
----------------
other (54, 56) None
(54, 58)
----------------
allergy (58, 60) None
(58, 78)
----------------
chief_complaint (78, 81) None
(78, 118)
----------------
history_of_present_illness (118, 123) None
(118, 163)
----------------
past_medical_history (163, 167) None
(163, 222)
----------------
social_history (222, 225) None
(222, 239)
----------------
family_history (239, 242) None
(239, 260)
----------------
hospital_course (260, 264) None
(260, 316)
----------------
medications (316, 319) None
(316, 411)
----------------
observation_and_plan (411, 414) None
(411, 430)
----------------
patient_instructions (430, 433) None
(430, 543)
----------------
signature (543, 547) None
(543, 596)
----------------


Each entity has similar attributes:

In [24]:
for ent in doc.ents:
    print(ent, ent._.section_category, ent._.section_title, ent._.section_span[:10])
    print()

Hydrochlorothiazide allergy Allergies: Allergies:
Hydrochlorothiazide

Attending:[**

Abdominal pain chief_complaint Chief Complaint: Chief Complaint:
Abdominal pain

Major Surgical or

stroke history_of_present_illness History of Present Illness: History of Present Illness:
74y female with type

abdominal pain history_of_present_illness History of Present Illness: History of Present Illness:
74y female with type

metastasis history_of_present_illness History of Present Illness: History of Present Illness:
74y female with type

Colon cancer past_medical_history Past Medical History: Past Medical History:
1. Colon cancer dx

hemicolectomy past_medical_history Past Medical History: Past Medical History:
1. Colon cancer dx

XRT past_medical_history Past Medical History: Past Medical History:
1. Colon cancer dx

stroke family_history Family History: Family History:
Mother with stroke at age 82

abdominal pain patient_instructions Discharge Instructions: Discharge Instructions:
Patient may 

# Preprocessing
In preprocessing, we'll take some steps to clean up the text.
- Lower-case (for demonstration purposes only; later steps are sometimes case-sensitive unless explicitly told not to be)
- Replace MIMIC-style time brackets with "2010" and remove all other MIMIC-style formatting
- Replace acronyms such as "dx'd" and "tx'd" to simplify later processing

The preprocessing component is implemented in [nlp_preprocessor](https://github.com/medspacy/nlp_preprocessor).

In [25]:
from medspacy.preprocess import Preprocessor, PreprocessingRule
import re

In [26]:
preprocessor = Preprocessor(nlp.tokenizer)

In [27]:
preprocess_rules = [
    PreprocessingRule(
        r"\[\*\*[\d]{1,4}-[\d]{1,2}(-[\d]{1,2})?\*\*\]",
        repl="01-01-2010",
        desc="Replace MIMIC date brackets with a generic date."
    ),
    
    PreprocessingRule(
        r"\[\*\*[\d]{4}\*\*\]",
        repl="2010",
        desc="Replace MIMIC year brackets with a generic year."
    ),
    
    PreprocessingRule(
        r"dx'd", repl="Diagnosed", 
                  desc="Replace abbreviation"
    ),
    
    PreprocessingRule(
        r"tx'd", repl="Treated", 
                  desc="Replace abbreviation"
    ),
    
        PreprocessingRule(
        r"\[\*\*[^\]]+\]", 
        desc="Remove all other bracketed placeholder text from MIMIC"
    )
]

In [28]:
preprocessor.add(preprocess_rules)

In [29]:
nlp.tokenizer = preprocessor

In [30]:
preprocessed_doc = nlp(text)

In [31]:
# Compare the original text with the preprocessed Doc
print(text)

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis. She is not receiving any chemo.

Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Type II Diabetes Mellitus
3. Hypertension

Social History:
Married, former tobacco use. No alcohol or drug use.

Family History:
Mother with stroke at age 82. no early deaths.
2 daughters- healthy


Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitte

In [32]:
preprocessed_doc

Admission Date:  01-01-2010              Discharge Date:   01-01-2010

Date of Birth:  01-01-2010             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line 01-01-2010
ERCP w/ sphincterotomy 01-01-2010


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis. She is not receiving any chemo.

Past Medical History:
1. Colon cancer Diagnosed in 2010, Treated with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Type II Diabetes Mellitus
3. Hypertension

Social History:
Married, former tobacco use. No alcohol or drug use.

Family History:
Mother with stroke at age 82. no early deaths.
2 daughters- healthy


Brief Hospital Course:
Ms.  was admitted on 01-01-2010. Ultrasound at the time of
admission demonstrated pancre

# Postprocessing
The final component we'll introduce is the `postprocessor`. The postprocessor iterates through each entity and checks a series of conditions on each. If all conditions evaluate as `True`, then some action is taken on the entity. Some use cases of this include removing an entity or changing an attributes.

For example, let's say that we want to exclude any entity which comes from the **"patient_instructions"** section, as these are typically not experienced by the patient and are purely hypothetical. We'll write a rule to remove any entity from `doc.ents` if it came from this section. 

The design pattern for a postprocessing rule is as follows:
- A `PostprocessingRule` contains a list of `patterns` and an `action` to take if all if the conditions are met.
- Each `PostprocessingPattern` takes a `condition`, which is evaluated against `success_value`.
- Each pattern can take option `condition_args` to pass into the condition check, and each rule takes optional `action_args`
- The module `postprocessing_functions` offer utility functions for the `condition` and `description` arguments

In [33]:
from medspacy.postprocess import Postprocessor, PostprocessingRule, PostprocessingPattern
from medspacy.postprocess import postprocessing_functions

In [34]:
postprocessor = Postprocessor(nlp, debug=False) # Set to True for more verbose information about rule matching

In [35]:
postprocessor = nlp.add_pipe("medspacy_postprocessor")

In [36]:
postprocess_rules = [
    # Instantiate our rule
    PostprocessingRule(
        # Pass in a list of patterns
        patterns=[
            # The pattern will check if the entitie's section is "patient_instructions"
            PostprocessingPattern(condition=lambda ent: ent._.section_category, success_value="patient_instructions"),
        ],
        # If all patterns are True, this entity will be removed.
        action=postprocessing_functions.remove_ent,
        description="Remove any entities from the instructions section."
    ),
    
]

Before adding the postprocessingrules, here are the final 5 entities:

In [37]:
print("Before:")
print()
for ent in doc.ents:
    print(ent, ent._.section_title, sep="  |  ")

Before:

Hydrochlorothiazide  |  Allergies:
Abdominal pain  |  Chief Complaint:
stroke  |  History of Present Illness:
abdominal pain  |  History of Present Illness:
metastasis  |  History of Present Illness:
Colon cancer  |  Past Medical History:
hemicolectomy  |  Past Medical History:
XRT  |  Past Medical History:
stroke  |  Family History:
abdominal pain  |  Discharge Instructions:
abdominal pain  |  Discharge Instructions:


In [38]:
postprocessor.add(postprocess_rules)

In [39]:
doc = nlp(text)

Afterwards, the entities in the `patient_instructions` section was removed.

In [40]:
print("After:")
print()
for ent in doc.ents:
    print(ent, ent._.section_title, sep="  |  ")

After:

Hydrochlorothiazide  |  Allergies:
Abdominal pain  |  Chief Complaint:
stroke  |  History of Present Illness:
abdominal pain  |  History of Present Illness:
metastasis  |  History of Present Illness:
Colon cancer  |  Past Medical History:
hemicolectomy  |  Past Medical History:
XRT  |  Past Medical History:
stroke  |  Family History:
