In [1]:
import spacy
import medspacy
from medspacy.visualization import visualize_ent, visualize_dep

# Overview
In this notebook, we'll look at how to extract clinical concepts and attributes from text.
- Target matching
- Section detection
- Context analysis

In [2]:
with open("./discharge_summary.txt") as f:
    text = f.read()

In [3]:
nlp = spacy.load("en_core_web_sm", disable=["ner"])

# Target extraction
In this step, we'll write rules to extract the main concepts we're interested in.

In this example, we'll use two utilities provided in `medspacy.ner` for rule-based matching: the `TargetMatcher` and `TargetRule`. However, you can use any spaCy components for adding spans to `doc.ents`, including pre-trained NER models or other [spaCy rule-based matching components](https://spacy.io/usage/rule-based-matching/).

## Target concepts
In our text, we'll extract the following concepts:
- Diagnoses 
- Medications

In addition, we'll show a few examples of how to add a custom spaCy attribute to a target rule to add an ICD-10 diagnosis code as an attribute of an entity.

In [4]:
from medspacy.ner import TargetMatcher, TargetRule

In [5]:
target_matcher = TargetMatcher(nlp)

In [6]:
nlp.add_pipe(target_matcher)

In [7]:
target_rules1 = [
    TargetRule(literal="abdominal pain", category="PROBLEM"),
    TargetRule("stroke", "PROBLEM"),
    TargetRule("hemicolectomy", "TREATMENT"),
    TargetRule("Hydrochlorothiazide", "TREATMENT"),
    TargetRule("colon cancer", "PROBLEM"),
    TargetRule("radiotherapy", "PROBLEM",
              pattern=[{"LOWER": "xrt"}]),
    TargetRule("metastasis", "PROBLEM"),
    
]

In [8]:
target_matcher.add(target_rules1)

In [9]:
doc = nlp(text)

In [10]:
visualize_ent(doc)

In [11]:
for ent in doc.ents:
    print(ent, ent.label_, ent._.target_rule.literal, sep="  |  ")
    print()

Hydrochlorothiazide  |  TREATMENT  |  Hydrochlorothiazide

Abdominal pain  |  PROBLEM  |  abdominal pain

stroke  |  PROBLEM  |  stroke

abdominal pain  |  PROBLEM  |  abdominal pain

metastasis  |  PROBLEM  |  metastasis

Colon cancer  |  PROBLEM  |  colon cancer

hemicolectomy  |  TREATMENT  |  hemicolectomy

XRT  |  PROBLEM  |  radiotherapy

stroke  |  PROBLEM  |  stroke

abdominal pain  |  PROBLEM  |  abdominal pain

abdominal pain  |  PROBLEM  |  abdominal pain



## Adding custom attributes
One of the most powerful functionalities of spaCy is the ability to add [custom attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes) to spaCy objects (`Doc`, `Span`, or `Token`). These custom attributes are accessed through `obj._....`. As we'll see in later steps, medSpaCy adds several of these attributes in other attributes like `context` or `sectionizer`. 

But it is sometimes useful to include a custom attribute as part of the target matching rule, rather than needing to build a separate component to add it. The `TargetRule` can also include a value for these attributes in the `attributes` argument. 

For example, let's say we want to map certain entities to [ICD-10 diagnosis codes](https://www.cdc.gov/nchs/icd/icd10cm.htm). One way we can do this is to include the diagnosis codes for concepts in our knowledge base. For example, **"Type II Diabetes"** can be mapped to **E11.9**". We can add this to entities by first registering the extension for the `Span` class:


In [12]:
from spacy.tokens import Span

In [13]:
Span.set_extension("icd10", default="")

We can now include ICD-10 code values in the target rules. We'll map **"Type II Diabetes Mellitus"** to **"E11.9"** and **"Hypertension"** to **"I10"**:

In [14]:
target_rules2 = [
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ],
              attributes={"icd10": "E11.9"}),
    TargetRule("Hypertension", "PROBLEM",
              pattern=[{"LOWER": {"IN": ["htn", "hypertension"]}}],
              attributes={"icd10": "I10"}),
    
    
]

In [15]:
target_matcher.add(target_rules2)

In [16]:
doc = nlp(text)

Now, whenever one of these rules results in a match, the ICD-10 value can be accessed in `ent._.icd10`:

In [17]:
for ent in doc.ents:
    if ent._.icd10 != "":
        print(ent, ent._.icd10)

type 2 dm E11.9
Type II Diabetes Mellitus E11.9
Hypertension I10
Type 2 DM E11.9
HTN I10


# Context
Clinical text often contains mentions of concepts which the patient did not actually experience. For example:

- "There is *no evidence of* **pneumonia**"
- "*Mother* with **breast cancer**"
- "Patient presents for *r/o* **COVID-19**"

In all of these instances, we need to use the contextual clues around the entity to assert attributes like negation, experiencer, and uncertainty.

One method for this is the [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744). ConText links target entities like problems with semantic modifiers like those shown above. The medSpaCy implementation of ConText is [cycontext](https://github.com/medspacy/cycontext).

Here we'll show the basic usage of ConText. When instantiating ConText, we can use default rules and then add additional as needed. See the [cycontext](https://github.com/medspacy/cycontext) repository for more detailed examples and tutorials.

In [18]:
from medspacy.context import ConTextComponent, ConTextItem

In [19]:
context = ConTextComponent(nlp, rules="default")

In [20]:
nlp.add_pipe(context)

In [21]:
nlp.pipe_names

['tagger', 'parser', 'target_matcher', 'context']

In [22]:
doc = nlp("Mother with stroke at age 82.")

We can use medSpaCy visualizers from the module `medspacy.visualization` to show the target/modifiers in text. `visualize_dep` shows arrows targets to show which concepts are modified by the semantic modifiers:

In [23]:
visualize_ent(doc)

In [24]:
visualize_dep(doc)

In [25]:
short_doc = nlp("Colon cancer diagnosed in 2012")

We can add a new rule using the `ConTextItem` class:

In [26]:
item_data = [
    ConTextItem("diagnosed in <YEAR>", "HISTORICAL", 
                rule="BACKWARD", # Look "backwards" in the text (to the left)
               pattern=[
                   {"LOWER": "diagnosed"},
                   {"LOWER": "in"},
                   {"LOWER": {"REGEX": "^[\d]{4}$"}}
               ])
]

In [27]:
context.add(item_data)

In [28]:
short_doc = nlp("Colon cancer diagnosed in 2012")

In [29]:
visualize_ent(short_doc)

In [30]:
visualize_dep(short_doc)

In addition to linking targets and modifiers, `cycontext` will also set attributes for each entity:

In [31]:
for ent in doc.ents:
    if any([ent._.is_negated, ent._.is_uncertain, ent._.is_historical, ent._.is_family, ent._.is_hypothetical, ]):
        print("'{0}' modified by {1} in: '{2}'".format(ent, ent._.modifiers, ent.sent))
        print()

'stroke' modified by (<TagObject> [Mother, FAMILY],) in: 'Mother with stroke at age 82.'



# Section detection
WE are often interested in which section of a clinical note an entity occurs in. This can be useful for setting attributes like temporality (similar to ConText) or for extracting entities from specific sections of the note.

MedSpaCy includes the `Sectionizer` class from the [clinical_sectionizer](https://github.com/medspacy/sectionizer) package. Similar to `ConTextComponent`, we can instantiate this with default rules and add new ones to fit our specific data. Section detection is especially dependent on your data, as each EHR will use different note formatting.

In [32]:
from medspacy.section_detection import Sectionizer

In [33]:
sectionizer = Sectionizer(nlp, patterns="default")

In [34]:
nlp.add_pipe(sectionizer)

In [35]:
doc = nlp(text)

`visualize_ent` will now highlight section titles in addition to entities and context modifiers:

In [36]:
visualize_ent(doc)

We can see here that the default rules did not catch the section title **"Brief Hospital Course"**. We can add a pattern to our sectionizer by passing in a dictionary with two key/pair values:
- **"section_title"**: The normalized section title
- **"pattern"**: A spaCy pattern to match the text (either a string or a list of Token dictionaries, similar to other components)

In [37]:
section_patterns = [
    {"section_title": "hospital_course", "pattern": "Brief Hospital Course:"}
]

In [38]:
sectionizer.add(section_patterns)

In [39]:
visualize_ent(nlp("""
Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitted on [**2573-5-30**]. Ultrasound at the time of
admission demonstrated pancreatic duct dilitation and an
edematous gallbladder. She was admitted to the ICU.
"""))

The sectionizer will add attributes to allow us to access section data. The `doc` object will have these 3 attributes:

In [40]:
# Normalized section titles
print(doc._.section_titles)

[None, 'other', 'chief_complaint', 'past_medical_history', 'social_history', 'family_history', 'hospital_course', 'medications', 'observation_and_plan', 'patient_instructions', 'signature']


In [41]:
# The Spans of the doc representing section headers
doc._.section_headers

[None,
 Service:,
 Chief Complaint:,
 Past Medical History:,
 Social History:,
 Family History:,
 Brief Hospital Course:,
 Discharge Medications:,
 Discharge Diagnosis:,
 Discharge Instructions:,
 Signed electronically by:]

In [42]:
# Spans of the actual sections of the notes
doc._.section_spans[:5]

[Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]
 
 Date of Birth:  [**2498-8-19**]             Sex:   F
 ,
 Service: SURGERY
 
 Allergies:
 Hydrochlorothiazide
 
 Attending:[**First Name3 (LF) 1893**],
 Chief Complaint:
 Abdominal pain
 
 Major Surgical or Invasive Procedure:
 PICC line [**6-25**]
 ERCP w/ sphincterotomy [**5-31**]
 
 
 History of Present Illness:
 74y female with type 2 dm and a recent stroke affecting her
 speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.
 ,
 Past Medical History:
 1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
 chemo. Last colonoscopy showed: Last CEA was in the 8 range
 (down from 9)
 2. Type II Diabetes Mellitus
 3. Hypertension
 ,
 Social History:
 Married, former tobacco use. No alcohol or drug use.
 ]

Which can be zipped up as tuples in one attribute:

In [43]:
print(doc._.sections[:5])

[(None, None, Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

), ('other', Service:, Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
), ('chief_complaint', Chief Complaint:, Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.

), ('past_medical_history', Past Medical History:, Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Type II Diabetes Mellitus
3. Hypertension

), ('social_history', Social History:, Social History:
Married, former tobacco use. No alcohol or drug use.

)]


For each section detected in the note, we'll print out the **normalized section title**, **section header**, and **the first 25 tokens of the section**:

In [44]:
for (section_title, section_header, section) in doc._.sections:
    print(section_title, section_header)
    print(section[:25])
    print("----------------")

None None
Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-
----------------
other Service:
Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]

----------------
chief_complaint Chief Complaint:
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
----------------
past_medical_history Past Medical History:
Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
----------------
social_history Social History:
Social History:
Married, former tobacco use. No alcohol or drug use.


----------------
family_history Family History:
Family History:
Mother with stroke at age 82. no early deaths.
2 daughters- healthy



----------------
hospital_course Brief Hospital Course:
Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitted on [**2573-5
----------------
medications Discharge Medications:
Discharge Medications:
1. Miconazole Nitrate 2 % Powder S

Each entity has similar attributes:

In [45]:
for ent in doc.ents:
    print(ent, ent._.section_title)
    print()

Hydrochlorothiazide other

Abdominal pain chief_complaint

type 2 dm chief_complaint

stroke chief_complaint

abdominal pain chief_complaint

metastasis chief_complaint

Colon cancer past_medical_history

hemicolectomy past_medical_history

XRT past_medical_history

Type II Diabetes Mellitus past_medical_history

Hypertension past_medical_history

stroke family_history

Type 2 DM observation_and_plan

HTN observation_and_plan

abdominal pain patient_instructions

abdominal pain patient_instructions



# Postprocessing
The final component we'll introduce is the `postprocessor`. The postprocessor iterates through each entity and checks a series of conditions on each. If all conditions evaluate as `True`, then some action is taken on the entity. Some use cases of this include removing an entity or changing an attributes.

For example, let's say that we want to exclude any entity which comes from the **"patient_instructions"** section, as these are typically not experienced by the patient and are purely hypothetical. We'll write a rule to remove any entity from `doc.ents` if it came from this section. 

The design pattern for a postprocessing rule is as follows:
- A `PostprocessingRule` contains a list of `patterns` and an `action` to take if all of the `patterns` evaluate as `True`
- Each `PostprocessingPattern` takes a `condition`, which evaluates as `True` or `False`. If all patterns return `True`, the action is taken
- Each pattern can take option `condition_args` to pass into the condition check, and each rule takes optional `action_args`
- The module `postprocessing_functions` offer utility functions for the `condition` and `description` arguments

In [46]:
from medspacy.postprocess import Postprocessor, PostprocessingRule, PostprocessingPattern
from medspacy.postprocess import postprocessing_functions

In [47]:
postprocessor = Postprocessor(debug=False)

In [48]:
nlp.add_pipe(postprocessor)

In [49]:
postprocess_rules = [
    # Instantiate our rule
    PostprocessingRule(
        # Pass in a list of patterns
        patterns=[
            # The pattern will check if the entitie's section is "patient_instructions"
            PostprocessingPattern(condition=lambda ent: ent._.section_title == "patient_instructions"),
        ],
        # If all patterns are True, this entity will be removed.
        action=postprocessing_functions.remove_ent,
        description="Remove any entities from the instructions section."
    ),
    
]

Before adding the postprocessingrules, here are the final 5 entities:

In [50]:
print("Before:")
for ent in doc.ents[-5:]:
    print(ent, ent._.section_title)

Before:
stroke family_history
Type 2 DM observation_and_plan
HTN observation_and_plan
abdominal pain patient_instructions
abdominal pain patient_instructions


In [51]:
postprocessor.add(postprocess_rules)

In [52]:
doc = nlp(text)

Afterwards, the final 2 entities have been removed:

In [53]:
print("After:")
for ent in doc.ents[-5:]:
    print(ent, ent._.section_title)

After:
Type II Diabetes Mellitus past_medical_history
Hypertension past_medical_history
stroke family_history
Type 2 DM observation_and_plan
HTN observation_and_plan
