# Overview
As we saw in the last notebook, spaCy doesn't work great for clinical text out of the box. We're interested in extracting different types of information from clinical text than news or Wikipedia articles. Clinical text is also very different from general domain language. 
- It is very messy, with semi-structured formatting from EHR
- Clinical documents include many abbreviations, some of which are ambiguous
- There are specific tasks needed in clinical NLP, such as detecting negation or uncertainty for concepts in the text

One of the most powerful components of spaCy is that is **very customizable**. In addition to working with the default models provided in the core library, you can create your own [custom components](https://spacy.io/usage/processing-pipelines#custom-components) or add your own [extension attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes). Developers and researchers can then publish their spaCy extensions to the open-source community. Some examples of these openly available libraries are:

- [scispacy](https://allenai.github.io/scispacy/): Includes models trained on biomedical literature
- [medCAT](https://github.com/CogStack/MedCAT): Models trained for medical concept extraction

In this notebook, we'll use [medspacy](https://github.com/medspacy/medspacy), a newly released package for performing clinical NLP tasks in spaCy.

# medspacy
[Medspacy](https://github.com/medspacy/medspacy) is an open-source package maintained by NLP developers at the University of Utah and the US Department of Veterans Affairs. The goal of medSpaCy is to provide flexible, easy-to-use spaCy components for common clinical NLP tasks, such as:

- Concept extraction
- Negation detection
- Document section splitting

One of the early uses of medSpaCy includes a [biosurveillance system for identifying positive cases of COVID-19](https://openreview.net/forum?id=ZQ_HvBxcdCv) (currently pending review).

**MedSpaCy is still in beta**, and you are one of the first users!

# Using medSpaCy

In [45]:
import spacy
import medspacy

In [2]:
nlp = medspacy.load()

In [3]:
nlp.pipe_names

['tagger',
 'parser',
 'target_matcher',
 'sectionizer',
 'context',
 'postprocessor']

In [4]:
texts = [
    "Patient presents for management of Type II Diabetes Mellitus",
    "No evidence of pneumonia",
    "Past medical history significant for afib, CHF, and CKD Stage 3, now CKD stage five.",
    "Mother with breast cancer",
    "continue metformin for type 2 dm",
    "Her grandma was recently diagnosed with Alzheimer's"
]

## Concept extraction

In [5]:
target_matcher = nlp.get_pipe("target_matcher")

In [6]:
from medspacy.ner import TargetRule

In [7]:
target_rules = [
    TargetRule("pneumonia", "PROBLEM"),
    TargetRule("afib", "PROBLEM"),
    TargetRule("CHF", "PROBLEM"),
    TargetRule("Breast Cancer", "PROBLEM"),
    TargetRule("Alzheimer's", "PROBLEM"),
    TargetRule("metformin", "TREATMENT"),
    
]

We then add these rules to our target matcher:

In [8]:
target_matcher.add(target_rules)

The simplest form of string-matching in spaCy is just exact strings to match, as shown above. However, we can also add more complex patterns to match concepts with varying form. For example, the same or similar concepts can be mentioned in multiple formats:
- **"Type II Diabetes Mellitus"** and **"type 2 dm"**
- **"CKD Stage 3"** and **"CKD Stage Five"**

We can write more complex rules using token attributes to match multiple string formats at once:

In [9]:
target_rules2 = [
    TargetRule("CKD", "PROBLEM", pattern=[
        {"LOWER": "ckd"}, # Token 1
        {"LOWER": "stage"}, # Token 2
        {"LIKE_NUM": True} # Token 3
        ]),
    
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ]),
]

In [10]:
target_matcher.add(target_rules2)

In [11]:
docs = list(nlp.pipe(texts))

In [12]:
from spacy import displacy

In [13]:
displacy.render(docs, style="ent")

## Contextual analysis
Clinical text often contains mentions of concepts which the patient did not actually experience. For example:

- "There is *no evidence of* **pneumonia**"
- "*Mother* with **breast cancer**"
- "Patient presents for *r/o* **COVID-19**"

In all of these instances, we need to use the contextual clues around the entity to assert attributes like negation, experiencer, and uncertainty.

One method for this is the [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744). ConText links target entities like problems with semantic modifiers like those shown above. The medSpaCy implementation of ConText is [cycontext](https://github.com/medspacy/cycontext).

Here we'll show the basic usage of ConText. When instantiating ConText, we can use default rules and then add additional as needed. See the [cycontext](https://github.com/medspacy/cycontext) repository for more detailed examples and tutorials.

In [14]:
from medspacy.ner import TargetRule
from medspacy.context import ConTextItem
from medspacy.visualization import visualize_ent, visualize_dep

In [15]:
doc = nlp("There is no evidence of pneumonia.")

In [16]:
visualize_ent(doc)

In [17]:
visualize_dep(doc)

In [18]:
ent = doc.ents[0]
print(ent)
print(ent._.is_negated)

pneumonia
True


In [19]:
for doc in docs:
    for ent in doc.ents:
        print("{0} is modified by {1}".format(ent, ent._.modifiers))

Type II Diabetes Mellitus is modified by ()
pneumonia is modified by (<TagObject> [No evidence of, NEGATED_EXISTENCE],)
afib is modified by (<TagObject> [Past medical history, HISTORICAL],)
CHF is modified by (<TagObject> [Past medical history, HISTORICAL],)
CKD Stage 3 is modified by (<TagObject> [Past medical history, HISTORICAL],)
CKD stage five is modified by (<TagObject> [Past medical history, HISTORICAL],)
breast cancer is modified by (<TagObject> [Mother, FAMILY],)
metformin is modified by ()
type 2 dm is modified by ()
Alzheimer's is modified by ()


In [20]:
idx = 3

In [21]:
visualize_ent(docs[idx])

In [22]:
visualize_dep(docs[idx])

### Adding rules to ConText


In [23]:
from medspacy.context import ConTextItem

In [24]:
context = nlp.get_pipe("context")

In [25]:
text = "Her grandma was recently diagnosed with Alzheimer's"
doc = nlp(text)

In [26]:
visualize_ent(doc)

In [27]:
new_item_data = [
    ConTextItem("grandma", "FAMILY"),
]

In [28]:
context.add(new_item_data)

In [29]:
doc = nlp(text)

In [30]:
visualize_ent(doc)

In [31]:
visualize_dep(doc)

## Section detection
Clinical notes often contain a certain structure. The one example of this is the [SOAP note](https://www.globalpremeds.com/blog/2015/01/02/understanding-soap-format-for-clinical-rounds/). Different parts of the notes have different significance. For example, a document listed in the **Past Medical History** or **Problem List** is likely a historical condition which may not be relevant to a patient visit, where as the **Assessment/Plan** will be contain more up-to-date diagnoses.

MedSpaCy will detect sections through the `sectionizer` component. 

In [32]:
sectionizer = nlp.get_pipe("sectionizer")

In [33]:
text = """Past Medical History:
1. Type II DM
2. Afib
3. CKD Stage 3

Family History:
1. Breast Cancer


Reason for this examination: Possible pneumonia.

IMPRESSION:
No evidence of pneumonia.

Assessment/Plan:
Continue metformin for type 2 dm.
"""

In [34]:
doc = nlp(text)

In [35]:
visualize_ent(doc)

In [36]:
print(doc._.section_titles)

['past_medical_history', 'family_history', 'reason_for_examination', 'observation_and_plan', 'observation_and_plan']


In [37]:
for ent in doc.ents:
    print(ent, ent._.section_title)

Type II DM past_medical_history
Afib past_medical_history
CKD Stage 3 past_medical_history
Breast Cancer family_history
pneumonia reason_for_examination
pneumonia observation_and_plan
metformin observation_and_plan
type 2 dm observation_and_plan


### Adding rules to the sectionizer

In [38]:
text = """Previous Medical History:
Pneumonia in 2012
"""

In [39]:
doc = nlp(text)

In [40]:
visualize_ent(doc)

In [41]:
pattern = {
    "section_title": "past_medical_history",
    "pattern": "Previous Medical History:"
}

In [42]:
sectionizer.add([pattern])

In [43]:
doc = nlp(text)
visualize_ent(doc)

# 

# Using a pre-trained model
So far, we've been using **rule-based methods** to extract concepts from text. An alternative is **statistical NLP**, where you train a machine learning classifier to extract concepts based on annotated datasets.

We'll use a model trained on the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/). This model was trained on data for the first subtask in the shared task, referred to in the challenge as **"Clinically relevant events"**. For the purpose of this module, I specifically restricted it to the following labels of **clinical concepts**:
- **Problems:** Diagnoses, signs, and symptoms
- **Tests:** Lab and vital measurements
- **Treatments:** Medications, procedures, and therapies


The model has been pre-installed and is available with the name **"en_info_3700_i2b2_2012"**. To install on your own machine, run this command to download and install the model:
```pip
pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
```

We can load this using both spacy or medSpaCy.

In [46]:
# Using spaCy
nlp = spacy.load("en_info_3700_i2b2_2012")

In [47]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [50]:
text = """Past Medical History:
1. Type II DM
2. Afib
3. CKD Stage 3

Family History:
1. Breast Cancer


Reason for this examination: Possible pneumonia.

IMPRESSION:
No evidence of pneumonia.

Assessment/Plan:
Continue metformin for type 2 dm."""
doc = nlp(text)

In [51]:
visualize_ent(doc)

In [56]:
nlp = medspacy.load("en_info_3700_i2b2_2012")

In [57]:
nlp.pipe_names

['tagger',
 'parser',
 'ner',
 'target_matcher',
 'sectionizer',
 'context',
 'postprocessor']

In [58]:
doc = nlp(text)

In [59]:
visualize_ent(doc)