# Overview
As we saw in the last notebook, spaCy doesn't work great for clinical text out of the box. We're interested in extracting different types of information from clinical text than news or Wikipedia articles. Clinical text is also very different from general domain language. 
- It is very messy, with semi-structured formatting from EHR
- Clinical documents include many abbreviations, some of which are ambiguous
- There are specific tasks needed in clinical NLP, such as detecting negation or uncertainty for concepts in the text

One of the most powerful components of spaCy is that is **very customizable**. In addition to working with the default models provided in the core library, you can create your own [custom components](https://spacy.io/usage/processing-pipelines#custom-components) or add your own [extension attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes). Developers and researchers can then publish their spaCy extensions to the open-source community. Some examples of these openly available libraries are:

- [scispacy](https://allenai.github.io/scispacy/): Includes models trained on biomedical literature
- [medCAT](https://github.com/CogStack/MedCAT): Models trained for medical concept extraction

In this notebook, we'll use [medspacy](https://github.com/medspacy/medspacy), a newly released package for performing clinical NLP tasks in spaCy.

# medspacy
<img alt="MedSpaCy logo" src="https://github.com/medspacy/medspacy/raw/master/images/medspacy_logo.png">


[Medspacy](https://github.com/medspacy/medspacy) is an open-source package maintained by NLP developers at the University of Utah and the US Department of Veterans Affairs. The goal of medSpaCy is to provide flexible, easy-to-use spaCy components for common clinical NLP tasks, such as:

- Concept extraction
- Negation detection
- Document section splitting

One of the early uses of medSpaCy includes a [biosurveillance system for identifying positive cases of COVID-19](https://openreview.net/forum?id=ZQ_HvBxcdCv).

**MedSpaCy is still in beta**, and you are one of the first users!

# I. Getting started with medSpaCy
Let's get started with medSpaCy. Just like with spaCy, we'll load a model containing a processing pipeline. Unlike the typical spaCy models, this pipeline will include some additional components for specific clinical tasks.

In [1]:
import spacy
import medspacy

In [2]:
nlp = medspacy.load()

In [3]:
nlp.pipe_names

['tagger',
 'parser',
 'target_matcher',
 'sectionizer',
 'context',
 'postprocessor']

For data, we'll use these short example texts:

In [4]:
texts = [
    "Patient presents for management of Type II Diabetes Mellitus",
    "No evidence of pneumonia",
    "Past medical history significant for afib, CHF, and CKD Stage 3, now CKD stage five.",
    "Mother with breast cancer",
    "continue metformin for type 2 dm",
    "Her grandma was recently diagnosed with Alzheimer's"
]

### Discussion
What information would be useful to extract from these texts? What processing steps do you need to take?

# II. Concept extraction
The first step we'll take is to define the **target concepts** we're interested in. In the previous notebook, spaCy extracted concepts like **"PERSON"** and **"ORG"**. In this notebook, we'll extract the following labels:
- **"PROBLEM"**
- **"TREATMENT"**
- **"TEST"**

We'll start by building a **rule-based system**. In rule-based NLP, we define patterns to match concepts in text. SpaCy offers many [rule-based methods](https://spacy.io/usage/rule-based-matching). MedSpaCy uses a pipeline component called `TargetMatcher` and rules defined by a class called `TargetRule`. Extracted concepts will be stored as `Span` objects in `doc.ents`.

We can access the target matcher through the `get_pipe()` method:

In [5]:
target_matcher = nlp.get_pipe("target_matcher")
target_matcher

<target_matcher.target_matcher.TargetMatcher at 0x10e5e1dd0>

In [6]:
# Import class for defining rules
from medspacy.ner import TargetRule

Target rules require two positional arguments:
- `literal`: A span of text to match in the text (case insensitive)
- `category`: The label to assign to extracted concepts

Let's define rules to extract any relevant clinical concepts in the texts above.

### TODO
Finish the rules below to match any of the concepts in the text.

In [7]:
target_rules = [
    TargetRule("pneumonia", "PROBLEM"),
    TargetRule("afib", "PROBLEM"),
    TargetRule("CHF", "PROBLEM"),
    TargetRule("Breast Cancer", "PROBLEM"),
    TargetRule("Alzheimer's", "PROBLEM"),
    TargetRule("metformin", "TREATMENT"),
]

We then add these rules to our target matcher:

In [8]:
target_matcher.add(target_rules)

The simplest form of string-matching in spaCy is just exact strings to match, as shown above. However, we can also add more complex patterns to match concepts with varying form. For example, the same or similar concepts can be mentioned in multiple formats:
- **"Type II Diabetes Mellitus"** and **"type 2 dm"**
- **"CKD Stage 3"** and **"CKD Stage Five"**

We can write more complex rules using token attributes to match multiple string formats at once. A pattern is a list of dictionaries representing conditions to match on each token. See SpaCy's documentation on [rule-based matching](https://spacy.io/usage/rule-based-matching) for more information on how these patterns work.

In [9]:
target_rules2 = [
    TargetRule("CKD", "PROBLEM", pattern=[
        {"LOWER": "ckd"}, # Token 1
        {"LOWER": "stage"}, # Token 2
        {"LIKE_NUM": True} # Token 3
        ]),
    
    TargetRule("Type II Diabetes Mellitus", "PROBLEM", 
              pattern=[
                  {"LOWER": "type"},
                  {"LOWER": {"IN": ["2", "ii", "two"]}},
                  {"LOWER": {"IN": ["dm", "diabetes"]}},
                  {"LOWER": "mellitus", "OP": "?"}
              ]),
]

In [10]:
target_matcher.add(target_rules2)

Now we can process the texts by calling `nlp.pipe(texts)`:

In [11]:
docs = list(nlp.pipe(texts))

Let's visualize the entities extracted in these docs.

In [12]:
from spacy import displacy

In [13]:
displacy.render(docs, style="ent")

# III. Contextual analysis
Clinical text often contains mentions of concepts which the patient did not actually experience. For example:

- "There is *no evidence of* **pneumonia**"
- "*Mother* with **breast cancer**"
- "Patient presents for *r/o* **COVID-19**"

In all of these instances, we need to use the contextual clues around the entity to assert attributes like negation, experiencer, and uncertainty.

One method for this is the [ConText algorithm](https://www.sciencedirect.com/science/article/pii/S1532046409000744). ConText links target entities like problems with semantic modifiers like those shown above. The medSpaCy implementation of ConText is [cycontext](https://github.com/medspacy/cycontext).

Here we'll show the basic usage of ConText. When instantiating ConText, we can use default rules and then add additional as needed. See the [cycontext](https://github.com/medspacy/cycontext) repository for more detailed examples and tutorials.

In [14]:
from medspacy.ner import TargetRule
from medspacy.context import ConTextItem
from medspacy.visualization import visualize_ent, visualize_dep

In [15]:
doc = nlp("There is no evidence of pneumonia.")

We can visualize the target and modifiers using two functions from `medspacy.visualization`. `visualize_ent` will highlight the spans of both target and modifier concepts. `visualize_dep` will show arrows between concepts to show which targets are modified by modifiers.

In [16]:
visualize_ent(doc)

In [17]:
visualize_dep(doc)

### TODO
Change `idx` to go through each of the texts and see target/modifier relationships.

In [18]:
idx = 0

In [19]:
visualize_ent(docs[idx])

In [20]:
visualize_dep(docs[idx])

## Adding rules to ConText
MedSpaCy comes with default rules for matching targets and modifiers. But you'll often find new examples which aren't included in the default rules. Let's see now how to add a rule.

In the sentence **"Her grandma was recently diagnosed with Alzheimer's"**, medSpaCy fails to recognize that **"grandma"** is a **"FAMILY"** modifier. 

In [21]:
text = "Her grandma was recently diagnosed with Alzheimer's"
doc = nlp(text)

In [22]:
visualize_ent(doc)

We can add this rule using the `ConTextItem` class and adding to the `context` component.

In [23]:
from medspacy.context import ConTextItem

In [24]:
context = nlp.get_pipe("context")

This class uses the same arguments as `TargetMatcher`, `literal` and `category`:

In [25]:
new_item_data = [
    ConTextItem(literal="grandma", category="FAMILY"),
]

In [26]:
context.add(new_item_data)

In [27]:
doc = nlp(text)

In [28]:
visualize_ent(doc)

In [29]:
visualize_dep(doc)

# IV.  Section detection
Clinical notes often contain a certain structure. The one example of this is the [SOAP note](https://www.globalpremeds.com/blog/2015/01/02/understanding-soap-format-for-clinical-rounds/). Different parts of the notes have different significance. For example, a document listed in the **Past Medical History** or **Problem List** is likely a historical condition which may not be relevant to a patient visit, where as the **Assessment/Plan** will be contain more up-to-date diagnoses.

MedSpaCy will detect sections through the `sectionizer` component. We can then visualize the section headers in using `visualize_ent`.

In [30]:
sectionizer = nlp.get_pipe("sectionizer")

In [31]:
text = """Past Medical History:
1. Type II DM
2. Afib
3. CKD Stage 3

Family History:
1. Breast Cancer


Reason for this examination: Possible pneumonia.

IMPRESSION:
No evidence of pneumonia.

Assessment/Plan:
Continue metformin for type 2 dm.
"""

In [32]:
doc = nlp(text)

In [33]:
visualize_ent(doc)

We can see all of the section titles in the doc by calling `doc._.section_titles`. We can also see which section an entity occured in using `ent._.section_title`:

In [34]:
print(doc._.section_titles)

['past_medical_history', 'family_history', 'reason_for_examination', 'observation_and_plan', 'observation_and_plan']


In [35]:
for ent in doc.ents:
    print(ent, ent._.section_title)

Type II DM past_medical_history
Afib past_medical_history
CKD Stage 3 past_medical_history
Breast Cancer family_history
pneumonia reason_for_examination
pneumonia observation_and_plan
metformin observation_and_plan
type 2 dm observation_and_plan


## Adding rules to the sectionizer
Just like with `context`, you'll want to add new section titles to the `sectionizer` component. We can do this by writing patterns which are dictionaries with two keys:
- `section_title`: The normalized section title
- `pattern`: The pattern to match in the text

We then add these patterns using `sectionizer.add()`.

For example, we can see below that medSpaCy fails to recognize **"Previous Medical History"** to be equivalent to **"Past Medical History"**.

In [36]:
text = """Previous Medical History:
Pneumonia in 2012
"""

In [37]:
doc = nlp(text)

In [38]:
visualize_ent(doc)

Let's add a pattern here to match it.

In [39]:
sectionizer = nlp.get_pipe("sectionizer")

In [40]:
pattern = {
    "section_title": "past_medical_history",
    "pattern": "Previous Medical History:"
}

In [41]:
sectionizer.add([pattern])

In [42]:
doc = nlp(text)
visualize_ent(doc)

# Using a pre-trained model
So far, we've been using **rule-based methods** to extract concepts from text. An alternative is **statistical NLP**, where you train a machine learning classifier to extract concepts based on annotated datasets.

We'll use a model trained on the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/). This model was trained on data for the first subtask in the shared task, referred to in the challenge as **"Clinically relevant events"**. For the purpose of this module, I specifically restricted it to the following labels of **clinical concepts**:
- **Problems:** Diagnoses, signs, and symptoms
- **Tests:** Lab and vital measurements
- **Treatments:** Medications, procedures, and therapies


The model has been pre-installed and is available with the name **"en_info_3700_i2b2_2012"**. To install on your own machine, run this command to download and install the model:
```pip
pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
```

We can load this using both spacy or medSpaCy.

In [43]:
# Using spaCy
# nlp = spacy.load("en_info_3700_i2b2_2012")
# Using medSpaCy
nlp = medspacy.load("en_info_3700_i2b2_2012")

In [44]:
nlp.pipe_names

['tagger',
 'parser',
 'ner',
 'target_matcher',
 'sectionizer',
 'context',
 'postprocessor']

Let's see what labels will be predicted by the NER component:

In [45]:
ner = nlp.get_pipe("ner")
ner.labels

('PROBLEM', 'TEST', 'TREATMENT')

Now let's see what concepts are extracted by our model. Any of the target concepts in `doc.ents` will have been extracted by the statistical NER model. MedSpaCy will keep extracting the modifiers and section titles.

In [46]:
text = """Past Medical History:
1. Type II DM
2. Afib
3. CKD Stage 3

Family History:
1. Breast Cancer


Reason for this examination: Possible pneumonia.

IMPRESSION:
No evidence of pneumonia.

Assessment/Plan:
Continue metformin for type 2 dm."""
doc = nlp(text)

In [47]:
print(doc.ents)

(Type II DM, Afib, this examination, pneumonia, pneumonia, metformin)


In [48]:
visualize_ent(doc)

### Discussion
Compared with our rule-based system 