In [1]:
import sys

In [2]:
sys.path.insert(0, "..")

In [3]:
import spacy
from spacy.tokens import Span

import medspacy
from medspacy.preprocess import PreprocessingRule, Preprocessor
from medspacy.ner import TargetRule
from medspacy.context import ConTextRule
from medspacy.section_detection import Sectionizer
from medspacy.postprocess import PostprocessingRule, PostprocessingPattern, Postprocessor
from medspacy.postprocess import postprocessing_functions
from medspacy.visualization import visualize_ent, visualize_dep


import re

# Overview
In this notebook, we'll show how to use a pretrained model for target concept extraction instead of defining rules. We'll then add our additional components to show how medSpaCy can be used to combine statistical NLP with other rule-based components.

As an example, we'll download the model below which contains a model pretrained for clinical data. This model was trained with data from the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/). This model was trained on data for the first subtask in the shared task, referred to in the challenge as **"Clinically relevant events"**, specifically the following **clinical concepts**:
- **Problems:** Diagnoses, signs, and symptoms
- **Tests:** Lab and vital measurements
- **Treatments:** Medications, procedures, and therapies

We can install this model with `pip` using this GitHub link:
```bash
pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
```

In [4]:
# !pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz

In [5]:
with open("./discharge_summary.txt") as f:
    text = f.read()

This model now can be loaded as any other spaCy model. We'll use `medspacy.load()` and pass in this model name as an example. Since this trained NER component will take care of entity extraction, we can disable the `target_matcher` from our pipeline (although you may want to add rule-based matching to reduce false negatives from the model):

In [6]:
nlp = medspacy.load("en_info_3700_i2b2_2012", disable=["target_matcher"])



In [7]:
nlp.pipe_names

['sentencizer', 'tagger', 'parser', 'ner', 'context']

In [8]:
ner = nlp.get_pipe("ner")

In [9]:
ner.labels

('PROBLEM', 'TEST', 'TREATMENT')

In [10]:
doc = nlp(text)

## Process our text
Similar to the last notebook, we'll add new rules to some of our components. Let's first look at what our model extracts out of the box:

In [11]:
visualize_ent(doc)

### Preprocessing

In [12]:
preprocessor = Preprocessor(nlp.tokenizer)

In [13]:
nlp.tokenizer = preprocessor

In [14]:
preprocess_rules = [
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{1,4}-[\d]{1,2}(-[\d]{1,2})?\*\*\]"),
        repl="01-01-2010",
        desc="Replace MIMIC date brackets with a generic date."
    ),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{4}\*\*\]"),
        repl="2010",
        desc="Replace MIMIC year brackets with a generic year."
    ),
    
    PreprocessingRule(
        re.compile("dx'd"), repl="Diagnosed", 
                  desc="Replace abbreviation"
    ),
    
    PreprocessingRule(
        re.compile("tx'd"), repl="Treated", 
                  desc="Replace abbreviation"
    ),
    
        PreprocessingRule(
        re.compile("\[\*\*[^\]]+\]"), 
        desc="Remove all other bracketed placeholder text from MIMIC"
    )
]

In [15]:
preprocessor.add(preprocess_rules)

### Context

In [16]:
context = nlp.get_pipe("context")

In [17]:
context_rules = [
    ConTextRule("diagnosed in <YEAR>", "HISTORICAL", 
               pattern=[
                   {"LOWER": "diagnosed"},
                   {"LOWER": "in"},
                   {"LOWER": {"REGEX": "^[\d]{4}$"}}
               ])
]

In [18]:
context.add(context_rules)

### Section detection

In [19]:
sectionizer = Sectionizer(nlp, patterns="default")

In [20]:
nlp.add_pipe(sectionizer)

In [21]:
section_patterns = [
    {"section_title": "hospital_course", "pattern": "Brief Hospital Course:"}
]

In [22]:
sectionizer.add(section_patterns)

### Postprocessing
Here, we'll show another example of how postprocessing can be used. The NER component extracts **"married"** as a **"TREATMENT"** entity. While some might agree with this in a philosophical sense, it doesn't match our clinical definition very well. This shows a challenge of statistical NLP: we have relatively little control over what concepts are extracted by our model. But we can use some postprocessing rules to clean this up.

Postprocessing can be used to remove or clean up entities which we know are incorrect. In this example, we'll just remove any entity where the text is **"married"**:

In [23]:
postprocessor = Postprocessor(debug=False) 

In [24]:
nlp.add_pipe(postprocessor)

In [25]:
postprocess_rules = [
    PostprocessingRule(
        patterns=[
            PostprocessingPattern(condition=lambda ent: ent.lower_ == "married"),
        ],
        action=postprocessing_functions.remove_ent,
        description="Remove a specific misclassified span of text."
    ),
    
]

In [26]:
postprocessor.add(postprocess_rules)

# Process our document
Now, let's process the text with our complete pipeline and show the results:

In [27]:
nlp.pipe_names

['sentencizer',
 'tagger',
 'parser',
 'ner',
 'context',
 'sectionizer',
 'postprocessor']

In [28]:
doc = nlp(text)

In [29]:
visualize_ent(doc)

In [30]:
short_text = "Colon cancer dx'd in [**2554**], tx'd with hemicolectomy"
short_doc = nlp(short_text)

In [31]:
visualize_ent(short_doc)

In [32]:
visualize_dep(short_doc)