# Adding to the Sectionizer

By default, `clinical_sectionizer` comes with a number of built-in patterns. However, this is a non-exhaustive list and your data will almost certainly contain a number of sections which aren't captured by the default patterns. 

In this notebook, we'll see how to add custom section patterns to our clinical sectionizer to recognize section headers which are not contained in the default knowledge base.

## Prerequisites
This notebook will also use some examples from the master medSpaCy package [medspacy](https://github.com/medspacy/medspacy), which you can download as:

`pip install medspacy`

Also, a simple spacy English model which can be downloaded with:

`python -m spacy download en_core_web_sm`

In [2]:
import sys

In [3]:
sys.path.insert(0, "../..")

In [4]:
import spacy
from medspacy.section_detection import Sectionizer
from medspacy.section_detection import SectionRule

from medspacy.visualization import visualize_ent 

In [5]:
nlp = spacy.load("en_core_web_sm")

In [6]:
nlp.add_pipe("medspacy_sectionizer")

<medspacy.section_detection.sectionizer.Sectionizer at 0x1eaf080dd30>

In [7]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'medspacy_sectionizer']

In [8]:
# pull this back out so that we can modify this (i.e. add rules)
sectionizer = nlp.get_pipe("medspacy_sectionizer")

## Available default sections
The sectionizer has a pattern list provided by default. You can see this list in `medspacy/resources/section_patterns.json`. They cover a broad range of topics including past medical history, chief complaints, allergies, diagnoses, observations, etc. but are not specialized for any particular dataset, so adding or tuning rules might be required.

In this example demonstrating some of the default rules, we'll use the text below:

In [9]:
text = """
Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]
 
Date of Birth:  [**2498-8-19**]             Sex:   F
 
Service: SURGERY
 
Allergies: 
Hydrochlorothiazide
 
Attending:[**First Name3  Last Name **] 
Chief Complaint:
Abdominal pain


Pertinent Results:
[**2573-5-30**] 09:10PM BLOOD WBC-19.2
"""

In [10]:
doc = nlp(text)

In [11]:
visualize_ent(doc)

In [12]:
doc._.section_titles

[, Service:, Allergies:, Chief Complaint:, Pertinent Results:]

The sectionizer correctly some of the sections, such as **"Allergies"** and **"Chief Complaint"**. However, the document contains at least one other section which might be useful to extract:
- **"Admission Date"**: Many MIMIC notes start this way and you could consider this first section to be **visit_information**

## Add patterns
To recognize these sections, we can add **rules** to the sectionizer. Create an instance of a `SectionRule` with the following components:

* `category`: the normalized name of the section
* `literal`: a human readable approximation of the text you are seeking to match, this is used with spaCy's `PhraseMatcher` if no other information is provided
* `pattern`: optional, a dictionary using spaCy's [rule-based matching API](https://spacy.io/usage/rule-based-matching)

`SectionRule` can also be can calling `SectionRule.from_json` or `SectionRule.from_dict` to read from a json or dict with the same components.

In [13]:
new_patterns = [
    SectionRule(category="visit_information", literal="Admission Date:", 
            pattern=[{"LOWER": {"REGEX": "admi(t|ssion)"}}, {"LOWER": "date"}, {"LOWER": ":"}])
]

We add this list of patterns through the `sectionizer.add` method:

In [14]:
sectionizer.add(new_patterns)

Now if we reprocess and visualize our doc, we can see that the new headers have been extracted:

In [15]:
doc = nlp(text)

Note that we now have Admission Date detected as a section.

In [16]:
visualize_ent(doc)

In [17]:
doc._.section_titles

[, Admission Date:, Service:, Allergies:, Chief Complaint:, Pertinent Results:]

# Loading a blank sectionizer
You can load the `Sectionizer` without the default patterns and only custom rules:

In [18]:
blank_sectionizer = Sectionizer(nlp, rules=None)

# Loading a sectionizer with custom rules
You can load a `Sectionizer` using a json file you specify with custom rules.
```python
your_sectionizer = Sectionizer(nlp, rules='path/to/your_rules.json')
```