# Overview
This notebook will show a basic example of how to use the `Sectionizer` package from medspaCy.

## Prerequisites
This notebook will also use some examples from the master medSpaCy package [medspacy](https://github.com/medspacy/medspacy), which you can download as:

`pip install medspacy`

It was also used a trained statistical model trained in i2b2 data, which you can download as:

`pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz`

## Example text
We'll process this document below:

In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, "../..")

In [2]:
with open("../discharge_summary.txt") as f:
    text = f.read()

In [3]:
print(text[:500])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging sh


# Getting started
The `Sectionizer` component is used in the same way as any other spaCy component. We'll start by loading a spaCy model, creating a `Sectionizer` object, and then adding it to our pipeline.

In [4]:
import spacy

In [5]:
nlp = spacy.load("en_info_3700_i2b2_2012")



In [6]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [7]:
from medspacy.section_detection import Sectionizer

In [8]:
sectionizer = Sectionizer(nlp)

In [9]:
nlp.add_pipe(sectionizer)

In [10]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'sectionizer']

## Processing text
As an example, we'll process this text from MIMIC-II:

In [11]:
doc = nlp(text)

Just like that, we've processed our doc with medSpaCy! Let's first take a look at what entities we've extracted, as well as the section headers. To do this, we'll use a visualizer in the medspaCy package. It is an extension of displaCy that supports medspaCy components.. This function will highlight all of the **entities** found in `doc.ents` which were extracted by our model's `ner` component, as well as the section headers extracted by `sectionizer`:

In [12]:
from medspacy.visualization import visualize_ent

In [13]:
visualize_ent(doc)

The section titles are highlighted in gray with **<<>>** symbols around the normalized section title. As you can see, there are sometimes overlap between the targets and section headers, which causes duplicate text to be displayed.

# Extracted Section Information
Let's now see what information was extracted by `sectionizer`. When `sectionizer` process a `doc`, it adds a number of custom attributes at the following levels:
- `Doc`: The entire document
- `Token`: A single token

In spaCy, custom attributes are saved under the `var._` attribute. 

## Doc
Let's first look at all of the `section_categories` which were found in our text. `category` represents the normalized name specified in your rules.

In [14]:
doc._.section_categories

[None,
 'other',
 'allergies',
 'chief_complaint',
 'history_of_present_illness',
 'past_medical_history',
 'social_history',
 'family_history',
 'hospital_course',
 'medications',
 'observation_and_plan',
 'patient_instructions',
 'signature']

Next, let's find the spans of the text which were recognized as `section_titles`. `title` represents the title or header component of a section. This is the part that is matched using the rules provided to the `Sectionizer` component.

In [15]:
doc._.section_titles

[,
 Service:,
 Allergies:,
 Chief Complaint:,
 History of Present Illness:,
 Past Medical History:,
 Social History:,
 Family History:,
 Brief Hospital Course:,
 Discharge Medications:,
 Discharge Diagnosis:,
 Discharge Instructions:,
 Signed electronically by:]

Now, let's look at the `section_bodies`. `body` represents the text in between each `title` that is assigned to the category of the previous matched `title`. Note that the `title` and the `body` do not overlap.

In [16]:
doc._.section_bodies

[Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]
 
 Date of Birth:  [**2498-8-19**]             Sex:   F
 , SURGERY
 , 
 Hydrochlorothiazide
 
 Attending:[**First Name3 (LF) 1893**], 
 Abdominal pain
 
 Major Surgical or Invasive Procedure:
 PICC line [**6-25**]
 ERCP w/ sphincterotomy [**5-31**]
 
 , 
 74y female with type 2 dm and a recent stroke affecting her
 speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.
 , 
 1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
 chemo. Last colonoscopy showed: Last CEA was in the 8 range
 (down from 9)
 2. Type II Diabetes Mellitus
 3. Hypertension
 , 
 Married, former tobacco use. No alcohol or drug use.
 , 
 Mother with stroke at age 82. no early deaths.
 2 daughters- healthy
 
 , 
 Ms. [**Known patient lastname 2004**] was admitted on [**2573-5-30**]. Ultrasound at the time of
 admission demonstrated pancreatic duct dilitation and an
 edematous gallbladder

Finally, the entire section is available through `section_span`. The `span` represents the combined `title` and `body`.

In [17]:
print(doc._.section_spans)

[Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

, Service: SURGERY

, Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
, Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


, History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.

, Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Type II Diabetes Mellitus
3. Hypertension

, Social History:
Married, former tobacco use. No alcohol or drug use.

, Family History:
Mother with stroke at age 82. no early deaths.
2 daughters- healthy


, Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitted on [**2573-5-

## Section Objects

The sections are stored as a custom `Section` obect that can also be accessed through the `doc`.

In [18]:
sections = doc._.sections
sections[1]

Section(category=other, title=Service:, body=SURGERY

, parent=None, rule=SectionRule(literal="Service:", category="other", pattern=None, on_match=None, parents=[], parent_required=False))

`Section` objects have a variety of properties:
* `doc`: the doc associated with the section
* `category`: returns the section category
* `title_start`: the index of the first token of the `title`
* `title_end`: the index of the last token of the `title`
* `title_span`: the span of the `title`
* `body_start`: the index of the first token of the `body`
* `body_end`: the index of the last token of the `body`
* `body_span`: the span of the `body`
* `section_span`: the span of the `title`+`body`
* `parent`: the `Section` that was matched as the parent of the current section (the parent-child pairings are covered in a future notebook)
* `rule`: the rule that matched the `title`

## Span
Now, for each entity extracted from the text, you can extract the associated `Section` object. In this example, lets look at the `category` for the `Section` associated with each entity.

In [19]:
for ent in doc.ents[:10]:
    print(ent, ent._.section.category, sep=", ")

Hydrochlorothiazide, allergies
Abdominal pain, chief_complaint
Invasive Procedure, chief_complaint
PICC line, chief_complaint
ERCP, chief_complaint
sphincterotomy, chief_complaint
a recent stroke, history_of_present_illness
abdominal pain, history_of_present_illness
Imaging, history_of_present_illness
metastasis, history_of_present_illness


Using the `Section`, you can also access the entire section span which contained the ent:

In [20]:
ent = doc.ents[0]
ent._.section.section_span

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]

# Assertion attributes
In clinical NLP, it's important to account for certain attributes about extracted entities, such as whether a concept is **negated** or **historical**. This is handled at a sentence level by [cycontext](https://github.com/medspacy/cycontext), which looks for linguistic modifiers within the same sentence as an entity. However, the section in which a concept occurs can also inform these attributes.

For example, in the example below, we know that:
- **"Pneumonia"** is not current because it occurs in the **Past Medical History**
- **"Penicillin"** and **"Allergies"** are not actually experienced, they're just listed in the allergies section. We call this **hypothetical**
- **"Diabetes** is experienced by someone in the patient's family because it occurs in the **Family History**
- **"Chest pain"** is hypothetical because it occurs in the **Patient Education** section as a hypothetical event



This functionality can be set on with the `add_attrs` argument in the constructor, which by default is `False`:

In [21]:
nlp = spacy.load("en_info_3700_i2b2_2012")



In [22]:
sectionizer = Sectionizer(nlp, add_attrs=True)

In [23]:
nlp.add_pipe(sectionizer)

In [24]:
text = """
Past Medical History:
pneumonia

Allergies: 
Penicillin

Family History:
Diabetes

Assessment and Plan:
Warfarin for PE

Patient Education:
You have been prescribed with a medication which is known to cause chest pain.
"""

In [25]:
doc = nlp(text)

In [26]:
visualize_ent(doc)

Each entity will have the following attributes defined under the `ent._` attribute. Each has a default value of `False` which could be set to True by `sectionizer`:

- `is_negated`
- `is_uncertain`
- `is_historical`
- `is_family`
- `is_hypothetical`

Let's iterate through these entities and see which these attributes in these ents.

In [27]:
for ent in doc.ents:
    print(ent, ent._.section.category)
    print("Historical:", ent._.is_historical, "\tFamily:", ent._.is_family, "\tHypothetical:", ent._.is_hypothetical,)
    print()

pneumonia past_medical_history
Historical: True 	Family: False 	Hypothetical: False

Allergies allergies
Historical: False 	Family: False 	Hypothetical: False

Penicillin allergies
Historical: False 	Family: False 	Hypothetical: False

Diabetes family_history
Historical: False 	Family: True 	Hypothetical: False

Warfarin family_history
Historical: False 	Family: True 	Hypothetical: False

PE family_history
Historical: False 	Family: True 	Hypothetical: False

a medication patient_education
Historical: False 	Family: False 	Hypothetical: False

chest pain patient_education
Historical: False 	Family: False 	Hypothetical: False



The attributes and sections are defined in a dictionary mapping the **section titles** to the attribute name/value pairs. You can find this in the `assertion_attributes_mapping` attribute:

In [28]:
sectionizer.assertion_attributes_mapping

{'past_medical_history': {'is_historical': True},
 'sexual_and_social_history': {'is_historical': True},
 'family_history': {'is_family': True},
 'patient_instructions': {'is_hypothetical': True},
 'education': {'is_hypothetical': True},
 'allergy': {'is_hypothetical': True}}

Additionally, you could define your own logic by constructing a dictionary like the one above, registering the `Span` extensions, passing the dictionary in to the `add_attrs` argument:
```python
sectionizer = Sectionizer(nlp, add_attr={...})
```