# I. Import libraries
Currently, all of the medspacy subpackages are stored in separate packages/repos. We eventually want to get to havin them all as part of the `medspacy` namespace.

Some packages still need to be added to pip, which is a prerequisite for medspacy to be available as one clean install. For now, any packages not available on pip can be cloned from GitHub and installed manually (`python setup.py install`):

```bash
pip install PyRuSH
pip install cycontext
pip install clinical_sectionizer

git clone https://github.com/medspacy/nlp_preprocessor.git
git clone https://github.com/medspacy/target_matcher.git
git clone https://github.com/medspacy/nlp_postprocessor.git
```

In [1]:
from cycontext import ConTextComponent
from cycontext.viz import visualize_ent

In [2]:
from clinical_sectionizer import Sectionizer

In [3]:
from PyRuSH import PyRuSHSentencizer

In [4]:
from nlp_preprocessor import Preprocessor, PreprocessingRule

In [5]:
from target_matcher import TargetMatcher, TargetRule

In [6]:
from nlp_postprocessor import Postprocessor, PostprocessingRule, PostprocessingPattern
from nlp_postprocessor import postprocessing_functions

In [7]:
import spacy

# II. Build the pipeline

In [8]:
nlp = spacy.blank("en")

### Preprocessor

In [9]:
preprocessor = Preprocessor(nlp.tokenizer)
nlp.tokenizer = preprocessor

### PyRuSH

In [10]:
pyrush = PyRuSHSentencizer("./resources/rush_rules.tsv")
nlp.add_pipe(pyrush)

### TargetMatcher

In [11]:
target_matcher = TargetMatcher(nlp)

In [12]:
nlp.add_pipe(target_matcher)

### Sectionizer

In [13]:
sectionizer = Sectionizer(nlp)

In [14]:
nlp.add_pipe(sectionizer)

### ConText

In [15]:
context = ConTextComponent(nlp)

In [16]:
nlp.add_pipe(context)

### Postprocessor

In [17]:
postprocessor = Postprocessor(debug=True)
nlp.add_pipe(postprocessor)

In [18]:
nlp.pipe_names

['sentencizer', 'target_matcher', 'sectionizer', 'context', 'postprocessor']

# II. Read in text
This is an excerpt from a MIMIC discharge summary.

In [19]:
with open("./discharge_summary.txt") as f:
    full_text = f.read()

In [20]:
print(full_text[:1000])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with hypertension, old MI and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain.  She had multiple bouts of
nausea and vomiting, with chills and decreased flatus.

Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Lymphedema from XRT, takes a diuretic
3. Cataracts
4. Hypertension
5. heart murmur - TTE in [**2567**] showed LA mod dilated, LV mildly
hypertrophied, aortic sclerosis, mild AI, mild MR.

Social History:
Married, former secretary, waitress. + tobacco 

# III. Add additional rules to various components
- **Preprocessor**: Clean up the text using the preprocessor
    - Lower-case (mainly for demo purposes)
    - Replace MIMIC-style brackets for dates and years with random dates
    - Remove all other MIMIC-style brackets
- **TargetMatcher**: Extended rule-based matching
    - Can use literal expressions or token patterns
    - Can pass in additional attributes to set for a matched attribute
    - Can also define callbacks
- **Sectionizer**: Add "Chief Complaint"
- **Postprocessor**: Modify or remove entities based on custom logic
    - As an example, we'll remove all entities  from **patient_instructions** sections

### Preprocessing

In [21]:
import re

In [22]:
import random
import time

def replace_with_random_date(match):
    return random_date("1/1/2000", "1/1/2020", random.random())

def replace_with_random_year(match):
    return str(random.randint(2000, 2020))

In [23]:
def str_time_prop(start, end, prop, format):
    """Get a time at a proportion of a range of two formatted times.

    start and end should be strings specifying times formated in the
    given format (strftime-style), giving an interval [start, end].
    prop specifies how a proportion of the interval to be taken after
    start.  The returned time will be in the specified format.
    """

    stime = time.mktime(time.strptime(start, format))
    etime = time.mktime(time.strptime(end, format))

    ptime = stime + prop * (etime - stime)

    return time.strftime(format, time.localtime(ptime))


def random_date(start, end, prop, format='%m/%d/%Y'):
    return str_time_prop(start, end, prop, format)

random_date("1/1/2008", "1/1/2009", random.random())

'01/24/2008'

Rules for preprocessing are any callable which take the text as input and return text as output. The `PreprocessingRule` class allows for a convenient API, but you can also pass in other objects or functions.

In [24]:
preprocess_rules = [
    lambda x: x.lower(),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{1,4}-[\d]{1,2}(-[\d]{1,2})?\*\*\]"),
        repl=replace_with_random_date,
        desc="Replace MIMIC date brackets with a random date."
    ),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{4}\*\*\]"),
        repl=replace_with_random_year,
        desc="Replace MIMIC year brackets with a random year."
    ),
    
    PreprocessingRule(
        re.compile("\[\*\*[^\]]+\]"), 
        desc="Remove all other bracketed placeholder text from MIMIC"
    )
    
]

In [25]:
preprocessor.add(preprocess_rules)

In [26]:
nlp("""Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]
 
Date of Birth:  [**2498-8-19**]
Signed electronically by: DR. [**First Name8 (NamePattern2) **]
""")

admission date:  03/08/2010              discharge date:   09/02/2013
 
date of birth:  01/22/2001
signed electronically by: dr. 

### TargetMatcher

In [27]:
target_rules = [
    TargetRule("abdominal pain", "PROBLEM"),
    TargetRule("hydrochlorothiazide", "TREATMENT"),
    TargetRule("cancer", "PROBLEM"),
    TargetRule("hypertension", "PROBLEM"),
    TargetRule("hypertension", "PROBLEM", pattern=[{"LOWER": "htn"}]),
    TargetRule("Lymphedema", "PROBLEM"),
    TargetRule("diuretic", "TREATMENT"),
    TargetRule("cataracts", "PROBLEM"),
    TargetRule("ulcer", "PROBLEM"),
    TargetRule("Myocardial Infarction", "PROBLEM"),
    # Give "Old MI" an explicit attribute of is_historical
    # But it still has the literal of "Myocardial Infarction"
    TargetRule("Myocardial Infarction", "PROBLEM", 
              pattern=[{"LOWER": "old"}, {"LOWER": "mi"}],
               attributes={"is_historical": True}
              ),
    TargetRule("Atrial Fibrillation", "PROBLEM", ),
]
target_matcher.add(target_rules)

In [28]:
doc = nlp("Patient with htn and old mi.")

old mi

htn



In [29]:
visualize_ent(doc)

In [30]:
print("Span", "literal", "is_historical", sep="\t\t")
for ent in doc.ents:
    print(ent, ent._.target_rule.literal, ent._.is_historical, sep="\t\t")

Span		literal		is_historical
htn		hypertension		False
old mi		Myocardial Infarction		True


### Sectionizer

In [31]:
section_rules = [
    {"section_title": "chief_complaint", "pattern": "Chief Complaint:"}
]

In [32]:
sectionizer.add(section_rules)

### Postprocessor

In [33]:
text = "Diagnoses: Atrial Fibrillation. Patient Instructions: Come back if you experience abdominal pain."

In [34]:
doc = nlp(text)

abdominal pain

atrial fibrillation



In [35]:
visualize_ent(doc)

In [36]:
for ent in doc.ents:
    print(ent, ent._.section_title)

atrial fibrillation observation_and_plan
abdominal pain patient_instructions


In [37]:
postprocess_rules = [
    PostprocessingRule(
        patterns=[
            PostprocessingPattern(condition=lambda ent: ent._.section_title.upper() == "PATIENT_INSTRUCTIONS"),
        ],
        action=postprocessing_functions.remove_ent,
        description="Remove any entities from the instructions section."
    )
]

In [38]:
postprocessor.add(postprocess_rules)

In [39]:
text = "Diagnoses: Atrial Fibrillation. Patient Instructions: Come back if you experience abdominal pain."

In [40]:
doc = nlp(text)

abdominal pain
Passed: PostprocessingRule: None - Remove any entities from the instructions section. on ent: abdominal pain diagnoses: atrial fibrillation. patient instructions: come back if you experience abdominal pain.

atrial fibrillation



In [41]:
for ent in doc.ents:
    print(ent, ent._.section_title)

atrial fibrillation observation_and_plan


# IV. Process the entire document with the complete pipelineoutput

In [42]:
doc = nlp(full_text)

abdominal pain
Passed: PostprocessingRule: None - Remove any entities from the instructions section. on ent: abdominal pain patient may shower. please call your surgeon or return to the
emergency room if  experience fever >101.5, nausea, vomiting,
abdominal pain, shortness of breath, abdominal pain or any
significant  change in your medical condition. a

completed by:   md  05/21/2016 @ 1404
signed electronically by: dr.  
 

abdominal pain
Passed: PostprocessingRule: None - Remove any entities from the instructions section. on ent: abdominal pain patient may shower. please call your surgeon or return to the
emergency room if  experience fever >101.5, nausea, vomiting,
abdominal pain, shortness of breath, abdominal pain or any
significant  change in your medical condition. a

completed by:   md  05/21/2016 @ 1404
signed electronically by: dr.  
 

htn

atrial fibrillation

hypertension

cataracts

diuretic

lymphedema

cancer

abdominal pain

old mi

hypertension

abdominal pain

hydro

In [43]:
visualize_ent(doc)