In [1]:
import spacy
import medspacy

# Overview
In this notebook, we'll look at two steps commonly performed on clinical text:
- Preprocessing
- Sentence splitting

In [2]:
with open("./discharge_summary.txt") as f:
    text = f.read()

In [3]:
nlp = spacy.blank("en")

# Preprocessing
In preprocessing, we'll take some steps to clean up the text.
- Lower-case (for demonstration purposes only; later steps are sometimes case-sensitive unless explicitly told not to be)
- Replace MIMIC-style time brackets with "2010" and remove all other MIMIC-style formatting
- Replace acronyms such as "dx'd" and "tx'd" to simplify later processing

The preprocessing component is implemented in [nlp_preprocessor](https://github.com/medspacy/nlp_preprocessor).

In [4]:
from medspacy.preprocess import Preprocessor, PreprocessingRule
import re

In [5]:
preprocessor = Preprocessor(nlp.tokenizer)

In [6]:
preprocess_rules = [
    lambda x: x.lower(),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{1,4}-[\d]{1,2}(-[\d]{1,2})?\*\*\]"),
        repl="01-01-2010",
        desc="Replace MIMIC date brackets with a generic date."
    ),
    
    PreprocessingRule(
        re.compile("\[\*\*[\d]{4}\*\*\]"),
        repl="2010",
        desc="Replace MIMIC year brackets with a generic year."
    ),
    
    PreprocessingRule(
        re.compile("dx'd"), repl="Diagnosed", 
                  desc="Replace abbreviation"
    ),
    
    PreprocessingRule(
        re.compile("tx'd"), repl="Treated", 
                  desc="Replace abbreviation"
    ),
    
        PreprocessingRule(
        re.compile("\[\*\*[^\]]+\]"), 
        desc="Remove all other bracketed placeholder text from MIMIC"
    )
]

In [7]:
preprocessor.add(preprocess_rules)

In [8]:
nlp.tokenizer = preprocessor

In [9]:
preprocessed_doc = nlp(text)

In [10]:
print(text[:1000])

Admission Date:  [**2573-5-30**]              Discharge Date:   [**2573-7-1**]

Date of Birth:  [**2498-8-19**]             Sex:   F

Service: SURGERY

Allergies:
Hydrochlorothiazide

Attending:[**First Name3 (LF) 1893**]
Chief Complaint:
Abdominal pain

Major Surgical or Invasive Procedure:
PICC line [**6-25**]
ERCP w/ sphincterotomy [**5-31**]


History of Present Illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. Imaging shows no evidence of metastasis.

Past Medical History:
1. Colon cancer dx'd in [**2554**], tx'd with hemicolectomy, XRT,
chemo. Last colonoscopy showed: Last CEA was in the 8 range
(down from 9)
2. Type II Diabetes Mellitus
3. Hypertension

Social History:
Married, former tobacco use. No alcohol or drug use.

Family History:
Mother with stroke at age 82. no early deaths.
2 daughters- healthy


Brief Hospital Course:
Ms. [**Known patient lastname 2004**] was admitted on [**2573-5-30**]. Ultrasound

In [11]:
preprocessed_doc

admission date:  01-01-2010              discharge date:   01-01-2010

date of birth:  01-01-2010             sex:   f

service: surgery

allergies:
hydrochlorothiazide

attending:
chief complaint:
abdominal pain

major surgical or invasive procedure:
picc line 01-01-2010
ercp w/ sphincterotomy 01-01-2010


history of present illness:
74y female with type 2 dm and a recent stroke affecting her
speech, who presents with 2 days of abdominal pain. imaging shows no evidence of metastasis.

past medical history:
1. colon cancer Diagnosed in 2010, Treated with hemicolectomy, xrt,
chemo. last colonoscopy showed: last cea was in the 8 range
(down from 9)
2. type ii diabetes mellitus
3. hypertension

social history:
married, former tobacco use. no alcohol or drug use.

family history:
mother with stroke at age 82. no early deaths.
2 daughters- healthy


brief hospital course:
ms.  was admitted on 01-01-2010. ultrasound at the time of
admission demonstrated pancreatic duct dilitation and an
edem

# Sentence segmentation
Sentence segmentation in medSpaCy is implemented in [PyRuSH](https://github.com/jianlins/PyRuSH). This package runs through a series of rules which were developed with clinical text in order to find the optimal sentence boundries.

PyRuSH rules are defined by a resources file. When loading as part of `medspacy.load()`, these rules will be loaded automatically. But when instantiating independently, we'll pass in the resource file included in the `medspacy` repository:

In [12]:
from medspacy.sentence_splitter import PyRuSHSentencizer

In [13]:
sentencizer = PyRuSHSentencizer(rules_path="../resources/rush_rules.tsv")

In [14]:
sentencizer

<PyRuSH.PyRuSHSentencizer.PyRuSHSentencizer at 0x114d530d0>

In [15]:
nlp.add_pipe(sentencizer)

In [16]:
nlp.pipe_names

['sentencizer']

In [17]:
doc = nlp(text)