# PHI Identification via SciSpaCy and MedSpaCy

# SciSpaCy
* python package containing spaCy models for processing, biomedical, scientific, or clinical data
* we will be using the en_ner_bc5cdr_md	NER model
    * this model is trained on the BC5CDR Corpus which consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical disease interactions

In [2]:
# install spacy (for base NLP capabilities)
!pip install spacy 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
# install en_ner_bc5cdr_md model for named entity recognition (NER)
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz (119.8 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
# install base spacy english small model
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
!pip install scispacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [78]:
import spacy
import scispacy
# base spacy model
nlp = spacy.load("en_core_web_sm")
# create Language model containg all components to process text
sci_nlp = spacy.load("en_ner_bc5cdr_md")

In [131]:
class phi:
    def __init__(self,file_path):
        self.file_path = file_path
        self.content = None
        
    # read file and handle errors 
    def _read_file(self):
        try:
            with open(self.file_path) as file:
                self.content = file.read()
                return (self.content) 
        except FileNotFoundError:
            print("File not found")
        except Exception as e:
            print(e)
            self.content = None
    # print file
    def display_file(self):
        if self.content is None:
            self._read_file()
        print("FILE CONTENT:")
        print(self.content)   
        
    # Natural Entity Recognition (NER) on file
    def extract_PHI(self):
        doc = sci_nlp(self.content)
        print("IDENTIFIED PHI ELEMENTS:")
        return([(ent.text, ent.label_,(ent.start_char,ent.end_char)) for ent in doc.ents])

## Testing "out of box" capabilities 

In [38]:
# extract PHI info for phi_sample.txt
phi1 = phi("phi_sample.txt")
phi1.display_file()
phi1.extract_PHI()

FILE CONTENT:

Patient Account Number: A12345B678
Patient Admission Date: 10/12/2023
Patient Discharge Date: 10/14/2023
Patient Medical Record Number: MRN12345678
Patient Health Plan Beneficiary Number: HPBN98765432

Diagnosis: Hypertension, Type 2 Diabetes
Treatment Plan: Prescribed antihypertensive medication, lifestyle changes, insulin therapy

Billing Information:

1. Office Visit - $200.00
2. Blood Tests - $150.00
3. MRI Scan - $450.00
4. Medication - $100.00
Total: $900.00


MRI Scan Results:
MRI shows mild lumbar disc bulge at L4-L5 with no signs of nerve impingement.

Blood Test Results:
Blood glucose level: 145 mg/dL, Cholesterol: 220 mg/dL, Hemoglobin A1c: 7.2%

Prescribed Dosage Information:
100 mg of Metformin twice daily, 5 mg of Lisinopril once daily

Prescribed Medications:
Metformin, Lisinopril, Aspirin

Pre-existing Condition Information:
Chronic Hypertension, Type 2 Diabetes

Mental Health Information:
Patient reports feeling anxious and stressed due to recent family 

[('Hypertension', 'DISEASE', (214, 226)),
 ('Diabetes', 'DISEASE', (235, 243)),
 ('insulin', 'CHEMICAL', (319, 326)),
 ('nerve impingement', 'DISEASE', (549, 566)),
 ('glucose', 'CHEMICAL', (595, 602)),
 ('Cholesterol', 'CHEMICAL', (621, 632)),
 ('Metformin', 'CHEMICAL', (708, 717)),
 ('Lisinopril', 'CHEMICAL', (739, 749)),
 ('Metformin', 'CHEMICAL', (786, 795)),
 ('Lisinopril', 'CHEMICAL', (797, 807)),
 ('Aspirin', 'CHEMICAL', (809, 816)),
 ('Hypertension', 'DISEASE', (862, 874)),
 ('Type 2 Diabetes\n\n', 'DISEASE', (876, 893))]

In [129]:
# extract PHI info for patient_visit.txt
phi2 = phi("patient_visit.txt")
phi2.display_file()
phi2.extract_PHI()

FILE CONTENT:
Patient Name: John Doe  
Age: 54  
Date of Visit: 2024-11-26  

**Diagnosis:**  
Mr. John Doe was diagnosed with Type 2 Diabetes and Hypertension. Additionally, he showed early signs of chronic kidney disease based on lab results.

**Treatment Plan:**  
The patient is advised to follow a low-sodium, low-carb diet, exercise daily, and monitor blood sugar levels twice a day. A follow-up visit is scheduled in three months.  

**Billing Information:**  
- Initial consultation: $150  
- Blood tests: $200  
- MRI scan: $800  
**Total Bill:** $1,150  

**MRI Results:**  
MRI scans showed minor ischemic changes in the brain indicative of age-related conditions. No evidence of acute stroke.  

**Blood Test Results:**  
- Fasting glucose: 140 mg/dL (high)  
- HbA1c: 7.8% (high)  
- Serum creatinine: 1.4 mg/dL (high)  

**Prescribed Dosage Information:**  
- Metformin 500 mg, twice a day  
- Losartan 50 mg, once a day  

**Medications Prescribed:**  
- Metformin  
- Losartan  

**Pr

[('Type 2 Diabetes', 'DISEASE', (113, 128)),
 ('Hypertension', 'DISEASE', (133, 145)),
 ('chronic kidney disease', 'DISEASE', (186, 208)),
 ('low-sodium', 'CHEMICAL', (289, 299)),
 ('acute stroke', 'DISEASE', (677, 689)),
 ('glucose', 'CHEMICAL', (730, 737)),
 ('140', 'CHEMICAL', (739, 742)),
 ('creatinine', 'CHEMICAL', (789, 799)),
 ('Metformin', 'CHEMICAL', (860, 869)),
 ('Losartan', 'CHEMICAL', (894, 902)),
 ('Metformin', 'CHEMICAL', (956, 965)),
 ('Losartan', 'CHEMICAL', (970, 978)),
 ('hyperlipidemia', 'DISEASE', (1042, 1056)),
 ('anxiety disorder', 'DISEASE', (1066, 1082)),
 ('Referred', 'DISEASE', (1208, 1216)),
 ('shingles vaccine', 'CHEMICAL', (1360, 1376))]

# Rule based matching to extract prescriptions with dosage as well as results
* We can add custom rules to extract specific data

In [126]:
# notice how prescription of Metaformin 500 mg is not being extracted fully
for elem in phi2.extract_PHI():
    if 'Metformin' in elem:
        print(elem)

IDENTIFIED PHI ELEMENTS:
('Metformin', 'CHEMICAL', (860, 869))
('Metformin', 'CHEMICAL', (956, 965))


In [132]:
from spacy.matcher import Matcher
from spacy.tokens import Span
'''
    we can add a new rule/pattern to identify drugs w/ its associated dosage 
    and add it to the pipeline so it is recognized as an entity
    
    this pattern matches:
        Chemical Entity (drug) followed by a number (dosage) and ASCII chars (unit of dose)
'''
from spacy.matcher import Matcher
# this pattern matches a Chemical entity (drug) followed by a number (the dosage) and token of ASCII characters (unit of dose)
dosage_pattern = [{'ENT_TYPE':'CHEMICAL'},{'LIKE_NUM': True}, {'IS_ASCII': True}]
# create matcher obj against spacy's vocab (collection of all tokens)
matcher = Matcher(sci_nlp.vocab)
# add pattern to matcher
matcher.add("PRESCRIPTION", [dosage_pattern])
doc = sci_nlp(phi2._read_file())

matches = matcher(doc)  # Get all matched patterns
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Match: {span.text} (from {start} to {end})")

Match: Metformin 500 mg (from 206 to 209)
Match: Losartan 50 mg (from 215 to 218)


In [119]:
# #we can add rules to match drugs, with its associated dosage
# # we are going to add a new pattern for our model to extract call "prescription"
# from spacy.matcher import Matcher
# from spacy.tokens import Span
# from spacy.language import Language


# '''
#     we can add a new rule/pattern to identify drugs w/ its associated dosage 
#     and add it to the pipeline so it is recognized as an entity
    
#     this pattern matches:
#         Chemical Entity (drug) followed by a number (dosage) and ASCII chars (unit of dose)
# '''
# dosage_pattern = [{'ENT_TYPE':'CHEMICAL'},{'LIKE_NUM': True}, {'IS_ASCII': True}]

# # create matcher obj against (sci)spacy's vocab (collection of all tokens)
# matcher = Matcher(sci_nlp.vocab)
# # add pattern to matcher
# matcher.add("PRESCRIPTION", [dosage_pattern])
# # define function to add matched prescriptions as entities 
# @Language.component("prescription_component")
# def add_prescription_entities(doc):
#     # apply matcher to doc obj to get matches
#     matches = matcher(doc)
#     # store new entities we will be returning
#     new_ents = []
#     for match_id, start, end, in matches:
#         # get match as slice of doc
#         span = doc[start:end]
#         # create new entity 
#         new_ents.append(Span(doc,start, end, label="PRESCRIPTION"))
#     # add entity to doc.ents (named entities)
#     # doc.ents = list(doc.ents) + new_ents
#     doc.ents = list(doc.ents)+ spacy.util.filter_spans(new_ents) 

    
#     return doc
# # add custom fun to add prescription entities to pipeline after inital NER
# sci_nlp.add_pipe("prescription_component",last= True)
# # add new entities to docs.ents    
    

<function __main__.add_prescription_entities(doc)>

# Extracting Chemical, Disease, and Prescrpitions

In [118]:
# Display all components in the pipeline
for name, component in sci_nlp.pipeline:
    print(f"Component name: {name}, Component: {component}")


Component name: tok2vec, Component: <spacy.pipeline.tok2vec.Tok2Vec object at 0x133a500b0>
Component name: tagger, Component: <spacy.pipeline.tagger.Tagger object at 0x133a50ad0>
Component name: attribute_ruler, Component: <spacy.pipeline.attributeruler.AttributeRuler object at 0x1336c7010>
Component name: lemmatizer, Component: <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x1337393d0>
Component name: parser, Component: <spacy.pipeline.dep_parser.DependencyParser object at 0x133b39000>
Component name: ner, Component: <spacy.pipeline.ner.EntityRecognizer object at 0x133b38e40>


In [125]:
# sci_nlp.remove_pipe("prescription_component")
# sci_nlp.remove_pipe("prescription_component1")
# sci_nlp.remove_pipe("prescription_component2")
sci_nlp.remove_pipe("prescription_component")



('prescription_component', <function __main__.add_prescription_entities(doc)>)