

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb)


# **Detect PHI for Deidentification**

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

## Colab Setup

Import license keys

In [45]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [46]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [47]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.3.0
Spark NLP_JSL Version : 4.3.0


## 🔎 Models


> * ### *`ner_deid_large`*
> * ### *`ner_deid_generic_augmented`*
> * ### *`ner_deid_subentity_augmented`*
> * ### *`ner_deid_subentity_augmented_i2b2`*
> * ### *`ner_deid_generic_glove`*
> * ### *`ner_deid_subentity_glove`*


**🔎You can find all these models and more [NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Spark+NLP+for+Healthcare)**

# 📍 Detect PHI for Deidentification with `embeddings_clinical`

In [48]:
from sparknlp_display import NerVisualizer

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")
  
sentenceDetector = SentenceDetectorDLModel.pretrained()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")
  
tokenizer = Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

def run_pipeline (model, text):

  if model == "ner_deid_subentity_glove" or model == "ner_deid_generic_glove":
    word_embeddings = WordEmbeddingsModel().pretrained("glove_100d")\
      .setInputCols(["sentence", "token"]) \
      .setOutputCol("embeddings")

  else:
    word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"]) \
      .setOutputCol("embeddings")

  ner = MedicalNerModel.pretrained(model, "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")
    
  pipeline =  Pipeline(
      stages=[
          documentAssembler,
          sentenceDetector,
          tokenizer,
          word_embeddings,
          ner,
          ner_converter
          ])

  df = spark.createDataFrame(text, StringType()).toDF("text")
  result = pipeline.fit(df).transform(df)
  
  print("\n")
  print("<----------------- MODEL NAME:","\033[1m" + model + "\033[0m"," ----------------- >")
  print("\n")

  result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                       result.ner_chunk.metadata)).alias("cols")) \
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

  print("\n")
  
  for i in range(len(text)):
    NerVisualizer().display(
      result = result.collect()[i],
      label_col = 'ner_chunk',
      document_col = 'document'
  )

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


## **🚩 ner_deid_large**

In [49]:
model = 'ner_deid_large'

In [50]:
sample_texts = [
"""Record date: 02-07-2003. MR: 5247840. HISTORY OF PRESENT ILLNESS: Mr. John Smith is a 60 years old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed on 23/03/2001 by the Mount Sinai Hospital. He work as a teacher. Mr. John Smith  underwent a resection there. He was to be admitted to the NYU Langone Hospital for cystectomy. The patient was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE:  Mr. John Smith presented to the Mount Sinai Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained.  Mr. John  Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease.  Following intervention, Mr. John Smith was admitted to Cardiology Service the direction of Dr. Oliver Hart.  He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. Mount Sinai Hospital, Madison Ave, New York, United States. Phone: (212)-455-1500."""]


In [51]:
run_pipeline(model, sample_texts)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_large download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mner_deid_large[0m  ----------------- >


+--------------------+----------+
|chunk               |ner_label |
+--------------------+----------+
|02-07-2003          |DATE      |
|5247840             |ID        |
|John Smith          |NAME      |
|60                  |AGE       |
|23/03/2001          |DATE      |
|Mount Sinai Hospital|LOCATION  |
|teacher             |PROFESSION|
|John Smith          |NAME      |
|NYU Langone Hospital|LOCATION  |
|02/04/2003          |DATE      |
|John Smith          |NAME      |
|Mount Sinai Hospital|LOCATION  |
|John  Smith         |NAME      |
|John Smith          |NAME      |
|Oliver Hart         |NAME      |
|02/07/2003          |DATE      |
|Mount Sinai Hospital|LOCATION  |
|Madison Ave         |LOCATION  |
|New York            |LOCATION  |
|Unite

## **🚩 ner_deid_generic_augmented**

In [52]:
model = 'ner_deid_generic_augmented'

In [53]:
sample_texts = [
"""Record date: 08-24-2007. MSW : 5067003218. Mrs. Liam Davis is a 61-year-old , born in Los Angeles, white female status post right total knee replacement secondary to degenerative joint disease performed by Dr. Anderson Johnson and Dr. Amelia Martinez at Emory University Hospital on 08/21/2007. The patient was transfused with 2 units of autologous blood postoperatively. She received DVT prophylaxis with a combination of Coumadin, Lovenox, SCD boots, and TED stockings. The remainder of her postoperative course was uneventful. Mrs. Liam Davis was discharged on 08/24/2007 from Emory University Hospital. The patient reports that her last bowel movement was on 08/24/2007 just prior to her discharge from Emory University Hospital. She denies any urological symptoms such as dysuria, incomplete bladder emptying or other voiding difficulties. Mrs. Liam Davis reports having some right knee pain, which is most intense at a "certain position". Mrs. Liam Davis is unable to elaborate on which "certain position" causes her the most discomfort. Emory University Hospital. 75 Francis St, Boston. Phone: 617-732-5500"""
]

In [54]:
run_pipeline(model, sample_texts)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mner_deid_generic_augmented[0m  ----------------- >


+----------------------------------------+---------+
|chunk                                   |ner_label|
+----------------------------------------+---------+
|08-24-2007                              |DATE     |
|5067003218                              |ID       |
|Liam Davis                              |NAME     |
|61-year-old                             |AGE      |
|Los Angeles                             |LOCATION |
|Anderson Johnson                        |NAME     |
|Amelia Martinez                         |NAME     |
|Emory University Hospital               |LOCATION |
|08/21/2007                              |DATE     |
|Liam Davis                              |NAME     |
|08/24/2007                              

## 🚩 **ner_deid_subentity_augmented**

In [55]:
model = "ner_deid_subentity_augmented"

In [56]:
sample_text = [
"""Record date: 02-07-2003. MR: 5247840. HISTORY OF PRESENT ILLNESS: Mr. John Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed on 23/03/2001 by the Mount Sinai Hospital.  He underwent a resection there. The patient was to be admitted to the NYU Langone Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE:  Mr. John Smith presented to the Mount Sinai Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained.  The patient underwent a left heart catheterization, which revealed two vessel coronary artery disease.  Following intervention, Mr. John Smith was admitted to Cardiology Service the direction of Dr. Oliver Hart.  He  was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. Mount Sinai Hospital 1468 Madison Ave, New York, United States. Phone: 212-241-6500. Email: jdoe@mountsinai.org."""
]

In [57]:
run_pipeline(model, sample_text)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mner_deid_subentity_augmented[0m  ----------------- >


+--------------------+-------------+
|chunk               |ner_label    |
+--------------------+-------------+
|02-07-2003          |DATE         |
|5247840             |MEDICALRECORD|
|John Smith          |PATIENT      |
|60-year-old         |AGE          |
|23/03/2001          |DATE         |
|Mount Sinai Hospital|HOSPITAL     |
|NYU Langone Hospital|HOSPITAL     |
|02/04/2003          |DATE         |
|John Smith          |PATIENT      |
|Mount Sinai Hospital|HOSPITAL     |
|John Smith          |PATIENT      |
|Oliver Hart         |DOCTOR       |
|02/07/2003          |DATE         |
|Mount Sinai Hospital|HOSPITAL     |
|1468 Madison Ave    |STREET       |
|New York            |CITY         |
|United States       |C

## 🚩 **ner_deid_subentity_augmented_i2b2**

In [58]:
model = "ner_deid_subentity_augmented_i2b2"

In [59]:
sample_text = [
"""Record date: 19-08-2014. MSW : 647390883. Mr. Noah Lee is a 48 year old, born in Houston, male with stage IV chronic kidney disease, likely secondary to HIV nephropathy who presents to clinic for followup having missed prior clinic appointments. He was last seen in the Northwestern Memorial Hospital on 05/29/2007 by Dr. Monroe Thompson. 
This is the first time that I have met the patient. The patient's history of renal insufficiency dates back to 06/2006 when he was hospitalized for an HIV-associated complication. He is unclear of the exact reason for his hospitalization at that time, but he was diagnosed with renal insufficiency and was followed in our Renal Clinic for approximately one year. 
Mr. Noah Lee had a baseline creatinine during that time of between 3.2 to 3.3. When he was initially diagnosed with renal insufficiency, he had been noncompliant with his HAART regimen. Since that time, he has been very compliant with treatment for his HIV and is seeing Dr. Jones Davis in our Scripps Memorial Hospital. Mr. Noah Lee is currently on three-drug antiretroviral therapy. His last CD4 count in 03/2008 was 350.  The latest blood work that I have is from 06/11/2008 and shows a serum creatinine of 3.8, which represents a GFR of 22 and a potassium of 5.9. 
The only complaint that the patient has at this time is some difficulty sleeping. He was given Ambien by Dr. Mia Wilson, but this has not helped significantly with his difficulty sleeping. Mr. Noah Lee says that he has trouble getting to sleep. The Ambien will allow him to sleep for about two hours, and then he is awake again. The patient is tired during the day, but is not taking any daytime naps. He has no history of excessive snoring or apneic periods. Mr. Noah Lee has no history of falling asleep at work or while driving. He has never had a formal sleep study. Mr. Noah Lee does continue to work in sales . Scripps Memorial Hospital. 9888 Genesee Ave,  USA . Phone: (858)-834-1798"""
]

In [60]:
run_pipeline(model, sample_text)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented_i2b2 download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mner_deid_subentity_augmented_i2b2[0m  ----------------- >


+------------------------------+----------+
|chunk                         |ner_label |
+------------------------------+----------+
|19-08-2014                    |DATE      |
|647390883                     |IDNUM     |
|Noah Lee                      |PATIENT   |
|48                            |AGE       |
|Houston                       |STATE     |
|Northwestern Memorial Hospital|HOSPITAL  |
|05/29/2007                    |DATE      |
|Monroe Thompson               |DOCTOR    |
|06/2006                       |DATE      |
|Noah Lee                      |PATIENT   |
|Jones Davis                   |DOCTOR    |
|Scripps Memorial Hospital     |HOSPITAL  |
|Noah Lee                      |PATIENT   |
|03/2008    

# 📍 Detect PHI for Deidentification with `glove_100d`

## **🚩 ner_deid_generic_glove**

In [61]:
model = "ner_deid_generic_glove"

In [62]:
sample_text = [
"""Record date: 02-07-2003. MR: 5247840. HISTORY OF PRESENT ILLNESS: Mr. John Smith is a 60 year old white male, who has a history of bladder cancer diagnosed on 23/03/2001 by the Mount Sinai Hospital.  He underwent a resection there. He work as a journalist. The patient was to be admitted to the NYU Langone Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE:  Mr. John Smith presented to the Mount Sinai Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained.  He underwent a left heart catheterization, which revealed two vessel coronary artery disease.  Following intervention, He was admitted to Cardiology Service the direction of Dr. Oliver Hart.  Mr. John Smith  was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. Mount Sinai Hospital 1468 Madison Ave, New York, USA. Phone : (212) 241-6500."""
]

In [63]:
run_pipeline(model, sample_text)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_deid_generic_glove download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mner_deid_generic_glove[0m  ----------------- >


+--------------------+----------+
|chunk               |ner_label |
+--------------------+----------+
|02-07-2003          |DATE      |
|5247840             |ID        |
|John Smith          |NAME      |
|60                  |AGE       |
|23/03/2001          |DATE      |
|Mount Sinai Hospital|LOCATION  |
|journalist          |PROFESSION|
|NYU Langone Hospital|LOCATION  |
|02/04/2003          |DATE      |
|John Smith          |NAME      |
|Mount Sinai Hospital|LOCATION  |
|Oliver Hart         |NAME      |
|John Smith          |NAME      |
|02/07/2003          |DATE      |
|Mount Sinai Hospital|LOCATION  |
|1468 Madison Ave    |LOCATION  |
|New York            |LOCATION  |
|USA                 |LOCATION  |
|(212) 241-6500      |CONTACT  

## **🚩 ner_deid_subentity_glove**

In [64]:
model = "ner_deid_subentity_glove"

In [65]:
sample_text = [
"""Record date: 02-07-2003. MR: 5247840. HISTORY OF PRESENT ILLNESS: Mr. John Smith is a 60 year old white male, who has a history of bladder cancer diagnosed on 23/03/2001 by the Mount Sinai Hospital.  The patient  underwent a resection there. He work as a journalist. He was to be admitted to the NYU Langone Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE:  Mr. John Smith presented to the Mount Sinai Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained.  Mr. John  Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease.  Following intervention, The patient was admitted to Cardiology Service the direction of Dr. Oliver Hart.  Mr. John Smith  was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. Mount Sinai Hospital 1468 Madison Ave, New York, USA. Phone : (212) 241-6500.""",
]

In [66]:
run_pipeline(model, sample_text)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_deid_subentity_glove download started this may take some time.
[OK!]


<----------------- MODEL NAME: [1mner_deid_subentity_glove[0m  ----------------- >


+--------------------+-------------+
|chunk               |ner_label    |
+--------------------+-------------+
|02-07-2003          |DATE         |
|5247840             |MEDICALRECORD|
|John Smith          |PATIENT      |
|60                  |AGE          |
|23/03/2001          |DATE         |
|Mount Sinai Hospital|HOSPITAL     |
|journalist          |PROFESSION   |
|NYU Langone Hospital|HOSPITAL     |
|02/04/2003          |DATE         |
|John Smith          |PATIENT      |
|Mount Sinai Hospital|HOSPITAL     |
|John  Smith         |PATIENT      |
|Oliver Hart         |DOCTOR       |
|John Smith          |PATIENT      |
|02/07/2003          |DATE         |
|Mount Sinai Hospital|HOSPITAL     |
|1468 Madison Ave    |STREET       |
|