[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/enterprise/healthcare/DeIdentification.ipynb)

<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

In [2]:
import json

with open('keys.json') as f:
    license_keys = json.load(f)

license_keys.keys()


dict_keys(['secret', 'SPARK_NLP_LICENSE', 'JSL_OCR_LICENSE', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'JSL_OCR_SECRET'])

In [3]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

secret = license_keys['secret']
os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['JSL_OCR_LICENSE'] = license_keys['JSL_OCR_LICENSE']
os.environ['AWS_ACCESS_KEY_ID']= license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']
version = license_keys['version']

! python -m pip install --upgrade spark-nlp-jsl==$version  --extra-index-url https://pypi.johnsnowlabs.com/$secret

import sparknlp

print (sparknlp.version())

import json
import os
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession


from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl



def start(secret):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:"+version) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+secret+"/spark-nlp-jsl-"+version+".jar")
      
    return builder.getOrCreate()


spark = start(secret) # if you want to start the session with custom params as in start function above
# sparknlp_jsl.start(secret)

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/9hk9l8ybo1
Collecting spark-nlp-jsl==2.5.1rc1
  Downloading https://pypi.johnsnowlabs.com/9hk9l8ybo1/spark-nlp-jsl/spark_nlp_jsl-2.5.1rc1-py3-none-any.whl
Collecting pyspark==2.4.4
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 57kB/s 
[?25hCollecting spark-nlp==2.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/df/b4/db653f8080a446de8ce981b262d85c85c61de7e920930726da0d1c6b4c65/spark_nlp-2.5.1-py2.py3-none-any.whl (121kB)
[K     |████████████████████████████████| 122kB 45.8MB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc

# DeIdentification - version 2.5.1

## Example for Named Entity Recognition with De-Identification Pipeline

One of the major issues when it comes to the analysis of medical records is how to deal with the confidentiality nature of the content.

Lets imagine we have a clinical record that contains this heading:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
Record date: 2093-01-13<br>
David Hale, M.D.<br>
Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira<br>
Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street.<br>
</div>

A usual requisite is to remove or ofuscate any content fragment that contains or potentially containts data that could be linked to an individual as for instance:
- Names and surnames of the patient
- Names and surnames of the doctors
- Name of a medical center
- Name of a City or Town
- Street adress
- Telephone number
- e-mail
- Date of birth (because combined with other data could lead to identification of patients)
- etc...

SparkNLP Enterprise provides with pipeline functionalities that allow to locate those fragments with personal sensible information and anonimize if required. We will see in this notebook an example of such a pipeline.

### Step 1: De-identification pipeline generation

In Spark-NLP annotating NLP happens through pipelines. Pipelines are made out of various Annotator steps. In our case the architecture of the De-identification pipeline will be:

* DocumentAssembler (text -> document)
* SentenceDetector (document -> sentence)
* Tokenizer (sentence -> token)
* WordEmbeddingsModel ([sentence, token] -> embeddings)
* NerDLModel (deidentify_dl) ([sentence, token, embeddings] -> ner)
* NerConverter ([sentence, token, ner] -> ner_chunk)
* DeIdentificationModel ([sentence, token, ner_chunk] -> deidentified

So from a text we end having a deidentified text.

We will use a pretrained model (NerDLModel deidentify) that uses wordembeddings to recognize tokens that contains personal information. Then we transform its output (ner) into ner_chunk that is then used by another pretrained annotator (DeIdentificationModel) that will finally generate a deidentified text.

#### Step 1.1 Load all the components of the pipeline

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

The fourth annotator in the pipeline is "WordEmbeddingsModel". We will download a pretrained model available from "clinical/models" named "embeddings_clinical".

When running this cell your are advised to be patient. 

First time you call this pretrained model it needs to be downloaded in your local and it takes a while.

The size is about 1.7Gb and will be saved typically in your home folder as

    ~HOMEFOLDER/cached_models/embeddings_clinical_en_2.0.2_2.4_1558454742956.zip

Next times you call it the model is loaded from your cached copy but even in that case it needs to be indexed each time so expect waiting up to 5 minutes (depending on your machine)

In [5]:
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [6]:
# Named Entity Recognition for clinical sensitive information. Includes names, phone numbers, addresses, etc

clinical_sensitive_entities = NerDLModel.pretrained("ner_deid_large", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_deid_large download started this may take some time.
Approximate size to download 14 MB
[OK!]


In [0]:
# Named Entity Recognition concepts parser, transforms entities into CHUNKS (required for next step: assertion status)

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

In [17]:
deidentification_rules = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") \
  .setMode("obfuscate") \
  .setInputCols(["sentence", "token", "ner_chunk"]) \
  .setOutputCol("deidentified")

deidentify_large download started this may take some time.
Approximate size to download 54.7 KB
[OK!]


#### Step 1.2 Defining the stages of the pipeline
Now that we have created all the components of our pipeline, lets put all them together into a pipeline.

In [0]:
# Build up the pipeline

pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_sensitive_entities,
    ner_converter,
    deidentification_rules
  ])

### Step 2 Get your model by fitting the pipeline with some data
Lest now see how our Deidientification pipeline works with some data. We will use the following data containing personal information as an example:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
Record date: 2093-01-13<br>
David Hale, M.D.<br>
Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira<br>
Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street.<br>
</div>

We will create a Spark DataFrame containing the lines of this document:

In [19]:
# We want to know more about this simple dataframe

data = spark.createDataFrame([
  ["Record date: 2093-01-13"],
  ["David Hale, M.D."],
  ["Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira"],
  ["Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street."]
]).toDF("text")

data.show(truncate=False)

+--------------------------------------------------------------------------+
|text                                                                      |
+--------------------------------------------------------------------------+
|Record date: 2093-01-13                                                   |
|David Hale, M.D.                                                          |
|Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira          |
|Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street.|
+--------------------------------------------------------------------------+



Now we will create a model by fitting our pipeline to our content:

In [0]:
# We convert the pipeline into a model, train any annotator if required (not the case)

model = pipeline.fit(data)

### Step 3. Transform your data with the model to deidentify content.
As a next step we transform our content using the new model generated:

In [0]:
output = model.transform(data)

Lets compare the original sentence ('sentence.result') with the final deidentified text ('deidentified.result') generated by the pipeline:

In [22]:
%%time

# Apply the actual transformation

print("Original sentences:")
output.select("sentence.result").show(truncate=False)
print("Annonymized output:")
output.select("deidentified.result").show(truncate=False)


Original sentences:
+------------------------------------------------------------------------------+
|result                                                                        |
+------------------------------------------------------------------------------+
|[Record date: 2093-01-13]                                                     |
|[David Hale, M.D.]                                                            |
|[Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira]            |
|[Record date: 2079-11-09., Cocke County Baptist Hospital., 0295 Keats Street.]|
+------------------------------------------------------------------------------+

Annonymized output:
+---------------------------------------------------+
|result                                             |
+---------------------------------------------------+
|[Record date: 2093-01-18]                          |
|[IRA, M.D.]                                        |
|[Name: JENS MR. #7194334 Date: <DATE> PC

Surnames, dates, names of healthcare facilities and street address have been identified as a potential personal information and substitued by generic masks.

### Step 4 with LightPipelines

Once you have created a model by fitting a pipeline with some data you can leverage the use of LightPipelines, faster and easier to use for testing or real-time queries.

Lets created a light_pipeline from our model:

In [0]:
light_pipeline = LightPipeline(model)

Now by just calling the method .annotate of our light_pipeline we will deidentify any content:

In [15]:
# Call annotate() in order to test a sentence or a list of sentences
ori_str = "Name: Smith García, DOB: 23/07/1977 Dr. Suarez. 17 Main Street, Miami Hospital, USA"
light_data = light_pipeline.annotate(ori_str)
print(ori_str)
print("".join(light_data['deidentified']))

Name: Smith García, DOB: 23/07/1977 Dr. Suarez. 17 Main Street, Miami Hospital, USA
Name: <NAME>, DOB: <DATE> Dr. <NAME>.<LOCATION>, <LOCATION>, <LOCATION>


Here we can how the NERDl for deidentification assigns the different NER classes to the tokens:

In [16]:
print("TOKEN (NER)")
print("============")
for i in range(len(light_data['token'])):
    print(light_data['token'][i] + " (" + light_data['ner'][i]+")")
    print("------------")

TOKEN (NER)
Name (O)
------------
: (O)
------------
Smith (B-NAME)
------------
García (I-NAME)
------------
, (O)
------------
DOB (O)
------------
: (O)
------------
23/07/1977 (B-DATE)
------------
Dr (O)
------------
. (O)
------------
Suarez (B-NAME)
------------
. (O)
------------
17 (B-LOCATION)
------------
Main (I-LOCATION)
------------
Street (I-LOCATION)
------------
, (O)
------------
Miami (B-LOCATION)
------------
Hospital (I-LOCATION)
------------
, (O)
------------
USA (B-LOCATION)
------------
