<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

# DeIdentification - version 2.4.0

## Example for Named Entity Recognition with De-Identification Pipeline

One of the major issues when it comes to the analysis of medical records is how to deal with the confidentiality nature of the content.

Lets imagine we have a clinical record that contains this heading:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
Record date: 2093-01-13<br>
David Hale, M.D.<br>
Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira<br>
Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street.<br>
</div>

A usual requisite is to remove or ofuscate any content fragment that contains or potentially containts data that could be linked to an individual as for instance:
- Names and surnames of the patient
- Names and surnames of the doctors
- Name of a medical center
- Name of a City or Town
- Street adress
- Telephone number
- e-mail
- Date of birth (because combined with other data could lead to identification of patients)
- etc...

SparkNLP Enterprise provides with pipeline functionalities that allow to locate those fragments with personal sensible information and anonimize if required. We will see in this notebook an example of such a pipeline.

### Step 1. Prepare the environment

#### Install OpenSource spark-nlp and pyspark pip packages
As a first step we import the required python dependences including some sparknlp components.

Be sure that you have the required python libraries (pyspark 2.4.4, spark-nlp 2.4.0) by running <code>pip list</code>. Check that the versions are correct.

If some of them is missing you can run:

<code>pip install --ignore-installed pyspark==2.4.4</code><br>
<code>pip install --ignore-installed spark-nlp==2.4.0</code><br>

The --ignore-installed parameter is to overwrite your previous pip package version if already installed.

<i>*If this cell fails means you have not propertly setup the required environment. Please check the pre-requisites guideline at http://www.johnsnowlabs.com</i>

In [1]:
import sys, time

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.util import *
from sparknlp.embeddings import *

from sparknlp.embeddings import EmbeddingsHelper
from sparknlp.pretrained import ResourceDownloader

from pyspark.ml import Pipeline, PipelineModel

ModuleNotFoundError: No module named 'pyspark'

#### Install Licensed Sparl-NLP package

We will use also some Spark-NLP enterprise functionalities contained in the spark-nlp-jsl package.

You can check that spark-nlp-jsl is installed by running <code>pip install</code>. Check that version installed is 2.4.0

If it is not then you need to install it by using:

<code>pip install spark-nlp-jsl==2.4.0 --extra-index-url https://pypi.johnsnowlabs.com/##### --ignore-installed</code>

The ##### is a secret url, if you have not received it please contact us at info@johnsnowlabs.com.

<i>*If the next cell fails means your licensed enterprise version is not propertly installed so please check the pre-requisites guideline at http://www.johnsnowlabs.com/</i>

In [2]:
# If this fails, means pip module for enterprise has not been properly set up

from sparknlp_jsl.annotator import *
import sparknlp_jsl

#### Setup credentials to private JohnSnowLabs models repository with AWS-CLI

Now is time to configure Spark-NLP in order to access private JohnSnowLabs models repository. This access is done via Amazon aws command line interface (AWSCLI).

Instructions about how to install awscli are available at: 

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

Make sure you configure your credentials with <code>aws configure</code> following the instructions at:

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Please substitute the ACCESS_KEY and SECRET_KEY with the credentials you have recived. If you need your credentials contact us at info@johnsnowlabs.com

#### Setup your license number in order to have access to private JohnSnowLabs models repository

You need to setup the SPARK_NLP_LICENSE environment variable to the license number provided to you. If you need your license credentials contact us at info@johnsnowlabs.com

In [19]:
import os
os.environ['SPARK_NLP_LICENSE']='######'

#### Start Spark session

The following will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.

Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.

We will be using version 2.4.0 of Spark NLP Open Source and 2.4.0 of Spark NLP Enterprise Edition.

The #### in <code>.config("spark.jars", "####")</code> is a secret url, if you have not received it please contact us at info@johnsnowlabs.com.

In [20]:
spark = sparknlp_jsl.start("#####") # Secret code provided as part of the license

### Step 2: De-identification pipeline generation

In Spark-NLP annotating NLP happens through pipelines. Pipelines are made out of various Annotator steps. In our case the architecture of the De-identification pipeline will be:

* DocumentAssembler (text -> document)
* SentenceDetector (document -> sentence)
* Tokenizer (sentence -> token)
* WordEmbeddingsModel ([sentence, token] -> embeddings)
* NerDLModel (deidentify_dl) ([sentence, token, embeddings] -> ner)
* NerConverter ([sentence, token, ner] -> ner_chunk)
* DeIdentificationModel ([sentence, token, ner_chunk] -> deidentified

So from a text we end having a deidentified text.

We will use a pretrained model (NerDLModel deidentify) that uses wordembeddings to recognize tokens that contains personal information. Then we transform its output (ner) into ner_chunk that is then used by another pretrained annotator (DeIdentificationModel) that will finally generate a deidentified text.

#### Step 2.1 Load all the components of the pipeline

In [5]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

The fourth annotator in the pipeline is "WordEmbeddingsModel". We will download a pretrained model available from "clinical/models" named "embeddings_clinical".

When running this cell your are advised to be patient. 

First time you call this pretrained model it needs to be downloaded in your local and it takes a while.

The size is about 1.7Gb and will be saved typically in your home folder as

    ~HOMEFOLDER/cached_models/embeddings_clinical_en_2.0.2_2.4_1558454742956.zip

Next times you call it the model is loaded from your cached copy but even in that case it needs to be indexed each time so expect waiting up to 5 minutes (depending on your machine)

In [6]:
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


The size of the "deidentify" NerDLModel is about 15Mb and will be saved typically in your home folder as

    ~HOMEFOLDER/cached_models/deidentify_dl_en_2.0.2_2.4_1559669094458.zip

Next times you call it the model is loaded from your cached copy and then usually takes few seconds.

In [7]:
# Named Entity Recognition for clinical sensitive information. Includes names, phone numbers, addresses, etc

clinical_sensitive_entities = NerDLModel.pretrained("deidentify_dl", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

deidentify_dl download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [8]:
# Named Entity Recognition concepts parser, transforms entities into CHUNKS (required for next step: assertion status)

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

The size of the "deidentify" NerDLModel is about 4Kb and will be saved typically in your home folder as

    ~HOMEFOLDER/cached_models/deidentify_rb_en_2.0.2_2.4_1559672122511.zip


In [10]:
deidentification_rules = DeIdentificationModel.pretrained("deidentify_rb", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "ner_chunk"]) \
  .setOutputCol("deidentified")

deidentify_rb download started this may take some time.
Approximate size to download 3.8 KB
[OK!]


#### Step 2.2 Defining the stages of the pipeline
Now that we have created all the components of our pipeline, lets put all them together into a pipeline.

In [11]:
# Build up the pipeline

pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_sensitive_entities,
    ner_converter,
    deidentification_rules
  ])

### Step 3 Get your model by fitting the pipeline with some data
Lest now see how our Deidientification pipeline works with some data. We will use the following data containing personal information as an example:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
Record date: 2093-01-13<br>
David Hale, M.D.<br>
Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira<br>
Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street.<br>
</div>

We will create a Spark DataFrame containing the lines of this document:

In [12]:
# We want to know more about this simple dataframe

data = spark.createDataFrame([
  ["Record date: 2093-01-13"],
  ["David Hale, M.D."],
  ["Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira"],
  ["Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street."]
]).toDF("text")

data.show(truncate=False)

+--------------------------------------------------------------------------+
|text                                                                      |
+--------------------------------------------------------------------------+
|Record date: 2093-01-13                                                   |
|David Hale, M.D.                                                          |
|Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira          |
|Record date: 2079-11-09. Cocke County Baptist Hospital. 0295 Keats Street.|
+--------------------------------------------------------------------------+



Now we will create a model by fitting our pipeline to our content:

In [13]:
# We convert the pipeline into a model, train any annotator if required (not the case)

model = pipeline.fit(data)

### Step 4. Transform your data with the model to deidentify content.
As a next step we transform our content using the new model generated:

In [14]:
output = model.transform(data)

Lets compare the original sentence ('sentence.result') with the final deidentified text ('deidentified.result') generated by the pipeline:

In [15]:
%%time

# Apply the actual transformation

print("Original sentences:")
output.select("sentence.result").show(truncate=False)
print("Annonymized output:")
output.select("deidentified.result").show(truncate=False)


Original sentences:
+------------------------------------------------------------------------------+
|result                                                                        |
+------------------------------------------------------------------------------+
|[Record date: 2093-01-13]                                                     |
|[David Hale, M.D.]                                                            |
|[Name: Hendrickson, Ora MR. #7194334 Date: 01/13/93 PCP: Oliveira]            |
|[Record date: 2079-11-09., Cocke County Baptist Hospital., 0295 Keats Street.]|
+------------------------------------------------------------------------------+

Annonymized output:
+---------------------------------------------------------------------+
|result                                                               |
+---------------------------------------------------------------------+
|[Record date: 2093-01-13]                                            |
|[<DOCTOR>, M.D.]      

Surnames, dates, names of healthcare facilities and street address have been identified as a potential personal information and substitued by generic masks.

### Step 5 with LightPipelines

Once you have created a model by fitting a pipeline with some data you can leverage the use of LightPipelines, faster and easier to use for testing or real-time queries.

Lets created a light_pipeline from our model:

In [16]:
light_pipeline = LightPipeline(model)

Now by just calling the method .annotate of our light_pipeline we will deidentify any content:

In [17]:
# Call annotate() in order to test a sentence or a list of sentences
ori_str = "Name: Smith García, DOB: 23/07/1977 Dr. Suarez. 17 Main Street, Miami Hospital, USA"
light_data = light_pipeline.annotate(ori_str)
print(ori_str)
print("".join(light_data['deidentified']))

Name: Smith García, DOB: 23/07/1977 Dr. Suarez. 17 Main Street, Miami Hospital, USA
Name: Smith <PATIENT>, DOB: 12/08/1977 Dr. <PATIENT>.17 <HOSPITAL>, <HOSPITAL>, <HOSPITAL>


Here we can how the NERDl for deidentification assigns the different NER classes to the tokens:

In [18]:
print("TOKEN (NER)")
print("============")
for i in range(len(light_data['token'])):
    print(light_data['token'][i] + " (" + light_data['ner'][i]+")")
    print("------------")

TOKEN (NER)
Name (O)
------------
: (O)
------------
Smith (O)
------------
García (I-PATIENT)
------------
, (O)
------------
DOB (O)
------------
: (O)
------------
23/07/1977 (I-DATE)
------------
Dr (O)
------------
. (O)
------------
Suarez (I-PATIENT)
------------
. (O)
------------
17 (O)
------------
Main (I-HOSPITAL)
------------
Street (I-HOSPITAL)
------------
, (O)
------------
Miami (I-HOSPITAL)
------------
Hospital (I-HOSPITAL)
------------
, (O)
------------
USA (I-HOSPITAL)
------------
