

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEID_MULTI.ipynb)


# Detect PHI for Generic Deidentification (multilingual)

Deidentification NER is a Named Entity Recognition model that annotates English, German, French, Italian, Spanish, Portuguese, and Romanian text to find protected health information (PHI) that may need to be de-identified. It has been trained with in-house annotated datasets using xlm-roberta-base multilingual embeddings.

> 📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## 🔧1. Colab Setup

- Import License Keys

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [9]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.2.2
Spark NLP_JSL Version : 5.2.1


## 🔍2. Select the model `ner_deid_multilingual` and construct the pipeline

**🔎You can find all these models and more [NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Spark+NLP+for+Healthcare)**

In [10]:
model_name = "ner_deid_multilingual"

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)\

ner = MedicalNerModel.pretrained(model_name, "xx", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            embeddings,
                            ner,
                            ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

xlm_roberta_base download started this may take some time.
Approximate size to download 619.5 MB
[OK!]
ner_deid_multilingual download started this may take some time.
[OK!]


## 📝3. Create example inputs

In [32]:
text_list =  [
"""Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 years old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",

"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",

"""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""",

"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""",

"""Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""",

"""Detalhes do paciente.
Nome do paciente:  Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Cidade/ Província: Porto.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos Sexo: Homen
Data de admissão: 17/06/2016.
Doutora: Maria Santos""",

"""Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""
]

## 🚀4. Run the pipeline to find Entities

In [33]:
from pyspark.sql.functions import col, explode
import pandas as pd

df = spark.createDataFrame(pd.DataFrame({"text": text_list}))
result = model.transform(df)

# Explode the 'ner_chunk' to flatten the structure and access its fields directly
result_exploded = result.withColumn("ner_chunk", explode("ner_chunk"))

# Select and display the required fields with the correct order
result_exploded.select(
    col("ner_chunk.result").alias("ner_chunk"),
    col("ner_chunk.begin").alias("begin"),
    col("ner_chunk.end").alias("end"),
    col("ner_chunk.metadata.entity").alias("ner_label"),
    col("ner_chunk.metadata.confidence").alias("confidence")
).show(50, truncate=False)

+-----------------------------+-----+---+----------+----------+
|ner_chunk                    |begin|end|ner_label |confidence|
+-----------------------------+-----+---+----------+----------+
|2093-01-13                   |14   |23 |DATE      |0.8754    |
|David Hale                   |26   |35 |NAME      |0.7781    |
|Hendrickson                  |51   |61 |NAME      |0.9694    |
|7194334                      |74   |80 |ID        |0.9038    |
|01/13/93                     |89   |96 |DATE      |0.998     |
|Oliveira                     |104  |111|NAME      |0.9882    |
|25                           |114  |115|AGE       |0.9822    |
|1-11-2000                    |142  |150|DATE      |0.9742    |
|Cocke County Baptist Hospital|153  |181|LOCATION  |0.47597498|
|0295 Keats Street            |184  |200|LOCATION  |0.37460002|
|(302) 786-5227               |212  |225|CONTACT   |0.86242497|
|Brothers Coal-Mine           |292  |309|LOCATION  |0.53135   |
|Michel Martinez              |24   |38 

## 👀5. Visualization of Detected Entities

In [34]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

for i in range(len(text_list)):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")


































