![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.1.Clinical_Multi_Language_Deidentification.ipynb)


# Clinical Deidentification Multi Language

## Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [3]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(secret = SECRET, params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


In [4]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "64G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")

    return builder.getOrCreate()

#spark = start(SECRET)

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# Deidentification Models in Different Languages

<center><b>Deidentification NER Models for Other Languages</b>

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_generic_de.html)  |de| 14| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/02/11/ner_deid_generic_fr.html)  |fr|
| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_subentity_de.html)  |de| 15| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/02/14/ner_deid_subentity_fr.html)  |fr|
| 3| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_generic_es.html)  |es| 16| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_generic_it_3_0.html)  |it|
| 4| [ner_deid_generic_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_augmented_es.html)  |es| 17| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_subentity_it_2_4.html)  |it|
| 5| [ner_deid_generic_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_generic_roberta_es.html)  |es| 18| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_generic_pt_3_0.html)  |pt|
| 6| [ner_deid_generic_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_roberta_augmented_es.html)  |es| 19| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_subentity_pt_3_0.html)  |pt|
| 7| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_subentity_es.html)  |es| 20| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_ro_3_0.html)  |ro|
| 8| [ner_deid_subentity_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_augmented_es.html)  |es| 21| [ner_deid_subentity_bert](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_bert_ro_3_0.html)  |ro|
| 9| [ner_deid_subentity_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_subentity_roberta_es.html)  |es| 22| [ner_deid_generic](https://nlp.johnsnowlabs.com/models)  |ro|
| 10| [ner_deid_subentity_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_roberta_augmented_es.html)  |es| 23| [ner_deid_generic_bert](https://nlp.johnsnowlabs.com/models)  |ro|
| 11| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2023/05/31/ner_deid_subentity_ar.html)  |ar| 24| [ner_deid_generic](https://nlp.johnsnowlabs.com/2023/05/30/ner_deid_generic_ar.html)  |ar|
 12|[ner_deid_subentity_arabert](https://nlp.johnsnowlabs.com/2023/09/16/ner_deid_subentity_arabert_ar.html)   |ar| 25|[ner_deid_generic_arabert](https://nlp.johnsnowlabs.com/2023/09/16/ner_deid_generic_arabert_ar.html)   |ar|
 13| [ner_deid_subentity_camelbert](https://nlp.johnsnowlabs.com/2023/09/22/ner_deid_subentity_camelbert_ar.html) |ar| 26| [ner_deid_generic_camelbert](https://nlp.johnsnowlabs.com/2023/09/16/ner_deid_generic_camelbert_ar.html) |ar|


# DE-IDENTIFICATION FOR GERMAN

## German Deidentification NER Models

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_generic_de.html)  |de| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_subentity_de.html)  |de|


Creating pipeline

In [5]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_de = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, FAX numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Employment information)
- Id (Social Security numbers, Medical record numbers, Internet protocol addresses)



In [6]:
ner_generic_de = MedicalNerModel.pretrained("ner_deid_generic", "de", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [7]:
ner_generic_de.getClasses()

['O',
 'I-LOCATION',
 'B-DATE',
 'I-NAME',
 'B-LOCATION',
 'I-DATE',
 'B-ID',
 'B-AGE',
 'B-CONTACT',
 'B-PROFESSION',
 'B-NAME']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- User Name
- Profession
- Phone
- Country
- Age

In [8]:
ner_subentity_de = MedicalNerModel.pretrained("ner_deid_subentity", "de", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [9]:
ner_subentity_de.getClasses()

['O',
 'B-ORGANIZATION',
 'I-DOCTOR',
 'B-DOCTOR',
 'B-USERNAME',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-PROFESSION',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'B-CITY',
 'B-HOSPITAL',
 'B-DATE',
 'B-STREET',
 'B-PATIENT',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-COUNTRY']

### Pipeline

In [10]:
nlpPipeline_de = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_de,
        ner_generic_de,
        ner_converter_generic,
        ner_subentity_de,
        ner_converter_subentity,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_de = nlpPipeline_de.fit(empty_data)

In [11]:
text_de = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""

text_df_de = spark.createDataFrame([[text_de]]).toDF("text")
result_de = model_de.transform(text_df_de)

Results for `ner_deid_subentity`

In [12]:
result_de.select(F.explode(F.arrays_zip(result_de.ner_chunk_subentity.result,
                                        result_de.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|Michael Berger                        |PATIENT  |
|12 Dezember 2018                      |DATE     |
|Elisabeth-Krankenhaus in Bad Kissingen|HOSPITAL |
|Berger                                |PATIENT  |
|76                                    |AGE      |
+--------------------------------------+---------+



Results for `ner_deid_generic`

In [13]:
result_de.select(F.explode(F.arrays_zip(result_de.ner_chunk_generic.result,
                                        result_de.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------+---------+
|chunk                    |ner_label|
+-------------------------+---------+
|Michael Berger           |NAME     |
|12 Dezember 2018         |DATE     |
|St. Elisabeth-Krankenhaus|LOCATION |
|Bad Kissingen            |LOCATION |
|Berger                   |NAME     |
|76                       |AGE      |
+-------------------------+---------+



## Deidentification

### Obfuscation mode

In [14]:
# Downloading custom faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate.txt

In [15]:
deid_masked_entity = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")\

deid_masked_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")\

deid_masked_fixed_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)\

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate.txt')\
    .setObfuscateRefSource("file")

In [16]:
nlpPipeline_de = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_de,
        ner_subentity_de,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_de = nlpPipeline_de.fit(empty_data)

In [17]:
deid_lp_de = LightPipeline(model_de)

In [18]:
text = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""

In [19]:
pd.set_option("display.max_colwidth", 100)

result_lp_de = deid_lp_de.annotate(text)

df_de = pd.DataFrame(list(zip(result_lp_de["masked_with_entity"],
                              result_lp_de["masked_with_chars"],
                              result_lp_de["masked_fixed_length_chars"],
                              result_lp_de["obfuscated"])),
                 columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_de

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,<PATIENT> wird am Morgen des <DATE> ins St. <HOSPITAL> eingeliefert.,[************] wird am Morgen des [**************] ins St. [************************************...,**** wird am Morgen des **** ins St. **** eingeliefert.,Rudolph Dippel wird am Morgen des 15 Dezember 2018 ins St. Asklepios Klinik Bad Oldsloe eingelie...
1,Herr <PATIENT> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.,Herr [****] ist ** Jahre alt und hat zu viel Wasser in den Beinen.,Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen.,Herr Wilfried Reising ist 64 Jahre alt und hat zu viel Wasser in den Beinen.


### Faker mode

In [20]:
deid_obfuscated_faker = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('de')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [21]:
nlpPipeline_de = Pipeline(stages=[
      documentAssembler,
      sentencerDL,
      tokenizer,
      word_embeddings_de,
      ner_subentity_de,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_de = nlpPipeline_de.fit(empty_data)

In [22]:
deid_lp_de = LightPipeline(model_de)

In [23]:
text = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""

In [24]:
pd.set_option("display.max_colwidth", 100)

result_de = deid_lp_de.annotate(text)

df_de = pd.DataFrame(list(zip(result_de["masked_with_entity"],
                              result_de["masked_with_chars"],
                              result_de["masked_fixed_length_chars"],
                              result_de["obfuscated"])),
                 columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_de

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,<PATIENT> wird am Morgen des <DATE> ins St. <HOSPITAL> eingeliefert.,[************] wird am Morgen des [**************] ins St. [************************************...,**** wird am Morgen des **** ins St. **** eingeliefert.,Valentine Volz wird am Morgen des 03 Januar 2019 ins St. Sankt Elisabeth Krankenhaus Köln eingel...
1,Herr <PATIENT> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.,Herr [****] ist ** Jahre alt und hat zu viel Wasser in den Beinen.,Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen.,Herr Volz ist 60 Jahre alt und hat zu viel Wasser in den Beinen.


## Pretrained German Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from German medical texts. The PHI information will be masked and obfuscated in the resulting text.
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - City
    - Street
    - Country
    - User name
    - Profession
    - Phone
    - Age
    - Contact
    - ID
    - Phone
    - Zip
    - Account
    - SSN
    - Driver's License Number
    - Plate Number

In [25]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline_de = PretrainedPipeline("clinical_deidentification", "de", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [26]:
pd.set_option("display.max_colwidth", 100)

text = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus in Bad Kissingen eingeliefert.
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.

Persönliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Straße 13 19300"""

result_de = deid_pipeline_de.annotate(text)

df_de = pd.DataFrame(list(zip(result_de["sentence"],
                              result_de["masked"],
                              result_de["masked_with_chars"],
                              result_de["masked_fixed_length_chars"],
                              result_de["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_de

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhau...,Zusammenfassung : <PATIENT> wird am Morgen des <DATE> ins <HOSPITAL> eingeliefert.,Zusammenfassung : [************] wird am Morgen des [**************] ins [**********************...,Zusammenfassung : **** wird am Morgen des **** ins **** eingeliefert.,Zusammenfassung : Saban Steffen wird am Morgen des 12 Dezember 2018 ins Klinik St. Hedwig eingel...
1,Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.,Herr <PATIENT> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.,Herr [************] ist ** Jahre alt und hat zu viel Wasser in den Beinen.,Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen.,Herr Saban Steffen ist 75 Jahre alt und hat zu viel Wasser in den Beinen.
2,Persönliche Daten :\nID-Nummer: T0110053F,Persönliche Daten :\nID-Nummer: <ID>,Persönliche Daten :\nID-Nummer: [*******],Persönliche Daten :\nID-Nummer: ****,Persönliche Daten :\nID-Nummer: W9009942Q
3,Platte A-BC124,Platte <PLATE>,Platte [*****],Platte ****,Platte Z-SL013
4,Kontonummer: DE89370400440532013000\nSSN : 13110587M565,Kontonummer: <ACCOUNT>\nSSN : <SSN>,Kontonummer: [********************]\nSSN : [**********],Kontonummer: ****\nSSN : ****,Kontonummer: 192837465738\nSSN : 02009476T454
5,Lizenznummer: B072RRE2I55,Lizenznummer: <DLN>,Lizenznummer: [*********],Lizenznummer: ****,Lizenznummer: S961KKX1V44
6,Adresse : St.Johann-Straße 13 19300,Adresse : <STREET> <ZIP>,Adresse : [*****************] [***],Adresse : **** ****,Adresse : Kösterring 4/1 08299


# DE-IDENTIFICATION FOR SPANISH

##   Spanish Deidentification NER Models
We have eight different models you can use:
* `ner_deid_generic`, detects 7 entities, uses SciWiki 300d embeddings.
* `ner_deid_generic_roberta`, same as previous, but uses Roberta Clinical Embeddings.
* `ner_deid_generic_augmented`, detects 8 entities (now includes 'SEX' entity), uses SciWiki 300d embeddings and has been trained with more data
* `ner_deid_generic_roberta_augmented`, same as previous, but uses Roberta Clinical Embeddings.
* `ner_deid_subentity`, detects 13 entities, uses SciWiki 300d embeddings.
* `ner_deid_subentity_roberta`, same as previous, but uses Roberta Clinical Embeddings.
* `ner_deid_subentity_augmented`, detects 17 entities, uses SciWiki 300d embeddings and has been trained with more data.
* `ner_deid_subentity_roberta_augmented`, same as previous, but uses Roberta Clinical Embeddings.

Since `augmented` models improve their results compared to the non augmented ones, we are going to show case them in this notebook

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_generic_es.html)  |es| 5| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_subentity_es.html)  |es|
| 2| [ner_deid_generic_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_augmented_es.html)  |es| 6| [ner_deid_subentity_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_augmented_es.html)  |es|
| 3| [ner_deid_generic_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_generic_roberta_es.html)  |es| 7| [ner_deid_subentity_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_subentity_roberta_es.html)  |es|
| 4| [ner_deid_generic_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_roberta_augmented_es.html)  |es| 8| [ner_deid_subentity_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_roberta_augmented_es.html)  |es|


Creating pipeline for Sciwiki 300d-based augmented model

In [27]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_es = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
embeddings_sciwiki_300d download started this may take some time.
Approximate size to download 253.3 MB
[OK!]


###   NER Deid Generic (Augmented)

**`ner_deid_generic_augmented`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, FAX numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Employment information)
- Id (Social Security numbers, Medical record numbers, Internet protocol addresses)
- Sex



In [28]:
ner_generic_es = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic_augmented download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [29]:
ner_generic_es.getClasses()

['O',
 'I-LOCATION',
 'B-ORGANIZATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-ORGANIZATION',
 'B-AGE',
 'I-SEX']

###   NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- User Name
- Profession
- Phone
- Country
- Age
- Sex
- Email
- ZIP
- ID
- Medical Record

In [30]:
ner_subentity_es = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity_augmented download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [31]:
ner_subentity_es.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-USERNAME',
 'B-PROFESSION',
 'I-ID',
 'B-CITY',
 'B-DATE',
 'B-PATIENT',
 'B-SEX',
 'I-SEX',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ID',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-STREET',
 'I-ORGANIZATION',
 'B-AGE',
 'I-HOSPITAL',
 'I-COUNTRY']

###   Pipeline

In [32]:
nlpPipeline_es = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_es,
        ner_generic_es,
        ner_converter_generic,
        ner_subentity_es,
        ner_converter_subentity,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_es = nlpPipeline_es.fit(empty_data)

In [33]:
text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2022 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."

text_df = spark.createDataFrame([[text]]).toDF("text")
result_es = model_es.transform(text_df)

Results for `ner_deid_subentity`

In [34]:
result_es.select(F.explode(F.arrays_zip(result_es.ner_chunk_subentity.result,
                                        result_es.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------+----------+
|chunk                  |ner_label |
+-----------------------+----------+
|Antonio Miguel Martínez|PATIENT   |
|un varón               |SEX       |
|35                     |AGE       |
|auxiliar de enfermería |PROFESSION|
|Cadiz                  |CITY      |
|España                 |COUNTRY   |
|14/03/2022             |DATE      |
|Clinica San Carlos     |HOSPITAL  |
+-----------------------+----------+



Results for `ner_deid_generic`

In [35]:
result_es.select(F.explode(F.arrays_zip(result_es.ner_chunk_generic.result,
                                        result_es.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------+----------+
|chunk                  |ner_label |
+-----------------------+----------+
|Antonio Miguel Martínez|NAME      |
|un varón               |SEX       |
|35                     |AGE       |
|auxiliar de enfermería |PROFESSION|
|Cadiz                  |LOCATION  |
|España                 |LOCATION  |
|14/03/2022             |DATE      |
|Clinica San Carlos     |LOCATION  |
+-----------------------+----------+



## DeIdentification

### Obfuscation mode

In [36]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_es.txt

In [37]:
deid_masked_entity = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_es.txt')\
    .setObfuscateRefSource("file")

In [38]:
nlpPipeline_es = Pipeline(stages=[
      documentAssembler,
      sentencerDL,
      tokenizer,
      word_embeddings_es,
      ner_subentity_es,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_es = nlpPipeline_es.fit(empty_data)

In [39]:
deid_lp_es = LightPipeline(model_es)

In [40]:
text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2022 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."

In [41]:
pd.set_option("display.max_colwidth", 100)

result_es = deid_lp_es.annotate(text)

df_es = pd.DataFrame(list(zip(result_es["masked_with_entity"],
                              result_es["masked_with_chars"],
                              result_es["masked_fixed_length_chars"],
                              result_es["obfuscated"])),
                  columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_es

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"<PATIENT>, <SEX> de <AGE> años de edad, de profesión <PROFESSION> y nacido en <CITY>, <COUNTRY>.","[*********************], [******] de ** años de edad, de profesión [********************] y naci...","****, **** de **** años de edad, de profesión **** y nacido en ****, ****.","Aurora Garrido Paez, m. de 27 años de edad, de profesión conserje y nacido en Valencia, España."
1,"Aún no estaba vacunado, se infectó con Covid-19 el dia <DATE> y tuvo que ir al Hospital. Fue tra...","Aún no estaba vacunado, se infectó con Covid-19 el dia [********] y tuvo que ir al Hospital. Fue...","Aún no estaba vacunado, se infectó con Covid-19 el dia **** y tuvo que ir al Hospital. Fue trata...","Aún no estaba vacunado, se infectó con Covid-19 el dia 17/04/2022 y tuvo que ir al Hospital. Fue..."


### Faker Mode

In [42]:
deid_obfuscated_faker = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('es')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [43]:
nlpPipeline_es = Pipeline(stages=[
      documentAssembler,
      sentencerDL,
      tokenizer,
      word_embeddings_es,
      ner_subentity_es,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_es = nlpPipeline_es.fit(empty_data)

In [44]:
deid_lp_es = LightPipeline(model_es)

In [45]:
text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2022 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."

In [46]:
pd.set_option("display.max_colwidth", 100)

result_es = deid_lp_es.annotate(text)

df_es = pd.DataFrame(list(zip(result_es["masked_with_entity"],
                              result_es["masked_with_chars"],
                              result_es["masked_fixed_length_chars"],
                              result_es["obfuscated"])),
                  columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_es

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"<PATIENT>, <SEX> de <AGE> años de edad, de profesión <PROFESSION> y nacido en <CITY>, <COUNTRY>.","[*********************], [******] de ** años de edad, de profesión [********************] y naci...","****, **** de **** años de edad, de profesión **** y nacido en ****, ****.","Jose Tomas Ramon Vicente Morote, m de 37 años de edad, de profesión coordinador de proveedores d..."
1,"Aún no estaba vacunado, se infectó con Covid-19 el dia <DATE> y tuvo que ir al Hospital. Fue tra...","Aún no estaba vacunado, se infectó con Covid-19 el dia [********] y tuvo que ir al Hospital. Fue...","Aún no estaba vacunado, se infectó con Covid-19 el dia **** y tuvo que ir al Hospital. Fue trata...","Aún no estaba vacunado, se infectó con Covid-19 el dia 21/03/2022 y tuvo que ir al Hospital. Fue..."


## Pretrained Spanish Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from German medical texts. The PHI information will be masked and obfuscated in the resulting text.
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - City
    - Street
    - Country
    - User name
    - Profession
    - Phone
    - Age
    - Contact
    - ID
    - Phone
    - ZIP
    - Account
    - SSN
    - Driver's License Number
    - Plate Number
    - Sex

|index|model|index|model|
|-----:|:-----|-----:|:-----|
| 1| [clinical_deidentification_augmented]()| 2| [clinical_deidentification]()|

In [47]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline_es = PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models")

clinical_deidentification_augmented download started this may take some time.
Approx size to download 268.3 MB
[OK!]


In [48]:
text = """Datos del paciente.
Nombre:  Ernesto.
Apellidos: Rivera Bueno.
NHC: 368503.
NASS: 26 63514095.
Domicilio:  Calle Miguel Benitez 90.
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 03/03/1946.
País: España.
Edad: 70 años Sexo: H.
Fecha de Ingreso: 12/12/2016.
Médico:  Ignacio Navarro Cuéllar NºCol: 28 28 70973.
Informe clínico del paciente: Paciente de 70 años de edad, minero jubilado, sin alergias medicamentosas conocidas, que presenta como antecedentes personales: accidente laboral antiguo con fracturas vertebrales y costales; intervenido de enfermedad de Dupuytren en mano derecha y by-pass iliofemoral izquierdo; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; enolismo activo, fumador de 20 cigarrillos / día.
Es derivado desde Atención Primaria por presentar hematuria macroscópica postmiccional en una ocasión y microhematuria persistente posteriormente, con micciones normales.
En la exploración física presenta un buen estado general, con abdomen y genitales normales; tacto rectal compatible con adenoma de próstata grado I/IV.
En la analítica de orina destaca la existencia de 4 hematíes/ campo y 0-5 leucocitos/campo; resto de sedimento normal.
Hemograma normal; en la bioquímica destaca una glucemia de 169 mg/dl y triglicéridos de 456 mg/dl; función hepática y renal normal. PSA de 1.16 ng/ml.
Las citologías de orina son repetidamente sospechosas de malignidad.
En la placa simple de abdomen se valoran cambios degenerativos en columna lumbar y calcificaciones vasculares en ambos hipocondrios y en pelvis.
La ecografía urológica pone de manifiesto la existencia de quistes corticales simples en riñón derecho, vejiga sin alteraciones con buena capacidad y próstata con un peso de 30 g.
En la UIV se observa normofuncionalismo renal bilateral, calcificaciones sobre silueta renal derecha y uréteres arrosariados con imágenes de adición en el tercio superior de ambos uréteres, en relación a pseudodiverticulosis ureteral. El cistograma demuestra una vejiga con buena capacidad, pero paredes trabeculadas en relación a vejiga de esfuerzo. La TC abdominal es normal.
La cistoscopia descubre la existencia de pequeñas tumoraciones vesicales, realizándose resección transuretral con el resultado anatomopatológico de carcinoma urotelial superficial de vejiga.
Remitido por: Ignacio Navarro Cuéllar c/ del Abedul 5-7, 2º dcha 28036 Madrid, España E-mail: nnavcu@hotmail.com.
"""

result_es = deid_pipeline_es.annotate(text)
print("\n".join(result_es['masked_with_chars']))
print("\n")
print("\n".join(result_es['masked']))
print("\n")
print("\n".join(result_es['masked_fixed_length_chars']))
print("\n")
print("\n".join(result_es['obfuscated']))

Datos [**********].
Nombre:  [*****].
Apellidos: [**********].
NHC: [****].
NASS: [*********].
Domicilio:  [*********************].
Localidad/ Provincia: [****].
CP: [***].
Datos asistenciales.
Fecha de nacimiento: [********].
País: [****].
Edad: ** años Sexo: *.
Fecha de Ingreso: [********].
Médico:  [*********************] NºCol: [*********].
Informe clínico [**********]: [******] ** ** años de edad, minero jubilado, sin alergias medicamentosas conocidas, que presenta como antecedentes personales: accidente laboral antiguo con fracturas vertebrales y costales; intervenido de enfermedad de Dupuytren en mano derecha y by-pass iliofemoral izquierdo;
Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; enolismo activo, fumador de 20 cigarrillos / día.
Es derivado desde Atención Primaria por presentar hematuria macroscópica postmiccional en una ocasión y microhematuria persistente posteriormente, con micciones normales.
En la exploración física presenta un buen estado general, 

# DE-IDENTIFICATION FOR FRENCH

## French Deidentification NER Models
We have two different models you can use:
* `ner_deid_generic`, detects 7 entities
* `ner_deid_subentity`, detects 15 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/02/11/ner_deid_generic_fr.html)  |fr| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/02/14/ner_deid_subentity_fr.html)  |fr|


Creating pipeline

In [49]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_fr = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\
    .setInputCols(["document","token"])\
  	.setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)

In [50]:
ner_generic_fr = MedicalNerModel.pretrained("ner_deid_generic", "fr", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [51]:
ner_generic_fr.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-AGE',
 'I-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- Username
- Profession
- Phone
- Country
- Age
- E-mail
- ZIP
- Medical Record

In [52]:
ner_subentity_fr = MedicalNerModel.pretrained("ner_deid_subentity", "fr", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [53]:
ner_subentity_fr.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-USERNAME',
 'B-PROFESSION',
 'B-CITY',
 'B-DATE',
 'I-MEDICALRECORD',
 'B-E-MAIL',
 'B-PATIENT',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-STREET',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-AGE',
 'I-COUNTRY']

### Pipeline

In [54]:
nlpPipeline_fr = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_fr,
        ner_generic_fr,
        ner_converter_generic,
        ner_subentity_fr,
        ner_converter_subentity,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_fr = nlpPipeline_fr.fit(empty_data)

In [55]:
text = "J'ai vu en consultation Michel Martinez (49 ans), jardinier, adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."

text_df = spark.createDataFrame([[text]]).toDF("text")
result_fr = model_fr.transform(text_df)

Results for `ner_deid_generic`

In [56]:
result_fr.select(F.explode(F.arrays_zip(result_fr.ner_chunk_generic.result,
                                        result_fr.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------+----------+
|chunk                        |ner_label |
+-----------------------------+----------+
|Michel Martinez              |NAME      |
|49 ans                       |AGE       |
|jardinier                    |PROFESSION|
|Centre Hospitalier De Plaisir|LOCATION  |
|Mars 2015                    |DATE      |
+-----------------------------+----------+



Results for `ner_deid_subentity`

In [57]:
result_fr.select(F.explode(F.arrays_zip(result_fr.ner_chunk_subentity.result,
                                        result_fr.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------+----------+
|chunk                        |ner_label |
+-----------------------------+----------+
|Michel Martinez              |PATIENT   |
|49 ans                       |AGE       |
|jardinier                    |PROFESSION|
|Centre Hospitalier De Plaisir|HOSPITAL  |
|Mars 2015                    |DATE      |
+-----------------------------+----------+



## DeIdentification

### Obfuscation mode

In [58]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_fr.txt

In [59]:
deid_masked_entity = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_fr.txt')\
    .setObfuscateRefSource("file")

In [60]:
nlpPipeline_fr = Pipeline(stages=[
      documentAssembler,
      sentencerDL,
      tokenizer,
      word_embeddings_fr,
      ner_subentity_fr,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_fr = nlpPipeline_fr.fit(empty_data)

In [61]:
deid_lp_fr = LightPipeline(model_fr)

In [62]:
text = "J'ai vu en consultation Michel Martinez (49 ans), jardinier, adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."

In [63]:
pd.set_option("display.max_colwidth", 200)

result_fr = deid_lp_fr.annotate(text)

df_fr = pd.DataFrame(list(zip(result_fr["masked_with_entity"],
                              result_fr["masked_with_chars"],
                              result_fr["masked_fixed_length_chars"],
                              result_fr["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_fr

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"J'ai vu en consultation <PATIENT> (<AGE>), <PROFESSION>, adressé au <HOSPITAL> pour un diabète mal contrôlé avec des symptômes datant de <DATE>.","J'ai vu en consultation [*************] ([****]), [*******], adressé au [***************************] pour un diabète mal contrôlé avec des symptômes datant de [*******].","J'ai vu en consultation **** (****), ****, adressé au **** pour un diabète mal contrôlé avec des symptômes datant de ****.","J'ai vu en consultation Mme Célina Lelièvre (55 ans), technicien logistique, adressé au Centre Hospitalier De Lorquin pour un diabète mal contrôlé avec des symptômes datant de Mars 0715."


### Faker mode

In [64]:
deid_obfuscated_faker = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('fr')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [65]:
nlpPipeline_fr = Pipeline(stages=[
      documentAssembler,
      sentencerDL,
      tokenizer,
      word_embeddings_fr,
      ner_subentity_fr,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_fr = nlpPipeline_fr.fit(empty_data)

In [66]:
deid_lp_fr = LightPipeline(model_fr)

In [67]:
text = "J'ai vu en consultation Michel Martinez (49 ans), jardinier, adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."

In [68]:
pd.set_option("display.max_colwidth", 200)

result_fr = deid_lp_fr.annotate(text)

df_fr = pd.DataFrame(list(zip(result_fr["masked_with_entity"],
                              result_fr["masked_with_chars"],
                              result_fr["masked_fixed_length_chars"],
                              result_fr["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_fr

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"J'ai vu en consultation <PATIENT> (<AGE>), <PROFESSION>, adressé au <HOSPITAL> pour un diabète mal contrôlé avec des symptômes datant de <DATE>.","J'ai vu en consultation [*************] ([****]), [*******], adressé au [***************************] pour un diabète mal contrôlé avec des symptômes datant de [*******].","J'ai vu en consultation **** (****), ****, adressé au **** pour un diabète mal contrôlé avec des symptômes datant de ****.","J'ai vu en consultation Tristan Robyns (54 ans), administrateur de logiciels de laboratoire, adressé au Centre Hospitalier De Dax pour un diabète mal contrôlé avec des symptômes datant de mai 2015."


## Pretrained French Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from French medical texts. The PHI information will be masked and obfuscated in the resulting text.
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - Sex
    - City
    - Street
    - Country
    - ZIP
    - Username
    - Profession
    - Phone
    - Email
    - Age
    - ID number
    - Medical record number
    - Account number
    - SSN
    - Plate Number
    - IP address
    - URL

In [69]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline_fr = PretrainedPipeline("clinical_deidentification", "fr", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [70]:
text = """COMPTE-RENDU D'HOSPITALISATION
PRENOM : Jean
NOM : Dubois
NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058
ADRESSE : 18 Avenue Matabiau
VILLE : Grenoble
CODE POSTAL : 38000
DATE DE NAISSANCE : 03/03/1946
Âge : 70 ans
Sexe : H
COURRIEL : jdubois@hotmail.fr
DATE D'ADMISSION : 12/12/2016
MÉDÉCIN : Dr Michel Renaud
RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour.
Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale.
L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV.
L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal.
Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml.
ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble
COURRIEL : mariebreton@chb.fr
"""

In [71]:
pd.set_option("display.max_colwidth", 100)

result_fr = deid_pipeline_fr.annotate(text)

df_fr = pd.DataFrame(list(zip(result_fr["sentence"],
                              result_fr["masked"],
                              result_fr["masked_with_chars"],
                              result_fr["masked_fixed_length_chars"],
                              result_fr["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_fr

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION
1,PRENOM : Jean,PRENOM : <PATIENT>,PRENOM : [**],PRENOM : ****,PRENOM : Mme Jérôme François De La Dufour
2,NOM : Dubois,NOM : <PATIENT>,NOM : [****],NOM : ****,NOM : Mme Marie Lamy
3,NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058,NUMÉRO DE SÉCURITÉ SOCIALE : <SSN>,NUMÉRO DE SÉCURITÉ SOCIALE : [***********],NUMÉRO DE SÉCURITÉ SOCIALE : ****,NUMÉRO DE SÉCURITÉ SOCIALE : 0431051740163
4,ADRESSE : 18 Avenue Matabiau,ADRESSE : <STREET>,ADRESSE : [****************],ADRESSE : ****,ADRESSE : Boulevard De Vallet
5,VILLE : Grenoble,VILLE : <CITY>,VILLE : [******],VILLE : ****,VILLE : Sainte Martin
6,CODE POSTAL : 38000,CODE POSTAL : <ZIP>,CODE POSTAL : [***],CODE POSTAL : ****,CODE POSTAL : 83111
7,DATE DE NAISSANCE : 03/03/1946,DATE DE NAISSANCE : <DATE>,DATE DE NAISSANCE : [********],DATE DE NAISSANCE : ****,DATE DE NAISSANCE : 03/03/1946
8,Âge : 70 ans,Âge : <AGE>,Âge : [****],Âge : ****,Âge : 75 ans
9,Sexe : H\nCOURRIEL : jdubois@hotmail.fr\nDATE D'ADMISSION : 12/12/2016,Sexe : <SEX>\nCOURRIEL : <E-MAIL>\nDATE D'ADMISSION : <DATE>,Sexe : *\nCOURRIEL : [****************]\nDATE D'ADMISSION : [********],Sexe : ****\nCOURRIEL : ****\nDATE D'ADMISSION : ****,Sexe : FEMME\nCOURRIEL : georgeslemonnier@live.com\nDATE D'ADMISSION : 12/12/2016


# DE-IDENTIFICATION FOR ITALIAN

## Italian NER Deidentification Models
We have two different models you can use:
* `ner_deid_generic`, detects 8 entities
* `ner_deid_subentity`, detects 19 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_generic_it_3_0.html)  |it| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_subentity_it_2_4.html)  |it|


Creating pipeline

In [72]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_it = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


###  NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)
- Sex

In [73]:
ner_generic_it = MedicalNerModel.pretrained("ner_deid_generic", "it", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [74]:
ner_generic_it.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-CONTACT',
 'B-PROFESSION',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-SEX',
 'B-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- Username
- Profession
- Phone
- Country
- Age
- Sex
- Email
- ZIP
- Medical Record Number
- Social Security Number
- ID Number
- URL

In [75]:
ner_subentity_it = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [76]:
ner_subentity_it.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-USERNAME',
 'B-PROFESSION',
 'B-URL',
 'I-URL',
 'B-CITY',
 'B-DATE',
 'I-MEDICALRECORD',
 'B-SEX',
 'B-PATIENT',
 'I-SEX',
 'I-DOCTOR',
 'I-CITY',
 'B-SSN',
 'I-DATE',
 'I-SSN',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-IDNUM',
 'B-STREET',
 'I-IDNUM',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-COUNTRY']

###  Pipeline

In [77]:
nlpPipeline_it = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_it,
        ner_generic_it,
        ner_converter_generic,
        ner_subentity_it,
        ner_converter_subentity,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_it = nlpPipeline_it.fit(empty_data)

In [78]:
text = "Ho visto Gastone Montanariello (49 anni), virologo, riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."

text_df = spark.createDataFrame([[text]]).toDF("text")
result_it = model_it.transform(text_df)

Results for `ner_deid_generic`

In [79]:
result_it.select(F.explode(F.arrays_zip(result_it.ner_chunk_generic.result,
                                        result_it.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------+----------+
|chunk                |ner_label |
+---------------------+----------+
|Gastone Montanariello|NAME      |
|49                   |AGE       |
|virologo             |PROFESSION|
|Ospedale San Camillo |LOCATION  |
|marzo 2015           |DATE      |
+---------------------+----------+



Results for `ner_deid_subentity`

In [80]:
result_it.select(F.explode(F.arrays_zip(result_it.ner_chunk_subentity.result,
                                        result_it.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------+----------+
|chunk                |ner_label |
+---------------------+----------+
|Gastone Montanariello|PATIENT   |
|49                   |AGE       |
|virologo             |PROFESSION|
|Ospedale San Camillo |HOSPITAL  |
|marzo 2015           |DATE      |
+---------------------+----------+



## DeIdentification

### Obfuscation mode

In [81]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_it.txt

In [82]:
deid_masked_entity = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_it.txt')\
    .setObfuscateRefSource("file")

In [83]:
nlpPipeline_it = Pipeline(stages=[
      documentAssembler,
      sentencerDL,
      tokenizer,
      word_embeddings_it,
      ner_subentity_it,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_it = nlpPipeline_it.fit(empty_data)

In [84]:
deid_lp_it = LightPipeline(model_it)

In [85]:
text = "Ho visto Gastone Montanariello (49 anni), virologo, riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."

In [86]:
pd.set_option("display.max_colwidth", 200)

result_it = deid_lp_it.annotate(text)

df_it = pd.DataFrame(list(zip(result_it["masked_with_entity"],
                              result_it["masked_with_chars"],
                              result_it["masked_fixed_length_chars"],
                              result_it["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_it

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Ho visto <PATIENT> (<AGE> anni), <PROFESSION>, riferito all' <HOSPITAL> per diabete mal controllato con sintomi risalenti a <DATE>.","Ho visto [*******************] (** anni), [******], riferito all' [******************] per diabete mal controllato con sintomi risalenti a [********].","Ho visto **** (**** anni), ****, riferito all' **** per diabete mal controllato con sintomi risalenti a ****.","Ho visto Dott. Gemma Vigliotti (43 anni), sanitario, riferito all' Azienda Unita' Sanitaria Locale Roma H per diabete mal controllato con sintomi risalenti a marzo 2815."


### Faker mode

In [87]:
deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('it')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [88]:
nlpPipeline_it = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_it,
        ner_subentity_it,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_it = nlpPipeline_it.fit(empty_data)

In [89]:
deid_lp_it = LightPipeline(model_it)

In [90]:
text = "Ho visto Gastone Montanariello (49 anni), virologo, riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."

In [91]:
pd.set_option("display.max_colwidth", 200)

result_it = deid_lp_it.annotate(text)

df_it = pd.DataFrame(list(zip(result_it["masked_with_entity"],
                              result_it["masked_with_chars"],
                              result_it["masked_fixed_length_chars"],
                              result_it["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_it

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Ho visto <PATIENT> (<AGE> anni), <PROFESSION>, riferito all' <HOSPITAL> per diabete mal controllato con sintomi risalenti a <DATE>.","Ho visto [*******************] (** anni), [******], riferito all' [******************] per diabete mal controllato con sintomi risalenti a [********].","Ho visto **** (**** anni), ****, riferito all' **** per diabete mal controllato con sintomi risalenti a ****.","Ho visto Adell Mora (45 anni), animator, riferito all' Elmore Community Hospital per diabete mal controllato con sintomi risalenti a marzo 2615."


## Pretrained Italian Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Italian medical texts. The PHI information will be masked and obfuscated in the resulting text.
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - Sex
    - City
    - Street
    - Country
    - ZIP
    - Username
    - Profession
    - Phone
    - Email
    - Age
    - ID number
    - Medical record number
    - Account number
    - SSN
    - Plate Number
    - IP address
    - URL

In [92]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline_it = PretrainedPipeline("clinical_deidentification", "it", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [93]:
text = """RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITTÀ : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ETÀ: 70 anni
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.

INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it
"""

In [94]:
pd.set_option("display.max_colwidth", None)

result_it = deid_pipeline_it.annotate(text)

df_it = pd.DataFrame(list(zip(result_it["sentence"],
                              result_it["masked"],
                              result_it["masked_with_chars"],
                              result_it["masked_fixed_length_chars"],
                              result_it["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_it

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO
1,NOME: Lodovico Fibonacci,NOME: <PATIENT>,NOME: [****************],NOME: ****,NOME: Cherubini
2,CODICE FISCALE: MVANSK92F09W408A,CODICE FISCALE: <SSN>,CODICE FISCALE: [**************],CODICE FISCALE: ****,CODICE FISCALE: BEXKDJ25Q32N731X
3,INDIRIZZO: Viale Burcardo 7\nCITTÀ : Napoli,INDIRIZZO: <STREET>\nCITTÀ : <CITY>,INDIRIZZO: [**************]\nCITTÀ : [****],INDIRIZZO: ****\nCITTÀ : ****,INDIRIZZO: Canale Adamo 11 Piano 7\nCITTÀ : Lodovico Salentino
4,CODICE POSTALE: 80139\nDATA DI NASCITA: 03/03/1946\nETÀ: 70 anni,CODICE POSTALE: <ZIP>DATA DI NASCITA: <DATE>\nETÀ: <AGE>anni,CODICE POSTALE: [***]DATA DI NASCITA: [********]\nETÀ: **anni,CODICE POSTALE: ****DATA DI NASCITA: ****\nETÀ: ****anni,CODICE POSTALE: 13462DATA DI NASCITA: 03/03/1946\nETÀ: 79anni
5,SESSO: M\nEMAIL: lpizzo@tim.it\nDATA DI AMMISSIONE: 12/12/2016,SESSO: <SEX>\nEMAIL: <E-MAIL>\nDATA DI AMMISSIONE: <DATE>,SESSO: *\nEMAIL: [***********]\nDATA DI AMMISSIONE: [********],SESSO: ****\nEMAIL: ****\nDATA DI AMMISSIONE: ****,SESSO: U\nEMAIL: henrywatson@world.com\nDATA DI AMMISSIONE: 12/12/2016
6,DOTTORE: Eva Viviani,DOTTORE: <DOCTOR>,DOTTORE: [*********],DOTTORE: ****,DOTTORE: Schiavone
7,"RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: <AGE>anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: **anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: ****anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: 79anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno."
8,"È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale."
9,"L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV."


# DE-IDENTIFICATION FOR PORTUGUESE

## Portuguese NER Deidentification Models
We have two different models you can use:
* `ner_deid_generic`, detects 8 entities
* `ner_deid_subentity`, detects 19 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_generic_pt_3_0.html)  |pt| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_subentity_pt_3_0.html)  |pt|


Creating pipeline

In [95]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_pt = WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)
- Sex

In [96]:
ner_generic_pt = MedicalNerModel.pretrained("ner_deid_generic", "pt", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [97]:
ner_generic_pt.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-SEX',
 'B-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `SEX`, `EMAIL`, `ZIP`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`

In [98]:
ner_subentity_pt = MedicalNerModel.pretrained("ner_deid_subentity", "pt", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [99]:
ner_subentity_pt.getClasses()

['O',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-PROFESSION',
 'I-ID',
 'B-CITY',
 'B-DATE',
 'B-PATIENT',
 'B-SEX',
 'I-SEX',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ID',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-STREET',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-COUNTRY']

### Pipeline

In [100]:
nlpPipeline_pt = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_pt,
        ner_generic_pt,
        ner_converter_generic,
        ner_subentity_pt,
        ner_converter_subentity,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_pt = nlpPipeline_pt.fit(empty_data)

In [101]:
text = """Detalhes do paciente.
Nome do paciente:  Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos
Data de admissão: 17/06/2016.
Doutora: Maria Santos"""

text_df = spark.createDataFrame([[text]]).toDF("text")
result_pt = model_pt.transform(text_df)

Results for `ner_deid_generic`

In [102]:
result_pt.select(F.explode(F.arrays_zip(result_pt.ner_chunk_generic.result,
                                        result_pt.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|Pedro Gonçalves  |NAME     |
|2569870          |ID       |
|Rua Das Flores 23|LOCATION |
|21754-987        |LOCATION |
|10/10/1963       |DATE     |
|53               |AGE      |
|17/06/2016       |DATE     |
|Maria Santos     |NAME     |
+-----------------+---------+



Results for `ner_deid_subentity`

In [103]:
result_pt.select(F.explode(F.arrays_zip(result_pt.ner_chunk_subentity.result,
                                        result_pt.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|Pedro Gonçalves  |PATIENT  |
|2569870          |ID       |
|Rua Das Flores 23|STREET   |
|21754-987        |ZIP      |
|10/10/1963       |DATE     |
|53               |AGE      |
|17/06/2016       |DATE     |
|Maria Santos     |DOCTOR   |
+-----------------+---------+



## DeIdentification

### Obfuscation mode

In [104]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_pt.txt

In [105]:
deid_masked_entity = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_pt.txt')\
    .setObfuscateRefSource("file")

In [106]:
nlpPipeline_pt = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_pt,
        ner_subentity_pt,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_pt = nlpPipeline_pt.fit(empty_data)

In [107]:
deid_lp_pt = LightPipeline(model_pt)

In [108]:
text = """Detalhes do paciente.
Nome do paciente: Antonio Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 23 anos
Data de admissão: 17/06/2016.
Doutora: Maria Santos"""

In [109]:
pd.set_option("display.max_colwidth", 200)

result_pt = deid_lp_pt.annotate(text)

df_pt = pd.DataFrame(list(zip(result_pt["sentence"],
                              result_pt["masked_with_entity"],
                              result_pt["masked_with_chars"],
                              result_pt["masked_fixed_length_chars"],
                              result_pt["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_pt

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.
1,Nome do paciente: Antonio Gonçalves,Nome do paciente: <PATIENT>,Nome do paciente: [***************],Nome do paciente: ****,Nome do paciente: Mendes
2,NHC: 2569870.,NHC: <ID>.,NHC: [*****].,NHC: ****.,NHC: 5634701.
3,Endereço: Rua Das Flores 23.\nCódigo Postal: 21754-987.,Endereço: <STREET>.\nCódigo Postal: <ZIP>.,Endereço: [***************].\nCódigo Postal: [*******].,Endereço: ****.\nCódigo Postal: ****.,"Endereço: Avenida Morais, 87.\nCódigo Postal: 58069-470."
4,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.
5,Data de nascimento: 10/10/1963.,Data de nascimento: <DATE>.,Data de nascimento: [********].,Data de nascimento: ****.,Data de nascimento: 26/11/1963.
6,Idade: 23 anos,Idade: <AGE> anos,Idade: ** anos,Idade: **** anos,Idade: 28 anos
7,Data de admissão: 17/06/2016.,Data de admissão: <DATE>.,Data de admissão: [********].,Data de admissão: ****.,Data de admissão: 03/08/2016.
8,\nDoutora: Maria Santos,\nDoutora: <DOCTOR>,\nDoutora: [**********],\nDoutora: ****,\nDoutora: Vicente


### Faker mode

In [110]:
deid_obfuscated_faker = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('pt')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [111]:
nlpPipeline_pt = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_pt,
        ner_subentity_pt,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated_faker
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_pt = nlpPipeline_pt.fit(empty_data)

In [112]:
deid_lp_pt = LightPipeline(model_pt)

In [113]:
pd.set_option("display.max_colwidth", 200)

result_pt = deid_lp_pt.annotate(text)

df_pt = pd.DataFrame(list(zip(result_pt["sentence"],
                              result_pt["masked_with_entity"],
                              result_pt["masked_with_chars"],
                              result_pt["masked_fixed_length_chars"],
                              result_pt["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_pt

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.
1,Nome do paciente: Antonio Gonçalves,Nome do paciente: <PATIENT>,Nome do paciente: [***************],Nome do paciente: ****,Nome do paciente: Juleen Albany
2,NHC: 2569870.,NHC: <ID>.,NHC: [*****].,NHC: ****.,NHC: 9810743.
3,Endereço: Rua Das Flores 23.\nCódigo Postal: 21754-987.,Endereço: <STREET>.\nCódigo Postal: <ZIP>.,Endereço: [***************].\nCódigo Postal: [*******].,Endereço: ****.\nCódigo Postal: ****.,Endereço: Jamesland.\nCódigo Postal: 96485-074.
4,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.
5,Data de nascimento: 10/10/1963.,Data de nascimento: <DATE>.,Data de nascimento: [********].,Data de nascimento: ****.,Data de nascimento: 13/11/1963.
6,Idade: 23 anos,Idade: <AGE> anos,Idade: ** anos,Idade: **** anos,Idade: 30 anos
7,Data de admissão: 17/06/2016.,Data de admissão: <DATE>.,Data de admissão: [********].,Data de admissão: ****.,Data de admissão: 21/07/2016.
8,\nDoutora: Maria Santos,\nDoutora: <DOCTOR>,\nDoutora: [**********],\nDoutora: ****,\nDoutora: Lilyan General


## Pretrained Portuguese Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Italian medical texts. The PHI information will be masked and obfuscated in the resulting text.
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - Sex
    - City
    - Street
    - Country
    - ZIP
    - Username
    - Profession
    - Phone
    - Email
    - Age
    - ID number
    - Medical record number
    - Account number
    - SSN
    - Plate Number
    - IP address
    - URL

In [114]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline_pt = PretrainedPipeline("clinical_deidentification", "pt", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [115]:
text = """RELAÇÃO HOSPITALAR
NOME: Pedro Gonçalves
NHC: MVANSK92F09W408A
ENDEREÇO: Rua Burcardo 7
CÓDIGO POSTAL: 80139
DATA DE NASCIMENTO: 03/03/1946
IDADE: 70 anos
SEXO: Homens
E-MAIL: pgon21@tim.pt
DATA DE ADMISSÃO: 12/12/2016
DOUTORA: Eva Andrade
RELATO CLÍNICO: 70 anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte história: ex-acidente de trabalho com fratura de vértebras e costelas; operado de doença de Dupuytren na mão direita e ponte ílio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.
Ele foi encaminhado a nós por apresentar hematúria macroscópica pós-evacuação em uma ocasião e microhematúria persistente posteriormente, com evacuação normal.
O exame físico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compatível com adenoma de próstata grau I/IV.
A urinálise mostrou 4 hemácias/campo e 0-5 leucócitos/campo; o resto do sedimento era normal.
O hemograma é normal; a bioquímica mostrou uma glicemia de 169 mg/dl e triglicerídeos 456 mg/dl; função hepática e renal são normais. PSA de 1,16 ng/ml.

DIRIGIDA A: Dr. Eva Andrade - Centro Hospitalar do Medio Ave - Avenida Dos Aliados, 56
E-MAIL: evandrade@poste.pt
"""

In [116]:
pd.set_option("display.max_colwidth", None)

result_pt = deid_pipeline_pt.annotate(text)

df_pt = pd.DataFrame(list(zip(result_pt["sentence"],
                              result_pt["masked"],
                              result_pt["masked_with_chars"],
                              result_pt["masked_fixed_length_chars"],
                              result_pt["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_pt

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,RELAÇÃO HOSPITALAR\nNOME: Pedro Gonçalves,RELAÇÃO HOSPITALAR\nNOME: <DOCTOR>,RELAÇÃO HOSPITALAR\nNOME: [*************],RELAÇÃO HOSPITALAR\nNOME: ****,RELAÇÃO HOSPITALAR\nNOME: Vasco Soares
1,NHC: MVANSK92F09W408A,NHC: <ID>,NHC: [**************],NHC: ****,NHC: VONAZL67M36T139N
2,ENDEREÇO: Rua Burcardo 7,ENDEREÇO: <STREET>,ENDEREÇO: [************],ENDEREÇO: ****,"ENDEREÇO: Rua Augusta, 19"
3,CÓDIGO POSTAL: 80139\nDATA DE NASCIMENTO: 03/03/1946,CÓDIGO POSTAL: <ZIP>\nDATA DE NASCIMENTO: <DATE>,CÓDIGO POSTAL: [***]\nDATA DE NASCIMENTO: [********],CÓDIGO POSTAL: ****\nDATA DE NASCIMENTO: ****,CÓDIGO POSTAL: 93046\nDATA DE NASCIMENTO: 03/03/1946
4,IDADE: 70 anos,IDADE: <AGE> anos,IDADE: ** anos,IDADE: **** anos,IDADE: 77 anos
5,SEXO: Homens,SEXO: <SEX>,SEXO: [****],SEXO: ****,SEXO: Mulher
6,E-MAIL: pgon21@tim.pt\nDATA DE ADMISSÃO: 12/12/2016,E-MAIL: <EMAIL>\nDATA DE ADMISSÃO: <DATE>,E-MAIL: [***********]\nDATA DE ADMISSÃO: [********],E-MAIL: ****\nDATA DE ADMISSÃO: ****,E-MAIL: richard@yahoo.pt\nDATA DE ADMISSÃO: 12/12/2016
7,DOUTORA: Eva Andrade,DOUTORA: <DOCTOR>,DOUTORA: [*********],DOUTORA: ****,DOUTORA: Eva Coutinho
8,"RELATO CLÍNICO: 70 anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte história: ex-acidente de trabalho com fratura de vértebras e costelas; operado de doença de Dupuytren na mão direita e ponte ílio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CLÍNICO: <AGE> anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte história: ex-acidente de trabalho com fratura de vértebras e costelas; operado de doença de Dupuytren na mão direita e ponte ílio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CLÍNICO: ** anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte história: ex-acidente de trabalho com fratura de vértebras e costelas; operado de doença de Dupuytren na mão direita e ponte ílio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CLÍNICO: **** anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte história: ex-acidente de trabalho com fratura de vértebras e costelas; operado de doença de Dupuytren na mão direita e ponte ílio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CLÍNICO: 77 anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte história: ex-acidente de trabalho com fratura de vértebras e costelas; operado de doença de Dupuytren na mão direita e ponte ílio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia."
9,"Ele foi encaminhado a nós por apresentar hematúria macroscópica pós-evacuação em uma ocasião e microhematúria persistente posteriormente, com evacuação normal.\nO exame físico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compatível com adenoma de próstata grau I/IV.","Ele foi encaminhado a nós por apresentar hematúria macroscópica pós-evacuação em uma ocasião e microhematúria persistente posteriormente, com evacuação normal.\nO exame físico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compatível com adenoma de próstata grau I/IV.","Ele foi encaminhado a nós por apresentar hematúria macroscópica pós-evacuação em uma ocasião e microhematúria persistente posteriormente, com evacuação normal.\nO exame físico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compatível com adenoma de próstata grau I/IV.","Ele foi encaminhado a nós por apresentar hematúria macroscópica pós-evacuação em uma ocasião e microhematúria persistente posteriormente, com evacuação normal.\nO exame físico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compatível com adenoma de próstata grau I/IV.","Ele foi encaminhado a nós por apresentar hematúria macroscópica pós-evacuação em uma ocasião e microhematúria persistente posteriormente, com evacuação normal.\nO exame físico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compatível com adenoma de próstata grau I/IV."


# DE-IDENTIFICATION FOR ROMANIAN


## Romanian NER Deidentification Models
We have two different models you can use:
* `ner_deid_subentity`, detects 17 entities
* `ner_deid_subentity_bert`, detects 17 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_ro_3_0.html)  |ro| 3| [ner_deid_generic](https://nlp.johnsnowlabs.com/models)  |ro|
| 2| [ner_deid_subentity_bert](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_bert_ro_3_0.html)  |ro| 4| [ner_deid_generic_bert](https://nlp.johnsnowlabs.com/models)  |ro|


Creating pipeline

In [117]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_ro = WordEmbeddingsModel.pretrained("w2v_cc_300d", "ro")\
    .setInputCols(["sentence","token"])\
	  .setOutputCol("word_embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)

In [118]:
ner_generic_ro = MedicalNerModel.pretrained("ner_deid_generic", "ro", "clinical/models")\
    .setInputCols(["sentence","token","word_embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [119]:
ner_generic_ro.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-CONTACT',
 'B-PROFESSION',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-AGE',
 'I-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `EMAIL`, `ZIP`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `FAX`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`,


In [120]:
ner_subentity_ro = MedicalNerModel.pretrained("ner_deid_subentity", "ro", "clinical/models")\
    .setInputCols(["sentence","token","word_embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
Approximate size to download 14.4 MB
[OK!]


In [121]:
ner_subentity_ro.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-PROFESSION',
 'I-LOCATION-OTHER',
 'B-CITY',
 'B-DATE',
 'B-LOCATION-OTHER',
 'B-PATIENT',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-IDNUM',
 'B-STREET',
 'B-FAX',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-FAX',
 'I-AGE',
 'I-COUNTRY']

### Pipeline

In [122]:
nlpPipeline_ro = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_ro,
        ner_generic_ro,
        ner_converter_generic,
        ner_subentity_ro,
        ner_converter_subentity,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ro = nlpPipeline_ro.fit(empty_data)

In [123]:
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Timar
C.N.P : 2450502264401"""

text_df = spark.createDataFrame([[text]]).toDF("text")
result_ro = model_ro.transform(text_df)

Results for `ner_deid_generic`

In [124]:
result_ro.select(F.explode(F.arrays_zip(result_ro.ner_chunk_generic.result, result_ro.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------------------+---------+
|chunk                       |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|LOCATION |
|Drumul Oprea Nr. 972        |LOCATION |
|Vaslui                      |LOCATION |
|737405 România              |LOCATION |
|+40(235)413773              |CONTACT  |
|25 May 2022                 |DATE     |
|BUREAN MARIA                |NAME     |
|77                          |AGE      |
|Agota Evelyn Timar          |NAME     |
|2450502264401               |ID       |
+----------------------------+---------+



Results for `ner_deid_subentity`

In [125]:
result_ro.select(F.explode(F.arrays_zip(result_ro.ner_chunk_subentity.result,
                                        result_ro.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------------------+---------+
|chunk                       |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|HOSPITAL |
|Drumul Oprea Nr. 972        |STREET   |
|Vaslui                      |CITY     |
|737405                      |ZIP      |
|+40(235)413773              |PHONE    |
|25 May 2022                 |DATE     |
|BUREAN MARIA                |PATIENT  |
|77                          |AGE      |
|Agota Evelyn Timar          |DOCTOR   |
|2450502264401               |IDNUM    |
+----------------------------+---------+



## DeIdentification

### Obfuscation mode

In [126]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_ro.txt

In [127]:
deid_masked_entity = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_ro.txt')\
    .setObfuscateRefSource("file")

In [128]:
nlpPipeline_ro = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_ro,
        ner_subentity_ro,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ro = nlpPipeline_ro.fit(empty_data)

In [129]:
deid_lp_ro = LightPipeline(model_ro)

In [130]:
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Timar
C.N.P : 2450502264401"""

In [131]:
pd.set_option("display.max_colwidth", 200)

result_ro = deid_lp_ro.annotate(text)

df_ro = pd.DataFrame(list(zip(result_ro["sentence"],
                              result_ro["masked_with_entity"],
                              result_ro["masked_with_chars"],
                              result_ro["masked_fixed_length_chars"],
                              result_ro["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ro

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România","<HOSPITAL>, <STREET> <CITY>, <ZIP> România","[**************************], [******************] [****], [****] România","****, **** ****, **** România","Centrul De Evaluare Si Tratament A Toxicodependentelor Primaria Municipiului Bucuresti, Aleea Voinea Anina, 868972 România"
1,Tel: +40(235)413773,Tel: <PHONE>,Tel: [************],Tel: ****,Tel: +97(362)906886
2,Data setului de analize: 25 May 2022,Data setului de analize: <DATE>,Data setului de analize: [*********],Data setului de analize: ****,Data setului de analize: 27 June 2022
3,"Nume si Prenume : BUREAN MARIA, Varsta: 77\nMedic : Agota Evelyn Timar","Nume si Prenume : <PATIENT>, Varsta: <AGE>\nMedic : <DOCTOR>","Nume si Prenume : [**********], Varsta: **\nMedic : [****************]","Nume si Prenume : ****, Varsta: ****\nMedic : ****","Nume si Prenume : DRAGULEASA DORINA, Varsta: 66\nMedic : Eftimie, Sinică"
4,C.N.P : 2450502264401,C.N.P : <IDNUM>,C.N.P : [***********],C.N.P : ****,C.N.P : 3927273359970


### Faker mode

In [132]:
deid_obfuscated_faker = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('ro')\
    .setObfuscateDate(True)\
    .setIgnoreRegex(True)\
    .setObfuscateRefSource('faker')

In [133]:
nlpPipeline_ro = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_ro,
        ner_subentity_ro,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated_faker
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ro = nlpPipeline_ro.fit(empty_data)

In [134]:
deid_lp_ro = LightPipeline(model_ro)

In [135]:
pd.set_option("display.max_colwidth", 200)

result_ro = deid_lp_ro.annotate(text)

df_ro = pd.DataFrame(list(zip(result_ro["sentence"],
                              result_ro["masked_with_entity"],
                              result_ro["masked_with_chars"],
                              result_ro["masked_fixed_length_chars"],
                              result_ro["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ro

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România","<HOSPITAL>, <STREET> <CITY>, <ZIP> România","[**************************], [******************] [****], [****] România","****, **** ****, **** România","Sf. Spiridon Town Hospital, Principesa Elena Parcul Bacău, 202316 România"
1,Tel: +40(235)413773,Tel: <PHONE>,Tel: [************],Tel: ****,Tel: +31(706)340220
2,Data setului de analize: 25 May 2022,Data setului de analize: <DATE>,Data setului de analize: [*********],Data setului de analize: ****,Data setului de analize: 29 May 2022
3,"Nume si Prenume : BUREAN MARIA, Varsta: 77\nMedic : Agota Evelyn Timar","Nume si Prenume : <PATIENT>, Varsta: <AGE>\nMedic : <DOCTOR>","Nume si Prenume : [**********], Varsta: **\nMedic : [****************]","Nume si Prenume : ****, Varsta: ****\nMedic : ****","Nume si Prenume : MIRU CHIHAIA, Varsta: 71\nMedic : Roberta Adela Focșeneanu"
4,C.N.P : 2450502264401,C.N.P : <IDNUM>,C.N.P : [***********],C.N.P : ****,C.N.P : 7361617793314


## Pretrained Romanian Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Romanian medical texts. The PHI information will be masked and obfuscated in the resulting text.
- The pipeline can mask and obfuscate:
  - AGE,
  - CITY,
  - COUNTRY,
  - DATE,
  - DOCTOR,
  - EMAIL,
  - FAX,
  - HOSPITAL,
  - IDNUM,
  - LOCATION-OTHER,
  - MEDICALRECORD,
  - ORGANIZATION,
  - PATIENT,
  - PHONE,
  - PROFESSION,
  - STREET,
  - ZIP,
  - ACCOUNT,
  - LICENSE,
  - PLATE

In [136]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline_ro = PretrainedPipeline("clinical_deidentification", "ro", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.1 GB
[OK!]


In [137]:
text = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022
Varsta : 77, Nume si Prenume : BUREAN MARIA
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """

The results can also be inspected vertically by creating a Pandas dataframe as such:

In [138]:
pd.set_option("display.max_colwidth", None)

result_ro = deid_pipeline_ro.annotate(text)

df_ro = pd.DataFrame(list(zip(result_ro["sentence"],
                              result_ro["masked"],
                              result_ro["masked_with_chars"],
                              result_ro["masked_fixed_length_chars"],
                              result_ro["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ro

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022","Medic : Dr. <DOCTOR>, C.N.P : <IDNUM>, Data setului de analize: <DATE>","Medic : Dr. [**********], C.N.P : [***********], Data setului de analize: [*********]","Medic : Dr. ****, C.N.P : ****, Data setului de analize: ****","Medic : Dr. R.t., C.N.P : 1527271195574, Data setului de analize: 25 May 2022"
1,"Varsta : 77, Nume si Prenume : BUREAN MARIA","Varsta : <AGE>, Nume si Prenume : <PATIENT>","Varsta : **, Nume si Prenume : [**********]","Varsta : ****, Nume si Prenume : ****","Varsta : 68, Nume si Prenume : DRAGAN MIHAI"
2,"Tel: +40(235)413773, E-mail : hale@gmail.com,\nLicență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999,\nSpitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405","Tel: <PHONE>, E-mail : <EMAIL>,\nLicență : <LICENSE>, Înmatriculare : <PLATE>, Cont : <ACCOUNT>,\n<HOSPITAL> <STREET> <CITY>, <ZIP>","Tel: [************], E-mail : [************],\nLicență : [*********], Înmatriculare : [******], Cont : [******************],\n[**************************] [******************] [****], [****]","Tel: ****, E-mail : ****,\nLicență : ****, Înmatriculare : ****, Cont : ****,\n**** **** ****, ****","Tel: +57(182)548668, E-mail : tudorsmaranda@kappa.ro,\nLicență : C775129032F, Înmatriculare : HM172448, Cont : KHHO5029180812813651,\nCentrul Medical De Evaluare Si Recuperare Pentru Copii Si Tineri Cristian Serban Buzias Aleea Voinea Aiud, 686572"


# DE-IDENTIFICATION FOR ARABIC

## Arabic NER Deidentification Models
We have two different models you can use:
* `ner_deid_generic`, detects 8 entities
* `ner_deid_subentity`, detects 17 entities
* `ner_deid_subentity_arabert` detects 17 entities
* `ner_deid_generic_arabert`,  detects 8 entities
* `ner_deid_subentity_camelbert` , detects 17 entities
* `ner_deid_generic_camelbert`,  detects 8 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2023/05/31/ner_deid_subentity_ar.html)  |ar| 2| [ner_deid_generic](https://nlp.johnsnowlabs.com/2023/05/30/ner_deid_generic_ar.html)  |ar|
 3|[`ner_deid_subentity_arabert`](https://nlp.johnsnowlabs.com/2023/09/16/ner_deid_subentity_arabert_en.html)   |ar| 4|[`ner_deid_generic_arabert`](https://nlp.johnsnowlabs.com/2023/09/16/ner_deid_generic_arabert_en.html)   |ar|
 5| [`ner_deid_subentity_camelbert`](https://nlp.johnsnowlabs.com/2023/09/22/ner_deid_subentity_camelbert_en.html) |ar| 6| [`ner_deid_generic_camelbert`](https://nlp.johnsnowlabs.com/2023/09/16/ner_deid_generic_camelbert_en.html) |ar|

Creating pipeline

In [139]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_ar = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d","ar")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
arabic_w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, FAX numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Employment information)
- Id (Social Security numbers, Medical record numbers, Internet protocol addresses)



In [140]:
ner_generic_ar = MedicalNerModel.pretrained("ner_deid_generic", "ar", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [141]:
ner_generic_ar.getClasses()

['O',
 'I-LOCATION',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-SEX',
 'B-AGE',
 'I-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- User Name
- Profession
- Phone
- Country
- Age

In [142]:
ner_subentity_ar = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
Approximate size to download 14.3 MB
[OK!]


In [143]:
ner_generic_ar.getClasses()

['O',
 'I-LOCATION',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-SEX',
 'B-AGE',
 'I-AGE']

### Pipeline

In [144]:
nlpPipeline_ar = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_ar,
        ner_generic_ar,
        ner_converter_generic,
        ner_subentity_ar,
        ner_converter_subentity,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ar = nlpPipeline_ar.fit(empty_data)

In [145]:
text_ar = """ملاحظات سريرية - مريض السكري. التاريخ: 11 مايو 1999. اسم المريض: فاطمة علي. العنوان: شارع الحرية، مبنى رقم 456، حي السلام، القاهرة. الرمز البريدي: 67890. البلد: مصر. اسم المستشفى: مستشفى الشفاء. اسم الطبيب: د. محمد صلاح. تفاصيل الحالة: المريضة فاطمة علي، البالغة من العمر 42 عامًا، مصابة بمرض السكري من النوع 2. تشكو من زيادة في العطش والجوع المفرط والتبول المتكرر. تم تشخيصها بمرض السكري بعد فحص شامل وفحوصات مخبرية. الخطة: تم وصف دواء فموي لخفض مستوى السكر في الدم. يجب على المريضة مراجعة الطبيب بانتظام وإجراء اختبارات السكر في الدم بانتظام. يتعين على المريضة اتباع نظام غذائي صحي ومتوازن، يشمل الحد من استهلاك السكريات والنشويات. يجب مراقبة ضغط الدم والكولسترول أيضًا ومراعاة التعليمات الطبية المتعلقة بتلك الحالات. تعليم المريضة بشأن أعراض الارتفاع أو الانخفاض الحاد في مستوى السكر في الدم وكيفية التعامل معها."""

text_df_ar = spark.createDataFrame([[text_ar]]).toDF("text")
result_ar = model_ar.transform(text_df_ar)

Results for `ner_deid_subentity`

In [146]:
result_ar.select(F.explode(F.arrays_zip(result_ar.ner_chunk_subentity.result,
                                        result_ar.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+------------+---------+
|chunk       |ner_label|
+------------+---------+
|11 مايو 1999|DATE     |
|فاطمة علي   |DOCTOR   |
|456،        |ZIP      |
|القاهرة     |CITY     |
|67890       |ZIP      |
|مصر         |COUNTRY  |
|محمد صلاح   |DOCTOR   |
|42          |AGE      |
+------------+---------+



Results for `ner_deid_generic`

In [147]:
result_ar.select(F.explode(F.arrays_zip(result_ar.ner_chunk_generic.result,
                                        result_ar.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+------------+---------+
|chunk       |ner_label|
+------------+---------+
|11 مايو 1999|DATE     |
|فاطمة علي   |NAME     |
|شارع الحرية،|LOCATION |
|القاهرة     |LOCATION |
|67890       |LOCATION |
|مصر         |LOCATION |
|محمد صلاح   |NAME     |
+------------+---------+



## DeIdentification

### Obfuscation mode

In [148]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_ar.txt

In [149]:
deid_masked_entity = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setLanguage('ar')\
    .setMaskingPolicy("entity_labels")

deid_masked_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setLanguage('ar')\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setLanguage('ar')\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('ar')\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_ar.txt')\
    .setObfuscateRefSource("file")

In [150]:
nlpPipeline_ar = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_ar,
        ner_subentity_ar,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ar = nlpPipeline_ar.fit(empty_data)

In [151]:
deid_lp_ar = LightPipeline(model_ar)

In [152]:
text = """
الملاحظات السريرية - مريض السكري
التاريخ: 11 مايو 1999
اسم المريض: فاطمة علي
العنوان: شارع الحرية ، حي السلام ، القاهرة
دولة: مصر
اسم المستشفى: مستشفى الشفاء
اسم الطبيب: د.محمد صلاح
"""

In [153]:
pd.set_option("display.max_colwidth", 200)

result_ar = deid_lp_ar.annotate(text)

df_ar = pd.DataFrame(list(zip(result_ar["sentence"],
                              result_ar["masked_with_entity"],
                              result_ar["masked_with_chars"],
                              result_ar["masked_fixed_length_chars"],
                              result_ar["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ar

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري
1,التاريخ: 11 مايو 1999,التاريخ: [تاريخ],التاريخ: [٭٭٭٭٭٭٭٭٭٭],التاريخ: ٭٭٭٭,التاريخ: 15 مايو 1999
2,اسم المريض: فاطمة علي,اسم المريض: [دكتور],اسم المريض: [٭٭٭٭٭٭٭],اسم المريض: ٭٭٭٭,اسم المريض: أمل
3,العنوان: شارع الحرية ، حي السلام ، القاهرة,العنوان: شارع الحرية ، حي [المدينة] ، [المدينة],العنوان: شارع الحرية ، حي [٭٭٭٭] ، [٭٭٭٭٭],العنوان: شارع الحرية ، حي ٭٭٭٭ ، ٭٭٭٭,العنوان: شارع الحرية ، حي أريانة الشرقية ، أبها
4,دولة: مصر,[البلد]: [البلد],[٭٭]: [٭],٭٭٭٭: ٭٭٭٭,دولة: مصر
5,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء
6,اسم الطبيب: د.محمد صلاح,اسم الطبيب: [دكتور],اسم الطبيب: [٭٭٭٭٭٭٭٭٭],اسم الطبيب: ٭٭٭٭,اسم الطبيب: أسيل


### Faker mode

In [154]:
deid_obfuscated_faker = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('ar')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [155]:
nlpPipeline_ar = Pipeline(
    stages=[
        documentAssembler,
        sentencerDL,
        tokenizer,
        word_embeddings_ar,
        ner_subentity_ar,
        ner_converter_subentity,
        deid_masked_entity,
        deid_masked_char,
        deid_masked_fixed_char,
        deid_obfuscated_faker
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ar = nlpPipeline_ar.fit(empty_data)

In [156]:
deid_lp_ar = LightPipeline(model_ar)

In [157]:
pd.set_option("display.max_colwidth", 200)

result_ar = deid_lp_ar.annotate(text)

df_ar = pd.DataFrame(list(zip(result_ar["sentence"],
                              result_ar["masked_with_entity"],
                              result_ar["masked_with_chars"],
                              result_ar["masked_fixed_length_chars"],
                              result_ar["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ar

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري,الملاحظات السريرية - مريض السكري
1,التاريخ: 11 مايو 1999,التاريخ: [تاريخ],التاريخ: [٭٭٭٭٭٭٭٭٭٭],التاريخ: ٭٭٭٭,التاريخ: 24 يونيو 1999
2,اسم المريض: فاطمة علي,اسم المريض: [دكتور],اسم المريض: [٭٭٭٭٭٭٭],اسم المريض: ٭٭٭٭,اسم المريض: صفية روان
3,العنوان: شارع الحرية ، حي السلام ، القاهرة,العنوان: شارع الحرية ، حي [المدينة] ، [المدينة],العنوان: شارع الحرية ، حي [٭٭٭٭] ، [٭٭٭٭٭],العنوان: شارع الحرية ، حي ٭٭٭٭ ، ٭٭٭٭,العنوان: شارع الحرية ، حي الساحلين ، الجيزة
4,دولة: مصر,[البلد]: [البلد],[٭٭]: [٭],٭٭٭٭: ٭٭٭٭,دولة: مصر
5,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء,اسم المستشفى: مستشفى الشفاء
6,اسم الطبيب: د.محمد صلاح,اسم الطبيب: [دكتور],اسم الطبيب: [٭٭٭٭٭٭٭٭٭],اسم الطبيب: ٭٭٭٭,اسم الطبيب: سمرين مروى


## Pretrained Arabic Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Arabic medical texts. The PHI information will be masked and obfuscated in the resulting text.
- The pipeline can mask and obfuscate:
  - CONTACT,
  - NAME,
  - DATE,
  - ID,
  - LOCATION,
  - AGE,
  - PATIENT,
  - HOSPITAL,
  - ORGANIZATION,
  - CITY,
  - STREET,
  - USERNAME,
  - SEX,
  - IDNUM,
  - EMAIL,
  - ZIP,
  - MEDICALRECORD,
  - PROFESSION,
  - PHONE,
  - COUNTRY,
  - DOCTOR,
  - SSN,
  - ACCOUNT,
  - LICENSE,
  - DLN,
  - VIN


In [158]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline_ar = PretrainedPipeline("clinical_deidentification", "ar", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [159]:
text = """
ملاحظات سريرية - مريض الربو:
التاريخ: 30 مايو 2023
اسم المريضة: ليلى حسن
تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي 123456789012.
العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة
الرمز البريدي: 54321
البلد: المملكة العربية السعودية
اسم المستشفى: مستشفى النور
اسم الطبيب: د. أميرة أحمد
"""

The results can also be inspected vertically by creating a Pandas dataframe as such:

In [160]:
pd.set_option("display.max_colwidth", None)

result_ar = deid_pipeline_ar.annotate(text)

df_ar = pd.DataFrame(list(zip(result_ar["sentence"],
                              result_ar["masked_with_entity"],
                              result_ar["masked_with_chars"],
                              result_ar["masked_fixed_length_chars"],
                              result_ar["obfuscated"])),
                 columns= ["Sentence", "masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ar

Unnamed: 0,Sentence,masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,ملاحظات سريرية - مريض الربو:\nالتاريخ: 30 مايو 2023,ملاحظات سريرية - مريض الربو:\nالتاريخ: [تاريخ] [تاريخ],ملاحظات سريرية - مريض الربو:\nالتاريخ: [٭٭٭٭٭] [٭٭],ملاحظات سريرية - مريض الربو:\nالتاريخ: ٭٭٭٭ ٭٭٭٭,ملاحظات سريرية - مريض الربو:\nالتاريخ: 30 يونيو 2024
1,اسم المريضة: ليلى حسن,اسم المريضة: [المريض],اسم المريضة: [٭٭٭٭٭٭],اسم المريضة: ٭٭٭٭,اسم المريضة: نوح شقيري
2,تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي 123456789012.,تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي [هاتف].,تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي [٭٭٭٭٭٭٭٭٭٭].,تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي ٭٭٭٭.,تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي 036925814703.
3,العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة,العنوان: شارع المعرفة، مبنى رقم [الرمز البريدي] [المدينة] [المدينة],العنوان: شارع المعرفة، مبنى رقم [٭٭] [٭٭٭٭٭٭٭٭٭] [٭],العنوان: شارع المعرفة، مبنى رقم ٭٭٭٭ ٭٭٭٭ ٭٭٭٭,العنوان: شارع المعرفة، مبنى رقم 814، كلميم سانت كاترين
4,الرمز البريدي: 54321,الرمز البريدي: [الرمز البريدي],الرمز البريدي: [٭٭٭],الرمز البريدي: ٭٭٭٭,الرمز البريدي: 29630
5,البلد: المملكة العربية السعودية,البلد: [المدينة] [البلد],البلد: [٭٭٭٭٭٭٭٭٭٭٭٭٭] [٭٭٭٭٭٭],البلد: ٭٭٭٭ ٭٭٭٭,البلد: زغوان الغربية السعودية
6,اسم المستشفى: مستشفى النور,اسم المستشفى: [الموقع],اسم المستشفى: [٭٭٭٭٭٭٭٭٭٭],اسم المستشفى: ٭٭٭٭,اسم المستشفى: شارع المدارس
7,اسم الطبيب: د. أميرة أحمد,اسم الطبيب: د. [دكتور],اسم الطبيب: د. [٭٭٭٭٭٭٭٭],اسم الطبيب: د. ٭٭٭٭,اسم الطبيب: د. علاء مدحت
