![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/4.1.Clinical_Multi_Language_Deidentification.ipynb)


# Clinical Deidentification Multi Language

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [4]:
from johnsnowlabs import nlp, medical, visual
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

üëå Detected license file /content/4.2.4.spark_nlp_for_healthcare (1).json
üëå Launched [92mcpu optimized[39m session with with: üöÄSpark-NLP==4.2.4, üíäSpark-Healthcare==4.2.4, running on ‚ö° PySpark==3.1.2


In [5]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Deidentification Models in Different Languages

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_generic_de.html)  |de| 11| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/02/11/ner_deid_generic_fr.html)  |fr|
| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_subentity_de.html)  |de| 12| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/02/14/ner_deid_subentity_fr.html)  |fr|
| 3| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_generic_es.html)  |es| 13| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_generic_it_3_0.html)  |it|
| 4| [ner_deid_generic_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_augmented_es.html)  |es| 14| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_subentity_it_2_4.html)  |it|
| 5| [ner_deid_generic_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_generic_roberta_es.html)  |es| 15| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_generic_pt_3_0.html)  |pt|
| 6| [ner_deid_generic_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_roberta_augmented_es.html)  |es| 16| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_subentity_pt_3_0.html)  |pt|
| 7| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_subentity_es.html)  |es| 17| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_ro_3_0.html)  |ro|
| 8| [ner_deid_subentity_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_augmented_es.html)  |es| 18| [ner_deid_subentity_bert](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_bert_ro_3_0.html)  |ro|
| 9| [ner_deid_subentity_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_subentity_roberta_es.html)  |es| 19| [ner_deid_generic](https://nlp.johnsnowlabs.com/models)  |ro|
| 10| [ner_deid_subentity_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_roberta_augmented_es.html)  |es| 20| [ner_deid_generic_bert](https://nlp.johnsnowlabs.com/models)  |ro|


# DE-IDENTIFICATION FOR GERMAN

## German Deidentification NER Models 

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_generic_de.html)  |de| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_subentity_de.html)  |de|


Creating pipeline

In [6]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_de = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, FAX numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Employment information)
- Id (Social Security numbers, Medical record numbers, Internet protocol addresses)



In [7]:
ner_generic_de = medical.NerModel.pretrained("ner_deid_generic", "de", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
[OK!]


In [8]:
ner_generic_de.getClasses()

['O',
 'I-LOCATION',
 'B-DATE',
 'I-NAME',
 'B-LOCATION',
 'I-DATE',
 'B-ID',
 'B-AGE',
 'B-CONTACT',
 'B-PROFESSION',
 'B-NAME']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- User Name
- Profession
- Phone
- Country
- Age

In [9]:
ner_subentity_de = medical.NerModel.pretrained("ner_deid_subentity", "de", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
[OK!]


In [10]:
ner_subentity_de.getClasses()

['O',
 'B-ORGANIZATION',
 'I-DOCTOR',
 'B-DOCTOR',
 'B-USERNAME',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-PROFESSION',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'B-CITY',
 'B-HOSPITAL',
 'B-DATE',
 'B-STREET',
 'B-PATIENT',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-COUNTRY']

### Pipeline

In [11]:
nlpPipeline_de = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_de,
      ner_generic_de,
      ner_converter_generic,
      ner_subentity_de,
      ner_converter_subentity,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_de = nlpPipeline_de.fit(empty_data)

In [12]:
text_de = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""

text_df_de = spark.createDataFrame([[text_de]]).toDF("text")
result_de = model_de.transform(text_df_de)

Results for `ner_deid_subentity`

In [13]:
result_de.select(F.explode(F.arrays_zip(result_de.ner_chunk_subentity.result, result_de.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|Michael Berger                        |PATIENT  |
|12 Dezember 2018                      |DATE     |
|Elisabeth-Krankenhaus in Bad Kissingen|HOSPITAL |
|Berger                                |PATIENT  |
|76                                    |AGE      |
+--------------------------------------+---------+



Results for `ner_deid_generic`

In [14]:
result_de.select(F.explode(F.arrays_zip(result_de.ner_chunk_generic.result, result_de.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------+---------+
|chunk                    |ner_label|
+-------------------------+---------+
|Michael Berger           |NAME     |
|12 Dezember 2018         |DATE     |
|St. Elisabeth-Krankenhaus|LOCATION |
|Bad Kissingen            |LOCATION |
|Berger                   |NAME     |
|76                       |AGE      |
+-------------------------+---------+



## DeIdentification

### Obfuscation mode

In [15]:
# Downloading custom faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate.txt

In [17]:
deid_masked_entity = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")\

deid_masked_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")\

deid_masked_fixed_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)\

deid_obfuscated = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate.txt')\
    .setObfuscateRefSource("file")

In [18]:
nlpPipeline_de = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_de,
      ner_subentity_de,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_de = nlpPipeline_de.fit(empty_data)

In [19]:
deid_lp_de = nlp.LightPipeline(model_de)

In [20]:
text = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""

In [21]:
import pandas as pd
pd.set_option("display.max_colwidth", 100)

result_lp_de = deid_lp_de.annotate(text)

df_de = pd.DataFrame(list(zip(result_lp_de["masked_with_entity"], result_lp_de["masked_with_chars"],
                           result_lp_de["masked_fixed_length_chars"], result_lp_de["obfuscated"])),
                 columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_de

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,<PATIENT> wird am Morgen des <DATE> ins St. <HOSPITAL> eingeliefert.,[************] wird am Morgen des [**************] ins St. [************************************...,**** wird am Morgen des **** ins St. **** eingeliefert.,Frau Irmi Graf wird am Morgen des 03-04-1994 ins St. Evangelisches Krankenhaus K√∂nigin Elisabeth...
1,Herr <PATIENT> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.,Herr [****] ist ** Jahre alt und hat zu viel Wasser in den Beinen.,Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen.,Herr Renato Lorch ist 62 Jahre alt und hat zu viel Wasser in den Beinen.


### Faker mode

In [22]:
deid_obfuscated_faker = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('de')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [23]:
nlpPipeline_de = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_de,
      ner_subentity_de,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_de = nlpPipeline_de.fit(empty_data)

In [24]:
deid_lp_de = nlp.LightPipeline(model_de)

In [25]:
text = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""

In [26]:
pd.set_option("display.max_colwidth", 100)

result_de = deid_lp_de.annotate(text)

df_de = pd.DataFrame(list(zip(result_de["masked_with_entity"], result_de["masked_with_chars"],
                           result_de["masked_fixed_length_chars"], result_de["obfuscated"])),
                 columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_de

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,<PATIENT> wird am Morgen des <DATE> ins St. <HOSPITAL> eingeliefert.,[************] wird am Morgen des [**************] ins St. [************************************...,**** wird am Morgen des **** ins St. **** eingeliefert.,Baumann Roos wird am Morgen des 01-25-1994 ins St. HOUSTON METHODIST ST. CATHERINE HOSPITAL eing...
1,Herr <PATIENT> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.,Herr [****] ist ** Jahre alt und hat zu viel Wasser in den Beinen.,Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen.,Herr Eggers M√ºnch ist 60 Jahre alt und hat zu viel Wasser in den Beinen.


## Pretrained German Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from German medical texts. The PHI information will be masked and obfuscated in the resulting text. 
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - City
    - Street
    - Country
    - User name
    - Profession
    - Phone
    - Age
    - Contact
    - ID
    - Phone
    - Zip
    - Account
    - SSN
    - Driver's License Number
    - Plate Number

In [27]:
deid_pipeline_de = nlp.PretrainedPipeline("clinical_deidentification", "de", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [28]:
pd.set_option("display.max_colwidth", 100)

text = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus in Bad Kissingen eingeliefert. 
Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.

Pers√∂nliche Daten :
ID-Nummer: T0110053F
Platte A-BC124
Kontonummer: DE89370400440532013000
SSN : 13110587M565
Lizenznummer: B072RRE2I55
Adresse : St.Johann-Stra√üe 13 19300"""

result_de = deid_pipeline_de.annotate(text)

df_de = pd.DataFrame(list(zip(result_de["sentence"], result_de["masked"],
                           result_de["masked_with_chars"], result_de["masked_fixed_length_chars"], result_de["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_de

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhau...,Zusammenfassung : <PATIENT> wird am Morgen des <DATE> ins <HOSPITAL> eingeliefert.,Zusammenfassung : [************] wird am Morgen des [**************] ins [**********************...,Zusammenfassung : **** wird am Morgen des **** ins **** eingeliefert.,Zusammenfassung : Hollmann Burmeister wird am Morgen des 11-01-1970 ins SAN RAMON REGIONAL MEDIC...
1,Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.,Herr <PATIENT> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.,Herr [************] ist ** Jahre alt und hat zu viel Wasser in den Beinen.,Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen.,Herr Hollmann Burmeister ist 57 Jahre alt und hat zu viel Wasser in den Beinen.
2,Pers√∂nliche Daten :\nID-Nummer: T0110053F,Pers√∂nliche Daten :\nID-Nummer: <ID>,Pers√∂nliche Daten :\nID-Nummer: [*******],Pers√∂nliche Daten :\nID-Nummer: ****,Pers√∂nliche Daten :\nID-Nummer: L6043236
3,Platte A-BC124,Platte <PLATE>,Platte [*****],Platte ****,Platte QA348G
4,Kontonummer: DE89370400440532013000\nSSN : 13110587M565,Kontonummer: <ACCOUNT>\nSSN : <SSN>,Kontonummer: [********************]\nSSN : [**********],Kontonummer: ****\nSSN : ****,Kontonummer: 192837465738\nSSN : 999-30-4262
5,Lizenznummer: B072RRE2I55,Lizenznummer: <DLN>,Lizenznummer: [*********],Lizenznummer: ****,Lizenznummer: S99913378
6,Adresse : St.Johann-Stra√üe 13 19300,Adresse : <STREET> <ZIP>,Adresse : [*****************] [***],Adresse : **** ****,Adresse : Guntram-Hofmann-Gasse 6 00031


# DE-IDENTIFICATION FOR SPANISH

##   Spanish Deidentification NER Models
We have eight different models you can use:
* `ner_deid_generic`, detects 7 entities, uses SciWiki 300d embeddings.
* `ner_deid_generic_roberta`, same as previous, but uses Roberta Clinical Embeddings.
* `ner_deid_generic_augmented`, detects 8 entities (now includes 'SEX' entity), uses SciWiki 300d embeddings and has been trained with more data
* `ner_deid_generic_roberta_augmented`, same as previous, but uses Roberta Clinical Embeddings.
* `ner_deid_subentity`, detects 13 entities, uses SciWiki 300d embeddings.
* `ner_deid_subentity_roberta`, same as previous, but uses Roberta Clinical Embeddings.
* `ner_deid_subentity_augmented`, detects 17 entities, uses SciWiki 300d embeddings and has been trained with more data.
* `ner_deid_subentity_roberta_augmented`, same as previous, but uses Roberta Clinical Embeddings.

Since `augmented` models improve their results compared to the non augmented ones, we are going to show case them in this notebook

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_generic_es.html)  |es| 5| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/18/ner_deid_subentity_es.html)  |es|
| 2| [ner_deid_generic_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_augmented_es.html)  |es| 6| [ner_deid_subentity_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_augmented_es.html)  |es|
| 3| [ner_deid_generic_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_generic_roberta_es.html)  |es| 7| [ner_deid_subentity_roberta](https://nlp.johnsnowlabs.com/2022/01/17/ner_deid_subentity_roberta_es.html)  |es|
| 4| [ner_deid_generic_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_generic_roberta_augmented_es.html)  |es| 8| [ner_deid_subentity_roberta_augmented](https://nlp.johnsnowlabs.com/2022/02/16/ner_deid_subentity_roberta_augmented_es.html)  |es|


Creating pipeline for Sciwiki 300d-based augmented model

In [29]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_es = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
embeddings_sciwiki_300d download started this may take some time.
Approximate size to download 253.3 MB
[OK!]


###   NER Deid Generic (Augmented)

**`ner_deid_generic_augmented`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, FAX numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Employment information)
- Id (Social Security numbers, Medical record numbers, Internet protocol addresses)
- Sex



In [30]:
ner_generic_es = medical.NerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic_augmented download started this may take some time.
[OK!]


In [31]:
ner_generic_es.getClasses()

['O',
 'I-LOCATION',
 'B-ORGANIZATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-ORGANIZATION',
 'B-AGE',
 'I-SEX']

###   NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- User Name
- Profession
- Phone
- Country
- Age
- Sex
- Email
- ZIP
- ID
- Medical Record

In [32]:
ner_subentity_es = medical.NerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [33]:
ner_subentity_es.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-USERNAME',
 'B-PROFESSION',
 'I-ID',
 'B-CITY',
 'B-DATE',
 'B-PATIENT',
 'B-SEX',
 'I-SEX',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ID',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-STREET',
 'I-ORGANIZATION',
 'B-AGE',
 'I-HOSPITAL',
 'I-COUNTRY']

###   Pipeline

In [34]:
nlpPipeline_es = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_es,
      ner_generic_es,
      ner_converter_generic,
      ner_subentity_es,
      ner_converter_subentity,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_es = nlpPipeline_es.fit(empty_data)

In [35]:
text = "Antonio Miguel Mart√≠nez, un var√≥n de 35 a√±os de edad, de profesi√≥n auxiliar de enfermer√≠a y nacido en Cadiz, Espa√±a. A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia 14/03/2022 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."

text_df = spark.createDataFrame([[text]]).toDF("text")
result_es = model_es.transform(text_df)

Results for `ner_deid_subentity`

In [36]:
result_es.select(F.explode(F.arrays_zip(result_es.ner_chunk_subentity.result, result_es.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------+----------+
|chunk                  |ner_label |
+-----------------------+----------+
|Antonio Miguel Mart√≠nez|PATIENT   |
|un var√≥n               |SEX       |
|35                     |AGE       |
|auxiliar de enfermer√≠a |PROFESSION|
|Cadiz                  |CITY      |
|Espa√±a                 |COUNTRY   |
|14/03/2022             |DATE      |
|Clinica San Carlos     |HOSPITAL  |
+-----------------------+----------+



Results for `ner_deid_generic`

In [37]:
result_es.select(F.explode(F.arrays_zip(result_es.ner_chunk_generic.result, result_es.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------+----------+
|chunk                  |ner_label |
+-----------------------+----------+
|Antonio Miguel Mart√≠nez|NAME      |
|un var√≥n               |SEX       |
|35                     |AGE       |
|auxiliar de enfermer√≠a |PROFESSION|
|Cadiz                  |LOCATION  |
|Espa√±a                 |LOCATION  |
|14/03/2022             |DATE      |
|Clinica San Carlos     |LOCATION  |
+-----------------------+----------+



## DeIdentification

### Obfuscation mode

In [38]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_es.txt

In [39]:
deid_masked_entity = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_es.txt')\
    .setObfuscateRefSource("file")

In [40]:
nlpPipeline_es = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_es,
      ner_subentity_es,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_es = nlpPipeline_es.fit(empty_data)

In [41]:
deid_lp_es = nlp.LightPipeline(model_es)

In [42]:
text = "Antonio Miguel Mart√≠nez, un var√≥n de 35 a√±os de edad, de profesi√≥n auxiliar de enfermer√≠a y nacido en Cadiz, Espa√±a. A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia 14/03/2022 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."

In [43]:
pd.set_option("display.max_colwidth", 100)

result_es = deid_lp_es.annotate(text)

df_es = pd.DataFrame(list(zip(result_es["masked_with_entity"], 
                           result_es["masked_with_chars"],
                           result_es["masked_fixed_length_chars"], 
                           result_es["obfuscated"])),
                  columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_es

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"<PATIENT>, <SEX> de <AGE> a√±os de edad, de profesi√≥n <PROFESSION> y nacido en <CITY>, <COUNTRY>.","[*********************], [******] de ** a√±os de edad, de profesi√≥n [********************] y naci...","****, **** de **** a√±os de edad, de profesi√≥n **** y nacido en ****, ****.","Aurora Garrido Paez, M. de 36 a√±os de edad, de profesi√≥n Conserje y nacido en Valladolid, Espa√±a."
1,"A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia <DATE> y tuvo que ir al Hospital. Fue tra...","A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia [********] y tuvo que ir al Hospital. Fue...","A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia **** y tuvo que ir al Hospital. Fue trata...","A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia 11/05/2022 y tuvo que ir al Hospital. Fue..."


### Faker Mode

In [44]:
deid_obfuscated_faker = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('es')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [45]:
nlpPipeline_es = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_es,
      ner_subentity_es,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_es = nlpPipeline_es.fit(empty_data)

In [46]:
deid_lp_es = nlp.LightPipeline(model_es)

In [47]:
text = "Antonio Miguel Mart√≠nez, un var√≥n de 35 a√±os de edad, de profesi√≥n auxiliar de enfermer√≠a y nacido en Cadiz, Espa√±a. A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia 14/03/2022 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."

In [48]:
pd.set_option("display.max_colwidth", 100)

result_es = deid_lp_es.annotate(text)

df_es = pd.DataFrame(list(zip(result_es["masked_with_entity"], 
                           result_es["masked_with_chars"],
                           result_es["masked_fixed_length_chars"], 
                           result_es["obfuscated"])),
                  columns= ["Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_es

Unnamed: 0,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"<PATIENT>, <SEX> de <AGE> a√±os de edad, de profesi√≥n <PROFESSION> y nacido en <CITY>, <COUNTRY>.","[*********************], [******] de ** a√±os de edad, de profesi√≥n [********************] y naci...","****, **** de **** a√±os de edad, de profesi√≥n **** y nacido en ****, ****.","Douglas Chiva, H de 31 a√±os de edad, de profesi√≥n Data processing manager y nacido en Tres Canto..."
1,"A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia <DATE> y tuvo que ir al Hospital. Fue tra...","A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia [********] y tuvo que ir al Hospital. Fue...","A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia **** y tuvo que ir al Hospital. Fue trata...","A√∫n no estaba vacunado, se infect√≥ con Covid-19 el dia 21/03/2022 y tuvo que ir al Hospital. Fue..."


## Pretrained Spanish Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from German medical texts. The PHI information will be masked and obfuscated in the resulting text. 
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - City
    - Street
    - Country
    - User name
    - Profession
    - Phone
    - Age
    - Contact
    - ID
    - Phone
    - ZIP
    - Account
    - SSN
    - Driver's License Number
    - Plate Number
    - Sex

|index|model|index|model|
|-----:|:-----|-----:|:-----|
| 1| [clinical_deidentification_augmented]()| 2| [clinical_deidentification]()|

In [49]:
deid_pipeline_es = nlp.PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models")

clinical_deidentification_augmented download started this may take some time.
Approx size to download 268.2 MB
[OK!]


In [50]:
text = """Datos del paciente.
Nombre:  Ernesto.
Apellidos: Rivera Bueno.
NHC: 368503.
NASS: 26 63514095.
Domicilio:  Calle Miguel Benitez 90.
Localidad/ Provincia: Madrid.
CP: 28016.
Datos asistenciales.
Fecha de nacimiento: 03/03/1946.
Pa√≠s: Espa√±a.
Edad: 70 a√±os Sexo: H.
Fecha de Ingreso: 12/12/2016.
M√©dico:  Ignacio Navarro Cu√©llar N¬∫Col: 28 28 70973.
Informe cl√≠nico del paciente: Paciente de 70 a√±os de edad, minero jubilado, sin alergias medicamentosas conocidas, que presenta como antecedentes personales: accidente laboral antiguo con fracturas vertebrales y costales; intervenido de enfermedad de Dupuytren en mano derecha y by-pass iliofemoral izquierdo; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; enolismo activo, fumador de 20 cigarrillos / d√≠a.
Es derivado desde Atenci√≥n Primaria por presentar hematuria macrosc√≥pica postmiccional en una ocasi√≥n y microhematuria persistente posteriormente, con micciones normales.
En la exploraci√≥n f√≠sica presenta un buen estado general, con abdomen y genitales normales; tacto rectal compatible con adenoma de pr√≥stata grado I/IV.
En la anal√≠tica de orina destaca la existencia de 4 hemat√≠es/ campo y 0-5 leucocitos/campo; resto de sedimento normal.
Hemograma normal; en la bioqu√≠mica destaca una glucemia de 169 mg/dl y triglic√©ridos de 456 mg/dl; funci√≥n hep√°tica y renal normal. PSA de 1.16 ng/ml.
Las citolog√≠as de orina son repetidamente sospechosas de malignidad.
En la placa simple de abdomen se valoran cambios degenerativos en columna lumbar y calcificaciones vasculares en ambos hipocondrios y en pelvis.
La ecograf√≠a urol√≥gica pone de manifiesto la existencia de quistes corticales simples en ri√±√≥n derecho, vejiga sin alteraciones con buena capacidad y pr√≥stata con un peso de 30 g.
En la UIV se observa normofuncionalismo renal bilateral, calcificaciones sobre silueta renal derecha y ur√©teres arrosariados con im√°genes de adici√≥n en el tercio superior de ambos ur√©teres, en relaci√≥n a pseudodiverticulosis ureteral. El cistograma demuestra una vejiga con buena capacidad, pero paredes trabeculadas en relaci√≥n a vejiga de esfuerzo. La TC abdominal es normal.
La cistoscopia descubre la existencia de peque√±as tumoraciones vesicales, realiz√°ndose resecci√≥n transuretral con el resultado anatomopatol√≥gico de carcinoma urotelial superficial de vejiga.
Remitido por: Ignacio Navarro Cu√©llar c/ del Abedul 5-7, 2¬∫ dcha 28036 Madrid, Espa√±a E-mail: nnavcu@hotmail.com.
"""

result_es = deid_pipeline_es.annotate(text)
print("\n".join(result_es['masked_with_chars']))
print("\n")
print("\n".join(result_es['masked']))
print("\n")
print("\n".join(result_es['masked_fixed_length_chars']))
print("\n")
print("\n".join(result_es['obfuscated']))

Datos [**********].
Nombre:  [*****].
Apellidos: [**********].
NHC: [****].
NASS: [*********].
Domicilio:  [*********************].
Localidad/ Provincia: [****].
CP: [***].
Datos asistenciales.
Fecha de nacimiento: [********].
Pa√≠s: [****].
Edad: ** a√±os Sexo: *.
Fecha de Ingreso: [********].
M√©dico:  [*********************] N¬∫Col: [*********].
Informe cl√≠nico [**********]: [******] ** ** a√±os de edad, minero jubilado, sin alergias medicamentosas conocidas, que presenta como antecedentes personales: accidente laboral antiguo con fracturas vertebrales y costales; intervenido de enfermedad de Dupuytren en mano derecha y by-pass iliofemoral izquierdo;
Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; enolismo activo, fumador de 20 cigarrillos / d√≠a.
Es derivado desde Atenci√≥n Primaria por presentar hematuria macrosc√≥pica postmiccional en una ocasi√≥n y microhematuria persistente posteriormente, con micciones normales.
En la exploraci√≥n f√≠sica presenta un buen esta

# DE-IDENTIFICATION FOR FRENCH

## French Deidentification NER Models
We have two different models you can use:
* `ner_deid_generic`, detects 7 entities
* `ner_deid_subentity`, detects 15 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/02/11/ner_deid_generic_fr.html)  |fr| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/02/14/ner_deid_subentity_fr.html)  |fr|


Creating pipeline

In [51]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_fr = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\
    .setInputCols(["document","token"])\
  	.setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)

In [52]:
ner_generic_fr = medical.NerModel.pretrained("ner_deid_generic", "fr", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
[OK!]


In [53]:
ner_generic_fr.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-AGE',
 'I-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- Username
- Profession
- Phone
- Country
- Age
- E-mail
- ZIP
- Medical Record

In [54]:
ner_subentity_fr = medical.NerModel.pretrained("ner_deid_subentity", "fr", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
[OK!]


In [55]:
ner_subentity_fr.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-USERNAME',
 'B-PROFESSION',
 'B-CITY',
 'B-DATE',
 'I-MEDICALRECORD',
 'B-E-MAIL',
 'B-PATIENT',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-STREET',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-AGE',
 'I-COUNTRY']

### Pipeline

In [56]:
nlpPipeline_fr = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_fr,
      ner_generic_fr,
      ner_converter_generic,
      ner_subentity_fr,
      ner_converter_subentity,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_fr = nlpPipeline_fr.fit(empty_data)

In [57]:
text = "J'ai vu en consultation Michel Martinez (49 ans), jardinier, adress√© au Centre Hospitalier De Plaisir pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de Mars 2015."

text_df = spark.createDataFrame([[text]]).toDF("text")
result_fr = model_fr.transform(text_df)

Results for `ner_deid_generic`

In [58]:
result_fr.select(F.explode(F.arrays_zip(result_fr.ner_chunk_generic.result, result_fr.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------+----------+
|chunk                        |ner_label |
+-----------------------------+----------+
|Michel Martinez              |NAME      |
|49 ans                       |AGE       |
|jardinier                    |PROFESSION|
|Centre Hospitalier De Plaisir|LOCATION  |
|Mars 2015                    |DATE      |
+-----------------------------+----------+



Results for `ner_deid_subentity`

In [59]:
result_fr.select(F.explode(F.arrays_zip(result_fr.ner_chunk_subentity.result, result_fr.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------+----------+
|chunk                        |ner_label |
+-----------------------------+----------+
|Michel Martinez              |PATIENT   |
|49 ans                       |AGE       |
|jardinier                    |PROFESSION|
|Centre Hospitalier De Plaisir|HOSPITAL  |
|Mars 2015                    |DATE      |
+-----------------------------+----------+



## DeIdentification

### Obfuscation mode

In [60]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_fr.txt

In [61]:
deid_masked_entity = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_fr.txt')\
    .setObfuscateRefSource("file")

In [62]:
nlpPipeline_fr = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_fr,
      ner_subentity_fr,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_fr = nlpPipeline_fr.fit(empty_data)

In [63]:
deid_lp_fr = nlp.LightPipeline(model_fr)

In [64]:
text = "J'ai vu en consultation Michel Martinez (49 ans), jardinier, adress√© au Centre Hospitalier De Plaisir pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de Mars 2015."

In [65]:
pd.set_option("display.max_colwidth", 200)

result_fr = deid_lp_fr.annotate(text)

df_fr = pd.DataFrame(list(zip(result_fr["masked_with_entity"], 
                           result_fr["masked_with_chars"],
                           result_fr["masked_fixed_length_chars"], 
                           result_fr["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_fr

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"J'ai vu en consultation <PATIENT> (<AGE>), <PROFESSION>, adress√© au <HOSPITAL> pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de <DATE>.","J'ai vu en consultation [*************] ([****]), [*******], adress√© au [***************************] pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de [*******].","J'ai vu en consultation **** (****), ****, adress√© au **** pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de ****.","J'ai vu en consultation Raymond Chauvin (26), √©ducateur de jeunes enfants, adress√© au CENTRE HOSPITALIER DU ROUVRAY pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de 12-06-1979."


### Faker mode

In [66]:
deid_obfuscated_faker = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('fr')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [67]:
nlpPipeline_fr = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_fr,
      ner_subentity_fr,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_fr = nlpPipeline_fr.fit(empty_data)

In [68]:
deid_lp_fr = nlp.LightPipeline(model_fr)

In [69]:
text = "J'ai vu en consultation Michel Martinez (49 ans), jardinier, adress√© au Centre Hospitalier De Plaisir pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de Mars 2015."

In [70]:
pd.set_option("display.max_colwidth", 200)

result_fr = deid_lp_fr.annotate(text)

df_fr = pd.DataFrame(list(zip(result_fr["masked_with_entity"], 
                           result_fr["masked_with_chars"],
                           result_fr["masked_fixed_length_chars"], 
                           result_fr["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_fr

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"J'ai vu en consultation <PATIENT> (<AGE>), <PROFESSION>, adress√© au <HOSPITAL> pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de <DATE>.","J'ai vu en consultation [*************] ([****]), [*******], adress√© au [***************************] pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de [*******].","J'ai vu en consultation **** (****), ****, adress√© au **** pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de ****.","J'ai vu en consultation Gaudo Duchaussoy (46 ans), Designer, fashion/clothing, adress√© au CENTRE HOSPITALIER UNIVERSITAIRE DE MONTPELLIER pour un diab√®te mal contr√¥l√© avec des sympt√¥mes datant de ..."


## Pretrained French Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from French medical texts. The PHI information will be masked and obfuscated in the resulting text. 
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - Sex
    - City
    - Street
    - Country
    - ZIP
    - Username
    - Profession
    - Phone
    - Email
    - Age
    - ID number
    - Medical record number
    - Account number
    - SSN
    - Plate Number
    - IP address
    - URL

In [71]:
deid_pipeline_fr = nlp.PretrainedPipeline("clinical_deidentification", "fr", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [72]:
text = """COMPTE-RENDU D'HOSPITALISATION
PRENOM : Jean
NOM : Dubois
NUM√âRO DE S√âCURIT√â SOCIALE : 1780160471058
ADRESSE : 18 Avenue Matabiau
VILLE : Grenoble
CODE POSTAL : 38000
DATE DE NAISSANCE : 03/03/1946
√Çge : 70 ans 
Sexe : H
COURRIEL : jdubois@hotmail.fr
DATE D'ADMISSION : 12/12/2016
M√âD√âCIN : Dr Michel Renaud
RAPPORT CLINIQUE : 70 ans, retrait√©, sans allergie m√©dicamenteuse connue, qui pr√©sente comme ant√©c√©dents : ancien accident du travail avec fractures vert√©brales et des c√¥tes ; op√©r√© de la maladie de Dupuytren √† la main droite et d'un pontage ilio-f√©moral gauche ; diab√®te de type II, hypercholest√©rol√©mie et hyperuric√©mie ; alcoolisme actif, fume 20 cigarettes / jour.
Il nous a √©t√© adress√© car il pr√©sentait une h√©maturie macroscopique postmictionnelle √† une occasion et une microh√©maturie persistante par la suite, avec une miction normale.
L'examen physique a montr√© un bon √©tat g√©n√©ral, avec un abdomen et des organes g√©nitaux normaux ; le toucher rectal √©tait compatible avec un ad√©nome de la prostate de grade I/IV.
L'analyse d'urine a montr√© 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du s√©diment √©tait normal.
H√©mogramme normal ; la biochimie a montr√© une glyc√©mie de 169 mg/dl et des triglyc√©rides de 456 mg/dl ; les fonctions h√©patiques et r√©nales √©taient normales. PSA de 1,16 ng/ml.
ADDRESS√â √Ä : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussi√®res, 38000 Grenoble
COURRIEL : mariebreton@chb.fr
"""

In [73]:
pd.set_option("display.max_colwidth", 100)

result_fr = deid_pipeline_fr.annotate(text)

df_fr = pd.DataFrame(list(zip(result_fr["sentence"], 
                           result_fr["masked"],
                           result_fr["masked_with_chars"], 
                           result_fr["masked_fixed_length_chars"], 
                           result_fr["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_fr

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION,COMPTE-RENDU D'HOSPITALISATION
1,PRENOM : Jean,PRENOM : <PATIENT>,PRENOM : [**],PRENOM : ****,PRENOM : Mme Ollivier
2,NOM : Dubois,NOM : <PATIENT>,NOM : [****],NOM : ****,NOM : Mme Traore
3,NUM√âRO DE S√âCURIT√â SOCIALE : 1780160471058,NUM√âRO DE S√âCURIT√â SOCIALE : <SSN>,NUM√âRO DE S√âCURIT√â SOCIALE : [***********],NUM√âRO DE S√âCURIT√â SOCIALE : ****,NUM√âRO DE S√âCURIT√â SOCIALE : 164033818514436
4,ADRESSE : 18 Avenue Matabiau,ADRESSE : <STREET>,ADRESSE : [****************],ADRESSE : ****,"ADRESSE : 731, boulevard de Legrand"
5,VILLE : Grenoble,VILLE : <CITY>,VILLE : [******],VILLE : ****,VILLE : Sainte Antoine
6,CODE POSTAL : 38000,CODE POSTAL : <ZIP>,CODE POSTAL : [***],CODE POSTAL : ****,CODE POSTAL : 37443
7,DATE DE NAISSANCE : 03/03/1946,DATE DE NAISSANCE : <DATE>,DATE DE NAISSANCE : [********],DATE DE NAISSANCE : ****,DATE DE NAISSANCE : 26/03/1946
8,√Çge : 70 ans,√Çge : <AGE>,√Çge : [****],√Çge : ****,√Çge : 46
9,Sexe : H\nCOURRIEL : jdubois@hotmail.fr\nDATE D'ADMISSION : 12/12/2016,Sexe : <SEX>\nCOURRIEL : <E-MAIL>\nDATE D'ADMISSION : <DATE>,Sexe : *\nCOURRIEL : [****************]\nDATE D'ADMISSION : [********],Sexe : ****\nCOURRIEL : ****\nDATE D'ADMISSION : ****,Sexe : Femme\nCOURRIEL : georgeslemonnier@live.com\nDATE D'ADMISSION : 24/01/2017


# DE-IDENTIFICATION FOR ITALIAN

## Italian NER Deidentification Models
We have two different models you can use:
* `ner_deid_generic`, detects 8 entities
* `ner_deid_subentity`, detects 19 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_generic_it_3_0.html)  |it| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_subentity_it_2_4.html)  |it|


Creating pipeline

In [74]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_it = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


###  NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)
- Sex

In [75]:
ner_generic_it = medical.NerModel.pretrained("ner_deid_generic", "it", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
[OK!]


In [76]:
ner_generic_it.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-CONTACT',
 'B-PROFESSION',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-SEX',
 'B-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

- Patient
- Doctor
- Hospital
- Date
- Organization
- City
- Street
- Username
- Profession
- Phone
- Country
- Age
- Sex
- Email
- ZIP
- Medical Record Number
- Social Security Number
- ID Number
- URL

In [77]:
ner_subentity_it = medical.NerModel.pretrained("ner_deid_subentity", "it", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
[OK!]


In [78]:
ner_subentity_it.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-USERNAME',
 'B-PROFESSION',
 'B-URL',
 'I-URL',
 'B-CITY',
 'B-DATE',
 'I-MEDICALRECORD',
 'B-SEX',
 'B-PATIENT',
 'I-SEX',
 'I-DOCTOR',
 'I-CITY',
 'B-SSN',
 'I-DATE',
 'I-SSN',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-IDNUM',
 'B-STREET',
 'I-IDNUM',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-COUNTRY']

###  Pipeline

In [79]:
nlpPipeline_it = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_it,
      ner_generic_it,
      ner_converter_generic,
      ner_subentity_it,
      ner_converter_subentity,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_it = nlpPipeline_it.fit(empty_data)

In [80]:
text = "Ho visto Gastone Montanariello (49 anni), virologo, riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."

text_df = spark.createDataFrame([[text]]).toDF("text")
result_it = model_it.transform(text_df)

Results for `ner_deid_generic`

In [81]:
result_it.select(F.explode(F.arrays_zip(result_it.ner_chunk_generic.result, result_it.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------+----------+
|chunk                |ner_label |
+---------------------+----------+
|Gastone Montanariello|NAME      |
|49                   |AGE       |
|virologo             |PROFESSION|
|Ospedale San Camillo |LOCATION  |
|marzo 2015           |DATE      |
+---------------------+----------+



Results for `ner_deid_subentity`

In [82]:
result_it.select(F.explode(F.arrays_zip(result_it.ner_chunk_subentity.result, result_it.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------+----------+
|chunk                |ner_label |
+---------------------+----------+
|Gastone Montanariello|PATIENT   |
|49                   |AGE       |
|virologo             |PROFESSION|
|Ospedale San Camillo |HOSPITAL  |
|marzo 2015           |DATE      |
+---------------------+----------+



## DeIdentification

### Obfuscation mode

In [83]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_it.txt

In [84]:
deid_masked_entity = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_it.txt')\
    .setObfuscateRefSource("file")

In [85]:
nlpPipeline_it = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_it,
      ner_subentity_it,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_it = nlpPipeline_it.fit(empty_data)

In [86]:
deid_lp_it = nlp.LightPipeline(model_it)

In [87]:
text = "Ho visto Gastone Montanariello (49 anni), virologo, riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."

In [88]:
pd.set_option("display.max_colwidth", 200)

result_it = deid_lp_it.annotate(text)

df_it = pd.DataFrame(list(zip(result_it["masked_with_entity"], 
                           result_it["masked_with_chars"],
                           result_it["masked_fixed_length_chars"], 
                           result_it["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_it

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Ho visto <PATIENT> (<AGE> anni), <PROFESSION>, riferito all' <HOSPITAL> per diabete mal controllato con sintomi risalenti a <DATE>.","Ho visto [*******************] (** anni), [******], riferito all' [******************] per diabete mal controllato con sintomi risalenti a [********].","Ho visto **** (**** anni), ****, riferito all' **** per diabete mal controllato con sintomi risalenti a ****.","Ho visto Calogero (30 anni), Batteriologo., riferito all' Casa Di Cura Val Di Sieve per diabete mal controllato con sintomi risalenti a 08-16-1974."


### Faker mode

In [89]:
deid_obfuscated = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('it')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [90]:
nlpPipeline_it = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_it,
      ner_subentity_it,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_it = nlpPipeline_it.fit(empty_data)

In [91]:
deid_lp_it = nlp.LightPipeline(model_it)

In [92]:
text = "Ho visto Gastone Montanariello (49 anni), virologo, riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."

In [93]:
pd.set_option("display.max_colwidth", 200)

result_it = deid_lp_it.annotate(text)

df_it = pd.DataFrame(list(zip(result_it["masked_with_entity"], 
                              result_it["masked_with_chars"],
                              result_it["masked_fixed_length_chars"], 
                              result_it["obfuscated"])),
                 columns= ["Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_it

Unnamed: 0,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Ho visto <PATIENT> (<AGE> anni), <PROFESSION>, riferito all' <HOSPITAL> per diabete mal controllato con sintomi risalenti a <DATE>.","Ho visto [*******************] (** anni), [******], riferito all' [******************] per diabete mal controllato con sintomi risalenti a [********].","Ho visto **** (**** anni), ****, riferito all' **** per diabete mal controllato con sintomi risalenti a ****.","Ho visto Janese Betters (50 anni), Chemical engineer, riferito all' WESTERN STATE HOSPITAL per diabete mal controllato con sintomi risalenti a 04-20-1996."


## Pretrained Italian Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Italian medical texts. The PHI information will be masked and obfuscated in the resulting text. 
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - Sex
    - City
    - Street
    - Country
    - ZIP
    - Username
    - Profession
    - Phone
    - Email
    - Age
    - ID number
    - Medical record number
    - Account number
    - SSN
    - Plate Number
    - IP address
    - URL

In [94]:
deid_pipeline_it = nlp.PretrainedPipeline("clinical_deidentification", "it", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [95]:
text = """RAPPORTO DI RICOVERO
NOME: Lodovico Fibonacci
CODICE FISCALE: MVANSK92F09W408A
INDIRIZZO: Viale Burcardo 7
CITT√Ä : Napoli
CODICE POSTALE: 80139
DATA DI NASCITA: 03/03/1946
ET√Ä: 70 anni 
SESSO: M
EMAIL: lpizzo@tim.it
DATA DI AMMISSIONE: 12/12/2016
DOTTORE: Eva Viviani
RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.
√à stato indirizzato a noi perch√© ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.
L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.
L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale.
L'emocromo √® normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml.

INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli
EMAIL: bferrabosco@poste.it
"""

In [96]:
pd.set_option("display.max_colwidth", None)

result_it = deid_pipeline_it.annotate(text)

df_it = pd.DataFrame(list(zip(result_it["sentence"], 
                              result_it["masked"],
                              result_it["masked_with_chars"], 
                              result_it["masked_fixed_length_chars"], 
                              result_it["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_it

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO,RAPPORTO DI RICOVERO
1,NOME: Lodovico Fibonacci,NOME: <PATIENT>,NOME: [****************],NOME: ****,NOME: Scotto-Polani
2,CODICE FISCALE: MVANSK92F09W408A,CODICE FISCALE: <SSN>,CODICE FISCALE: [**************],CODICE FISCALE: ****,CODICE FISCALE: ECI-QLN77G15L455Y
3,INDIRIZZO: Viale Burcardo 7\nCITT√Ä : Napoli,INDIRIZZO: <STREET>\nCITT√Ä : <CITY>,INDIRIZZO: [**************]\nCITT√Ä : [****],INDIRIZZO: ****\nCITT√Ä : ****,INDIRIZZO: Viale Orlando 808\nCITT√Ä : Sesto Raimondo
4,CODICE POSTALE: 80139\nDATA DI NASCITA: 03/03/1946\nET√Ä: 70 anni,CODICE POSTALE: <ZIP>DATA DI NASCITA: <DATE>\nET√Ä: <AGE>anni,CODICE POSTALE: [***]DATA DI NASCITA: [********]\nET√Ä: **anni,CODICE POSTALE: ****DATA DI NASCITA: ****\nET√Ä: ****anni,CODICE POSTALE: 53581DATA DI NASCITA: 05/03/1946\nET√Ä: 5anni
5,SESSO: M\nEMAIL: lpizzo@tim.it\nDATA DI AMMISSIONE: 12/12/2016,SESSO: <SEX>\nEMAIL: <E-MAIL>\nDATA DI AMMISSIONE: <DATE>,SESSO: *\nEMAIL: [***********]\nDATA DI AMMISSIONE: [********],SESSO: ****\nEMAIL: ****\nDATA DI AMMISSIONE: ****,SESSO: U\nEMAIL: HenryWatson@world.com\nDATA DI AMMISSIONE: 04/01/2017
6,DOTTORE: Eva Viviani,DOTTORE: <DOCTOR>,DOTTORE: [*********],DOTTORE: ****,DOTTORE: Sig. Fredo Marangoni
7,"RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: <AGE>anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: **anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: ****anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno.","RAPPORTO CLINICO: 5anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno."
8,"√à stato indirizzato a noi perch√© ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","√à stato indirizzato a noi perch√© ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","√à stato indirizzato a noi perch√© ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","√à stato indirizzato a noi perch√© ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale.","√à stato indirizzato a noi perch√© ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale."
9,"L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV.","L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV."


# DE-IDENTIFICATION FOR PORTUGUESE

## Potuguese NER Deidentification Models
We have two different models you can use:
* `ner_deid_generic`, detects 8 entities
* `ner_deid_subentity`, detects 19 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_generic_pt_3_0.html)  |pt| 2| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/04/13/ner_deid_subentity_pt_3_0.html)  |pt|


Creating pipeline

In [97]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_pt = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")\
    .setInputCols(["document","token"])\
	  .setOutputCol("embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)
- Sex

In [98]:
ner_generic_pt = medical.NerModel.pretrained("ner_deid_generic", "pt", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
[OK!]


In [99]:
ner_generic_pt.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-PROFESSION',
 'B-CONTACT',
 'I-ID',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-SEX',
 'I-SEX',
 'B-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `SEX`, `EMAIL`, `ZIP`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`

In [100]:
ner_subentity_pt = medical.NerModel.pretrained("ner_deid_subentity", "pt", "clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
[OK!]


In [101]:
ner_subentity_pt.getClasses()

['O',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-PROFESSION',
 'I-ID',
 'B-CITY',
 'B-DATE',
 'B-PATIENT',
 'B-SEX',
 'I-SEX',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ID',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-STREET',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-COUNTRY']

### Pipeline

In [102]:
nlpPipeline_pt = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_pt,
      ner_generic_pt,
      ner_converter_generic,
      ner_subentity_pt,
      ner_converter_subentity,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_pt = nlpPipeline_pt.fit(empty_data)

In [103]:
text = """Detalhes do paciente.
Nome do paciente:  Pedro Gon√ßalves
NHC: 2569870.
Endere√ßo: Rua Das Flores 23.
C√≥digo Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos 
Data de admiss√£o: 17/06/2016.
Doutora: Maria Santos"""

text_df = spark.createDataFrame([[text]]).toDF("text")
result_pt = model_pt.transform(text_df)

Results for `ner_deid_generic`

In [104]:
result_pt.select(F.explode(F.arrays_zip(result_pt.ner_chunk_generic.result, result_pt.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|Pedro Gon√ßalves  |NAME     |
|2569870          |ID       |
|Rua Das Flores 23|LOCATION |
|21754-987        |LOCATION |
|10/10/1963       |DATE     |
|53               |AGE      |
|17/06/2016       |DATE     |
|Maria Santos     |NAME     |
+-----------------+---------+



Results for `ner_deid_subentity`

In [105]:
result_pt.select(F.explode(F.arrays_zip(result_pt.ner_chunk_subentity.result, result_pt.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|Pedro Gon√ßalves  |PATIENT  |
|2569870          |ID       |
|Rua Das Flores 23|STREET   |
|21754-987        |ZIP      |
|10/10/1963       |DATE     |
|53               |AGE      |
|17/06/2016       |DATE     |
|Maria Santos     |DOCTOR   |
+-----------------+---------+



## DeIdentification

### Obfuscation mode

In [106]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_pt.txt

In [107]:
deid_masked_entity = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_pt.txt')\
    .setObfuscateRefSource("file")

In [108]:
nlpPipeline_pt = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_pt,
      ner_subentity_pt,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_pt = nlpPipeline_pt.fit(empty_data)

In [109]:
deid_lp_pt = nlp.LightPipeline(model_pt)

In [110]:
text = """Detalhes do paciente.
Nome do paciente: Antonio Gon√ßalves
NHC: 2569870.
Endere√ßo: Rua Das Flores 23.
C√≥digo Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 23 anos 
Data de admiss√£o: 17/06/2016.
Doutora: Maria Santos"""

In [111]:
pd.set_option("display.max_colwidth", 200)

result_pt = deid_lp_pt.annotate(text)

df_pt = pd.DataFrame(list(zip(result_pt["sentence"],
                              result_pt["masked_with_entity"], 
                              result_pt["masked_with_chars"],
                              result_pt["masked_fixed_length_chars"], 
                              result_pt["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_pt

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.
1,Nome do paciente: Antonio Gon√ßalves,Nome do paciente: <PATIENT>,Nome do paciente: [***************],Nome do paciente: ****,Nome do paciente: Diana Pereira
2,NHC: 2569870.,NHC: <ID>.,NHC: [*****].,NHC: ****.,NHC: 602449 86.
3,Endere√ßo: Rua Das Flores 23.\nC√≥digo Postal: 21754-987.,Endere√ßo: <STREET>.\nC√≥digo Postal: <ZIP>.,Endere√ßo: [***************].\nC√≥digo Postal: [*******].,Endere√ßo: ****.\nC√≥digo Postal: ****.,"Endere√ßo: Largo das Portas do Mar, 589.\nC√≥digo Postal: 74536-889."
4,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.
5,Data de nascimento: 10/10/1963.,Data de nascimento: <DATE>.,Data de nascimento: [********].,Data de nascimento: ****.,Data de nascimento: 31/10/1963.
6,Idade: 23 anos,Idade: <AGE> anos,Idade: ** anos,Idade: **** anos,Idade: 64 anos
7,Data de admiss√£o: 17/06/2016.,Data de admiss√£o: <DATE>.,Data de admiss√£o: [********].,Data de admiss√£o: ****.,Data de admiss√£o: 23/07/2016.
8,\nDoutora: Maria Santos,\nDoutora: <DOCTOR>,\nDoutora: [**********],\nDoutora: ****,\nDoutora: Nelson Ferreira


### Faker mode

In [112]:
deid_obfuscated_faker = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('pt')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [113]:
nlpPipeline_pt = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_pt,
      ner_subentity_pt,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_pt = nlpPipeline_pt.fit(empty_data)

In [114]:
deid_lp_pt = nlp.LightPipeline(model_pt)

In [115]:
pd.set_option("display.max_colwidth", 200)

result_pt = deid_lp_pt.annotate(text)

df_pt = pd.DataFrame(list(zip(result_pt["sentence"],
                              result_pt["masked_with_entity"], 
                              result_pt["masked_with_chars"],
                              result_pt["masked_fixed_length_chars"], 
                              result_pt["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_pt

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.,Detalhes do paciente.
1,Nome do paciente: Antonio Gon√ßalves,Nome do paciente: <PATIENT>,Nome do paciente: [***************],Nome do paciente: ****,Nome do paciente: Marthe Adams
2,NHC: 2569870.,NHC: <ID>.,NHC: [*****].,NHC: ****.,NHC: H8117096.
3,Endere√ßo: Rua Das Flores 23.\nC√≥digo Postal: 21754-987.,Endere√ßo: <STREET>.\nC√≥digo Postal: <ZIP>.,Endere√ßo: [***************].\nC√≥digo Postal: [*******].,Endere√ßo: ****.\nC√≥digo Postal: ****.,Endere√ßo: 4646 John R St.\nC√≥digo Postal: 00020.
4,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.,Dados de cuidados.
5,Data de nascimento: 10/10/1963.,Data de nascimento: <DATE>.,Data de nascimento: [********].,Data de nascimento: ****.,Data de nascimento: 01/12/1963.
6,Idade: 23 anos,Idade: <AGE> anos,Idade: ** anos,Idade: **** anos,Idade: 30 anos
7,Data de admiss√£o: 17/06/2016.,Data de admiss√£o: <DATE>.,Data de admiss√£o: [********].,Data de admiss√£o: ****.,Data de admiss√£o: 16/07/2016.
8,\nDoutora: Maria Santos,\nDoutora: <DOCTOR>,\nDoutora: [**********],\nDoutora: ****,\nDoutora: Dr Bonnita Fragmin


## Pretrained Portuguese Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Italian medical texts. The PHI information will be masked and obfuscated in the resulting text. 
- The pipeline can mask and obfuscate:
    - Patient
    - Doctor
    - Hospital
    - Date
    - Organization
    - Sex
    - City
    - Street
    - Country
    - ZIP
    - Username
    - Profession
    - Phone
    - Email
    - Age
    - ID number
    - Medical record number
    - Account number
    - SSN
    - Plate Number
    - IP address
    - URL

In [116]:
deid_pipeline_pt = nlp.PretrainedPipeline("clinical_deidentification", "pt", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.2 GB
[OK!]


In [117]:
text = """RELA√á√ÉO HOSPITALAR
NOME: Pedro Gon√ßalves
NHC: MVANSK92F09W408A
ENDERE√áO: Rua Burcardo 7
C√ìDIGO POSTAL: 80139
DATA DE NASCIMENTO: 03/03/1946
IDADE: 70 anos
SEXO: Homens
E-MAIL: pgon21@tim.pt
DATA DE ADMISS√ÉO: 12/12/2016
DOUTORA: Eva Andrade
RELATO CL√çNICO: 70 anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte hist√≥ria: ex-acidente de trabalho com fratura de v√©rtebras e costelas; operado de doen√ßa de Dupuytren na m√£o direita e ponte √≠lio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.
Ele foi encaminhado a n√≥s por apresentar hemat√∫ria macrosc√≥pica p√≥s-evacua√ß√£o em uma ocasi√£o e microhemat√∫ria persistente posteriormente, com evacua√ß√£o normal.
O exame f√≠sico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compat√≠vel com adenoma de pr√≥stata grau I/IV.
A urin√°lise mostrou 4 hem√°cias/campo e 0-5 leuc√≥citos/campo; o resto do sedimento era normal.
O hemograma √© normal; a bioqu√≠mica mostrou uma glicemia de 169 mg/dl e triglicer√≠deos 456 mg/dl; fun√ß√£o hep√°tica e renal s√£o normais. PSA de 1,16 ng/ml.

DIRIGIDA A: Dr. Eva Andrade - Centro Hospitalar do Medio Ave - Avenida Dos Aliados, 56
E-MAIL: evandrade@poste.pt
"""

In [118]:
pd.set_option("display.max_colwidth", None)

result_pt = deid_pipeline_pt.annotate(text)

df_pt = pd.DataFrame(list(zip(result_pt["sentence"], 
                           result_pt["masked"],
                           result_pt["masked_with_chars"], 
                           result_pt["masked_fixed_length_chars"], 
                           result_pt["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_pt

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,RELA√á√ÉO HOSPITALAR\nNOME: Pedro Gon√ßalves,RELA√á√ÉO HOSPITALAR\nNOME: <DOCTOR>,RELA√á√ÉO HOSPITALAR\nNOME: [*************],RELA√á√ÉO HOSPITALAR\nNOME: ****,RELA√á√ÉO HOSPITALAR\nNOME: Eva Coutinho
1,NHC: MVANSK92F09W408A,NHC: <ID>,NHC: [**************],NHC: ****,NHC: 124 445 311
2,ENDERE√áO: Rua Burcardo 7,ENDERE√áO: <STREET>,ENDERE√áO: [************],ENDERE√áO: ****,"ENDERE√áO: Avenida Dos Aliados, 56"
3,C√ìDIGO POSTAL: 80139\nDATA DE NASCIMENTO: 03/03/1946,C√ìDIGO POSTAL: <ZIP>\nDATA DE NASCIMENTO: <DATE>,C√ìDIGO POSTAL: [***]\nDATA DE NASCIMENTO: [********],C√ìDIGO POSTAL: ****\nDATA DE NASCIMENTO: ****,C√ìDIGO POSTAL: 4099\nDATA DE NASCIMENTO: 27/04/1946
4,IDADE: 70 anos,IDADE: <AGE> anos,IDADE: ** anos,IDADE: **** anos,IDADE: 36 anos
5,SEXO: Homens,SEXO: <SEX>,SEXO: [****],SEXO: ****,SEXO: Mulher
6,E-MAIL: pgon21@tim.pt\nDATA DE ADMISS√ÉO: 12/12/2016,E-MAIL: <EMAIL>\nDATA DE ADMISS√ÉO: <DATE>,E-MAIL: [***********]\nDATA DE ADMISS√ÉO: [********],E-MAIL: ****\nDATA DE ADMISS√ÉO: ****,E-MAIL: richard@yahoo.pt\nDATA DE ADMISS√ÉO: 17/12/2016
7,DOUTORA: Eva Andrade,DOUTORA: <DOCTOR>,DOUTORA: [*********],DOUTORA: ****,DOUTORA: Carlos Melo
8,"RELATO CL√çNICO: 70 anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte hist√≥ria: ex-acidente de trabalho com fratura de v√©rtebras e costelas; operado de doen√ßa de Dupuytren na m√£o direita e ponte √≠lio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CL√çNICO: <AGE> anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte hist√≥ria: ex-acidente de trabalho com fratura de v√©rtebras e costelas; operado de doen√ßa de Dupuytren na m√£o direita e ponte √≠lio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CL√çNICO: ** anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte hist√≥ria: ex-acidente de trabalho com fratura de v√©rtebras e costelas; operado de doen√ßa de Dupuytren na m√£o direita e ponte √≠lio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CL√çNICO: **** anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte hist√≥ria: ex-acidente de trabalho com fratura de v√©rtebras e costelas; operado de doen√ßa de Dupuytren na m√£o direita e ponte √≠lio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia.","RELATO CL√çNICO: 36 anos, aposentado, sem alergia a medicamentos conhecida, com a seguinte hist√≥ria: ex-acidente de trabalho com fratura de v√©rtebras e costelas; operado de doen√ßa de Dupuytren na m√£o direita e ponte √≠lio-femoral esquerda; diabetes tipo II, hipercolesterolemia e hiperuricemia; alcoolismo ativo, fuma 20 cigarros/dia."
9,"Ele foi encaminhado a n√≥s por apresentar hemat√∫ria macrosc√≥pica p√≥s-evacua√ß√£o em uma ocasi√£o e microhemat√∫ria persistente posteriormente, com evacua√ß√£o normal.\nO exame f√≠sico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compat√≠vel com adenoma de pr√≥stata grau I/IV.","Ele foi encaminhado a n√≥s por apresentar hemat√∫ria macrosc√≥pica p√≥s-evacua√ß√£o em uma ocasi√£o e microhemat√∫ria persistente posteriormente, com evacua√ß√£o normal.\nO exame f√≠sico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compat√≠vel com adenoma de pr√≥stata grau I/IV.","Ele foi encaminhado a n√≥s por apresentar hemat√∫ria macrosc√≥pica p√≥s-evacua√ß√£o em uma ocasi√£o e microhemat√∫ria persistente posteriormente, com evacua√ß√£o normal.\nO exame f√≠sico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compat√≠vel com adenoma de pr√≥stata grau I/IV.","Ele foi encaminhado a n√≥s por apresentar hemat√∫ria macrosc√≥pica p√≥s-evacua√ß√£o em uma ocasi√£o e microhemat√∫ria persistente posteriormente, com evacua√ß√£o normal.\nO exame f√≠sico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compat√≠vel com adenoma de pr√≥stata grau I/IV.","Ele foi encaminhado a n√≥s por apresentar hemat√∫ria macrosc√≥pica p√≥s-evacua√ß√£o em uma ocasi√£o e microhemat√∫ria persistente posteriormente, com evacua√ß√£o normal.\nO exame f√≠sico mostrou bom estado geral, com abdome e genitais normais; o toque retal foi compat√≠vel com adenoma de pr√≥stata grau I/IV."


# DE-IDENTIFICATION FOR ROMANIAN


## Romanian NER Deidentification Models
We have two different models you can use:
* `ner_deid_subentity`, detects 17 entities
* `ner_deid_subentity_bert`, detects 17 entities

|index|model|lang|index|model|lang|
|-----:|:-----|----|-----:|:-----|----|
| 1| [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_ro_3_0.html)  |ro| 3| [ner_deid_generic](https://nlp.johnsnowlabs.com/models)  |ro|
| 2| [ner_deid_subentity_bert](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_bert_ro_3_0.html)  |ro| 4| [ner_deid_generic_bert](https://nlp.johnsnowlabs.com/models)  |ro|


Creating pipeline

In [119]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencerDL = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings_ro = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "ro")\
    .setInputCols(["sentence","token"])\
	  .setOutputCol("word_embeddings")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]


### NER Deid Generic

**`ner_deid_generic`** extracts:
- Name
- Profession
- Age
- Date
- Contact (Telephone numbers, Email addresses)
- Location (Address, City, Postal code, Hospital Name, Organization)
- ID (Social Security numbers, Medical record numbers)

In [121]:
ner_generic_ro = medical.NerModel.pretrained("ner_deid_generic", "ro", "clinical/models")\
    .setInputCols(["sentence","token","word_embeddings"])\
    .setOutputCol("ner_deid_generic")

ner_converter_generic = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_deid_generic"])\
    .setOutputCol("ner_chunk_generic")

ner_deid_generic download started this may take some time.
[OK!]


In [122]:
ner_generic_ro.getClasses()

['O',
 'I-LOCATION',
 'I-CONTACT',
 'I-PROFESSION',
 'I-NAME',
 'I-DATE',
 'B-ID',
 'B-CONTACT',
 'B-PROFESSION',
 'B-NAME',
 'B-DATE',
 'B-LOCATION',
 'B-AGE',
 'I-AGE']

### NER Deid Subentity

**`ner_deid_subentity`** extracts:

`PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `EMAIL`, `ZIP`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `FAX`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, 


In [123]:
ner_subentity_ro = medical.NerModel.pretrained("ner_deid_subentity", "ro", "clinical/models")\
    .setInputCols(["sentence","token","word_embeddings"])\
    .setOutputCol("ner_deid_subentity")

ner_converter_subentity = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_deid_subentity"])\
    .setOutputCol("ner_chunk_subentity")

ner_deid_subentity download started this may take some time.
[OK!]


In [124]:
ner_subentity_ro.getClasses()

['O',
 'B-MEDICALRECORD',
 'B-ORGANIZATION',
 'I-PROFESSION',
 'B-DOCTOR',
 'B-PROFESSION',
 'I-LOCATION-OTHER',
 'B-CITY',
 'B-DATE',
 'B-LOCATION-OTHER',
 'B-PATIENT',
 'I-DOCTOR',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'I-PATIENT',
 'B-PHONE',
 'I-PHONE',
 'B-HOSPITAL',
 'B-EMAIL',
 'B-IDNUM',
 'B-STREET',
 'B-FAX',
 'I-ORGANIZATION',
 'I-HOSPITAL',
 'B-AGE',
 'I-FAX',
 'I-AGE',
 'I-COUNTRY']

### Pipeline

In [125]:
nlpPipeline_ro = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_ro,
      ner_generic_ro,
      ner_converter_generic,
      ner_subentity_ro,
      ner_converter_subentity,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ro = nlpPipeline_ro.fit(empty_data)

In [126]:
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 Rom√¢nia
Tel: +40(235)413773
Data setului de analize: 25 May 2022
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Timar
C.N.P : 2450502264401"""

text_df = spark.createDataFrame([[text]]).toDF("text")
result_ro = model_ro.transform(text_df)

Results for `ner_deid_generic`

In [127]:
result_ro.select(F.explode(F.arrays_zip(result_ro.ner_chunk_generic.result, result_ro.ner_chunk_generic.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------------------+---------+
|chunk                       |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|LOCATION |
|Drumul Oprea Nr. 972        |LOCATION |
|Vaslui                      |LOCATION |
|737405 Rom√¢nia              |LOCATION |
|+40(235)413773              |CONTACT  |
|25 May 2022                 |DATE     |
|BUREAN MARIA                |NAME     |
|77                          |AGE      |
|Agota Evelyn Timar          |NAME     |
|2450502264401               |ID       |
+----------------------------+---------+



Results for `ner_deid_subentity`

In [128]:
result_ro.select(F.explode(F.arrays_zip(result_ro.ner_chunk_subentity.result, result_ro.ner_chunk_subentity.metadata)).alias("cols")) \
         .select(F.expr("cols['0']").alias("chunk"),
                 F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------------------+---------+
|chunk                       |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|HOSPITAL |
|Drumul Oprea Nr. 972        |STREET   |
|Vaslui                      |CITY     |
|737405                      |ZIP      |
|+40(235)413773              |PHONE    |
|25 May 2022                 |DATE     |
|BUREAN MARIA                |PATIENT  |
|77                          |AGE      |
|Agota Evelyn Timar          |DOCTOR   |
|2450502264401               |IDNUM    |
+----------------------------+---------+



## DeIdentification

### Obfuscation mode

In [129]:
# Downloading faker entity list.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/obfuscate_ro.txt

In [130]:
deid_masked_entity = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_entity")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")

deid_masked_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

deid_masked_fixed_char = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)

deid_obfuscated = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate_ro.txt')\
    .setObfuscateRefSource("file")

In [131]:
nlpPipeline_ro = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_ro,
      ner_subentity_ro,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ro = nlpPipeline_ro.fit(empty_data)

In [132]:
deid_lp_ro = nlp.LightPipeline(model_ro)

In [133]:
text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 Rom√¢nia
Tel: +40(235)413773
Data setului de analize: 25 May 2022
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Timar
C.N.P : 2450502264401"""

In [134]:
pd.set_option("display.max_colwidth", 200)

result_ro = deid_lp_ro.annotate(text)

df_ro = pd.DataFrame(list(zip(result_ro["sentence"],
                              result_ro["masked_with_entity"], 
                              result_ro["masked_with_chars"],
                              result_ro["masked_fixed_length_chars"], 
                              result_ro["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ro

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 Rom√¢nia","<HOSPITAL>, <STREET> <CITY>, <ZIP> Rom√¢nia","[**************************], [******************] [****], [****] Rom√¢nia","****, **** ****, **** Rom√¢nia","Academia Rom√¢na Spitalul Universitar de Urgenta ‚ÄúElias‚Äù, Intrarea Diaconescu Anina, 654513 Rom√¢nia"
1,Tel: +40(235)413773,Tel: <PHONE>,Tel: [************],Tel: ****,Tel: 0770 664 874
2,Data setului de analize: 25 May 2022,Data setului de analize: <DATE>,Data setului de analize: [*********],Data setului de analize: ****,Data setului de analize: 01-15-1988
3,"Nume si Prenume : BUREAN MARIA, Varsta: 77\nMedic : Agota Evelyn Timar","Nume si Prenume : <PATIENT>, Varsta: <AGE>\nMedic : <DOCTOR>","Nume si Prenume : [**********], Varsta: **\nMedic : [****************]","Nume si Prenume : ****, Varsta: ****\nMedic : ****","Nume si Prenume : Draguleasa Dorina, Varsta: 98\nMedic : NISTOR Eliana"
4,C.N.P : 2450502264401,C.N.P : <IDNUM>,C.N.P : [***********],C.N.P : ****,C.N.P : 626491510041


### Faker mode

In [135]:
deid_obfuscated_faker = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setLanguage('ro')\
    .setObfuscateDate(True)\
    .setObfuscateRefSource('faker')

In [136]:
nlpPipeline_ro = nlp.Pipeline(stages=[
      documentAssembler, 
      sentencerDL,
      tokenizer,
      word_embeddings_ro,
      ner_subentity_ro,
      ner_converter_subentity,
      deid_masked_entity,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated_faker
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model_ro = nlpPipeline_ro.fit(empty_data)

In [137]:
deid_lp_ro = nlp.LightPipeline(model_ro)

In [138]:
pd.set_option("display.max_colwidth", 200)

result_ro = deid_lp_ro.annotate(text)

df_ro = pd.DataFrame(list(zip(result_ro["sentence"],
                              result_ro["masked_with_entity"], 
                              result_ro["masked_with_chars"],
                              result_ro["masked_fixed_length_chars"], 
                              result_ro["obfuscated"])),
                 columns= ["Sentence", "Masked_with_entity", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ro

Unnamed: 0,Sentence,Masked_with_entity,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 Rom√¢nia","<HOSPITAL>, <STREET> <CITY>, <ZIP> Rom√¢nia","[**************************], [******************] [****], [****] Rom√¢nia","****, **** ****, **** Rom√¢nia","Arad County Emergency Clinical Hospital, Petrovici »òtefƒÉne»ôti, 00028 Rom√¢nia"
1,Tel: +40(235)413773,Tel: <PHONE>,Tel: [************],Tel: ****,Tel: 0475 88 12 35
2,Data setului de analize: 25 May 2022,Data setului de analize: <DATE>,Data setului de analize: [*********],Data setului de analize: ****,Data setului de analize: 12-19-2002
3,"Nume si Prenume : BUREAN MARIA, Varsta: 77\nMedic : Agota Evelyn Timar","Nume si Prenume : <PATIENT>, Varsta: <AGE>\nMedic : <DOCTOR>","Nume si Prenume : [**********], Varsta: **\nMedic : [****************]","Nume si Prenume : ****, Varsta: ****\nMedic : ****","Nume si Prenume : Cadar IvƒÉnescu, Varsta: 60\nMedic : Dr Matei Gherman"
4,C.N.P : 2450502264401,C.N.P : <IDNUM>,C.N.P : [***********],C.N.P : ****,C.N.P : BS:7922278


## Pretrained Romanian Deidentification Pipeline

- We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Romanian medical texts. The PHI information will be masked and obfuscated in the resulting text. 
- The pipeline can mask and obfuscate:
  - AGE, 
  - CITY, 
  - COUNTRY,
  - DATE, 
  - DOCTOR, 
  - EMAIL, 
  - FAX, 
  - HOSPITAL, 
  - IDNUM, 
  - LOCATION-OTHER, 
  - MEDICALRECORD, 
  - ORGANIZATION, 
  - PATIENT, 
  - PHONE, 
  - PROFESSION, 
  - STREET, 
  - ZIP, 
  - ACCOUNT, 
  - LICENSE, 
  - PLATE

In [139]:
deid_pipeline_ro = nlp.PretrainedPipeline("clinical_deidentification", "ro", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.1 GB
[OK!]


In [140]:
text = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 
Varsta : 77, Nume si Prenume : BUREAN MARIA 
Tel: +40(235)413773, E-mail : hale@gmail.com,
Licen»õƒÉ : B004256985M, √énmatriculare : CD205113, Cont : FXHZ7170951927104999, 
Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """

The results can also be inspected vertically by creating a Pandas dataframe as such:

In [141]:
pd.set_option("display.max_colwidth", None)

result_ro = deid_pipeline_ro.annotate(text)

df_ro = pd.DataFrame(list(zip(result_ro["sentence"], 
                           result_ro["masked"],
                           result_ro["masked_with_chars"], 
                           result_ro["masked_fixed_length_chars"], 
                           result_ro["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df_ro

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022","Medic : Dr. <DOCTOR>, C.N.P : <IDNUM>, Data setului de analize: <DATE>","Medic : Dr. [**********], C.N.P : [***********], Data setului de analize: [*********]","Medic : Dr. ****, C.N.P : ****, Data setului de analize: ****","Medic : Dr. Doina Gheorghiu, C.N.P : 6794561192919, Data setului de analize: 01-04-2001"
1,"Varsta : 77, Nume si Prenume : BUREAN MARIA","Varsta : <AGE>, Nume si Prenume : <PATIENT>","Varsta : **, Nume si Prenume : [**********]","Varsta : ****, Nume si Prenume : ****","Varsta : 91, Nume si Prenume : Dragomir Emilia"
2,"Tel: +40(235)413773, E-mail : hale@gmail.com,\nLicen»õƒÉ : B004256985M, √énmatriculare : CD205113, Cont : FXHZ7170951927104999, \nSpitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405","Tel: <PHONE>, E-mail : <EMAIL>,\nLicen»õƒÉ : <LICENSE>, √énmatriculare : <PLATE>, Cont : <ACCOUNT>, \n<HOSPITAL> <STREET> <CITY>, <ZIP>","Tel: [************], E-mail : [************],\nLicen»õƒÉ : [*********], √énmatriculare : [******], Cont : [******************], \n[**************************] [******************] [****], [****]","Tel: ****, E-mail : ****,\nLicen»õƒÉ : ****, √énmatriculare : ****, Cont : ****, \n**** **** ****, ****","Tel: 0248 551 376, E-mail : tudorsmaranda@kappa.ro,\nLicen»õƒÉ : T003485962M, √énmatriculare : AR-65-UPQ, Cont : KHHO5029180812813651, \nCentrul Medical de Evaluare si Recuperare pentru Copii si Tineri Cristian Serban Buzias Aleea Voinea Curcani, 328479"
