![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.10.Fhir_DeIdentification.ipynb)

## 🔐 Spark Transformer for De-identifying FHIR Resources


> 📌 **Note: FhirDeIdentification Module Availability**  
>
> The `FhirDeIdentification` module is not enabled by default in your current package.  
> ➤ To use this module, please set `fhir_deid=True` in the `start()` function.  
> ➤ If you need help or access, feel free to contact us at 📧 [support@johnsnowlabs.com](mailto:support@johnsnowlabs.com).




This Spark-based transformer de-identifies **FHIR (Fast Healthcare Interoperability Resources)** according to configurable privacy rules.

It supports **field-level obfuscation** of FHIR JSON documents using **FHIR Path** expressions.

---

## ✅ Key Features

- 📦 **FHIR Versions**: Supports `R4`, `R5`, and `DSTU3`
- 🧠 **Type-aware De-identification** strategies
- 🧾 **Parser Types**: Supports both `JSON` and `XML`
- ⚙️ **Modes**: `mask` or `obfuscate`

---





## 🔧 Parameters

| **Parameter** | **Description** |
|----------------|-----------------|
| **fhirVersion** | FHIR version to use for de-identification. Options: [‘R4’, ‘R5’, ‘DSTU3’] |
| **parserType** | Parser type to use for de-identification. Options: [‘JSON’, ‘XML’] |
| **mode** | Mode for Anonymizer. Options: [‘mask’, ‘obfuscate’] |
| **dateEntities** | List of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’] |
| **obfuscateDate** | When mode == "obfuscate", determines whether to obfuscate dates. If True and obfuscation fails, unnormalizedDateMode is used. |
| **unnormalizedDateMode** | Mode to use for unformatted dates. Options: [‘mask’, ‘obfuscate’, ‘skip’]. Default: ‘obfuscate’ |
| **days** | Number of days to displace dates for obfuscation. If not set, a random value between 1 and 60 is used. |
| **dateFormats** | List of date formats to automatically displace if parsed. |
| **obfuscateRefSource** | Source for obfuscating entities (non-date). Options: [‘custom’, ‘faker’, ‘both’] |
| **language** | Language for regex and faker data. Options: [‘en’, ‘de’, ‘es’, ‘fr’, ‘ar’, ‘ro’]. Default: ‘en’ |
| **seed** | Seed for deterministic obfuscation results. |
| **maskingPolicy** | Masking policy. Options: [‘same_length_chars’, ‘entity_labels’, ‘fixed_length_chars’] |
| **fixedMaskLength** | Mask length when using ‘fixed_length_chars’ policy. |
| **sameLengthFormattedEntities** | List of formatted entities to preserve length during obfuscation. Supported: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE |
| **genderAwareness** | Use gender-aware names during obfuscation. Default: False |
| **ageRanges** | Integer list defining limits of age groups to preserve during obfuscation. |
| **selectiveObfuscationModes** | Dictionary of entity-wise obfuscation modes. Example: {‘NAME’: [‘mask_entity_labels’], ‘PHONE’: [‘obfuscate’]} |
| **customFakers** | Custom dictionary of faker terms to use for specific entities. |
| **keepYear** | Keep year intact during date obfuscation. Default: False |
| **keepMonth** | Keep month intact during date obfuscation. Default: False |

---

💡 *Use this module to ensure privacy-compliant handling of healthcare data while maintaining structure and usability of FHIR resources.*

⚠️ Attention!
💡 The Fhir_Deidentification model is not supported by every version of the Spark NLP for Healthcare library.
🚨 If you encounter an error when starting the Spark Session, please contact 📧 info@jonsnowlabs.com
.
✅ For example, this notebook works properly with version 6.0.0.

# Setup

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
from pyspark.ml import Pipeline,PipelineModel
import sparknlp_jsl
import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
import warnings
warnings.filterwarnings('ignore')

spark = sparknlp_jsl.start(secret = license_keys["SECRET"],fhir_deid=True)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.0.0
Spark NLP_JSL Version : 6.0.0


In [None]:
from IPython.display import clear_output

# Util Functions

In [None]:
def read_fhir_as_dict(path: str):
  with open(path, "r") as f:
      data = json.load(f)
  return data

In [None]:
def read_fhir_as_str(path: str) -> str:
  with open(path, 'r') as f:
      return f.read()

In [None]:
def parse_str_to_dict(fhir_str: str) -> dict:
  return json.loads(fhir_str)

In [None]:
def parse_dict_to_str(fhir_dict: dict) -> str:
  return json.dumps(fhir_dict)

In [None]:
def show_fhir_diff(original: dict, obfuscated: dict):
    """Show only changed fields between two FHIR dicts (nested-aware)"""
    def compare_dicts(d1, d2, path=""):
        diff = {}
        for key in d1:
            current_path = f"{path}.{key}" if path else key
            if key not in d2:
                pass
            elif isinstance(d1[key], dict) and isinstance(d2[key], dict):
                nested_diff = compare_dicts(d1[key], d2[key], current_path)
                diff.update(nested_diff)
            elif isinstance(d1[key], list) and isinstance(d2[key], list):
                for i, (item1, item2) in enumerate(zip(d1[key], d2[key])):
                    if isinstance(item1, dict) and isinstance(item2, dict):
                        nested_diff = compare_dicts(item1, item2, f"{current_path}[{i}]")
                        diff.update(nested_diff)
                    elif item1 != item2:
                        diff[f"{current_path}[{i}]"] = (item1, item2)
            elif d1[key] != d2[key]:
                diff[current_path] = (d1[key], d2[key])
        return diff

    differences = compare_dicts(original, obfuscated)

    for field, (orig_val, obf_val) in differences.items():
        print(f"Changed: {field} | {orig_val} → {obf_val}")

# Basic Examples to Show Functionalities


In [None]:
john_doe = """{
  "resourceType": "Patient",
  "id": "example",
  "name": [
    {
      "use": "official",
      "family": "Doe",
      "given": [
        "John",
        "Michael"
      ]
    }
  ],
  "telecom": [
    {
      "system": "email",
      "value": "john.doe@example.com"
    },
    {
      "system": "url",
      "value": "http://johndoe.com"
    }
  ],
  "birthDate": "1970-01-01",
  "gender": "male"
}"""


In [None]:
rules = {
  "Patient.birthDate" : "Date",
  "Patient.name.given" : "first_name",
  "Patient.name.family" : "last_name",
  "Patient.telecom.value" : "Email",
  "Patient.gender" : "GENDER"
}

fhir = (
    FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(88)
      .setCustomFakers(
          {
              "GENDER": ["female", "other"]
          }
      )
      .setObfuscateRefSource("both")
)

In [None]:
fhir_df = spark.createDataFrame([[john_doe]]).toDF("text")
fhir.transform(fhir_df).select("deid").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|deid                                                                                                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Cease","given":["Mylene","Anola"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"Aurora@google.com"}],

In [None]:
obfuscated_john = parse_str_to_dict(fhir.transform(fhir_df).select("deid").collect()[0]["deid"])
obfuscated_john

{'resourceType': 'Patient',
 'id': 'example',
 'name': [{'use': 'official',
   'family': 'Cease',
   'given': ['Mylene', 'Anola']}],
 'telecom': [{'system': 'email', 'value': 'Bryton@yahoo.com'},
  {'system': 'url', 'value': 'Aurora@google.com'}],
 'gender': 'other',
 'birthDate': '1970-01-21'}

In [None]:
show_fhir_diff(parse_str_to_dict(john_doe), obfuscated_john)

Changed: name[0].family | Doe → Cease
Changed: name[0].given[0] | John → Mylene
Changed: name[0].given[1] | Michael → Anola
Changed: telecom[0].value | john.doe@example.com → Bryton@yahoo.com
Changed: telecom[1].value | http://johndoe.com → Aurora@google.com
Changed: birthDate | 1970-01-01 → 1970-01-21
Changed: gender | male → other


## 🛡️ Direct FHIR De-identification Without Spark DataFrames


In [None]:
fake_john = fhir.deidentify(john_doe)
fake_john

'{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Cease","given":["Mylene","Anola"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"Aurora@google.com"}],"gender":"other","birthDate":"1970-01-21"}'

## ⚙️ Supports List of FHIR Resources with Multi-threaded Processing Under the Hood



In [None]:
fakes = fhir.deidentify_list([john_doe, fake_john])
print(type(fakes))
fakes

<class 'list'>


['{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Cease","given":["Mylene","Anola"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"Aurora@google.com"}],"gender":"other","birthDate":"1970-01-21"}',
 '{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Capers","given":["Burna","Claria"]}],"telecom":[{"system":"email","value":"Adel@google.com"},{"system":"url","value":"Carrey@yahoo.com"}],"gender":"female","birthDate":"1970-02-10"}']

## XML Support


In [None]:
xml_john = """
<Patient xmlns="http://hl7.org/fhir">
  <id value="example"/>
  <name>
    <use value="official"/>
    <family value="Doe"/>
    <given value="John"/>
    <given value="Michael"/>
  </name>
  <telecom>
    <system value="email"/>
    <value value="john.doe@example.com"/>
  </telecom>
  <telecom>
    <system value="url"/>
    <value value="http://johndoe.com"/>
  </telecom>
  <birthDate value="1970-01-01"/>
  <gender value="male"/>
</Patient>
"""


In [None]:
rules = {
  "Patient.birthDate" : "Date",
  "Patient.name.given" : "first_name",
  "Patient.name.family" : "last_name",
  "Patient.telecom.value" : "Email",
  "Patient.gender" : "GENDER"
}

xml_fhir = (
    FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("XML")
      .setDays(20)
      .setSeed(88)
      .setCustomFakers(
          {
              "GENDER": ["female", "other"]
          }
      )
      .setObfuscateRefSource("both")
)

In [None]:
xml_fake_john = xml_fhir.deidentify(xml_john)
xml_fake_john

'<Patient xmlns="http://hl7.org/fhir"><id value="example"/><name><use value="official"/><family value="Cease"/><given value="Mylene"/><given value="Anola"/></name><telecom><system value="email"/><value value="Bryton@yahoo.com"/></telecom><telecom><system value="url"/><value value="Aurora@google.com"/></telecom><gender value="other"/><birthDate value="1970-01-21"/></Patient>'

## Encounter


In [None]:
!wget https://www.hl7.org/fhir/us/core/Encounter-example-1.json

--2025-09-29 17:35:41--  https://www.hl7.org/fhir/us/core/Encounter-example-1.json
Resolving www.hl7.org (www.hl7.org)... 3.18.251.254, 18.221.56.67
Connecting to www.hl7.org (www.hl7.org)|3.18.251.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2624 (2.6K) [application/json]
Saving to: ‘Encounter-example-1.json’


2025-09-29 17:35:41 (1.23 GB/s) - ‘Encounter-example-1.json’ saved [2624/2624]



In [None]:
encounter_example = read_fhir_as_str("Encounter-example-1.json")

In [None]:
rules = {
  "Encounter.id" : "ID",
  "Encounter.meta.lastUpdated" : "DATE",
  "Encounter.period.start" : "DATE",
  "Encounter.period.end" : "DATE",
  "Encounter.location.location.display" : "HOSPITAL"
}

encounter_fhir = (
    FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
)

In [None]:
obfuscated_encounter = encounter_fhir.deidentify(encounter_example)
obfuscated_encounter_dict = parse_str_to_dict(obfuscated_encounter)


In [None]:
show_fhir_diff(parse_str_to_dict(encounter_example), obfuscated_encounter_dict)

Changed: id | example-1 → vklpqgv-9
Changed: meta.lastUpdated | 2024-01-28T16:06:21-08:00 → 2024-02-17T16:06:21-08:00
Changed: period.start | 2015-11-01T17:00:14-05:00 → 2015-11-21T17:00:14-05:00
Changed: period.end | 2015-11-01T18:00:14-05:00 → 2015-11-21T18:00:14-05:00
Changed: location[0].location.display | Holy Family Hospital → ST. LUKE'S NORTHLAND HOSPITAL - SMITHVILLE


## Patient


In [None]:
!wget https://hl7.org/fhir/R4/patient-example-a.json

--2025-09-29 17:35:52--  https://hl7.org/fhir/R4/patient-example-a.json
Resolving hl7.org (hl7.org)... 3.18.251.254, 18.221.56.67
Connecting to hl7.org (hl7.org)|3.18.251.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [application/json]
Saving to: ‘patient-example-a.json’


2025-09-29 17:35:53 (1.56 GB/s) - ‘patient-example-a.json’ saved [2690/2690]



In [None]:
patient_example = read_fhir_as_str("patient-example-a.json")

In [None]:
rules = {
  "Patient.id" : "ID",
  "Patient.identifier.value" : "ID",
  "Patient.active" : "STATUS",
  "Patient.name.given" : "FIRST_NAME",
  "Patient.name.family": "LAST_NAME",
  "Patient.contact.organization.display" : "ORGANIZATION",
  "Patient.gender": "GENDER",
}

patient_fhir = (
    FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
      .setCustomFakers({
          "GENDER" : ["female"],
          "STATUS" : ["false"],
      })
)

In [None]:
patient_fake = (patient_fhir.deidentify(patient_example))
show_fhir_diff(parse_str_to_dict(patient_example), parse_str_to_dict(patient_fake))

Changed: id | pat1 → qla9
Changed: identifier[0].value | 654321 → 432109
Changed: active | True → False
Changed: name[0].family | Donald → Beagle
Changed: name[0].given[0] | Duck → Eudelia
Changed: gender | male → female
Changed: contact[0].organization.display | Walt Disney Corporation → EMCOR


# Real World Examples


In [None]:
!wget https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_latest.zip

--2025-09-29 17:36:11--  https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_latest.zip
Resolving synthetichealth.github.io (synthetichealth.github.io)... 185.199.110.153, 185.199.109.153, 185.199.108.153, ...
Connecting to synthetichealth.github.io (synthetichealth.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33868563 (32M) [application/x-zip-compressed]
Saving to: ‘synthea_sample_data_fhir_latest.zip’


2025-09-29 17:36:14 (44.2 MB/s) - ‘synthea_sample_data_fhir_latest.zip’ saved [33868563/33868563]



In [None]:
import zipfile
zip_path = 'synthea_sample_data_fhir_latest.zip'
extract_path = '/content/patients'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

## Simple Example with Patient

In [None]:
import os
from pyspark.sql import functions as F

# Find first .json file from extracted folder
extract_path = "/content/patients"
json_files = []

for root, dirs, files in os.walk(extract_path):
    for file in files:
        if file.endswith(".json"):
            json_files.append(os.path.join(root, file))

# Raise an error if no JSON found
if not json_files:
    raise FileNotFoundError("No JSON files found in the extracted folder.")

# Take the first one
first_json_path = json_files[0]
print(f"Using file: {first_json_path}")

# Load using binaryFile
patient_df = spark.read.format("binaryFile").load(first_json_path) \
    .select(
        F.input_file_name().alias("file_name"),
        F.decode(F.col("content"), "UTF-8").alias("text")
    )


Using file: /content/patients/Emmie273_Dibbert990_4de0cfdb-e8da-698b-0b53-882911ca2691.json


In [None]:
patient_df.show(truncate=200)

+--------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                             file_name|                                                                                                                                                                                                    text|
+--------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|file:///content/patients/Emmie273_Dibbert990_4de0cfdb-e8da-698b-0b53-882911ca2691.json|{\n  "resourceType": "Bundle",\n  "type": 

In [None]:
rules = {
  "Patient.name.given" : "FIRST_NAME",
  "Patient.name.family": "LAST_NAME",
  "Patient.name.prefix": "TITLE",
  "Patient.telecom.value" : "PHONE",
  "Patient.address.city" : "city",
  "Patient.address.country" : "COUNTRY",
  "Patient.address.postalCode" : "ZIP",
  "Patient.gender": "GENDER",
  "Patient.birthDate": "DATE",
  "Patient.deceasedDateTime": "DATE",
  "Patient.identifier.value" : "ID",
}

patient_fhir = (
    FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
      .setCustomFakers({
          "TITLE" : ["Dr.", "Mr.", "Mrs.", "Ms."],
          "GENDER" : ["male", "female"],
      })
)


In [None]:
collected_patient = patient_fhir.transform(patient_df).collect()[0]
show_fhir_diff(parse_str_to_dict(collected_patient["text"]), parse_str_to_dict(collected_patient["deid"]))

Changed: entry[0].resource.identifier[0].value | 4de0cfdb-e8da-698b-0b53-882911ca2691 → 2mv8demu-v6ml-476u-8u31-660799dl0479
Changed: entry[0].resource.identifier[1].value | 4de0cfdb-e8da-698b-0b53-882911ca2691 → 2mv8demu-v6ml-476u-8u31-660799dl0479
Changed: entry[0].resource.identifier[2].value | 999-11-7051 → 777-99-5839
Changed: entry[0].resource.identifier[3].value | S99979615 → R77757493
Changed: entry[0].resource.name[0].family | Dibbert990 → Knight
Changed: entry[0].resource.name[0].given[0] | Emmie273 → Reese
Changed: entry[0].resource.telecom[0].value | 555-262-4987 → 333-040-2765
Changed: entry[0].resource.gender | female → male
Changed: entry[0].resource.birthDate | 2008-04-14 → 2008-05-04
Changed: entry[0].resource.address[0].city | Taunton → Blacksburg
Changed: entry[0].resource.address[0].postalCode | 02780 → 80568
Changed: entry[0].resource.address[0].country | US → ERITREA


## Complete Dataset Encompasses All Patients

### Spark

In [None]:
from pyspark.sql import functions as F

all_df = spark.read.format("binaryFile") \
    .load("/content/patients/*.json") \
    .select(
        F.input_file_name().alias("file_name"),
        F.decode(F.col("content"), "UTF-8").alias("text")
    ).repartition(os.cpu_count() * 2).cache()


In [None]:
all_df.select("file_name").show(truncate=False, n=3)

+----------------------------------------------------------------------------------------------+
|file_name                                                                                     |
+----------------------------------------------------------------------------------------------+
|file:///content/patients/Mirna233_Irene779_Crooks415_2d1fb98d-b9c6-ff4b-77db-542d38ebab46.json|
|file:///content/patients/Wayne846_Kuhic920_4c307a40-2318-532b-76e8-c919bb2c06be.json          |
|file:///content/patients/Filiberto722_Boyle917_bfe8e255-0de6-21d5-f495-0050fcd4fa54.json      |
+----------------------------------------------------------------------------------------------+
only showing top 3 rows



In [None]:
%%time
patient_fhir.transform(all_df).write.mode("overwrite").parquet("content/all_deid.parquet")

CPU times: user 5.15 ms, sys: 2.22 ms, total: 7.38 ms
Wall time: 25.7 s


View a sample patient result

In [None]:
a_patient = spark.read.parquet("content/all_deid.parquet").limit(1).collect()[0]
print(a_patient["file_name"])

file:///content/patients/Nathan164_Schaden604_a7366165-34ce-48bf-a703-fcb4d6fd0ce1.json


In [None]:
show_fhir_diff(parse_str_to_dict(a_patient["text"]), parse_str_to_dict(a_patient["deid"]))

Changed: entry[0].resource.identifier[0].value | a7366165-34ce-48bf-a703-fcb4d6fd0ce1 → l5144943-12dv-26ue-l581-edu2m4em8dv9
Changed: entry[0].resource.identifier[1].value | a7366165-34ce-48bf-a703-fcb4d6fd0ce1 → l5144943-12dv-26ue-l581-edu2m4em8dv9
Changed: entry[0].resource.identifier[2].value | 999-28-6986 → 777-06-4764
Changed: entry[0].resource.identifier[3].value | S99943875 → R77721653
Changed: entry[0].resource.identifier[4].value | X67714985X → K45592763K
Changed: entry[0].resource.name[0].family | Schaden604 → Bayard
Changed: entry[0].resource.name[0].given[0] | Nathan164 → Leveda
Changed: entry[0].resource.name[0].given[1] | Douglass930 → Gaynel
Changed: entry[0].resource.name[0].prefix[0] | Mr. → Ms.
Changed: entry[0].resource.telecom[0].value | 555-455-5569 → 333-233-3347
Changed: entry[0].resource.gender | male → female
Changed: entry[0].resource.birthDate | 1947-11-03 → 1947-11-23
Changed: entry[0].resource.deceasedDateTime | 2024-10-12T02:58:04+00:00 → 2024-11-01T02:58:

### Spark-Less

In [None]:
import glob
all_patients_contents = [open(path, "r", encoding="utf-8").read() for path in glob.glob("/content/patients/*.json")]

In [None]:
%%time
all_patients_obfuscated = patient_fhir.deidentify_list(all_patients_contents)

CPU times: user 17.8 s, sys: 3.1 s, total: 20.9 s
Wall time: 47.1 s


View a sample patient result

In [None]:
show_fhir_diff(parse_str_to_dict(all_patients_contents[0]), parse_str_to_dict(all_patients_obfuscated[0]))

Changed: entry[0].resource.identifier[0].value | 4de0cfdb-e8da-698b-0b53-882911ca2691 → 2mv8demu-v6ml-476u-8u31-660799dl0479
Changed: entry[0].resource.identifier[1].value | 4de0cfdb-e8da-698b-0b53-882911ca2691 → 2mv8demu-v6ml-476u-8u31-660799dl0479
Changed: entry[0].resource.identifier[2].value | 999-11-7051 → 777-99-5839
Changed: entry[0].resource.identifier[3].value | S99979615 → R77757493
Changed: entry[0].resource.name[0].family | Dibbert990 → Knight
Changed: entry[0].resource.name[0].given[0] | Emmie273 → Reese
Changed: entry[0].resource.telecom[0].value | 555-262-4987 → 333-040-2765
Changed: entry[0].resource.gender | female → male
Changed: entry[0].resource.birthDate | 2008-04-14 → 2008-05-04
Changed: entry[0].resource.address[0].city | Taunton → Blacksburg
Changed: entry[0].resource.address[0].postalCode | 02780 → 80568
Changed: entry[0].resource.address[0].country | US → ERITREA


View another sample patient result

In [None]:
show_fhir_diff(parse_str_to_dict(all_patients_contents[1]), parse_str_to_dict(all_patients_obfuscated[1]))

Changed: entry[0].resource.identifier[0].value | db60298f-7e8d-4810-1f24-6ebb3466f248 → mu48076e-5v6m-2698-9e02-4vuu1244e026
Changed: entry[0].resource.identifier[1].value | db60298f-7e8d-4810-1f24-6ebb3466f248 → mu48076e-5v6m-2698-9e02-4vuu1244e026
Changed: entry[0].resource.identifier[2].value | 999-25-5609 → 777-03-3487
Changed: entry[0].resource.identifier[3].value | S99930408 → R77718286
Changed: entry[0].resource.identifier[4].value | X74573091X → K52351879K
Changed: entry[0].resource.name[0].family | Lesch175 → Rooks
Changed: entry[0].resource.name[0].given[0] | Rafael239 → Lavonna
Changed: entry[0].resource.name[0].given[1] | Les282 → Mercedes
Changed: entry[0].resource.name[0].prefix[0] | Mr. → Ms.
Changed: entry[0].resource.telecom[0].value | 555-277-1613 → 333-055-9491
Changed: entry[0].resource.gender | male → female
Changed: entry[0].resource.birthDate | 2001-03-04 → 2001-03-24
Changed: entry[0].resource.address[0].city | Randolph → Avdimou
Changed: entry[0].resource.addre

## STU3

In [None]:
!wget https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_stu3_latest.zip

--2025-09-29 17:41:08--  https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_stu3_latest.zip
Resolving synthetichealth.github.io (synthetichealth.github.io)... 185.199.110.153, 185.199.109.153, 185.199.108.153, ...
Connecting to synthetichealth.github.io (synthetichealth.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19804574 (19M) [application/x-zip-compressed]
Saving to: ‘synthea_sample_data_fhir_stu3_latest.zip’


2025-09-29 17:42:14 (35.0 MB/s) - ‘synthea_sample_data_fhir_stu3_latest.zip’ saved [19804574/19804574]



In [None]:
import zipfile
zip_path_stu3 = 'synthea_sample_data_fhir_stu3_latest.zip'
extract_path_stu3 = '/content/patients_stu3'

with zipfile.ZipFile(zip_path_stu3, 'r') as zip_ref:
    zip_ref.extractall(extract_path_stu3)

In [None]:
rules = {
  "Patient.name.given" : "FIRST_NAME",
  "Patient.name.family": "LAST_NAME",
  "Patient.name.prefix": "TITLE",
  "Patient.telecom.value" : "PHONE",
  "Patient.address.city" : "city",
  "Patient.address.country" : "COUNTRY",
  "Patient.address.postalCode" : "ZIP",
  "Patient.gender": "GENDER",
  "Patient.birthDate": "DATE",
  "Patient.deceasedDateTime": "DATE",
  "Patient.identifier.value" : "ID",
}

stu3_fhir = (
    FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("STU3")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
      .setCustomFakers({
          "TITLE" : ["Dr.", "Mr.", "Mrs.", "Ms."],
          "GENDER" : ["male", "female"],
      })
)


In [None]:
from pyspark.sql import functions as F

all_df_stu3 = spark.read.format("binaryFile") \
    .load("/content/patients_stu3/*.json") \
    .select(
        F.input_file_name().alias("file_name"),
        F.decode(F.col("content"), "UTF-8").alias("text")
    ).repartition(os.cpu_count() * 2).cache()

In [None]:
print("Total number of patients:", all_df_stu3.count())


Total number of patients: 109


In [None]:
%%time
stu3_fhir.transform(all_df_stu3).write.mode("overwrite").parquet("content/all_deid_stu3.parquet")

CPU times: user 6.19 ms, sys: 2.14 ms, total: 8.33 ms
Wall time: 17.4 s


In [None]:
a_patient_stu3 = spark.read.parquet("content/all_deid_stu3.parquet").limit(1).collect()[0]
print(a_patient_stu3["file_name"])

file:///content/patients_stu3/Pricilla716_Ellena683_D'Amore443_288ba4b8-56da-b7e0-c16d-4ac111e41361.json


In [None]:
show_fhir_diff(parse_str_to_dict(a_patient_stu3["text"]), parse_str_to_dict(a_patient_stu3["deid"]))

Changed: entry[0].resource.identifier[0].value | 288ba4b8-56da-b7e0-c16d-4ac111e41361 → 066ul2u6-34ml-u5v8-d94m-2ld999v29149
Changed: entry[0].resource.identifier[1].value | 288ba4b8-56da-b7e0-c16d-4ac111e41361 → 066ul2u6-34ml-u5v8-d94m-2ld999v29149
Changed: entry[0].resource.identifier[2].value | 999-60-6581 → 777-48-4369
Changed: entry[0].resource.identifier[3].value | S99942031 → R77720819
Changed: entry[0].resource.identifier[4].value | X23674993X → K01452771K
Changed: entry[0].resource.name[0].family | D'Amore443 → Cal
Changed: entry[0].resource.name[0].given[0] | Pricilla716 → Margarito
Changed: entry[0].resource.name[0].given[1] | Ellena683 → Gilford
Changed: entry[0].resource.name[1].family | Emmerich580 → Canning
Changed: entry[0].resource.name[1].given[0] | Pricilla716 → Margarito
Changed: entry[0].resource.name[1].given[1] | Ellena683 → Gilford
Changed: entry[0].resource.telecom[0].value | 555-686-9216 → 333-464-7094
Changed: entry[0].resource.gender | female → male
Changed: