![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.10.Fhir_DeIdentification.ipynb)

## 🔐 Spark Transformer for De-identifying FHIR Resources


> 📌 **Note: FhirDeIdentification Module Availability**  
>
> The `FhirDeIdentification` module is not enabled by default in your current package.  
> ➤ To use this module, please set `fhir_deid=True` in the `start()` function.  
> ➤ If you need help or access, feel free to contact us at 📧 [support@johnsnowlabs.com](mailto:support@johnsnowlabs.com).




This Spark-based transformer de-identifies **FHIR (Fast Healthcare Interoperability Resources)** according to configurable privacy rules.

It supports **field-level obfuscation** of FHIR JSON documents using **FHIR Path** expressions.

---

## ✅ Key Features

- 📦 **FHIR Versions**: Supports `R4`, `R5`, and `DSTU3`
- 🧠 **Type-aware De-identification** strategies
- 🧾 **Parser Types**: Supports both `JSON` and `XML`
- ⚙️ **Modes**: `mask` or `obfuscate`

---





## 🔧 Parameters

| Parameter | Description |
|----------|-------------|
| `fhirVersion` | FHIR version (`R4`, `R5`, or `DSTU3`) |
| `parserType` | Parser type (`JSON` or `XML`) |
| `mode` | Mode for anonymizer (`mask` or `obfuscate`) |
| `dateEntities` | List of date entities (e.g., `['DATE', 'DOB', 'DOD']`) |
| `obfuscateDate` | Whether to obfuscate dates when `mode='obfuscate'` |
| `unnormalizedDateMode` | Fallback mode for unformatted dates (`mask`, `obfuscate`, or `skip`) |
| `days` | Number of days to shift dates (or random between 1–60) |
| `dateFormats` | Recognized date formats |
| `obfuscateRefSource` | Source for obfuscation (`custom`, `faker`, `both`) |
| `language` | Language for regex and faker (`en`, `de`, `es`, `fr`, `ar`, `ro`) |
| `seed` | Random seed for reproducibility |
| `maskingPolicy` | `same_length_chars`, `entity_labels`, `fixed_length_chars` |
| `fixedMaskLength` | Used with `fixed_length_chars` |
| `sameLengthFormattedEntities` | Entities to keep same-length output (e.g., `PHONE`, `SSN`) |
| `genderAwareness` | Gender-aware name handling (affects performance) |
| `ageRanges` | List of age groups to preserve |
| `selectiveObfuscationModes` | Per-entity de-identification strategies |
| `customFakers` | Custom terms for obfuscation by entity |
| `keepYear`, `keepMonth` | Preserve year/month when obfuscating dates |

---

💡 *Use this module to ensure privacy-compliant handling of healthcare data while maintaining structure and usability of FHIR resources.*

# Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install(fhir_deid=True)

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
import json
from pyspark.sql import functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start(fhir_deid=True)

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_9596 (9).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.0.0, 💊Spark-Healthcare==6.0.0, running on ⚡ PySpark==3.4.0


⚠️ Attention!
💡 The Fhir_Deidentification model is not supported by every version of the Spark NLP for Healthcare library.
🚨 If you encounter an error when starting the Spark Session, please contact 📧 info@jonsnowlabs.com
.
✅ For example, this notebook works properly with version 6.0.0.

# Util Functions

In [None]:
def read_fhir_as_dict(path: str):
  with open(path, "r") as f:
      data = json.load(f)
  return data

In [None]:
def read_fhir_as_str(path: str) -> str:
  with open(path, 'r') as f:
      return f.read()

In [None]:
def parse_str_to_dict(fhir_str: str) -> dict:
  return json.loads(fhir_str)

In [None]:
def parse_dict_to_str(fhir_dict: dict) -> str:
  return json.dumps(fhir_dict)

In [None]:
def show_fhir_diff(original: dict, obfuscated: dict):
    """Show only changed fields between two FHIR dicts (nested-aware)"""
    def compare_dicts(d1, d2, path=""):
        diff = {}
        for key in d1:
            current_path = f"{path}.{key}" if path else key
            if key not in d2:
                pass
            elif isinstance(d1[key], dict) and isinstance(d2[key], dict):
                nested_diff = compare_dicts(d1[key], d2[key], current_path)
                diff.update(nested_diff)
            elif isinstance(d1[key], list) and isinstance(d2[key], list):
                for i, (item1, item2) in enumerate(zip(d1[key], d2[key])):
                    if isinstance(item1, dict) and isinstance(item2, dict):
                        nested_diff = compare_dicts(item1, item2, f"{current_path}[{i}]")
                        diff.update(nested_diff)
                    elif item1 != item2:
                        diff[f"{current_path}[{i}]"] = (item1, item2)
            elif d1[key] != d2[key]:
                diff[current_path] = (d1[key], d2[key])
        return diff

    differences = compare_dicts(original, obfuscated)

    for field, (orig_val, obf_val) in differences.items():
        print(f"Changed: {field} | {orig_val} → {obf_val}")

# Basic Examples to Show Functionalities


In [None]:
john_doe = """{
  "resourceType": "Patient",
  "id": "example",
  "name": [
    {
      "use": "official",
      "family": "Doe",
      "given": [
        "John",
        "Michael"
      ]
    }
  ],
  "telecom": [
    {
      "system": "email",
      "value": "john.doe@example.com"
    },
    {
      "system": "url",
      "value": "http://johndoe.com"
    }
  ],
  "birthDate": "1970-01-01",
  "gender": "male"
}"""


In [None]:
rules = {
  "Patient.birthDate" : "Date",
  "Patient.name.given" : "first_name",
  "Patient.name.family" : "last_name",
  "Patient.telecom.value" : "Email",
  "Patient.gender" : "GENDER"
}

fhir = (
    medical.FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(88)
      .setCustomFakers(
          {
              "GENDER": ["female", "other"]
          }
      )
      .setObfuscateRefSource("both")
)

In [None]:
fhir_df = spark.createDataFrame([[john_doe]]).toDF("text")
fhir.transform(fhir_df).select("deid").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|deid                                                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Killings","given":["Ellison","Isidor"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"A

In [None]:
obfuscated_john = parse_str_to_dict(fhir.transform(fhir_df).select("deid").collect()[0]["deid"])
obfuscated_john

{'resourceType': 'Patient',
 'id': 'example',
 'name': [{'use': 'official',
   'family': 'Killings',
   'given': ['Ellison', 'Isidor']}],
 'telecom': [{'system': 'email', 'value': 'Bryton@yahoo.com'},
  {'system': 'url', 'value': 'Aurora@google.com'}],
 'gender': 'other',
 'birthDate': '1970-01-21'}

In [None]:
show_fhir_diff(parse_str_to_dict(john_doe), obfuscated_john)

Changed: name[0].family | Doe → Killings
Changed: name[0].given[0] | John → Ellison
Changed: name[0].given[1] | Michael → Isidor
Changed: telecom[0].value | john.doe@example.com → Bryton@yahoo.com
Changed: telecom[1].value | http://johndoe.com → Aurora@google.com
Changed: birthDate | 1970-01-01 → 1970-01-21
Changed: gender | male → other


## 🛡️ Direct FHIR De-identification Without Spark DataFrames


In [None]:
fake_john = fhir.deidentify(john_doe)
fake_john

'{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Killings","given":["Ellison","Isidor"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"Aurora@google.com"}],"gender":"other","birthDate":"1970-01-21"}'

## ⚙️ Supports List of FHIR Resources with Multi-threaded Processing Under the Hood



In [None]:
fakes = fhir.deidentify_list([john_doe, fake_john])
print(type(fakes))
fakes

<class 'list'>


['{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Killings","given":["Ellison","Isidor"]}],"telecom":[{"system":"email","value":"Bryton@yahoo.com"},{"system":"url","value":"Aurora@google.com"}],"gender":"other","birthDate":"1970-01-21"}',
 '{"resourceType":"Patient","id":"example","name":[{"use":"official","family":"Nanas","given":["Almira","Elbridge"]}],"telecom":[{"system":"email","value":"Adel@google.com"},{"system":"url","value":"Carrey@yahoo.com"}],"gender":"female","birthDate":"1970-02-10"}']

## XML Support


In [None]:
xml_john = """
<Patient xmlns="http://hl7.org/fhir">
  <id value="example"/>
  <name>
    <use value="official"/>
    <family value="Doe"/>
    <given value="John"/>
    <given value="Michael"/>
  </name>
  <telecom>
    <system value="email"/>
    <value value="john.doe@example.com"/>
  </telecom>
  <telecom>
    <system value="url"/>
    <value value="http://johndoe.com"/>
  </telecom>
  <birthDate value="1970-01-01"/>
  <gender value="male"/>
</Patient>
"""


In [None]:
rules = {
  "Patient.birthDate" : "Date",
  "Patient.name.given" : "first_name",
  "Patient.name.family" : "last_name",
  "Patient.telecom.value" : "Email",
  "Patient.gender" : "GENDER"
}

xml_fhir = (
    medical.FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("XML")
      .setDays(20)
      .setSeed(88)
      .setCustomFakers(
          {
              "GENDER": ["female", "other"]
          }
      )
      .setObfuscateRefSource("both")
)

In [None]:
xml_fake_john = xml_fhir.deidentify(xml_john)
xml_fake_john

'<Patient xmlns="http://hl7.org/fhir"><id value="example"/><name><use value="official"/><family value="Killings"/><given value="Ellison"/><given value="Isidor"/></name><telecom><system value="email"/><value value="Bryton@yahoo.com"/></telecom><telecom><system value="url"/><value value="Aurora@google.com"/></telecom><gender value="other"/><birthDate value="1970-01-21"/></Patient>'

## Encounter


In [None]:
!wget https://www.hl7.org/fhir/us/core/Encounter-example-1.json

In [None]:
encounter_example = read_fhir_as_str("Encounter-example-1.json")

In [None]:
rules = {
  "Encounter.id" : "ID",
  "Encounter.meta.lastUpdated" : "DATE",
  "Encounter.period.start" : "DATE",
  "Encounter.period.end" : "DATE",
  "Encounter.location.location.display" : "HOSPITAL"
}

encounter_fhir = (
    medical.FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
)

In [None]:
obfuscated_encounter = encounter_fhir.deidentify(encounter_example)
obfuscated_encounter_dict = parse_str_to_dict(obfuscated_encounter)


In [None]:
show_fhir_diff(parse_str_to_dict(encounter_example), obfuscated_encounter_dict)

Changed: id | example-1 → vklpqgv-9
Changed: meta.lastUpdated | 2024-01-28T16:06:21-08:00 → 2024-02-17T16:06:21-08:00
Changed: period.start | 2015-11-01T17:00:14-05:00 → 2015-11-21T17:00:14-05:00
Changed: period.end | 2015-11-01T18:00:14-05:00 → 2015-11-21T18:00:14-05:00
Changed: location[0].location.display | Holy Family Hospital → ST. LUKE'S NORTHLAND HOSPITAL - SMITHVILLE


## Patient


In [None]:
!wget https://hl7.org/fhir/R4/patient-example-a.json

In [None]:
patient_example = read_fhir_as_str("patient-example-a.json")

In [None]:
rules = {
  "Patient.id" : "ID",
  "Patient.identifier.value" : "ID",
  "Patient.active" : "STATUS",
  "Patient.name.given" : "FIRST_NAME",
  "Patient.name.family": "LAST_NAME",
  "Patient.contact.organization.display" : "ORGANIZATION",
  "Patient.gender": "GENDER",
}

patient_fhir = (
    medical.FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
      .setCustomFakers({
          "GENDER" : ["female"],
          "STATUS" : ["false"],
      })
)

In [None]:
patient_fake = (patient_fhir.deidentify(patient_example))
show_fhir_diff(parse_str_to_dict(patient_example), parse_str_to_dict(patient_fake))

Changed: id | pat1 → qla9
Changed: identifier[0].value | 654321 → 432109
Changed: active | True → False
Changed: name[0].family | Donald → Sander
Changed: name[0].given[0] | Duck → Netta
Changed: gender | male → female
Changed: contact[0].organization.display | Walt Disney Corporation → EMCOR


# Real World Examples


In [None]:
!wget https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_latest.zip

In [None]:
import zipfile
zip_path = 'synthea_sample_data_fhir_latest.zip'
extract_path = '/content/patients'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

## Simple Example with Patient

In [None]:
import os
from pyspark.sql import functions as F

# Find first .json file from extracted folder
extract_path = "/content/patients"
json_files = []

for root, dirs, files in os.walk(extract_path):
    for file in files:
        if file.endswith(".json"):
            json_files.append(os.path.join(root, file))

# Raise an error if no JSON found
if not json_files:
    raise FileNotFoundError("No JSON files found in the extracted folder.")

# Take the first one
first_json_path = json_files[0]
print(f"Using file: {first_json_path}")

# Load using binaryFile
patient_df = spark.read.format("binaryFile").load(first_json_path) \
    .select(
        F.input_file_name().alias("file_name"),
        F.decode(F.col("content"), "UTF-8").alias("text")
    )


Using file: /content/patients/Sharri659_Feest103_1f0029cb-195d-f68b-9e4a-61449d505028.json


In [None]:
patient_df.show(truncate=200)

+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                            file_name|                                                                                                                                                                                                    text|
+-------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|file:///content/patients/Sharri659_Feest103_1f0029cb-195d-f68b-9e4a-61449d505028.json|{\n  "resourceType": "Bundle",\n  "type": "tra

In [None]:
rules = {
  "Patient.name.given" : "FIRST_NAME",
  "Patient.name.family": "LAST_NAME",
  "Patient.name.prefix": "TITLE",
  "Patient.telecom.value" : "PHONE",
  "Patient.address.city" : "city",
  "Patient.address.country" : "COUNTRY",
  "Patient.address.postalCode" : "ZIP",
  "Patient.gender": "GENDER",
  "Patient.birthDate": "DATE",
  "Patient.deceasedDateTime": "DATE",
  "Patient.identifier.value" : "ID",
}

patient_fhir = (
    medical.FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
      .setCustomFakers({
          "TITLE" : ["Dr.", "Mr.", "Mrs.", "Ms."],
          "GENDER" : ["male", "female"],
      })
)


In [None]:
collected_patient = patient_fhir.transform(patient_df).collect()[0]
show_fhir_diff(parse_str_to_dict(collected_patient["text"]), parse_str_to_dict(collected_patient["deid"]))

Changed: entry[0].resource.identifier[0].value | 1f0029cb-195d-f68b-9e4a-61449d505028 → 9e8807du-973m-e46u-7v2l-49227m383806
Changed: entry[0].resource.identifier[1].value | 1f0029cb-195d-f68b-9e4a-61449d505028 → 9e8807du-973m-e46u-7v2l-49227m383806
Changed: entry[0].resource.identifier[2].value | 999-59-6519 → 777-37-4397
Changed: entry[0].resource.identifier[3].value | S99938517 → R77716395
Changed: entry[0].resource.identifier[4].value | X29655014X → K07433892K
Changed: entry[0].resource.name[0].family | Feest103 → Dolores
Changed: entry[0].resource.name[0].given[0] | Sharri659 → Hershall
Changed: entry[0].resource.name[0].given[1] | Lue764 → Lennard
Changed: entry[0].resource.name[0].prefix[0] | Ms. → Dr.
Changed: entry[0].resource.telecom[0].value | 555-713-5108 → 333-591-3986
Changed: entry[0].resource.gender | female → male
Changed: entry[0].resource.birthDate | 2000-06-14 → 2000-07-04
Changed: entry[0].resource.address[0].city | Southborough → Harper
Changed: entry[0].resource.

## Complete Dataset Encompasses All Patients

### Spark

In [None]:
from pyspark.sql import functions as F

all_df = spark.read.format("binaryFile") \
    .load("/content/patients/*.json") \
    .select(
        F.input_file_name().alias("file_name"),
        F.decode(F.col("content"), "UTF-8").alias("text")
    ).repartition(os.cpu_count() * 2).cache()


In [None]:
all_df.select("file_name").show(truncate=False, n=3)

+-------------------------------------------------------------------------------------+
|file_name                                                                            |
+-------------------------------------------------------------------------------------+
|file:/content/patients/Carlo647_Gislason620_b7ec4930-0a84-73bc-ffb7-8af376b4d991.json|
|file:/content/patients/Colin861_Medhurst46_57f13fc4-0882-fda3-e99a-7c8b8d0d086b.json |
|file:/content/patients/Kathern391_Nader710_a5e6cfe1-b6d7-9d18-540b-3234ebbad64c.json |
+-------------------------------------------------------------------------------------+
only showing top 3 rows



In [None]:
%%time
patient_fhir.transform(all_df).write.mode("overwrite").parquet("content/all_deid.parquet")

CPU times: user 685 ms, sys: 119 ms, total: 803 ms
Wall time: 2min 34s


View a sample patient result

In [None]:
a_patient = spark.read.parquet("content/all_deid.parquet").limit(1).collect()[0]
print(a_patient["file_name"])

file:/content/patients/Edwardo860_Hermann103_08969e90-844a-c907-4d3c-a4907951e35e.json


In [None]:
show_fhir_diff(parse_str_to_dict(a_patient["text"]), parse_str_to_dict(a_patient["deid"]))

Changed: entry[0].resource.identifier[0].value | 08969e90-844a-c907-4d3c-a4907951e35e → 86747v78-622l-d785-2m1d-l2785739v13v
Changed: entry[0].resource.identifier[1].value | 08969e90-844a-c907-4d3c-a4907951e35e → 86747v78-622l-d785-2m1d-l2785739v13v
Changed: entry[0].resource.identifier[2].value | 999-74-3374 → 777-52-1152
Changed: entry[0].resource.identifier[3].value | S99913786 → R77791564
Changed: entry[0].resource.identifier[4].value | X34252414X → K12030292K
Changed: entry[0].resource.name[0].family | Hermann103 → Cruel
Changed: entry[0].resource.name[0].given[0] | Edwardo860 → Melinda
Changed: entry[0].resource.name[0].given[1] | Ariel183 → Atha
Changed: entry[0].resource.name[0].prefix[0] | Mr. → Ms.
Changed: entry[0].resource.telecom[0].value | 555-930-2854 → 333-718-0632
Changed: entry[0].resource.gender | male → female
Changed: entry[0].resource.birthDate | 1915-04-04 → 1915-04-24
Changed: entry[0].resource.deceasedDateTime | 1982-03-13T05:31:36+00:00 → 1982-04-02T05:31:36+0

### Spark-Less

In [None]:
import glob
all_patients_contents = [open(path, "r", encoding="utf-8").read() for path in glob.glob("/content/patients/*.json")]

In [None]:
%%time
all_patients_obfuscated = patient_fhir.deidentify_list(all_patients_contents)

CPU times: user 32.9 s, sys: 7.04 s, total: 40 s
Wall time: 2min 50s


View a sample patient result

In [None]:
show_fhir_diff(parse_str_to_dict(all_patients_contents[0]), parse_str_to_dict(all_patients_obfuscated[0]))

Changed: entry[0].resource.identifier[0].value | 1f0029cb-195d-f68b-9e4a-61449d505028 → 9e8807du-973m-e46u-7v2l-49227m383806
Changed: entry[0].resource.identifier[1].value | 1f0029cb-195d-f68b-9e4a-61449d505028 → 9e8807du-973m-e46u-7v2l-49227m383806
Changed: entry[0].resource.identifier[2].value | 999-59-6519 → 777-37-4397
Changed: entry[0].resource.identifier[3].value | S99938517 → R77716395
Changed: entry[0].resource.identifier[4].value | X29655014X → K07433892K
Changed: entry[0].resource.name[0].family | Feest103 → Dolores
Changed: entry[0].resource.name[0].given[0] | Sharri659 → Hershall
Changed: entry[0].resource.name[0].given[1] | Lue764 → Lennard
Changed: entry[0].resource.name[0].prefix[0] | Ms. → Dr.
Changed: entry[0].resource.telecom[0].value | 555-713-5108 → 333-591-3986
Changed: entry[0].resource.gender | female → male
Changed: entry[0].resource.birthDate | 2000-06-14 → 2000-07-04
Changed: entry[0].resource.address[0].city | Southborough → Harper
Changed: entry[0].resource.

View another sample patient result

In [None]:
show_fhir_diff(parse_str_to_dict(all_patients_contents[1]), parse_str_to_dict(all_patients_obfuscated[1]))

Changed: entry[0].resource.identifier[0].value | a54bb0fb-7e96-9488-505a-a71d58cdae85 → l32uu8eu-5v74-7266-383l-l59m36dmlv63
Changed: entry[0].resource.identifier[1].value | a54bb0fb-7e96-9488-505a-a71d58cdae85 → l32uu8eu-5v74-7266-383l-l59m36dmlv63
Changed: entry[0].resource.identifier[2].value | 999-28-1483 → 777-06-9261
Changed: entry[0].resource.identifier[3].value | S99985880 → R77763668
Changed: entry[0].resource.identifier[4].value | X34675298X → K12453076K
Changed: entry[0].resource.name[0].family | Gulgowski816 → Frames
Changed: entry[0].resource.name[0].given[0] | Wilson960 → Margy
Changed: entry[0].resource.name[0].given[1] | Rashad361 → Ilah
Changed: entry[0].resource.name[0].prefix[0] | Mr. → Ms.
Changed: entry[0].resource.telecom[0].value | 555-116-3691 → 333-994-1479
Changed: entry[0].resource.gender | male → female
Changed: entry[0].resource.birthDate | 1954-12-17 → 1955-01-06
Changed: entry[0].resource.address[0].city | Sandwich → Pana
Changed: entry[0].resource.addres

## STU3

In [None]:
!wget https://synthetichealth.github.io/synthea-sample-data/downloads/latest/synthea_sample_data_fhir_stu3_latest.zip

In [None]:
import zipfile
zip_path_stu3 = 'synthea_sample_data_fhir_stu3_latest.zip'
extract_path_stu3 = '/content/patients_stu3'

with zipfile.ZipFile(zip_path_stu3, 'r') as zip_ref:
    zip_ref.extractall(extract_path_stu3)

In [None]:
rules = {
  "Patient.name.given" : "FIRST_NAME",
  "Patient.name.family": "LAST_NAME",
  "Patient.name.prefix": "TITLE",
  "Patient.telecom.value" : "PHONE",
  "Patient.address.city" : "city",
  "Patient.address.country" : "COUNTRY",
  "Patient.address.postalCode" : "ZIP",
  "Patient.gender": "GENDER",
  "Patient.birthDate": "DATE",
  "Patient.deceasedDateTime": "DATE",
  "Patient.identifier.value" : "ID",
}

stu3_fhir = (
    medical.FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("STU3")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(1000)
      .setObfuscateRefSource("both")
      .setCustomFakers({
          "TITLE" : ["Dr.", "Mr.", "Mrs.", "Ms."],
          "GENDER" : ["male", "female"],
      })
)


In [None]:
from pyspark.sql import functions as F

all_df_stu3 = spark.read.format("binaryFile") \
    .load("/content/patients_stu3/*.json") \
    .select(
        F.input_file_name().alias("file_name"),
        F.decode(F.col("content"), "UTF-8").alias("text")
    ).repartition(os.cpu_count() * 2).cache()

In [None]:
print("Total number of patients:", all_df_stu3.count())


Total number of patients: 129


In [None]:
%%time
stu3_fhir.transform(all_df_stu3).write.mode("overwrite").parquet("content/all_deid_stu3.parquet")

CPU times: user 526 ms, sys: 92.2 ms, total: 618 ms
Wall time: 1min 44s


In [None]:
a_patient_stu3 = spark.read.parquet("content/all_deid_stu3.parquet").limit(1).collect()[0]
print(a_patient_stu3["file_name"])

file:/content/patients_stu3/Horacio317_McClure239_609e481b-69f6-4833-4090-c83f5268a389.json


In [None]:
show_fhir_diff(parse_str_to_dict(a_patient_stu3["text"]), parse_str_to_dict(a_patient_stu3["deid"]))

Changed: entry[0].resource.identifier[0].value | 609e481b-69f6-4833-4090-c83f5268a389 → 487v269u-47e4-2611-2878-d61e3046l167
Changed: entry[0].resource.identifier[1].value | 609e481b-69f6-4833-4090-c83f5268a389 → 487v269u-47e4-2611-2878-d61e3046l167
Changed: entry[0].resource.identifier[2].value | 999-90-6100 → 777-78-4988
Changed: entry[0].resource.identifier[3].value | S99931725 → R77719503
Changed: entry[0].resource.identifier[4].value | X35130046X → K13918824K
Changed: entry[0].resource.name[0].family | McClure239 → Piles
Changed: entry[0].resource.name[0].given[0] | Horacio317 → Michell
Changed: entry[0].resource.name[0].given[1] | Thaddeus38 → Firman
Changed: entry[0].resource.name[0].prefix[0] | Mr. → Ms.
Changed: entry[0].resource.telecom[0].value | 555-125-1020 → 333-903-9808
Changed: entry[0].resource.gender | male → female
Changed: entry[0].resource.birthDate | 1916-12-17 → 1917-01-06
Changed: entry[0].resource.deceasedDateTime | 1972-12-15T08:33:00+00:00 → 1973-01-04T08:33: