![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.14.CDA_DeIdentification.ipynb)

## üîê Spark Transformer for De-identifying CDA (HL7 Clinical Document Architecture) Documents

This Spark-based transformer de-identifies **CDA (Clinical Document Architecture)** XML documents according to configurable privacy rules.

It supports **field-level obfuscation** of CDA XML documents using **XPath-like path expressions** and **NLP-based free-text de-identification**.

---

## ‚úÖ Key Features

- üì¶ **XPath-based de-identification**: Target specific CDA elements using dot (`.`) or slash (`/`) notation
- üéØ **Attribute-level access**: De-identify XML attributes using `@` notation
- üåê **Namespace-aware**: Automatically handles HL7 v3 namespace
- üß† **Structured field obfuscation**: Replace names, dates, addresses, phone numbers, and identifiers
- üìù **Free-text de-identification**: Process narrative sections using NLP pipelines
- ‚öôÔ∏è **Modes**: `mask` or `obfuscate`

---

| **Parameter**                   | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                             |
| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **mode**                        | Mode for Anonymizer. Options: [`mask`, `obfuscate`]                                                                                                                                                                                                                                                                                                                                                                                         |
| **mappingRules**                | Dictionary mapping CDA paths to entity types (e.g. `{"recordTarget.patientRole.patient.name.given": "first_name"}`)                                                                                                                                                                                                                                                                                                                         |
| **freeTextPaths**               | List of CDA paths as strings to free-text elements for de-identification. Paths should be specified using CDA XML path notation (dot or slash separated).<br><br>These paths will be processed using the de-identification pipeline set via `setPipeline()`.<br><br>Example: `['component.structuredBody.component.section.text']`                                                                                                          |
| **pipeline**                    | Spark NLP Pipeline object (`PipelineModel` or `PretrainedPipeline`) containing de-identification stages.<br><br>The pipeline should perform the desired de-identification transformations for free-text sections specified in `freeTextPaths`.<br><br>Set via `setPipeline(spark, pipeline, outputCol="obfuscated")` <br><br> where `outputCol` is the output column name from the pipeline containing de-identified text (default: `"obfuscated"`). |
| **dateEntities**                | List of date entities. Default: [`DATE`, `DOB`, `DOD`]                                                                                                                                                                                                                                                                                                                                                                                      |
| **unnormalizedDateMode**        | Mode to use for unformatted dates. Options: [`mask`, `obfuscate`, `skip`]. Default: `obfuscate`                                                                                                                                                                                                                                                                                                                                             |
| **days**                        | Number of days to displace dates for obfuscation. If not set, a random value between 1 and 60 is used.                                                                                                                                                                                                                                                                                                                                      |
| **dateFormats**                 | List of date formats to automatically displace if parsed.                                                                                                                                                                                                                                                                                                                                                                                   |
| **obfuscateRefSource**          | Source for obfuscating entities (non-date). Options: [`custom`, `faker`, `both`]                                                                                                                                                                                                                                                                                                                                                            |
| **language**                    | Language for regex and faker data. Options: [`en`, `de`, `es`, `fr`, `ar`, `ro`]. Default: `en`                                                                                                                                                                                                                                                                                                                                             |
| **seed**                        | Seed for deterministic obfuscation results.                                                                                                                                                                                                                                                                                                                                                                                                 |
| **maskingPolicy**               | Masking policy. Options: [`same_length_chars`, `entity_labels`, `fixed_length_chars`]                                                                                                                                                                                                                                                                                                                                                       |
| **fixedMaskLength**             | Mask length when using `fixed_length_chars` policy.                                                                                                                                                                                                                                                                                                                                                                                         |
| **sameLengthFormattedEntities** | List of formatted entities to preserve length during obfuscation. Supported: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE                                                                                                                                                                                                                                                                                |
| **genderAwareness**             | Use gender-aware names during obfuscation. Default: False                                                                                                                                                                                                                                                                                                                                                                                   |
| **ageRanges**                   | Integer list defining limits of age groups to preserve during obfuscation.                                                                                                                                                                                                                                                                                                                                                                  |
| **selectiveObfuscationModes**   | Dictionary of entity-wise obfuscation modes. Example: `{'NAME': ['mask_entity_labels'], 'PHONE': ['obfuscate']}`                                                                                                                                                                                                                                                                                                                            |
| **customFakers**                | Custom dictionary of faker terms to use for specific entities.                                                                                                                                                                                                                                                                                                                                                                              |
| **keepYear**                    | Keep year intact during date obfuscation. Default: False                                                                                                                                                                                                                                                                                                                                                                                    |
| **keepMonth**                   | Keep month intact during date obfuscation. Default: False                                                                                                                                                                                                                                                                                                                                                                                   |


# Setup

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m745.1/745.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m569.8/569.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m746.2/746.2 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m95.6/95.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m66.9/66.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2

In [None]:
from pyspark.ml import Pipeline,PipelineModel
import sparknlp_jsl
import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output

spark = sparknlp_jsl.start(secret = license_keys["SECRET"])

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

# Basic Examples to Show Functionalities

In [None]:
cda_example = """<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3"
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                  xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">

    <typeId root="2.16.840.1.113883.1.3" extension="POCD_HD000040"/>
    <templateId root="2.16.840.1.113883.10.20.1"/>
    <id root="2.16.840.1.113883.19.5" extension="12345"/>
    <code code="34133-9" codeSystem="2.16.840.1.113883.6.1" displayName="Summarization of Episode Note"/>
    <title>Good Health Clinic Consultation Note</title>
    <effectiveTime value="20240101120000"/>

    <!-- ===================== PATIENT ===================== -->
    <recordTarget>
        <patientRole>
            <id root="2.16.840.1.113883.19.5" extension="998877"/>
            <addr use="HP">
                <streetAddressLine>123 Main Street</streetAddressLine>
                <city>Istanbul</city>
                <state>34</state>
                <postalCode>34000</postalCode>
                <country>TR</country>
            </addr>
            <telecom use="HP" value="+905551112233"/>
            <patient>
                <name>
                    <given>John</given>
                    <family>Doe</family>
                </name>
                <administrativeGenderCode code="M"/>
                <birthTime value="19800101"/>
            </patient>
        </patientRole>
    </recordTarget>

    <!-- ===================== AUTHOR ===================== -->
    <author>
        <time value="20240101113000"/>
        <assignedAuthor>
            <id root="2.16.840.1.113883.4.6" extension="111223333"/>
            <assignedPerson>
                <name>
                    <given>Emily</given>
                    <family>Clark</family>
                </name>
            </assignedPerson>
            <representedOrganization>
                <id root="2.16.840.1.113883.19.5" extension="ORG001"/>
                <name>Good Health Clinic</name>
            </representedOrganization>
        </assignedAuthor>
    </author>

    <!-- ===================== CUSTODIAN ===================== -->
    <custodian>
        <assignedCustodian>
            <representedCustodianOrganization>
                <id root="2.16.840.1.113883.19.5" extension="CUST001"/>
                <name>Good Health Clinic Organization</name>
            </representedCustodianOrganization>
        </assignedCustodian>
    </custodian>

    <!-- ===================== BODY ===================== -->
    <component>
        <structuredBody>

            <component>
                <section>
                    <code code="10164-2" codeSystem="2.16.840.1.113883.6.1" displayName="History of Present Illness"/>
                    <title>History of Present Illness</title>
                    <text>
                        Patient John Doe presented with chest pain and shortness of breath.
                        He lives at 456 Main Street, Istanbul. Contact number is +905551112233.
                    </text>
                </section>
            </component>

        </structuredBody>
    </component>

</ClinicalDocument>"""

In [None]:
rules = {
    "recordTarget.patientRole.patient.name.given": "first_name",
    "recordTarget.patientRole.patient.name.family": "last_name",
    "recordTarget.patientRole.addr.streetAddressLine": "Street",
    "recordTarget.patientRole.addr.city": "city",
    "recordTarget.patientRole.addr.postalCode": "ZIP",
    "recordTarget.patientRole.addr.country": "COUNTRY",
    "recordTarget.patientRole.telecom.value": "Phone",
    "recordTarget.patientRole.id.extension": "ID",
    "author.assignedAuthor.assignedPerson.name.given": "first_name",
    "author.assignedAuthor.assignedPerson.name.family": "last_name",
    "author.assignedAuthor.id.extension": "ID",
    "custodian.assignedCustodian.representedCustodianOrganization.name": "Organization",
    "custodian.assignedCustodian.representedCustodianOrganization.id.extension": "ID",
    "effectiveTime.value": "DATE",
    "author.time.value": "DATE",
    "recordTarget.patientRole.patient.birthTime.value": "DATE",
}

cda_deid = (
    CdaDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setDays(20)
      .setSeed(88)
      .setObfuscateRefSource("both")
)

In [None]:
cda_df = spark.createDataFrame([[cda_example]]).toDF("text")
cda_deid.transform(cda_df).select("deid").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## üõ°Ô∏è Direct CDA De-identification Without Spark DataFrames

In [None]:
deidentified_cda = cda_deid.deidentify(cda_example)
print(deidentified_cda)

<?xml version="1.0" encoding="UTF-8" standalone="no"?><ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">

    <typeId extension="POCD_HD000040" root="2.16.840.1.113883.1.3"/>
    <templateId root="2.16.840.1.113883.10.20.1"/>
    <id extension="12345" root="2.16.840.1.113883.19.5"/>
    <code code="34133-9" codeSystem="2.16.840.1.113883.6.1" displayName="Summarization of Episode Note"/>
    <title>Good Health Clinic Consultation Note</title>
    <effectiveTime value="20240121120000"/>

    <recordTarget>
        <patientRole>
            <id extension="443322" root="2.16.840.1.113883.19.5"/>
            <addr use="HP">
                <streetAddressLine>1011 North Cooper Street</streetAddressLine>
                <city>Bakebare</city>
                <state>34</state>
                <postalCode>89555</postalCode>
                <country>AMERICAN SAMOA</country>
            </addr>
            <te

## ‚öôÔ∏è Supports List of CDA Documents with Multi-threaded Processing Under the Hood

In [None]:
deidentified_list = cda_deid.deidentify_list([cda_example, deidentified_cda])
print(type(deidentified_list))
print(f"Number of documents: {len(deidentified_list)}")

<class 'list'>
Number of documents: 2


## üìù Path Notation Examples

CDA De-identification supports flexible path notation:

### Element Paths
Both dot (`.`) and slash (`/`) notation work:
- `recordTarget.patientRole.patient.name.given`
- `recordTarget/patientRole/patient/name/given`

### Attribute Paths
Attributes can be specified with or without `@` prefix:
- `recordTarget.patientRole.telecom/@value` (explicit attribute)
- `recordTarget.patientRole.telecom.value` (auto-detected attribute)
- `recordTarget.patientRole.id.extension` (known attribute)

### Nested Structures
- `component.structuredBody.component.section.text`
- `component/structuredBody/component/section/text`

In [None]:
# Example with different path notations
rules_slash = {
    "recordTarget/patientRole/patient/name/given": "first_name",
    "recordTarget/patientRole/patient/name/family": "last_name",
    "recordTarget.patientRole.telecom/@value": "Phone",  # Mix of notations
    "recordTarget.patientRole.id.extension": "ID",
}

cda_deid_slash = (
    CdaDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules_slash)
      .setDays(20)
      .setSeed(88)
      .setObfuscateRefSource("both")
)

result_slash = cda_deid_slash.deidentify(cda_example)
print(result_slash[:1500])  # Show first 500 characters

<?xml version="1.0" encoding="UTF-8" standalone="no"?><ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">

    <typeId extension="POCD_HD000040" root="2.16.840.1.113883.1.3"/>
    <templateId root="2.16.840.1.113883.10.20.1"/>
    <id extension="12345" root="2.16.840.1.113883.19.5"/>
    <code code="34133-9" codeSystem="2.16.840.1.113883.6.1" displayName="Summarization of Episode Note"/>
    <title>Good Health Clinic Consultation Note</title>
    <effectiveTime value="20240101120000"/>

    <recordTarget>
        <patientRole>
            <id extension="443322" root="2.16.840.1.113883.19.5"/>
            <addr use="HP">
                <streetAddressLine>123 Main Street</streetAddressLine>
                <city>Istanbul</city>
                <state>34</state>
                <postalCode>34000</postalCode>
                <country>TR</country>
            </addr>
            <telecom use="HP" value=

## üß† Free-Text De-identification with NLP Pipelines

CDA documents often contain free-text narrative sections. Use `setFreeTextPaths()` to specify which sections should be processed with NLP de-identification pipelines.

In [None]:
deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark_optimized", "en", "clinical/models")



clinical_deidentification_docwise_benchmark_optimized download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
# Example with free-text paths
rules_with_freetext = {
    "recordTarget.patientRole.patient.name.given": "first_name",
    "recordTarget.patientRole.patient.name.family": "last_name",
    "recordTarget.patientRole.addr.streetAddressLine": "Address",
    "recordTarget.patientRole.telecom.value": "Phone",
    "author.assignedAuthor.assignedPerson.name.given": "first_name",
    "author.assignedAuthor.assignedPerson.name.family": "last_name",
}


cda_deid_freetext = (
    CdaDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules_with_freetext)
      .setFreeTextPaths([
          "component.structuredBody.component.section.text"
      ])
      .setPipeline(spark, deid_pipeline, "obfuscated")
      .setDays(20)
      .setSeed(88)
      .setObfuscateRefSource("both")
)

result_freetext = cda_deid_freetext.deidentify(cda_example)
print("Before De-identified:")
print(cda_example[2500:])
print()
print("De-identified CDA (free-text only):")
print(result_freetext[2500:])

Before De-identified:
 -->
    <component>
        <structuredBody>

            <component>
                <section>
                    <code code="10164-2" codeSystem="2.16.840.1.113883.6.1" displayName="History of Present Illness"/>
                    <title>History of Present Illness</title>
                    <text>
                        Patient John Doe presented with chest pain and shortness of breath.
                        He lives at 456 Main Street, Istanbul. Contact number is +905551112233.
                    </text>
                </section>
            </component>

        </structuredBody>
    </component>

</ClinicalDocument>

De-identified CDA (free-text only):
     <structuredBody>

            <component>
                <section>
                    <code code="10164-2" codeSystem="2.16.840.1.113883.6.1" displayName="History of Present Illness"/>
                    <title>History of Present Illness</title>
                    <text>Patient Valerie Ates pr

üéØ Advanced Examples

### Custom Fakers

In [None]:
rules_custom = {
    "recordTarget.patientRole.patient.name.given": "first_name",
    "recordTarget.patientRole.patient.name.family": "last_name",
    "recordTarget.patientRole.patient.administrativeGenderCode.code": "GENDER",
    "author.assignedAuthor.assignedPerson.name.given": "first_name",
    "author.assignedAuthor.assignedPerson.name.family": "last_name",
}

cda_deid_custom = (
    CdaDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules_custom)
      .setDays(20)
      .setSeed(88)
      .setCustomFakers({
          "GENDER": ["F", "M", "O"]  # Custom gender codes
      })
      .setObfuscateRefSource("both")
)

result_custom = cda_deid_custom.deidentify(cda_example)
print(result_custom[600:1400])

    <recordTarget>
        <patientRole>
            <id extension="998877" root="2.16.840.1.113883.19.5"/>
            <addr use="HP">
                <streetAddressLine>123 Main Street</streetAddressLine>
                <city>Istanbul</city>
                <state>34</state>
                <postalCode>34000</postalCode>
                <country>TR</country>
            </addr>
            <telecom use="HP" value="+905551112233"/>
            <patient>
                <name>
                    <given>Mylene</given>
                    <family>Cease</family>
                </name>
                <administrativeGenderCode code="O"/>
                <birthTime value="19800101"/>
            </patient>
        </patientRole>
    </recordTarget


### Mask Mode

In [None]:
rules_masking = {
    "recordTarget.patientRole.patient.name.given": "first_name",
    "recordTarget.patientRole.patient.name.family": "last_name",
    "recordTarget.patientRole.addr.streetAddressLine": "Address",
    "recordTarget.patientRole.addr.city": "city",
    "recordTarget.patientRole.addr.postalCode": "ZIP",
    "recordTarget.patientRole.addr.country": "COUNTRY",
    "recordTarget.patientRole.telecom.value": "Phone",
    "recordTarget.patientRole.id.extension": "ID",
    "author.assignedAuthor.assignedPerson.name.given": "first_name",
    "author.assignedAuthor.assignedPerson.name.family": "last_name",
    "author.assignedAuthor.id.extension": "ID",
    "custodian.assignedCustodian.representedCustodianOrganization.name": "Organization",
    "custodian.assignedCustodian.representedCustodianOrganization.id.extension": "ID",
    # "effectiveTime.value": "DATE",
    # "author.time.value": "DATE",
    # "recordTarget.patientRole.patient.birthTime.value": "DATE",
}


In [None]:
cda_deid_mask = (
    CdaDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("mask")
      .setMappingRules(rules_masking)
      .setMaskingPolicy("same_length_chars")
      .setObfuscateRefSource("both")
)

result_mask = cda_deid_mask.deidentify(cda_example)
print("Masked CDA (first 600 characters):")
print(result_mask[:600])

Masked CDA (first 600 characters):
<?xml version="1.0" encoding="UTF-8" standalone="no"?><ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">

    <typeId extension="POCD_HD000040" root="2.16.840.1.113883.1.3"/>
    <templateId root="2.16.840.1.113883.10.20.1"/>
    <id extension="12345" root="2.16.840.1.113883.19.5"/>
    <code code="34133-9" codeSystem="2.16.840.1.113883.6.1" displayName="Summarization of Episode Note"/>
    <title>Good Health Clinic Consultation Note</title>
    <effectiveTime value="20240101120000"/>



## üìä Batch Processing with Spark DataFrames

In [None]:
# Create multiple CDA documents
cda_doc1 = cda_example
cda_doc2 = cda_example.replace("John", "Jane").replace("Doe", "Smith").replace("Emily", "Michael")

# Create DataFrame
cda_batch_df = spark.createDataFrame([
    [cda_doc1],
    [cda_doc2]
]).toDF("text")

# Transform
deidentified_batch = cda_deid.transform(cda_batch_df)
deidentified_batch.select("deid").show(truncate=False, n=1)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Real World Examples

In [None]:
# Clone the ccda-samples repository
!git clone https://github.com/jddamore/ccda-samples.git

fatal: destination path 'ccda-samples' already exists and is not an empty directory.


In [None]:
import os
import glob
from ipywidgets import Dropdown, Button, Output, VBox, HTML
from IPython.display import display, clear_output

# Find all XML files in the ccda-samples directory
ccda_samples_path = "ccda-samples"
xml_files = []

if os.path.exists(ccda_samples_path):
    for root, dirs, files in os.walk(ccda_samples_path):
        for file in files:
            if file.endswith('.xml'):
                full_path = os.path.join(root, file)
                xml_files.append((file, full_path))

print(f"Found {len(xml_files)} CDA XML files")
if xml_files:
    print("\nSample files:")
    for name, path in xml_files[:5]:
        print(f"  - {name}")

Found 384 CDA XML files

Sample files:
  - SLI_CCD_e1Alice_HealthCompanion_HC_11022017_deidentified.xml
  - SLI_CCD_e1Alice_HealthCompanion_HC_11022017.xml
  - 5595_5_deidentified.xml
  - 5595_5.xml
  - CDA-E1-6-22-2015.xml


In [None]:
# Create widget for selecting CDA file
if xml_files:
    # Create options dictionary: filename -> filepath
    file_options = {name: path for name, path in xml_files}

    dropdown = Dropdown(
        options=file_options,
        description='CDA File:',
        style={'description_width': 'initial'},
        layout={'width': '500px'}
    )

    output = Output()

    def deidentify_selected(b):
        with output:
            clear_output()
            selected_path = dropdown.value
            # Get filename from path
            selected_name = os.path.basename(selected_path)
            print(f"Processing: {selected_name}")
            print("=" * 80)

            # Read the selected CDA file
            try:
                with open(selected_path, 'r', encoding='utf-8') as f:
                    cda_content = f.read()

                print(f"\nüìÑ Original CDA Document (first 500 characters):")
                print("-" * 80)
                print(cda_content[:500])
                print("...")

                # De-identify using our CDA de-identification transformer
                deidentified_content = cda_deid.deidentify(cda_content)

                print(f"\nüîê De-identified CDA Document (first 500 characters):")
                print("-" * 80)
                print(deidentified_content[:500])
                print("...")

                # Extract and compare patient info
                print(f"\nüìä Comparison:")
                print("-" * 80)
                orig_info = extract_patient_info(cda_content)
                deid_info = extract_patient_info(deidentified_content)

                print("\nOriginal Patient Info:")
                for key, value in orig_info.items():
                    print(f"  {key}: {value}")

                print("\nDe-identified Patient Info:")
                for key, value in deid_info.items():
                    print(f"  {key}: {value}")

                # Save de-identified version
                output_path = selected_path.replace('.xml', '_deidentified.xml')
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(deidentified_content)
                print(f"\n‚úÖ De-identified document saved to: {output_path}")

            except Exception as e:
                print(f"‚ùå Error processing file: {str(e)}")
                import traceback
                traceback.print_exc()

    button = Button(
        description='üîê De-identify Selected CDA',
        button_style='success',
        layout={'width': '300px'}
    )
    button.on_click(deidentify_selected)

    display(HTML("<h3>Select a CDA file to de-identify:</h3>"))
    display(VBox([dropdown, button, output]))
else:
    print("No XML files found. Please check if the repository was cloned successfully.")

HTML(value='<h3>Select a CDA file to de-identify:</h3>')

VBox(children=(Dropdown(description='CDA File:', layout=Layout(width='500px'), options={'SLI_CCD_e1Alice_Healt‚Ä¶

In [None]:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import QName

def extract_patient_info(xml_str):
    """Extract patient information from CDA XML"""
    root = ET.fromstring(xml_str)
    ns = {'hl7': 'urn:hl7-org:v3'}

    patient_info = {}

    # Patient name
    given = root.find('.//hl7:given', ns)
    family = root.find('.//hl7:family', ns)
    if given is not None:
        patient_info['given'] = given.text
    if family is not None:
        patient_info['family'] = family.text

    # Address
    street = root.find('.//hl7:streetAddressLine', ns)
    city = root.find('.//hl7:city', ns)
    if street is not None:
        patient_info['street'] = street.text
    if city is not None:
        patient_info['city'] = city.text

    # Telecom
    telecom = root.find('.//hl7:telecom[@value]', ns)
    if telecom is not None:
        patient_info['telecom'] = telecom.get('value')

    return patient_info

# Extract from original
original_info = extract_patient_info(cda_example)
print("Original Patient Info:")
for key, value in original_info.items():
    print(f"  {key}: {value}")

# Extract from de-identified
deid_info = extract_patient_info(deidentified_cda)
print("\nDe-identified Patient Info:")
for key, value in deid_info.items():
    print(f"  {key}: {value}")

Original Patient Info:
  given: John
  family: Doe
  street: 123 Main Street
  city: Istanbul
  telecom: +905551112233

De-identified Patient Info:
  given: Mylene
  family: Cease
  street: 1011 North Cooper Street
  city: Bakebare
  telecom: +450006667788


In [None]:
# Process a specific CDA file directly
if xml_files:
    # Take the first XML file as an example
    example_file = xml_files[0][1]
    print(f"Processing: {example_file}")

    # Read the file
    with open(example_file, 'r', encoding='utf-8') as f:
        real_cda = f.read()

    print(f"\nOriginal CDA length: {len(real_cda)} characters")
    print(f"First 300 characters:\n{real_cda[:300]}...")

    # De-identify
    deidentified_real_cda = cda_deid.deidentify(real_cda)

    print(f"\nDe-identified CDA length: {len(deidentified_real_cda)} characters")
    print(f"First 300 characters:\n{deidentified_real_cda[:300]}...")

    # Compare patient information
    print("\n" + "=" * 80)
    print("Patient Information Comparison:")
    print("=" * 80)

    orig_patient = extract_patient_info(real_cda)
    deid_patient = extract_patient_info(deidentified_real_cda)

    print("\nOriginal:")
    for key, value in orig_patient.items():
        print(f"  {key}: {value}")

    print("\nDe-identified:")
    for key, value in deid_patient.items():
        print(f"  {key}: {value}")
else:
    print("No XML files available. Please clone the repository first.")

Processing: ccda-samples/Health Companion/SLI_CCD_e1Alice_HealthCompanion_HC_11022017_deidentified.xml

Original CDA length: 138370 characters
First 300 characters:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="style.xsl"?><!-- This XML document was retrieved from a public repository of certified EHR samples for Meaningful Use 
That repository states "The public will be able to view-download the submitted files fro...

De-identified CDA length: 138384 characters
First 300 characters:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="style.xsl"?><!-- This XML document was retrieved from a public repository of certified EHR samples for Meaningful Use 
That repository states "The public will be able to view-download the submitted files fro...

Patient Information Comparison:

Original:
  given: Roman
  family: Slim
  street: Address
  city: Hines
  telecom: ezk:+6(000)-660-6789

De-identified:
  give