# üßæ DICOM Metadata De-Identification Pipeline

In this notebook, we demonstrate how to **de-identify DICOM metadata** using  
üß† **Healthcare NLP** and üëÅÔ∏è **Visual NLP** models.

The approach implemented here follows the methodology described in the blog post below, focusing specifically on **metadata-level PHI removal**:

üîó https://medium.com/john-snow-labs/de-identifying-dicom-files-a-step-by-step-guide-with-john-snow-labs-visual-nlp-2c21b60f92a8

---

## ‚öôÔ∏è Pipeline Overview

The pipeline is driven by a **strategy file (CSV)** that defines how each DICOM tag should be handled:

- üè∑Ô∏è Tags marked with **`cleanTag`** are routed through **NER-based dPHI detection**
- üß© All remaining tags are processed using **rule-based and deterministic cleaning logic**
- üìã Nested sequences, private tags, and UID fields are handled according to DICOM best practices

This notebook focuses **exclusively on metadata de-identification**  
(no pixel-level PHI removal).

üìå For a **complete Pixel + Metadata de-identification workflow**, refer to:  
**`SparkOCRDicomCombinedNerDeid.ipynb`**

---

## üß≠ Notebook Workflow

The notebook is organized into the following steps:

### 1Ô∏è‚É£ Load DICOM Files  
üìÇ Read and inspect raw DICOM files from disk, including metadata structure and tag values.

### 2Ô∏è‚É£ View Strategy Files  
üìÑ Examine the CSV strategy file that defines **tag-level de-identification actions**, such as:
- `cleanTag`
- `remove`
- `keep`
- `hash`
- `replace`

### 3Ô∏è‚É£ Run Metadata De-Identification Pipeline  
üöÄ Apply the metadata transformation pipeline and inspect intermediate stages, including:
- NER-based PHI detection on selected tags  
- Rule-based transformations for dates, UIDs, and identifiers  

### 4Ô∏è‚É£ Compare Original vs De-Identified Metadata  
üîç Analyze tag-level changes between:
- Original metadata  
- Cleaned / transformed metadata  

### 5Ô∏è‚É£ Extract PATIENT / UID Mappings  
üîë Generate mappings for:
- Patient identifiers  
- Study / Series / SOP Instance UIDs  
to support consistency, traceability, and validation.

### 6Ô∏è‚É£ Save De-Identified DICOMs to Disk  
üíæ Persist cleaned DICOM files for downstream processing, validation, or deployment.

In [1]:
license = ""
secret = ""
nlp_secret = ""
aws_access_key = ""
aws_secret_key = ""

version = secret.split("-")[0]
spark_ocr_jar_path = "../../../target/scala-2.12"

## Start Spark session

In [2]:
from sparkocr import start

if license:
    os.environ['JSL_OCR_LICENSE'] = license
    os.environ['SPARK_NLP_LICENSE'] = license

if aws_access_key:
    os.environ['AWS_ACCESS_KEY'] = aws_access_key
    os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret_key


spark = start(secret=secret,
              nlp_secret=nlp_secret,
              jar_path=spark_ocr_jar_path,
              nlp_internal="6.2.2")

spark

Spark version: 3.5.0
Spark NLP version: 6.2.0
Spark NLP for Healthcare version: 6.2.0
Spark OCR version: 6.2.2rc1

:: loading settings :: url = jar:file:/usr/local/lib/python3.11/dist-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5bc4c365-03f1-48f3-82a1-20ce9efc1c26;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;6.2.0 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in central
	found com.g

## Import transformers and annotators

In [3]:
import os
import sys
import shutil 

from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp_jsl
from sparknlp_jsl.annotator import *

import sparkocr
from sparkocr.transformers import *
from sparkocr.utils import *
from sparkocr.enums import *
from sparkocr.schemas import BinarySchema

from pyspark.ml import PipelineModel, Pipeline
from pyspark.sql.functions import *

print(f"Spark NLP version: {sparknlp.version()}")
print(f"Spark NLP internal version: {sparknlp_jsl.version()}")
print(f"Spark OCR version: {sparkocr.version()}")

Spark NLP version: 6.2.0
Spark NLP internal version: 6.2.0
Spark OCR version: 6.2.2rc1


## Load Dicom Files

In [4]:
midib_root_path = './../data/dicom/midib/*.dcm'

dicom_df = spark.read.format("binaryFile").load(midib_root_path)

dicom_df

                                                                                

path,modificationTime,length,content
file:/workspace/D...,2025-12-18 12:45:43,8179050,[00 00 00 00 00 0...
file:/workspace/D...,2025-12-18 12:41:06,7022298,[00 00 00 00 00 0...


## Load Strategy File

In [5]:
import pkg_resources
import pandas as pd

default_csv_path = pkg_resources.resource_filename('sparkocr', "resources/ocr/dicom/midi_b_strategy_ner.csv")

strategy_df = pd.read_csv(default_csv_path)

strategy_df.head()

Unnamed: 0,Tags,VR,Name,Status,Action,Option
0,"(0008, 002A)",DT,Acquisition DateTime,,shiftDateByRandomNbOfDays,
1,"(0008, 009C)",PN,Consulting Physician's Name,,remove,
2,"(0008, 010D)",UI,Context Group Extension Creator UID,,hashId,
3,"(0010, 21D0)",DA,Last Menstrual Date,,shiftDateByRandomNbOfDays,
4,"(0010, 21F0)",LO,Patient's Religious Preference,,remove,


In [6]:
# all available actions in the present strategy file
strategy_df["Action"].unique()

array(['shiftDateByRandomNbOfDays', 'remove', 'hashId',
       'replaceWithRandomName', 'shiftUnixTimeStampRandom',
       'replaceWithLiteral', 'cleanTag', 'ensureTagExists',
       'patientHashId'], dtype=object)

## Tags Marked for `cleanTag` Action

In this section, we inspect all DICOM tags that are marked with the `cleanTag` action in the strategy file.

These tags are **extracted as text** and passed through the **NER-based dPHI detection pipeline**, enabling contextual identification and de-identification of sensitive information present in metadata fields.

In [7]:
dicom_to_metadata = DicomToMetadata() \
    .setInputCol("path") \
    .setOutputCol("metadata") \
    .setKeepInput(True) \
    .setExtractTagForNer(True) \
    .setTagMappingCol("tag_mapping") \
    .setTagCol("tag_text") \
    .setStrategyFile(default_csv_path)

In [8]:
metadata_result = dicom_to_metadata.transform(dicom_df).cache()

metadata_result.columns



['path',
 'modificationTime',
 'length',
 'content',
 'exception',
 'metadata',
 'tag_text',
 'tag_mapping']

In [9]:
metadata_result.select("path", "tag_text", "tag_mapping").show(10, True)

14:41:00, INFO Run DicomToMetadata                                  (0 + 2) / 2]
14:41:00, INFO Run DicomToMetadata
  warn_and_log(msg)
  warn_and_log(msg)
  warn_and_log(msg)
  warn_and_log(msg)
  warn_and_log(msg)
14:41:06, INFO DicomToMetadata : Total Rules Extracted 198
14:41:06, INFO DicomToMetadata : Total cleanTag Action Found : 14
14:41:06, INFO DicomToMetadata : Returning Metadata, Document Object and Tag Mapping for NER
14:41:06, INFO DicomToMetadata : Total Rules Extracted 198
14:41:06, INFO DicomToMetadata : Total cleanTag Action Found : 14
14:41:06, INFO DicomToMetadata : Returning Metadata, Document Object and Tag Mapping for NER


+--------------------+--------------------+--------------------+
|                path|            tag_text|         tag_mapping|
+--------------------+--------------------+--------------------+
|file:/workspace/D...|CHEST 2 VIEW<dico...|[00081030, 0008103E]|
|file:/workspace/D...|CHEST, PA & LATER...|[00081030, 001810...|
+--------------------+--------------------+--------------------+



                                                                                

## Metadata NER DeIdentification Pipeline

In [10]:
dicom_to_metadata = DicomToMetadata() \
    .setInputCol("path") \
    .setOutputCol("metadata_original") \
    .setKeepInput(True) \
    .setExtractTagForNer(True) \
    .setTagMappingCol("tag_mapping") \
    .setTagCol("tag_text") \
    .setStrategyFile(default_csv_path)

t_document_assembler = DocumentAssembler() \
    .setInputCol("tag_text") \
    .setOutputCol("t_document")

t_sentencer = SentenceDetector()\
    .setInputCols(["t_document"])\
    .setOutputCol("t_sentence") \
    .setCustomBounds(["<dicom>"]) \
    .setUseCustomBoundsOnly(True)

t_tokenizer = Tokenizer() \
    .setInputCols(["t_sentence"]) \
    .setOutputCol("t_token") \
    .setContextChars(["_", "^", "."])

t_word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["t_sentence", "t_token"]) \
    .setOutputCol("t_embeddings")

t_ner_deidentify_dl = MedicalNerModel.pretrained("ner_deidentify_dl", "en", "clinical/models") \
    .setInputCols("t_sentence", "t_token", "t_embeddings") \
    .setOutputCol("t_ner_deidentify_dl")

t_ner_deidentify_dl_ner_converter = NerConverterInternal() \
    .setInputCols(["t_sentence", "t_token", "t_ner_deidentify_dl"]) \
    .setOutputCol("t_ner_deidentify_dl_ner_chunk") \
    .setWhiteList(["DOCTOR", "PATIENT", "ORGANIZATION", "LOCATION-OTHER", "COUNTRY", "CITY", "STATE", "STREET", "ZIP"]) \
    .setThreshold(0.82)

labels = ["CITY", "COUNTRY", "DATE", "DOCTOR", "HOSPITAL", "ORGANIZATION", "PATIENT", "PHONE", "PROFESSION", "STATE", "STREET", "ZIP"]
t_zeroshot_ner_deid_subentity_docwise_large = PretrainedZeroShotNER().pretrained("zeroshot_ner_deid_subentity_docwise_large", "en", "clinical/models") \
    .setInputCols("t_sentence", "t_token") \
    .setOutputCol("t_zeroshot_ner_deid_subentity_docwise_large") \
    .setPredictionThreshold(0.90) \
    .setLabels(labels)

t_zeroshot_ner_deid_subentity_docwise_large_ner_converter = NerConverterInternal()\
    .setInputCols("t_sentence", "t_token", "t_zeroshot_ner_deid_subentity_docwise_large")\
    .setOutputCol("t_zeroshot_ner_deid_subentity_docwise_large_ner_chunk")

codes = [
    "AK","TSARH","BH","FGH","PH","WGMC","GVH","MCH","MJM","SS","CWASH","RASC","CMCH","FMCC",
    "TLM","RMABCH","EBMC","MM","SCH","WM","AAARM","HM","WHAYCC","BGH","NH","MDCH","DSAPCC",
    "MBARCH","CMC","MLASGH","FWARMC","GFASCC","JCH","MKAAH","WNGH","PAAHMC","WBARCC","HG",
    "WG","HMCC","CHC","HCCH","CHAJH","PQACH","MSCH","MJH","HMAJMC","DHCH","GGAMH","LGMC",
    "HRG","HGAWC","WJADC","DC","LCH","WSC","BJAHGH","SSABM","HSG","HHAMM","BGCC","JASGH",
    "HWGH","AMC","KM","RS","DL","CL","LS","KP","CV","AR","CB","SE","ES","JD","CCH","DBAMCH",
    "MSAMM","SBAJH","RTAJCC","BASM","VHG","HPCC","HPAOM","AL","RD","CW","DTAMG","SCASM",
    "HFACG","YPH","MWM","SL","SW","VAGH","SJ", "AH", "CH", "JCASCH", "LBAWCH","ECC", "NRALG",
    "MWG", "WBAACH", "FBH", "CAAFMC", "BYARMC"
]

pattern = r"\b(" + "|".join(codes) + r")\b"

t_regex_matcher = RegexMatcher() \
    .setInputCols("t_sentence") \
    .setOutputCol("t_regex") \
    .setRules([
        # Match known hospital or facility codes (from predefined list)
        pattern + ";CODES",
        # Match professional titles like Dr., DR., dr
        r"\b[Dd][Rr]\b\.?;PROFESSION",
        # Match numeric IDs in format: 123.456.7890x123
        r"\d{3}\.\d{3}\.\d{4}x\d{3};ID",
        # Match compact date-like numeric patterns (YYMMDD etc.)
        r"\b(?:\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01]))\b;DATE",
        # Match phone-like pattern with parentheses and extension,
        # e.g. (151)265-4464x20 ‚Äî but we tag it as DATE as per your request
        r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}(?:x\d{1,5})?;PHONE",
        # Match SSN-style numeric pattern (e.g. 300-45-3175) ‚Üí labeled as DATE
        r"\b\d{3}-\d{2}-\d{4}\b;DATE"]) \
    .setDelimiter(";") 

t_chunk_converter = ChunkConverter()\
    .setInputCols("t_regex")\
    .setOutputCol("t_regex_ner_chunk")

t_chunk_merger = ChunkMergeApproach()\
    .setInputCols('t_regex_ner_chunk', 't_ner_deidentify_dl_ner_chunk', 't_zeroshot_ner_deid_subentity_docwise_large_ner_chunk')\
    .setOutputCol('t_merged_ner_chunk')\
    .setMergeOverlapping(True)

filter_chunk = ["US", "MR BREAS", "MR BREAST BILAT", "MR BREASTUNI", "MIEDNICA", "WWO", "THINS", 
      "3D Rendering", "MAMMO", "CT", "DCE", "CA", "MC prostaat kliniek detectie-mc"]

t_chunk_filterer = ChunkFilterer() \
  .setInputCols(["t_sentence","t_merged_ner_chunk"]) \
  .setOutputCol("t_filtered") \
  .setCriteria("isin") \
  .setFilterValue("result") \
  .setBlackList(filter_chunk)

deid_obfuscated = DeIdentification() \
    .setInputCols(["t_sentence", "t_token", "t_filtered"]) \
    .setOutputCol("deid_documents") \
    .setMode("deid")

dicom_deidentifier = DicomMetadataDeidentifier() \
    .setInputCols(["path"]) \
    .setOutputCol("dicom_metadata_cleaned") \
    .setTagMappingCol("tag_mapping") \
    .setKeepInput(True) \
    .setTagCleanedCol("deid_documents") \
    .setStrategyFile(default_csv_path)

extract_final_tag = DicomToMetadata() \
    .setInputCol("dicom_metadata_cleaned") \
    .setOutputCol("metadata_cleaned") \
    .setKeepInput(True) \
    .setExtractTagForNer(False) \
    .setStrategyFile(default_csv_path)

pipeline = Pipeline(stages=[
    dicom_to_metadata,
    t_document_assembler,
    t_sentencer,
    t_tokenizer,
    t_word_embeddings,
    t_ner_deidentify_dl,
    t_ner_deidentify_dl_ner_converter,
    t_zeroshot_ner_deid_subentity_docwise_large,
    t_zeroshot_ner_deid_subentity_docwise_large_ner_converter,
    t_regex_matcher,
    t_chunk_converter,
    t_chunk_merger,
    t_chunk_filterer,
    deid_obfuscated,
    dicom_deidentifier,
    extract_final_tag
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ]embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
Download done! Loading the resource.
[OK!]
ner_deidentify_dl download started this may take some time.
Approximate size to download 14.1 MB
[ | ]ner_deidentify_dl download started this may take some time.
Approximate size to download 14.1 MB
[ / ]Download done! Loading the resource.
[ ‚Äî ]

2025-12-19 14:41:24.256654: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-19 14:41:24.334194: W external/org_tensorflow/tensorflow/core/common_runtime/colocation_graph.cc:1218] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_de

[OK!]
zeroshot_ner_deid_subentity_docwise_large download started this may take some time.
Approximate size to download 1.5 GB
[ | ]zeroshot_ner_deid_subentity_docwise_large download started this may take some time.
Approximate size to download 1.5 GB
Download done! Loading the resource.
[OK!]


In [11]:
result = pipeline.fit(dicom_df).transform(dicom_df).cache()

result.columns

['dicom_metadata_cleaned',
 'exception',
 'metadata_original',
 't_document',
 't_sentence',
 't_token',
 't_embeddings',
 't_ner_deidentify_dl',
 't_ner_deidentify_dl_ner_chunk',
 't_zeroshot_ner_deid_subentity_docwise_large',
 't_zeroshot_ner_deid_subentity_docwise_large_ner_chunk',
 't_regex',
 't_regex_ner_chunk',
 't_merged_ner_chunk',
 't_filtered',
 'deid_documents',
 'path',
 'metadata_cleaned']

### Extract Original Vs Cleaned Metadata to Pandas DF

In [12]:
collect_result = []

for item in result.select("path", "metadata_original", "metadata_cleaned").toLocalIterator():

    data = item.asDict()

    metadata_original = json.loads(data["metadata_original"])
    metadata_cleaned = json.loads(data["metadata_cleaned"])

    for item in metadata_original.keys():
        original_value = metadata_original[item]["value"]
        cleaned_value = metadata_cleaned[item]["value"]
        
        value_changed = False if original_value == cleaned_value else True
        
        collect_result.append([data["path"], item, metadata_original[item]["vr"], original_value, cleaned_value, value_changed])

14:41:45, INFO Run DicomToMetadata                                  (0 + 2) / 2]
14:41:45, INFO DicomToMetadata : Total Rules Extracted 198
14:41:45, INFO DicomToMetadata : Total cleanTag Action Found : 14
14:41:45, INFO DicomToMetadata : Returning Metadata, Document Object and Tag Mapping for NER
14:41:45, INFO Run DicomToMetadata
14:41:45, INFO DicomToMetadata : Total Rules Extracted 198
14:41:45, INFO DicomToMetadata : Total cleanTag Action Found : 14
14:41:45, INFO DicomToMetadata : Returning Metadata, Document Object and Tag Mapping for NER


Using CPUs


14:41:54, INFO Run DicomMetadataDeidentifier
14:41:54, INFO Run DicomMetadataDeidentifier
  warn_and_log(msg)
  warn_and_log(msg)
  warn_and_log(msg)
  warn_and_log(msg)
  warn_and_log(msg)
14:41:57, INFO DicomMetadataDeidentifier : Invalid Option : '' For Tag : '0008002A' And VR : 'DT'. Using Random instead.
14:41:57, INFO DicomMetadataDeidentifier : Updating Tag '0008002A' VR 'DT'
14:41:57, INFO DicomMetadataDeidentifier : Tag '0008009C' Present in Strategy file but missing in Dicom.
14:41:57, INFO DicomMetadataDeidentifier : Tag '0008010D' Present in Strategy file but missing in Dicom.
14:41:57, INFO DicomMetadataDeidentifier : Invalid Option : '' For Tag : '001021D0' And VR : 'DA'. Using Random instead.
14:41:57, INFO DicomMetadataDeidentifier : Updating Tag '001021D0' VR 'DA'
14:41:57, INFO DicomMetadataDeidentifier : Tag '001021F0' Present in Strategy file but missing in Dicom.
14:41:57, INFO DicomMetadataDeidentifier : Tag '0018700C' Present in Strategy file but missing in Dicom

In [13]:
columns = ["File", "Tag", "VR", "Original_Value", "Cleaned_Value", "Is_Changed"]

metadata_result_df = pd.DataFrame(collect_result, columns=columns)
metadata_result_df.head(5)

Unnamed: 0,File,Tag,VR,Original_Value,Cleaned_Value,Is_Changed
0,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80005,CS,ISO_IR 100,ISO_IR 100,False
1,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80008,CS,"[ORIGINAL, PRIMARY, ]","[ORIGINAL, PRIMARY, ]",False
2,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80016,UI,1.2.840.10008.5.1.4.1.1.1.1,1.2.840.10008.5.1.4.1.1.1.1,False
3,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80018,UI,2.3.185.0.2.0801206.6.512.1221786140775342924,2.25.130377398866551351689158762623323288883,True
4,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80020,DA,20190526,20190506,True


### Filter All Tags That are DeIdentified

In [14]:
metadata_result_df[metadata_result_df["Is_Changed"] == True].head(5)

Unnamed: 0,File,Tag,VR,Original_Value,Cleaned_Value,Is_Changed
3,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80018,UI,2.3.185.0.2.0801206.6.512.1221786140775342924,2.25.130377398866551351689158762623323288883,True
4,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80020,DA,20190526,20190506,True
5,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80021,DA,20190526,20190525,True
6,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80022,DA,20190526,20190401,True
7,file:/workspace/DICOM_WORKSHOP/NEW_NBS/midib/T...,80023,DA,20190526,20190504,True


## Extract UID/Patiend ID Mapping

In [15]:
from sparkocr.utils import generate_dicom_mapping

help(generate_dicom_mapping)

Help on function generate_dicom_mapping in module sparkocr.utils:

generate_dicom_mapping(df, original_col, final_col)
    Return mapping of old and new dicom identifiers.
    
    :param df : Dataframe containing original and cleaned metadata
    :param original_col : Dataframe column containing original metadata
    :param final_col : Dataframe column containing cleaned metadata
    :return {"uid_mapping" : {old_uid:new_uid}, "patient_mapping" : {old_patid:new_patid}



In [16]:
mapping_result = generate_dicom_mapping(result, "metadata_original", "metadata_cleaned")

columns = ["id_old", "id_new"]

uid_mapping = pd.DataFrame(list(mapping_result["uid_mapping"].items()), columns=columns)

uid_mapping.head(5)

Unnamed: 0,id_old,id_new
0,1.2.840.10008.5.1.4.1.1.1.1,1.2.840.10008.5.1.4.1.1.1.1
1,2.3.185.0.2.0801206.6.512.1221786140775342924,2.25.130377398866551351689158762623323288883
2,2.3.185.0.2.0801206.6.512.1530417802811153775,2.25.94627344826000112534085906369516834805
3,2.3.185.0.2.0801206.6.512.1696618440938865491,2.25.70571521897336622285826091149294726595
4,2.3.185.0.2.0801206.6.512.1517291030850877420,2.25.146107250882657279873475188628760276109


In [17]:
columns = ["id_old", "id_new"]

patient_mapping = pd.DataFrame(list(mapping_result["patient_mapping"].items()), columns=columns)

patient_mapping.head(20)

Unnamed: 0,id_old,id_new
0,2063640840,1512113109
1,913829086,131119190


## Save Dicom File To Disk

In [19]:
root_path = "./dicom_deid_output"

os.makedirs(root_path, exist_ok=True)

for item in result.select("path", "dicom_metadata_cleaned").toLocalIterator():
    data = item.asDict()
    filename = os.path.basename(data["path"])

    file_out_path = os.path.join(root_path, filename)

    with open(file_out_path, "wb") as dicom_out:
        dicom_out.write(data["dicom_metadata_cleaned"])

In [20]:
shutil.rmtree(root_path)