<!DOCTYPE html>
<html>
<body>

<h1>DICOM Metadata Deidentification:</h1>

<p>This notebook provides a hands-on demonstration of deidentifying Protected Health Information (PHI) within Dicom metadata. We’ll walk through the key steps and include code examples to guide you through the process.</p>

<h2>Environment Setup:</h2>

<ul>
    <li><strong>Google Colab (Recommended):</strong> If you are using Google Colab, the environment is generally pre-configured for many dependencies.</li>
    <li><strong>Local Setup (If not using Colab):</strong>
        <ul>
            <li><strong>Homebrew:</strong> Ensure you have Homebrew installed. (e.g., <code>/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"</code></li>
            <li><strong>Apache Spark:</strong> Install Apache Spark (e.g., <code>brew install apache-spark</code> By default OpenJDK 17 will be installed.</li>
            <li><strong>OpenJDK 17:</strong> Ensure you have Java Path configured before starting <code>spark-ocr</code> session <code>os.environ['JAVA_HOME'] = '/home/linuxbrew/.linuxbrew/Cellar/openjdk@17/17.0.14'</code></li>
        </ul>
    </li>
</ul>

<h2>Required Libraries:</h2>

<ul>
    <li><strong>Spark-OCR:</strong> This library is crucial for text detection and Optical Character Recognition (OCR) and IO operations related to DICOM images.</li>
    <li><strong>Spark-NLP: </strong>Open Source Library installed automatically with Visual NLP.</li>
</ul>

<img
  src="https://www.johnsnowlabs.com/wp-content/uploads/2023/03/visual_nlp.png"
  alt="Visual NLP Diagram from John Snow Labs"
  style="max-width: 100%; height: auto;" >
</body>
</html>

<h2>Installation Steps</h2>
<ul>
    <li>Make sure license file is present in the same directory, and <strong>license</strong> variable is populated.</li>
    <li>Set Environment Variables.</li>
    <li>Install Visual-NLP</li>
    <li><strong>Restart Session!!!!<strong></li>
</ul>

In [None]:
import json 
import os 

license = ""

if license and "json" in license:
    
    with open(license, "r") as creds_in:
        creds = json.loads(creds_in.read())

        for key in creds.keys():
            os.environ[key] = creds[key]
else:
    raise Exception("License JSON File is not specified")

In [None]:
# Install Visual-NLP
!python3 -m pip install -q --upgrade spark-ocr==$OCR_VERSION --user --extra-index-url https://pypi.johnsnowlabs.com/$SPARK_OCR_SECRET

In [None]:
!pip install -q pandas matplotlib

In [None]:
# RESTART SESSION!!!

<h2>Notebook Objectives:</h2>

<p>This notebook will illustrate the following key stages of DICOM pixel deidentification:</p>

<ul>
    <li>Loading and processing DICOM files for metadata deidentification.</li>
    <li>Viewing & Editing Configuration CSV.</li>
    <li>Understand different de-identification strategies available.</li>
</ul>

<p>By following this notebook, you will gain practical experience in building a DICOM pixel deidentification pipeline using Visual-NLP.</p>

<h2>Start Visual-NLP Session with Healthcare-NLP</h2>

<h2>Configuration:</h2>

<p>To execute the code, you'll need to provide the following configuration details:</p>

<ol>
    <li><strong>License Information:</strong>
        <ul>
            <li>Specify the license for <code>Spark-OCR</code> license from your license file.</li>
        </ul>
    </li>
    <li><strong>Library Paths:</strong>
        <ul>
            <li>Provide the folder path to the <code>spark-ocr-assembly-[version].jar</code>file or <code>spark_ocr_secret</code> from your license file.</li>
        </ul>
    </li>
    <li><strong>AWS Credentials (for Model Downloads):</strong>
        <ul>
            <li>Supply your AWS access key (<code>aws_access_key</code>).</li>
            <li>Supply your AWS secret key (<code>aws_secret_key</code>).</li>
            <li>Supply your AWS session token (<code>aws_session_token</code>).</li>
        </ul>
    </li>
</ol>

In [1]:
from sparkocr import start
import os
import json
import time
import shutil

license = ""

if license and "json" in license:
    
    with open(license, "r") as creds_in:
        creds = json.loads(creds_in.read())

        for key in creds.keys():
            os.environ[key] = creds[key]
else:
    raise Exception("License JSON File is not specified")

In [None]:
extra_configurations = {
    "spark.extraListeners": "com.johnsnowlabs.license.LicenseLifeCycleManager",
}

# Set Java Path
# Not Required on Google Collab
os.environ['JAVA_HOME'] = '/home/linuxbrew/.linuxbrew/Cellar/openjdk@17/17.0.14'

spark = start(secret=os.environ["SPARK_OCR_SECRET"],
              nlp_secret=os.environ["SECRET"],
              nlp_internal=True,
              nlp_jsl=True,
              nlp_version=os.environ["PUBLIC_VERSION"],
              extra_conf=extra_configurations,
              use_gpu=False)

spark

<h2>Import Transformers and Annotators</h2>

In [3]:
import os
import sys
import time
from PIL import Image
import io
import json
import pandas as pd
import pkg_resources

import sparknlp
import sparkocr
from sparkocr.transformers import *
from sparkocr.utils import *
from sparkocr.enums import *

from pyspark.ml import PipelineModel, Pipeline
from sparknlp.pretrained import PretrainedPipeline
from pyspark.ml import PipelineModel, Pipeline

import pyspark.sql.functions as F

print(f"Spark NLP version: {sparknlp.version()}")
print(f"Spark OCR version: {sparkocr.version()}")

  import pkg_resources


Spark NLP version: 5.5.2
Spark OCR version: 5.5.1rc4


In [4]:
def compare_dicom_metadata(dicom_original_metadata, dicom_redacted_metadata):
    """
    Compare Original and Redacted Metadata
    Add a new column denoting if value has changed
    Returns pandas dataframe
    """
    metadata_mapping = []

    for key in dicom_original_metadata.keys():
        changed = dicom_original_metadata[key] != dicom_redacted_metadata[key]
        metadata_mapping.append([key,dicom_original_metadata[key],dicom_redacted_metadata[key], changed])
    
    metadata_mapping_df = pd.DataFrame(metadata_mapping, columns=["Key", "Original Value", "Redacted Value", "Value Changed"])
    
    return metadata_mapping_df

<h2>Extract Metadata from Dicom Object</h2>

In [5]:
dicom_to_metadata = DicomToMetadata() \
    .setInputCol("content") \
    .setOutputCol("metadata")

In [6]:
df = spark.read.format("binaryFile").load("./data/David_Douglas.dcm")

result = dicom_to_metadata.transform(df).select("metadata").collect()[0].asDict()["metadata"]

json_result = json.loads(result)

json_result

                                                                                

{'ImageType': "['DERIVED', 'PRIMARY', 'POST_PROCESSED', 'RT', '', '', '', '', '100000']",
 'SOPClassUID': '1.2.840.10008.5.1.4.1.1.1',
 'SOPInstanceUID': '2.25.112784503178059210578740147414000844278',
 'StudyDate': '20130713',
 'SeriesDate': '20130713',
 'AcquisitionDate': '20130713',
 'ContentDate': '20130713',
 'StudyTime': '085957',
 'SeriesTime': '090020.000',
 'AcquisitionTime': '090020.408',
 'ContentTime': '090143.249',
 'AccessionNumber': '20130714E864535',
 'Modality': 'CR',
 'Manufacturer': 'FUJIFILM Corporation',
 'InstitutionName': 'Scott Community Hospital',
 'InstitutionAddress': '334 Michael Manor Sarahview, PA 56560',
 "ReferringPhysician'sName": 'HUGHES^KATHLEEN',
 "ReferringPhysician'sAddress": '0544 Green Inlet Jeffreyland, HI 66060',
 'StudyDescription': 'XR CHEST AP PORTABLE for Douglas Davidson',
 'ProcedureCodeSequence': [{'CodeValue': 'CHESTAPPAP',
   'CodingSchemeDesignator': 'SECTRA',
   'CodeMeaning': 'XR CHEST AP PORTABLE'}],
 "PerformingPhysician'sName": '

<h2>Run Deidentify Dicom Metadata and Extract Results</h2>

In [7]:
csv_path = pkg_resources.resource_filename('sparkocr', "resources/ocr/dicom/default_dicom_strategy.csv")

strategy_csv = pd.read_csv(csv_path)

strategy_csv.head(10)

Unnamed: 0,Tags,VR,Name,Status,Action,Repeatable
0,"(0002,0100)",UI,Private Information Creator UID,,hashId,
1,"(0002,0102)",OB,Private Information,,hashId,
2,"(0004,1130)",CS,File-set ID,,hashId,
3,"(0004,1141)",CS,File-set Descriptor File ID,,hashId,
4,"(0004,1432)",UI,Private Record UID,,hashId,
5,"(0004,1500)",CS,Referenced File ID,,hashId,
6,"(0008,0012)",DA,Instance Creation Date,,shiftDateByRandomNbOfDays,
7,"(0008,0014)",UI,Instance Creator UID,,hashId,
8,"(0008,0020)",DA,Study Date,,shiftDateByFixedNbOfDays,112.0
9,"(0008,0021)",DA,Series Date,,shiftDateByRandomNbOfDays,


<h2>Metadata Cleaning Strategies</h2> 
<ol>
    <li><strong>remove</strong> – Replace the tag's value with a placeholder text.</li> 
    <li><strong>delete</strong> – Completely remove the tag from the dataset.</li> 
    <li><strong>hash_uid</strong> – Replace the UID with a hashed version to preserve structure but remove identifying information.</li> 
    <li><strong>shiftDateByFixedNbOfDays</strong> – Shift date fields by a fixed number of days.</li> 
    <li><strong>shiftDateByRandomNbOfDays</strong> – Shift date fields by a randomly selected number of days.</li> 
    <li><strong>replaceWithLiteral</strong> – Replace the tag's value with a predefined literal value. Use <strong>setPlaceholderText</strong> param to change default value.</li> 
    <li><strong>replaceWithRandomName</strong> – Replace names with randomly selected values from a list of predefined names</li> 
</ol>

In [8]:
dicom_deidentifier = DicomMetadataDeidentifier() \
    .setInputCols(["content"]) \
    .setOutputCol("dicom_metadata_cleaned") \
    .setRemovePrivateTags(False) \
    .setPlaceholderText("<hidden>") \
    .setStrategyFile(csv_path)

<h2>Example 1: Deidentifying David Douglas's DICOM Metadata</h2>

<p><strong>Input DICOM File Path:</strong> <code>./data/David_Douglas.dcm</code></p>

<p><strong>Number of Frames in Input:</strong> 1</p>

<p><strong>Output Deidentified DICOM File Path:</strong> <code>./data/result/david_douglas/</code></p>

In [9]:
df = spark.read.format("binaryFile").load("./data/David_Douglas.dcm")

dicom_deidentifier.transform(df).write \
  .format("binaryFormat") \
  .option("type", "dicom") \
  .option("field", "dicom_metadata_cleaned") \
  .option("nameField", "fileName") \
  .option("extension", "dcm") \
  .option("prefix", "de-id-") \
  .mode("overwrite") \
  .save("./data/result/david_douglas/")

13:54:22, INFO Run DicomMetadataDeidentifier
13:54:22, INFO Run DicomMetadataDeidentifier:vr:DA, action:shiftDateByFixedNbOfDays
13:54:22, INFO Run DicomMetadataDeidentifier:vr:DA, action:shiftDateByRandomNbOfDays
13:54:22, INFO Run DicomMetadataDeidentifier:vr:DA, action:shiftDateByRandomNbOfDays
13:54:22, INFO Run DicomMetadataDeidentifier:vr:DA, action:shiftDateByRandomNbOfDays
13:54:22, INFO Run DicomMetadataDeidentifier:vr:SH, action:hashId
  warn_and_log(msg)
13:54:22, INFO Run DicomMetadataDeidentifier:vr:LO, action:remove
13:54:22, INFO Run DicomMetadataDeidentifier:vr:ST, action:remove
13:54:22, INFO Run DicomMetadataDeidentifier:vr:PN, action:remove
13:54:22, INFO Run DicomMetadataDeidentifier:vr:ST, action:remove
13:54:22, INFO Run DicomMetadataDeidentifier:vr:PN, action:remove
13:54:22, INFO Run DicomMetadataDeidentifier:vr:PN, action:remove
13:54:22, INFO Run DicomMetadataDeidentifier:vr:LO, action:hashId
13:54:22, INFO Run DicomMetadataDeidentifier:vr:DA, action:shiftDate

<h2>Run DicomToMetadata and Extract Results</h2>

In [10]:
df = spark.read.format("binaryFile").load("./data/result/david_douglas/*.dcm")

deid_result = dicom_to_metadata.transform(df).select("metadata").collect()[0].asDict()["metadata"]

json_deid_result = json.loads(deid_result)

json_deid_result

{'ImageType': "['DERIVED', 'PRIMARY', 'POST_PROCESSED', 'RT', '', '', '', '', '100000']",
 'SOPClassUID': '1.2.840.10008.5.1.4.1.1.1',
 'SOPInstanceUID': '2.25.112784503178059210578740147414000844278',
 'StudyDate': '20131102',
 'SeriesDate': '20130820',
 'AcquisitionDate': '20130818',
 'ContentDate': '20130829',
 'StudyTime': '085957',
 'SeriesTime': '090020.000',
 'AcquisitionTime': '090020.408',
 'ContentTime': '090143.249',
 'AccessionNumber': '2.25.104661169016722362372224050231930259815',
 'Modality': 'CR',
 'Manufacturer': 'FUJIFILM Corporation',
 'InstitutionName': '<hidden>',
 'InstitutionAddress': '<hidden>',
 "ReferringPhysician'sName": '<hidden>',
 "ReferringPhysician'sAddress": '<hidden>',
 'StudyDescription': 'XR CHEST AP PORTABLE for Douglas Davidson',
 'ProcedureCodeSequence': [{'CodeValue': 'CHESTAPPAP',
   'CodingSchemeDesignator': 'SECTRA',
   'CodeMeaning': 'XR CHEST AP PORTABLE'}],
 "PerformingPhysician'sName": '<hidden>',
 'DerivationDescription': 'G1.0e#1.60+0.20

<h3>Compare the Tags with Changes</h3>

In [11]:
pd_df = compare_dicom_metadata(json_result, json_deid_result)
pd_df[pd_df["Value Changed"] == True].head(50)

Unnamed: 0,Key,Original Value,Redacted Value,Value Changed
3,StudyDate,20130713,20131102,True
4,SeriesDate,20130713,20130820,True
5,AcquisitionDate,20130713,20130818,True
6,ContentDate,20130713,20130829,True
11,AccessionNumber,20130714E864535,2.25.104661169016722362372224050231930259815,True
14,InstitutionName,Scott Community Hospital,<hidden>,True
15,InstitutionAddress,"334 Michael Manor Sarahview, PA 56560",<hidden>,True
16,ReferringPhysician'sName,HUGHES^KATHLEEN,<hidden>,True
17,ReferringPhysician'sAddress,"0544 Green Inlet Jeffreyland, HI 66060",<hidden>,True
20,PerformingPhysician'sName,BROWN^PETER,<hidden>,True
