# De-identification Dicom documents with encapsulated Pdf document

## Install spark-ocr python packge
Need specify:
- license
- path to `spark-ocr-assembly-[version].jar` and `spark-nlp-jsl-[version]`
- or `secret` for Spark OCR and `nlp_secret` for Spark NLP Internal
- `aws_access_key` and `aws_secret_key`for download pretrained models

For more details about Dicom de-identification please read:

 - [DICOM de-identification at scale in Visual NLP — Part 1.](https://medium.com/john-snow-labs/dicom-de-identification-at-scale-in-visual-nlp-part-1-68784177f5f0)

 - [DICOM de-identification at scale in Visual NLP — Part 2.](https://medium.com/john-snow-labs/dicom-de-identification-at-scale-in-visual-nlp-part-2-361af5e36412)

 - [DICOM de-identification at scale in Visual NLP — Part 3.](https://medium.com/john-snow-labs/dicom-de-identification-at-scale-in-visual-nlp-part-3-ac750be386cb)

In [4]:
license = ""
secret = ""
nlp_secret = ""
aws_access_key = ""
aws_secret_key = ""

version = secret.split("-")[0]
nlp_internal_version = str(nlp_secret.split("-")[0])
spark_ocr_jar_path = "../../../target/scala-2.12"

## Install requirements

In [None]:
# NBVAL_SKIP
%pip install pydicom highdicom
%pip install --upgrade spark-nlp-jsl==5.1.1  --extra-index-url https://pypi.johnsnowlabs.com/$nlp_secret
%pip install spark-nlp==5.1.1
%pip install spark-ocr==$version --extra-index-url=https://pypi.johnsnowlabs.com/$secret --upgrade

## Start Spark session

In [2]:
from sparkocr import start
import os
from pyspark import SparkConf

if license:
    os.environ['JSL_OCR_LICENSE'] = license
    os.environ['SPARK_NLP_LICENSE'] = license

if aws_access_key:
    os.environ['AWS_ACCESS_KEY'] = aws_access_key
    os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret_key


spark = start(secret=secret,
              nlp_secret=nlp_secret,
              jar_path=spark_ocr_jar_path,
              nlp_internal=nlp_internal_version
             )

spark

Spark version: 3.5.0
Spark NLP version: 5.2.2
Spark NLP for Healthcare version: 5.2.1
Spark OCR version: 5.2.0

:: loading settings :: url = jar:file:/opt/conda/envs/trocrMetrics/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/ec2-user/.ivy2/cache
The jars for the packages stored in: /home/ec2-user/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4fb78512-0448-4517-806a-dbf83f6920a4;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.2.2 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.500 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in spark-list
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.20.1 in spark-list
	found com.google.guava#guava;31.1-jre in spark-list
	found com.google

## Import transformers and annotators

In [3]:
import os
import sys

from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp_jsl
from sparknlp_jsl.annotator import *

import sparkocr
from sparkocr.transformers import *
from sparkocr.utils import *
from sparkocr.enums import *
from sparkocr.schemas import BinarySchema

from pyspark.ml import PipelineModel, Pipeline
from pyspark.sql.functions import *

print(f"Spark NLP version: {sparknlp.version()}")
print(f"Spark NLP internal version: {sparknlp_jsl.version()}")
print(f"Spark OCR version: {sparkocr.version()}")

Spark NLP version: 5.2.2
Spark NLP internal version: 5.2.1
Spark OCR version: 5.2.0


## Define Spark NLP pipeline for de-identification text

In [4]:
def deidentification_nlp_pipeline(input_column, prefix = "", model="ner_deid_large"):
    document_assembler = DocumentAssembler() \
        .setInputCol(input_column) \
        .setOutputCol(prefix + "document_raw")

    cleanUpPatterns = ["<[^>]*>", ":"]
    documentNormalizer = DocumentNormalizer() \
      .setInputCols(prefix + "document_raw") \
      .setOutputCol(prefix + "document") \
      .setAction("clean") \
      .setPatterns(cleanUpPatterns) \
      .setReplacement(" ") \
      .setPolicy("pretty_all") 

    # Sentence Detector annotator, processes various sentences per line
    sentence_detector = SentenceDetector() \
        .setInputCols([prefix + "document"]) \
        .setOutputCol(prefix + "sentence")

    tokenizer = Tokenizer() \
        .setInputCols([prefix + "sentence"]) \
        .setOutputCol(prefix + "token")

    # Clinical word embeddings
    word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
        .setInputCols([prefix + "sentence", prefix + "token"]) \
        .setOutputCol(prefix + "embeddings") \
        .setEnableInMemoryStorage(True)

    clinical_ner = MedicalNerModel.pretrained(model, "en", "clinical/models") \
        .setInputCols([prefix + "sentence", prefix + "token", prefix + "embeddings"]) \
        .setOutputCol(prefix + "ner")

    custom_ner_converter = NerConverter() \
        .setInputCols([prefix + "sentence", prefix + "token", prefix + "ner"]) \
        .setOutputCol(prefix + "ner_chunk") \
        .setWhiteList(['NAME', 'AGE', 'CONTACT', 'ID',
                   'LOCATION', 'PROFESSION', 'PERSON', 'DATE', 'DOCTOR'])

    nlp_pipeline = Pipeline(stages=[
            document_assembler,
            documentNormalizer,
            sentence_detector,
            tokenizer,
            word_embeddings,
            clinical_ner,
            custom_ner_converter
        ])
    empty_data = spark.createDataFrame([[""]]).toDF(input_column)
    nlp_model = nlp_pipeline.fit(empty_data)
    return nlp_model

## Define Spark Ocr pipeline

In [6]:
# Extract encapsulated Pdf from the Dicom
dicom_to_pdf = DicomToPdf() \
    .setInputCols(["path"]) \
    .setOutputCol("pdf") \
    .setKeepInput(True)

# Convert Pdf to the image
pdf_to_image = PdfToImage() \
    .setInputCol("pdf") \
    .setOutputCol("image") \
    .setFallBackCol("text_image")

# Recognize text
ocr = ImageToText() \
    .setInputCol("image") \
    .setOutputCol("text") \
    .setIgnoreResolution(False) \
    .setPageIteratorLevel(PageIteratorLevel.SYMBOL) \
    .setPageSegMode(PageSegmentationMode.SPARSE_TEXT) \
    .setConfidenceThreshold(70)

# Found coordinates of sensitive data
position_finder = PositionFinder() \
    .setInputCols("ner_chunk") \
    .setOutputCol("regions") \
    .setPageMatrixCol("positions") \
    .setOcrScaleFactor(1)

# Hide sensitive data
drawRegions = ImageDrawRegions()  \
    .setInputCol("image")  \
    .setInputRegionsCol("regions")  \
    .setOutputCol("image_with_regions")  \
    .setFilledRect(True) \
    .setRectColor(Color.gray)

# Convert image to Pdf
image_to_pdf = ImageToPdf() \
    .setInputCol("image_with_regions") \
    .setOutputCol("pdf")

# Update Pdf in Dicom
dciom_update_pdf = DicomUpdatePdf() \
    .setInputCol("path") \
    .setInputPdfCol("pdf") \
    .setOutputCol("dicom") \
    .setKeepInput(True)

# Deidentify metadata in Dicom
dicom_deidentifier = DicomMetadataDeidentifier() \
    .setInputCols(["dicom"]) \
    .setOutputCol("dicom_cleaned")

# OCR pipeline
pipeline = PipelineModel(stages=[
     dicom_to_pdf,
     pdf_to_image,
     ocr,
     deidentification_nlp_pipeline(input_column="text", prefix="", model="ner_deid_generic_augmented"),
     position_finder,
     drawRegions,
     image_to_pdf,
     dciom_update_pdf,
     dicom_deidentifier
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


## Read dicom files

In [8]:
dicom_path = './../data/dicom/encapsulated/*.dcm'
dicom_df = spark.read.format("binaryFile").load(dicom_path)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

In [9]:
dicom_df.show()

[Stage 5:>                                                          (0 + 1) / 1]

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/home/ec2-us...|2024-02-26 22:43:57|651696|[00 00 00 00 00 0...|
+--------------------+-------------------+------+--------------------+



                                                                                

## Run pipeline and store resulst

In [10]:
# NBVAL_SKIP
output_path = "./deidentified_pdf/"

def get_name(path, keep_subfolder_level=0):
    path = path.split("/")
    path[-1] = ".".join(path[-1].split('.')[:-1])
    return "/".join(path[-keep_subfolder_level-1:])

result = pipeline.transform(dicom_df)
result.withColumn("fileName", udf(get_name, StringType())(col("path"))) \
    .write \
    .format("binaryFormat") \
    .option("type", "dicom") \
    .option("field", "dicom_cleaned") \
    .option("prefix", "") \
    .option("nameField", "fileName") \
    .mode("overwrite") \
    .save(output_path)

24/02/27 02:16:43 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/02/27 02:16:55 ERROR PositionFinder: PositionFinder unmatched:::Annotation(type: chunk, begin: 946, end: 1000, result: Industries Served Computer software, Banking, Insurance), index: 9
02:17:00, INFO Run DicomMetadataDeidentifier                        (0 + 1) / 1]
24/02/27 02:17:01 WARN BasicWriteTaskStatsTracker: Expected 1 files, but only saw 0. This could be due to the output format not writing empty files, or files being not immediately visible in the filesystem.
                                                                                

## Remove results

In [11]:
# NBVAL_SKIP
%%bash
rm -r -f ./deidentified_pdf