## Spark Session Setup for Visual NLP, Healthcare NLP, and Open-Source NLP

To use this notebook, you need to start a Spark session with the following John Snow Labs libraries:

- **[Spark-OCR / Visual-NLP](https://nlp.johnsnowlabs.com/docs/en/ocr)**
- **[Healthcare NLP](https://nlp.johnsnowlabs.com/licensed/api/python/)**
- **[Open-Source NLP](https://github.com/JohnSnowLabs/spark-nlp)**

### Required Environment Variables

Ensure you have a valid license file containing your credentials. The following environment variables must be set:

- `SPARK_NLP_LICENSE` (Healthcare)
- `SECRET` (Healthcare)
- `JSL_VERSION` (Healthcare)
- `SPARK_OCR_LICENSE` (Visual)
- `SPARK_OCR_SECRET` (Visual)
- `OCR_VERSION` (Visual)
- `PUBLIC_VERSION` (Open-Source)
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_SESSION_TOKEN`

### Notes

- For **text-only projects** (i.e., no visual data processing), you can use **`SPARK_NLP_LICENSE`**.
- For projects involving **visual data** (e.g., image or PDF processing), you should use **`SPARK_OCR_LICENSE`**.
- For projects involving both **visual** and **text** you can use either one of them.
- All required key-value pairs **must be set as environment variables** to install and use the full functionality of these libraries.
- Ensure that you **restart** the session after installing all the required libraries.

<h2>Install Required Library</h2>

In [1]:
import os
import json
import time
import shutil

license = "../spark_nlp_for_healthcare_spark_ocr_10057.json"

if license and "json" in license:

    with open(license, "r") as creds_in:
        creds = json.loads(creds_in.read())

        for key in creds.keys():
            os.environ[key] = creds[key]
else:
    raise Exception("License JSON File is not specified")

In [2]:
!pip install --upgrade -q https://pypi.johnsnowlabs.com/$SPARK_OCR_SECRET/spark-ocr/spark_ocr-$OCR_VERSION-py3-none-any.whl

!pip install --upgrade -q https://pypi.johnsnowlabs.com/$SECRET/spark-nlp-jsl/spark_nlp_jsl-$JSL_VERSION-py3-none-any.whl

!pip install -q spark-nlp==$PUBLIC_VERSION

!pip install -q pandas

!pip install -q matplotlib

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spark-ocr 6.0.0 requires spark-nlp==5.5.3, but you have spark-nlp 6.0.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][

In [None]:
### RESTART SESSION!!!

## Start Spark Session - Visual NLP, Healthcare NLP, Spark-NLP

In this section, we initialize the Spark session using the `start()` function from the **`sparkocr`** package.

This utility sets up a fully configured Spark session tailored for **Spark OCR** and optionally for **Spark NLP**, **Healthcare NLP**, and **GPU/Apple Silicon support**.

### Function Overview: `start()`

The `start()` function returns a ready-to-use `SparkSession` and accepts the following parameters:

- **`secret`**: Secret key required to download JAR files from the John Snow Labs server.
- **`jar_path`**: (Optional) Local path to a pre-downloaded JAR file.
- **`extra_conf`**: Additional Spark configuration — can be a `SparkConf` object or a Python `dict`.
- **`master_url`**: URL for the Spark master (e.g., `"local[*]"`).
- **`nlp_version`**: Version of Spark NLP to use. If `None`, Spark NLP is not included.
- **`nlp_internal`**: Boolean indicating whether to include Spark NLP Internal.
- **`nlp_jsl`**: Boolean or version string to include Spark NLP for Healthcare (JSL).
- **`nlp_secret`**: Secret key for downloading Spark NLP Internal.
- **`m1`**: Set to `True` to enable support for Apple Silicon (M1/M2) Macs.
- **`keys_file`**: Path to a JSON file containing your credentials. Default is `'keys.json'`.
- **`logLevel`**: Logging level for Spark (e.g., `"WARN"`, `"INFO"`).
- **`use_gpu`**: Whether to enable GPU support for Spark NLP. Default is `False`.
- **`apple_silicon`**: Whether to use Apple Silicon binaries. Default is `False`.

In [None]:
from sparknlp.pretrained import PretrainedPipeline
from sparkocr import start
import os
import json
import time
import shutil

license = "../spark_nlp_for_healthcare_spark_ocr_10057.json"

extra_configurations = {
    "spark.extraListeners": "com.johnsnowlabs.license.LicenseLifeCycleManager", #required
    "spark.sql.legacy.allowUntypedScalaUDF" : "true", #required
    "spark.executor.instances" : "7", 
    "spark.executor.cores" : "16", 
    "spark.executor.memory" : "130G", 
    "spark.driver.memory" : "100G", 
    "spark.sql.shuffle.partitions" : "896"
}

# Not needed for Google Collab
os.environ['JAVA_HOME'] = '/home/linuxbrew/.linuxbrew/Cellar/openjdk@17/17.0.15'

spark = start(nlp_internal=True,
              nlp_jsl=True,
              use_gpu=False,
              extra_conf=extra_configurations,
              keys_file=license)

spark

## Pretrained Pipelines for PHI Deidentification

This notebook demonstrates how effortlessly you can apply **PHI deidentification** to PDF documents using **pretrained pipelines** from **John Snow Labs** — all with a single `.transform()` call.

These pipelines are specifically designed to identify and redact **personally identifiable information (PII)** and **protected health information (PHI)**. The redaction is performed by overlaying **black bounding boxes** on top of detected sensitive content in the resulting PDF, preserving the original layout while ensuring privacy compliance.

Several pipelines are also **signature-aware**, meaning they can detect and remove **handwritten or scanned signatures**, enabling comprehensive document sanitization.

### Available Pipelines

- **[`pdf_deid_multilingual_name_plus`](https://nlp.johnsnowlabs.com/2025/05/17/pdf_deid_multilingual_name_plus.html)**  
  Redacts PHI entities such as: `HOSPITAL`, `NAME`, `PATIENT`, `ID`, `MEDICALRECORD`, `IDNUM`, `COUNTRY`, `LOCATION`, `STREET`, `STATE`, `ZIP`, `CONTACT`, `PHONE`, `DATE`.

- **[`pdf_deid_multilingual_name_plus_signature_aware`](https://nlp.johnsnowlabs.com/2025/05/17/pdf_deid_multilingual_name_plus_signature_aware.html)**  
  Extends the above pipeline with **signature detection and removal** capabilities.

- **[`pdf_obfuscate_multilingual_name_plus`](https://nlp.johnsnowlabs.com/2025/05/17/pdf_obfuscate_multilingual_name_plus.html)**  
  Detects the same PHI entities as the above pipelines but replaces them **in-place with obfuscated text of the same length** instead of drawing box for redaction. Does not include signature detection and removal.

- **[`pdf_deid_multi_model_context_pipeline`](https://nlp.johnsnowlabs.com/2025/05/09/pdf_deid_multi_model_context_pipeline.html)**  
  A context-aware pipeline that redacts entities such as: `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `HOSPITAL`, `IDNUM`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `USERNAME`, `ZIP`.

- **[`pdf_deid_multi_model_context_signature_aware_pipeline`](https://nlp.johnsnowlabs.com/2025/05/23/pdf_deid_multi_model_context_signature_aware_pipeline.html)**  
  Includes all functionalities of the context-aware pipeline with additional **signature detection and removal**.

- **[`pdf_obfuscation_multi_model_context_pipeline`](https://nlp.johnsnowlabs.com/2025/05/09/pdf_obfuscation_multi_model_context_pipeline.html)**  
Detects the same PHI entities as the above pipelines but replaces them **in-place with obfuscated text of the same length** instead of drawing box for redaction. Does not include signature detection and removal.

<h2>Deidentification Pretrained Pipelines</h2>

In [4]:
pdf_deid_multi_model_context_pipeline = PretrainedPipeline("pdf_deid_multi_model_context_pipeline", "en", "clinical/ocr")

pdf_deid_multi_model_context_pipeline.model.stages

pdf_deid_multi_model_context_pipeline download started this may take some time.
Approx size to download 4.4 GB
[ | ]

25/05/27 16:10:20 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


[OK!]


[PdfToImage_3039dbc55d70,
 ImageToText_e47a60ca0cbf,
 DocumentAssembler_47aaa7f19fed,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_e1689c9bc262,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_9d4a08b1c03d,
 NerConverter_f230dec390db,
 CONTEXTUAL-PARSER_c654a34a3fa4,
 CONTEXTUAL-PARSER_33af26b2fdc0,
 CONTEXTUAL-PARSER_1b2769552347,
 CONTEXTUAL-PARSER_07bff3c2b284,
 CONTEXTUAL-PARSER_96b9d29769eb,
 CONTEXTUAL-PARSER_7aefd98e984c,
 ENTITY_EXTRACTOR_2ddfc4744b44,
 CONTEXTUAL-PARSER_99792b93398d,
 REGEX_MATCHER_2b137485db5e,
 CONTEXTUAL-PARSER_30dc43a095c1,
 CONTEXTUAL-PARSER_f64d354f0266,
 CONTEXTUAL-PARSER_72e20622d44a,
 CONTEXTUAL-PARSER_563cfd5f905d,
 REGEX_MATCHER_52d38a3643c9,
 MERGE_b0a322ec2787,
 MERGE_97a646c7a333,
 XLM_ROBERTA_EMBEDDINGS_b8a75c006754,
 MedicalNerModel_59183c57aedb,
 NER_CONVERTER_a3d70cd61248,
 PretrainedZeroShotNER_ca8c4dfe310f,
 NER_CONVERTER_86c4cbb15961,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_b537928b54d8,
 MERGE_1b7360979a74,
 P

In [5]:
input_pdf_path = "./PDF_Original/Easy/PDF_Deid_Deidentification_0.pdf"
output_pdf_path = "./sample_result/pretrained/"

# load pdf file 
df = spark.read.format("binaryFile").load(input_pdf_path)
df

path,modificationTime,length,content
file:/workspace/p...,2025-05-27 16:00:05,29492,[25 50 44 46 2D 3...


In [7]:
result = pdf_deid_multi_model_context_pipeline.transform(df)

for row in result.select("path", "pdf").toLocalIterator():
    input_file_name = os.path.basename(row.asDict()["path"]).replace(".pdf", "")
    savename = os.path.join(output_pdf_path, input_file_name + "_deid_cleaned.pdf")
    pdfFile = open(savename, "wb")
    pdfFile.write(row.asDict()["pdf"])
    pdfFile.close()

25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:12:52 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:12:52 WARN DAGScheduler: Broadcasting large task binary with size 1675.1 KiB
25/05/27 16:13:15 WARN DAGScheduler: Broadcasting large task binary with size 1197.8 KiB
                                 

In [9]:
pdf_deid_multilingual_name_plus = PretrainedPipeline("pdf_deid_multilingual_name_plus", "en", "clinical/ocr")

pdf_deid_multilingual_name_plus.model.stages

pdf_deid_multilingual_name_plus download started this may take some time.
Approx size to download 3.8 GB
[ | ]

25/05/27 16:18:54 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


[OK!]


[PdfToImage_1c5484056078,
 ImageToText_6cb91ed976f8,
 DocumentAssembler_a98ae552b35a,
 SentenceDetectorDLModel_c83c27f46b97,
 REGEX_TOKENIZER_49e33e5662c4,
 PretrainedZeroShotNER_ca8c4dfe310f,
 NER_CONVERTER_a9840264a1d1,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_0343c0826eea,
 XLM_ROBERTA_EMBEDDINGS_b8a75c006754,
 MedicalNerModel_59183c57aedb,
 NER_CONVERTER_8e4da16b943c,
 CONTEXTUAL-PARSER_100152bbc72d,
 ChunkConverter_40d7b165011b,
 MERGE_21afdc51d507,
 DE-IDENTIFICATION_5c638995ea56,
 NerOutputCleaner_be85681e16a9,
 PositionFinder_7b9598051b35,
 ImageDrawRegions_a95cd9174c93,
 ImageToPdf_c03962c85764]

In [10]:
input_pdf_path = "./PDF_Original/Easy/PDF_Deid_Deidentification_1.pdf"
output_pdf_path = "./sample_result/pretrained/"

df = spark.read.format("binaryFile").load(input_pdf_path)
df

path,modificationTime,length,content
file:/workspace/p...,2025-05-27 16:00:05,29221,[25 50 44 46 2D 3...


In [11]:
result = pdf_deid_multilingual_name_plus.transform(df)

for row in result.select("path", "pdf").toLocalIterator():
    input_file_name = os.path.basename(row.asDict()["path"]).replace(".pdf", "")
    savename = os.path.join(output_pdf_path, input_file_name + "_deid_cleaned.pdf")
    pdfFile = open(savename, "wb")
    pdfFile.write(row.asDict()["pdf"])
    pdfFile.close()

25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:19:09 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:19:10 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB

Using CPUs


25/05/27 16:19:27 WARN DAGScheduler: Broadcasting large task binary with size 1230.1 KiB
25/05/27 16:19:28 WARN DAGScheduler: Broadcasting large task binary with size 1156.4 KiB
                                                                                

In [12]:
# Running Multiple Files

input_files_to_run = [
    "./PDF_Original/Easy/PDF_Deid_Deidentification_10.pdf",
    "./PDF_Original/Medium/PDF_Deid_Deidentification_Medium_0.pdf",
    "./PDF_Original/Hard/PDF_Deid_Deidentification_Hard_0.pdf",
]

output_pdf_path = "./sample_result/pretrained/"

df = spark.read.format("binaryFile").load(input_files_to_run)
df

path,modificationTime,length,content
file:/workspace/p...,2025-05-27 16:00:05,461436,[25 50 44 46 2D 3...
file:/workspace/p...,2025-05-27 16:00:05,273133,[25 50 44 46 2D 3...
file:/workspace/p...,2025-05-27 16:00:05,29192,[25 50 44 46 2D 3...


In [13]:
result = pdf_deid_multilingual_name_plus.transform(df)

for row in result.select("path", "pdf").toLocalIterator():
    input_file_name = os.path.basename(row.asDict()["path"]).replace(".pdf", "")
    savename = os.path.join(output_pdf_path, input_file_name + "_deid_cleaned.pdf")
    pdfFile = open(savename, "wb")
    pdfFile.write(row.asDict()["pdf"])
    pdfFile.close()

25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:19:47 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:19:49 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
25/05/27 16:20:16 WARN DAGScheduler: Broadcasting large task binary with size 1230.2 KiB
25/05/27 16:20:18 WARN DAGScheduler:

<h2>Obfuscation Pretrained Pipelines</h2>

In [15]:
pdf_obfuscation_multi_model_context_pipeline = PretrainedPipeline("pdf_obfuscation_multi_model_context_pipeline", "en", "clinical/ocr")

pdf_obfuscation_multi_model_context_pipeline.model.stages

pdf_obfuscation_multi_model_context_pipeline download started this may take some time.
Approx size to download 4.4 GB
[ | ]

25/05/27 16:27:58 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


[OK!]


[PdfToImage_3039dbc55d70,
 ImageToText_e47a60ca0cbf,
 DocumentAssembler_47aaa7f19fed,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_e1689c9bc262,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_9d4a08b1c03d,
 NerConverter_f230dec390db,
 CONTEXTUAL-PARSER_c654a34a3fa4,
 CONTEXTUAL-PARSER_33af26b2fdc0,
 CONTEXTUAL-PARSER_1b2769552347,
 CONTEXTUAL-PARSER_07bff3c2b284,
 CONTEXTUAL-PARSER_96b9d29769eb,
 CONTEXTUAL-PARSER_7aefd98e984c,
 ENTITY_EXTRACTOR_2ddfc4744b44,
 CONTEXTUAL-PARSER_99792b93398d,
 REGEX_MATCHER_2b137485db5e,
 CONTEXTUAL-PARSER_30dc43a095c1,
 CONTEXTUAL-PARSER_f64d354f0266,
 CONTEXTUAL-PARSER_72e20622d44a,
 CONTEXTUAL-PARSER_563cfd5f905d,
 REGEX_MATCHER_52d38a3643c9,
 MERGE_b0a322ec2787,
 MERGE_97a646c7a333,
 XLM_ROBERTA_EMBEDDINGS_b8a75c006754,
 MedicalNerModel_59183c57aedb,
 NER_CONVERTER_c48b64fa3415,
 PretrainedZeroShotNER_ca8c4dfe310f,
 NER_CONVERTER_1e400595b6da,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_2eba5de8da64,
 MERGE_107935ba3a49,
 D

In [26]:
input_pdf_path = "./PDF_Original/Easy/PDF_Deid_Deidentification_0.pdf"
output_pdf_path = "./sample_result/pretrained/"

# load pdf file 
df = spark.read.format("binaryFile").load(input_pdf_path)
df

                                                                                

path,modificationTime,length,content
file:/workspace/p...,2025-05-27 16:00:05,29492,[25 50 44 46 2D 3...


In [27]:
result = pdf_obfuscation_multi_model_context_pipeline.transform(df)

for row in result.select("path", "pdf").toLocalIterator():
    input_file_name = os.path.basename(row.asDict()["path"]).replace(".pdf", "")
    savename = os.path.join(output_pdf_path, input_file_name + "_obfuscated_cleaned.pdf")
    pdfFile = open(savename, "wb")
    pdfFile.write(row.asDict()["pdf"])
    pdfFile.close()

25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:35:44 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:35:44 WARN DAGScheduler: Broadcasting large task binary with size 1814.7 KiB
25/05/27 16:36:05 WARN DAGScheduler: Broadcasting large task binary with size 1916.2 KiB
25/05/27 16:36:06 WARN DAGSchedul

In [29]:
pdf_obfuscate_multilingual_name_plus = PretrainedPipeline("pdf_obfuscate_multilingual_name_plus", "en", "clinical/ocr")

pdf_obfuscate_multilingual_name_plus.model.stages

pdf_obfuscate_multilingual_name_plus download started this may take some time.
Approx size to download 3.8 GB
[ | ]

25/05/27 16:42:32 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


[OK!]


[PdfToImage_14122ccf67b6,
 ImageToText_eed07825764d,
 DocumentAssembler_49d9b6a09dc5,
 SentenceDetectorDLModel_c83c27f46b97,
 REGEX_TOKENIZER_c2accb5524f2,
 PretrainedZeroShotNER_ca8c4dfe310f,
 NER_CONVERTER_0aac6f98b0c3,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_95d4db8e7c98,
 XLM_ROBERTA_EMBEDDINGS_b8a75c006754,
 MedicalNerModel_59183c57aedb,
 NER_CONVERTER_4f0c1ff5bd22,
 CONTEXTUAL-PARSER_100152bbc72d,
 ChunkConverter_07ed9bf472d8,
 MERGE_8e521a48482a,
 DE-IDENTIFICATION_17861617b2f9,
 NerOutputCleaner_c20beb8d5394,
 PositionFinder_cb89dcf21399,
 ImageDrawRegions_a466e6d8f79d,
 ImageToPdf_2caf20d9efd7]

In [32]:
input_pdf_path = "./PDF_Original/Easy/PDF_Deid_Deidentification_1.pdf"
output_pdf_path = "./sample_result/pretrained/"

df = spark.read.format("binaryFile").load(input_pdf_path)
df

                                                                                

path,modificationTime,length,content
file:/workspace/p...,2025-05-27 16:00:05,29221,[25 50 44 46 2D 3...


In [33]:
result = pdf_obfuscate_multilingual_name_plus.transform(df)

for row in result.select("path", "pdf").toLocalIterator():
    input_file_name = os.path.basename(row.asDict()["path"]).replace(".pdf", "")
    savename = os.path.join(output_pdf_path, input_file_name + "_obfuscated_cleaned.pdf")
    pdfFile = open(savename, "wb")
    pdfFile.write(row.asDict()["pdf"])
    pdfFile.close()

25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Times-Bold
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Helvetica-Bold
25/05/27 16:44:58 WARN PDType1Font: Using fallback font LiberationSans for Times-Roman
25/05/27 16:44:58 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB

Using CPUs


25/05/27 16:45:12 WARN DAGScheduler: Broadcasting large task binary with size 1249.8 KiB
25/05/27 16:45:13 WARN DAGScheduler: Broadcasting large task binary with size 1157.2 KiB
