## Spark Session Setup for Visual NLP, Healthcare NLP, and Open-Source NLP

To use this notebook, you need to start a Spark session with the following John Snow Labs libraries:

- **[Spark-OCR / Visual-NLP](https://nlp.johnsnowlabs.com/docs/en/ocr)**
- **[Healthcare NLP](https://nlp.johnsnowlabs.com/licensed/api/python/)**
- **[Open-Source NLP](https://github.com/JohnSnowLabs/spark-nlp)**

### Required Environment Variables

Ensure you have a valid license file containing your credentials. The following environment variables must be set:

- `SPARK_NLP_LICENSE` (Healthcare)
- `SECRET` (Healthcare)
- `JSL_VERSION` (Healthcare)
- `SPARK_OCR_LICENSE` (Visual)
- `SPARK_OCR_SECRET` (Visual)
- `OCR_VERSION` (Visual)
- `PUBLIC_VERSION` (Open-Source)
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_SESSION_TOKEN`

### Notes

- For **text-only projects** (i.e., no visual data processing), you can use **`SPARK_NLP_LICENSE`**.
- For projects involving **visual data** (e.g., image or PDF processing), you should use **`SPARK_OCR_LICENSE`**.
- For projects involving both **visual** and **text** you can use either one of them.
- All required key-value pairs **must be set as environment variables** to install and use the full functionality of these libraries.
- Ensure that you **restart** the session after installing all the required libraries.

In [1]:
import os
import json
import time
import shutil

license = ""

if license and "json" in license:

    with open(license, "r") as creds_in:
        creds = json.loads(creds_in.read())

        for key in creds.keys():
            os.environ[key] = creds[key]
else:
    raise Exception("License JSON File is not specified")

In [2]:
!pip install --upgrade -q https://pypi.johnsnowlabs.com/$SPARK_OCR_SECRET/spark-ocr/spark_ocr-$OCR_VERSION-py3-none-any.whl

!pip install --upgrade -q https://pypi.johnsnowlabs.com/$SECRET/spark-nlp-jsl/spark_nlp_jsl-$JSL_VERSION-py3-none-any.whl

!pip install -q spark-nlp==$PUBLIC_VERSION

!pip install -q pandas

!pip install -q matplotlib

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spark-ocr 6.0.0 requires spark-nlp==5.5.3, but you have spark-nlp 6.0.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;

In [None]:
### RESTART SESSION!!!

## Start Spark Session - Visual NLP, Healthcare NLP, Spark-NLP

In this section, we initialize the Spark session using the `start()` function from the **`sparkocr`** package.

This utility sets up a fully configured Spark session tailored for **Spark OCR** and optionally for **Spark NLP**, **Healthcare NLP**, and **GPU/Apple Silicon support**.

### Function Overview: `start()`

The `start()` function returns a ready-to-use `SparkSession` and accepts the following parameters:

- **`secret`**: Secret key required to download JAR files from the John Snow Labs server.
- **`jar_path`**: (Optional) Local path to a pre-downloaded JAR file.
- **`extra_conf`**: Additional Spark configuration — can be a `SparkConf` object or a Python `dict`.
- **`master_url`**: URL for the Spark master (e.g., `"local[*]"`).
- **`nlp_version`**: Version of Spark NLP to use. If `None`, Spark NLP is not included.
- **`nlp_internal`**: Boolean indicating whether to include Spark NLP Internal.
- **`nlp_jsl`**: Boolean or version string to include Spark NLP for Healthcare (JSL).
- **`nlp_secret`**: Secret key for downloading Spark NLP Internal.
- **`m1`**: Set to `True` to enable support for Apple Silicon (M1/M2) Macs.
- **`keys_file`**: Path to a JSON file containing your credentials. Default is `'keys.json'`.
- **`logLevel`**: Logging level for Spark (e.g., `"WARN"`, `"INFO"`).
- **`use_gpu`**: Whether to enable GPU support for Spark NLP. Default is `False`.
- **`apple_silicon`**: Whether to use Apple Silicon binaries. Default is `False`.

In [None]:
from sparkocr import start
import os
import json
import time
import shutil

license = ""

extra_configurations = {
    "spark.extraListeners": "com.johnsnowlabs.license.LicenseLifeCycleManager", #required
    "spark.sql.legacy.allowUntypedScalaUDF" : "true", #required
    "spark.executor.instances" : "7", 
    "spark.executor.cores" : "16", 
    "spark.executor.memory" : "130G", 
    "spark.driver.memory" : "100G", 
    "spark.sql.shuffle.partitions" : "896"
}

# Not needed for Google Collab
os.environ['JAVA_HOME'] = '/home/linuxbrew/.linuxbrew/Cellar/openjdk@17/17.0.15'

spark = start(nlp_internal=True,
              nlp_jsl=True,
              use_gpu=False,
              extra_conf=extra_configurations,
              keys_file=license)

spark

<h2>Import Visual NLP, Healthcare NLP and Spark-NLP</h2>

In [2]:
import numpy as np
import pandas as pd
import os

#Pyspark Imports
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Necessary imports from Spark OCR library
import sparkocr
from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import *

# import sparknlp packages
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp_jsl
from sparknlp_jsl.annotator import *
from collections import Counter
from sparknlp.pretrained import PretrainedPipeline

In [3]:
def evaluate_predictions(SOURCE_GT_PATH, DF_SAVE_PATH, SAVE_MAPPING_PATH):
    """
    Method to Calculate Precision, Recall and F1-Score
    Saves final file with prediction, ground truth, precision, recall
    """
    
    def calculate_metrics(preds, gts):
      gt_counter = Counter(gts)
      pred_counter = Counter(preds)

      tp = 0
      for item in pred_counter:
          if item in gt_counter:
              tp += min(pred_counter[item], gt_counter[item])

      fp = sum(pred_counter.values()) - tp
      fn = sum(gt_counter.values()) - tp

      precision = tp / (tp + fp) if (tp + fp) else 0
      recall = tp / (tp + fn) if (tp + fn) else 0

      return precision, recall

    with open(SOURCE_GT_PATH, "r") as f:
        ground_truth = json.load(f)

    df_predictions = spark.read.format("parquet").load(DF_SAVE_PATH)

    predictions_by_file = {}

    for row in df_predictions.select("path").distinct().toLocalIterator():
        file_path = row.asDict()["path"]
        filename = os.path.basename(file_path)

        if filename not in ground_truth:
            continue

        extracted_results = []
        rows = df_predictions.filter(F.col("path") == file_path).select("positions_ner")

        for r in rows.toLocalIterator():
            for ner in r.asDict()["positions_ner"]:
                extracted_results.append(ner.asDict()["result"])

        predictions_by_file[filename] = extracted_results

    summary = {}
    all_precisions = []
    all_recalls = []

    for filename, predictions in predictions_by_file.items():
        gt_values = ground_truth[filename]
        gt_values = [i.replace("-year-old", "") for i in gt_values]
        predictions = [i.replace("-year-old", "") for i in predictions]
        precision, recall = calculate_metrics([i.replace(" ", "") for i in predictions], 
                                              [i.replace(" ", "") for i in gt_values])

        all_precisions.append(precision)
        all_recalls.append(recall)

        summary[filename] = {
            "precision": round(precision, 4),
            "recall": round(recall, 4),
            "gt": gt_values,
            "pred": predictions
        }

        print(f"Filename: {filename} | Precision: {precision:.4f} | Recall: {recall:.4f}")

    avg_precision = round(sum(all_precisions) / len(all_precisions), 4)
    avg_recall = round(sum(all_recalls) / len(all_recalls), 4)
    f1_score = round(2 * (avg_precision * avg_recall) / (avg_precision + avg_recall), 4)

    print(f"\nOverall Precision: {avg_precision}")
    print(f"Overall Recall: {avg_recall}")
    print(f"F1 Score: {f1_score}")

    with open(SAVE_MAPPING_PATH, "w") as f:
        json.dump(summary, f, indent=4)

    print(f"Mapping File Saved To : {SAVE_MAPPING_PATH}")

## Threshold Parameters

- **`ner_threshold`**  
  Filters out entities based on their predicted confidence scores. This parameter is used in the `NerConverterInternal` stage to retain only high-confidence predictions.

- **`ocr_threshold`**  
  Filters out predicted text from OCR based on confidence scores, ensuring only reliable OCR outputs are used downstream.

- **`matcherWhitelist`**  
  Applies text matching to identify and retain similar entities from predictions, guided by a confidence score threshold.

- **`whitelist`**  
  Allows you to retain only specific entity classes by explicitly listing them.


In [4]:
# Ner Threshold
ner_threshold = 0.90

# OCR Output Threshold
ocr_threshold = 70

# Ner Whitelist Entites
whitelist = ['HOSPITAL', 'NAME', 'PATIENT', 'ID', 'MEDICALRECORD', 'IDNUM', 'COUNTRY', 'LOCATION', 'STREET', 'STATE', 'ZIP', 'CONTACT', 'PHONE', 'DATE']

# Matcher is used for regex matching from already detected NER
# NER threshold is used to select detected NER for matching
matcherWhitelist = {'HOSPITAL': 0.9,
 'NAME': 0.6,
 'PATIENT': 0.9,
 'ID': 0.9,
 'MEDICALRECORD': 0.6,
 'IDNUM': 0.6,
 'COUNTRY': 0.9,
 'LOCATION': 0.9,
 'STREET': 0.9,
 'STATE': 0.9,
 'ZIP': 0.9,
 'CONTACT': 0.9,
 'PHONE': 0.9,
 'DATE': 0.9}

matcherWhitelist

{'HOSPITAL': 0.9,
 'NAME': 0.6,
 'PATIENT': 0.9,
 'ID': 0.9,
 'MEDICALRECORD': 0.6,
 'IDNUM': 0.6,
 'COUNTRY': 0.9,
 'LOCATION': 0.9,
 'STREET': 0.9,
 'STATE': 0.9,
 'ZIP': 0.9,
 'CONTACT': 0.9,
 'PHONE': 0.9,
 'DATE': 0.9}

In [None]:
pdf_to_image = PdfToImage() \
  .setInputCol("content") \
  .setSplitNumBatch(10) \
  .setOutputCol("image_raw") \
  .setImageType(ImageType.TYPE_3BYTE_BGR) \
  .setSplittingStategy(SplittingStrategy.FIXED_NUMBER_OF_PARTITIONS)

ocr = ImageToText() \
    .setInputCol("image_raw") \
    .setOutputCol("text") \
    .setIgnoreResolution(False) \
    .setPageIteratorLevel(PageIteratorLevel.SYMBOL) \
    .setPageSegMode(PageSegmentationMode.SPARSE_TEXT) \
    .setWithSpaces(True) \
    .setKeepLayout(False) \
    .setConfidenceThreshold(70)

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink_full") 

abbreviations = ['Bros', 'No', 'al', 'vs', 'etc', 'Fig', 'Dr', 'Prof', 'PhD', 'MD', 'Co', 'Corp', 'Inc', 'bros', 'VS', 'Vs', 'ETC', 'fig', 'dr', 'prof', 'PHD', 'phd', 'md', 'co', 'corp', 'inc', 'Jan', 'Feb', 'Mar', 'Apr', 'Jul', 'Aug', 'Sep', 'Sept', 'Oct', 'Nov', 'Dec', 'St', 'st', 'AM', 'PM', 'am', 'pm', 'e.g', 'f.e', 'i.e']
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setImpossiblePenultimates(abbreviations) \
    .setUseCustomBoundsOnly(False) \
    .setSplitLength(2147483647) \
    .setExplodeSentences(False)

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token") \
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '"', "'"])


ner_docwise_large = PretrainedZeroShotNER().pretrained("zeroshot_ner_deid_subentity_docwise_large", "en", "clinical/models") \
    .setInputCols("sentence", "token") \
    .setOutputCol("ner_docwise_large") \
    .setLabels(["CITY", "COUNTRY", "PHONE", "IDNUM", "ID", "MEDICALRECORD", "DATE", "HOSPITAL", "ORGANIZATION", "STATE", "STREET"])

ner_chunk_docwise_large = NerConverterInternal() \
    .setInputCols("sentence", "token", "ner_docwise_large") \
    .setOutputCol("ner_chunk_docwise_large") \
    .setThreshold(0.90)

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_deid = MedicalNerModel.pretrained("ner_deid_subentity_augmented_docwise", "en", "clinical/models")  \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_deid_subentity_augmented_docwise")

ner_deid_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_subentity_augmented_docwise"]) \
    .setOutputCol("ner_chunk_subentity_augmented_docwise") \
    .setWhiteList(["IDNUM", "MEDICALRECORD", "ZIP"]) \
    .setThreshold(0.60)

embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
    .setInputCols("sentence", "token") \
    .setOutputCol("xlm_embeddings")\
    .setMaxSentenceLength(512)\
    .setCaseSensitive(False)

ner = MedicalNerModel.pretrained("ner_deid_name_multilingual", "xx", "clinical/models") \
    .setInputCols(["sentence", "token", "xlm_embeddings"]) \
    .setOutputCol("ner_deid_name_multilingual")

ner_name_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid_name_multilingual"]) \
    .setOutputCol("ner_chunk_name_multilingual") \
    .setThreshold(0.60)

age_contextual_parser = ContextualParserModel.pretrained("age_parser", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("chunk_age")

age_chunk_converter = ChunkConverter() \
    .setInputCols(["chunk_age"]) \
    .setOutputCol("ner_chunk_age")

chunk_merger = ChunkMergeApproach() \
    .setInputCols('ner_chunk_subentity_augmented_docwise', 'ner_chunk_docwise_large', 'ner_chunk_name_multilingual', 'ner_chunk_age') \
    .setOutputCol('merged_ner_chunk') \
    .setMergeOverlapping(True)

deid_obfuscated = DeIdentification() \
    .setInputCols(["sentence", "token", "merged_ner_chunk"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setKeepMonth(True) \
    .setKeepYear(True) \
    .setObfuscateDate(True) \
    .setSameEntityThreshold(0.7) \
    .setKeepTextSizeForObfuscation(True) \
    .setFakerLengthOffset(2) \
    .setReturnEntityMappings(True) \
    .setDays(2) \
    .setMappingsColumn("aux") \
    .setIgnoreRegex(True) \
    .setGroupByCol("path") \
    .setRegion("us") \
    .setSeed(40) \
    .setConsistentObfuscation(True) \
    .setChunkMatching(matcherWhitelist)

cleaner = NerOutputCleaner() \
    .setInputCol("aux") \
    .setOutputCol("new_aux") \
    .setOutputNerCol("positions_ner")

position_finder = PositionFinder() \
    .setInputCols("positions_ner") \
    .setOutputCol("coordinates") \
    .setPageMatrixCol("positions")

draw_regions = ImageDrawRegions() \
  .setInputCol("image_raw") \
  .setInputRegionsCol("coordinates") \
  .setRectColor(Color.black) \
  .setFilledRect(True) \
  .setOutputCol("image_with_regions")

stages = [
    pdf_to_image,
    ocr,
    document_assembler,
    sentence_detector,
    tokenizer,
    ner_docwise_large,
    ner_chunk_docwise_large,
    word_embeddings,
    ner_deid,
    ner_deid_converter,
    embeddings,
    ner,
    ner_name_converter,
    age_contextual_parser,
    age_chunk_converter,
    chunk_merger,
    deid_obfuscated,
    cleaner,
    position_finder,
    draw_regions
]

pipe = Pipeline(stages=stages)

In [6]:
stages

[PdfToImage_24383e7d72af,
 ImageToText_777a0e572b06,
 DocumentAssembler_334bbe45c83c,
 SentenceDetectorDLModel_c83c27f46b97,
 Tokenizer_cb576e40ba33,
 PretrainedZeroShotNER_ca8c4dfe310f,
 NerConverterInternal_f8dcb764cea1,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_ada39ac0d359,
 NerConverterInternal_0da57797b1b8,
 XLM_ROBERTA_EMBEDDINGS_b8a75c006754,
 MedicalNerModel_59183c57aedb,
 NerConverterInternal_6a42cb6c30fa,
 CONTEXTUAL-PARSER_100152bbc72d,
 ChunkConverter_ed78f9d4128b,
 ChunkMergeApproach_71a45f6027ab,
 DeIdentification_1802897a3390,
 NerOutputCleaner_817bf8a3d210,
 PositionFinder_d370170be46b,
 ImageDrawRegions_a15485c93c74]

## File and Directory Paths

- **`SOURCE_PDF_PATH`**  
  Path to the local folder containing input PDF files.

- **`DF_SAVE_PATH`**  
  Path for saving intermediate DataFrame results. Useful for checkpointing across multiple pipeline stages.

- **`SOURCE_GT_PATH`**  
  Path to the local folder containing JSON files with ground truth information.

- **`SAVE_MAPPING_PATH`**  
  Path to the JSON file where evaluation results (including predicted vs. ground truth, precision, recall, and F1-score) will be stored.

- **`SAVE_OUTPUT_PDF`**  
  Path to the folder where the output (redacted or annotated) PDF files will be saved.


<h2>Easy Dataset</h2>

<h4>Total Files : [ 30 Files ]</h4>

In [8]:
SOURCE_PDF_PATH = ["./PDF_Original/Easy/"]
DF_SAVE_PATH = "./df_temp/easy/" #should be regenerated
SOURCE_GT_PATH = "./Mapping/all_phi/pdf_deid_gts_easy.json"
SAVE_MAPPING_PATH = "./Mapping/all_phi/easy_result_mapping.json"
SAVE_OUTPUT_PDF = "./easy_pdf_output/"

os.makedirs(SAVE_OUTPUT_PDF, exist_ok=True)

In [None]:
df = spark.read.format("binaryFile") \
    .option("pathGlobFilter", "*.pdf") \
    .load(SOURCE_PDF_PATH) \
    .filter(~col("path").contains("ipynb"))

result = pipe.fit(df).transform(df)
result.write.format('parquet').mode('overwrite').save(DF_SAVE_PATH)

In [11]:
evaluate_predictions(SOURCE_GT_PATH=SOURCE_GT_PATH, 
                     DF_SAVE_PATH=DF_SAVE_PATH, 
                     SAVE_MAPPING_PATH=SAVE_MAPPING_PATH)

Filename: PDF_Deid_Deidentification_2.pdf | Precision: 1.0000 | Recall: 0.9756
Filename: PDF_Deid_Deidentification_22.pdf | Precision: 1.0000 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_6.pdf | Precision: 1.0000 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_9.pdf | Precision: 1.0000 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_5.pdf | Precision: 0.9756 | Recall: 0.9756
Filename: PDF_Deid_Deidentification_8.pdf | Precision: 1.0000 | Recall: 0.9756
Filename: PDF_Deid_Deidentification_1.pdf | Precision: 1.0000 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_27.pdf | Precision: 1.0000 | Recall: 0.9268
Filename: PDF_Deid_Deidentification_28.pdf | Precision: 0.9535 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_0.pdf | Precision: 0.9535 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_19.pdf | Precision: 1.0000 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_11.pdf | Precision: 0.9762 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_17.pdf | Pr

In [12]:
OBFUSCATED_IMAGE_COL = "image_with_regions"

img_to_pdf = ImageToPdf() \
    .setPageNumCol("pagenum") \
    .setOriginCol("path") \
    .setOutputCol("pdf") \
    .setInputCol(OBFUSCATED_IMAGE_COL) \
    .setAggregatePages(True)

source = spark.read.format("parquet").load(DF_SAVE_PATH)
result_pdf = img_to_pdf.transform(source)

for row in result_pdf.select("path", "pdf").toLocalIterator():
  filename = row.asDict()["path"]
  basename = os.path.basename(filename)

  savename = os.path.join(SAVE_OUTPUT_PDF, basename)
    
  pdfFile = open(savename, "wb")
  pdfFile.write(row.asDict()["pdf"])
  pdfFile.close()

[Stage 515:>                                                        (0 + 1) / 1]

<h2>Medium Dataset</h2>

<h4>Total Files : [ 40 Files ( 30 Easy + 10 Medium ) ]</h4>

In [7]:
SOURCE_PDF_PATH = ["./PDF_Original/Easy/", "./PDF_Original/Medium/",]
DF_SAVE_PATH = "./df_temp/medium/" #should be regenerated
SOURCE_GT_PATH = "./Mapping/all_phi/pdf_deid_gts_medium.json"
SAVE_MAPPING_PATH = "./Mapping/all_phi/medium_result_mapping.json"
SAVE_OUTPUT_PDF = "./medium_pdf_output/"

os.makedirs(SAVE_OUTPUT_PDF, exist_ok=True)

In [None]:
df = spark.read.format("binaryFile") \
    .option("pathGlobFilter", "*.pdf") \
    .load(SOURCE_PDF_PATH) \
    .filter(~col("path").contains("ipynb"))

result = pipe.fit(df).transform(df)
result.write.format('parquet').mode('overwrite').save(DF_SAVE_PATH)

In [9]:
evaluate_predictions(SOURCE_GT_PATH=SOURCE_GT_PATH, 
                     DF_SAVE_PATH=DF_SAVE_PATH, 
                     SAVE_MAPPING_PATH=SAVE_MAPPING_PATH)

Filename: PDF_Deid_Deidentification_Medium_5.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_6.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_8.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_9.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_3.pdf | Precision: 1.0000 | Recall: 0.8846
Filename: PDF_Deid_Deidentification_Medium_1.pdf | Precision: 0.9200 | Recall: 0.8846
Filename: PDF_Deid_Deidentification_Medium_2.pdf | Precision: 0.9333 | Recall: 0.8077
Filename: PDF_Deid_Deidentification_Medium_0.pdf | Precision: 0.9796 | Recall: 0.9231
Filename: PDF_Deid_Deidentification_Medium_4.pdf | Precision: 0.9800 | Recall: 0.9423
Filename: PDF_Deid_Deidentification_Medium_7.pdf | Precision: 0.9167 | Recall: 0.8462
Filename: PDF_Deid_Deidentification_16.pdf | Precision: 1.0000 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_20.pdf | Precision: 0.95

In [10]:
OBFUSCATED_IMAGE_COL = "image_with_regions"

img_to_pdf = ImageToPdf() \
    .setPageNumCol("pagenum") \
    .setOriginCol("path") \
    .setOutputCol("pdf") \
    .setInputCol(OBFUSCATED_IMAGE_COL) \
    .setAggregatePages(True)

source = spark.read.format("parquet").load(DF_SAVE_PATH)
result_pdf = img_to_pdf.transform(source)

for row in result_pdf.select("path", "pdf").toLocalIterator():
  filename = row.asDict()["path"]
  basename = os.path.basename(filename)

  savename = os.path.join(SAVE_OUTPUT_PDF, basename)
    
  if "Medium" in filename:
      pdfFile = open(savename, "wb")
      pdfFile.write(row.asDict()["pdf"])
      pdfFile.close()

[Stage 668:>                                                        (0 + 1) / 1]

<h2>Hard Dataset</h2>

<h4>Total Files : [ 50 Files ( 30 Easy + 10 Medium + 10 Hard) ]</h4>

In [14]:
SOURCE_PDF_PATH = ["./PDF_Original/Easy/", "./PDF_Original/Medium/",  "./PDF_Original/Hard/",]
DF_SAVE_PATH = "./df_temp/hard/" #should be regenerated
SOURCE_GT_PATH = "./Mapping/all_phi/pdf_deid_gts_hard.json"
SAVE_MAPPING_PATH = "./Mapping/all_phi/hard_result_mapping.json"
SAVE_OUTPUT_PDF = "./hard_pdf_output/"

os.makedirs(SAVE_OUTPUT_PDF, exist_ok=True)

In [None]:
df = spark.read.format("binaryFile") \
    .option("pathGlobFilter", "*.pdf") \
    .load(SOURCE_PDF_PATH) \
    .filter(~col("path").contains("ipynb"))

result = pipe.fit(df).transform(df)
result.write.format('parquet').mode('overwrite').save(DF_SAVE_PATH)

In [16]:
evaluate_predictions(SOURCE_GT_PATH=SOURCE_GT_PATH, 
                     DF_SAVE_PATH=DF_SAVE_PATH, 
                     SAVE_MAPPING_PATH=SAVE_MAPPING_PATH)

Filename: PDF_Deid_Deidentification_Medium_9.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_5.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_0.pdf | Precision: 0.9796 | Recall: 0.9231
Filename: PDF_Deid_Deidentification_Medium_2.pdf | Precision: 0.9333 | Recall: 0.8077
Filename: PDF_Deid_Deidentification_Medium_8.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_1.pdf | Precision: 0.9200 | Recall: 0.8846
Filename: PDF_Deid_Deidentification_Medium_4.pdf | Precision: 0.9800 | Recall: 0.9423
Filename: PDF_Deid_Deidentification_Medium_7.pdf | Precision: 0.9167 | Recall: 0.8462
Filename: PDF_Deid_Deidentification_Medium_6.pdf | Precision: 0.9792 | Recall: 0.9038
Filename: PDF_Deid_Deidentification_Medium_3.pdf | Precision: 1.0000 | Recall: 0.8846
Filename: PDF_Deid_Deidentification_16.pdf | Precision: 1.0000 | Recall: 1.0000
Filename: PDF_Deid_Deidentification_20.pdf | Precision: 0.95

In [54]:
OBFUSCATED_IMAGE_COL = "image_with_regions"

img_to_pdf = ImageToPdf() \
    .setPageNumCol("pagenum") \
    .setOriginCol("path") \
    .setOutputCol("pdf") \
    .setInputCol(OBFUSCATED_IMAGE_COL) \
    .setAggregatePages(True)

source = spark.read.format("parquet").load(DF_SAVE_PATH)
result_pdf = img_to_pdf.transform(source)

for row in result_pdf.select("path", "pdf").toLocalIterator():
  filename = row.asDict()["path"]
  basename = os.path.basename(filename)

  savename = os.path.join(SAVE_OUTPUT_PDF, basename)
    
  if "Hard" in filename:
      pdfFile = open(savename, "wb")
      pdfFile.write(row.asDict()["pdf"])
      pdfFile.close()

[Stage 5963:>                                                       (0 + 1) / 1]