# DataFog Model Evaluation Notebook


## Objective

To provide a clear and transparent framework for evaluating the quality of ML models used for feature extraction from images for the purpose of downstream text annotation.  

A common challenge facing users is to accurately screen out PII in images or scanned PDFs. Here we lay out a multi-step pipeline that scans and extracts text from an set of common business documents that are uploaded

## Methodology

The section 'Images' below shows both the five representative documents we used for testing, as well as a human-annotated version

* Step 1: extract text from images
* Step 2: scan the extracted text for PII
* Step 3: compare to human_annotated version as ground truth

# Setup

## Install packages


In [None]:
!pip install tensorflow==2.15.0
!pip install keras==3.3.3
!pip install transformers==4.40.1 pyspark
!pip install pandas==2.2.2 Pillow Requests==2.31.0 spacy==3.4.4

In [None]:
!pip install https://huggingface.co/beki/en_spacy_pii_fast/resolve/main/en_spacy_pii_fast-any-py3-none-any.whl
!pip install https://huggingface.co/beki/en_spacy_pii_distilbert/resolve/main/en_spacy_pii_distilbert-any-py3-none-any.whl

## Manual PII annotation of images

<table>
  <tr>
    <th></th>
    <th>Original</th>
    <th>Human Annotated</th>
  </tr>
  <tr>
    <td>1</td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/a4fea9c9-fb73-4d23-84df-6679f303bb92" width="100%"></td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/59a1f270-a4d9-4a57-8364-106e3574832d" width="100%"></td>
  </tr>
  <tr>
    <td>2</td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/06e8c369-6f06-4dba-b622-95101f44853b" width="100%"></td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/f5e96478-fe6e-4143-acca-b6428dd886b6" width="100%"></td>
  </tr>
  <tr>
    <td>3</td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/a7de556a-3844-406e-b31d-6216a70c6667" width="100%"></td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/a2aea9af-8f7a-4af8-bd9b-33e5063b16eb" width="100%"></td>
  </tr>
  <tr>
    <td>4</td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/122ed6fa-4b1c-443d-951e-6c1ed9f1dab3" width="100%"></td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/d2be8d10-ed27-41e0-84a8-59b289207441" width="100%"></td>
  </tr>
  <tr>
    <td>5</td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/4e5788ac-41dc-493d-ad7f-d96b245acdaa" width="100%"></td>
    <td><img src="https://github.com/orgs/DataFog/projects/1/assets/61345237/30afa295-5f1c-48cf-b8c5-f11dd8ae8c7b" width="100%"></td>
  </tr>
</table>

## Convert manual annotation to DataFrame format

# Analysis


## Init Spark and Image-to-Text model

In [None]:
import requests
from PIL import Image
from io import BytesIO
import warnings
from transformers import DonutProcessor, VisionEncoderDecoderModel
import re
import json

import spacy
import pandas as pd
from spacy.tokens import Span
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType, StructField, StringType

# Suppress warnings
warnings.filterwarnings("ignore")

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("DataFogEval") \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .getOrCreate()

# Initialize Donut model and processor
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

# Set device and evaluation mode
device = "cpu"
model.to(device)
model.eval()

# Image URL set
image_set = {
    'medical_invoice': 'https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png',
    'sales_receipt': 'https://templates.invoicehome.com/sales-receipt-template-us-classic-white-750px.png',
    'press_release': 'https://newsroom.cisco.com/c/dam/r/newsroom/en/us/assets/a/y2023/m09/cisco_splunk_1200x675_v3.png',
    'insurance_claim_scanned_form': 'https://www.pdffiller.com/preview/101/35/101035394.png',
    'scanned_internal_record': 'https://www.pdffiller.com/preview/435/972/435972694.png'
}

# Function to download and process images
def download_image(url):
    response = requests.get(url)
    if response.status_code == 200:
        return Image.open(BytesIO(response.content)).convert('RGB')
    else:
        raise Exception(f"Failed to download image from {url}")

# Pre-download and convert all images
processed_images = {name: download_image(url) for name, url in image_set.items()}

# Function to parse images
def parse_image(image):
    task_prompt = "<s_cord-v2>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
    pixel_values = processor(image, return_tensors="pt").pixel_values

    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=True,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )

    sequence = processor.batch_decode(outputs.sequences)[0]
    sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
    sequence = re.sub(r"<.*?>", "", sequence, count=1).strip()

    result = processor.token2json(sequence)
    return json.dumps(result)

# Create a DataFrame from pre-processed images
image_df_data = [(name, parse_image(img)) for name, img in processed_images.items()]
df = spark.createDataFrame(image_df_data, ["image_name", "parsed_data"])

# Display DataFrame
df.show(truncate=False)




+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|image_name                  |parsed_data                                                                                                                                                                                                                                                                                                                                                                                                        

In [None]:
# save to json file as 'donut_img_2_text.json'
df.write.json("donut_extracted_text.json")

## Set up custom Spark functions to broadcast over the DataFrame

In [None]:
PII_ANNOTATION_LABELS = ["DATE_TIME", "LOC", "NRP", "ORG", "PER"]
MAXIMAL_STRING_SIZE = 1000000

def pii_annotator(text: str, broadcasted_nlp) -> list[list[str]]:
    """Extract features using en_spacy_pii_fast model.

    Returns:
        list[list[str]]: Values as arrays in order defined in the PII_ANNOTATION_LABELS.
    """
    if text:
        if len(text) > MAXIMAL_STRING_SIZE:
            # Cut the strings for required sizes
            text = text[:MAXIMAL_STRING_SIZE]
        nlp = broadcasted_nlp.value
        doc = nlp(text)

        # Pre-create dictionary with labels matching to expected extracted entities
        classified_entities: dict[str, list[str]] = {
            _label: [] for _label in PII_ANNOTATION_LABELS
        }
        for ent in doc.ents:
            # Add entities from extracted values
            classified_entities[ent.label_].append(ent.text)

        return [_ent for _ent in classified_entities.values()]
    else:
        return [[] for _ in PII_ANNOTATION_LABELS]

def broadcast_pii_annotator_udf(spark_session: SparkSession, spacy_model: str = "en_spacy_pii_fast"):
    """Broadcast PII annotator across Spark cluster and create UDF"""
    broadcasted_nlp = spark_session.sparkContext.broadcast(
        spacy.load(spacy_model)
    )

    pii_annotation_udf = udf(
        lambda text: pii_annotator(text, broadcasted_nlp),
        ArrayType(ArrayType(StringType())),
    )
    return pii_annotation_udf


## Run: Spacy models only

### en_spacy_pii_fast


In [None]:
extract_features_udf = broadcast_pii_annotator_udf(spark, spacy_model="en_spacy_pii_fast")

df = df.withColumn("en_spacy_pii_fast", extract_features_udf(df.parsed_data))
df.show(truncate=False)

+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|image_name                  |parsed_data                                                                                                                                                                                                                          

### en_spacy_pii_distilbert

In [None]:
extract_features_udf = broadcast_pii_annotator_udf(spark, spacy_model="en_spacy_pii_distilbert")

df = df.withColumn("en_spacy_pii_distilbert", extract_features_udf(df.parsed_data))
df.show(truncate=False)

+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|image_name                  |parsed_data                                              

## Run: Transformer models only



To use transformers, you need to upgrade transformers and restart the session.  The issue is due to some cross-dependency tangle between keras, tensorflow, and transformers  

In [None]:
!pip list

In [None]:
!pip install --upgrade transformers

### lakshyakh93/deberta_finetuned_pii

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
from transformers import pipeline

# # reinitialize spark
# spark = SparkSession.builder \
#     .appName("DataFogEval") \
#     .config("spark.driver.memory", "8g") \
#     .config("spark.executor.memory", "8g") \
#     .getOrCreate()
# Define a class to handle the transformer model
class PIIModel:
    def __init__(self):
        # Load the model
        self.model = pipeline('token-classification', model='lakshyakh93/deberta_finetuned_pii')

    def classify(self, text):
        # Make predictions
        if text:
            detections = self.model(text)
            return any(detection['score'] > 0.85 for detection in detections)
        return False

# Initialize the model outside UDF to avoid reinitialization costs
pii_model = PIIModel()

# Define a UDF that uses the model
def classify_udf(text):
    return pii_model.classify(text)

# Register the UDF
spark.udf.register("classify_udf", classify_udf, BooleanType())

# Create a user-defined function for Spark DataFrame transformations
classify_udf_spark = udf(classify_udf, BooleanType())

# Assuming df is already defined and has a column 'parsed_data'
df = df.withColumn("deberta_finetuned_pii", classify_udf_spark(df.parsed_data))
df.show(truncate=False)


+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
|image_name                  |parsed_data                        

In [None]:
# download as pii_annotation_results.json
df.write.json("pii_annotation_results.json")

### JasperLS/deberta-v3-base-pii-identifier-v2

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
from transformers import pipeline


# Define a class to handle the transformer model
class PIIModel:
    def __init__(self):
        # Load the model
        self.model = pipeline("text-classification", model="JasperLS/deberta-v3-base-pii-identifier-v2")

    def classify(self, text):
        # Make predictions
        if text:
            detections = self.model(text)
            return any(detection['score'] > 0.85 for detection in detections)
        return False

# Initialize the model outside UDF to avoid reinitialization costs
pii_model = PIIModel()

# Define a UDF that uses the model
def classify_udf(text):
    return pii_model.classify(text)

# Register the UDF
spark.udf.register("classify_udf", classify_udf, BooleanType())

# Create a user-defined function for Spark DataFrame transformations
classify_udf_spark = udf(classify_udf, BooleanType())

# Assuming df is already defined and has a column 'parsed_data'
df = df.withColumn("deberta-v3-base-pii", classify_udf_spark(df.parsed_data))
df.show(truncate=True)


config.json:   0%|          | 0.00/986 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

+--------------------+--------------------+--------------------+-----------------------+---------------------+-------------------+
|          image_name|         parsed_data|   en_spacy_pii_fast|en_spacy_pii_distilbert|deberta_finetuned_pii|deberta-v3-base-pii|
+--------------------+--------------------+--------------------+-----------------------+---------------------+-------------------+
|     medical_invoice|{"menu": [{"nm": ...|[[11, 11, 11], [C...|   [[07/01/23], [Col...|                 true|              false|
|       sales_receipt|{"menu": [{"nm": ...|[[11/02/2019, 26/...|   [[11/02/2019, 231...|                 true|               true|
|       press_release|{"text_sequence":...|[[], [], [], [], []]|   [[], [], [], [], []]|                 true|               true|
|insurance_claim_s...|{"nm": "Sample 15...|[[], [], [], [Sam...|   [[], [], [], [Sam...|                false|              false|
|scanned_internal_...|[{"nm": "OHO Hist...|[[], [], [], [OHO...|   [[], [Local Goes

In [None]:
# download dataframe as pii_annotation_results.json
df.write.json("pii_annotation_results_full.json")