# üåê Comparative Analysis: Unstructured vs SparkNLP (Reader2Doc) for HTML Ingestion

This notebook demonstrates how to extract structured information from **HTML documents** using two powerful tools:

- **Unstructured**: Extracts elements (text, titles, tables, images) preserving document structure.
- **SparkNLP (Reader2Doc)**: Reads HTML files into Spark NLP pipelines as text documents for large-scale NLP tasks.

We'll analyze their differences in handling structured HTML content containing **headings**, **paragraphs**, **tables**, and **images**.

In [1]:
!pip install unstructured[local-inference]

Collecting unstructured[local-inference]
  Downloading unstructured-0.18.21-py3-none-any.whl.metadata (25 kB)
Collecting filetype (from unstructured[local-inference])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[local-inference])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[local-inference])
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Collecting dataclasses-json (from unstructured[local-inference])
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting python-iso639 (from unstructured[local-inference])
  Downloading python_iso639-2025.11.16-py3-none-any.whl.metadata (15 kB)
Collecting langdetect (from unstructured[local-inference])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [

## Generating Documents

In [2]:
!mkdir html_docs
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-456123.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-654987.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-789321.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-789351.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-789456.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-112233.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-334455.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-445566.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-556677.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-778899.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2025-12-000001.html
!wget -P html_docs !wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2025-12-000002.html

--2025-12-26 19:41:19--  http://!wget/
Resolving !wget (!wget)... failed: Name or service not known.
wget: unable to resolve host address ‚Äò!wget‚Äô
--2025-12-26 19:41:19--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/html/EHR-2024-11-456123.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20124 (20K) [text/plain]
Saving to: ‚Äòhtml_docs/EHR-2024-11-456123.html‚Äô


2025-12-26 19:41:19 (4.96 MB/s) - ‚Äòhtml_docs/EHR-2024-11-456123.html‚Äô saved [20124/20124]

FINISHED --2025-12-26 19:41:19--
Total wall clock time: 0.2s
Downloaded: 1 files, 20K in 0.004s (4.96 MB/s)
--2025-12-26 19:41:19--  http://!wget/
Resolving !wget (

In [3]:
!ls -lh html_docs

total 148K
-rw-r--r-- 1 root root 20K Dec 26 19:41 EHR-2024-11-456123.html
-rw-r--r-- 1 root root 21K Dec 26 19:41 EHR-2024-11-654987.html
-rw-r--r-- 1 root root 21K Dec 26 19:41 EHR-2024-11-789321.html
-rw-r--r-- 1 root root 23K Dec 26 19:41 EHR-2024-11-789351.html
-rw-r--r-- 1 root root 19K Dec 26 19:41 EHR-2024-11-789456.html
-rw-r--r-- 1 root root 16K Dec 26 19:41 EHR-2025-12-000001.html
-rw-r--r-- 1 root root 17K Dec 26 19:41 EHR-2025-12-000002.html


## 1Ô∏è‚É£ Unstructured: HTML Partition and Cleaning

Unstructured can parse HTML files and extract text, links, tables, and images into structured **elements**.

In [9]:
from unstructured.partition.html import partition_html
from unstructured.cleaners.core import (
    clean_extra_whitespace,
    replace_unicode_quotes,
    clean_non_ascii_chars,
    clean_bullets
)
import json
import os
import re

def remove_colons(text: str) -> str:
    return re.sub(r":", "", text)

def clean_element_text(text: str) -> str:
    """Clean up text extracted from HTML using Unstructured."""
    text = clean_extra_whitespace(text)
    text = replace_unicode_quotes(text)
    text = clean_non_ascii_chars(text)
    text = clean_bullets(text)
    text = remove_colons(text)  # only remove ":" characters
    return text.strip()

def ingest_and_clean_unstructured_html(html_path: str):
    """Extract and clean HTML text elements (Title, Table, etc.) returning simplified JSON."""
    elements = partition_html(filename=html_path)
    cleaned_output = []

    for el in elements:
        if hasattr(el, 'text') and el.text:
            cleaned_text = clean_element_text(el.text)
            cleaned_output.append({
                "filename": os.path.basename(html_path),
                "type": el.category if hasattr(el, 'category') else el.__class__.__name__,
                "text": cleaned_text
            })
    return cleaned_output

In [10]:
import os
import json

def process_html_directory(directory_path: str, output_json: str = "clean_output_unstructured.json"):
    """Process all HTML files in a directory and combine their extracted data."""
    all_results = []

    # Loop through all files in the directory
    for filename in os.listdir(directory_path):
        if filename.lower().endswith(".html"):
            file_path = os.path.join(directory_path, filename)
            print(f"üîç Processing: {file_path}")

            try:
                output_html = ingest_and_clean_unstructured_html(file_path)
                all_results.extend(output_html)
            except Exception as e:
                print(f"‚ö†Ô∏è Error processing {filename}: {e}")

    # Print the clean JSON output to console
    print(json.dumps(all_results, indent=2, ensure_ascii=False))

    # Save all combined results to one JSON file
    with open(output_json, "w", encoding="utf-8") as f:
        json.dump(all_results, f, ensure_ascii=False, indent=2)

    print(f"‚úÖ Finished! Output saved to: {output_json}")

In [11]:
directory = "html_docs"
process_html_directory(directory)

üîç Processing: html_docs/EHR-2024-11-456123.html
üîç Processing: html_docs/EHR-2024-11-789321.html
üîç Processing: html_docs/EHR-2024-11-654987.html
üîç Processing: html_docs/EHR-2025-12-000001.html
üîç Processing: html_docs/EHR-2025-12-000002.html
üîç Processing: html_docs/EHR-2024-11-789456.html
üîç Processing: html_docs/EHR-2024-11-789351.html
[
  {
    "filename": "EHR-2024-11-456123.html",
    "type": "Title",
    "text": "PATIENT MEDICAL SUMMARY"
  },
  {
    "filename": "EHR-2024-11-456123.html",
    "type": "UncategorizedText",
    "text": "Metropolitan Cancer Center | Electronic Health Record"
  },
  {
    "filename": "EHR-2024-11-456123.html",
    "type": "UncategorizedText",
    "text": "Generated November 15, 2024 | Report ID EHR-2024-11-456123"
  },
  {
    "filename": "EHR-2024-11-456123.html",
    "type": "Title",
    "text": "Patient Name"
  },
  {
    "filename": "EHR-2024-11-456123.html",
    "type": "UncategorizedText",
    "text": "DAVIDSON, JAMES"
  },
  {


## 2Ô∏è‚É£ SparkNLP: Reader2Doc and DocumentNormalizer for HTML

SparkNLP can read and normalize HTML text using the **Reader2Doc** component followed by **DocumentNormalizer**.

In [12]:
!pip install -q --force-reinstall pyspark==3.5.7

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m317.4/317.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m200.5/200.5 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dataproc-spark-connect 1.0.1 requires pyspark[connect]~=4.0.0, but you have pyspark 3.5.7 which is incompatible.[0m[31m
[0m

In [14]:
 !pip install spark-nlp

Collecting spark-nlp
  Downloading spark_nlp-6.3.0-py2.py3-none-any.whl.metadata (19 kB)
Downloading spark_nlp-6.3.0-py2.py3-none-any.whl (744 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m745.0/745.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-6.3.0


In [15]:
import sparknlp
from sparknlp.reader.reader2doc import Reader2Doc
from sparknlp.annotator import DocumentNormalizer
from pyspark.ml import Pipeline

spark = sparknlp.start()

print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.5.7


In [17]:
print("Spark version:", spark.version)
print("Scala Version:", spark.sparkContext._jvm.scala.util.Properties.versionString())
print("Java Version:", spark.sparkContext._jvm.java.lang.System.getProperty("java.version"))

Spark version: 3.5.7
Scala Version: version 2.12.18
Java Version: 17.0.17


In [18]:
from sparknlp.annotator import SentenceDetectorDLModel
from pyspark.sql.functions import explode

empty_df = spark.createDataFrame([], 'string').toDF('text')

reader2doc = Reader2Doc() \
    .setContentType('text/html') \
    .setContentPath(directory) \
    .setOutputCol('document')

normalizer = DocumentNormalizer() \
    .setInputCols(['document']) \
    .setOutputCol('normalized') \
    .setAutoMode("HTML_CLEAN") \
    .setPatterns([(":")])

sentence_detector = SentenceDetectorDLModel() \
    .pretrained() \
    .setInputCols(['normalized']) \
    .setOutputCol('sentences') \
    .setExplodeSentences(True)

pipeline = Pipeline(stages=[reader2doc, normalizer, sentence_detector])
model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.show(truncate=False, n=5)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence                                                                                                                                                                                                           

In [19]:
exploded_df.show(truncate=False, n=50)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [20]:
from pyspark.sql.functions import explode, col
import json

flat_df = (
    result_df
    .withColumn("sentence", explode("sentences"))  # explode the correct column
    .select(
        col("filename"),
        col("sentence.result").alias("result")      # extract the text part
    )
)


# --- Collect to driver (if small enough) ---
data = flat_df.toJSON().map(json.loads).collect()

# --- Pretty print or save ---
print(json.dumps(data, indent=2, ensure_ascii=False))

# Optional: save as a JSON file
with open("clean_output_sparknlp.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

print("‚úÖ Clean JSON saved to clean_output.json")


[
  {
    "filename": "EHR-2024-11-456123.html",
    "result": "PATIENT MEDICAL SUMMARY Metropolitan Cancer Center | Electronic Health Record Generated November 15, 2024 | Report ID EHR-2024-11-456123 Patient Name DAVIDSON, JAMES Medical Record Number MRN-2024-456123 Date of Birth 04/10/1962 (62 years) Sex Male Primary Physician Dr. Michael Chen, MD Last Visit 11/15/2024 ?"
  },
  {
    "filename": "EHR-2024-11-456123.html",
    "result": "Active & Chronic Diagnoses Non-Small Cell Lung Carcinoma (Active) Histology Adenocarcinoma Location Left upper lobe, apicoposterior segment Stage T2bN0M0 (Stage II) Tumor Size 3.1 cm Diagnosed March 2024 Status Treatment planning - Surgery + adjuvant chemotherapy Chronic Obstructive Pulmonary Disease (Chronic) Diagnosed 2018 Status Mild-moderate, managed with bronchodilators FEV1 2.10 L (65% predicted) Former Tobacco Use Disorder (Resolved) Pack-years 15 pack-years Quit Date 2006 (18 years smoke-free) Status Former smoker, excellent cessation Essenti

In [21]:
import json

with open("clean_output_sparknlp.json", "r", encoding="utf-8") as f:
    clean_output_sparknlp = json.load(f)

clean_output_sparknlp

[{'filename': 'EHR-2024-11-456123.html',
  'result': 'PATIENT MEDICAL SUMMARY Metropolitan Cancer Center | Electronic Health Record Generated November 15, 2024 | Report ID EHR-2024-11-456123 Patient Name DAVIDSON, JAMES Medical Record Number MRN-2024-456123 Date of Birth 04/10/1962 (62 years) Sex Male Primary Physician Dr. Michael Chen, MD Last Visit 11/15/2024 ?'},
 {'filename': 'EHR-2024-11-456123.html',
  'result': 'Active & Chronic Diagnoses Non-Small Cell Lung Carcinoma (Active) Histology Adenocarcinoma Location Left upper lobe, apicoposterior segment Stage T2bN0M0 (Stage II) Tumor Size 3.1 cm Diagnosed March 2024 Status Treatment planning - Surgery + adjuvant chemotherapy Chronic Obstructive Pulmonary Disease (Chronic) Diagnosed 2018 Status Mild-moderate, managed with bronchodilators FEV1 2.10 L (65% predicted) Former Tobacco Use Disorder (Resolved) Pack-years 15 pack-years Quit Date 2006 (18 years smoke-free) Status Former smoker, excellent cessation Essential Hypertension (Chro

## Tokens Comparison Unstructured vs SparkNLP

In [22]:
from collections import defaultdict

with open("clean_output_unstructured.json", "r", encoding="utf-8") as f:
    clean_output_unstructured = json.load(f)

def flatten_unstructured_output(data):
    grouped = defaultdict(list)
    for entry in data:
        grouped[entry["filename"]].append(entry["text"])

    flattened = []
    for filename, texts in grouped.items():
        joined_text = " ".join(texts)
        flattened.append({
            "filename": filename,
            "result": joined_text
        })
    return flattened

flaten_output_unstructured = flatten_unstructured_output(clean_output_unstructured)

In [23]:
flaten_output_unstructured

[{'filename': 'EHR-2024-11-456123.html',
  'result': 'PATIENT MEDICAL SUMMARY Metropolitan Cancer Center | Electronic Health Record Generated November 15, 2024 | Report ID EHR-2024-11-456123 Patient Name DAVIDSON, JAMES Medical Record Number MRN-2024-456123 Date of Birth 04/10/1962 (62 years) Sex Male Primary Physician Dr. Michael Chen, MD Last Visit 11/15/2024 Active & Chronic Diagnoses Histology Adenocarcinoma Location Left upper lobe, apicoposterior segment Stage T2bN0M0 (Stage II) Tumor Size 3.1 cm Diagnosed March 2024 Status Treatment planning - Surgery + adjuvant chemotherapy Diagnosed 2018 Status Mild-moderate, managed with bronchodilators FEV1 2.10 L (65% predicted) Pack-years 15 pack-years Quit Date 2006 (18 years smoke-free) Status Former smoker, excellent cessation Diagnosed 2015 Status Well-controlled on medication Molecular Testing Pending EGFR, ALK, ROS1, KRAS, and PD-L1 testing in progress. Results will guide adjuvant therapy decisions and inform future treatment options

In [24]:
def tokenize(text):
    return text.lower().split()

In [25]:
# Create dictionaries: filename ‚Üí tokens
tokens_unstructured_dict = {
    item["filename"]: tokenize(item["result"])
    for item in flaten_output_unstructured
}

In [26]:
tokens_unstructured_dict

{'EHR-2024-11-456123.html': ['patient',
  'medical',
  'summary',
  'metropolitan',
  'cancer',
  'center',
  '|',
  'electronic',
  'health',
  'record',
  'generated',
  'november',
  '15,',
  '2024',
  '|',
  'report',
  'id',
  'ehr-2024-11-456123',
  'patient',
  'name',
  'davidson,',
  'james',
  'medical',
  'record',
  'number',
  'mrn-2024-456123',
  'date',
  'of',
  'birth',
  '04/10/1962',
  '(62',
  'years)',
  'sex',
  'male',
  'primary',
  'physician',
  'dr.',
  'michael',
  'chen,',
  'md',
  'last',
  'visit',
  '11/15/2024',
  'active',
  '&',
  'chronic',
  'diagnoses',
  'histology',
  'adenocarcinoma',
  'location',
  'left',
  'upper',
  'lobe,',
  'apicoposterior',
  'segment',
  'stage',
  't2bn0m0',
  '(stage',
  'ii)',
  'tumor',
  'size',
  '3.1',
  'cm',
  'diagnosed',
  'march',
  '2024',
  'status',
  'treatment',
  'planning',
  '-',
  'surgery',
  '+',
  'adjuvant',
  'chemotherapy',
  'diagnosed',
  '2018',
  'status',
  'mild-moderate,',
  'managed'

In [27]:
tokens_sparknlp_dict = {
    item["filename"]: tokenize(item["result"])
    for item in clean_output_sparknlp
}

In [28]:
# --- Loop and append tokens ---
for item in clean_output_sparknlp:
    filename = item["filename"]
    text = item["result"]
    tokens = tokenize(text)
    tokens_sparknlp_dict[filename].extend(tokens)  # ‚úÖ append instead of overwrite

# --- Convert back to a normal dict ---
tokens_sparknlp_dict = dict(tokens_sparknlp_dict)

In [29]:
tokens_sparknlp_dict

{'EHR-2024-11-456123.html': ['follow-up',
  'plan',
  'post-surgical',
  'surveillance',
  'chest',
  'ct',
  'every',
  '6',
  'months',
  'for',
  'first',
  '2',
  'years',
  'then',
  'chest',
  'ct',
  'annually',
  'for',
  'years',
  '3-5',
  'cea',
  'monitoring',
  'every',
  '3-6',
  'months',
  'monitor',
  'for',
  'symptoms',
  'of',
  'recurrence',
  '(cough,',
  'dyspnea,',
  'hemoptysis,',
  'weight',
  'loss)',
  'pulmonary',
  'function',
  'testing',
  'annually',
  'smoking',
  'cessation',
  'support',
  'and',
  'counseling',
  'next',
  'scheduled',
  'appointments',
  'thoracic',
  'surgery',
  '-',
  'april',
  '8,',
  '2024',
  'provider',
  'dr.',
  'robert',
  'williams,',
  'md',
  'purpose',
  'left',
  'upper',
  'lobectomy',
  '(vats)',
  'medical',
  'oncology',
  '-',
  'may',
  '2024',
  'provider',
  'dr.',
  'michael',
  'chen,',
  'md',
  'purpose',
  'initiate',
  'adjuvant',
  'chemotherapy',
  '(cisplatin',
  '+',
  'pemetrexed)',
  'pulmonology

In [33]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/benchmark/ehr_tokens_ground_truth.json

--2025-12-26 20:32:52--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/SPARKNLP-1320-Benchmarkunstructured-vs-SparkNLP-in-unstructured-document-processing/open-source-nlp/data/readers/benchmark/ehr_tokens_ground_truth.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 152654 (149K) [text/plain]
Saving to: ‚Äòehr_tokens_ground_truth.json‚Äô


2025-12-26 20:32:52 (2.73 MB/s) - ‚Äòehr_tokens_ground_truth.json‚Äô saved [152654/152654]



In [34]:
with open("ehr_tokens_ground_truth.json", "r", encoding="utf-8") as f:
    tokens_groundtruth_dict = json.load(f)

In [35]:
import json
import os
import re
import unicodedata

def normalize_token(t):
    """Normalize tokens for reliable comparison."""
    t = unicodedata.normalize("NFKC", t)  # normalize unicode
    t = re.sub(r"\s+", " ", t.strip())    # collapse whitespace
    return t.lower()

def compare_sequence(predicted_tokens, groundtruth_tokens, output_dir=".", prefix="comparison"):
    """
    Compare two token lists ignoring case, order, and Unicode issues.
    Properly surfaces missing/extra tokens even when duplicates exist.
    """

    # --- Normalize ---
    pred_norm = [normalize_token(t) for t in predicted_tokens]
    gt_norm = [normalize_token(t) for t in groundtruth_tokens]

    # --- Convert to sets for difference detection ---
    pred_set, gt_set = set(pred_norm), set(gt_norm)

    missing_in_pred = sorted(list(gt_set - pred_set))  # in GT but not in pred
    extra_in_pred = sorted(list(pred_set - gt_set))    # in pred but not in GT

    # --- Compute metrics ---
    intersection = len(pred_set & gt_set)
    union = len(pred_set | gt_set)
    jaccard = round(intersection / union, 4) if union > 0 else 1.0
    match = pred_set == gt_set

    # --- Build results ---
    result_data = {
        "summary": {
            "match": match,
            "pred_count": len(pred_norm),
            "gt_count": len(gt_norm),
            "jaccard_similarity": jaccard,
            "missing_count": len(missing_in_pred),
            "extra_count": len(extra_in_pred),
        },
        "details": {
            "missing_in_pred": missing_in_pred,
            "extra_in_pred": extra_in_pred,
        },
    }

    # --- Save results ---
    os.makedirs(output_dir, exist_ok=True)
    output_path = os.path.join(output_dir, f"{prefix}_sequence_comparison.json")
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(result_data, f, indent=2, ensure_ascii=False)

    print(f"‚úÖ Comparison done ‚Äî Match: {match}, Jaccard: {jaccard:.4f}")
    print(f"üî∏ Missing tokens: {len(missing_in_pred)}")
    print(f"üî∏ Extra tokens:   {len(extra_in_pred)}")
    print(f"üìÅ Saved to: {output_path}")

    return result_data

In [36]:
def match_groundtruth(html_filename, groundtruth_dict):
    """
    Find the corresponding ground truth entry for a given HTML filename.
    Matches by EHR ID (e.g., EHR-2024-11-654987).
    """
    match = re.search(r"(EHR-\d{4}-\d{2}-\d+)", html_filename)
    if not match:
        return None
    ehr_id = match.group(1)
    for gt_name in groundtruth_dict.keys():
        if ehr_id in gt_name:
            return gt_name
    return None

In [37]:
import os, re, json

# Master results dictionary
all_results = {}

# Main loop through all HTMLs
for html_name in tokens_unstructured_dict.keys():
    gt_name = match_groundtruth(html_name, tokens_groundtruth_dict)

    if gt_name is None:
        print(f"‚ö†Ô∏è No ground truth found for {html_name}")
        continue

    print(f"üîç Comparing {html_name} ‚Üî {gt_name}")
    all_results[html_name] = {}

    # Compare SparkNLP vs Ground Truth
    if html_name in tokens_sparknlp_dict:
        result_spark = compare_sequence(
            tokens_sparknlp_dict[html_name],
            tokens_groundtruth_dict[gt_name],
            output_dir="outputs/tmp",  # disable per-file output
            prefix=""
        )
        all_results[html_name]["sparknlp"] = result_spark

    # Compare Unstructured vs Ground Truth
    if html_name in tokens_unstructured_dict:
        result_unstructured = compare_sequence(
            tokens_unstructured_dict[html_name],
            tokens_groundtruth_dict[gt_name],
            output_dir="outputs/tmp",
            prefix=""
        )
        all_results[html_name]["unstructured"] = result_unstructured


# Save single consolidated JSON
output_path = "outputs/all_comparisons.json"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(all_results, f, indent=2, ensure_ascii=False)

print(f"‚úÖ All comparison results saved to {output_path}")

üîç Comparing EHR-2024-11-456123.html ‚Üî EHR-2024-11-456123_text_only.txt
‚úÖ Comparison done ‚Äî Match: False, Jaccard: 0.9713
üî∏ Missing tokens: 7
üî∏ Extra tokens:   7
üìÅ Saved to: outputs/tmp/_sequence_comparison.json
‚úÖ Comparison done ‚Äî Match: False, Jaccard: 0.9507
üî∏ Missing tokens: 17
üî∏ Extra tokens:   7
üìÅ Saved to: outputs/tmp/_sequence_comparison.json
üîç Comparing EHR-2024-11-789321.html ‚Üî EHR-2024-11-789321_text_only.txt
‚úÖ Comparison done ‚Äî Match: False, Jaccard: 0.9784
üî∏ Missing tokens: 5
üî∏ Extra tokens:   6
üìÅ Saved to: outputs/tmp/_sequence_comparison.json
‚úÖ Comparison done ‚Äî Match: False, Jaccard: 0.9686
üî∏ Missing tokens: 10
üî∏ Extra tokens:   6
üìÅ Saved to: outputs/tmp/_sequence_comparison.json
üîç Comparing EHR-2024-11-654987.html ‚Üî EHR-2024-11-654987_text_only.txt
‚úÖ Comparison done ‚Äî Match: False, Jaccard: 0.9711
üî∏ Missing tokens: 4
üî∏ Extra tokens:   9
üìÅ Saved to: outputs/tmp/_sequence_comparison.json
‚úÖ C