![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.13.End2End_Preannotation_and_Training_Pipeline.ipynb)

# **End2End Preannotation and Training Pipeline**

## Spark Setup

In [2]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

Saving spark_nlp_for_healthcare_spark_ocr_10320.json to spark_nlp_for_healthcare_spark_ocr_10320.json


In [3]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.1  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m737.0/737.0 kB[0m [31m92.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.5/200.5 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.7/564.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━

In [4]:
import os
import json
import numpy as np
import pandas as pd

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

params = {"spark.driver.memory":"48G", # Amount of memory to use for the driver process, i.e. where SparkContext is initialized
          "spark.kryoserializer.buffer.max":"2000M", # Maximum allowable size of Kryo serialization buffer, in MiB unless otherwise specified.
          "spark.driver.maxResultSize":"2000M"} # Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes.
                                                # Should be at least 1M, or 0 for unlimited.

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)
spark.sparkContext.setLogLevel("ERROR")
print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


## Loading the Pretrained Pipeline

Spark NLP's pretrained pipeline, `clinical_deidentification_docwise_benchmark`, is loaded. This pipeline is designed to mask and obfuscate sensitive information in medical texts, such as names, ID numbers, contact information, locations, ages, and dates. The existing stages of the pipeline are examined to understand its structure.

In [5]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models")

clinical_deidentification_docwise_benchmark download started this may take some time.
Approx size to download 2.3 GB
[OK!]


In [6]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### Sample text

In [7]:
text = """
(NOTE) Patient Name: John Lee. MR#: 7789201 Location: LERE Date Reported: 2025-05-12 16:30
Specimen #RD23-4897 Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A
Electronically Signed Out By Dr. Smith, Dr. Carter, CT(ASCP) Date Reported: 2025-05-12 16:30
General Hospital Dr. Fan Gabriel 90210 CPT Code(s) A: 88305

General Hospital in New York City Dr. Williams, NYC, NY
(212) 555-7890 Patient Name: John Lee Accession #: GH-556672
Patient ID #: 7789201 Collected: 2025-05-10 Address:
123 Main Street, FALL RIVER
NIAGARA FALLS, NY 14304
Received: 2025-05-10 Reported: 2025-05-12
Soc. Sec. #: XXX-XX-1234 DOB/Age/Sex: 1973 (Age: 52) M
Physician(s): Dr. Jameson. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.
The following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.
· Chromosome analysis cytogenetics. (ADDENDUM REPORT TO FOLLOW.)
· Leukemic immunophenotyping flow cytometry.

...., and there is no evidence of dysplasia.
Fr/ap MATERIAL RECEIVED 6 SLIDES LABELED 032-1902, COLLECTED 2025-05-10
SPECIMEN SOURCE: GASTRIC, ILEUM AND RANDOM COLON, BIOPSIES
REFERRING FACILITY: NY
"""

## Extending the Pipeline with New Stages

New and customized stages are added to enhance the capabilities of the existing pipeline.

In [8]:
document_assembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

splitter = (
            InternalDocumentSplitter()
            .setInputCols("document")
            .setOutputCol("splitter")
            .setSplitMode("recursive")
            .setSplitPatterns(["\s+"])  # Token base
            .setPatternsAreRegex(True)
            .setChunkSize(512)    # 512 Char Lenght
            .setChunkOverlap(50)
            .setEnableSentenceIncrement(True)  # Like sentenceDetector
)

tokenizer = (
    Tokenizer()
    .setInputCols("splitter")
    .setOutputCol("token")
)

### Create a Custom `CPT Code` Parser

Using `ContextualParserApproach`, a new parser is created to detect CPT (Current Procedural Terminology) codes within the text based on regex rules. This allows the pipeline to recognize a custom entity type not found in the standard de-identification pipeline.

In [9]:
cpt_rule = {
    "entity": "CPT_CODE",
    "ruleScope": "sentence",
    "regex": r"(?:CPT(?: Code\(s\)?|#|:)?\s*:?[\s#]*)?(\b88[0-9]{3}\b)",
    "matchScope": "token"
}

with open('cpt.json', 'w') as f:
    json.dump(cpt_rule, f)

cpt_parser = ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_cpt") \
    .setJsonPath("cpt.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

cpt_parser_pipeline = Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    cpt_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

cpt_parser_model = cpt_parser_pipeline.fit(empty_data)
cpt_parser_model.stages[-1].write().overwrite().save("./parsers/cpt_parser")

cpt_parser = ContextualParserModel.load("parsers/cpt_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_cpt")

In [10]:
annotations = LightPipeline(cpt_parser_model).annotate(text)

annotations["entity_cpt"]

['88305']

###  Create a Custom `Specimen ID` Parser

Similarly, another parser is created with ContextualParserApproach to extract specimen IDs from medical texts

In [11]:
with open('specimen.json', 'w') as f:
    json.dump({
        "entity": "IDNUM",
        "ruleScope": "sentence",
        "regex": "(?:Specimen(?:\s*(?:ID|Number|Code|#|No\.?)?:?)?\s*)?#?[A-Z]{1,5}[0-9]{2,4}-?[0-9]{3,6}",
        "contextLength": 25,
        "matchScope": "token"
    }, f)

specimen_parser = ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_specimen") \
    .setJsonPath("specimen.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

specimen_parser_pipeline = Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    specimen_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

specimen_parser_model = specimen_parser_pipeline.fit(empty_data)
specimen_parser_model.stages[-1].write().overwrite().save("./parsers/specimen_parser")

specimen_parser = ContextualParserModel.load("./parsers/specimen_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_specimen")

In [12]:
annotations = LightPipeline(specimen_parser_model).annotate(text)

annotations["entity_specimen"]

['#RD23-4897']

### **IOBTagger**

The `IOBTagger` is added to tag the entities recognized by the Named Entity Recognition (NER) model in the IOB (Inside, Outside, Beginning) format. This format provides a standard data structure required for training the NER model.

In [13]:
iobTagger = sparknlp_jsl.annotator.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

### **Update the Chunk Merging Strategy**

The inputs of the ChunkMergeModel, which is responsible for merging entities from different NER models, are updated to include the entities generated by the newly created cpt_parser and specimen_parser. This ensures that all entities found by both the pretrained models and our custom parsers are consolidated.

In [14]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()
merger_input_cols

['entity_icd10',
 'entity_email',
 'entity_ip_address',
 'entity_age',
 'entity_medicalrecord',
 'entity_ssn',
 'entity_account',
 'entity_vin',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_country',
 'entity_state',
 'entity_zip',
 'entity_plate',
 'entity_dln',
 'entity_license']

In [15]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()

chunk_merge_rulebase = deid_pipeline.model.stages[35]\
      .setInputCols(["entity_cpt", "entity_specimen"] + merger_input_cols)

### Update the De-identification Blacklist

In [16]:
deid_pipeline.model.stages[38]

ChunkMergeModel_5a3f1e608447

In [17]:
deid_pipeline.model.stages[38] = deid_pipeline.model.stages[38]\
                                      .setBlackList(['CPT_CODE'])

### Updated Stages

In [18]:
deid_pipeline.model.stages = (
    deid_pipeline.model.stages[:35]
    + [cpt_parser, specimen_parser, chunk_merge_rulebase]
    + deid_pipeline.model.stages[36:]
    + [iobTagger]
)

In [19]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

## Save and Test the Modified Pipeline

In [20]:
empty_result = deid_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

deid_pipeline.model.write().overwrite().save("modified_pipeline")

In [21]:
# We are loading the pretrained pipeline using the `from_disk` method.
from sparknlp.pretrained import PretrainedPipeline

modified_pipeline = PretrainedPipeline.from_disk('modified_pipeline')

### Sample Result

In [22]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = modified_pipeline.transform(samples_df).cache()

In [78]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+----------------------------------+-----+----+---------+----------+
|chunk                             |begin|end |ner_label|confidence|
+----------------------------------+-----+----+---------+----------+
|John Lee                          |22   |29  |NAME     |0.9999912 |
|7789201                           |37   |43  |IDNUM    |0.72      |
|LERE                              |55   |58  |LOCATION |0.861845  |
|2025-05-12                        |75   |84  |DATE     |NULL      |
|#RD23-4897                        |101  |110 |IDNUM    |0.50      |
|Smith                             |232  |236 |NAME     |0.9992543 |
|Carter                            |243  |248 |NAME     |0.9988757 |
|2025-05-12                        |275  |284 |DATE     |NULL      |
|General Hospital                  |292  |307 |LOCATION |0.9980348 |
|Fan Gabriel                       |313  |323 |NAME     |0.98504215|
|90210                             |325  |329 |IDNUM    |0.5666    |
|General Hospital                 

In [24]:
pd.set_option("display.max_colwidth", 1000)

result_df = result.selectExpr("text",
                              "mask_entity.result as masked_result",
                              "obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\n(NOTE) Patient Name: John Lee. MR#: 7789201 Location: LERE Date Reported: 2025-05-12 16:30\nSpecimen #RD23-4897 Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A\nElectronically Signed Out By Dr. Smith, Dr. Carter, CT(ASCP) Date Reported: 2025-05-12 16:30\nGeneral Hospital Dr. Fan Gabriel 90210 CPT Code(s) A: 88305\n\nGeneral Hospital in New York City Dr. Williams, NYC, NY\n(212) 555-7890 Patient Name: John Lee Accession #: GH-556672\nPatient ID #: 7789201 Collected: 2025-05-10 Address:\n123 Main Street, FALL RIVER\nNIAGARA FALLS, NY 14304\nReceived: 2025-05-10 Reported: 2025-05-12\nSoc. Sec. #: XXX-XX-1234 DOB/Age/Sex: 1973 (Age: 52) M\nPhysician(s): Dr. Jameson. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\n· Chromosome analysis cytogene...","[\n(NOTE) Patient Name: <NAME>. MR#: <IDNUM> Location: <LOCATION> Date Reported: <DATE> 16:30\nSpecimen <IDNUM> Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A\nElectronically Signed Out By Dr. <NAME>, Dr. <NAME>, CT(ASCP) Date Reported: <DATE> 16:30\n<LOCATION> Dr. <NAME> <IDNUM> CPT Code(s) A: 88305\n\n<LOCATION> in <LOCATION> City Dr. <NAME>, <LOCATION>, <LOCATION>\n<CONTACT> Patient Name: <NAME> Accession #: <IDNUM>\nPatient ID #: <IDNUM> Collected: <DATE> Address:\n<LOCATION>, <LOCATION>\n<LOCATION>, <LOCATION> <LOCATION>\nReceived: <DATE> Reported: <DATE>\nSoc. Sec. #: <IDNUM> DOB/Age/Sex: <DATE> (Age: <AGE>) M\nPhysician(s): Dr. <NAME>. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.\nThe following special studies were performed at <LOCATION>, <LOCATION> – <LOCATION>; <LOCATION> City.\n· <LOCATION> analysis cytogenetics. (ADDENDUM REPORT TO FOLLOW.)\n·...","[\n(NOTE) Patient Name: Gillie Allan. MR#: 0074518 Location: 4500 MEMORIAL DRIVE Date Reported: 2025-06-29 16:30\nSpecimen #SA52-9740 Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A\nElectronically Signed Out By Dr. Wanna, Dr. Malvin, CT(ASCP) Date Reported: 2025-06-29 16:30\n310 Ellis Street Dr. Marcelo Danes 41581 CPT Code(s) A: 88305\n\n310 Ellis Street in 2000 Boise Ave City Dr. Duwaine, 427 GUY PARK AVE, 16100 SOUTH FREEWAY\n(585) 666-0741 Patient Name: Gillie Allan Accession #: PU-663305\nPatient ID #: 0074518 Collected: 2025-06-27 Address:\n3255 Independence Street, 302 W MCNEESE ST\n4101 NW 89TH BLVD, 16100 SOUTH FREEWAY 59 KOCH AVE\nReceived: 2025-06-27 Reported: 2025-06-29\nSoc. Sec. #: WWW-WW-8529 DOB/Age/Sex: 1974 (Age: 44) M\nPhysician(s): Dr. Marchelle. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.\nThe following special studies were performed..."


##  Prepare Data for Custom NER Model Training

In [None]:
# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/refs/heads/master/tutorials/academic/DeIdentification_Benchmarks_Text2Story2025/deidentification_benchmark_ground_truth_48_doc.csv

In [25]:
import pandas as pd
benchmark_df = pd.read_csv("./deidentification_benchmark_ground_truth_48_doc.csv")
benchmark_df

Unnamed: 0,doc_id,text,begin,end,chunk,chunk_label
0,1,"\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRINCIPAL DIAGNOSIS :\nTracheoesophageal fistula .\nASSOCIATED DIAGNOSIS :\nDiabetes mellitus , pneumonia , sepsis , respiratory failure , pleural effusion , postoperative encephalopathy , postoperative myocardial infarction , and thrombocytopenia .\nSPECIAL PROCEDURES AND OPERATIONS :\nFebruary 6 , 1994 , rigid bronchoscopy with biopsy and upper gastrointestinal endoscopy .\nASSOCIATED PROCEDURES :\nOn February 10 , 1994 , flexible bronchoscopy , flexible esophagoscopy , rigid bronchoscopy , transhiatal esophagectomy ( partial ) , substernal gastric interposition and jejunostomy .\nMultiple bronchoscopies , chest tube insertion .\nHISTORY OF PRESENT ILLNESS :\nThe patient was a 71 year old white female with a history of carcinoid lung cancer who presented for evaluation of a tracheoesophageal...",1,10,957770228,IDNUM
1,1,"\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRINCIPAL DIAGNOSIS :\nTracheoesophageal fistula .\nASSOCIATED DIAGNOSIS :\nDiabetes mellitus , pneumonia , sepsis , respiratory failure , pleural effusion , postoperative encephalopathy , postoperative myocardial infarction , and thrombocytopenia .\nSPECIAL PROCEDURES AND OPERATIONS :\nFebruary 6 , 1994 , rigid bronchoscopy with biopsy and upper gastrointestinal endoscopy .\nASSOCIATED PROCEDURES :\nOn February 10 , 1994 , flexible bronchoscopy , flexible esophagoscopy , rigid bronchoscopy , transhiatal esophagectomy ( partial ) , substernal gastric interposition and jejunostomy .\nMultiple bronchoscopies , chest tube insertion .\nHISTORY OF PRESENT ILLNESS :\nThe patient was a 71 year old white female with a history of carcinoid lung cancer who presented for evaluation of a tracheoesophageal...",11,14,FIH,LOCATION
2,1,"\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRINCIPAL DIAGNOSIS :\nTracheoesophageal fistula .\nASSOCIATED DIAGNOSIS :\nDiabetes mellitus , pneumonia , sepsis , respiratory failure , pleural effusion , postoperative encephalopathy , postoperative myocardial infarction , and thrombocytopenia .\nSPECIAL PROCEDURES AND OPERATIONS :\nFebruary 6 , 1994 , rigid bronchoscopy with biopsy and upper gastrointestinal endoscopy .\nASSOCIATED PROCEDURES :\nOn February 10 , 1994 , flexible bronchoscopy , flexible esophagoscopy , rigid bronchoscopy , transhiatal esophagectomy ( partial ) , substernal gastric interposition and jejunostomy .\nMultiple bronchoscopies , chest tube insertion .\nHISTORY OF PRESENT ILLNESS :\nThe patient was a 71 year old white female with a history of carcinoid lung cancer who presented for evaluation of a tracheoesophageal...",15,22,0408267,IDNUM
3,1,"\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRINCIPAL DIAGNOSIS :\nTracheoesophageal fistula .\nASSOCIATED DIAGNOSIS :\nDiabetes mellitus , pneumonia , sepsis , respiratory failure , pleural effusion , postoperative encephalopathy , postoperative myocardial infarction , and thrombocytopenia .\nSPECIAL PROCEDURES AND OPERATIONS :\nFebruary 6 , 1994 , rigid bronchoscopy with biopsy and upper gastrointestinal endoscopy .\nASSOCIATED PROCEDURES :\nOn February 10 , 1994 , flexible bronchoscopy , flexible esophagoscopy , rigid bronchoscopy , transhiatal esophagectomy ( partial ) , substernal gastric interposition and jejunostomy .\nMultiple bronchoscopies , chest tube insertion .\nHISTORY OF PRESENT ILLNESS :\nThe patient was a 71 year old white female with a history of carcinoid lung cancer who presented for evaluation of a tracheoesophageal...",23,33,46769/5v7d,IDNUM
4,1,"\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRINCIPAL DIAGNOSIS :\nTracheoesophageal fistula .\nASSOCIATED DIAGNOSIS :\nDiabetes mellitus , pneumonia , sepsis , respiratory failure , pleural effusion , postoperative encephalopathy , postoperative myocardial infarction , and thrombocytopenia .\nSPECIAL PROCEDURES AND OPERATIONS :\nFebruary 6 , 1994 , rigid bronchoscopy with biopsy and upper gastrointestinal endoscopy .\nASSOCIATED PROCEDURES :\nOn February 10 , 1994 , flexible bronchoscopy , flexible esophagoscopy , rigid bronchoscopy , transhiatal esophagectomy ( partial ) , substernal gastric interposition and jejunostomy .\nMultiple bronchoscopies , chest tube insertion .\nHISTORY OF PRESENT ILLNESS :\nThe patient was a 71 year old white female with a history of carcinoid lung cancer who presented for evaluation of a tracheoesophageal...",34,40,237890,IDNUM
...,...,...,...,...,...,...
1474,48,"6509414988 Theophilus\n\nAniceto ,\n\nUNIVERSITY OF MD CHARLES REGIONAL MEDICAL CENTER\n\nNIX SPECIALTY HEALTH CENTER\n\nCALIFORNIA Eupora ,\n\nRecords Coversheet\n\nDiego Foy\n\nPatient Name : 6509414988\n\nPatient MR# : Patient DOB : 2/16/1941\n\nDate 7/14/2021\n\nPrepared :\n\nNot\n\nSpecified\n\nDiagnosis : Document Type Page Range\n\nHistory & Physical/MD Notes 10-20\n\nConsults 21-24\n\nChemo/Radiation Therapy n/a\n\nLab n/a\n\nReports\n\nSummaries n/a\n\nDischarge\n\nNotes 25-27\n\nOperative\n\nRadiology Reports 28-76\n\nPathology Reports 77-83\n\nOther 84-87\n\neHealth Technologies\n\nJordanfort\n\nLake Erin , ALASKA 38720\n\nMAIN 966.255.9888\n\nFAX 644.508.1141 1 of 87\n\nPage\n\n",559,568,Lake Erin,LOCATION
1475,48,"6509414988 Theophilus\n\nAniceto ,\n\nUNIVERSITY OF MD CHARLES REGIONAL MEDICAL CENTER\n\nNIX SPECIALTY HEALTH CENTER\n\nCALIFORNIA Eupora ,\n\nRecords Coversheet\n\nDiego Foy\n\nPatient Name : 6509414988\n\nPatient MR# : Patient DOB : 2/16/1941\n\nDate 7/14/2021\n\nPrepared :\n\nNot\n\nSpecified\n\nDiagnosis : Document Type Page Range\n\nHistory & Physical/MD Notes 10-20\n\nConsults 21-24\n\nChemo/Radiation Therapy n/a\n\nLab n/a\n\nReports\n\nSummaries n/a\n\nDischarge\n\nNotes 25-27\n\nOperative\n\nRadiology Reports 28-76\n\nPathology Reports 77-83\n\nOther 84-87\n\neHealth Technologies\n\nJordanfort\n\nLake Erin , ALASKA 38720\n\nMAIN 966.255.9888\n\nFAX 644.508.1141 1 of 87\n\nPage\n\n",571,577,ALASKA,LOCATION
1476,48,"6509414988 Theophilus\n\nAniceto ,\n\nUNIVERSITY OF MD CHARLES REGIONAL MEDICAL CENTER\n\nNIX SPECIALTY HEALTH CENTER\n\nCALIFORNIA Eupora ,\n\nRecords Coversheet\n\nDiego Foy\n\nPatient Name : 6509414988\n\nPatient MR# : Patient DOB : 2/16/1941\n\nDate 7/14/2021\n\nPrepared :\n\nNot\n\nSpecified\n\nDiagnosis : Document Type Page Range\n\nHistory & Physical/MD Notes 10-20\n\nConsults 21-24\n\nChemo/Radiation Therapy n/a\n\nLab n/a\n\nReports\n\nSummaries n/a\n\nDischarge\n\nNotes 25-27\n\nOperative\n\nRadiology Reports 28-76\n\nPathology Reports 77-83\n\nOther 84-87\n\neHealth Technologies\n\nJordanfort\n\nLake Erin , ALASKA 38720\n\nMAIN 966.255.9888\n\nFAX 644.508.1141 1 of 87\n\nPage\n\n",578,583,38720,LOCATION
1477,48,"6509414988 Theophilus\n\nAniceto ,\n\nUNIVERSITY OF MD CHARLES REGIONAL MEDICAL CENTER\n\nNIX SPECIALTY HEALTH CENTER\n\nCALIFORNIA Eupora ,\n\nRecords Coversheet\n\nDiego Foy\n\nPatient Name : 6509414988\n\nPatient MR# : Patient DOB : 2/16/1941\n\nDate 7/14/2021\n\nPrepared :\n\nNot\n\nSpecified\n\nDiagnosis : Document Type Page Range\n\nHistory & Physical/MD Notes 10-20\n\nConsults 21-24\n\nChemo/Radiation Therapy n/a\n\nLab n/a\n\nReports\n\nSummaries n/a\n\nDischarge\n\nNotes 25-27\n\nOperative\n\nRadiology Reports 28-76\n\nPathology Reports 77-83\n\nOther 84-87\n\neHealth Technologies\n\nJordanfort\n\nLake Erin , ALASKA 38720\n\nMAIN 966.255.9888\n\nFAX 644.508.1141 1 of 87\n\nPage\n\n",590,602,966.255.9888,CONTACT


In [26]:
benchmark_df.count()

Unnamed: 0,0
doc_id,1479
text,1479
begin,1479
end,1479
chunk,1479
chunk_label,1479


### Preprocess Data with the Modified Pipeline

Run the entire dataset through our modified pipeline. This generates token, sentence, and embedding annotations required for the NER training downstream.

In [27]:
text_df = benchmark_df[["doc_id", "text"]].drop_duplicates()
input_spark_df = spark.createDataFrame(text_df).repartition(32)
input_spark_df.show()

+------+--------------------+
|doc_id|                text|
+------+--------------------+
|    20|\n263283549 ELMVH...|
|    45|Email : Hobbes@ya...|
|    41|6509414988 Theoph...|
|    19|\n755646616\nFIH\...|
|    30|\n793183831 PUMC\...|
|    48|6509414988 Theoph...|
|    47|critical result h...|
|    22|\n168165320\nFIH\...|
|    31|Legal Name:Page F...|
|    46|----- % Date : 8/...|
|     1|\n957770228\nFIH\...|
|     2|\n291181306\nFIH\...|
|    12|\n559197012\nFIH\...|
|    37|Pasco Bond DOB : ...|
|    11|\n935669761 PUMC\...|
|    27|\n059140531 ELMVH...|
|    28|\n733882247\nFIH\...|
|    39|SAN DIMAS COMMUNI...|
|    29|\n417344403 RWH\n...|
|    26|\n348983165\nFIH\...|
+------+--------------------+
only showing top 20 rows



In [28]:
results = modified_pipeline.transform(input_spark_df)
results.columns

['doc_id',
 'text',
 'document',
 'splitter',
 'token',
 'embeddings',
 'ner_clinical_large',
 'ner_chunk_clinical_large',
 'ner_deid_generic_docwise',
 'ner_deid_docwise_subentity',
 'ner_deid_generic_docwise_merged_conll',
 'ner_chunk_generic_docwise',
 'ner_chunk_subentity_docwise',
 'ner_chunk_merged_docwise',
 'ner_zero_shot',
 'ner_chunk_zero_shot_raw',
 'ner_deid_subentity_docwise_new',
 'ner_chunk_subentity_docwise_new_chunk',
 'ner_chunk_zero_shot',
 'deid_merged_ner_chunk',
 'entity_icd10',
 'entity_ssn',
 'entity_account',
 'entity_dln',
 'entity_plate',
 'entity_vin',
 'entity_license',
 'entity_country',
 'entity_state',
 'entity_age',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_zip',
 'entity_medicalrecord',
 'entity_email',
 'entity_ip_address',
 'entity_cpt',
 'entity_specimen',
 'deid_merged_ner_rulebased',
 'ner_chunk_raw',
 'ner_chunk_processed',
 'ner_chunk',
 'mask_entity',
 'obfuscated',
 'ner_label']

In [29]:
result_df = results.select('doc_id','text','document','splitter',
                          'token',"embeddings", 'ner_label')

In [30]:
result_df.show(2, truncate=40)

+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|doc_id|                                    text|                                document|                                splitter|                                   token|                              embeddings|                               ner_label|
+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|    20|\n263283549 ELMVH\n80655600\n5177642\...|[{document, 0, 2900, \n263283549 ELMV...|[{document, 1, 508, 263283549 ELMVH\n...|[{token, 1, 9, 263283549, {sentence -...|[{word_embeddings, 1, 9, 263283549, {...|[{named_entity, 1, 9, 

In [31]:
result_df.count()

48

### Persist Preprocessed Data

Save the annotated DataFrame to Parquet format. This is an optimization step to speed up the training process by avoiding re-computation.

In [32]:
%%time

n_partitions = 48

# WRITING THE DATA
result_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/result_df_{n_partitions}.parquet")


CPU times: user 212 ms, sys: 45.2 ms, total: 258 ms
Wall time: 1min 24s


## Train a Custom Medical NER Model

In [33]:
# READING THE DATA
n_partitions = 48
result_df = spark.read \
    .parquet(f"./data/result_df_{n_partitions}.parquet")\
    .repartition(n_partitions)

In [None]:
result_df.show(2)

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            splitter|               token|          embeddings|           ner_label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     1|\n957770228\nFIH\...|[{document, 0, 78...|[{document, 1, 50...|[{token, 1, 9, 95...|[{word_embeddings...|[{named_entity, 1...|
|     4|\n229937784\nFIH\...|[{document, 0, 37...|[{document, 1, 51...|[{token, 1, 9, 22...|[{word_embeddings...|[{named_entity, 1...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [34]:
(train_df, test_df) = result_df.randomSplit([0.8, 0.2], seed = 42)

In [35]:
test_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/test_df.parquet")

###  Use MedicalNerDLGraphChecker for NER

The MedicalNerDLGraphChecker processes the dataset to extract required graph parameters (tokens, labels, embedding dimensions)

In [36]:
embeddings = (WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
            .setInputCols(["splitter", "token"])
            .setOutputCol("embeddings"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [37]:
nerDLGraphChecker = MedicalNerDLGraphChecker()\
    .setInputCols(["splitter", "token"])\
    .setLabelColumn("ner_label")\
    .setEmbeddingsModel(embeddings)

###  Configure and Run the MedicalNerApproach

In [38]:
nerTagger = MedicalNerApproach()\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setLabelColumn("ner_label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setBatchSize(8)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setEarlyStoppingCriterion(0.01)\
    .setEarlyStoppingPatience(5)\
    .setUseBestModel(False)\
    #.setTestDataset("./data/test_df.parquet")\
    #.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
    #.setDatasetInfo("NCBI_sample_short dataset") #You can add details regarding the dataset

ner_pipeline = Pipeline(
    stages=[
          nerDLGraphChecker,
          nerTagger
 ])

In [39]:
%%time
ner_model = ner_pipeline.fit(train_df)

CPU times: user 1.12 s, sys: 274 ms, total: 1.39 s
Wall time: 7min 34s


In [40]:
ner_model.stages[-1].getTrainingClassDistribution()

{'I-NAME': 465, 'I-CONTACT': 22, 'I-AGE': 80, 'I-IDNUM': 5, 'B-DATE': 617, 'I-DATE': 266, 'I-LOCATION': 373, 'B-NAME': 447, 'B-AGE': 64, 'B-LOCATION': 266, 'B-IDNUM': 170, 'O': 47073, 'B-CONTACT': 39}

### Save the Trained NER Model and Review Logs

In [41]:
ner_model.stages[-1].write().overwrite().save('models/new_NER_model')

In [42]:
import os
log_file= os.listdir("ner_logs")[0]

with open (f"./ner_logs/{log_file}") as f:
    print(f.read())

Name of the selected graph: medical-ner-dl/blstm_100_200_128_100.pb
Training started - total epochs: 30 - lr: 0.001 - batch size: 8 - labels: 13 - chars: 95 - training examples: 479


Epoch 1/30 started, lr: 0.001, dataset size: 479


Epoch 1/30 - 14.82s - loss: 2564.314 - avg training loss: 48.38328 - batches: 53
Quality on validation dataset (20.0%), validation examples = 95
time to finish evaluation: 1.28s
Total validation loss: 218.8025	Avg validation loss: 24.3114
label	 tp	 fp	 fn	 prec	 rec	 f1
I-NAME	 0	 5	 69	 0.0	 0.0	 0.0
I-CONTACT	 0	 0	 9	 0.0	 0.0	 0.0
I-AGE	 0	 0	 9	 0.0	 0.0	 0.0
B-DATE	 5	 19	 55	 0.20833333	 0.083333336	 0.11904763
I-DATE	 0	 0	 26	 0.0	 0.0	 0.0
I-LOCATION	 0	 0	 65	 0.0	 0.0	 0.0
B-NAME	 0	 0	 64	 0.0	 0.0	 0.0
B-AGE	 0	 0	 13	 0.0	 0.0	 0.0
B-LOCATION	 0	 0	 38	 0.0	 0.0	 0.0
B-IDNUM	 1	 0	 25	 1.0	 0.03846154	 0.074074075
B-CONTACT	 0	 0	 22	 0.0	 0.0	 0.0
tp: 6 fp: 24 fn: 395 labels: 11
Macro-average	 prec: 0.10984849, rec: 0.011072262, f1: 0.020

## Evaluate the Newly Trained NER Model

In [53]:
test_df.show()

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            splitter|               token|          embeddings|           ner_label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    47|critical result h...|[{document, 0, 75...|[{document, 0, 50...|[{token, 0, 7, cr...|[{word_embeddings...|[{named_entity, 0...|
|     9|\n333145593\nFIH\...|[{document, 0, 76...|[{document, 1, 51...|[{token, 1, 9, 33...|[{word_embeddings...|[{named_entity, 1...|
|     1|\n957770228\nFIH\...|[{document, 0, 78...|[{document, 1, 50...|[{token, 1, 9, 95...|[{word_embeddings...|[{named_entity, 1...|
|    44|May 2035 Thursday...|[{document, 0, 39...|[{document, 0, 51...|[{token, 0, 2, Ma...|[{word_embeddings...|[{named_entity, 0...|
|     8|\n305265793\nFIH\...|[{document, 0, 11...|[{doc

In [63]:
ner_converter_fix = NerConverterInternal()\
      .setInputCols(["splitter", "token", "ner"])\
      .setOutputCol("temp_ner_chunk")

iobTagger_fix = IOBTagger()\
      .setInputCols(["token", "temp_ner_chunk"])\
      .setOutputCol("ner")

pipeline_fix = Pipeline(
    stages=[
          ner_model.stages[-1],
          ner_converter_fix,
          iobTagger_fix
 ]).fit(test_df)

In [64]:
pred_df = pipeline_fix.transform(test_df).cache()

In [69]:
pred_df.show()

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            splitter|               token|          embeddings|           ner_label|                 ner|      temp_ner_chunk|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    47|critical result h...|[{document, 0, 75...|[{document, 0, 50...|[{token, 0, 7, cr...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|[{chunk, 161, 166...|
|     9|\n333145593\nFIH\...|[{document, 0, 76...|[{document, 1, 51...|[{token, 1, 9, 33...|[{word_embeddings...|[{named_entity, 1...|[{named_entity, 1...|[{chunk, 1, 9, 33...|
|     1|\n957770228\nFIH\...|[{document, 0, 78...|[{document, 1, 50...|[{token, 1, 9, 95...|[{word_embeddings...|[{

In [70]:
from pyspark.sql import functions as F

pred_token_df = pred_df.select(F.explode(F.arrays_zip(pred_df.ner_label.metadata,
                                                  pred_df.ner_label.begin,
                                                  pred_df.ner_label.end,
                                                  pred_df.ner_label.result,
                                                  pred_df.ner.result)).alias("cols")) \
          .select(F.expr("cols['0']['word']").alias("token"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']").alias("gtruth"),
                  F.expr("cols['4']").alias("prediction"))\
          .toPandas()

pred_token_df

Unnamed: 0,token,begin,end,gtruth,prediction
0,critical,0,7,O,O
1,result,9,14,O,O
2,hand,16,19,O,O
3,delivered,21,29,O,O
4,to,31,32,O,O
...,...,...,...,...,...
8277,1,2413,2413,O,O
8278,of,2415,2416,O,O
8279,87,2418,2419,O,O
8280,Page,2422,2425,O,O


### Calculate Evaluation Metrics
Use the NerDLMetrics class to compute precision, recall, and F1-score for each entity. The evaluation is shown with both `full_chunk` and `partial_chunk_per_token` modes.

In [71]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

evaler = NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"),
                                          prediction_col="ner",
                                          label_col="ner_label",
                                          drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+-----+----+----+-----+---------+------+------+
|  entity|   tp|  fp|  fn|total|precision|recall|    f1|
+--------+-----+----+----+-----+---------+------+------+
| CONTACT|  2.0| 0.0|12.0| 14.0|      1.0|0.1429|  0.25|
|    NAME| 42.0| 7.0|11.0| 53.0|   0.8571|0.7925|0.8235|
|    DATE|102.0|17.0|10.0|112.0|   0.8571|0.9107|0.8831|
|   IDNUM| 13.0| 0.0| 0.0| 13.0|      1.0|   1.0|   1.0|
|LOCATION| 37.0| 3.0|20.0| 57.0|    0.925|0.6491|0.7629|
|     AGE|  9.0| 2.0| 1.0| 10.0|   0.8182|   0.9|0.8571|
+--------+-----+----+----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.7627792916604318|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8151046887510546|
+------------------+

None


In [72]:
evaler = NerDLMetrics(mode="partial_chunk_per_token")
eval_result_partial = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"), prediction_col="ner", label_col="ner_label", drop_o = True, case_sensitive = True).cache()

eval_result_partial.withColumn("precision", F.round(eval_result_partial["precision"],4))\
           .withColumn("recall", F.round(eval_result_partial["recall"],4))\
           .withColumn("f1", F.round(eval_result_partial["f1"],4)).sort("entity").show(100)
df_partial=eval_result_partial.toPandas()
print("partial_chunk_per_token")
print(eval_result_partial.selectExpr("avg(f1) as macro").show())
print (eval_result_partial.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+-----+----+----+-----+---------+------+------+
|  entity|   tp|  fp|  fn|total|precision|recall|    f1|
+--------+-----+----+----+-----+---------+------+------+
|     AGE| 23.0| 3.0| 3.0| 26.0|   0.8846|0.8846|0.8846|
| CONTACT|  2.0| 0.0|13.0| 15.0|      1.0|0.1333|0.2353|
|    DATE|214.0|18.0| 3.0|217.0|   0.9224|0.9862|0.9532|
|   IDNUM| 15.0| 0.0| 0.0| 15.0|      1.0|   1.0|   1.0|
|LOCATION| 78.0| 6.0|23.0|101.0|   0.9286|0.7723|0.8432|
|    NAME| 89.0| 6.0|13.0|102.0|   0.9368|0.8725|0.9036|
+--------+-----+----+----+-----+---------+------+------+

partial_chunk_per_token
+------------------+
|             macro|
+------------------+
|0.8033225739436283|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8943491499800822|
+------------------+

None


## Calculate Evaluation Metrics with sklearn - token level evaluation

### Classication Report

In [73]:
flattener = (Flattener()
    .setInputCols("ner_label","ner")
    .setExplodeSelectedFields({"ner_label"    : ["result as gt_result",
                                             "metadata.word as gt_token",
                                             "begin as gt_begin",
                                             "end as gt_end"],
                               "ner": ["result as pred_result",
                                             "metadata.word as pred_token",
                                             "begin as pred_begin",
                                             "end as pred_end"],


    })
    .setCleanAnnotations(True)
)

In [74]:
flattened_df = flattener.transform(pred_df)

In [75]:
classification_pandas_df = flattened_df.toPandas()

In [76]:
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

def strip_iob(tag):
    if tag == "O":
        return "O"
    return tag.split("-")[-1]  # "B-NAME" -> "NAME", "I-DATE" -> "DATE"

y_true = [strip_iob(x) for x in classification_pandas_df["gt_result"]]
y_pred = [strip_iob(x) for x in classification_pandas_df["pred_result"]]

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         AGE       0.88      0.88      0.88        26
     CONTACT       1.00      0.13      0.24        15
        DATE       0.92      0.99      0.95       217
       IDNUM       1.00      1.00      1.00        15
    LOCATION       0.93      0.77      0.84       101
        NAME       0.94      0.87      0.90       102
           O       0.99      1.00      1.00      7806

    accuracy                           0.99      8282
   macro avg       0.95      0.81      0.83      8282
weighted avg       0.99      0.99      0.99      8282



### Confusion Matrix

In [77]:
import pandas as pd
from sklearn.metrics import confusion_matrix

labels = sorted(set(y_true + y_pred))

cm = confusion_matrix(y_true, y_pred, labels=labels)

print(pd.DataFrame(cm, index=labels, columns=labels))

          AGE  CONTACT  DATE  IDNUM  LOCATION  NAME     O
AGE        23        0     0      0         0     0     3
CONTACT     0        2     6      0         2     0     5
DATE        0        0   214      0         0     0     3
IDNUM       0        0     0     15         0     0     0
LOCATION    0        0     0      0        78     4    19
NAME        0        0     0      0         2    89    11
O           3        0    12      0         2     2  7787


## Create the Final Pipeline with the Custom NER Model

In [83]:
# We are loading the pretrained pipeline using the `from_disk` method.
from sparknlp.pretrained import PretrainedPipeline

modified_pipeline = PretrainedPipeline.from_disk('modified_pipeline')

In [84]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### New Stages

In [85]:
ner_deid_new = MedicalNerModel.load("models/new_NER_model")\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setOutputCol("ner_deid_new")

ner_deid_new_converter = NerConverter()\
      .setInputCols(["splitter", "token", "ner_deid_new"])\
      .setOutputCol("ner_chunk_new")

ner_deid = MedicalNerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")  \
      .setInputCols(["splitter", "token", "embeddings"]) \
      .setOutputCol("ner_deid_subentity_docwise")

ner_deid_converter = NerConverter()\
      .setInputCols(["splitter", "token", "ner_deid_subentity_docwise"])\
      .setOutputCol("ner_chunk_subentity_docwise")

chunk_merge_ner = ChunkMergeModel()\
    .setInputCols("ner_chunk_new", # New Trained Model
                  "ner_chunk_subentity_docwise")\
    .setOutputCol("deid_merged_ner_chunk")\
    .setOrderingFeatures(["ChunkLength","ChunkBegin"])\
    .setMergeOverlapping(True)\
    .setResetSentenceIndices(True)


ner_deid_subentity_docwise download started this may take some time.
Approximate size to download 8.9 MB
[OK!]


### **Update Stages**

In [86]:
modified_pipeline.model.stages = (
    modified_pipeline.model.stages[:4]
    + [ner_deid_new,
       ner_deid_new_converter,
       ner_deid,
       ner_deid_converter,
       chunk_merge_ner]
    + modified_pipeline.model.stages[18:]

)

In [87]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_30f0545272c2,
 NerConverter_31f30ad79323,
 MedicalNerModel_32184c1db80b,
 NerConverter_ce71d8707f31,
 ChunkMergeModel_db4f5459cedb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 CONTEXTUAL-PARSER_f8b8f9aafb9f,
 CONTEXTUAL-PARSER_7f824493eafc,
 REGEX_MATCHER_26934077fe57,
 REGEX_MATCHER_5fe3de8b5a4e,
 CONTEXTUAL-PARSER_b9ba22559fef,
 CONTEXTUAL-PARSER_367bd31082ee,
 MERGE_ddff59e8b14a,
 ChunkMergeModel_50feb5f97568,
 ContextualEntityRuler_08eeaa89c938,
 ChunkMe

### Reassemble and Save the Final

Rebuild the pipeline's stages, replacing the original NER components with our new custom NER model and the reconfigured merger. The final pipeline is then saved.

In [88]:
empty_result = modified_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

modified_pipeline.model.write().overwrite().save("new_pipeline")

In [89]:
from sparknlp.pretrained import PretrainedPipeline

new_pipeline = PretrainedPipeline.from_disk('new_pipeline')

## Final Test of the New Pipeline

In [90]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = new_pipeline.transform(samples_df).cache()

In [91]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+----------------------------------+-----+----+---------+----------+
|chunk                             |begin|end |ner_label|confidence|
+----------------------------------+-----+----+---------+----------+
|John Lee                          |22   |29  |NAME     |0.9972    |
|7789201                           |37   |43  |IDNUM    |0.72      |
|LERE                              |55   |58  |NAME     |0.9574    |
|2025-05-12                        |75   |84  |DATE     |NULL      |
|#RD23-4897                        |101  |110 |IDNUM    |0.50      |
|Smith                             |232  |236 |NAME     |0.9827    |
|Carter, CT(ASCP                   |243  |257 |NAME     |0.6975    |
|2025-05-12                        |275  |284 |DATE     |NULL      |
|Hospital                          |300  |307 |LOCATION |0.9498    |
|Fan Gabriel                       |313  |323 |NAME     |0.9885    |
|90210                             |325  |329 |IDNUM    |0.8879    |
|New York                         

In [92]:
pd.set_option("display.max_colwidth", 1000)
result_df = result.selectExpr("text","mask_entity.result as masked_result","obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\n(NOTE) Patient Name: John Lee. MR#: 7789201 Location: LERE Date Reported: 2025-05-12 16:30\nSpecimen #RD23-4897 Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A\nElectronically Signed Out By Dr. Smith, Dr. Carter, CT(ASCP) Date Reported: 2025-05-12 16:30\nGeneral Hospital Dr. Fan Gabriel 90210 CPT Code(s) A: 88305\n\nGeneral Hospital in New York City Dr. Williams, NYC, NY\n(212) 555-7890 Patient Name: John Lee Accession #: GH-556672\nPatient ID #: 7789201 Collected: 2025-05-10 Address:\n123 Main Street, FALL RIVER\nNIAGARA FALLS, NY 14304\nReceived: 2025-05-10 Reported: 2025-05-12\nSoc. Sec. #: XXX-XX-1234 DOB/Age/Sex: 1973 (Age: 52) M\nPhysician(s): Dr. Jameson. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\n· Chromosome analysis cytogene...","[\n(NOTE) Patient Name: <NAME>. MR#: <IDNUM> Location: <NAME> Date Reported: <DATE> 16:30\nSpecimen <IDNUM> Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A\nElectronically Signed Out By Dr. <NAME>, Dr. <NAME>) Date Reported: <DATE> 16:30\nGeneral <LOCATION> Dr. <NAME> <IDNUM> CPT Code(s) A: 88305\n\nGeneral Hospital in <LOCATION> City Dr. <NAME> <LOCATION>, <LOCATION>\n<CONTACT> Patient Name: <NAME> Accession #: <IDNUM>\nPatient ID #: <IDNUM> Collected: <DATE> Address:\n<LOCATION>, <LOCATION>, <LOCATION> <LOCATION>\nReceived: <DATE> Reported: <DATE>\nSoc. Sec. #: <DATE> DOB/Age/Sex: <DATE> (Age: <AGE>) M\nPhysician(s): Dr. <NAME>. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.\nThe following special studies were performed at <LOCATION>, <LOCATION>; <LOCATION> City.\n· Chromosome analysis cytogenetics. (ADDENDUM REPORT TO FOLLOW.)\n· Leukemic immunophenotypin...","[\n(NOTE) Patient Name: Gillie Allan. MR#: 0074518 Location: GUY Date Reported: 2025-06-29 16:30\nSpecimen #SA52-9740 Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A\nElectronically Signed Out By Dr. Wanna, Dr. Malvin JARRED) Date Reported: 2025-06-29 16:30\nGeneral 2329 Parker Road Dr. Marcelo Danes 41581 CPT Code(s) A: 88305\n\nGeneral Hospital in 2000 Boise Ave City Dr. Duwaine 427 GUY PARK AVE, 16100 SOUTH FREEWAY\n(585) 666-0741 Patient Name: Gillie Allan Accession #: PU-663305\nPatient ID #: 0074518 Collected: 2025-06-27 Address:\n3255 Independence Street, 116 PORTER DRIVE, 16100 SOUTH FREEWAY 59 KOCH AVE\nReceived: 2025-06-27 Reported: 2025-06-29\nSoc. Sec. #: XXX-XX-2934 DOB/Age/Sex: 1974 (Age: 44) M\nPhysician(s): Dr. Marchelle. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.\nThe following special studies were performed at 103 North Street, 1530 Nor..."
