![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.13.End2End_Preannotation_and_Training_Pipeline.ipynb)

# **End2End Preannotation and Training Pipeline**

## Spark Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.0  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import os
import json
import numpy as np
import pandas as pd

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

params = {"spark.driver.memory":"48G", # Amount of memory to use for the driver process, i.e. where SparkContext is initialized
          "spark.kryoserializer.buffer.max":"2000M", # Maximum allowable size of Kryo serialization buffer, in MiB unless otherwise specified.
          "spark.driver.maxResultSize":"2000M"} # Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes.
                                                # Should be at least 1M, or 0 for unlimited.

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)
spark.sparkContext.setLogLevel("ERROR")
print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


## Loading the Pretrained Pipeline

Spark NLP's pretrained pipeline, `clinical_deidentification_docwise_benchmark`, is loaded. This pipeline is designed to mask and obfuscate sensitive information in medical texts, such as names, ID numbers, contact information, locations, ages, and dates. The existing stages of the pipeline are examined to understand its structure.

In [4]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models")

clinical_deidentification_docwise_benchmark download started this may take some time.
Approx size to download 2.3 GB
[OK!]


In [5]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### Sample text

In [9]:
text = """
(NOTE) Patient Name: John Lee. MR#: 7789201 Location: LERE Date Reported: 2025-05-12 16:30
Specimen #RD23-4897 Clinical History: None Given. CLINICAL INFORMATION: Date of Last Menstrual Period: N/A
Electronically Signed Out By Dr. Smith, Dr. Carter, CT(ASCP) Date Reported: 2025-05-12 16:30
General Hospital Dr. Fan Gabriel 90210 CPT Code(s) A: 88305

General Hospital in New York City Dr. Williams, NYC, NY
(212) 555-7890 Patient Name: John Lee Accession #: GH-556672
Patient ID #: 7789201 Collected: 2025-05-10 Address:
123 Main Street, FALL RIVER
NIAGARA FALLS, NY 14304
Received: 2025-05-10 Reported: 2025-05-12
Soc. Sec. #: XXX-XX-1234 DOB/Age/Sex: 1973 (Age: 52) M
Physician(s): Dr. Jameson. Peripheral sequestration, i.e. splenomegaly or hepatomegaly should be excluded to be sure if peripheral sequestration is not present.
The following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.
· Chromosome analysis cytogenetics. (ADDENDUM REPORT TO FOLLOW.)
· Leukemic immunophenotyping flow cytometry.

...., and there is no evidence of dysplasia.
Fr/ap MATERIAL RECEIVED 6 SLIDES LABELED 032-1902, COLLECTED 2025-05-10
SPECIMEN SOURCE: GASTRIC, ILEUM AND RANDOM COLON, BIOPSIES
REFERRING FACILITY: NY
"""

## Extending the Pipeline with New Stages

New and customized stages are added to enhance the capabilities of the existing pipeline.

In [6]:
document_assembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

splitter = (
            InternalDocumentSplitter()
            .setInputCols("document")
            .setOutputCol("splitter")
            .setSplitMode("recursive")
            .setSplitPatterns(["\s+"])  # Token base
            .setPatternsAreRegex(True)
            .setChunkSize(512)    # 512 Char Lenght
            .setChunkOverlap(50)
            .setEnableSentenceIncrement(True)  # Like sentenceDetector
)

tokenizer = (
    Tokenizer()
    .setInputCols("splitter")
    .setOutputCol("token")
)

### Create a Custom `CPT Code` Parser

Using `ContextualParserApproach`, a new parser is created to detect CPT (Current Procedural Terminology) codes within the text based on regex rules. This allows the pipeline to recognize a custom entity type not found in the standard de-identification pipeline.

In [10]:
cpt_rule = {
    "entity": "CPT_CODE",
    "ruleScope": "sentence",
    "regex": r"(?:CPT(?: Code\(s\)?|#|:)?\s*:?[\s#]*)?(\b88[0-9]{3}\b)",
    "matchScope": "token"
}

with open('cpt.json', 'w') as f:
    json.dump(cpt_rule, f)

cpt_parser = ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_cpt") \
    .setJsonPath("cpt.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

cpt_parser_pipeline = Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    cpt_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

cpt_parser_model = cpt_parser_pipeline.fit(empty_data)
cpt_parser_model.stages[-1].write().overwrite().save("./parsers/cpt_parser")

cpt_parser = ContextualParserModel.load("parsers/cpt_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_cpt")

In [11]:
annotations = LightPipeline(cpt_parser_model).annotate(text)

annotations["entity_cpt"]

['88305']

###  Create a Custom `Specimen ID` Parser

Similarly, another parser is created with ContextualParserApproach to extract specimen IDs from medical texts

In [12]:
with open('specimen.json', 'w') as f:
    json.dump({
        "entity": "IDNUM",
        "ruleScope": "sentence",
        "regex": "(?:Specimen(?:\s*(?:ID|Number|Code|#|No\.?)?:?)?\s*)?#?[A-Z]{1,5}[0-9]{2,4}-?[0-9]{3,6}",
        "contextLength": 25,
        "matchScope": "token"
    }, f)

specimen_parser = ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_specimen") \
    .setJsonPath("specimen.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

specimen_parser_pipeline = Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    specimen_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

specimen_parser_model = specimen_parser_pipeline.fit(empty_data)
specimen_parser_model.stages[-1].write().overwrite().save("./parsers/specimen_parser")

specimen_parser = ContextualParserModel.load("./parsers/specimen_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_specimen")

In [13]:
annotations = LightPipeline(specimen_parser_model).annotate(text)

annotations["entity_specimen"]

['#RD23-4897']

### **IOBTagger**

The `IOBTagger` is added to tag the entities recognized by the Named Entity Recognition (NER) model in the IOB (Inside, Outside, Beginning) format. This format provides a standard data structure required for training the NER model.

In [14]:
iobTagger = sparknlp_jsl.annotator.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

### **Update the Chunk Merging Strategy**

The inputs of the ChunkMergeModel, which is responsible for merging entities from different NER models, are updated to include the entities generated by the newly created cpt_parser and specimen_parser. This ensures that all entities found by both the pretrained models and our custom parsers are consolidated.

In [15]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()
merger_input_cols

['entity_icd10',
 'entity_email',
 'entity_ip_address',
 'entity_age',
 'entity_medicalrecord',
 'entity_ssn',
 'entity_account',
 'entity_vin',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_country',
 'entity_state',
 'entity_zip',
 'entity_plate',
 'entity_dln',
 'entity_license']

In [16]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()

chunk_merge_rulebase = deid_pipeline.model.stages[35]\
      .setInputCols(["entity_cpt", "entity_specimen"] + merger_input_cols)

### Update the De-identification Blacklist

In [17]:
deid_pipeline.model.stages[38]

ChunkMergeModel_5a3f1e608447

In [18]:
deid_pipeline.model.stages[38] = deid_pipeline.model.stages[38]\
                                      .setBlackList(['CPT_CODE'])

### Updated Stages

In [19]:
deid_pipeline.model.stages = (
    deid_pipeline.model.stages[:35]
    + [cpt_parser, specimen_parser, chunk_merge_rulebase]
    + deid_pipeline.model.stages[36:]
    + [iobTagger]
)

In [20]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

## Save and Test the Modified Pipeline

In [21]:
empty_result = deid_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

deid_pipeline.model.write().overwrite().save("modified_pipeline")

In [22]:
# We are loading the pretrained pipeline using the `from_disk` method.
from sparknlp.pretrained import PretrainedPipeline

modified_pipeline = PretrainedPipeline.from_disk('modified_pipeline')

### Sample Result

In [24]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = modified_pipeline.transform(samples_df).cache()

In [25]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+----------------------------------+-----+----+---------+----------+
|chunk                             |begin|end |ner_label|confidence|
+----------------------------------+-----+----+---------+----------+
|John Lee                          |7    |14  |NAME     |0.9999188 |
|7789201                           |45   |51  |IDNUM    |0.71      |
|1973                              |80   |83  |DATE     |0.99705976|
|52                                |87   |88  |AGE      |0.99993765|
|GH-556672                         |110  |118 |IDNUM    |0.8692267 |
|XXX-XX-1234                       |139  |149 |IDNUM    |0.85107154|
|123 Main Street                   |160  |174 |LOCATION |0.9999253 |
|FALL RIVER                        |177  |186 |LOCATION |0.9986915 |
|NIAGARA FALLS                     |189  |201 |LOCATION |0.9978828 |
|NY                                |204  |205 |LOCATION |0.9999924 |
|14304                             |207  |211 |LOCATION |0.73      |
|RD23-4897                        

In [26]:
pd.set_option("display.max_colwidth", 1000)

result_df = result.selectExpr("text",
                              "mask_entity.result as masked_result",
                              "obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\nName: John Lee\nMedical Record Number (MRN): 7789201\nDate of Birth / Age / Sex: 1973 / 52 / Male\nAccession #: GH-556672\nSocial Security #: XXX-XX-1234\nAddress: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304\nSpecimen #: RD23-4897\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 032-1902\nDate Collected: 2025-05-10\nDate Received: 2025-05-10\nRequesting Physician: Dr. Jameson\nReferring Facility: General Hospital, New York City, NY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytome...","[\nName: <NAME>\nMedical Record Number (MRN): <IDNUM>\nDate of Birth / Age / Sex: <DATE> / <AGE> / Male\nAccession #: <IDNUM>\nSocial Security #: <IDNUM>\nAddress: <LOCATION>, <LOCATION>, <LOCATION>, <LOCATION> <LOCATION>\nSpecimen #: <IDNUM>\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled <IDNUM>\nDate Collected: <DATE>\nDate Received: <DATE>\nRequesting Physician: Dr. <NAME>\nReferring Facility: <LOCATION>, <LOCATION> City, <LOCATION>\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at <LOCATION>, <LOCATION> – <LOCATION>; <LOCATION> City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytometry):\nDate Reported:...","[\nName: Gillie Allan\nMedical Record Number (MRN): 0074518\nDate of Birth / Age / Sex: 1974 / 44 / Male\nAccession #: PU-663305\nSocial Security #: WWW-WW-8529\nAddress: 3255 Independence Street, 302 W MCNEESE ST, 4101 NW 89TH BLVD, 16100 SOUTH FREEWAY 59 KOCH AVE\nSpecimen #: SA52-9740\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 125-8415\nDate Collected: 2025-06-27\nDate Received: 2025-06-27\nRequesting Physician: Dr. Marchelle\nReferring Facility: 310 Ellis Street, 2000 Boise Ave City, 16100 SOUTH FREEWAY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at 103 North Street, 16100 SOUTH FREEWAY – 969 Lakeland Drive; 2000 Boise Ave City.\nChromosome Analysis (C..."


##  Prepare Data for Custom NER Model Training

In [27]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/refs/heads/master/data/ner/eng.train -O eng.train

from sparknlp.training import CoNLL
data_conll = CoNLL(includeDocId=True,explodeSentences=True).readDataset(spark, "./eng.train")
data_conll.show(2)


+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|                 pos|               label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     X|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [28]:
data_conll.count()

14041

In [29]:
input_spark_df = data_conll.select("doc_id", "text")
input_spark_df.show(2, truncate=50)

+------+------------------------------------------------+
|doc_id|                                            text|
+------+------------------------------------------------+
|     X|EU rejects German call to boycott British lamb .|
|     X|                                 Peter Blackburn|
+------+------------------------------------------------+
only showing top 2 rows



### Preprocess Data with the Modified Pipeline

Run the entire dataset through our modified pipeline. This generates token, sentence, and embedding annotations required for the NER training downstream.

In [30]:
results = modified_pipeline.transform(input_spark_df)
results.columns

['doc_id',
 'text',
 'document',
 'splitter',
 'token',
 'embeddings',
 'ner_clinical_large',
 'ner_chunk_clinical_large',
 'ner_deid_generic_docwise',
 'ner_deid_docwise_subentity',
 'ner_deid_generic_docwise_merged_conll',
 'ner_chunk_generic_docwise',
 'ner_chunk_subentity_docwise',
 'ner_chunk_merged_docwise',
 'ner_zero_shot',
 'ner_chunk_zero_shot_raw',
 'ner_deid_subentity_docwise_new',
 'ner_chunk_subentity_docwise_new_chunk',
 'ner_chunk_zero_shot',
 'deid_merged_ner_chunk',
 'entity_icd10',
 'entity_ssn',
 'entity_account',
 'entity_dln',
 'entity_plate',
 'entity_vin',
 'entity_license',
 'entity_country',
 'entity_state',
 'entity_age',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_zip',
 'entity_medicalrecord',
 'entity_email',
 'entity_ip_address',
 'entity_cpt',
 'entity_specimen',
 'deid_merged_ner_rulebased',
 'ner_chunk_raw',
 'ner_chunk_processed',
 'ner_chunk',
 'mask_entity',
 'obfuscated',
 'ner_label']

In [31]:
result_df = results.select('doc_id','text','document','splitter',
                          'token',"embeddings", 'ner_label')

In [32]:
result_df.show(2, truncate=40)

+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|doc_id|                                    text|                                document|                                splitter|                                   token|                              embeddings|                               ner_label|
+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|     X|EU rejects German call to boycott Bri...|[{document, 0, 47, EU rejects German ...|[{document, 0, 48, EU rejects German ...|[{token, 0, 1, EU, {sentence -> 0}, [...|[{word_embeddings, 0, 1, EU, {isOOV -...|[{named_entity, 0, 1, 

### Persist Preprocessed Data

Save the annotated DataFrame to Parquet format. This is an optimization step to speed up the training process by avoiding re-computation.

In [33]:
%%time

n_partitions = 48

# WRITING THE DATA
result_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/result_df_{n_partitions}.parquet")


CPU times: user 4.41 s, sys: 771 ms, total: 5.18 s
Wall time: 19min 45s


## Train a Custom Medical NER Model

In [34]:
# READING THE DATA
n_partitions = 48
result_df = spark.read \
    .parquet(f"./data/result_df_{n_partitions}.parquet")\
    .repartition(n_partitions)

In [35]:
result_df.count()

14041

In [40]:
result_df.show(2)

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            splitter|               token|          embeddings|           ner_label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|" Our policy has ...|[{document, 0, 12...|[{document, 0, 12...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|
|     X|Issuer : Birmingh...|[{document, 0, 38...|[{document, 0, 39...|[{token, 0, 5, Is...|[{word_embeddings...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [36]:
(train_df, test_df) = result_df.randomSplit([0.8, 0.2], seed = 42)

In [37]:
test_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/test_df.parquet")

###  Use MedicalNerDLGraphChecker for NER

The MedicalNerDLGraphChecker processes the dataset to extract required graph parameters (tokens, labels, embedding dimensions)

In [41]:
embeddings = (WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
            .setInputCols(["splitter", "token"])
            .setOutputCol("embeddings"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [42]:
nerDLGraphChecker = MedicalNerDLGraphChecker()\
    .setInputCols(["splitter", "token"])\
    .setLabelColumn("ner_label")\
    .setEmbeddingsModel(embeddings)

###  Configure and Run the MedicalNerApproach

In [43]:
nerTagger = MedicalNerApproach()\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setLabelColumn("ner_label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setBatchSize(8)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setEarlyStoppingCriterion(0.01)\
    .setEarlyStoppingPatience(5)\
    .setUseBestModel(False)\
    #.setTestDataset("./data/test_df.parquet")\
    #.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
    #.setDatasetInfo("NCBI_sample_short dataset") #You can add details regarding the dataset

ner_pipeline = Pipeline(
    stages=[
          nerDLGraphChecker,
          nerTagger
 ])

In [44]:
%%time
ner_model = ner_pipeline.fit(train_df)

CPU times: user 13.7 s, sys: 1.79 s, total: 15.5 s
Wall time: 47min 34s


In [45]:
ner_model.stages[-1].getTrainingClassDistribution()

{'I-NAME': 4461, 'I-CONTACT': 206, 'I-AGE': 31, 'I-IDNUM': 58, 'B-DATE': 3529, 'I-DATE': 494, 'I-LOCATION': 3782, 'B-NAME': 5216, 'B-AGE': 571, 'B-LOCATION': 10548, 'B-IDNUM': 152, 'O': 135913, 'B-CONTACT': 318}

### Save the Trained NER Model and Review Logs

In [46]:
ner_model.stages[-1].write().overwrite().save('models/new_NER_model')

In [47]:
import os
log_file= os.listdir("ner_logs")[0]

with open (f"./ner_logs/{log_file}") as f:
    print(f.read())

Name of the selected graph: medical-ner-dl/blstm_100_200_128_100.pb
Training started - total epochs: 30 - lr: 0.001 - batch size: 8 - labels: 13 - chars: 84 - training examples: 11192


Epoch 1/30 started, lr: 0.001, dataset size: 11192


Epoch 1/30 - 80.17s - loss: 5240.444 - avg training loss: 4.6915345 - batches: 1117
Quality on validation dataset (20.0%), validation examples = 2238
time to finish evaluation: 11.82s
Total validation loss: 897.1212	Avg validation loss: 3.1259
label	 tp	 fp	 fn	 prec	 rec	 f1
I-NAME	 766	 112	 189	 0.87243736	 0.8020942	 0.8357883
I-CONTACT	 0	 0	 44	 0.0	 0.0	 0.0
I-AGE	 0	 0	 9	 0.0	 0.0	 0.0
I-IDNUM	 0	 0	 11	 0.0	 0.0	 0.0
B-DATE	 626	 33	 152	 0.9499241	 0.80462724	 0.8712595
I-DATE	 72	 4	 43	 0.94736844	 0.62608695	 0.7539267
I-LOCATION	 294	 63	 491	 0.8235294	 0.3745223	 0.5148862
B-NAME	 902	 351	 178	 0.7198723	 0.83518517	 0.7732533
B-AGE	 23	 4	 85	 0.8518519	 0.21296297	 0.34074077
B-LOCATION	 1597	 238	 622	 0.87029976	 0.71969354	 0.78

## Evaluate the Newly Trained NER Model

In [48]:
pred_df = ner_model.stages[-1].transform(test_df).cache()

In [49]:
pred_df.show()

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            splitter|               token|          embeddings|           ner_label|                 ner|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|" I have told Ban...|[{document, 0, 23...|[{document, 0, 23...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|" Our policy has ...|[{document, 0, 12...|[{document, 0, 12...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|" The workers kee...|[{document, 0, 12...|[{document, 0, 12...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|( Spain ) , Paul ...|[{document, 0, 63...|[{document, 0,

In [50]:
from pyspark.sql import functions as F

pred_token_df = pred_df.select(F.explode(F.arrays_zip(pred_df.ner_label.metadata,
                                                  pred_df.ner_label.begin,
                                                  pred_df.ner_label.end,
                                                  pred_df.ner_label.result,
                                                  pred_df.ner.result)).alias("cols")) \
          .select(F.expr("cols['0']['word']").alias("token"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']").alias("gtruth"),
                  F.expr("cols['4']").alias("prediction"))\
          .toPandas()

pred_token_df

Unnamed: 0,token,begin,end,gtruth,prediction
0,"""",0,0,O,O
1,I,2,2,O,O
2,have,4,7,O,O
3,told,9,12,O,O
4,Bangladesh,14,23,B-LOCATION,B-LOCATION
...,...,...,...,...,...
42711,Pakistan,41,48,B-LOCATION,B-LOCATION
42712,at,50,51,O,O
42713,The,53,55,B-LOCATION,B-LOCATION
42714,Oval,57,60,I-LOCATION,I-LOCATION


### Calculate Evaluation Metrics
Use the NerDLMetrics class to compute precision, recall, and F1-score for each entity. The evaluation is shown with both `full_chunk` and `partial_chunk_per_token` modes.

In [51]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

evaler = NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"),
                                          prediction_col="ner",
                                          label_col="ner_label",
                                          drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+------+-----+-----+------+---------+------+------+
|  entity|    tp|   fp|   fn| total|precision|recall|    f1|
+--------+------+-----+-----+------+---------+------+------+
| CONTACT|  67.0|  7.0| 12.0|  79.0|   0.9054|0.8481|0.8758|
|    NAME|1210.0|105.0| 97.0|1307.0|   0.9202|0.9258| 0.923|
|    DATE| 891.0| 27.0| 22.0| 913.0|   0.9706|0.9759|0.9732|
|LOCATION|2465.0|234.0|255.0|2720.0|   0.9133|0.9063|0.9098|
|     AGE| 121.0| 11.0| 15.0| 136.0|   0.9167|0.8897| 0.903|
|   IDNUM|  28.0|  5.0| 13.0|  41.0|   0.8485|0.6829|0.7568|
+--------+------+-----+-----+------+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8902531689644534|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.9223345083066614|
+------------------+

None


In [52]:
evaler = NerDLMetrics(mode="partial_chunk_per_token")
eval_result_partial = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"), prediction_col="ner", label_col="ner_label", drop_o = True, case_sensitive = True).cache()

eval_result_partial.withColumn("precision", F.round(eval_result_partial["precision"],4))\
           .withColumn("recall", F.round(eval_result_partial["recall"],4))\
           .withColumn("f1", F.round(eval_result_partial["f1"],4)).sort("entity").show(100)
df_partial=eval_result_partial.toPandas()
print("partial_chunk_per_token")
print(eval_result_partial.selectExpr("avg(f1) as macro").show())
print (eval_result_partial.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+------+-----+-----+------+---------+------+------+
|  entity|    tp|   fp|   fn| total|precision|recall|    f1|
+--------+------+-----+-----+------+---------+------+------+
|     AGE| 126.0| 11.0| 16.0| 142.0|   0.9197|0.8873|0.9032|
| CONTACT| 113.0|  9.0| 21.0| 134.0|   0.9262|0.8433|0.8828|
|    DATE|1037.0| 24.0| 23.0|1060.0|   0.9774|0.9783|0.9778|
|   IDNUM|  33.0| 12.0| 19.0|  52.0|   0.7333|0.6346|0.6804|
|LOCATION|3436.0|299.0|303.0|3739.0|   0.9199| 0.919|0.9195|
|    NAME|2219.0|133.0|115.0|2334.0|   0.9435|0.9507|0.9471|
+--------+------+-----+-----+------+---------+------+------+

partial_chunk_per_token
+------------------+
|             macro|
+------------------+
|0.8851369706910256|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.9337572286730484|
+------------------+

None


## Create the Final Pipeline with the Custom NER Model

In [53]:
# We are loading the pretrained pipeline using the `from_disk` method.
from sparknlp.pretrained import PretrainedPipeline

modified_pipeline = PretrainedPipeline.from_disk('modified_pipeline')

In [54]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### New Stages

In [55]:
ner_deid_new = MedicalNerModel.load("models/new_NER_model")\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setOutputCol("ner_deid_new")

ner_deid_new_converter = NerConverter()\
      .setInputCols(["splitter", "token", "ner_deid_new"])\
      .setOutputCol("ner_chunk_new")

ner_deid = MedicalNerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")  \
      .setInputCols(["splitter", "token", "embeddings"]) \
      .setOutputCol("ner_deid_subentity_docwise")

ner_deid_converter = NerConverter()\
      .setInputCols(["splitter", "token", "ner_deid_subentity_docwise"])\
      .setOutputCol("ner_chunk_subentity_docwise")

chunk_merge_ner = ChunkMergeModel()\
    .setInputCols("ner_chunk_new", # New Trained Model
                  "ner_chunk_subentity_docwise")\
    .setOutputCol("deid_merged_ner_chunk")\
    .setOrderingFeatures(["ChunkLength","ChunkBegin"])\
    .setMergeOverlapping(True)\
    .setResetSentenceIndices(True)


ner_deid_subentity_docwise download started this may take some time.
Approximate size to download 8.9 MB
[OK!]


### **Update Stages**

In [56]:
modified_pipeline.model.stages = (
    modified_pipeline.model.stages[:4]
    + [ner_deid_new,
       ner_deid_new_converter,
       ner_deid,
       ner_deid_converter,
       chunk_merge_ner]
    + modified_pipeline.model.stages[18:]

)

In [57]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_9f6778063b3b,
 NerConverter_40a17057accb,
 MedicalNerModel_32184c1db80b,
 NerConverter_1b62e3a80e9f,
 ChunkMergeModel_c577e9aa610b,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 CONTEXTUAL-PARSER_f8b8f9aafb9f,
 CONTEXTUAL-PARSER_7f824493eafc,
 REGEX_MATCHER_26934077fe57,
 REGEX_MATCHER_5fe3de8b5a4e,
 CONTEXTUAL-PARSER_92044d777a8a,
 CONTEXTUAL-PARSER_2a97125c9b93,
 MERGE_ddff59e8b14a,
 ChunkMergeModel_50feb5f97568,
 ContextualEntityRuler_08eeaa89c938,
 ChunkMe

### Reassemble and Save the Final

Rebuild the pipeline's stages, replacing the original NER components with our new custom NER model and the reconfigured merger. The final pipeline is then saved.

In [58]:
empty_result = modified_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

modified_pipeline.model.write().overwrite().save("new_pipeline")

In [59]:
from sparknlp.pretrained import PretrainedPipeline

new_pipeline = PretrainedPipeline.from_disk('new_pipeline')

## Final Test of the New Pipeline

In [60]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = new_pipeline.transform(samples_df).cache()

In [61]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+------------------------+-----+----+---------+----------+
|chunk                   |begin|end |ner_label|confidence|
+------------------------+-----+----+---------+----------+
|John Lee                |7    |14  |NAME     |0.9422    |
|7789201                 |45   |51  |IDNUM    |0.71      |
|1973                    |80   |83  |DATE     |0.9997    |
|52                      |87   |88  |DATE     |0.4974    |
|GH-556672               |110  |118 |IDNUM    |0.7311    |
|XXX-XX-1234             |139  |149 |IDNUM    |0.4833    |
|123 Main Street         |160  |174 |LOCATION |0.7568334 |
|FALL RIVER              |177  |186 |LOCATION |0.65885   |
|NIAGARA FALLS           |189  |201 |LOCATION |0.54719996|
|NY                      |204  |205 |LOCATION |0.9998    |
|14304                   |207  |211 |LOCATION |0.73      |
|RD23-4897               |225  |233 |IDNUM    |0.50      |
|032-1902                |331  |338 |DATE     |0.4176    |
|2025-05-10              |356  |365 |DATE     |NULL     

In [62]:
pd.set_option("display.max_colwidth", 1000)
result_df = result.selectExpr("text","mask_entity.result as masked_result","obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\nName: John Lee\nMedical Record Number (MRN): 7789201\nDate of Birth / Age / Sex: 1973 / 52 / Male\nAccession #: GH-556672\nSocial Security #: XXX-XX-1234\nAddress: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304\nSpecimen #: RD23-4897\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 032-1902\nDate Collected: 2025-05-10\nDate Received: 2025-05-10\nRequesting Physician: Dr. Jameson\nReferring Facility: General Hospital, New York City, NY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytome...","[\nName: <NAME>\nMedical Record Number (MRN): <IDNUM>\nDate of Birth / Age / Sex: <DATE> / <DATE> / Male\nAccession #: <IDNUM>\nSocial Security #: <IDNUM>\nAddress: <LOCATION>, <LOCATION>, <LOCATION>, <LOCATION> <LOCATION>\nSpecimen #: <IDNUM>\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled <DATE>\nDate Collected: <DATE>\nDate Received: <DATE>\nRequesting Physician: Dr. <NAME>\nReferring Facility: <LOCATION>, <LOCATION> City, <LOCATION>\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at <LOCATION> <LOCATION>, <LOCATION> – <LOCATION>; <LOCATION> City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytometry):\nDat...","[\nName: Gillie Allan\nMedical Record Number (MRN): 0074518\nDate of Birth / Age / Sex: 1974 / 53 / Male\nAccession #: PU-663305\nSocial Security #: WWW-WW-8529\nAddress: 3255 Independence Street, 302 W MCNEESE ST, 4101 NW 89TH BLVD, 16100 SOUTH FREEWAY 59 KOCH AVE\nSpecimen #: SA52-9740\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 202-1902\nDate Collected: 2025-06-27\nDate Received: 2025-06-27\nRequesting Physician: Dr. Marchelle\nReferring Facility: 310 Ellis Street, 2000 Boise Ave City, 16100 SOUTH FREEWAY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at 227 Mountain Dr 55 Nicomedes Rivera Street, 16100 SOUTH FREEWAY – 969 Lakeland Drive; 2000 Boise Ave Cit..."
