![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.14.End2End_Preannotation_and_Training_Pipeline.ipynb)

# **End2End Preannotation and Training Pipeline**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [4]:
import os
import json
import numpy as np
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/6.1.1.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.1.3, 💊Spark-Healthcare==6.1.1, running on ⚡ PySpark==3.4.0


## Loading the Pretrained Pipeline

Spark NLP's pretrained pipeline, `clinical_deidentification_docwise_benchmark`, is loaded. This pipeline is designed to mask and obfuscate sensitive information in medical texts, such as names, ID numbers, contact information, locations, ages, and dates. The existing stages of the pipeline are examined to understand its structure.

In [5]:
deid_pipeline = nlp.PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models")

clinical_deidentification_docwise_benchmark download started this may take some time.
Approx size to download 2.3 GB
[OK!]


In [6]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### Sample Text

In [7]:
text = """
Name: John Lee
Medical Record Number (MRN): 7789201
Date of Birth / Age / Sex: 1973 / 52 / Male
Accession #: GH-556672
Social Security #: XXX-XX-1234
Address: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304
Specimen #: RD23-4897
Material Received: Gastric, Ileum, and Random Colon Biopsies
Material Details: 6 slides labeled 032-1902
Date Collected: 2025-05-10
Date Received: 2025-05-10
Requesting Physician: Dr. Jameson
Referring Facility: General Hospital, New York City, NY
Clinical History: None Given.
Clinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.
CPT Code(s): 88305
Gastric, ileum, and random colon, biopsies:
No evidence of dysplasia.
The following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.
Chromosome Analysis (Cytogenetics): (Addendum report to follow.)
Leukemic Immunophenotyping (Flow Cytometry):
Date Reported: 2025-05-12, 16:30
Electronically Signed Out By:
Dr. Smith
Dr. Carter, CT(ASCP)
"""

## Extending the Pipeline with New Stages

New and customized stages are added to enhance the capabilities of the existing pipeline.

In [8]:
document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

splitter = (
            medical.InternalDocumentSplitter()
            .setInputCols("document")
            .setOutputCol("splitter")
            .setSplitMode("recursive")
            .setSplitPatterns(["\s+"])  # Token base
            .setPatternsAreRegex(True)
            .setChunkSize(512)    # 512 Char Lenght
            .setChunkOverlap(50)
            .setEnableSentenceIncrement(True)  # Like sentenceDetector
)

tokenizer = (
    nlp.Tokenizer()
    .setInputCols("splitter")
    .setOutputCol("token")
)

### Create a Custom `CPT Code` Parser

Using `ContextualParserApproach`, a new parser is created to detect CPT (Current Procedural Terminology) codes within the text based on regex rules. This allows the pipeline to recognize a custom entity type not found in the standard de-identification pipeline.

In [9]:
cpt_rule = {
    "entity": "CPT_CODE",
    "ruleScope": "sentence",
    "regex": r"(?:CPT(?: Code\(s\)?|#|:)?\s*:?[\s#]*)?(\b88[0-9]{3}\b)",
    "matchScope": "token"
}

with open('cpt.json', 'w') as f:
    json.dump(cpt_rule, f)

cpt_parser = medical.ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_cpt") \
    .setJsonPath("cpt.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

cpt_parser_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    cpt_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

cpt_parser_model = cpt_parser_pipeline.fit(empty_data)


In [10]:
cpt_parser_model.stages[-1].write().overwrite().save("./parsers/cpt_parser")

cpt_parser = medical.ContextualParserModel.load("parsers/cpt_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_cpt")

In [11]:
annotations = nlp.LightPipeline(cpt_parser_model).annotate(text)

annotations["entity_cpt"]

['88305']

###  Create a Custom `Specimen ID` Parser

Similarly, another parser is created with ContextualParserApproach to extract specimen IDs from medical texts

In [12]:
with open('specimen.json', 'w') as f:
    json.dump({
        "entity": "IDNUM",
        "ruleScope": "sentence",
        "regex": "(?:Specimen(?:\s*(?:ID|Number|Code|#|No\.?)?:?)?\s*)?#?[A-Z]{1,5}[0-9]{2,4}-?[0-9]{3,6}",
        "contextLength": 25,
        "matchScope": "token"
    }, f)

specimen_parser = medical.ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_specimen") \
    .setJsonPath("specimen.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

specimen_parser_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    specimen_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

specimen_parser_model = specimen_parser_pipeline.fit(empty_data)


In [13]:
specimen_parser_model.stages[-1].write().overwrite().save("./parsers/specimen_parser")

specimen_parser = medical.ContextualParserModel.load("./parsers/specimen_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_specimen")

In [14]:
annotations = nlp.LightPipeline(specimen_parser_model).annotate(text)

annotations["entity_specimen"]

['RD23-4897']

### IOBTagger

The `IOBTagger` is added to tag the entities recognized by the Named Entity Recognition (NER) model in the IOB (Inside, Outside, Beginning) format. This format provides a standard data structure required for training the NER model.

In [15]:
iobTagger = medical.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

### Update the Chunk Merging Strategy

The inputs of the ChunkMergeModel, which is responsible for merging entities from different NER models, are updated to include the entities generated by the newly created cpt_parser and specimen_parser. This ensures that all entities found by both the pretrained models and our custom parsers are consolidated.

In [16]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()
merger_input_cols

['entity_icd10',
 'entity_email',
 'entity_ip_address',
 'entity_age',
 'entity_medicalrecord',
 'entity_ssn',
 'entity_account',
 'entity_vin',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_country',
 'entity_state',
 'entity_zip',
 'entity_plate',
 'entity_dln',
 'entity_license']

In [17]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()

chunk_merge_rulebase = deid_pipeline.model.stages[35]\
      .setInputCols(["entity_cpt", "entity_specimen"] + merger_input_cols)

### Update the De-identification Blacklist

In [18]:
deid_pipeline.model.stages[38]

ChunkMergeModel_5a3f1e608447

In [19]:
deid_pipeline.model.stages[38] = deid_pipeline.model.stages[38]\
                                      .setBlackList(['CPT_CODE'])

### Updated Stages

In [20]:
deid_pipeline.model.stages = (
    deid_pipeline.model.stages[:35]
    + [cpt_parser, specimen_parser, chunk_merge_rulebase]
    + deid_pipeline.model.stages[36:]
    + [iobTagger]
)

In [21]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

## Save and Test the Modified Pipeline

In [22]:
empty_result = deid_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

deid_pipeline.model.write().overwrite().save("modified_pipeline")

In [23]:
# We are loading the pretrained pipeline using the `from_disk` method.

modified_pipeline = nlp.PretrainedPipeline.from_disk('modified_pipeline')

### Sample Result

In [24]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = modified_pipeline.transform(samples_df).cache()

In [25]:
from pyspark.sql import functions as F

result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+----------------------------------+-----+----+---------+----------+
|chunk                             |begin|end |ner_label|confidence|
+----------------------------------+-----+----+---------+----------+
|John Lee                          |7    |14  |NAME     |0.9999188 |
|7789201                           |45   |51  |IDNUM    |0.71      |
|1973                              |80   |83  |DATE     |0.99705976|
|52                                |87   |88  |AGE      |0.99993765|
|GH-556672                         |110  |118 |IDNUM    |0.86922663|
|XXX-XX-1234                       |139  |149 |IDNUM    |0.85107106|
|123 Main Street                   |160  |174 |LOCATION |0.9999253 |
|FALL RIVER                        |177  |186 |LOCATION |0.9986915 |
|NIAGARA FALLS                     |189  |201 |LOCATION |0.9978828 |
|NY                                |204  |205 |LOCATION |0.9999924 |
|14304                             |207  |211 |LOCATION |0.73      |
|RD23-4897                        

In [26]:
pd.set_option("display.max_colwidth", 1000)

result_df = result.selectExpr("text",
                              "mask_entity.result as masked_result",
                              "obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\nName: John Lee\nMedical Record Number (MRN): 7789201\nDate of Birth / Age / Sex: 1973 / 52 / Male\nAccession #: GH-556672\nSocial Security #: XXX-XX-1234\nAddress: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304\nSpecimen #: RD23-4897\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 032-1902\nDate Collected: 2025-05-10\nDate Received: 2025-05-10\nRequesting Physician: Dr. Jameson\nReferring Facility: General Hospital, New York City, NY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytome...","[\nName: <NAME>\nMedical Record Number (MRN): <IDNUM>\nDate of Birth / Age / Sex: <DATE> / <AGE> / Male\nAccession #: <IDNUM>\nSocial Security #: <IDNUM>\nAddress: <LOCATION>, <LOCATION>, <LOCATION>, <LOCATION> <LOCATION>\nSpecimen #: <IDNUM>\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled <IDNUM>\nDate Collected: <DATE>\nDate Received: <DATE>\nRequesting Physician: Dr. <NAME>\nReferring Facility: <LOCATION>, <LOCATION> City, <LOCATION>\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at <LOCATION>, <LOCATION> – <LOCATION>; <LOCATION> City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytometry):\nDate Reported:...","[\nName: Gillie Allan\nMedical Record Number (MRN): 0074518\nDate of Birth / Age / Sex: 1974 / 44 / Male\nAccession #: PU-663305\nSocial Security #: WWW-WW-8529\nAddress: 3255 Independence Street, 302 W MCNEESE ST, 4101 NW 89TH BLVD, 16100 SOUTH FREEWAY 59 KOCH AVE\nSpecimen #: SA52-9740\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 125-8415\nDate Collected: 2025-06-27\nDate Received: 2025-06-27\nRequesting Physician: Dr. Marchelle\nReferring Facility: 310 Ellis Street, 2000 Boise Ave City, 16100 SOUTH FREEWAY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at 103 North Street, 16100 SOUTH FREEWAY – 969 Lakeland Drive; 2000 Boise Ave City.\nChromosome Analysis (C..."


##  Prepare Data for Custom NER Model Training

In [27]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/refs/heads/master/data/ner/eng.train -O eng.train

data_conll = nlp.CoNLL(includeDocId=True,explodeSentences=True).readDataset(spark, "./eng.train")
data_conll.show(2)


+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|                 pos|               label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     X|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [28]:
data_conll.count()

14041

In [29]:
input_spark_df = data_conll.select("doc_id", "text")
input_spark_df.show(2, truncate=50)

+------+------------------------------------------------+
|doc_id|                                            text|
+------+------------------------------------------------+
|     X|EU rejects German call to boycott British lamb .|
|     X|                                 Peter Blackburn|
+------+------------------------------------------------+
only showing top 2 rows



### Preprocess Data with the Modified Pipeline

Run the entire dataset through our modified pipeline. This generates token, sentence, and embedding annotations required for the NER training downstream.

In [30]:
results = modified_pipeline.transform(input_spark_df)
results.columns

['doc_id',
 'text',
 'document',
 'splitter',
 'token',
 'embeddings',
 'ner_clinical_large',
 'ner_chunk_clinical_large',
 'ner_deid_generic_docwise',
 'ner_deid_docwise_subentity',
 'ner_deid_generic_docwise_merged_conll',
 'ner_chunk_generic_docwise',
 'ner_chunk_subentity_docwise',
 'ner_chunk_merged_docwise',
 'ner_zero_shot',
 'ner_chunk_zero_shot_raw',
 'ner_deid_subentity_docwise_new',
 'ner_chunk_subentity_docwise_new_chunk',
 'ner_chunk_zero_shot',
 'deid_merged_ner_chunk',
 'entity_icd10',
 'entity_ssn',
 'entity_account',
 'entity_dln',
 'entity_plate',
 'entity_vin',
 'entity_license',
 'entity_country',
 'entity_state',
 'entity_age',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_zip',
 'entity_medicalrecord',
 'entity_email',
 'entity_ip_address',
 'entity_cpt',
 'entity_specimen',
 'deid_merged_ner_rulebased',
 'ner_chunk_raw',
 'ner_chunk_processed',
 'ner_chunk',
 'mask_entity',
 'obfuscated',
 'ner_label']

In [31]:
result_df = results.select('doc_id','text','document','splitter',
                          'token',"embeddings", 'ner_label')

In [32]:
result_df.show(2, truncate=40)

+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|doc_id|                                    text|                                document|                                splitter|                                   token|                              embeddings|                               ner_label|
+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|     X|EU rejects German call to boycott Bri...|[{document, 0, 47, EU rejects German ...|[{document, 0, 48, EU rejects German ...|[{token, 0, 1, EU, {sentence -> 0}, [...|[{word_embeddings, 0, 1, EU, {isOOV -...|[{named_entity, 0, 1, 

### Persist Preprocessed Data

Save the annotated DataFrame to Parquet format. This is an optimization step to speed up the training process by avoiding re-computation.

In [33]:
%%time

n_partitions = 48

# WRITING THE DATA
result_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/result_df_{n_partitions}.parquet")


CPU times: user 124 ms, sys: 51.5 ms, total: 175 ms
Wall time: 16min 18s


## Train a Custom Medical NER Model

In [34]:
# READING THE DATA
n_partitions = 48
result_df = spark.read \
    .parquet(f"./data/result_df_{n_partitions}.parquet")\
    .repartition(n_partitions)

In [35]:
result_df.count()

14041

In [36]:
(train_df, test_df) = result_df.randomSplit([0.8, 0.2], seed = 42)

In [37]:
test_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/test_df.parquet")

###  Use MedicalNerDLGraphChecker for NER

The MedicalNerDLGraphChecker processes the dataset to extract required graph parameters (tokens, labels, embedding dimensions)

In [38]:
embeddings = (nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
            .setInputCols(["splitter", "token"])
            .setOutputCol("embeddings"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [39]:
nerDLGraphChecker = medical.NerDLGraphChecker()\
    .setInputCols(["splitter", "token"])\
    .setLabelColumn("ner_label")\
    .setEmbeddingsModel(embeddings)

###  Configure and Run the MedicalNerApproach

In [40]:
nerTagger = medical.NerApproach()\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setLabelColumn("ner_label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setBatchSize(8)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setEarlyStoppingCriterion(0.01)\
    .setEarlyStoppingPatience(5)\
    .setUseBestModel(False)\
    #.setTestDataset("./data/test_df.parquet")\
    #.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
    #.setDatasetInfo("NCBI_sample_short dataset") #You can add details regarding the dataset

ner_pipeline = nlp.Pipeline(
    stages=[
          nerDLGraphChecker,
          nerTagger
 ])

In [41]:
%%time
ner_model = ner_pipeline.fit(train_df)

CPU times: user 322 ms, sys: 112 ms, total: 433 ms
Wall time: 42min 28s


In [42]:
ner_model.stages[-1].getTrainingClassDistribution()

{'I-NAME': 4435, 'I-CONTACT': 175, 'I-AGE': 34, 'I-IDNUM': 59, 'B-DATE': 3550, 'I-DATE': 497, 'I-LOCATION': 3804, 'B-NAME': 5213, 'B-AGE': 572, 'B-LOCATION': 10538, 'B-IDNUM': 156, 'O': 136268, 'B-CONTACT': 312}

### Save the Trained NER Model and Review Logs

In [43]:
ner_model.stages[-1].write().overwrite().save('models/new_NER_model')

In [44]:
import os
log_file= os.listdir("ner_logs")[0]

with open (f"./ner_logs/{log_file}") as f:
    print(f.read())

Name of the selected graph: medical-ner-dl/blstm_100_200_128_100.pb
Training started - total epochs: 30 - lr: 0.001 - batch size: 8 - labels: 13 - chars: 84 - training examples: 11194


Epoch 1/30 started, lr: 0.001, dataset size: 11194


Epoch 1/30 - 74.09s - loss: 5313.576 - avg training loss: 4.761269 - batches: 1116
Quality on validation dataset (20.0%), validation examples = 2238
time to finish evaluation: 13.60s
Total validation loss: 810.2688	Avg validation loss: 2.8232
label	 tp	 fp	 fn	 prec	 rec	 f1
I-NAME	 782	 232	 123	 0.77120316	 0.8640884	 0.81500787
I-CONTACT	 17	 20	 7	 0.45945945	 0.7083333	 0.55737704
I-AGE	 0	 0	 8	 0.0	 0.0	 0.0
I-IDNUM	 0	 0	 9	 0.0	 0.0	 0.0
B-DATE	 617	 37	 115	 0.94342506	 0.84289616	 0.89033186
I-DATE	 75	 18	 15	 0.8064516	 0.8333333	 0.81967217
I-LOCATION	 368	 144	 424	 0.71875	 0.46464646	 0.5644172
B-NAME	 892	 340	 159	 0.72402596	 0.8487155	 0.7814279
B-AGE	 62	 30	 67	 0.67391306	 0.48062015	 0.561086
B-LOCATION	 1697	 394	 447	 0.8115

## Evaluate the Newly Trained NER Model

In [45]:
pred_df = ner_model.stages[-1].transform(test_df).cache()

In [46]:
pred_df.show()

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            splitter|               token|          embeddings|           ner_label|                 ner|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|" Brush Wellman h...|[{document, 0, 15...|[{document, 0, 15...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|" It is almost im...|[{document, 0, 10...|[{document, 0, 10...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|" It was an off-s...|[{document, 0, 10...|[{document, 0, 10...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|" The two-day vis...|[{document, 0, 12...|[{document, 0,

In [47]:
from pyspark.sql import functions as F

pred_token_df = pred_df.select(F.explode(F.arrays_zip(pred_df.ner_label.metadata,
                                                  pred_df.ner_label.begin,
                                                  pred_df.ner_label.end,
                                                  pred_df.ner_label.result,
                                                  pred_df.ner.result)).alias("cols")) \
          .select(F.expr("cols['0']['word']").alias("token"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']").alias("gtruth"),
                  F.expr("cols['4']").alias("prediction"))\
          .toPandas()

pred_token_df

Unnamed: 0,token,begin,end,gtruth,prediction
0,"""",0,0,O,O
1,Brush,2,6,B-NAME,O
2,Wellman,8,14,I-NAME,I-NAME
3,has,16,18,O,O
4,been,20,23,O,O
...,...,...,...,...,...
41244,games,42,46,O,O
41245,behind,48,53,O,O
41246,first,55,59,O,O
41247,place,61,65,O,O


### Calculate Evaluation Metrics
Use the NerDLMetrics class to compute precision, recall, and F1-score for each entity. The evaluation is shown with both `full_chunk` and `partial_chunk_per_token` modes.

In [48]:
evaler = medical.NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"),
                                          prediction_col="ner",
                                          label_col="ner_label",
                                          drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+------+-----+-----+------+---------+------+------+
|  entity|    tp|   fp|   fn| total|precision|recall|    f1|
+--------+------+-----+-----+------+---------+------+------+
| CONTACT|  68.0|  7.0| 17.0|  85.0|   0.9067|   0.8|  0.85|
|    NAME|1192.0|109.0|107.0|1299.0|   0.9162|0.9176|0.9169|
|    DATE| 815.0| 25.0| 31.0| 846.0|   0.9702|0.9634|0.9668|
|   IDNUM|  24.0| 11.0| 17.0|  41.0|   0.6857|0.5854|0.6316|
|LOCATION|2452.0|181.0|212.0|2664.0|   0.9313|0.9204|0.9258|
|     AGE| 142.0| 22.0| 15.0| 157.0|   0.8659|0.9045|0.8847|
+--------+------+-----+-----+------+---------+------+------+

+-----------------+
|            macro|
+-----------------+
|0.862638263002126|
+-----------------+

None
+-----------------+
|            micro|
+-----------------+
|0.925448076564739|
+-----------------+

None


In [49]:
evaler = medical.NerDLMetrics(mode="partial_chunk_per_token")
eval_result_partial = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"), prediction_col="ner", label_col="ner_label", drop_o = True, case_sensitive = True).cache()

eval_result_partial.withColumn("precision", F.round(eval_result_partial["precision"],4))\
           .withColumn("recall", F.round(eval_result_partial["recall"],4))\
           .withColumn("f1", F.round(eval_result_partial["f1"],4)).sort("entity").show(100)
df_partial=eval_result_partial.toPandas()
print("partial_chunk_per_token")
print(eval_result_partial.selectExpr("avg(f1) as macro").show())
print (eval_result_partial.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+------+-----+-----+------+---------+------+------+
|  entity|    tp|   fp|   fn| total|precision|recall|    f1|
+--------+------+-----+-----+------+---------+------+------+
|     AGE| 150.0| 25.0| 17.0| 167.0|   0.8571|0.8982|0.8772|
| CONTACT| 113.0|  6.0| 21.0| 134.0|   0.9496|0.8433|0.8933|
|    DATE| 965.0| 22.0| 30.0| 995.0|   0.9777|0.9698|0.9738|
|   IDNUM|  31.0| 17.0| 20.0|  51.0|   0.6458|0.6078|0.6263|
|LOCATION|3466.0|221.0|252.0|3718.0|   0.9401|0.9322|0.9361|
|    NAME|2247.0|138.0|108.0|2355.0|   0.9421|0.9541|0.9481|
+--------+------+-----+-----+------+---------+------+------+

partial_chunk_per_token
+------------------+
|             macro|
+------------------+
|0.8757876037007678|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.9407430847696318|
+------------------+

None


## Create the Final Pipeline with the Custom NER Model

In [64]:
# We are loading the pretrained pipeline using the `from_disk` method.

modified_pipeline = nlp.PretrainedPipeline.from_disk('modified_pipeline')

In [65]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### New Stages

In [66]:
ner_deid_new = medical.NerModel.load("models/new_NER_model")\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setOutputCol("ner_deid_new")

ner_deid_new_converter = nlp.NerConverter()\
      .setInputCols(["splitter", "token", "ner_deid_new"])\
      .setOutputCol("ner_chunk_new")

ner_deid = medical.NerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")  \
      .setInputCols(["splitter", "token", "embeddings"]) \
      .setOutputCol("ner_deid_subentity_docwise")

ner_deid_converter = nlp.NerConverter()\
      .setInputCols(["splitter", "token", "ner_deid_subentity_docwise"])\
      .setOutputCol("ner_chunk_subentity_docwise")

chunk_merge_ner = medical.ChunkMergeModel()\
    .setInputCols("ner_chunk_new", # New Trained Model
                  "ner_chunk_subentity_docwise")\
    .setOutputCol("deid_merged_ner_chunk")\
    .setOrderingFeatures(["ChunkLength","ChunkBegin"])\
    .setMergeOverlapping(True)\
    .setResetSentenceIndices(True)


ner_deid_subentity_docwise download started this may take some time.
Approximate size to download 8.9 MB
[OK!]


### **Update Stages**

In [60]:
modified_pipeline.model.stages = (
    modified_pipeline.model.stages[:4]
    + [ner_deid_new,
       ner_deid_new_converter,
       ner_deid,
       ner_deid_converter,
       chunk_merge_ner]
    + modified_pipeline.model.stages[18:]

)

In [67]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### Reassemble and Save the Final

Rebuild the pipeline's stages, replacing the original NER components with our new custom NER model and the reconfigured merger. The final pipeline is then saved.

In [68]:
empty_result = modified_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

modified_pipeline.model.write().overwrite().save("new_pipeline")

In [69]:
new_pipeline = nlp.PretrainedPipeline.from_disk('new_pipeline')

## Final Test of the New Pipeline

In [70]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = new_pipeline.transform(samples_df).cache()

In [71]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+----------------------------------+-----+----+---------+----------+
|chunk                             |begin|end |ner_label|confidence|
+----------------------------------+-----+----+---------+----------+
|John Lee                          |7    |14  |NAME     |0.9999188 |
|7789201                           |45   |51  |IDNUM    |0.71      |
|1973                              |80   |83  |DATE     |0.99705976|
|52                                |87   |88  |AGE      |0.99993765|
|GH-556672                         |110  |118 |IDNUM    |0.86922663|
|XXX-XX-1234                       |139  |149 |IDNUM    |0.85107106|
|123 Main Street                   |160  |174 |LOCATION |0.9999253 |
|FALL RIVER                        |177  |186 |LOCATION |0.9986915 |
|NIAGARA FALLS                     |189  |201 |LOCATION |0.9978828 |
|NY                                |204  |205 |LOCATION |0.9999924 |
|14304                             |207  |211 |LOCATION |0.73      |
|RD23-4897                        

In [72]:
pd.set_option("display.max_colwidth", 1000)
result_df = result.selectExpr("text","mask_entity.result as masked_result","obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\nName: John Lee\nMedical Record Number (MRN): 7789201\nDate of Birth / Age / Sex: 1973 / 52 / Male\nAccession #: GH-556672\nSocial Security #: XXX-XX-1234\nAddress: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304\nSpecimen #: RD23-4897\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 032-1902\nDate Collected: 2025-05-10\nDate Received: 2025-05-10\nRequesting Physician: Dr. Jameson\nReferring Facility: General Hospital, New York City, NY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytome...","[\nName: <NAME>\nMedical Record Number (MRN): <IDNUM>\nDate of Birth / Age / Sex: <DATE> / <AGE> / Male\nAccession #: <IDNUM>\nSocial Security #: <IDNUM>\nAddress: <LOCATION>, <LOCATION>, <LOCATION>, <LOCATION> <LOCATION>\nSpecimen #: <IDNUM>\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled <IDNUM>\nDate Collected: <DATE>\nDate Received: <DATE>\nRequesting Physician: Dr. <NAME>\nReferring Facility: <LOCATION>, <LOCATION> City, <LOCATION>\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at <LOCATION>, <LOCATION> – <LOCATION>; <LOCATION> City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytometry):\nDate Reported:...","[\nName: Gillie Allan\nMedical Record Number (MRN): 0074518\nDate of Birth / Age / Sex: 1974 / 44 / Male\nAccession #: PU-663305\nSocial Security #: WWW-WW-8529\nAddress: 3255 Independence Street, 302 W MCNEESE ST, 4101 NW 89TH BLVD, 16100 SOUTH FREEWAY 59 KOCH AVE\nSpecimen #: SA52-9740\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 125-8415\nDate Collected: 2025-06-27\nDate Received: 2025-06-27\nRequesting Physician: Dr. Marchelle\nReferring Facility: 310 Ellis Street, 2000 Boise Ave City, 16100 SOUTH FREEWAY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at 103 North Street, 16100 SOUTH FREEWAY – 969 Lakeland Drive; 2000 Boise Ave City.\nChromosome Analysis (C..."
