![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.14.End2End_Preannotation_and_Training_Pipeline.ipynb)

# **End2End Preannotation and Training Pipeline**

## Required libraries and their versions

Spark NLP's **`TFGraphBuilder`** is required to build computation graphs for **`MedicalNerApproach`**. It is officially supported with **`Python 3.11`**, **`TensorFlow 2.12.0`**, **`TensorFlow Addons 0.20.0`**, and **`NumPy 1.23.5`**. Newer `TensorFlow (≥2.13)` or `NumPy (≥1.24)` versions are not compatible, and the `TensorFlow 1.15` fallback is deprecated.

In [None]:
! pip install -q tensorflow==2.12.0
! pip install -q tensorflow-addons

In [None]:
! pip uninstall -y numpy
! pip install "numpy==1.23.5"

**Note:** After running the installation commands above, please **restart the runtime/session**  
(`Runtime → Restart runtime`), then continue from the next step of the notebook  

## Spark Setup

In [None]:
import numpy as np
import tensorflow

print('NumPy Version: ',np.__version__)
print('TF Version: ', tensorflow.__version__)

NumPy Version:  1.23.5
TF Version:  2.12.0


In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
import os
import json
import numpy as np
import pandas as pd

spark = nlp.start()

## Loading the Pretrained Pipeline

Spark NLP's pretrained pipeline, `clinical_deidentification_docwise_benchmark`, is loaded. This pipeline is designed to mask and obfuscate sensitive information in medical texts, such as names, ID numbers, contact information, locations, ages, and dates. The existing stages of the pipeline are examined to understand its structure.

In [None]:
deid_pipeline = nlp.PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models")

clinical_deidentification_docwise_benchmark download started this may take some time.
Approx size to download 2.3 GB
[OK!]


In [None]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

## Extending the Pipeline with New Stages

New and customized stages are added to enhance the capabilities of the existing pipeline.

In [None]:
document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

splitter = (
            medical.InternalDocumentSplitter()
            .setInputCols("document")
            .setOutputCol("splitter")
            .setSplitMode("recursive")
            .setSplitPatterns(["\s+"])  # Token base
            .setPatternsAreRegex(True)
            .setChunkSize(512)    # 512 Char Lenght
            .setChunkOverlap(50)
            .setEnableSentenceIncrement(True)  # Like sentenceDetector
)

tokenizer = (
    nlp.Tokenizer()
    .setInputCols("splitter")
    .setOutputCol("token")
)

### Create a Custom `CPT Code` Parser

Using `ContextualParserApproach`, a new parser is created to detect CPT (Current Procedural Terminology) codes within the text based on regex rules. This allows the pipeline to recognize a custom entity type not found in the standard de-identification pipeline.

In [None]:
cpt_rule = {
    "entity": "CPT_CODE",
    "ruleScope": "sentence",
    "regex": r"(?:CPT(?: Code\(s\)?|#|:)?\s*:?[\s#]*)?(\b88[0-9]{3}\b)",
    "matchScope": "token"
}

with open('cpt.json', 'w') as f:
    json.dump(cpt_rule, f)

cpt_parser = medical.ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_cpt") \
    .setJsonPath("cpt.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

cpt_parser_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    cpt_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

cpt_parser_model = cpt_parser_pipeline.fit(empty_data)


In [None]:
cpt_parser_model.stages[-1].write().overwrite().save("./parsers/cpt_parser")

cpt_parser = medical.ContextualParserModel.load("parsers/cpt_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_cpt")

In [None]:
annotations = nlp.LightPipeline(cpt_parser_model).annotate(text)

annotations["entity_cpt"]

['88305']

###  Create a Custom `Specimen ID` Parser

Similarly, another parser is created with ContextualParserApproach to extract specimen IDs from medical texts

In [None]:
with open('specimen.json', 'w') as f:
    json.dump({
        "entity": "IDNUM",
        "ruleScope": "sentence",
        "regex": "(?:Specimen(?:\s*(?:ID|Number|Code|#|No\.?)?:?)?\s*)?#?[A-Z]{1,5}[0-9]{2,4}-?[0-9]{3,6}",
        "contextLength": 25,
        "matchScope": "token"
    }, f)

specimen_parser = medical.ContextualParserApproach() \
    .setInputCols(["splitter", "token"]) \
    .setOutputCol("entity_specimen") \
    .setJsonPath("specimen.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

specimen_parser_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    splitter,
    tokenizer,
    specimen_parser
  ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

specimen_parser_model = specimen_parser_pipeline.fit(empty_data)


In [None]:
specimen_parser_model.stages[-1].write().overwrite().save("./parsers/specimen_parser")

specimen_parser = medical.ContextualParserModel.load("./parsers/specimen_parser") \
    .setInputCols(["splitter", "token"])\
    .setOutputCol("entity_specimen")

In [None]:
annotations = nlp.LightPipeline(specimen_parser_model).annotate(text)

annotations["entity_specimen"]

['RD23-4897']

### IOBTagger

The `IOBTagger` is added to tag the entities recognized by the Named Entity Recognition (NER) model in the IOB (Inside, Outside, Beginning) format. This format provides a standard data structure required for training the NER model.

In [None]:
iobTagger = medical.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

### Update the Chunk Merging Strategy

The inputs of the ChunkMergeModel, which is responsible for merging entities from different NER models, are updated to include the entities generated by the newly created cpt_parser and specimen_parser. This ensures that all entities found by both the pretrained models and our custom parsers are consolidated.

In [None]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()
merger_input_cols

['entity_icd10',
 'entity_email',
 'entity_ip_address',
 'entity_age',
 'entity_medicalrecord',
 'entity_ssn',
 'entity_account',
 'entity_vin',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_country',
 'entity_state',
 'entity_zip',
 'entity_plate',
 'entity_dln',
 'entity_license']

In [None]:
merger_input_cols = deid_pipeline.model.stages[35].getInputCols()

chunk_merge_rulebase = deid_pipeline.model.stages[35]\
      .setInputCols(["entity_cpt", "entity_specimen"] + merger_input_cols)

### Update the De-identification Blacklist

In [None]:
deid_pipeline.model.stages[38]

ChunkMergeModel_5a3f1e608447

In [None]:
deid_pipeline.model.stages[38] = deid_pipeline.model.stages[38]\
                                      .setBlackList(['CPT_CODE'])

### Updated Stages

In [None]:
deid_pipeline.model.stages = (
    deid_pipeline.model.stages[:35]
    + [cpt_parser, specimen_parser, chunk_merge_rulebase]
    + deid_pipeline.model.stages[36:]
    + [iobTagger]
)

In [None]:
deid_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

## Save and Test the Modified Pipeline

In [None]:
empty_result = deid_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

deid_pipeline.model.write().overwrite().save("modified_pipeline")

In [None]:
# We are loading the pretrained pipeline using the `from_disk` method.

modified_pipeline = nlp.PretrainedPipeline.from_disk('modified_pipeline')

### Sample Text

In [None]:
text = """
Name: John Lee
Medical Record Number (MRN): 7789201
Date of Birth / Age / Sex: 1973 / 52 / Male
Accession #: GH-556672
Social Security #: XXX-XX-1234
Address: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304
Specimen #: RD23-4897
Material Received: Gastric, Ileum, and Random Colon Biopsies
Material Details: 6 slides labeled 032-1902
Date Collected: 2025-05-10
Date Received: 2025-05-10
Requesting Physician: Dr. Jameson
Referring Facility: General Hospital, New York City, NY
Clinical History: None Given.
Clinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.
CPT Code(s): 88305
Gastric, ileum, and random colon, biopsies:
No evidence of dysplasia.
The following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.
Chromosome Analysis (Cytogenetics): (Addendum report to follow.)
Leukemic Immunophenotyping (Flow Cytometry):
Date Reported: 2025-05-12, 16:30
Electronically Signed Out By:
Dr. Smith
Dr. Carter, CT(ASCP)
"""

### Sample Result

In [None]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = modified_pipeline.transform(samples_df).cache()

In [None]:
from pyspark.sql import functions as F

result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+----------------------------------+-----+----+---------+----------+
|chunk                             |begin|end |ner_label|confidence|
+----------------------------------+-----+----+---------+----------+
|John Lee                          |7    |14  |NAME     |0.9999188 |
|7789201                           |45   |51  |IDNUM    |0.71      |
|1973                              |80   |83  |DATE     |0.99705976|
|52                                |87   |88  |AGE      |0.99993765|
|GH-556672                         |110  |118 |IDNUM    |0.8692267 |
|XXX-XX-1234                       |139  |149 |IDNUM    |0.85107154|
|123 Main Street                   |160  |174 |LOCATION |0.9999253 |
|FALL RIVER                        |177  |186 |LOCATION |0.9986915 |
|NIAGARA FALLS                     |189  |201 |LOCATION |0.9978828 |
|NY                                |204  |205 |LOCATION |0.9999924 |
|14304                             |207  |211 |LOCATION |0.73      |
|RD23-4897                        

In [None]:
pd.set_option("display.max_colwidth", 1000)

result_df = result.selectExpr("text",
                              "mask_entity.result as masked_result",
                              "obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\nName: John Lee\nMedical Record Number (MRN): 7789201\nDate of Birth / Age / Sex: 1973 / 52 / Male\nAccession #: GH-556672\nSocial Security #: XXX-XX-1234\nAddress: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304\nSpecimen #: RD23-4897\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 032-1902\nDate Collected: 2025-05-10\nDate Received: 2025-05-10\nRequesting Physician: Dr. Jameson\nReferring Facility: General Hospital, New York City, NY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytome...","[\nName: <NAME>\nMedical Record Number (MRN): <IDNUM>\nDate of Birth / Age / Sex: <DATE> / <AGE> / Male\nAccession #: <IDNUM>\nSocial Security #: <IDNUM>\nAddress: <LOCATION>, <LOCATION>, <LOCATION>, <LOCATION> <LOCATION>\nSpecimen #: <IDNUM>\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled <IDNUM>\nDate Collected: <DATE>\nDate Received: <DATE>\nRequesting Physician: Dr. <NAME>\nReferring Facility: <LOCATION>, <LOCATION> City, <LOCATION>\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at <LOCATION>, <LOCATION> – <LOCATION>; <LOCATION> City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytometry):\nDate Reported:...","[\nName: Gillie Allan\nMedical Record Number (MRN): 0074518\nDate of Birth / Age / Sex: 1974 / 44 / Male\nAccession #: PU-663305\nSocial Security #: WWW-WW-8529\nAddress: 3255 Independence Street, 302 W MCNEESE ST, 4101 NW 89TH BLVD, 16100 SOUTH FREEWAY 59 KOCH AVE\nSpecimen #: SA52-9740\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 125-8415\nDate Collected: 2025-06-27\nDate Received: 2025-06-27\nRequesting Physician: Dr. Marchelle\nReferring Facility: 310 Ellis Street, 2000 Boise Ave City, 16100 SOUTH FREEWAY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at 103 North Street, 16100 SOUTH FREEWAY – 969 Lakeland Drive; 2000 Boise Ave City.\nChromosome Analysis (C..."


##  Prepare Data for Custom NER Model Training

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/refs/heads/master/data/ner/eng.train -O eng.train

data_conll = nlp.CoNLL(includeDocId=True,explodeSentences=True).readDataset(spark, "./eng.train")
data_conll.show(2)


In [None]:
data_conll.count()

In [None]:
input_spark_df = data_conll.select("doc_id", "text")
input_spark_df.show(2, truncate=50)

+------+------------------------------------------------+
|doc_id|                                            text|
+------+------------------------------------------------+
|     X|EU rejects German call to boycott British lamb .|
|     X|                                 Peter Blackburn|
+------+------------------------------------------------+
only showing top 2 rows



### Preprocess Data with the Modified Pipeline

Run the entire dataset through our modified pipeline. This generates token, sentence, and embedding annotations required for the NER training downstream.

In [None]:
results = modified_pipeline.transform(input_spark_df)
results.columns

['doc_id',
 'text',
 'document',
 'splitter',
 'token',
 'embeddings',
 'ner_clinical_large',
 'ner_chunk_clinical_large',
 'ner_deid_generic_docwise',
 'ner_deid_docwise_subentity',
 'ner_deid_generic_docwise_merged_conll',
 'ner_chunk_generic_docwise',
 'ner_chunk_subentity_docwise',
 'ner_chunk_merged_docwise',
 'ner_zero_shot',
 'ner_chunk_zero_shot_raw',
 'ner_deid_subentity_docwise_new',
 'ner_chunk_subentity_docwise_new_chunk',
 'ner_chunk_zero_shot',
 'deid_merged_ner_chunk',
 'entity_icd10',
 'entity_ssn',
 'entity_account',
 'entity_dln',
 'entity_plate',
 'entity_vin',
 'entity_license',
 'entity_country',
 'entity_state',
 'entity_age',
 'entity_date',
 'entity_phone',
 'entity_phone2',
 'entity_zip',
 'entity_medicalrecord',
 'entity_email',
 'entity_ip_address',
 'entity_cpt',
 'entity_specimen',
 'deid_merged_ner_rulebased',
 'ner_chunk_raw',
 'ner_chunk_processed',
 'ner_chunk',
 'mask_entity',
 'obfuscated',
 'ner_label']

In [None]:
result_df = results.select('doc_id','text','document','splitter',
                          'token',"embeddings", 'ner_label')

In [None]:
result_df.show(2, truncate=40)

+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|doc_id|                                    text|                                document|                                splitter|                                   token|                              embeddings|                               ner_label|
+------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+
|     X|EU rejects German call to boycott Bri...|[{document, 0, 47, EU rejects German ...|[{document, 0, 48, EU rejects German ...|[{token, 0, 1, EU, {sentence -> 0}, [...|[{word_embeddings, 0, 1, EU, {isOOV -...|[{named_entity, 0, 1, 

### Persist Preprocessed Data

Save the annotated DataFrame to Parquet format. This is an optimization step to speed up the training process by avoiding re-computation.

In [None]:
%%time

n_partitions = 48

# WRITING THE DATA
result_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/result_df_{n_partitions}.parquet")


CPU times: user 7.91 s, sys: 1.13 s, total: 9.04 s
Wall time: 27min 38s


## Train a Custom Medical NER Model

In [None]:
# READING THE DATA
n_partitions = 48
result_df = spark.read \
    .parquet(f"./data/result_df_{n_partitions}.parquet")\
    .repartition(n_partitions)

In [None]:
result_df.count()

14041

In [None]:
(train_df, test_df) = result_df.randomSplit([0.8, 0.2], seed = 42)

In [None]:
test_df.repartition(n_partitions).write.mode("overwrite").format("parquet")\
    .save(f"./data/test_df.parquet")

###  Create the TensorFlow Graph for NER

We will use `TFGraphBuilder` annotator which can be used to create graphs in the model training pipeline. `TFGraphBuilder` inspects the data and creates the proper graph if a suitable version of TensorFlow is available. The graph is stored in the defined folder and loaded by the `MedicalNerApproach` annotator.

In [None]:
graph_folder_path = "medical_ner_graphs"

ner_graph_builder = medical.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["splitter", "token", "embeddings"]) \
    .setLabelColumn("ner_label")\
    .setGraphFolder(graph_folder_path)\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(50)\
    .setIsLicensed(True) # False -> if you want to use TFGraphBuilder with NerDLApproach

###  Configure and Run the MedicalNerApproach

In [None]:
nerTagger = medical.NerApproach()\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setLabelColumn("ner_label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setBatchSize(8)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setEarlyStoppingCriterion(0.01)\
    .setEarlyStoppingPatience(5)\
    .setGraphFolder(graph_folder_path)\
    .setUseBestModel(False)\
    #.setTestDataset("./data/test_df.parquet")\
    #.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
    #.setDatasetInfo("NCBI_sample_short dataset") #You can add details regarding the dataset

ner_pipeline = nlp.Pipeline(
    stages=[
          ner_graph_builder,
          nerTagger
 ])

In [None]:
%%time
ner_model = ner_pipeline.fit(train_df)

TF Graph Builder configuration:
Model name: ner_dl
Graph folder: medical_ner_graphs
Graph file name: auto
Build params: {'ntags': 13, 'embeddings_dim': 200, 'nchars': 85, 'is_medical': True, 'lstm_size': 50}


Instructions for updating:
non-resource variables are not supported in the long term
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


ner_dl graph exported to medical_ner_graphs/blstm_13_200_50_85.pb
CPU times: user 21.1 s, sys: 2.6 s, total: 23.7 s
Wall time: 23min 15s


In [None]:
ner_model.stages[-1].getTrainingClassDistribution()

{'I-NAME': 4343, 'I-CONTACT': 196, 'I-AGE': 32, 'I-IDNUM': 65, 'B-DATE': 3505, 'I-DATE': 487, 'I-LOCATION': 3870, 'B-NAME': 5130, 'B-AGE': 568, 'B-LOCATION': 10562, 'B-IDNUM': 165, 'O': 136242, 'B-CONTACT': 319}

### Save the Trained NER Model and Review Logs

In [None]:
ner_model.stages[-1].write().overwrite().save('models/new_NER_model')

In [None]:
import os
log_file= os.listdir("ner_logs")[0]

with open (f"./ner_logs/{log_file}") as f:
    print(f.read())

Name of the selected graph: /content/medical_ner_graphs/blstm_13_200_50_85.pb
Training started - total epochs: 30 - lr: 0.001 - batch size: 8 - labels: 13 - chars: 84 - training examples: 11189


Epoch 1/30 started, lr: 0.001, dataset size: 11189


Epoch 1/30 - 45.13s - loss: 5704.2026 - avg training loss: 5.1158767 - batches: 1115
Quality on validation dataset (20.0%), validation examples = 2237
time to finish evaluation: 4.12s
Total validation loss: 914.8215	Avg validation loss: 3.1987
label	 tp	 fp	 fn	 prec	 rec	 f1
I-NAME	 708	 131	 160	 0.84386176	 0.8156682	 0.82952553
I-CONTACT	 0	 0	 55	 0.0	 0.0	 0.0
I-AGE	 0	 0	 9	 0.0	 0.0	 0.0
I-IDNUM	 0	 0	 5	 0.0	 0.0	 0.0
B-DATE	 635	 65	 119	 0.9071429	 0.84217507	 0.8734526
I-DATE	 70	 9	 47	 0.886076	 0.5982906	 0.7142857
I-LOCATION	 262	 123	 509	 0.68051946	 0.33981842	 0.4532872
B-NAME	 780	 212	 247	 0.78629035	 0.75949365	 0.7726597
B-AGE	 14	 0	 113	 1.0	 0.11023622	 0.19858158
B-LOCATION	 1679	 451	 495	 0.7882629	 0.7723091	 

## Evaluate the Newly Trained NER Model

In [None]:
pred_df = ner_model.stages[-1].transform(test_df).cache()

In [None]:
pred_df.show()

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            splitter|               token|          embeddings|           ner_label|                 ner|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     X|" It was one of t...|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|" The question is...|[{document, 0, 76...|[{document, 0, 77...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|" Ukraine 's bigg...|[{document, 0, 17...|[{document, 0, 17...|[{token, 0, 0, ",...|[{word_embeddings...|[{named_entity, 0...|[{named_entity, 0...|
|     X|- 4 Greg Norman (...|[{document, 0, 38...|[{document, 0,

In [None]:
from pyspark.sql import functions as F

pred_token_df = pred_df.select(F.explode(F.arrays_zip(pred_df.ner_label.metadata,
                                                  pred_df.ner_label.begin,
                                                  pred_df.ner_label.end,
                                                  pred_df.ner_label.result,
                                                  pred_df.ner.result)).alias("cols")) \
          .select(F.expr("cols['0']['word']").alias("token"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']").alias("gtruth"),
                  F.expr("cols['4']").alias("prediction"))\
          .toPandas()

pred_token_df

Unnamed: 0,token,begin,end,gtruth,prediction
0,"""",0,0,O,O
1,It,2,3,O,O
2,was,5,7,O,O
3,one,9,11,O,O
4,of,13,14,O,O
...,...,...,...,...,...
41656,Sunday,32,37,B-DATE,B-DATE
41657,:,39,39,O,O
41658,stated,0,5,O,O
41659,),7,7,O,O


### Calculate Evaluation Metrics
Use the NerDLMetrics class to compute precision, recall, and F1-score for each entity. The evaluation is shown with both `full_chunk` and `partial_chunk_per_token` modes.

In [None]:
evaler = medical.NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"),
                                          prediction_col="ner",
                                          label_col="ner_label",
                                          drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+------+-----+-----+------+---------+------+------+
|  entity|    tp|   fp|   fn| total|precision|recall|    f1|
+--------+------+-----+-----+------+---------+------+------+
| CONTACT|  68.0| 38.0| 29.0|  97.0|   0.6415| 0.701|  0.67|
|    NAME|1191.0|144.0|177.0|1368.0|   0.8921|0.8706|0.8812|
|    DATE| 830.0| 41.0| 48.0| 878.0|   0.9529|0.9453|0.9491|
|LOCATION|2295.0|280.0|347.0|2642.0|   0.8913|0.8687|0.8798|
|     AGE| 111.0|  5.0| 44.0| 155.0|   0.9569|0.7161|0.8192|
|   IDNUM|  25.0| 11.0| 26.0|  51.0|   0.6944|0.4902|0.5747|
+--------+------+-----+-----+------+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.7956707338734669|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8831835832312127|
+------------------+

None


In [None]:
evaler = medical.NerDLMetrics(mode="partial_chunk_per_token")
eval_result_partial = evaler.computeMetricsFromDF(pred_df.select("ner_label","ner"), prediction_col="ner", label_col="ner_label", drop_o = True, case_sensitive = True).cache()

eval_result_partial.withColumn("precision", F.round(eval_result_partial["precision"],4))\
           .withColumn("recall", F.round(eval_result_partial["recall"],4))\
           .withColumn("f1", F.round(eval_result_partial["f1"],4)).sort("entity").show(100)
df_partial=eval_result_partial.toPandas()
print("partial_chunk_per_token")
print(eval_result_partial.selectExpr("avg(f1) as macro").show())
print (eval_result_partial.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+--------+------+-----+-----+------+---------+------+------+
|  entity|    tp|   fp|   fn| total|precision|recall|    f1|
+--------+------+-----+-----+------+---------+------+------+
|     AGE| 116.0|  5.0| 48.0| 164.0|   0.9587|0.7073| 0.814|
| CONTACT| 129.0| 47.0| 32.0| 161.0|    0.733|0.8012|0.7656|
|    DATE| 965.0| 35.0| 51.0|1016.0|    0.965|0.9498|0.9573|
|   IDNUM|  28.0| 14.0| 32.0|  60.0|   0.6667|0.4667| 0.549|
|LOCATION|3211.0|302.0|454.0|3665.0|    0.914|0.8761|0.8947|
|    NAME|2283.0|151.0|207.0|2490.0|    0.938|0.9169|0.9273|
+--------+------+-----+-----+------+---------+------+------+

partial_chunk_per_token
+------------------+
|             macro|
+------------------+
|0.8179912776610174|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.9066066198899188|
+------------------+

None


## Create the Final Pipeline with the Custom NER Model

In [None]:
# We are loading the pretrained pipeline using the `from_disk` method.

modified_pipeline = nlp.PretrainedPipeline.from_disk('modified_pipeline')

In [None]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_1aef7e9d2de5,
 MedicalNerModel_d92d47622e85,
 MedicalNerModel_32184c1db80b,
 MedicalNerModel_ada39ac0d359,
 NER_CONVERTER_a99db4e6a79d,
 NER_CONVERTER_4a9436714344,
 NER_CONVERTER_ea6433988e18,
 PretrainedZeroShotNER_5f30ab9002f1,
 NER_CONVERTER_c97040caf7b3,
 MedicalNerModel_b8b167ec3114,
 NER_CONVERTER_06db473f3215,
 ContextualEntityRuler_11ff6711ef6b,
 ChunkMergeModel_95d6827691bb,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 C

### New Stages

In [None]:
ner_deid_new = medical.NerModel.load("models/new_NER_model")\
    .setInputCols(["splitter", "token", "embeddings"])\
    .setOutputCol("ner_deid_new")

ner_deid_new_converter = medical.NerConverterInternal()\
      .setInputCols(["splitter", "token", "ner_deid_new"])\
      .setOutputCol("ner_chunk_new")

ner_deid = medical.NerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")  \
      .setInputCols(["splitter", "token", "embeddings"]) \
      .setOutputCol("ner_deid_subentity_docwise")

ner_deid_converter = medical.NerConverterInternal()\
      .setInputCols(["splitter", "token", "ner_deid_subentity_docwise"])\
      .setOutputCol("ner_chunk_subentity_docwise")

chunk_merge_ner = medical.ChunkMergeModel()\
    .setInputCols("ner_chunk_new", # New Trained Model
                  "ner_chunk_subentity_docwise")\
    .setOutputCol("deid_merged_ner_chunk")\
    .setOrderingFeatures(["ChunkLength","ChunkBegin"])\
    .setMergeOverlapping(True)\
    .setResetSentenceIndices(True)


ner_deid_subentity_docwise download started this may take some time.
Approximate size to download 8.9 MB
[OK!]


### **Update Stages**

In [None]:
modified_pipeline.model.stages = (
    modified_pipeline.model.stages[:4]
    + [ner_deid_new,
       ner_deid_new_converter,
       ner_deid,
       ner_deid_converter,
       chunk_merge_ner]
    + modified_pipeline.model.stages[18:]

)

In [None]:
modified_pipeline.model.stages

[DocumentAssembler_ae0f203deedd,
 InternalDocumentSplitter_cc36578ceda6,
 REGEX_TOKENIZER_2e85686aea12,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_7c293f91a5d1,
 NerConverter_4463ddf8ec64,
 MedicalNerModel_32184c1db80b,
 NerConverter_46fe911277ef,
 ChunkMergeModel_9ca1a03bef13,
 CONTEXTUAL-PARSER_bf2a6abaf5fa,
 CONTEXTUAL-PARSER_ff6bad379d91,
 CONTEXTUAL-PARSER_89341cae7221,
 CONTEXTUAL-PARSER_c6b9eded8d31,
 CONTEXTUAL-PARSER_9480c24bd9f8,
 CONTEXTUAL-PARSER_3886bce391c8,
 CONTEXTUAL-PARSER_0bb3fb75cd01,
 ENTITY_EXTRACTOR_6792f2f6e85a,
 ENTITY_EXTRACTOR_74ace4be4f73,
 CONTEXTUAL-PARSER_dfb32adc7555,
 REGEX_MATCHER_5003669d6422,
 CONTEXTUAL-PARSER_746a25662aa6,
 CONTEXTUAL-PARSER_079220479a3d,
 CONTEXTUAL-PARSER_f8b8f9aafb9f,
 CONTEXTUAL-PARSER_7f824493eafc,
 REGEX_MATCHER_26934077fe57,
 REGEX_MATCHER_5fe3de8b5a4e,
 CONTEXTUAL-PARSER_64158658a948,
 CONTEXTUAL-PARSER_56b2e8abcd9a,
 MERGE_ddff59e8b14a,
 ChunkMergeModel_50feb5f97568,
 ContextualEntityRuler_08eeaa89c938,
 ChunkMe

### Reassemble and Save the Final

Rebuild the pipeline's stages, replacing the original NER components with our new custom NER model and the reconfigured merger. The final pipeline is then saved.

In [None]:
empty_result = modified_pipeline.transform(spark.createDataFrame([[""]]).toDF("text"))

modified_pipeline.model.write().overwrite().save("new_pipeline")

In [None]:
new_pipeline = nlp.PretrainedPipeline.from_disk('new_pipeline')

## Final Test of the New Pipeline

In [None]:
samples_df = spark.createDataFrame([[text]]).toDF("text")

result = new_pipeline.transform(samples_df).cache()

In [None]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['confidence']").alias("confidence")).show(50,truncate=False)

+------------------------+-----+----+---------+----------+
|chunk                   |begin|end |ner_label|confidence|
+------------------------+-----+----+---------+----------+
|John Lee                |7    |14  |NAME     |0.9422    |
|7789201                 |45   |51  |IDNUM    |0.71      |
|1973                    |80   |83  |DATE     |0.9687    |
|52                      |87   |88  |AGE      |0.998     |
|GH-556672               |110  |118 |IDNUM    |0.7311    |
|XXX-XX-1234             |139  |149 |IDNUM    |0.4833    |
|123 Main Street         |160  |174 |LOCATION |0.9873    |
|FALL RIVER              |177  |186 |LOCATION |0.65885   |
|NIAGARA FALLS           |189  |201 |LOCATION |0.54719996|
|NY                      |204  |205 |LOCATION |0.9033    |
|14304                   |207  |211 |LOCATION |0.73      |
|RD23-4897               |225  |233 |IDNUM    |0.50      |
|032-1902                |331  |338 |DATE     |0.4176    |
|2025-05-10              |356  |365 |DATE     |NULL     

In [None]:
pd.set_option("display.max_colwidth", 1000)
result_df = result.selectExpr("text","mask_entity.result as masked_result","obfuscated.result as obfuscated_result").toPandas()
result_df

Unnamed: 0,text,masked_result,obfuscated_result
0,"\nName: John Lee\nMedical Record Number (MRN): 7789201\nDate of Birth / Age / Sex: 1973 / 52 / Male\nAccession #: GH-556672\nSocial Security #: XXX-XX-1234\nAddress: 123 Main Street, FALL RIVER, NIAGARA FALLS, NY 14304\nSpecimen #: RD23-4897\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 032-1902\nDate Collected: 2025-05-10\nDate Received: 2025-05-10\nRequesting Physician: Dr. Jameson\nReferring Facility: General Hospital, New York City, NY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at Barstow Heights Christus Southeast, NY – St Elizabeth; New York City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytome...","[\nName: <NAME>\nMedical Record Number (MRN): <IDNUM>\nDate of Birth / Age / Sex: <DATE> / <AGE> / Male\nAccession #: <IDNUM>\nSocial Security #: <IDNUM>\nAddress: <LOCATION>, <LOCATION>, <LOCATION>, <LOCATION> <LOCATION>\nSpecimen #: <IDNUM>\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled <DATE>\nDate Collected: <DATE>\nDate Received: <DATE>\nRequesting Physician: Dr. <NAME>\nReferring Facility: <LOCATION>, <LOCATION> City, <LOCATION>\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at <LOCATION> <LOCATION>, <LOCATION> – <LOCATION>; <LOCATION> City.\nChromosome Analysis (Cytogenetics): (Addendum report to follow.)\nLeukemic Immunophenotyping (Flow Cytometry):\nDate...","[\nName: Gillie Allan\nMedical Record Number (MRN): 0074518\nDate of Birth / Age / Sex: 1974 / 44 / Male\nAccession #: PU-663305\nSocial Security #: WWW-WW-8529\nAddress: 3255 Independence Street, 302 W MCNEESE ST, 4101 NW 89TH BLVD, 16100 SOUTH FREEWAY 59 KOCH AVE\nSpecimen #: SA52-9740\nMaterial Received: Gastric, Ileum, and Random Colon Biopsies\nMaterial Details: 6 slides labeled 202-1902\nDate Collected: 2025-06-27\nDate Received: 2025-06-27\nRequesting Physician: Dr. Marchelle\nReferring Facility: 310 Ellis Street, 2000 Boise Ave City, 16100 SOUTH FREEWAY\nClinical History: None Given.\nClinical Note: Peripheral sequestration, i.e., splenomegaly or hepatomegaly, should be excluded to be sure if peripheral sequestration is not present.\nCPT Code(s): 88305\nGastric, ileum, and random colon, biopsies:\nNo evidence of dysplasia.\nThe following special studies were performed at 227 Mountain Dr 55 Nicomedes Rivera Street, 16100 SOUTH FREEWAY – 969 Lakeland Drive; 2000 Boise Ave Cit..."
