![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.4.PipelineTracer_and_PipelineOutputParser.ipynb)

#   **📜 PipelineTracer and PipelineOutputParser**


## Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
!pip install --upgrade -q pyspark==3.5.0 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
!pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
!pip install -q spark-nlp-display

In [3]:

import json
import os

import pandas as pd

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.types as T

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

spark = sparknlp_jsl.start(secret = license_keys['SECRET'])

spark.sparkContext.setLogLevel("ERROR")

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.5.3
Spark NLP_JSL Version : 5.5.3


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# PipelineTracer



    PipelineTracer is a class that allows to trace the stages of a pipeline and get information about them.
    The `PipelineTracer` class provides functionality for tracing and retrieving information about the various stages of a pipeline.
    It can be used to obtain detailed insights into the entities, assertions, and relationships utilized within the pipeline.
    Compatibility with both `PipelineModel` and `PretrainedPipeline`.
    It can be used with a PipelineModel or a PretrainedPipeline.
    Additionally, it can be used to create a parser dictionary that can be used to create a PipelineOutputParser.


## **🔎 Parameters**

**Parameters**:

- `printPipelineSchema`: Prints the schema of the pipeline.
- `createParserDictionary`: Returns a parser dictionary that can be used to create a PipelineOutputParser
- `getPossibleEntities`: Returns a list of possible entities that the pipeline can include.
- `getPossibleAssertions`: Returns a list of possible assertions that the pipeline can include
- `getPossibleRelations`: Returns a list of possible relations that the pipeline can include.
- `getPipelineStages`: Returns a list of PipelineStage objects that represent the stages of the pipeline.
- `getParserDictDirectly`: Returns a parser dictionary that can be used to create a PipelineOutputParser. This method is used to get the parser dictionary directly without creating a PipelineTracer objec.
- `listAvailableModels`: Returns a list of available models for a given language and source
- `showAvailableModels`: Prints a list of available models for a given language and source.

In [4]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser


### showAvailableModels

In [None]:
PipelineTracer.showAvailableModels(language="en", source="clinical/models")

clinical_deidentification
explain_clinical_doc_ade
explain_clinical_doc_biomarker
explain_clinical_doc_generic
explain_clinical_doc_granular
explain_clinical_doc_medication
explain_clinical_doc_oncology
explain_clinical_doc_public_health
explain_clinical_doc_radiology
explain_clinical_doc_risk_factors
explain_clinical_doc_vop
icd10cm_resolver_pipeline
icd10cm_rxnorm_resolver_pipeline
rxnorm_resolver_pipeline
snomed_resolver_pipeline


### listAvailableModels

In [None]:
for model in PipelineTracer.listAvailableModels():
  print(PipelineTracer.getParserDictDirectly(model))

{'document_identifier': 'clinical_deidentification', 'document_text': 'sentence', 'entities': ['ner_chunk'], 'assertions': [], 'resolutions': [], 'relations': [], 'summaries': [], 'deidentifications': [{'original': 'sentence', 'obfuscated': 'obfuscated', 'masked': ''}], 'classifications': []}
{'document_identifier': 'explain_clinical_doc_ade', 'document_text': 'document', 'entities': ['ner_chunks_ade'], 'assertions': ['assertion'], 'resolutions': [], 'relations': ['relations'], 'summaries': [], 'deidentifications': [], 'classifications': [{'classification_column_name': 'class', 'sentence_column_name': 'sentence'}]}
{'document_identifier': 'explain_clinical_doc_biomarker', 'document_text': 'document', 'entities': ['ner_biomarker_chunk'], 'assertions': [], 'resolutions': [], 'relations': ['re_oncology_biomarker_result_wip'], 'summaries': [], 'deidentifications': [], 'classifications': [{'classification_column_name': 'prediction', 'sentence_column_name': 'sentence'}]}
{'document_identifie

### createParserDictionary

In [None]:
from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")


explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]


In [None]:
tracer = PipelineTracer(oncology_pipeline)
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

### printPipelineSchema

In [None]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_27a75510357d)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetectorDLModel
 |    |-- uid: string (SentenceDetectorDLModel_6bafc4746ea5)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_6e5cf9a1fd71)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |  

### getPossibleEntities

In [None]:
tracer.getPossibleEntities()

['Cycle_Number',
 'Direction',
 'Histological_Type',
 'Biomarker_Result',
 'Site_Other_Body_Part',
 'Hormonal_Therapy',
 'Death_Entity',
 'Targeted_Therapy',
 'Route',
 'Tumor_Finding',
 'Duration',
 'Pathology_Result',
 'Chemotherapy',
 'Date',
 'Radiotherapy',
 'Radiation_Dose',
 'Oncogene',
 'Cancer_Surgery',
 'Tumor_Size',
 'Staging',
 'Pathology_Test',
 'Cancer_Dx',
 'Age',
 'Site_Lung',
 'Site_Breast',
 'Site_Liver',
 'Site_Lymph_Node',
 'Response_To_Treatment',
 'Site_Brain',
 'Immunotherapy',
 'Race_Ethnicity',
 'Metastasis',
 'Smoking_Status',
 'Imaging_Test',
 'Relative_Date',
 'Line_Of_Therapy',
 'Unspecific_Therapy',
 'Site_Bone',
 'Gender',
 'Cycle_Count',
 'Cancer_Score',
 'Adenopathy',
 'Grade',
 'Biomarker',
 'Invasion',
 'Frequency',
 'Performance_Status',
 'Dosage',
 'Cycle_Day',
 'Anatomical_Site',
 'Size_Trend',
 'Posology_Information',
 'Cancer_Therapy',
 'Lymph_Node',
 'Tumor_Description',
 'Lymph_Node_Modifier',
 'Alcohol',
 'BMI',
 'Communicable_Disease',
 'Obes

### getPossibleAssertions

In [None]:
tracer.getPossibleAssertions()

['Past', 'Family', 'Absent', 'Hypothetical', 'Possible', 'Present']

### getPossibleRelations

In [None]:
tracer.getPossibleRelations()

['is_size_of', 'is_date_of', 'is_location_of', 'is_finding_of']

### getPipelineStages

In [None]:
stages = tracer.getPipelineStages()
for stage in stages:
    print(stage.__dict__())

{'uid': 'DocumentAssembler_27a75510357d', 'name': 'DocumentAssembler', 'index': 0, 'inputCol': StageField(inputCol, text, string), 'outputCol': StageField(outputCol, document, string), 'inputAnnotatorType': StageField(inputAnnotatorType, ----------, none), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'SentenceDetectorDLModel_6bafc4746ea5', 'name': 'SentenceDetectorDLModel', 'index': 1, 'inputCol': StageField(inputCols, [document], array), 'outputCol': StageField(outputCol, sentence, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'REGEX_TOKENIZER_6e5cf9a1fd71', 'name': 'TokenizerModel', 'index': 2, 'inputCol': StageField(inputCols, [sentence], array), 'outputCol': StageField(outputCol, token, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, token,

## with Custom Pipeline




In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["TREATMENT", "PROBLEM"])

clinical_assertion = AssertionDLModel \
    .pretrained("assertion_dl_large", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setIncludeConfidence(True) \
    .setEntityAssertionCaseSensitive(False) \
    .setEntityAssertion({"treAtment": ["present"]}) \
    .setReplaceLabels({"PRESENT": "available", "absent": "none", "Conditional": "Optional"})

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_dl_large download started this may take some time.
[OK!]


In [None]:
tracer = PipelineTracer(model)
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['ner_chunk'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
tracer.getPossibleAssertions()

['available',
 'none',
 'hypothetical',
 'possible',
 'Optional',
 'associated_with_someone_else']

In [None]:
tracer.getPossibleEntities()

['TREATMENT', 'PROBLEM']

In [None]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_fe752ae88297)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetector
 |    |-- uid: string (SentenceDetector_f2e16279691e)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_07d24c81a279)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |    |-- outputCo

# StructuredJsonConverter
This Annotator integrates seamlessly with existing systems to process outputs from pretrained pipelines, delivering structured, easy-to-read results in a dictionary format. Optimized for API integration and user-friendly outputs, it supports streamlined data analysis workflows by converting raw annotations into a prettified, structured JSON format. With configurable schema mappings, it accommodates diverse outputs, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It uses column_maps to define output columns and align them with pipeline requirements. It handles diverse annotation types, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It produces well-structured, easy-to-read results ideal for API consumption and streamlined workflows.

## explain_clinical_doc_oncology

In [5]:
from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")


explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]


In [6]:
text = """The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response"""

data = spark.createDataFrame([text], T.StringType()).toDF("text")
data.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a c...|
+----------------------------------------------------------------------------------------------------+



In [7]:
result_df = oncology_pipeline.transform(data)
result_df.show(truncate = 40)

+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+---------------

In [8]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(oncology_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

**.setOutputAsStr(True)**

In [9]:
output_converter = StructuredJsonConverter()\
    .setOutputCol("result")\
    .setConverterSchema(column_maps)\
    .setCleanAnnotations(False)\
    .setReturnRelationEntities(True)\
    .setOutputAsStr(True)

json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  result|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":{"document_identifier":"6a6b6cdd-8011-4def-95a5-6cf8e789d529","document_text":["The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ova...|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [10]:
result_collections = json_output.collect()
eval(result_collections[0].result)

{'result': {'document_identifier': '6a6b6cdd-8011-4def-95a5-6cf8e789d529',
  'document_text': ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response'],
  'entities': [{'begin': '24',
    'chunk': 'computed tomography',
    'ner_source': 'ner_oncology_chunk',
    'end': '42',
    'ner_label': 'Imaging_Test',
    'chunk_id': '1b71b12a',
    'sentence': '0',
    'ner_confidence': '0.9575'},
   {

**.setOutputAsStr(False)**

In [11]:
output_converter = StructuredJsonConverter()\
    .setOutputCol("result")\
    .setConverterSchema(column_maps)\
    .setCleanAnnotations(False)\
    .setReturnRelationEntities(True)\
    .setOutputAsStr(False)

json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  result|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{5d93f9ac-58dd-4b65-988f-1b3fd0d0b795, [The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later w...|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [12]:
result_collections = json_output.collect()
for record in result_collections:
    for k,v in column_maps.items():
        print(k,record.result[k])

document_identifier 5d93f9ac-58dd-4b65-988f-1b3fd0d0b795
document_text ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response']
entities [{'ner_label': 'Imaging_Test', 'sentence': '0', 'chunk': 'computed tomography', 'end': '42', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.9575', 'begin': '24', 'chunk_id': '1b71b12a'}, {'ner_label': 'Imaging_Test', 'sentence': '0', 'chunk': 'CT',

**.setParentSource("chunk")**

By using the new .setFormat("chunk") option, users can extract structured chunks instead of base schema results, enabling more precise control over text segmentation.

Additionally, the new sentenceColumn parameter allows retrieval of sentence-level details.

In [20]:
output_converter = StructuredJsonConverter()\
    .setOutputCol("result")\
    .setConverterSchema(column_maps)\
    .setCleanAnnotations(False)\
    .setReturnRelationEntities(True)\
    .setOutputAsStr(True)\
    .setParentSource("chunk")\
    .setSentenceColumn("sentence")

json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  result|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":[{"chunk_id":"1b71b12a","chunk":"computed tomography","begin":24,"end":42,"sentence_id":0,"sentence":"The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, whic...|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [22]:
result_collections = json_output.collect()
eval(result_collections[0].result)

{'result': [{'chunk_id': '1b71b12a',
   'chunk': 'computed tomography',
   'begin': 24,
   'end': 42,
   'sentence_id': 0,
   'sentence': 'The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass.',
   'ner_label': 'Imaging_Test',
   'ner_source': 'ner_oncology_chunk',
   'ner_confidence': '0.9575',
   'assertion': 'Past',
   'assertion_confidence': '1.0',
   'relations': []},
  {'chunk_id': 'ce9ac1a9',
   'chunk': 'CT',
   'begin': 45,
   'end': 46,
   'sentence_id': 0,
   'sentence': 'The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass.',
   'ner_label': 'Imaging_Test',
   'ner_source': 'ner_oncology_chunk',
   'ner_confidence': '0.9565',
   'assertion': 'Present',
   'assertion_confidence': '0.8937',
   'relations': []},
  {'chunk_id': '3576c965',
   'chunk': 'abdomen',
   'begin': 61,
   'end': 67,
   'sentence_id': 0,
   'sentence': 'The Patient underwent a 

# PipelineOutputParser

The output parser module seamlessly integrates with existing systems to return clear, easy-to-read & process prettified results in dictionary format from pretrained pipelines. It is designed to enhance API integration, and user understanding and streamline data analysis workflows.

## clinical_deidentification

In [None]:

from sparknlp.pretrained import PretrainedPipeline
pretrained_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")

text = [
    """Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .""",
    """Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
]

results = pretrained_pipeline.fullAnnotate(text)


clinical_deidentification download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(pretrained_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "clinical_deidentification"})
column_maps

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': ['ner_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("clinical_deidentification", "en", "clinical/models")
columns_directly

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': ['ner_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'clinical_deidentification',
   'document_text': ['Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .',
    'PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .',
    'Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .'],
   'entities': [{'chunk_id': '78463532',
     'chunk': '2093-01-13',
     'begin': 14,
     'end': 23,
     'ner_label': 'DATE',
     'ner_source': None,
     'ner_confidence': None},
    {'chunk_id': '60a35054',
     'chunk': 'David Hale',
     'begin': 27,
     'end': 36,
     'ner_label': 'DOCTOR',
     'ner_source': 'ner_chunk_enriched',
     'ner_confidence': '0.9895'},
    {'chunk_id': '9d3e7907',
     'chunk': 'Hendrickson Ora',
     'begin': 55,
     'end': 69,
     'ner_label': 'PATIENT',
     'ner_source': 'ner_chunk_enriched',
     'ner_confidence': '0.99300003'},
    {'chunk_id': '81bc095c',
     'chunk': '7194334',
     'begin': 78,
     'e

**entities**

In [None]:
pd.DataFrame.from_dict(result["result"][0]["entities"])

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,78463532,2093-01-13,14,23,DATE,,
1,60a35054,David Hale,27,36,DOCTOR,ner_chunk_enriched,0.9895
2,9d3e7907,Hendrickson Ora,55,69,PATIENT,ner_chunk_enriched,0.99300003
3,81bc095c,7194334,78,84,MEDICALRECORD,entity_med,0.71
4,3648e0b6,01/13/93,93,100,DATE,,
5,9356dcf7,Oliveira,110,117,DOCTOR,ner_chunk_enriched,0.9999
6,81eed3d2,25,121,122,AGE,entity_age,0.75
7,0d2359a4,2079-11-09,150,159,DATE,,
8,9ec42c27,Cocke County Baptist Hospital,163,191,HOSPITAL,ner_chunk_enriched,0.97572505
9,70015484,0295 Keats Street,195,211,STREET,ner_chunk_enriched,0.7954333


**deidentifications**

In [None]:
pd.DataFrame.from_dict(result["result"][0]["deidentifications"])

Unnamed: 0,original,obfuscated,masked
0,"[Record date : 2093-01-13 , David Hale , M.D ....","[Record date : 2093-02-08 , Laurita Porta , M....","[Record date : <DATE> , <DOCTOR> , M.D . , Nam..."


## icd10cm_resolver_pipeline

In [None]:
from sparknlp.pretrained import PretrainedPipeline

icd10cm_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")

text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage"""

results = icd10cm_pipeline.fullAnnotate(text)

# for stages in icd10cm_pipeline.model.stages:
#     try:
#         inputCol = stages.getInputCol()
#     except:
#         inputCol = stages.getInputCols()
#     print(inputCol,stages.getOutputCol())

icd10cm_resolver_pipeline download started this may take some time.
Approx size to download 3.3 GB
[OK!]


In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(icd10cm_pipeline)

# column_maps = pipeline_tracer.createParserDictionary()
# column_maps.update({"document_identifier": "icd10cm_resolver_pipeline"})
# column_maps

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("icd10cm_resolver_pipeline", "en", "clinical/models")
columns_directly

{'document_identifier': 'icd10cm_resolver_pipeline',
 'document_text': 'document',
 'entities': ['chunk'],
 'assertions': [],
 'resolutions': [{'vocab': 'icd10cm', 'resolver_column_name': 'icd10cm'}],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(columns_directly)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'icd10cm_resolver_pipeline',
   'document_text': ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage'],
   'entities': [{'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'begin': 39,
     'end': 67,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9424'},
    {'chunk_id': 'd280706c',
     'chunk': 'anisakiasis',
     'begin': 95,
     'end': 105,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9933'},
    {'chunk_id': '9df194a1',
     'chunk': 'fetal and neonatal hemorrhage',
     'begin': 135,
     'end': 163,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.7501'}],
   'assertions': [],
   'resolutions': [{'vocab': 'icd10cm',
     'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'code': 'O2

In [None]:
column_maps = {
    "document_identifier": "icd10cm_resolver_pipeline",
    "document_text": "document",
    "entities": ["chunk"],
    # assertions": [],
    "resolutions": [
        {
            "vocab":"icd10",
            "resolver_column_name": "icd10cm"
        }
    ]
    # "relations": [],
    # "summaries": [],
    # "deidentifications": [],
    # "classifications":[]
}

pipeline_parser = PipelineOutputParser(column_maps)

parsed_result = pipeline_parser.run(results)
parsed_result

{'result': [{'document_identifier': 'icd10cm_resolver_pipeline',
   'document_text': ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage'],
   'entities': [{'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'begin': 39,
     'end': 67,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9424'},
    {'chunk_id': 'd280706c',
     'chunk': 'anisakiasis',
     'begin': 95,
     'end': 105,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9933'},
    {'chunk_id': '9df194a1',
     'chunk': 'fetal and neonatal hemorrhage',
     'begin': 135,
     'end': 163,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.7501'}],
   'assertions': [],
   'resolutions': [{'vocab': 'icd10',
     'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'code': 'O24.

**entities**

In [None]:
pd.DataFrame.from_dict(result["result"][0]["entities"])

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,230909b1,gestational diabetes mellitus,39,67,PROBLEM,,0.9424
1,d280706c,anisakiasis,95,105,PROBLEM,,0.9933
2,9df194a1,fetal and neonatal hemorrhage,135,163,PROBLEM,,0.7501


In [None]:
pd.DataFrame.from_dict(result["result"][0]["resolutions"])

Unnamed: 0,vocab,chunk_id,chunk,code,resolutions,all_k_codes,all_k_resolutions,all_k_aux_labels,all_k_distances
0,icd10cm,230909b1,gestational diabetes mellitus,O24.919,O24.919,E11.9,O24.919:::E11.9,,0.0:::0.0
1,icd10cm,d280706c,anisakiasis,B81.0,B81.0,,B81.0:::,,0.0:::0.0
2,icd10cm,9df194a1,fetal and neonatal hemorrhage,P549,fetal or neonatal hemorrhage,P549:::P545:::O3689:::P101:::P548:::P528:::P26...,fetal or neonatal hemorrhage:::neonatal cutane...,,2.8578:::5.1898:::5.2974:::6.1457:::6.1773:::6...


## explain_clinical_doc_biomarker

In [None]:
from sparknlp.pretrained import PretrainedPipeline

biomarker_pipeline = PretrainedPipeline("explain_clinical_doc_biomarker", "en", "clinical/models")

results = biomarker_pipeline.fullAnnotate("""In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.""")


# for stages in biomarker_pipeline.model.stages:
#     try:
#         inputCol = stages.getInputCol()
#     except:
#         inputCol = stages.getInputCols()
#     print(inputCol,stages.getOutputCol())

explain_clinical_doc_biomarker download started this may take some time.
Approx size to download 2 GB
[OK!]


In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(biomarker_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_biomarker"})
column_maps

{'document_identifier': 'explain_clinical_doc_biomarker',
 'document_text': 'document',
 'entities': ['merged_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': ['re_oncology_biomarker_result_wip'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [{'classification_column_name': 'prediction',
   'sentence_column_name': 'sentence'}]}

In [None]:
# from sparknlp_jsl.pipeline_tracer import PipelineTracer

# columns_directly = PipelineTracer.getParserDictDirectly("explain_clinical_doc_biomarker", "en", "clinical/models")
# columns_directly

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_biomarker',
   'document_text': ['In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.'],
   'entities': [{'chunk_id': 'bc15add6',
     'chunk': 'positive',
     'begin': 84,
     'end': 91,
     'ner_label': 'Biomarker_Result',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.9672'},
    {'chunk_id': 'b473fd80',
     'chunk': 'CD9',
     'begin': 97,
     'end': 99,
     'ner_label': 'Biomarker',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.992'},
    {'chunk_id': '0252d08a',
     'chunk': 'CD10',
     'begin': 105,
     'end': 108,
     'ner_label': 'Bio

**entities**

In [None]:
pd.DataFrame.from_dict(result["result"][0]["entities"])

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,bc15add6,positive,84,91,Biomarker_Result,ner_oncology_chunk,0.9672
1,b473fd80,CD9,97,99,Biomarker,ner_oncology_chunk,0.992
2,0252d08a,CD10,105,108,Biomarker,ner_oncology_chunk,0.9987
3,eeb6f7ec,tumor markers,151,163,Biomarker,ner_oncology_chunk,0.48290002
4,7ed223b4,elevated level,172,185,Biomarker_Result,ner_oncology_chunk,0.90779996
5,368cf412,Cyfra21-1,190,198,Biomarker,ner_oncology_chunk,0.9851
6,c7d6148b,4.77 ng/mL,201,210,Biomarker_Result,ner_oncology_chunk,0.9719
7,29a967e6,NSE,213,215,Biomarker,ner_oncology_chunk,0.9991
8,47876c89,19.60 ng/mL,218,228,Biomarker_Result,ner_oncology_chunk,0.96005
9,e3d3c90c,SCCA,235,238,Biomarker,ner_oncology_chunk,0.9979


In [None]:
pd.DataFrame.from_dict(result["result"][0]["relations"])

Unnamed: 0,relation,chunk1_id,chunk1,chunk2_id,chunk2,confidence,direction
0,is_finding_of,bc15add6,positive,b473fd80,CD9,0.9932805,both
1,is_finding_of,bc15add6,positive,0252d08a,CD10,0.9988914,both
2,is_finding_of,eeb6f7ec,tumor markers,7ed223b4,elevated level,0.90050846,both
3,O,eeb6f7ec,tumor markers,c7d6148b,4.77 ng/mL,0.7407979,both
4,O,eeb6f7ec,tumor markers,47876c89,19.60 ng/mL,0.9778502,both
5,O,eeb6f7ec,tumor markers,6189a4f9,2.58 ng/mL,0.9993332,both
6,is_finding_of,7ed223b4,elevated level,368cf412,Cyfra21-1,0.9950375,both
7,O,7ed223b4,elevated level,29a967e6,NSE,0.81141526,both
8,O,7ed223b4,elevated level,e3d3c90c,SCCA,0.9064728,both
9,is_finding_of,368cf412,Cyfra21-1,c7d6148b,4.77 ng/mL,0.9818734,both


In [None]:
pd.DataFrame.from_dict(result["result"][0]["classifications"])

Unnamed: 0,category,sentence,sentence_id
0,1,"In the bone- marrow (BM) aspiration, blasts ac...",0
1,1,Measurements of serum tumor markers showed ele...,1
2,1,Immunohistochemical staining showed positive s...,2


## explain_clinical_doc_oncology

https://nlp.johnsnowlabs.com/2024/05/06/explain_clinical_doc_oncology_en.html

In [None]:
from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")

results = oncology_pipeline.fullAnnotate("""The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response""")

# for stages in oncology_pipeline.model.stages:
#     try:
#         inputCol = stages.getInputCol()
#     except:
#         inputCol = stages.getInputCols()
#     print(inputCol,stages.getOutputCol())

explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]
text document
['document'] sentence
['sentence'] token
['sentence', 'token'] embeddings
['sentence', 'token', 'embeddings'] ner_oncology
['sentence', 'token', 'ner_oncology'] ner_oncology_chunk
['sentence', 'token', 'embeddings'] ner_oncology_anatomy_general
['sentence', 'token', 'ner_oncology_anatomy_general'] ner_oncology_anatomy_general_chunk
['sentence', 'token', 'embeddings'] ner_oncology_response_to_treatment
['sentence', 'token', 'ner_oncology_response_to_treatment'] ner_oncology_response_to_treatment_chunk
['sentence', 'token', 'embeddings'] ner_oncology_unspecific_posology
['sentence', 'token', 'ner_oncology_unspecific_posology'] ner_oncology_unspecific_posology_chunk
['sentence', 'token', 'embeddings'] ner_oncology_tnm
['sentence', 'token', 'ner_oncology_tnm'] ner_oncology_tnm_chunk
['sentence', 'token', 'embeddings'] ner_jsl
['sentence', 'token', 'ner_jsl'] ner_jsl_ch

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(oncology_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_oncology"})
column_maps

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
# from sparknlp_jsl.pipeline_tracer import PipelineTracer

# columns_directly = PipelineTracer.getParserDictDirectly("explain_clinical_doc_oncology", "en", "clinical/models")
# columns_directly

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_oncology',
   'document_text': ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response'],
   'entities': [{'chunk_id': '1b71b12a',
     'chunk': 'computed tomography',
     'begin': 24,
     'end': 42,
     'ner_label': 'Imaging_Test',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.9575'},
    {'chunk_id': 'ce9ac1a9'

In [None]:
pd.DataFrame.from_dict(result["result"][0]["entities"])

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,1b71b12a,computed tomography,24,42,Imaging_Test,ner_oncology_chunk,0.9575
1,ce9ac1a9,CT,45,46,Imaging_Test,ner_oncology_chunk,0.9565
2,3576c965,abdomen,61,67,Site_Other_Body_Part,ner_oncology_chunk,0.9446
3,cff2288c,pelvis,73,78,Site_Other_Body_Part,ner_oncology_chunk,0.6514
4,98848a68,ovarian,104,110,Site_Other_Body_Part,ner_oncology_chunk,0.7915
5,d3e628e9,mass,112,115,Tumor_Finding,ner_oncology_chunk,0.9557
6,3d8b6be0,Pap smear,120,128,Pathology_Test,ner_oncology_chunk,0.96725
7,4d03018b,one month later,140,154,Relative_Date,ner_oncology_chunk,0.8786667
8,8de23a92,atypical glandular cells,173,196,Pathology_Result,ner_oncology_chunk,0.7270667
9,70affced,adenocarcinoma,213,226,Cancer_Dx,ner_oncology_chunk,0.9992


In [None]:
pd.DataFrame.from_dict(result["result"][0]["assertions"])

Unnamed: 0,chunk_id,chunk,assertion,assertion_source
0,1b71b12a,computed tomography,Past,assertion
1,ce9ac1a9,CT,Past,assertion
2,d3e628e9,mass,Present,assertion
3,3d8b6be0,Pap smear,Past,assertion
4,8de23a92,atypical glandular cells,Present,assertion
5,70affced,adenocarcinoma,Possible,assertion
6,71dddb8a,pathologic specimen,Past,assertion
7,63e46bca,extension,Present,assertion
8,ac5748d2,tumor,Present,assertion
9,5d80c8a0,enlarged,Present,assertion


In [None]:
pd.DataFrame.from_dict(result["result"][0]["relations"])

Unnamed: 0,relation,chunk1_id,chunk1,chunk2_id,chunk2,confidence,direction
0,O,3576c965,abdomen,d3e628e9,mass,0.9439166,both
1,O,cff2288c,pelvis,d3e628e9,mass,0.9611397,both
2,is_location_of,98848a68,ovarian,d3e628e9,mass,0.922661,both
3,is_finding_of,3d8b6be0,Pap smear,70affced,adenocarcinoma,0.52542114,both
4,is_location_of,ac5748d2,tumor,74e8e40b,fallopian tubes,0.9026299,both
5,is_location_of,ac5748d2,tumor,76146911,appendix,0.6649267,both
6,O,ac5748d2,tumor,dc74e652,omentum,0.80328876,both
7,Chemotherapy-Dosage,c2e02074,Neoadjuvant chemotherapy,98f81754,500 mg/m2,1.0,both
8,Chemotherapy-Cycle_Count,c2e02074,Neoadjuvant chemotherapy,bb801681,6 cycles,1.0,both
9,Chemotherapy-Dosage,d5d30ff5,Cyclophosphamide,98f81754,500 mg/m2,1.0,both


## with Custom Pipeline

In [None]:
def get_pipeline_model():
    documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

    tokenizer = Tokenizer()\
        .setInputCols("sentence")\
        .setOutputCol("token")

    # ade claassifier
    sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade_augmented", "en", "clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("ade_classification")

    word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
        .setInputCols("sentence", "token")\
        .setOutputCol("word_embeddings")

    # to get PROBLEM entitis
    clinical_ner = MedicalNerModel().pretrained("ner_clinical", "en", "clinical/models") \
        .setInputCols(["sentence", "token", "word_embeddings"]) \
        .setOutputCol("clinical_ner")

    clinical_ner_chunk = NerConverterInternal()\
        .setInputCols("sentence","token","clinical_ner")\
        .setOutputCol("clinical_ner_chunk")\
        .setWhiteList(["PROBLEM","TEST"])

    # Assertion model trained on i2b2 (sampled from MIMIC) dataset
    assertion_jsl = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
        .setInputCols(["sentence", "clinical_ner_chunk", "word_embeddings"]) \
        .setOutputCol("assertion_jsl")\
        .setEntityAssertionCaseSensitive(False)

    # to get DRUG entities
    posology_ner = MedicalNerModel().pretrained("ner_posology", "en", "clinical/models") \
        .setInputCols(["sentence", "token", "word_embeddings"]) \
        .setOutputCol("posology_ner")

    posology_ner_chunk = NerConverterInternal()\
        .setInputCols("sentence","token","posology_ner")\
        .setOutputCol("posology_ner_chunk")\
        .setWhiteList(["DRUG","DOSAGE","DURATION"])

    # ner deid Idendification
    deid_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
        .setInputCols(["sentence", "token", "word_embeddings"]) \
        .setOutputCol("deid_ner")

    deid_ner_chunk = NerConverterInternal()\
        .setInputCols(["sentence", "token", "deid_ner"])\
        .setOutputCol("deid_ner_chunk")

    # merge the chunks into a single ner_chunk
    chunk_merger = ChunkMergeApproach()\
        .setInputCols("clinical_ner_chunk","posology_ner_chunk")\
        .setOutputCol("merged_ner_chunk")\
        .setMergeOverlapping(False)

    obfuscation = DeIdentification()\
        .setInputCols(["sentence", "token", "deid_ner_chunk"]) \
        .setOutputCol("deidentified") \
        .setMode("obfuscate")\
        .setObfuscateDate(True)\
        .setObfuscateRefSource("faker") \
        .setMetadataMaskingPolicy("entity_labels")\
        .setOutputAsDocument(True)\

    assertion_vop = AssertionDLModel.pretrained("assertion_vop_clinical", "en", "clinical/models") \
        .setInputCols(["sentence", "merged_ner_chunk", "word_embeddings"]) \
        .setOutputCol("assertion_vop")

    pos_tagger = PerceptronModel()\
        .pretrained("pos_clinical", "en", "clinical/models") \
        .setInputCols(["sentence", "token"])\
        .setOutputCol("pos_tags")

    dependency_parser = DependencyParserModel()\
        .pretrained("dependency_conllu", "en")\
        .setInputCols(["sentence", "pos_tags", "token"])\
        .setOutputCol("dependencies")

    generic_re = RelationExtractionModel()\
        .pretrained("generic_re")\
        .setInputCols(["word_embeddings", "pos_tags", "posology_ner_chunk", "dependencies"])\
        .setOutputCol("generic_re")\
        .setMaxSyntacticDistance(10)

    # convert chunks to doc to get sentence embeddings of them
    chunk2doc = Chunk2Doc()\
      .setInputCols("merged_ner_chunk")\
      .setOutputCol("doc_final_chunk")


    sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
        .setInputCols(["doc_final_chunk"])\
        .setOutputCol("sbert_embeddings")\
        .setCaseSensitive(False)

    # filter PROBLEM entity embeddings
    router_sentence_icd10 = Router() \
        .setInputCols("sbert_embeddings") \
        .setFilterFieldsElements(["PROBLEM"]) \
        .setOutputCol("problem_embeddings")

    # filter DRUG entity embeddings
    router_sentence_rxnorm = Router() \
        .setInputCols("sbert_embeddings") \
        .setFilterFieldsElements(["DRUG"]) \
        .setOutputCol("drug_embeddings")

    # use problem_embeddings only
    icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") \
        .setInputCols(["problem_embeddings"]) \
        .setOutputCol("icd10cm_code")\
        .setDistanceFunction("EUCLIDEAN")

    # use drug_embeddings only
    rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
        .setInputCols(["drug_embeddings"]) \
        .setOutputCol("rxnorm_code")\
        .setDistanceFunction("EUCLIDEAN")

    #summurazation
    summarizer = MedicalSummarizer\
        .pretrained("summarizer_clinical_jsl")\
        .setInputCols(['document'])\
        .setOutputCol('summary')\
        .setMaxTextLength(512)\
        .setMaxNewTokens(512)

    pipeline = Pipeline(
        stages=[
            documentAssembler,
            sentenceDetector,
            tokenizer,
            sequenceClassifier,
            word_embeddings,
            clinical_ner,
            clinical_ner_chunk,
            assertion_jsl,
            posology_ner,
            posology_ner_chunk,
            deid_ner,
            deid_ner_chunk,
            chunk_merger,
            obfuscation,
            assertion_vop,
            pos_tagger,
            dependency_parser,
            generic_re,
            chunk2doc,
            sbiobert_embeddings,
            router_sentence_icd10,
            router_sentence_rxnorm,
            icd_resolver,
            rxnorm_resolver,
            summarizer
    ])

    empty_data = spark.createDataFrame([['']]).toDF("text")
    # model = pipeline.fit(empty_data)
    return pipeline.fit(empty_data)

big_pipeline_model =  get_pipeline_model()

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
bert_sequence_classifier_ade_augmented download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_jsl_augmented download started this may take some time.
[OK!]
ner_posology download started this may take some time.
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]
assertion_vop_clinical download started this may take some time.
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_slim_billable_hcc

In [None]:
text = """
Ora Hendrickson, a 28-year-old female with a history of gestational diabetes, now type 2 diabetes, and obesity (BMI 33.5 kg/m²), presented with polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior, she completed a five-day course of amoxicillin for a respiratory infection and had been on dapagliflozin for six months.
On examination, she had dry oral mucosa and a benign abdomen. Key lab findings included serum glucose 111 mg/dL, bicarbonate 18 mmol/L, anion gap 20, triglycerides 508 mg/dL, and HbA1c 10%. Venous pH was 7.27, and serum lipase was normal at 43 U/L. Due to poor oral intake, she was admitted for starvation ketosis.
She also reported a two-week headache and anxiety when walking fast. Her father’s paralysis and workplace bullying were significant stressors, leading to insomnia treated with sleeping pills.
Ora, with insulin-dependent type 2 diabetes, coronary artery disease, and chronic renal insufficiency, was previously admitted for acute paraplegia. She developed pressure wounds on her left foot and sacral area. Transferred for further care, she was on multiple medications, including Fragmin, Xenaderm, Lantus, OxyContin, Avandia, and Neurontin. Pathology revealed tumor cells positive for estrogen and progesterone receptors.
Discharged with Avandia, Coumadin, metformin, and Lisinopril, she was also prescribed aspirin and an Albuterol inhaler for asthma.
"""

ligth_model = LightPipeline(big_pipeline_model)
results = ligth_model.fullAnnotate(text)

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(big_pipeline_model)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_oncology"})
column_maps

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': ['deid_ner_chunk',
  'posology_ner_chunk',
  'clinical_ner_chunk',
  'merged_ner_chunk'],
 'assertions': ['assertion_jsl', 'assertion_vop'],
 'resolutions': [{'vocab': 'icd10cm_code',
   'resolver_column_name': 'icd10cm_code'},
  {'vocab': 'rxnorm_code', 'resolver_column_name': 'rxnorm_code'}],
 'relations': ['generic_re'],
 'summaries': ['summary'],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'deidentified',
   'masked': ''}],
 'classifications': [{'classification_column_name': 'ade_classification',
   'sentence_column_name': 'sentence'}]}

In [None]:
column_maps = {
    'document_identifier': 'some document identifier',
    'document_text': 'document',
    'entities': ['clinical_ner_chunk','posology_ner_chunk','deid_ner_chunk',],
    'assertions': ['assertion_vop', 'assertion_jsl'],
    'resolutions': [{
            'vocab':"rxnorm",
            'resolver_column_name': 'rxnorm_code'
        },
        {
            'vocab':"icd10",
            'resolver_column_name': 'icd10cm_code'
    }],
    'relations': ['generic_re'],
    'summaries': ['summary'],
    'deidentifications' : [{
        "original": "document",
        "obfuscated": "deidentified",
        "masked": None # None, will check in metadata to masked field
    }],
    'classifications':[{
        "classification_column_name": "ade_classification",
        "sentence_column_name": "sentence",
    }]
}


pipeline_parser = PipelineOutputParser(column_maps,)
result = pipeline_parser.run(results, return_relation_entities=True )

result['result'][0]

{'document_identifier': 'some document identifier',
 'document_text': ['\nOra Hendrickson, a 28-year-old female with a history of gestational diabetes, now type 2 diabetes, and obesity (BMI 33.5 kg/m²), presented with polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior, she completed a five-day course of amoxicillin for a respiratory infection and had been on dapagliflozin for six months.\nOn examination, she had dry oral mucosa and a benign abdomen. Key lab findings included serum glucose 111 mg/dL, bicarbonate 18 mmol/L, anion gap 20, triglycerides 508 mg/dL, and HbA1c 10%. Venous pH was 7.27, and serum lipase was normal at 43 U/L. Due to poor oral intake, she was admitted for starvation ketosis.\nShe also reported a two-week headache and anxiety when walking fast. Her father’s paralysis and workplace bullying were significant stressors, leading to insomnia treated with sleeping pills.\nOra, with insulin-dependent type 2 diabetes, coronary artery disease, and chronic r

In [None]:
pd.DataFrame.from_dict(result["result"][0]["entities"])

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,8212ece0,Ora Hendrickson,1,15,NAME,deid_ner_chunk,0.95475
1,cac49f82,28-year-old,20,30,AGE,deid_ner_chunk,0.9995
2,d35ff417,gestational diabetes,57,76,PROBLEM,clinical_ner_chunk,0.95325
3,ecbc13bf,type 2 diabetes,83,97,PROBLEM,clinical_ner_chunk,0.79780006
4,f769b24d,obesity,104,110,PROBLEM,clinical_ner_chunk,0.9972
5,baf4ee42,BMI,113,115,TEST,clinical_ner_chunk,0.9027
6,4ba206e3,polyuria,145,152,PROBLEM,clinical_ner_chunk,0.9994
7,090833c0,polydipsia,155,164,PROBLEM,clinical_ner_chunk,0.9947
8,12396d0a,poor appetite,167,179,PROBLEM,clinical_ner_chunk,0.99905
9,1caa3436,vomiting,186,193,PROBLEM,clinical_ner_chunk,0.9879


In [None]:
pd.DataFrame.from_dict(result["result"][0]["assertions"])

Unnamed: 0,chunk_id,chunk,assertion,assertion_source
0,d35ff417,gestational diabetes,Present_Or_Past,assertion_vop
1,ecbc13bf,type 2 diabetes,Present_Or_Past,assertion_vop
2,f769b24d,obesity,Present_Or_Past,assertion_vop
3,baf4ee42,BMI,Present_Or_Past,assertion_vop
4,4ba206e3,polyuria,Present_Or_Past,assertion_vop
...,...,...,...,...
85,fdc7ac56,acute paraplegia,Present,assertion_jsl
86,5c901cee,pressure wounds on her left foot,Present,assertion_jsl
87,027dd522,Pathology,Possible,assertion_jsl
88,446b6ddb,tumor cells positive for estrogen and progeste...,Present,assertion_jsl


In [None]:
pd.DataFrame.from_dict(result["result"][0]["resolutions"])

Unnamed: 0,vocab,chunk_id,chunk,code,resolutions,all_k_codes,all_k_resolutions,all_k_aux_labels,all_k_distances
0,rxnorm,a078b619,amoxicillin,370576,amoxicillin Oral Suspension,370576:::540141:::1152900:::1152899:::370886::...,amoxicillin Oral Suspension:::amoxicillinan [a...,Clinical Drug Form:::Brand Name:::Clinical Dos...,0.0000:::2.9012:::6.0653:::6.0653:::6.4355:::6...
1,rxnorm,99b18548,dapagliflozin,1488568,dapagliflozin Oral Tablet,1488568:::1545653:::2627044:::1992672:::148856...,dapagliflozin Oral Tablet:::empagliflozin [emp...,Clinical Drug Form:::Ingredient:::Ingredient::...,0.0000:::4.4793:::5.0801:::5.7025:::5.7108:::6...
2,rxnorm,7252a3a2,Fragmin,281554,fragmin [fragmin],281554:::1739229:::217106:::361779:::217104:::...,fragmin [fragmin]:::fragarin [fragarin]:::ferr...,Brand Name:::Ingredient:::Brand Name:::Brand N...,0.0000:::7.4122:::7.8882:::7.9000:::7.9438:::8...
3,rxnorm,32aee897,Xenaderm,581754,xenaderm [xenaderm],581754:::2198949:::1307304:::202363:::1000108:...,xenaderm [xenaderm]:::xenleta [xenleta]:::xtan...,Brand Name:::Brand Name:::Brand Name:::Brand N...,0.0000:::6.2442:::6.8670:::6.9987:::7.2079:::7...
4,rxnorm,e419b1dd,Lantus,261551,lantus [lantus],261551:::151959:::377389:::202990:::196502:::6...,lantus [lantus]:::laratrim [laratrim]:::laches...,Brand Name:::Brand Name:::Clinical Drug Form::...,0.0000:::7.7323:::7.8976:::8.0829:::8.1541:::8...
5,rxnorm,5b3b5ece,OxyContin,218986,oxycontin [oxycontin],218986:::1373205:::32680:::1120014:::7804:::54...,oxycontin [oxycontin]:::Apis cerana worker sec...,Brand Name:::Ingredient:::Ingredient:::Brand N...,0.0000:::6.5183:::6.9746:::7.1005:::7.2795:::7...
6,rxnorm,094a24c9,Avandia,261455,avandia [avandia],261455:::352450:::607816:::2054097:::613324:::...,avandia [avandia]:::avandamet [avandamet]:::av...,Brand Name:::Brand Name:::Brand Name:::Brand N...,0.0000:::5.8763:::6.1972:::6.5783:::6.7766:::6...
7,rxnorm,3bc05671,Neurontin,196498,neurontin [neurontin],196498:::203803:::1311555:::218699:::1045325::...,neurontin [neurontin]:::nebcin [nebcin]:::nita...,Brand Name:::Brand Name:::Ingredient:::Brand N...,0.0000:::7.5014:::7.6132:::7.6755:::7.7288:::7...
8,rxnorm,702d3577,estrogen,4100,estrogens [estrogens],4100:::109022:::372083:::216993:::1165181:::37...,estrogens [estrogens]:::estradiol Drug Implant...,Ingredient:::Clinical Drug Form:::Clinical Dru...,3.3852:::4.9790:::5.1917:::5.4582:::6.0688:::6...
9,rxnorm,d247abf2,progesterone receptors,8727,progesterone [progesterone],8727:::815024:::1648167:::373627:::692987:::14...,progesterone [progesterone]:::estradiol / prog...,Ingredient:::Multiple Ingredients:::Clinical D...,5.2546:::7.1132:::7.2616:::7.2841:::7.2903:::7...


In [None]:
pd.DataFrame.from_dict(result["result"][0]["relations"])

Unnamed: 0,relation,chunk1_id,chunk1,entity1,entity1_begin,entity1_end,chunk2_id,chunk2,entity2,entity2_begin,entity2_end,confidence,direction
0,DRUG-DRUG,7252a3a2,Fragmin,DRUG,1127,1133,32aee897,Xenaderm,DRUG,1136,1143,1.0,both
1,DRUG-DRUG,7252a3a2,Fragmin,DRUG,1127,1133,e419b1dd,Lantus,DRUG,1146,1151,1.0,both
2,DRUG-DRUG,7252a3a2,Fragmin,DRUG,1127,1133,5b3b5ece,OxyContin,DRUG,1154,1162,1.0,both
3,DRUG-DRUG,7252a3a2,Fragmin,DRUG,1127,1133,094a24c9,Avandia,DRUG,1165,1171,1.0,both
4,DRUG-DRUG,7252a3a2,Fragmin,DRUG,1127,1133,3bc05671,Neurontin,DRUG,1178,1186,1.0,both
5,DRUG-DRUG,32aee897,Xenaderm,DRUG,1136,1143,e419b1dd,Lantus,DRUG,1146,1151,1.0,both
6,DRUG-DRUG,32aee897,Xenaderm,DRUG,1136,1143,5b3b5ece,OxyContin,DRUG,1154,1162,1.0,both
7,DRUG-DRUG,32aee897,Xenaderm,DRUG,1136,1143,094a24c9,Avandia,DRUG,1165,1171,1.0,both
8,DRUG-DRUG,32aee897,Xenaderm,DRUG,1136,1143,3bc05671,Neurontin,DRUG,1178,1186,1.0,both
9,DRUG-DRUG,e419b1dd,Lantus,DRUG,1146,1151,5b3b5ece,OxyContin,DRUG,1154,1162,1.0,both


In [None]:
pd.DataFrame.from_dict(result["result"][0]["deidentifications"])

Unnamed: 0,original,obfuscated,masked
0,"[\nOra Hendrickson, a 28-year-old female with ...","[ Jose Ngo, a 39-year-old female with a histor...","[ <NAME>, a <AGE> female with a history of ges..."


In [None]:
pd.DataFrame.from_dict(result["result"][0]["classifications"])

Unnamed: 0,category,sentence,sentence_id
0,ADE,"Ora Hendrickson, a 28-year-old female with a h...",0
1,ADE,"Two weeks prior, she completed a five-day cour...",1
2,ADE,"On examination, she had dry oral mucosa and a ...",2
3,ADE,Key lab findings included serum glucose 111 mg...,3
4,ADE,"Venous pH was 7.27, and serum lipase was norma...",4
5,ADE,"Due to poor oral intake, she was admitted for ...",5
6,ADE,She also reported a two-week headache and anxi...,6
7,ADE,"Ora, with insulin-dependent type 2 diabetes, c...",7
8,ADE,She developed pressure wounds on her left foot...,8
9,ADE,"Transferred for further care, she was on multi...",9
