![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/academic/NER_Benchmarks/NER_Performance_Comparison_Of_Healthcare_NLP_VS_Cloud_Solutions.ipynb)

# Comparing Named Entity Recognition Performance:

John Snow Labs, OpenAI, Anthropic Claude, Google Cloud Platform, Azure Health Data Services, and Amazon Comprehend Medical

This notebook presents a comparative analysis of Named Entity Recognition solutions (NER) offered by Healthcare NLP library, OpenAI GPT-4o, Anthropic Claude 3.7 Sonnet, Azure Health Data Services, and Amazon Comprehend Medical. In this analysis, we evaluate the performance of each solution in detecting NER entities, using a benchmark dataset annotated by domain experts.


# Setup





In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.0 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
! pip install -q openpyxl boto3

In [4]:
import os
import json
import requests
import pandas as pd

import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

from zipfile import ZipFile
from io import BytesIO

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"52G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 5.5.3
Spark NLP_JSL Version : 5.5.3


# Utils

In [5]:
def explode_light_result(light_result):
    id,  chunks, entities, begin, end = [], [], [], [], []
    for idx, item in enumerate(light_result):
        for annotation in item["ner_chunk"]:
            id.append(idx)
            chunks.append(annotation.result)
            entities.append(annotation.metadata["entity"])
            begin.append(annotation.begin)
            end.append(annotation.end)

    return pd.DataFrame({"id":id, "chunk":chunks, "begin":begin, "end":end, "entity":entities})

# JohnSnowLabs Pipelines

| Pipeline Name                                     | Predicted Entity           |
|---------------------------------------------------|----------------------------|
| ner_admission_discharge_benchmark_pipeline        | ADMISSION_DISCHARGE        |
| ner_alcohol_use_benchmark_pipeline                | ALCOHOL_USE                |
| ner_body_part_benchmark_pipeline                  | BODY_PART                  |
| ner_drug_benchmark_pipeline                       | DRUG                       |
| ner_procedure_benchmark_pipeline                  | PROCEDURE                  |
| ner_test_benchmark_pipeline                       | TEST                       |
| ner_treatment_benchmark_pipeline                  | TREATMENT                  |
| ner_consumption_benchmark_pipeline                | CONSUMPTION                |
| ner_grade_stage_severity_benchmark_pipeline       | GRADE_STAGE_SEVERITY       |
| ner_medical_condition_disorder_benchmark_pipeline | MEDICAL_CONDITION_DISORDER |
| ner_problem_benchmark_pipeline                    | PROBLEM                    |
| ner_substance_use_benchmark_pipeline              | SUBSTANCE_USE              |
| ner_symptom_or_sign_benchmark_pipeline            | SYMPTOM_OR_SIGN            |
| ner_tobaco_use_benchmark_pipeline                 | TOBACO_USE                 |


## Problem

This pipeline can be used to extracts `problem` (diseases, disorders, injuries, symptoms, signs .etc) information in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_problem_benchmark_pipeline", "en", "clinical/models")


ner_problem_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """HISTORY OF PRESENT ILLNESS :
Mr. He is a 77 year old male with squamous cell carcinoma of the lung .
Over the pat three to four weeks , he started having increased dyspnea and noted wheezing .
A bronchoscopy showed protrusion of the tumor into the right main stem bronchus with a positive needle biopsy , washings and brushings for squamous cell carcinoma .
A computerized tomography scan showed a large subcarinal mass .
PAST MEDICAL HISTORY :
His past medical history was significant for malignant bladder tumor in 1991 .
PHYSICAL EXAMINATION :
On physical examination , Mr. He had very marked inspiratory and expiratory stridor .
There were no nodes present .
The breath sounds were somewhat decreased through both lung fields .
His cardiac examination did not show any murmur , gallop , or cardiomegaly .
There was no hepatosplenomegaly , and no peripheral edema .
HOSPITAL COURSE :
A bronchoscopy with the intention of coring out tumor was carried out by Dr. Reg He , but all the tumor was extrinsic to the airway and he was unable to relieve the obstruction .
The tumor now involves the trachea as well as the right main bronchus .
His major complaint was of persistent severe coughing and secretions .
"""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,squamous cell carcinoma,63,85,PROBLEM
1,0,dyspnea,164,170,PROBLEM
2,0,wheezing,182,189,PROBLEM
3,0,tumor,233,237,PROBLEM
4,0,squamous cell carcinoma,332,354,PROBLEM
5,0,mass,415,418,PROBLEM
6,0,malignant bladder tumor,490,512,PROBLEM
7,0,murmur,773,778,PROBLEM
8,0,cardiomegaly,794,805,PROBLEM
9,0,hepatosplenomegaly,822,839,PROBLEM


**Spark Transfom Result**

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")

result = ner_pipeline.transform(data)

result_df = result.selectExpr("explode(ner_chunk) as ner_chunk")\
                  .selectExpr("ner_chunk.result as chunk",
                              "ner_chunk.begin",
                              "ner_chunk.end",
                              "ner_chunk.metadata['entity'] as ner_label",
                              #"ner_chunk.metadata['confidence'] as confidence"
                              ).toPandas()
result_df

Unnamed: 0,chunk,begin,end,ner_label
0,squamous cell carcinoma,63,85,PROBLEM
1,dyspnea,164,170,PROBLEM
2,wheezing,182,189,PROBLEM
3,tumor,233,237,PROBLEM
4,squamous cell carcinoma,332,354,PROBLEM
5,mass,415,418,PROBLEM
6,malignant bladder tumor,490,512,PROBLEM
7,murmur,773,778,PROBLEM
8,cardiomegaly,794,805,PROBLEM
9,hepatosplenomegaly,822,839,PROBLEM


**PipelineTracer**

In [None]:
pipeline_tracer = PipelineTracer(ner_pipeline)

pipeline_tracer.getPossibleEntities()

['PROBLEM']

**PipelineOutputParser**

In [None]:
column_maps = {
    'document_identifier': 'ner_pipeline',
    'document_text': 'document',
    'entities': ['ner_chunk'],
    'assertions': [],
    'resolutions': [],
    'relations': [],
    'summaries': [],
    'deidentifications': [],
    'classifications': []
}

pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(light_result) #light_result is defined above
result['result'][0]

{'document_identifier': 'ner_pipeline',
 'document_id': 0,
 'document_text': ['HISTORY OF PRESENT ILLNESS :\nMr. He is a 77 year old male with squamous cell carcinoma of the lung .\nOver the pat three to four weeks , he started having increased dyspnea and noted wheezing .\nA bronchoscopy showed protrusion of the tumor into the right main stem bronchus with a positive needle biopsy , washings and brushings for squamous cell carcinoma .\nA computerized tomography scan showed a large subcarinal mass .\nPAST MEDICAL HISTORY :\nHis past medical history was significant for malignant bladder tumor in 1991 .\nPHYSICAL EXAMINATION :\nOn physical examination , Mr. He had very marked inspiratory and expiratory stridor .\nThere were no nodes present .\nThe breath sounds were somewhat decreased through both lung fields .\nHis cardiac examination did not show any murmur , gallop , or cardiomegaly .\nThere was no hepatosplenomegaly , and no peripheral edema .\nHOSPITAL COURSE :\nA bronchoscopy with 

In [None]:
pd.DataFrame.from_dict(result["result"][0]["entities"])

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,ffa4da18,squamous cell carcinoma,63,85,PROBLEM,disorder_matcher,
1,19ce5087,dyspnea,164,170,PROBLEM,symptom_matcher,
2,ec6478f3,wheezing,182,189,PROBLEM,symptom_matcher,
3,2c4712cc,tumor,233,237,PROBLEM,disorder_matcher,
4,5da039d5,squamous cell carcinoma,332,354,PROBLEM,disorder_matcher,
5,9fea044d,mass,415,418,PROBLEM,symptom_matcher,
6,a83c1a14,malignant bladder tumor,490,512,PROBLEM,disorder_matcher,
7,a1efa133,murmur,773,778,PROBLEM,symptom_matcher,
8,446e84c0,cardiomegaly,794,805,PROBLEM,symptom_matcher,
9,d4975f71,hepatosplenomegaly,822,839,PROBLEM,symptom_matcher,


In [None]:
from sparknlp_display import NerVisualizer
visualiser = NerVisualizer()

visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')

## 	Drug

This pipeline can be used to extract posology information in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_drug_benchmark_pipeline", "en", "clinical/models")


ner_drug_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """The patient was admitted to the Surgical Intensive Care Unit postoperatively with stable hemodynamics on Lidocaine at one , Dobutamine at 200 and Nipride .
The patient was extubated by postoperative day # 1 but was noted to be relatively hypoxemic with high oxygen requirement .
Chest x-ray demonstrating pulmonary edema .
Aggressive diuresis was undertaken and the patient responded , albeit sluggishly .
In addition , he remained agitated .
This was attributed to mild hypoxia and / or his underlying psychiatric diagnoses and he was treated with Haldol appropriately .
On day # 2 the patient continued to diurese and this was maintained with the Lasix/ Mannitol infusion .
His urine output remained at 150 to 200 cc. an hour .
Despite this , his chest x-ray continued to show severe pulmonary edema and the clinical picture correlated .
He required anti-hypertensive therapy initially with Nipride which was changed to Hydralazine to avoid shunting .
Consequently his Pronestyl was discontinued .
In addition , the patient was given magnesium to bring his level above 2 ."""


**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,Lidocaine,105,113,DRUG
1,0,Dobutamine,124,133,DRUG
2,0,Nipride,146,152,DRUG
3,0,Haldol,549,554,DRUG
4,0,Lasix/,649,654,DRUG
5,0,Mannitol,656,663,DRUG
6,0,Nipride,893,899,DRUG
7,0,Hydralazine,922,932,DRUG
8,0,Pronestyl,971,979,DRUG


## 	Treatment

This pipeline can be used to extract `treatments` mentioned in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_treatment_benchmark_pipeline", "en", "clinical/models")


ner_treatment_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """IN SUMMARY :
The patient was assessed as a 72 year old woman with a background of stage IIIC ovarian carcinoma and documented local recurrence who presents for line 2 of cycle 1 chemotherapy with Adriamycin , Ifex and MESNA .
She also has a low-grade fever of unknown etiology , has a background history of deep venous thrombosis and is therefore currently on anticoagulation and she shows evidence of dehydration and failure to thrive .
It was decided at that time to hold off with the chemotherapy .
The patient was started on Ampicillin and Gentamicin for urinary tract infection which ultimately grew out Escherichia coli sensitive to the above antibiotics and for right lower lobe pneumonia on x-ray .
She was started on nebulizers around-the-clock and chest physical therapy .
On 6/5/94 they were 21 and 2.2 respectively .
Her Coumadin anticoagulation was adjusted to give a prothrombin time between 16 and 18 and an I and R of 2.5-3 .
On June 5 , 1994 it was decided that Mrs. Neathe was not stable enough with a line 2 cycle I chemotherapy with Ifex , Adriamycin and MESNA .
She was therefore well hydrated and was started on her chemotherapy .
In view of her kidney damage it was suggested to change her intravenous antibiotics from Ancef and Gentamicin to Ancef and ciprofloxacin which she tolerated well .
A Neuro-Oncology consult was sought which felt this was probably secondary to Ifex intoxication and her chemotherapy was stopped .
An electroencephalogram was requested and was negative .
No computerized tomography scan or magnetic resonance imaging study of the head was performed .
"""


**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,chemotherapy,178,189,TREATMENT
1,0,anticoagulation,360,374,TREATMENT
2,0,chemotherapy,487,498,TREATMENT
3,0,antibiotics,649,659,TREATMENT
4,0,nebulizers,726,735,TREATMENT
5,0,physical therapy,764,779,TREATMENT
6,0,anticoagulation,842,856,TREATMENT
7,0,chemotherapy,1035,1046,TREATMENT
8,0,chemotherapy,1138,1149,TREATMENT
9,0,antibiotics,1225,1235,TREATMENT


## 	Body Part

This pipeline can be used to extract all types of anatomical references in medical text. It is a single-entity pipeline and generalizes all anatomical references to a single entity.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_body_part_benchmark_pipeline", "en", "clinical/models")


ner_body_part_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """HISTORY OF PRESENT ILLNESS :
Mr. He is a 77 year old male with squamous cell carcinoma of the lung .
The initial presentation was the orifice of the right upper lobe with mediastinoscopy positive for subcarinal lymph node .
He received 6,000 rads to his mediastinum with external beam .
Subsequently , he returned to Goo where he received another 3,000 rads via an endobronchial catheter , apparently due to recurrence .
Over the pat three to four weeks , he started having increased dyspnea and noted wheezing .
A bronchoscopy showed protrusion of the tumor into the right main stem bronchus with a positive needle biopsy , washings and brushings for squamous cell carcinoma .
A computerized tomography scan showed a large subcarinal mass .
PAST MEDICAL HISTORY :
His past medical history was significant for malignant bladder tumor in 1991 .
PHYSICAL EXAMINATION :
On physical examination , Mr. He had very marked inspiratory and expiratory stridor .
There were no nodes present .
The breath sounds were somewhat decreased through both lung fields .
His cardiac examination did not show any murmur , gallop , or cardiomegaly .
There was no hepatosplenomegaly , and no peripheral edema .
HOSPITAL COURSE :
A bronchoscopy with the intention of coring out tumor was carried out by Dr. Reg He , but all the tumor was extrinsic to the airway and he was unable to relieve the obstruction .
The tumor now involves the trachea as well as the right main bronchus .
His major complaint was of persistent severe coughing and secretions .
Ultimately , only codeine at 30 mg q6h controlled him and this was very affective .
Inhalers provide only mild relief .
He is aware of his prognosis .
The patient was also seen by Dr. Lenchermoi Fyfesaul of the Oncology Service who did not feel that chemotherapy had anything of promise to offer .
Mr. He is an anxious man , but very pleasant .
He and his family understand his prognosis ."""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,lung,94,97,BODY_PART
1,0,upper lobe,155,164,BODY_PART
2,0,subcarinal lymph node,200,220,BODY_PART
3,0,mediastinum,254,264,BODY_PART
4,0,main stem bronchus,574,591,BODY_PART
5,0,subcarinal,724,733,BODY_PART
6,0,bladder,820,826,BODY_PART
7,0,nodes,967,971,BODY_PART
8,0,lung fields,1038,1048,BODY_PART
9,0,airway,1332,1337,BODY_PART


## 	Procedure

This pipeline can be used to extract `procedure` mentions in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_procedure_benchmark_pipeline", "en", "clinical/models")


ner_procedure_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """PRINCIPAL PROCEDURE :
4-2-93 , right and left heart catheterization ( transeptal ) with coronary graft and left ventriculogram .
4-8-93 , open heart aortic valve replacement and bypass of right coronary artery .

Mr. No was on the pump for 2 hours and 25 minutes , with an aortic crossclamp time of 1 hour and 48 minutes .
Postoperatively , the patient was extubated on the first postoperative day .
He had a good deal of pulmonary congestion .
He seemed to be doing well until the morning of 4-13-93 when he suddenly became pulseless .
There was no evidence of ventricular fibrillation .
Cardiopulmonary resuscitation was immediately undertaken , but was not successful .
The patient had his chest opened for any evidence of tamponade and there was no evidence of bleeding .
The heart appeared to be flaccid .
We really have no good explanation of what this was all about .
He was being treated for a possible pneumonia .
He did have a good deal of pulmonary congestion , but this occurred suddenly and unexpectedly .
Cardiopulmonary resuscitation efforts were carried out for virtually one hour , but were unsuccessful .
The patient was pronounced dead on the morning of 4-13-93 .
A post mortem examination will be performed .
"""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,heart catheterization,46,66,PROCEDURE
1,0,transeptal,70,79,PROCEDURE
2,0,coronary graft,88,101,PROCEDURE
3,0,left ventriculogram,107,125,PROCEDURE
4,0,open heart aortic valve replacement,138,172,PROCEDURE
5,0,bypass,178,183,PROCEDURE
6,0,aortic crossclamp,273,289,PROCEDURE
7,0,extubated,357,365,PROCEDURE
8,0,Cardiopulmonary resuscitation,589,617,PROCEDURE
9,0,Cardiopulmonary resuscitation,1019,1047,PROCEDURE


## 	Test

This pipeline can be used to extract `test` mentions in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_test_benchmark_pipeline", "en", "clinical/models")


ner_test_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """PHYSICAL EXAMINATION :
On physical examination , the patient was a well developed , stocky gentleman .
The blood pressure was 115/80 , pulse 80 , respirations of 20 , venous pressure elevated at 3 cm above the clavicle at 90 degrees .
There were very small , barely palpable carotid pulses .
There was dullness at the right base , with a high diaphragm and possibly some fluid .
The cardiac examination showed a left ventricular tap at the fifth intercostal space left of the midclavicular line .
There was a grade II / VI systolic ejection murmur in the aortic area , no third sound , and paradoxical splitting of the second sound .
The liver was not palpable .
There were diminished pulses in the legs .
LABORATORY DATA :
The hemoglobin was 14.4 grams percent , white blood count 6,900 , platelet count 125,000 , sodium 137 mEq. per liter , potassium of 4.7 , BUN and creatinine of 23 and 1.3 mg percent .
The electrocardiogram showed left ventricular hypertrophy and non-specific ST-T wave changes .
The chest film showed massive cardiomegaly with pulmonary venous engorgement ."""


**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,PHYSICAL EXAMINATION,0,19,TEST
1,0,physical examination,26,45,TEST
2,0,blood pressure,107,120,TEST
3,0,pulse,135,139,TEST
4,0,respirations,146,157,TEST
5,0,venous pressure,167,181,TEST
6,0,pulses,283,288,TEST
7,0,cardiac examination,383,401,TEST
8,0,pulses,685,690,TEST
9,0,hemoglobin,728,737,TEST


## 	Grade Stage Severity

This pipeline can be used to extracts biomarker, grade stage and severity related information in medical text. GRADE_STAGE_SEVERITY: Mentions of pathological grading, staging, severity and modifier of the diseases/cancers.



In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_grade_stage_severity_benchmark_pipeline", "en", "clinical/models")


ner_grade_stage_severity_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """IDENTIFYING DATA AND CHIEF COMPLAINT :
Mrs. Saujule T. Neathe is a 73 year old Gravida 6 Para 4 abortions 2 with a background history of stage IIIC papillary serous adenocarcinoma of the ovary who presents on 5/30/94 for failure to thrive , right lower lobe pneumonia and obstructive uropathy .
This course of chemotherapy was complicated only by persistent diarrhea and some nausea and vomiting .
An abdominopelvic computerized tomography scan during that admission showed recurrent disease in the pelvis with bilateral hydronephrosis .
Two days following discharge , however , the patient was admitted to Sephsandpot Center because of a decreased urinary output and persistent nausea and vomiting and anorexia .
PHYSICAL EXAMINATION :
( on admission ) showed the patient to be low-grade febrile with temperature of 99.6. She was noted to be a thin , cachectic woman in no apparent distress . Head and neck examination remarkable only for extreme dry mucous membranes consistent with dehydration .
Abdomen :
showed well healed midline scar , non-distended , non-tender , bowel sounds were good , multiple small nodules were palpated subcutaneously in the upper abdomen which was non-tender , there was no costovertebral angle tenderness .
Pelvic and rectal examinations confirmed recurrence of tumor mass in the pelvis .
The patient was occult blood negative .
Extremities : showed no evidence of acute deep venous thrombosis . However , left leg had two plus pitting edema to the knee whereas the right leg had minimal edema .
"""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,stage IIIC,137,146,GRADE_STAGE_SEVERITY
1,0,persistent,347,356,GRADE_STAGE_SEVERITY
2,0,recurrent,474,482,GRADE_STAGE_SEVERITY
3,0,persistent,668,677,GRADE_STAGE_SEVERITY
4,0,low-grade,779,787,GRADE_STAGE_SEVERITY
5,0,apparent,874,881,GRADE_STAGE_SEVERITY
6,0,extreme,940,946,GRADE_STAGE_SEVERITY
7,0,multiple,1097,1104,GRADE_STAGE_SEVERITY
8,0,small,1106,1110,GRADE_STAGE_SEVERITY
9,0,acute,1398,1402,GRADE_STAGE_SEVERITY


## Admission Discharge

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_admission_discharge_benchmark_pipeline", "en", "clinical/models")


ner_admission_discharge_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """
ADMISSION DATE :
12-6-93
DISCHARGE DATE :
12-9-93
IDENTIFYING DATA :
This 75 year old female was transferred from Iming Medical Center for angioplasty .
PRINCIPAL DIAGNOSIS :
Unstable angina .
ASSOCIATED DIAGNOSIS :
Hypertension .
PRINCIPAL PROCEDURE :
Right and circumflex angioplasty , cardiac catheterization on 12-6-93 .
HISTORY OF PRESENT ILLNESS :
This 75 year old woman was previously admitted here in November 1993 for chronic angina .
She had mild mitral regurgitation and a slightly diminished ejection fraction .
There was a 90% right coronary stenosis which was reduced to 30 with a balloon angioplasty .
There were three lesions in the circumflex , dilated successfully .
However , the low circumflex marginal vessel could not be crossed with the balloon .
"""


**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,ADMISSION,1,9,ADMISSION_DISCHARGE
1,0,DISCHARGE,26,34,ADMISSION_DISCHARGE
2,0,admitted,393,400,ADMISSION_DISCHARGE


## 	Medical Condition Disorder

This pipeline can be used to extracts `medical condition disorder` information in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_medical_condition_disorder_benchmark_pipeline", "en", "clinical/models")


ner_medical_condition_disorder_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """The patient was admitted to the Surgical Intensive Care Unit postoperatively with stable hemodynamics on Lidocaine at one , Dobutamine at 200 and Nipride .
The patient was extubated by postoperative day # 1 but was noted to be relatively hypoxemic with high oxygen requirement .
Chest x-ray demonstrating pulmonary edema .
Aggressive diuresis was undertaken and the patient responded , albeit sluggishly .
In addition , he remained agitated .
This was attributed to mild hypoxia and / or his underlying psychiatric diagnoses and he was treated with Haldol appropriately .
On day # 2 the patient continued to diurese and this was maintained with the Lasix/ Mannitol infusion .
His urine output remained at 150 to 200 cc. an hour .
Despite this , his chest x-ray continued to show severe pulmonary edema and the clinical picture correlated .
He required anti-hypertensive therapy initially with Nipride which was changed to Hydralazine to avoid shunting .
Consequently his Pronestyl was discontinued .
In addition , the patient was given magnesium to bring his level above 2 ."""


**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity


## 	Symptom or Sign

This pipeline can be used to extract `symptom or sign` information in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_symptom_or_sign_benchmark_pipeline", "en", "clinical/models")

ner_symptom_or_sign_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """IDENTIFYING DATA AND CHIEF COMPLAINT :
Mrs. Saujule T. Neathe is a 73 year old Gravida 6 Para 4 abortions 2 with a background history of stage IIIC papillary serous adenocarcinoma of the ovary who presents on 5/30/94 for failure to thrive , right lower lobe pneumonia and obstructive uropathy .
This course of chemotherapy was complicated only by persistent diarrhea and some nausea and vomiting .
An abdominopelvic computerized tomography scan during that admission showed recurrent disease in the pelvis with bilateral hydronephrosis .
Two days following discharge , however , the patient was admitted to Sephsandpot Center because of a decreased urinary output and persistent nausea and vomiting and anorexia .
PHYSICAL EXAMINATION :
( on admission ) showed the patient to be low-grade febrile with temperature of 99.6. She was noted to be a thin , cachectic woman in no apparent distress . Head and neck examination remarkable only for extreme dry mucous membranes consistent with dehydration .
Abdomen :
showed well healed midline scar , non-distended , non-tender , bowel sounds were good , multiple small nodules were palpated subcutaneously in the upper abdomen which was non-tender , there was no costovertebral angle tenderness .
Pelvic and rectal examinations confirmed recurrence of tumor mass in the pelvis .
The patient was occult blood negative .
Extremities : showed no evidence of acute deep venous thrombosis . However , left leg had two plus pitting edema to the knee whereas the right leg had minimal edema .
"""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,failure to thrive,221,237,SYMPTOM_OR_SIGN
1,0,nausea,376,381,SYMPTOM_OR_SIGN
2,0,vomiting,387,394,SYMPTOM_OR_SIGN
3,0,decreased urinary output,639,662,SYMPTOM_OR_SIGN
4,0,nausea,679,684,SYMPTOM_OR_SIGN
5,0,vomiting,690,697,SYMPTOM_OR_SIGN
6,0,febrile,789,795,SYMPTOM_OR_SIGN
7,0,thin,845,848,SYMPTOM_OR_SIGN
8,0,cachectic,852,860,SYMPTOM_OR_SIGN
9,0,dry mucous membranes,948,967,SYMPTOM_OR_SIGN


## 	Consumption

This pipeline can be used to extracts Consumption (Alcohol, Smoking/Tobaco, and Substance Usage) related information in medical text. Alcohol refers to beverages containing ethanol, a psychoactive substance that is widely consumed for its pleasurable effects. Smoking typically involves inhaling smoke from burning tobacco, a highly addictive substance. Substance mentions of illegal recreational drugs use. Include also substances that can create dependency including here caffeine and tea. “overdose, cocaine, illicit substance intoxication, coffee, etc

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_consumption_benchmark_pipeline", "en", "clinical/models")


ner_consumption_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """SOCIAL HISTORY : The patient is a nonsmoker . Denies any alcohol or illicit drug use . The patient does live with his family .
SOCIAL HISTORY : The patient smokes approximately 2 packs per day times greater than 40 years . He does drink occasional alcohol approximately 5 to 6 alcoholic drinks per month . He denies any drug use . He is a retired liquor store owner .
SOCIAL HISTORY : Patient admits alcohol use , Drinking is described as heavy , Patient denies illegal drug use , Patient denies STD history , Patient denies tobacco use .
SOCIAL HISTORY : The patient is employed in the finance department . He is a nonsmoker . He does consume alcohol on the weekend as much as 3 to 4 alcoholic beverages per day on the weekends . He denies any IV drug use or abuse .
SOCIAL HISTORY : The patient is a smoker . Admits to heroin use , alcohol abuse as well . Also admits today using cocaine .
"""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,nonsmoker,34,42,CONSUMPTION
1,0,alcohol,57,63,CONSUMPTION
2,0,illicit drug use,68,83,CONSUMPTION
3,0,smokes,156,161,CONSUMPTION
4,0,drink,231,235,CONSUMPTION
5,0,alcohol,248,254,CONSUMPTION
6,0,alcoholic drinks,277,292,CONSUMPTION
7,0,drug use,320,327,CONSUMPTION
8,0,alcohol use,400,410,CONSUMPTION
9,0,Drinking,414,421,CONSUMPTION


## 	Alcohol Usage

This pipeline can be used to extract posology information in medical text.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_alcohol_use_benchmark_pipeline", "en", "clinical/models")


ner_alcohol_use_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """SOCIAL HISTORY : The patient is a nonsmoker . Denies any alcohol or illicit drug use . The patient does live with his family .
SOCIAL HISTORY : The patient smokes approximately 2 packs per day times greater than 40 years . He does drink occasional alcohol approximately 5 to 6 alcoholic drinks per month . He denies any drug use . He is a retired liquor store owner .
SOCIAL HISTORY : Patient admits alcohol use , Drinking is described as heavy , Patient denies illegal drug use , Patient denies STD history , Patient denies tobacco use .
SOCIAL HISTORY : The patient is employed in the finance department . He is a nonsmoker . He does consume alcohol on the weekend as much as 3 to 4 alcoholic beverages per day on the weekends . He denies any IV drug use or abuse .
SOCIAL HISTORY : She is married .Employed with the US Post Office .She is a mother of three . Denies tobacco , alcohol or illicit drug use . MEDICATIONS . Coumadin 1 mg daily .Last INR was on Tuesday , August 14 , 2007 , and her INR was 2.3.2 . Amiodarone 100 mg p.o . daily .
"""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,alcohol,57,63,ALCOHOL_USE
1,0,drink,231,235,ALCOHOL_USE
2,0,alcohol,248,254,ALCOHOL_USE
3,0,alcoholic drinks,277,292,ALCOHOL_USE
4,0,liquor,347,352,ALCOHOL_USE
5,0,alcohol use,400,410,ALCOHOL_USE
6,0,Drinking,414,421,ALCOHOL_USE
7,0,consume alcohol,636,650,ALCOHOL_USE
8,0,alcoholic beverages,685,703,ALCOHOL_USE
9,0,alcohol,879,885,ALCOHOL_USE


## 	Tobaco Usage

This pipeline can be used to detect and label smoking-related entities within medical text. Smoking/Tobacco typically involves inhaling smoke from burning tobacco, a highly addictive substance.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_tobaco_use_benchmark_pipeline", "en", "clinical/models")


ner_tobaco_use_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """SOCIAL HISTORY : The patient is a nonsmoker . Denies any alcohol or illicit drug use . The patient does live with his family .
SOCIAL HISTORY : The patient smokes approximately 2 packs per day times greater than 40 years . He does drink occasional alcohol approximately 5 to 6 alcoholic drinks per month . He denies any drug use . He is a retired liquor store owner .
SOCIAL HISTORY : Patient admits alcohol use , Drinking is described as heavy , Patient denies illegal drug use , Patient denies STD history , Patient denies tobacco use .
SOCIAL HISTORY : The patient is employed in the finance department . He is a nonsmoker . He does consume alcohol on the weekend as much as 3 to 4 alcoholic beverages per day on the weekends . He denies any IV drug use or abuse .
SOCIAL HISTORY : She is married .Employed with the US Post Office .She is a mother of three . Denies tobacco , alcohol or illicit drug use . MEDICATIONS . Coumadin 1 mg daily .Last INR was on Tuesday , August 14 , 2007 , and her INR was 2.3.2 . Amiodarone 100 mg p.o . daily .
"""


**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,nonsmoker,34,42,TOBACO_USE
1,0,smokes,156,161,TOBACO_USE
2,0,tobacco,525,531,TOBACO_USE
3,0,nonsmoker,616,624,TOBACO_USE
4,0,tobacco,869,875,TOBACO_USE


## 	Substance Usage

This pipeline can be used to extracts `substance usage` information in medical text. SUBSTANCE_USE: Mentions of illegal recreational drugs use. Include also substances that can create dependency including here caffeine and tea. “overdose, cocaine, illicit substance intoxication, coffee, etc.”.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("ner_substance_use_benchmark_pipeline", "en", "clinical/models")


ner_substance_use_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
text = """SOCIAL HISTORY : The patient is a nonsmoker . Denies any alcohol or illicit drug use . The patient does live with his family .
SOCIAL HISTORY : The patient smokes approximately 2 packs per day times greater than 40 years . He does drink occasional alcohol approximately 5 to 6 alcoholic drinks per month . He denies any drug use . He is a retired liquor store owner .
SOCIAL HISTORY : Patient admits alcohol use , Drinking is described as heavy , Patient denies illegal drug use , Patient denies STD history , Patient denies tobacco use .
SOCIAL HISTORY : The patient is employed in the finance department . He is a nonsmoker . He does consume alcohol on the weekend as much as 3 to 4 alcoholic beverages per day on the weekends . He denies any IV drug use or abuse .
SOCIAL HISTORY : The patient is a smoker . Admits to heroin use , alcohol abuse as well . Also admits today using cocaine .
"""

**Light Pipeline, Annotation Result**

In [None]:
light_result = ner_pipeline.fullAnnotate(text)

explode_light_result(light_result)

Unnamed: 0,id,chunk,begin,end,entity
0,0,illicit drug use,68,83,SUBSTANCE_USE
1,0,drug use,320,327,SUBSTANCE_USE
2,0,illegal drug use,462,477,SUBSTANCE_USE
3,0,IV drug use,745,755,SUBSTANCE_USE
4,0,heroin use,821,830,SUBSTANCE_USE
5,0,using cocaine,876,888,SUBSTANCE_USE


# Benchmarks

## Dataset

In [6]:
!wget -q https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/academic/NER_Benchmarks/ner_cloud_benchmark_text_df.xlsx
!wget -q https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/academic/NER_Benchmarks/ner_cloud_benchmark_token_df_ground_truth.xlsx


In [23]:
text_df = pd.read_excel("ner_cloud_benchmark_text_df.xlsx")
text_df.head(2)

Unnamed: 0,idx,task_name,text
0,2,task_id_2,\n010760828\nFIH\n4604997\n76732/99w3\n503426\...
1,8,task_id_8,\n405507617\nFIH\n2887168\n132052\n543394\n11/...


In [24]:
input_spark_df = spark.createDataFrame(text_df)
input_spark_df.show(2, truncate=50)

+---+---------+--------------------------------------------------+
|idx|task_name|                                              text|
+---+---------+--------------------------------------------------+
|  2|task_id_2|\n010760828\nFIH\n4604997\n76732/99w3\n503426\n...|
|  8|task_id_8|\n405507617\nFIH\n2887168\n132052\n543394\n11/1...|
+---+---------+--------------------------------------------------+
only showing top 2 rows



In [13]:
list(text_df.idx)

[2, 8, 36, 53, 65, 72, 77, 106, 107, 119, 234, 235, 239, 242, 255]

In [34]:
token_df = pd.read_excel("ner_cloud_benchmark_token_df_ground_truth.xlsx").reset_index(drop=True)
token_df.ner_label = token_df.ner_label.apply(lambda x: x.split("-")[-1])
token_df.tail()

Unnamed: 0,idx,task_name,token,begin,end,ner_label
8888,255,task_id_255,Patient,958,964,O
8889,255,task_id_255,denies,966,971,O
8890,255,task_id_255,tobacco,973,979,CONSUMPTION
8891,255,task_id_255,use,981,983,O
8892,255,task_id_255,.,985,985,O


In [35]:
token_df.idx.unique()

array([  2,   8,  36,  53,  65,  72,  77, 106, 107, 116, 234, 235, 239,
       242, 255])

## JohnSnowLabs

In [36]:
label_list = [
    'ADMISSION_DISCHARGE',
    'PROBLEM', # 'SYMPTOM_OR_SIGN', 'MEDICAL_CONDITION_DISORDER',
    'GRADE_STAGE_SEVERITY',
    'TEST', # 'TEST_RESULT',
    'PROCEDURE',
    'TREATMENT',
    'DRUG',
    'BODY_PART',
    'CONSUMPTION', # 'TOBACO_USE', 'ALCOHOL_USE', 'SUBSTANCE_USE'
]

In [37]:
def apply_spark_pred(main_token_df,spark_pred):
    token_df = main_token_df.copy()
    token_df["spark"] = "O"
    for spark_i, spark_row in spark_pred.iterrows():
        idx = spark_row["idx"]
        task_name = spark_row["task_name"]
        for token_i, token_row in token_df[token_df.task_name==task_name].iterrows():
            if (token_row["begin"] >= spark_row["begin"]) and (token_row["end"] <= spark_row["end"]) :
                token_df.loc[token_i,"spark"] = spark_row["pred_label"]
                token_df.loc[token_i,"ner_source"] = spark_row["ner_source"]

    return token_df

In [38]:
def create_selected_replacement_dict(selected_labels,label_list):
    replacement_dict = {}
    for label in label_list:
        if label not in selected_labels:
            replacement_dict[label] = "O"
    return replacement_dict


In [39]:
def explode_spark_result(spark_result):
    return spark_result.selectExpr("idx","task_name",
                                        "explode(ner_chunk) as ner_chunks")\
                .selectExpr("idx","task_name",
                            "ner_chunks.result as pred_chunk",
                            "ner_chunks.begin",
                            "ner_chunks.end",
                            "ner_chunks.metadata.entity as pred_label",
                            "ner_chunks.metadata.ner_source",
                            "ner_chunks.metadata.confidence")\
                .toPandas()

In [40]:
pipeline_dict = {
    "ner_problem_benchmark_pipeline":'PROBLEM',
    "ner_drug_benchmark_pipeline":'DRUG',
    "ner_treatment_benchmark_pipeline":'TREATMENT',
    "ner_body_part_benchmark_pipeline":'BODY_PART',
    "ner_procedure_benchmark_pipeline":'PROCEDURE',
    "ner_test_benchmark_pipeline":'TEST',
    "ner_grade_stage_severity_benchmark_pipeline":'GRADE_STAGE_SEVERITY',
    "ner_admission_discharge_benchmark_pipeline":'ADMISSION_DISCHARGE',
    "ner_consumption_benchmark_pipeline":'CONSUMPTION',
}

In [41]:
from sparknlp.pretrained import PretrainedPipeline
from sklearn.metrics import classification_report

for pipeline_name, predicted_label in pipeline_dict.items():
    nlp_pipeline_model = PretrainedPipeline(pipeline_name, "en", "clinical/models")
    spark_result = nlp_pipeline_model.transform(input_spark_df)
    spark_pred = explode_spark_result(spark_result)
    replaced_df = apply_spark_pred(token_df, spark_pred)
    selected_labels = create_selected_replacement_dict([predicted_label], label_list)
    replaced_df.ner_label = replaced_df.ner_label.replace(selected_labels)
    print(classification_report(replaced_df["ner_label"], replaced_df["spark"], digits=3))


ner_problem_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
              precision    recall  f1-score   support

           O      0.988     0.997     0.992      8153
     PROBLEM      0.961     0.862     0.909       740

    accuracy                          0.986      8893
   macro avg      0.974     0.929     0.951      8893
weighted avg      0.985     0.986     0.985      8893

ner_drug_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
              precision    recall  f1-score   support

        DRUG      0.992     0.907     0.947       129
           O      0.999     1.000     0.999      8764

    accuracy                          0.999      8893
   macro avg      0.995     0.953     0.973      8893
weighted avg      0.999     0.999     0.999      8893

ner_treatment_benchmark_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
              pre

**All Benchmarks with full dataset**

ner_problem_benchmark_pipeline
```
              precision    recall  f1-score   support
           O      0.989     0.996     0.993     76426
     PROBLEM      0.948     0.866     0.905      6145
    accuracy                          0.986     82571
   macro avg      0.969     0.931     0.949     82571
weighted avg      0.986     0.986     0.986     82571
```


ner_drug_benchmark_pipeline
```
              precision    recall  f1-score   support
        DRUG      0.989     0.957     0.973      1373
           O      0.999     1.000     1.000     81198
    accuracy                          0.999     82571
   macro avg      0.994     0.978     0.986     82571
weighted avg      0.999     0.999     0.999     82571
```
ner_treatment_benchmark_pipeline
```
              precision    recall  f1-score   support
           O      0.999     0.999     0.999     82021
   TREATMENT      0.900     0.918     0.909       550
    accuracy                          0.999     82571
   macro avg      0.950     0.959     0.954     82571
weighted avg      0.999     0.999     0.999     82571
```

ner_body_part_benchmark_pipeline
```
              precision    recall  f1-score   support
   BODY_PART      0.777     0.895     0.832      2049
           O      0.997     0.993     0.995     80522
    accuracy                          0.991     82571
   macro avg      0.887     0.944     0.914     82571
weighted avg      0.992     0.991     0.991     82571
```

ner_procedure_benchmark_pipeline
```
              precision    recall  f1-score   support
           O      0.999     0.997     0.998     81024
   PROCEDURE      0.869     0.936     0.901      1547
    accuracy                          0.996     82571
   macro avg      0.934     0.967     0.950     82571
weighted avg      0.996     0.996     0.996     82571
```



ner_test_benchmark_pipeline
```
              precision    recall  f1-score   support
           O      0.995     0.996     0.996     79006
        TEST      0.914     0.897     0.906      3565
    accuracy                          0.992     82571
   macro avg      0.955     0.946     0.951     82571
weighted avg      0.992     0.992     0.992     82571
```

ner_grade_stage_severity_benchmark_pipeline
```
                      precision    recall  f1-score   support
GRADE_STAGE_SEVERITY      0.722     0.904     0.803       689
                   O      0.999     0.997     0.998     81882
            accuracy                          0.996     82571
           macro avg      0.861     0.951     0.900     82571
        weighted avg      0.997     0.996     0.996     82571
```

ner_admission_discharge_benchmark_pipeline
```
                     precision    recall  f1-score   support
ADMISSION_DISCHARGE      0.983     0.986     0.984       799
                  O      1.000     1.000     1.000     81772
           accuracy                          1.000     82571
          macro avg      0.991     0.993     0.992     82571
       weighted avg      1.000     1.000     1.000     82571
```

ner_consumption_benchmark_pipeline
```
              precision    recall  f1-score   support
 CONSUMPTION      0.988     0.977     0.983       662
           O      1.000     1.000     1.000     81909
    accuracy                          1.000     82571
   macro avg      0.994     0.989     0.991     82571
weighted avg      1.000     1.000     1.000     82571
```

# Cloud Providers

## AWS


https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-entitiesv2.html


**AWS Comprehend Medical Entities**


- ANATOMY: Detects references to the parts of the body or body systems and the locations of those parts or systems.
- BEHAVIORAL_ENVIRONMENTAL_SOCIAL: Detects the behaviors and conditions in the environment that impact a person's health. This includes tobacco usage, alcohol consumption, recreational drug usage, allergies, gender, and race/ethnicity.
- MEDICAL_CONDITION: Detects the signs, symptoms, and diagnoses of medical conditions.
- MEDICATION: Detects medication and dosage information on the patient.
- PROTECTED_HEALTH_INFORMATION: Detects the patient's personal information.
- TEST_TREATMENT_PROCEDURE: Detects the procedures that are used to determine a medical condition.
- TIME_EXPRESSION: Detects entities related to time when they are associated with a detected entity.

All six categories are detected by the DetectEntitiesV2 operation. For analysis specific to detecting PHI, use DetectPHI on single files and StartPHIDetectionJob for batch analysis.

Amazon Comprehend Medical detects information in the following classes:

    Entity: A text reference to the name of relevant objects, such as people, treatments, medications, and medical conditions. For example, ibuprofen.
    Category: The generalized grouping to which an entity belongs. For example, ibuprofen is part of the MEDICATION category.
    Type: The type of entity detected within a single category. For example, ibuprofen is in the GENERIC_NAME type in the MEDICATION category.
    Attribute: Information related to an entity, such as the dosage of a medication. For example, 200 mg is an attribute of the ibuprofen entity.
    Trait: Something that Amazon Comprehend Medical understands about an entity, based on context. For example, a medication has the NEGATION trait if a patient is not taking it.
    Relationship Type: The relationship between an entity and an attribute.



### Credentials

In [None]:
import boto3
from botocore.client import Config

region_name = "region" # change with yours
AWS_ACCESS_KEY_ID = "AKXXXX"  # change with yours
AWS_SECRET_ACCESS_KEY = "szYYYYYYYYYYYYYYYYYYYY" # change with yours

session = boto3.Session(aws_access_key_id=AWS_ACCESS_KEY_ID,
                        aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
                        region_name=region_name)

mfa_serial = "arn:aws:iam::XXXXXXXXXXXXX:mfa/YYYYYYYY" # change name with yours
mfa_token = input('Please enter your 6 digit MFA code:')

sts = session.client('sts')
MFA_validated_token = sts.get_session_token(SerialNumber=mfa_serial, TokenCode=mfa_token)
config = Config(connect_timeout=3600, read_timeout=70)
MFA_validated_token

In [None]:
# Extract validated credentials for role assumption
validated_access_key_id = MFA_validated_token['Credentials']['AccessKeyId']
validated_secret_access_key = MFA_validated_token['Credentials']['SecretAccessKey']
validated_session_token = MFA_validated_token['Credentials']['SessionToken']

temp_sts_client = boto3.client(
    'sts',
    aws_access_key_id=validated_access_key_id,
    aws_secret_access_key=validated_secret_access_key,
    aws_session_token=validated_session_token
)

# we need to change the role as Your Role and the arn id should be like this
target_role_arn = "arn:aws:iam::XXXXXXXXXXXXX:role/YourRoleName"

# Assume the desired role
response = temp_sts_client.assume_role(
    RoleArn=target_role_arn,
    RoleSessionName='MedComp'
)

# response

In [None]:
import boto3
# temporary creds
tmp_access_key_id = response['Credentials']['AccessKeyId']
tmp_secret_access_key = response['Credentials']['SecretAccessKey']
tmp_session_token = response['Credentials']['SessionToken']


client = boto3.client(service_name='comprehendmedical',
                      region_name='us-west-2',
                      aws_access_key_id = tmp_access_key_id,
                      aws_secret_access_key = tmp_secret_access_key,
                      aws_session_token = tmp_session_token
                     )

### test

In [None]:
text = """
Jennifer Smith is 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
"""

aws_response = client.detect_entities_v2(Text = text)
aws_response.keys()

dict_keys(['Entities', 'UnmappedAttributes', 'ModelVersion', 'ResponseMetadata'])

In [None]:
aws_response["Entities"][0]

{'Id': 19,
 'BeginOffset': 1,
 'EndOffset': 15,
 'Score': 0.9934670329093933,
 'Text': 'Jennifer Smith',
 'Category': 'PROTECTED_HEALTH_INFORMATION',
 'Type': 'NAME',
 'Traits': []}

In [None]:
def extract_entites_from_aws_respose(aws_response):
    entity_row = []
    for entity in aws_response["Entities"]:
        entity_row.append({
            "start":entity["BeginOffset"],
            "end":entity["EndOffset"],
            "text":entity["Text"],
            "label":(entity["Type"]).upper(),
            #"category":(entity["Category"]).upper(),
            #"traits": entity["Traits"],
            #"attribute": entity["Attributes"] if "Attributes" in entity.keys() else []
        })
    return entity_row

In [None]:
extract_entites_from_aws_respose(aws_response)

[{'start': 1, 'end': 15, 'text': 'Jennifer Smith', 'label': 'NAME'},
 {'start': 19, 'end': 21, 'text': '28', 'label': 'AGE'},
 {'start': 31, 'end': 37, 'text': 'female', 'label': 'GENDER'},
 {'start': 56,
  'end': 85,
  'text': 'gestational diabetes mellitus',
  'label': 'DX_NAME'},
 {'start': 96, 'end': 107, 'text': 'eight years', 'label': 'TIME_TO_DX_NAME'},
 {'start': 108,
  'end': 129,
  'text': 'prior to presentation',
  'label': 'TIME_TO_DX_NAME'},
 {'start': 154, 'end': 171, 'text': 'diabetes mellitus', 'label': 'DX_NAME'},
 {'start': 173, 'end': 177, 'text': 'T2DM', 'label': 'DX_NAME'},
 {'start': 201,
  'end': 225,
  'text': 'HTG-induced pancreatitis',
  'label': 'DX_NAME'},
 {'start': 226, 'end': 237, 'text': 'three years', 'label': 'TIME_TO_DX_NAME'},
 {'start': 238,
  'end': 259,
  'text': 'prior to presentation',
  'label': 'TIME_TO_DX_NAME'},
 {'start': 290, 'end': 299, 'text': 'hepatitis', 'label': 'DX_NAME'},
 {'start': 318, 'end': 326, 'text': 'one-week', 'label': 'TIM

## Azure


https://learn.microsoft.com/en-us/azure/ai-services/language-service/text-analytics-for-health/concepts/health-entity-categories


Text Analytics for health detects medical concepts that fall under the following categories.

- BODY_STRUCTURE - Body systems, anatomic locations or regions, and body sites. For example, arm, knee, abdomen, nose, liver, head, respiratory system, lymphocytes.
- EXAMINATION_NAME – Diagnostic procedures and tests, including vital signs and body measurements. For example, MRI, ECG, HIV test, hemoglobin, platelets count, scale systems such as Bristol stool scale.
- DATE - Full date relating to a medical condition, examination, treatment, medication, or administrative event.
- DIRECTION – Directional terms that may relate to a body structure, medical condition, examination, or treatment, such as: left, lateral, upper, posterior.
- FREQUENCY - Describes how often a medical condition, examination, treatment, or medication occurred, occurs, or should occur.
- TIME - Temporal terms relating to the beginning and/or length (duration) of a medical condition, examination, treatment, medication, or administrative event.
- MEASUREMENT_UNIT – The unit of measurement related to an examination or a medical condition measurement.
- MEASUREMENT_VALUE – The value related to an examination or a medical condition measurement.
- ADMINISTRATIVE_EVENT – Events that relate to the healthcare system but of an administrative/semi-administrative nature. For example, registration, admission, trial, study entry, transfer, discharge, hospitalization, hospital stay.
- DIAGNOSIS – Disease, syndrome, poisoning. For example, breast cancer, Alzheimer’s, HTN, CHF, spinal cord injury.
- SYMPTOM_OR_SIGN – Subjective or objective evidence of disease or other diagnoses. For example, chest pain, headache, dizziness, rash, SOB, abdomen was soft, good bowel sounds, well nourished.
- CONDITION_QUALIFIER - Qualitative terms that are used to describe a medical condition. All the following subcategories are considered qualifiers:
- CONDITION_SCALE – Qualitative terms that characterize the condition by a scale, which is a finite ordered list of values.
- MEDICATION_CLASS – A set of medications that have a similar mechanism of action, a related mode of action, a similar chemical structure, and/or are used to treat the same disease. For example, ACE inhibitor, opioid, antibiotics, pain relievers.
- MEDICATION_NAME – Medication mentions, including copyrighted brand names, and non-brand names. For example, Ibuprofen.
- SUBSTANCE_USE – Mentions of use of legal or illegal drugs, tobacco or alcohol. For example, smoking, drinking, or heroin use.
- TREATMENT_NAME – Therapeutic procedures. For example, knee replacement surgery, bone marrow transplant, TAVI, diet.

### Credentials

In [None]:
!pip install -q azure-ai-textanalytics==5.2.0

In [None]:
import os

os.environ["LANGUAGE_KEY"] = "XXXXXXXXXXXXXX"
os.environ["LANGUAGE_ENDPOINT"] = "https://xxxxxxxxxxxx.azure.com/"


# This example requires environment variables named "LANGUAGE_KEY" and "LANGUAGE_ENDPOINT"
key = os.environ.get('LANGUAGE_KEY')
endpoint = os.environ.get('LANGUAGE_ENDPOINT')

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

# Authenticate the client using your key and endpoint
def authenticate_client():
    ta_credential = AzureKeyCredential(key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=endpoint,
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

### test

In [None]:
documents = [
    """Jennifer Smith is 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting."""
]

poller = client.begin_analyze_healthcare_entities(documents)
result = poller.result()

azure_response = [doc for doc in result if not doc.is_error][0]

In [None]:
def extract_entites_from_azure_respose(azure_response):
    entity_row = []
    for j, entity_base in enumerate(azure_response["entities"]):

        try:
            entity = dict(entity_base)
            entity_row.append({
                "start":  int(entity["offset"]),
                "end": int(entity["offset"]) + int(entity["length"]),
                "text": entity["text"],
                "label": entity["category"] if entity["category"] is None else entity["category"].upper(),
                #"category":entity["subcategory"] if entity["subcategory"] is None else entity["subcategory"].upper(),
                #"assertion": entity["assertion"] if entity["assertion"] is None else dict(entity["assertion"]),
                })
        except Exception as e:
            print(j, e)
            #print(f"{i}: entity: ",entity)

    return entity_row

In [None]:
extract_entites_from_azure_respose(azure_response)

[{'start': 18, 'end': 29, 'text': '28-year-old', 'label': 'AGE'},
 {'start': 30, 'end': 36, 'text': 'female', 'label': 'GENDER'},
 {'start': 55,
  'end': 84,
  'text': 'gestational diabetes mellitus',
  'label': 'DIAGNOSIS'},
 {'start': 95, 'end': 106, 'text': 'eight years', 'label': 'TIME'},
 {'start': 116,
  'end': 128,
  'text': 'presentation',
  'label': 'ADMINISTRATIVEEVENT'},
 {'start': 144,
  'end': 170,
  'text': 'type two diabetes mellitus',
  'label': 'DIAGNOSIS'},
 {'start': 172, 'end': 176, 'text': 'T2DM', 'label': 'DIAGNOSIS'},
 {'start': 189, 'end': 196, 'text': 'episode', 'label': 'COURSE'},
 {'start': 200,
  'end': 224,
  'text': 'HTG-induced pancreatitis',
  'label': 'DIAGNOSIS'},
 {'start': 225, 'end': 236, 'text': 'three years', 'label': 'TIME'},
 {'start': 246,
  'end': 258,
  'text': 'presentation',
  'label': 'ADMINISTRATIVEEVENT'},
 {'start': 283, 'end': 298, 'text': 'acute hepatitis', 'label': 'DIAGNOSIS'},
 {'start': 317, 'end': 325, 'text': 'one-week', 'label'

## GCP

### Credentials

In [None]:
GOOGLE_APPLICATION_CREDENTIALS = {
  "type": "service_account",
  "project_id": "<project_id>",
  "private_key_id": "<private_key_id>",
  "private_key": "-----BEGIN PRIVATE KEY-----\XXXXXXXXXXXXXXXX\n-----END PRIVATE KEY-----\n",
  "client_email": "XXXXXXXXX@<project_id>.iam.gserviceaccount.com",
  "client_id": "<client_id>",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/XXXXXX<project_id>.iam.gserviceaccount.com",
  "universe_domain": "googleapis.com"
}


# Defining license key-value pairs as local variables
locals().update(GOOGLE_APPLICATION_CREDENTIALS)

# Adding license key-value pairs to environment variables
os.environ.update(GOOGLE_APPLICATION_CREDENTIALS)

with open('google_license_keys.json', 'w') as f:
  json.dump(GOOGLE_APPLICATION_CREDENTIALS, f)

!export GOOGLE_APPLICATION_CREDENTIALS="./google_license_keys.json"

!gcloud init

#  Pick configuration to use: -> 1
# You must sign in to continue -> Y
# enter the verification code provided in your browser: from your account
# Pick cloud project to use:

In [None]:
!gcloud projects add-iam-policy-binding PROJECT_ID \
    --member serviceAccount:SERVICE_ACCOUNT_ID \
    --role roles/healthcare.nlpServiceViewer

In [None]:
import requests
import subprocess
import json


def get_access_token():
    result = subprocess.run(['gcloud', 'auth', 'print-access-token'], capture_output=True, text=True)
    return result.stdout.strip()

access_token = get_access_token()
access_token[:10]

'ya29.a0AZY'

In [None]:
sample_text = """Jennifer Smith is 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting."""


In [None]:
project_id = GOOGLE_APPLICATION_CREDENTIALS["project_id"]
url = f"https://healthcare.googleapis.com/v1/projects/{project_id}/locations/us-central1/services/nlp:analyzeEntities"

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

data = {
  "documentContent": sample_text
}

response = requests.post(url, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    print("Request successful!")
    # Process the response data here
else:
    print("Request failed:", response.text)

Request successful!


In [None]:
def extract_entites_from_gcp_response(gcp_response):
    entity_row = []
    for i, entity in enumerate(gcp_response["entityMentions"]):
        assertions = {
            "temporal":entity["temporalAssessment"]["value"] if "temporalAssessment" in  entity.keys() else None,
            "certainty":entity["certaintyAssessment"]["value"] if "certaintyAssessment" in entity.keys() else None,
        }
        entity_row.append({
            "start":entity["text"]["beginOffset"],
            "end":entity["text"]["beginOffset"] + len(entity["text"]["content"]),
            "text":entity["text"]["content"],
            "label": entity["type"] ,
            #"assertion": assertions,
            #"resolutions": entity["linkedEntities"] if "linkedEntities" in entity.keys() else None
        })
    return entity_row


In [None]:
gcp_response = eval(response.text)

extract_entites_from_gcp_response(gcp_response)

[{'start': 55,
  'end': 84,
  'text': 'gestational diabetes mellitus',
  'label': 'PROBLEM'},
 {'start': 144,
  'end': 170,
  'text': 'type two diabetes mellitus',
  'label': 'PROBLEM'},
 {'start': 172, 'end': 176, 'text': 'T2DM', 'label': 'PROBLEM'},
 {'start': 200,
  'end': 224,
  'text': 'HTG-induced pancreatitis',
  'label': 'PROBLEM'},
 {'start': 283, 'end': 288, 'text': 'acute', 'label': 'SEVERITY'},
 {'start': 289, 'end': 298, 'text': 'hepatitis', 'label': 'PROBLEM'},
 {'start': 337, 'end': 345, 'text': 'polyuria', 'label': 'PROBLEM'},
 {'start': 347, 'end': 360, 'text': 'poor appetite', 'label': 'PROBLEM'},
 {'start': 366, 'end': 374, 'text': 'vomiting', 'label': 'PROBLEM'}]

## Sonnet

In [None]:
!pip install -q anthropic

### Credentials

In [None]:
from getpass import getpass
ANTHROPIC_API_KEY =  getpass('Please enter your ANTHROPIC_API_KEY:')

import os
api_key = {
    "ANTHROPIC_API_KEY":ANTHROPIC_API_KEY
}
locals().update(api_key)
os.environ.update(api_key)



Please enter your ANTHROPIC_API_KEY:··········


In [None]:
import anthropic
client = anthropic.Anthropic()
client.models.list(limit=20)

In [None]:
ENTITY_INSTRUCTIONS = f"""
    You are an expert medical annotator with extensive experience in labeling medical entities within clinical texts. Your role is to accurately identify and annotate Clinical entities in the provided text, following the specified entity types.

    ### Instructions:
    1. **Review the Text**: Carefully read the text to understand its medical context.
    2. **Identify Clinical Entities**: Locate any terms or phrases that represent Clinical Entities, based on the following entity types and their derscription:
        **Clinical Entity Labels:**
        - **ADMISSION_DISCHARGE**: Information related to hospital admission or discharge.
        - **BODY_PART**: Specific body parts or anatomical locations.
        - **SYMPTOM_OR_SIGN**: Symptoms or clinical signs reported by the patient.
        - **MEDICAL_CONDITION_DISORDER**: Diagnosed diseases, disorders, or medical conditions.
        - **GRADE_STAGE_SEVERITY**: Severity, grade, or stage of a condition, disease, or symptom.
        - **TEST**: Medical tests, laboratory investigations, or imaging procedures.
        - **TEST_RESULT**: Results or findings of medical tests.
        - **PROCEDURE**: Surgical or non-surgical medical procedures.
        - **TREATMENT**: Therapies, interventions, or medical treatments.
        - **DRUG**: Medications or pharmaceutical substances.
        - **TOBACCO_USE**: Mentions of tobacco consumption.
        - **ALCOHOL_USE**: Mentions of alcohol consumption.
        - **SUBSTANCE_USE**: Mentions of drug or substance abuse.
    3. **Annotate Entities**: Extract clinical entities from the following text and categorize them into predefined labels.
    4. **Response Format**: Return the results in a structured JSON format, including the entity name, its corresponding label in the text.
    5. DO NOT return any other text or explanation like 'Here is the entities..' or 'JSON:...'.

"""
EXAMPLE_OUTPUT = """
### Example Output:
[
  {{ "extracted_chunk": "", "entity_label": "" }},
  {{ "extracted_chunk": "", "entity_label": "" }},
  ...
]
"""

JSON_NSTRUCTIONS = """
Please convert given text to proper json format.
Ensures the malformed string is properly formatted by enclosing it in square brackets and removing any trailing commas
"""

In [None]:
def anthropic_generation(INSTRUCTIONS,EXAMPLE_OUTPUT, text, model_name = "claude-3-7-sonnet-20250219"):
  response = client.messages.create(
    model=model_name,
    max_tokens=5000,
    messages=[
        {"role": "user", "content": INSTRUCTIONS},
        {"role": "assistant", "content": EXAMPLE_OUTPUT},
        {"role": "user", "content": text}
    ]
)
  return response.content[0].text

### test

In [None]:
import re

def extract_entites_from_anthropic_response(text, anthropic_response):
    entity_list = []
    try:
      response_list = eval(anthropic_response)
    except Exception as e:
      fixed_response = anthropic_generation(JSON_NSTRUCTIONS, EXAMPLE_OUTPUT, anthropic_response,
                     model_name = "claude-3-7-sonnet-20250219")
      response_list = eval(fixed_response)
      print("fixed",e)

    normalized_text = re.sub(r'[^\w\s]', '', text.replace("\n", " ")).lower()
    for response_chunk in response_list:
        word = response_chunk["extracted_chunk"]
        label = response_chunk["entity_label"]
        matches = [(match.group(), match.start(), match.end(),label ) for match in re.finditer(rf"\b{re.escape(word)}\b", text)]
        if len(matches) == 0:
          normalized_word = re.sub(r'[^\w\s]', '', word.replace("\n", " ")).lower()
          matches = [(word, match.start(), match.end(),label ) for match in re.finditer(rf"\b{re.escape(normalized_word)}\b", normalized_text)]

        for matche in matches:
            chunk = {
                "text": matche[0],
                "label": matche[3],
                "start": matche[1],
                "end": matche[2]
            }
            if chunk not in entity_list:
                entity_list.append(chunk)
    return  sorted(entity_list, key=lambda x: x['start'])

In [None]:
anthropic_response = anthropic_generation(ENTITY_INSTRUCTIONS, EXAMPLE_OUTPUT, text,
          model_name = "claude-3-7-sonnet-20250219")

entity_row = extract_entites_from_anthropic_response(text, anthropic_response)

entity_row

[{'text': 'gestational diabetes mellitus',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 56,
  'end': 85},
 {'text': 'type two diabetes mellitus',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 145,
  'end': 171},
 {'text': 'T2DM',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 173,
  'end': 177},
 {'text': 'HTG-induced pancreatitis',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 201,
  'end': 225},
 {'text': 'acute hepatitis',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 284,
  'end': 299},
 {'text': 'polyuria', 'label': 'SYMPTOM_OR_SIGN', 'start': 338, 'end': 346},
 {'text': 'poor appetite',
  'label': 'SYMPTOM_OR_SIGN',
  'start': 348,
  'end': 361},
 {'text': 'vomiting', 'label': 'SYMPTOM_OR_SIGN', 'start': 367, 'end': 375}]

## GPT

In [None]:
!pip install -q openai

### Credentials

In [None]:
from getpass import getpass
OPENAI_API_KEY =  getpass('Please enter your open_api_key:')

import os
api_key = {
    "OPENAI_API_KEY":OPENAI_API_KEY
}
locals().update(api_key)
os.environ.update(api_key)

# import openai
# openai.api_key = OPENAI_API_KEY

from openai import OpenAI
client = OpenAI(api_key = OPENAI_API_KEY)

Please enter your open_api_key:··········


In [None]:
ENTITY_INSTRUCTIONS = f"""
    You are an expert medical annotator with extensive experience in labeling medical entities within clinical texts. Your role is to accurately identify and annotate Clinical entities in the provided text, following the specified entity types.

    ### Instructions:
    1. **Review the Text**: Carefully read the text to understand its medical context.
    2. **Identify Clinical Entities**: Locate any terms or phrases that represent Clinical Entities, based on the following entity types and their derscription:
        **Clinical Entity Labels:**
        - **ADMISSION_DISCHARGE**: Information related to hospital admission or discharge.
        - **BODY_PART**: Specific body parts or anatomical locations.
        - **SYMPTOM_OR_SIGN**: Symptoms or clinical signs reported by the patient.
        - **MEDICAL_CONDITION_DISORDER**: Diagnosed diseases, disorders, or medical conditions.
        - **GRADE_STAGE_SEVERITY**: Severity, grade, or stage of a condition, disease, or symptom.
        - **TEST**: Medical tests, laboratory investigations, or imaging procedures.
        - **TEST_RESULT**: Results or findings of medical tests.
        - **PROCEDURE**: Surgical or non-surgical medical procedures.
        - **TREATMENT**: Therapies, interventions, or medical treatments.
        - **DRUG**: Medications or pharmaceutical substances.
        - **TOBACCO_USE**: Mentions of tobacco consumption.
        - **ALCOHOL_USE**: Mentions of alcohol consumption.
        - **SUBSTANCE_USE**: Mentions of drug or substance abuse.
    3. **Annotate Entities**: Extract clinical entities from the following text and categorize them into predefined labels.
    4. **Response Format**: Return the results in a structured JSON format, including the entity name, its corresponding label in the text.
    5. DO NOT return any other text or explanation like 'Here is the entities..' or 'JSON:...'.

"""
EXAMPLE_OUTPUT = """
### Example Output:
[
  {{ "extracted_chunk": "", "entity_label": "" }},
  {{ "extracted_chunk": "", "entity_label": "" }},
  ...
]
"""

JSON_NSTRUCTIONS = """
Please convert given text to proper json format.
Ensures the malformed string is properly formatted by enclosing it in square brackets and removing any trailing commas
"""

In [None]:
def gpt_generation(INSTRUCTIONS,EXAMPLE_OUTPUT,text,
               gpt_model_name = "gpt-4.5-preview", temperature=0):
  SYSTEM_PROMPT = "You are a smart and intelligent medical assistant system."

  response = client.chat.completions.create(
                  model=gpt_model_name,
                  response_format={ "type": "json_object" },
                  temperature=temperature,
                  messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": INSTRUCTIONS},
                    {"role": "assistant", "content": EXAMPLE_OUTPUT},
                    {"role": "user", "content": text}
                  ]
                )
  return response.choices[0].message.content

### test

In [None]:
import re
def find_iter_entities(gpt_response):
    import re
      # Regular expression to match each record
    pattern = r'\{\s*"extracted_chunk":\s*"([^"]+)",\s*"entity_label":\s*"([^"]+)"\s*\}'

    # Find all matches
    matches = re.finditer(pattern, gpt_response)

    # Extract matches into a list of dictionaries
    response_list = [{"extracted_chunk": match.group(1), "entity_label": match.group(2)} for match in matches]
    return response_list

def extract_entites_from_gpt_response(text, gpt_response):
    entity_list = []
    try:
        response_list = find_iter_entities( gpt_response)
    except Exception as e:
      try:
        fixed_response = gpt_generation(JSON_NSTRUCTIONS, EXAMPLE_OUTPUT, gpt_response)
        response_list = find_iter_entities( fixed_response)
        print("fixed",e)
      except Exception as e:
        response_list = []
        print(e)

    import re
    normalized_text = re.sub(r'[^\w\s]', '', text.replace("\n", " ")).lower()

    for response_chunk in response_list:
        word = response_chunk["extracted_chunk"]
        label = response_chunk["entity_label"]
        matches = [(match.group(), match.start(), match.end(),label ) for match in re.finditer(rf"\b{re.escape(word)}\b", text)]
        if len(matches) == 0:
          normalized_word = re.sub(r'[^\w\s]', '', word.replace("\n", " ")).lower()
          matches = [(word, match.start(), match.end(),label ) for match in re.finditer(rf"\b{re.escape(normalized_word)}\b", normalized_text)]

        for matche in matches:
            chunk = {
                "text": matche[0],
                "label": matche[3],
                "start": matche[1],
                "end": matche[2]
            }
            if chunk not in entity_list:
                entity_list.append(chunk)
    return  sorted(entity_list, key=lambda x: x['start'])

In [None]:
gpt_response = gpt_generation(ENTITY_INSTRUCTIONS, EXAMPLE_OUTPUT, text)

entity_row = extract_entites_from_gpt_response(text, gpt_response)

entity_row

[{'text': 'gestational diabetes mellitus',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 56,
  'end': 85},
 {'text': 'type two diabetes mellitus',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 145,
  'end': 171},
 {'text': 'T2DM',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 173,
  'end': 177},
 {'text': 'HTG-induced pancreatitis',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 201,
  'end': 225},
 {'text': 'acute hepatitis',
  'label': 'MEDICAL_CONDITION_DISORDER',
  'start': 284,
  'end': 299},
 {'text': 'polyuria', 'label': 'SYMPTOM_OR_SIGN', 'start': 338, 'end': 346},
 {'text': 'poor appetite',
  'label': 'SYMPTOM_OR_SIGN',
  'start': 348,
  'end': 361},
 {'text': 'vomiting', 'label': 'SYMPTOM_OR_SIGN', 'start': 367, 'end': 375}]