![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SparkNLP_vs_ChatGPT.ipynb)

## **SparkNLP vs ChatGPT**

### Example text

In [None]:
sample_texts = [
    """A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy, total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma (mucinous-type carcinoma, stage Ic) 1 year ago. Patient's medical compliance was poor and failed to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2). Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast in 2 months. Core needle biopsy revealed metaplastic carcinoma. Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2 Dosage), and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response, followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting. Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions. The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation associated with adenomyoepithelioma. Immunohistochemistry study showed that the tumor cells are positive for epithelial markers-cytokeratin (AE1/AE3) stain, and myoepithelial markers, including cytokeratin 5/6 (CK 5/6), p63, and S100 stains. The dissected axillary lymph nodes showed metastastic carcinoma with negative hormone receptors in 3 nodes. The patient was staged as pT3N1aM0, with histologic tumor grade III""",
    """The patient was operated. The breast conserving surgery was performed. The clip in the right breast was localized by the ROLL method under X-ray guidance. The lesion in her left breast was localized by ROLL method under US guidance. The pathologic results were the remnant foci of high-grade DCIS in the right breast""",
    """Chest computed tomography (CT) showed pulmonary lesions in posterior segment of right upper lobe, and peripheral lung cancer with multiple pulmonary metastases. Multiple metastases of the thoracic vertebrae, sternum, and ribs were considered, which were similar to previous CT images""",
    """A chest CT scan revealed a lesion in the lower lobe of the right lung. The histopathological examination indicated an invasive nonmucinous adenocarcinoma. The cancer was classified as stage IIb (T1bN1M0). The patient subsequently received four cycles of gemcitabine (1,000 mg/m2 i.v. on d1 and d8) plus cisplatin (75 mg/m2 i.v. on d1) as postoperative adjuvant chemotherapy, and was followed up every 3 months thereafter. In 2016, a contrast-enhanced chest CT scan revealed mediastinal lymphadenopathy and multiple pleural nodules, indicating recurrence of the lung cancer""",
    """There was no differentiated component characteristic of adenocarcinoma, acinar cell carcinoma or hepatoid carcinoma of the pancreas. The surgical margins were negative for neoplastic infiltration. There was no evidence of perineural invasion. No lymph node metastasis was shown""",
]
sample_texts = [text.replace("\n"," ") for text in sample_texts]
sample_texts

["A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy, total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma (mucinous-type carcinoma, stage Ic) 1 year ago. Patient's medical compliance was poor and failed to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2). Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast in 2 months. Core needle biopsy revealed metaplastic carcinoma. Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2 Dosage), and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response, followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting. Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions. The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation associat

## OpenAI

### Install `openai`  python packages

In [None]:
!pip install --upgrade -q openai

In [None]:
from getpass import getpass
OPENAI_API_KEY =  getpass('Please enter your open_api_key:')

import os
api_key = {
    "OPENAI_API_KEY":OPENAI_API_KEY
}
locals().update(api_key)
os.environ.update(api_key)

### Set necessary prompts

In [None]:
SYSTEM_PROMPT = "You are a smart and intelligent Named Entity Recognition (NER) system. I will provide you the definition of the entities you need to extract, the sentence from where your extract the entities and the output format with examples. You can only extract a token once with one entity, chose wisely using the provided definitions of the entities."

GUIDELINES_PROMPT = ("""Here are the labels of the Oncology model with their descriptions:
- `Adenopathy`: Mentions of pathological findings of the lymph nodes.
- `Age`: All mention of ages, past or present, related to the patient or with anybody else.
- `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category.
- `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers.
- `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction.
- `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score").
- `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment.
- `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy".
- `Cycle_Coun`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles").
- `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5").
- `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle").
- `Date`: Mentions of exact dates, in any format, including day number, month and/or year.
- `Death_Entity`: Words that indicate the death of the patient or someone else (including family members), such as "died" or "passed away".
- `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower".
- `Dosage`: The quantity prescribed by the physician for an active ingredient.
- `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks").
- `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid").
- `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father").
- `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated")
- `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary".
- `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy".
- `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan".
- `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy".
- `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category.
- `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment").
- `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions.
- `Oncogene`: Mentions of genes that are implicated in the etiology of cancer.
- `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells").
- `Pathology_Test`: Mentions of biopsies or tests that use tissue samples.
- `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4").
- `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups.
- `Radiotherapy`: Terms that indicate the use of Radiotherapy.
- `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement".
- `Relative_Date`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "yesterday" or "three years later").
- `Route`: Words indicating the type of administration route (such as "PO" or "transdermal").
- `Site_Bone`: Anatomical terms that refer to the human skeleton.
- `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum).
- `Site_Breast`: Anatomical terms that refer to the breasts.
- `Site_Liver`: Anatomical terms that refer to the liver.
- `Site_Lung`: Anatomical terms that refer to the lungs.
- `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies.
- `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities.
- `Smoking_Status`: All mentions of smoking related to the patient or to someone else.
- `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced".
- `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy".
- `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm").
- `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm").
- `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy").
""")


EXAMPLE_OUTPUT= """Output Format:
Examples:
1. Sentence: She had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast.
The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.
{"Output": {'Direction': ['left', 'right'],
             'Cancer_Surgery': ['mastectomy', 'axillary lymph node dissection'],
             'Cancer_Dx': ['breast cancer', 'cancer'],
             'Relative_Date': ['twenty years ago', '13 years later'],
             'Tumor_Finding': ['tumor'],
             'Biomarker_Result': ['positive'],
             'Biomarker': ['ER', 'PR'],
             'Radiotherapy': ['radiotherapy'],
             'Site_Breast': ['breast'],
             'Response_To_Treatment': ['recurred'],
             'Site_Lung': ['lung'],
             'Metastasis': ['metastasis'],
             'Chemotherapy': ['adriamycin', 'cyclophosphamide'],
             'Dosage': ['60 mg/m2', '600 mg/m2'],
             'Cycle_Count': ['six courses'],
             'Line_Of_Therapy': ['first line']}}
2. Sentence: {}
{"Output": {}}"""

In [None]:
from openai import OpenAI
client = OpenAI(api_key = OPENAI_API_KEY)

In [None]:
def generate_ner(sentence, gpt_model_name = "gpt-4", temperature=0):
  response = client.chat.completions.create(
                  model=gpt_model_name,
                  #response_format={ "type": "json_object" },
                  temperature=temperature,
                  messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": GUIDELINES_PROMPT},
                    {"role": "assistant", "content": EXAMPLE_OUTPUT},
                    {"role": "user", "content": sentence}
                  ]
                )
  return response

### GPT4 Results

In [None]:
import pandas as pd

main_df = pd.DataFrame(columns=["text", "GPT4 Result"])

for text in sample_texts:
    response = generate_ner(text)
    print(text)
    unsorted_dict = eval(response.choices[0].message.content)['Output']
    print("\n")
    row = pd.DataFrame([[text, unsorted_dict]], columns=["text", "GPT4 Result"])
    main_df = pd.concat([main_df, row], axis=0).reset_index(drop=True)

A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy, total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma (mucinous-type carcinoma, stage Ic) 1 year ago. Patient's medical compliance was poor and failed to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2). Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast in 2 months. Core needle biopsy revealed metaplastic carcinoma. Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2 Dosage), and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response, followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting. Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions. The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation associated

In [None]:
main_df

Unnamed: 0,text,GPT4 Result
0,A 65-year-old woman had a history of debulking...,"{'Age': ['65-year-old'], 'Gender': ['woman'], ..."
1,The patient was operated. The breast conservin...,{'Cancer_Surgery': ['breast conserving surgery...
2,Chest computed tomography (CT) showed pulmonar...,"{'Imaging_Test': ['Chest computed tomography',..."
3,A chest CT scan revealed a lesion in the lower...,"{'Imaging_Test': ['chest CT scan', 'contrast-e..."
4,There was no differentiated component characte...,"{'Histological_Type': ['adenocarcinoma', 'acin..."


## **Spark NLP**

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

### installation

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [None]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.3.2
Spark NLP_JSL Version : 5.3.2


### **`ner_oncology` pipeline**

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(["-", "\/"])

word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_oncology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            word_embeddings,
                            ner,
                            ner_converter])

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_oncology download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame(sample_texts, StringType()).toDF("text").coalesce(1)
data = data.withColumn("idx", F.monotonically_increasing_id())

data.show()

+--------------------+---+
|                text|idx|
+--------------------+---+
|A 65-year-old wom...|  0|
|The patient was o...|  1|
|Chest computed to...|  2|
|A chest CT scan r...|  3|
|There was no diff...|  4|
+--------------------+---+



### Spark NLP Result

In [None]:
result = pipeline.fit(data).transform(data)

result_df = result.select("idx",F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select("idx",F.expr("cols['0']").alias("chunk"),
              #F.expr("cols['1']").alias("begin"),
              #F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              #F.expr("cols['3']['confidence']").alias("confidence")
              ).toPandas()

result_df

Unnamed: 0,idx,chunk,ner_label
0,0,65-year-old,Age
1,0,woman,Gender
2,0,debulking surgery,Cancer_Surgery
3,0,bilateral,Direction
4,0,oophorectomy,Cancer_Surgery
...,...,...,...
135,4,neoplastic infiltration,Invasion
136,4,perineural,Site_Other_Body_Part
137,4,invasion,Invasion
138,4,lymph node,Site_Lymph_Node


## Comparision

In [None]:
res_df = result_df.groupby(["idx", "ner_label"]).agg({'chunk': ', '.join})

for i,row in res_df.iterrows():
  new_row = f"'{i[1]}' : [{row.chunk}]"
  res_df.loc[i,"new_row" ] = new_row

res_df = res_df.groupby(["idx"]).agg({'new_row': ', '.join})

main_df["SparkNLP Result"] = res_df

main_df

Unnamed: 0,text,GPT4 Result,SparkNLP Result
0,"A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy, total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma (mu...","{'Age': ['65-year-old'], 'Gender': ['woman'], 'Cancer_Surgery': ['debulking surgery', 'bilateral oophorectomy', 'omentectomy', 'total anterior hysterectomy', 'radical pelvic lymph nodes dissection...","'Age' : [65-year-old], 'Biomarker' : [epithelial markers-cytokeratin, AE1/AE3, myoepithelial markers, cytokeratin 5/6, CK 5/6, p63, S100, hormone receptors], 'Biomarker_Result' : [positive, negati..."
1,The patient was operated. The breast conserving surgery was performed. The clip in the right breast was localized by the ROLL method under X-ray guidance. The lesion in her left breast was localiz...,"{'Cancer_Surgery': ['breast conserving surgery'], 'Direction': ['right', 'left'], 'Site_Breast': ['breast', 'breast'], 'Imaging_Test': ['X-ray', 'US'], 'Pathology_Result': ['remnant foci of high-g...","'Cancer_Dx' : [DCIS], 'Cancer_Surgery' : [breast conserving surgery], 'Direction' : [right, left, right], 'Gender' : [her], 'Grade' : [high-grade], 'Imaging_Test' : [X-ray guidance, US guidance], ..."
2,"Chest computed tomography (CT) showed pulmonary lesions in posterior segment of right upper lobe, and peripheral lung cancer with multiple pulmonary metastases. Multiple metastases of the thoracic...","{'Imaging_Test': ['Chest computed tomography', 'CT'], 'Site_Lung': ['pulmonary', 'right upper lobe', 'lung'], 'Cancer_Dx': ['peripheral lung cancer'], 'Metastasis': ['multiple pulmonary metastases...","'Cancer_Dx' : [peripheral lung cancer], 'Direction' : [posterior, right], 'Imaging_Test' : [Chest computed tomography, CT, CT images], 'Metastasis' : [metastases, metastases], 'Site_Bone' : [thora..."
3,A chest CT scan revealed a lesion in the lower lobe of the right lung. The histopathological examination indicated an invasive nonmucinous adenocarcinoma. The cancer was classified as stage IIb (T...,"{'Imaging_Test': ['chest CT scan', 'contrast-enhanced chest CT scan'], 'Tumor_Finding': ['multiple pleural nodules'], 'Direction': ['lower', 'right'], 'Site_Lung': ['lobe of the right lung'], 'His...","'Adenopathy' : [mediastinal lymphadenopathy], 'Cancer_Dx' : [adenocarcinoma, cancer, lung cancer], 'Chemotherapy' : [gemcitabine, cisplatin, adjuvant chemotherapy], 'Cycle_Count' : [four cycles], ..."
4,"There was no differentiated component characteristic of adenocarcinoma, acinar cell carcinoma or hepatoid carcinoma of the pancreas. The surgical margins were negative for neoplastic infiltration....","{'Histological_Type': ['adenocarcinoma', 'acinar cell carcinoma', 'hepatoid carcinoma'], 'Site_Other_Body_Part': ['pancreas'], 'Pathology_Result': ['surgical margins were negative for neoplastic i...","'Cancer_Dx' : [adenocarcinoma, carcinoma, carcinoma of the pancreas], 'Histological_Type' : [acinar cell, hepatoid], 'Invasion' : [neoplastic infiltration, invasion], 'Metastasis' : [metastasis], ..."
