![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/30.4.Text_Classification_with_LargeFewShotClassifier.ipynb)

# Large FewShot Classifier Model

In this notebook, we’ll explore how SetFit can efficiently deliver strong performance even with a limited amount of data. SetFit leverages contrastive learning to expand a small set of labeled examples into thousands of training instances, making the most of what’s available. While it may not match the performance of Bert For Sequence Classifcation (BFSC) models trained on full datasets, it offers a balance of efficiency and effectiveness.

## **Colab Setup**

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing setfit transformers onnx-tf skl2onnx
! pip install --upgrade -q setfit transformers onnx-tf skl2onnx==1.16.0 onnxruntime==1.17.0
! pip install --upgrade -q optimum
! pip install sentence_transformers

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [5]:
import setfit
import onnx
import tensorflow as tf
import shutil
import copy
import sentence_transformers
import torch
import torch.onnx
import transformers
# import onnx_tf

from transformers import pipeline, set_seed, BioGptTokenizer, BioGptForCausalLM, AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from setfit import Trainer, SetFitModel, TrainingArguments, sample_dataset
from skl2onnx import convert_sklearn
from datasets import Dataset
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
#skl2onnx is needed to convert sklearn LogisticRegression to ONNX

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

In [6]:
import json
import os

import numpy as np
import pandas as pd

import sparknlp
import sparknlp_jsl

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

  and should_run_async(code)


Spark NLP Version : 5.4.0
Spark NLP_JSL Version : 5.4.0


In [7]:
#disable gpu
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# Pretrained Models


| Model Name              | Description |
|-------------------------|-------------|
|[large_fewshot_classifier_age_group](https://nlp.johnsnowlabs.com/2024/08/15/large_fewshot_classifier_age_group_en)      | This few-shot classification model identifies and classifies tweets reporting Adverse Drug Events (ADEs).
|[large_fewshot_classifier_ade](https://nlp.johnsnowlabs.com/2024/08/12/large_fewshot_classifier_ade_en)      | This few-shot classification model identifies and classifies the age group of a person mentioned in health documents.

In [8]:
#Pipeline for pretrained `large_fewshot_classifier_ade` model

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

large_few_shot_classifier = LargeFewShotClassifierModel()\
    .pretrained("large_fewshot_classifier_ade", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("prediction")

pipeline = sparknlp.base.Pipeline().setStages([
    documentAssembler,
    large_few_shot_classifier
])

large_fewshot_classifier_ade download started this may take some time.
[OK!]


In [9]:
large_few_shot_classifier.getClasses()

['ADE', 'noADE']

In [13]:
data = spark.createDataFrame([
    ["The patient developed severe liver toxicity after taking the medication for three weeks"],
    ["He experienced no complications during the treatment and reported feeling much better."],
    ["She experienced a sudden drop in blood pressure after the administration of the new drug."],
    ["The doctor recommended a daily dosage of the vitamin supplement to improve her health."]
]).toDF("text")

result = pipeline.fit(data).transform(data)

result.selectExpr("text", "prediction.result[0] as prediction").show(truncate=False)

+-----------------------------------------------------------------------------------------+----------+
|text                                                                                     |prediction|
+-----------------------------------------------------------------------------------------+----------+
|The patient developed severe liver toxicity after taking the medication for three weeks  |ADE       |
|He experienced no complications during the treatment and reported feeling much better.   |noADE     |
|She experienced a sudden drop in blood pressure after the administration of the new drug.|ADE       |
|The doctor recommended a daily dosage of the vitamin supplement to improve her health.   |noADE     |
+-----------------------------------------------------------------------------------------+----------+



# Training

## Dataset

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mtsamples_classifier.csv

In [None]:
df = pd.read_csv("./mtsamples_classifier.csv").dropna().reset_index(drop=True)
df.rename(columns={'category': 'label'}, inplace=True)
df['label']=df['label'].astype(str)
df["text"] = df["text"].astype(str)
df["length"] = df["text"].apply(lambda x: len(x))
#df = df[df["length"] <= 512].reset_index(drop=True)
df

Unnamed: 0,label,text,length
0,Gastroenterology,PROCEDURES PERFORMED: Colonoscopy. INDICATION...,3058
1,Gastroenterology,OPERATION 1. Ivor-Lewis esophagogastrectomy. ...,5937
2,Gastroenterology,PREOPERATIVE DIAGNOSES: 1. Gastroesophageal r...,2422
3,Gastroenterology,PROCEDURE: Colonoscopy. PREOPERATIVE DIAGNOSE...,1886
4,Gastroenterology,PREOPERATIVE DIAGNOSIS: Right colon tumor. PO...,3999
...,...,...,...
625,Urology,PREOPERATIVE DIAGNOSIS: Bilateral vesicourete...,2871
626,Urology,HISTORY OF PRESENT ILLNESS: The patient prese...,2008
627,Urology,Thank you for allowing me to help in the eval...,3213
628,Urology,PREOPERATIVE DIAGNOSIS: Bilateral undescended...,3766


In [None]:
df["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Orthopedic,222
Gastroenterology,157
Neurology,141
Urology,110


In [None]:
from sklearn.model_selection import train_test_split
train_data_df, test_data_df = train_test_split(df, test_size=0.2, random_state=55, stratify= df['label'])

In [None]:
train_data_df.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Orthopedic,178
Gastroenterology,125
Neurology,113
Urology,88


In [None]:
test_data_df.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Orthopedic,44
Gastroenterology,32
Neurology,28
Urology,22


## Preprocess

In [None]:
#base sentence embeddings model to use
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"

!mkdir -p  ./models
#location to save the trained model
EXPORT_PATH = "./models/mtsamples_classifier/"
#Use differentiable head or not. If not differentiable, SetFit will use sklearn LogisticRegression for the classifier
#Otherwise a custom Pytorch model will be used. A differentiable head makes a more powerful/sensitive classifier but
#requires more data. If only a few examples per class are available, then resort to a non-differentiable classifier
USE_DIFFERENTIABLE_HEAD = False
USE_NUMERIC_LABELS = True

In [None]:
#preprocess data
#NOTE: a SetFIT model with differentiable head only accepts numeric(integer) labels. In such a case you need to
#convert the labels to ints and save the original labels in a separate text file called labels.txt (see below)
def preprocessing_data(df, labels = None):
    if not labels:
        labels = list(df.label.unique())

    if USE_DIFFERENTIABLE_HEAD or USE_NUMERIC_LABELS:
        df["label_classes"] =  df["label"]
        df["label"] = df.apply(lambda x: labels.index(x["label"]), axis=1)
    #df["length"] = df["text"].apply(lambda x: len(x))
    #df = df[df["length"] <= 512]
    return df, labels

In [None]:
#create training/test datasets

#Number of samples per class to use in the training set. The larger, the better, but slower
N_SAMPLES_PER_CLASS = {
    "Orthopedic": 1,
    "Gastroenterology": 1,
    "Neurology" : 1,
    "Urology": 1
}

train_data_df_p, labels = preprocessing_data(train_data_df)
test_data_df_p, _ = preprocessing_data(test_data_df, labels)

#add numeric labels, if needed
for k in list(N_SAMPLES_PER_CLASS.keys()):
    N_SAMPLES_PER_CLASS[labels.index(k)] = N_SAMPLES_PER_CLASS[k]

# train_data_df_p = train_data_df_p.groupby("label").apply(lambda x: x.sample(N_SAMPLES_PER_CLASS[x["label"].iloc[0]]))
# train_data_df_p = train_data_df_p.reset_index(drop=True)


train_dataset = Dataset.from_pandas(train_data_df_p)
test_dataset = Dataset.from_pandas(test_data_df_p)

# train_dataset = sample_dataset(train_dataset, label_column="label", num_samples=N_SAMPLES)


In [None]:
labels

['Gastroenterology', 'Orthopedic', 'Neurology', 'Urology']

In [None]:
train_dataset

Dataset({
    features: ['label', 'text', 'length', 'label_classes', '__index_level_0__'],
    num_rows: 504
})

In [None]:
test_dataset

Dataset({
    features: ['label', 'text', 'length', 'label_classes', '__index_level_0__'],
    num_rows: 126
})

## Training

In [None]:
#Create the model
if USE_DIFFERENTIABLE_HEAD:
    model = SetFitModel.from_pretrained(MODEL_NAME,
                                        use_differentiable_head = True,
                                        head_params={"out_features": len(labels)})
    model.labels = list(range(0, len(labels)))
else:
    model = SetFitModel.from_pretrained(MODEL_NAME)
    model.labels = labels

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [None]:
model.labels

['Gastroenterology', 'Orthopedic', 'Neurology', 'Urology']

In [None]:
args = TrainingArguments(
    # loss=sentence_transformers.losses.OnlineContrastiveLoss,
    max_steps = 10,
    batch_size=1,
    num_epochs=1
)
args.eval_strategy = args.evaluation_strategy
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset
)

Map:   0%|          | 0/504 [00:00<?, ? examples/s]

In [None]:
trainer.train()

***** Running training *****
  Num unique pairs = 76
  Batch size = 1
  Num epochs = 1
  Total optimization steps = 10


Step,Training Loss


In [None]:
trainer.state.log_history[-5:]

[{'embedding_loss': 0.1996, 'learning_rate': 2e-05, 'epoch': 0.01, 'step': 1},
 {'train_runtime': 30.5238,
  'train_samples_per_second': 0.328,
  'train_steps_per_second': 0.328,
  'epoch': 0.13,
  'step': 10}]

In [None]:
USE_DIFFERENTIABLE_HEAD

False

In [None]:
#convert numeric label ids to 32 bit ints
if USE_DIFFERENTIABLE_HEAD:
    model.labels = [int(l_i) for l_i in model.labels]

In [None]:
model.save_pretrained(EXPORT_PATH)

In [None]:
EXPORT_PATH

'./models/mtsamples_classifier/'

In [None]:
labels

['Gastroenterology', 'Orthopedic', 'Neurology', 'Urology']

In [None]:
#Save the original labels. Not necessary if using a non-differentiable head, but doing it in all cases for consistency
with open(f"{EXPORT_PATH}/labels.txt", "wt") as F:
    for label in labels:
        F.write(label + "\n")

## Evaluation

In [None]:
trainer.evaluate(test_dataset)

***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.9444444444444444}

In [None]:
test_predictions = []
test_labels = []
batch_i = 0
batch_size = 16
while batch_i * batch_size < len(test_dataset):
    batch = test_dataset[batch_i * batch_size: (batch_i + 1)* batch_size]
    # print(batch)
    predicted_labels = model.predict(batch["text"])
    test_predictions.extend([predicted_label if type(predicted_label) == str else  labels[int(predicted_label)] for predicted_label in predicted_labels])
    test_labels.extend([ex_label if type(ex_label) == str else labels[int(ex_label)] for ex_label in batch["label"]])
    batch_i += 1

In [None]:
report = classification_report(test_labels, test_predictions)
print(report)

                  precision    recall  f1-score   support

Gastroenterology       0.89      1.00      0.94        32
       Neurology       0.96      0.93      0.95        28
      Orthopedic       0.98      0.98      0.98        44
         Urology       0.95      0.82      0.88        22

        accuracy                           0.94       126
       macro avg       0.94      0.93      0.94       126
    weighted avg       0.95      0.94      0.94       126



# Export into LargeFewShotClassifier

In [None]:
#location of set fit model
SETFIT_MODEL_PATH = "./models/mtsamples_classifier"

#location of the exported ONNX version of the model
OUTPUT_PATH = "./models/mtsamples_classifier_onnx"

#location of the exported SparkNLP model
SPARKNLP_OUTPUT_PATH = "./models/sparknlp/mtsamples_classifier"



!test -d $OUTPUT_PATH || mkdir $OUTPUT_PATH
!test -d $SETFIT_MODEL_PATH || mkdir $SETFIT_MODEL_PATH

!ls -l $OUTPUT_PATH

total 0


In [None]:
#load the underlying senternce embeddings model
ssent_emb = SentenceTransformer(SETFIT_MODEL_PATH)

#load set fit model
set_fit_model = setfit.SetFitModel.from_pretrained(SETFIT_MODEL_PATH)

#check if the classifier(head) is differentiable
set_fit_model.has_differentiable_head

False

In [None]:
#customized sentence embeddings model
class JSLSetFitEmbeddings(torch.nn.Module):
    def forward(self, input_ids, attention_mask):

        return self.sent_emb(input={"input_ids": input_ids, "attention_mask": attention_mask})["sentence_embedding"]

    def __init__(self, sent_emb):
        super().__init__()
        self.sent_emb = sent_emb

#customized classifier model
class JSLSetFitDifferentiableClassifier(torch.nn.Module):
    def forward(self, embeddings):
        return self.diff_classifier({"sentence_embedding": embeddings})["probs"]

    def __init__(self, diff_classifier):
        super().__init__()
        self.diff_classifier = diff_classifier


In [None]:
#load the underlying sentence embeddings model
ssent_emb = SentenceTransformer(SETFIT_MODEL_PATH)

#load tokeznier
tokenizer = transformers.AutoTokenizer.from_pretrained(SETFIT_MODEL_PATH)

#load customized sentence embeddings model
jsl_set_fit_embeddings = JSLSetFitEmbeddings(ssent_emb)

In [None]:
#dummy inputs for the embeddings model
dummy_input_ids = torch.ones(1, 5, requires_grad=False).int().to("cpu")
dummy_attention_mask = torch.ones(1, 5, requires_grad=False).int().to("cpu")
dummy_args = tuple([dummy_input_ids, dummy_attention_mask])

dynamic_axes = {
    'input_ids': {0: 'batch', 1: 'max_seq'},
    'attention_mask': {0: 'batch', 1: 'max_seq'}
}

In [None]:
#get the shape of the embeddings
sample_output_shape = jsl_set_fit_embeddings(input_ids = dummy_input_ids, attention_mask=dummy_attention_mask).shape

In [None]:
# Export the embeddings model to ONNX format
torch.onnx.export (
    jsl_set_fit_embeddings,
    (dummy_input_ids, dummy_attention_mask),
    f"{OUTPUT_PATH}/embeddings.onnx",
    input_names=['input_ids', 'attention_mask'],
    output_names=['embeddings'],
    opset_version=12,
    do_constant_folding=True,
    dynamic_axes=dynamic_axes,
    export_params=True,
    verbose=False
)

In [None]:
set_fit_model.model_head.class_weight

In [None]:
#Export the classifier. We export different models, depending on whether the head is differentiable or not

if not set_fit_model.has_differentiable_head:
    #Head not differentiable, e.g. using sklearn
    print("Exporting non differentiable classifier")
    from skl2onnx.common.data_types import FloatTensorType
    initial_type = [('embeddings', FloatTensorType([None, sample_output_shape[1]]))]
    set_fit_model.model_head.classes_ = np.array(set_fit_model.labels)
    classifier_onnx = convert_sklearn(set_fit_model.model_head, initial_types=initial_type)

    with open(f"{OUTPUT_PATH}/classifier.onnx", "wb") as F:
        F.write(classifier_onnx.SerializeToString())
else:
    #Head differentiable, use standart Pytorch to ONNX export
    print("Exporting differentiable classifier")
    dummy_embeddings = torch.ones(sample_output_shape).to("cpu")
    set_fit_model.model_head.classes_ = np.array(set_fit_model.labels)  # added
    jsl_set_fit_head = JSLSetFitDifferentiableClassifier(set_fit_model.model_head)
    # # # Export to ONNX format
    torch.onnx.export (
        jsl_set_fit_head,
        (dummy_embeddings),
        f"{OUTPUT_PATH}/classifier.onnx",
        input_names=['embeddings'],
        output_names=['output_probability'],
        opset_version=12,
        do_constant_folding=True,
        dynamic_axes={"embeddings": {0: 'batch', 1: 'max_seq'}},
        export_params=True,
        verbose=False
    )

Exporting non differentiable classifier


In [None]:
set_fit_model.model_head.classes_

array(['Gastroenterology', 'Orthopedic', 'Neurology', 'Urology'],
      dtype='<U16')

In [None]:
#Create the assets folder
!test -d $OUTPUT_PATH/assets || mkdir $OUTPUT_PATH/assets

#save vocabulary
tokenizer.save_vocabulary(f"{OUTPUT_PATH}/assets")

('./models/mtsamples_classifier_onnx/assets/vocab.txt',)

In [None]:
#if the labels are already specified in a separate file, load them, otherwise get them from the model
#Mind that the labels MUST be specified in a separate file when using a differentiable head
if os.path.isfile(f"{SETFIT_MODEL_PATH}/labels.txt"):
    shutil.copy(f"{SETFIT_MODEL_PATH}/labels.txt", f"{OUTPUT_PATH}/assets/labels.txt")
else:
    with open(f"{OUTPUT_PATH}/assets/labels.txt", "wt") as F:
        for label in set_fit_model.labels:
            F.write(label + "\n")

## Test Base Model

In [None]:
#Sample texts
texts = [
    "The patient was diagnosed with early-onset Parkinson's disease, and the neurologist recommended starting dopamine therapy immediately.",
    "The orthopedic surgeon performed a successful knee replacement surgery, allowing the patient to regain mobility.",
    "The patient presented with symptoms of Crohn's disease, and a colonoscopy was scheduled for further evaluation.",
    "PSA levels were elevated, prompting further tests to rule out prostate cancer."
]
inputs = tokenizer.batch_encode_plus(texts, padding=True)

In [None]:
#Load the models in ONNX
import onnxruntime as ort

session_options = ort.SessionOptions()
session_options.log_severity_level = 3

onnx_embds = ort.InferenceSession(f"{OUTPUT_PATH}/embeddings.onnx", sess_options=session_options)
onnx_head = ort.InferenceSession(f"{OUTPUT_PATH}/classifier.onnx", sess_options=session_options)

In [None]:
#Do ONNX inference. The output differs depending on the type of head
embeddings = onnx_embds.run(["embeddings"], {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]})[0]
if set_fit_model.has_differentiable_head:
    with open(f"{SETFIT_MODEL_PATH}/labels.txt", "rt") as F:
        labels = list(map(lambda x: x.strip(), F.readlines()))
        print(labels)
    probs = onnx_head.run(["output_probability"], {"embeddings": embeddings})[0]
    for ex in probs:
        print(labels[np.argmax(ex)], np.max(ex), ex)
else:
    print(onnx_head.run(["output_label", "output_probability"], {"embeddings": embeddings}))

[array(['Neurology', 'Orthopedic', 'Gastroenterology', 'Urology'],
      dtype=object), [{'Gastroenterology': 0.07261137664318085, 'Neurology': 0.7793239951133728, 'Orthopedic': 0.10372629761695862, 'Urology': 0.04433833062648773}, {'Gastroenterology': 0.033801596611738205, 'Neurology': 0.0959932953119278, 'Orthopedic': 0.7814074754714966, 'Urology': 0.08879761397838593}, {'Gastroenterology': 0.9320468306541443, 'Neurology': 0.029405836015939713, 'Orthopedic': 0.01963229663670063, 'Urology': 0.01891503296792507}, {'Gastroenterology': 0.09149256348609924, 'Neurology': 0.27800387144088745, 'Orthopedic': 0.07599826157093048, 'Urology': 0.5545052886009216}]]


## Load and Save into SparkNLP

In [None]:
#Set the model architecture. Currently we only suport MPNet.
#Supporting other architectures requires implementing the corresponding tokenizer, the rest should work out of the box.

model_architecture = None
if "MPNetModel" in ssent_emb[0].auto_model.config.architectures: # Access the config from the underlying Transformer model
    model_architecture = "mpnet"

In [None]:
#export model to SparkNLP
sparknlp_jsl.annotator.LargeFewShotClassifierModel\
    .loadSavedModel(OUTPUT_PATH,
                    spark,
                    model_architecture=model_architecture,
                    has_differentiable_head=set_fit_model.has_differentiable_head)\
    .setCaseSensitive(not tokenizer.do_lower_case)\
    .write().overwrite().save(SPARKNLP_OUTPUT_PATH)

## Pipeline

In [None]:
#define pipeline
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

large_few_shot_classifier = LargeFewShotClassifierModel()\
    .load(SPARKNLP_OUTPUT_PATH)\
    .setInputCols("document")\
    .setOutputCol("prediction")

pipeline = Pipeline().setStages([
    document_assembler,
    large_few_shot_classifier
])

In [None]:
large_few_shot_classifier.getClasses()

['Orthopedic', 'Gastroenterology', 'Neurology', 'Urology']

In [None]:
#Sample texts were defined above
#sample Spark data
spark_df = spark.createDataFrame([[texts[0]],
                                  [texts[1]],
                                  [texts[2]],
                                  [texts[3]]]).toDF("text")

spark_df.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The patient was diagnosed with early-onset Parkinson's disease, and the neurologist recommended s...|
|The orthopedic surgeon performed a successful knee replacement surgery, allowing the patient to r...|
|The patient presented with symptoms of Crohn's disease, and a colonoscopy was scheduled for furth...|
|                      PSA levels were elevated, prompting further tests to rule out prostate cancer.|
+----------------------------------------------------------------------------------------------------+



In [None]:
results = pipeline.fit(spark_df).transform(spark_df)

#show results
results.selectExpr("text","prediction.result[0] as prediction").show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------+
|                                                                                                text|      prediction|
+----------------------------------------------------------------------------------------------------+----------------+
|The patient was diagnosed with early-onset Parkinson's disease, and the neurologist recommended s...|       Neurology|
|The orthopedic surgeon performed a successful knee replacement surgery, allowing the patient to r...|      Orthopedic|
|The patient presented with symptoms of Crohn's disease, and a colonoscopy was scheduled for furth...|Gastroenterology|
|                      PSA levels were elevated, prompting further tests to rule out prostate cancer.|         Urology|
+----------------------------------------------------------------------------------------------------+----------------+



## Evaluation in SparkNLP

In [None]:
test_data_df

Unnamed: 0,label,text,length,label_classes
429,1,HISTORY OF PRESENT ILLNESS: The patient is a ...,7258,Orthopedic
576,3,CHIEF COMPLAINT: This 61-year-old male presen...,4335,Urology
29,0,HISTORY OF PRESENT ILLNESS: Ms. Connor is a 5...,3560,Gastroenterology
166,2,PREOPERATIVE DIAGNOSIS: Brain tumor left temp...,2749,Neurology
579,3,CHIEF COMPLAINT: Left flank pain and unable t...,1270,Urology
...,...,...,...,...
509,1,PREOPERATIVE DIAGNOSIS: Left carpal tunnel sy...,2428,Orthopedic
413,1,PREOPERATIVE DIAGNOSES: 1. Left carpal tunnel...,7669,Orthopedic
186,2,CC: Progressive memory and cognitive decline....,5233,Neurology
273,2,REASON FOR REFERRAL: The patient is a 58-year...,31031,Neurology


In [None]:
test_spark_df = spark.createDataFrame(test_data_df)

test_results = pipeline.fit(test_spark_df).transform(test_spark_df)

In [None]:
test_results.selectExpr("text", "label","label_classes" ,"prediction.result[0] as prediction").show(5, truncate=100)

+----------------------------------------------------------------------------------------------------+-----+----------------+----------------+
|                                                                                                text|label|   label_classes|      prediction|
+----------------------------------------------------------------------------------------------------+-----+----------------+----------------+
| HISTORY OF PRESENT ILLNESS: The patient is a 26-year-old female, referred to Physical Therapy fo...|    1|      Orthopedic|      Orthopedic|
| CHIEF COMPLAINT: This 61-year-old male presents today with recent finding of abnormal serum PSA ...|    3|         Urology|Gastroenterology|
| HISTORY OF PRESENT ILLNESS: Ms. Connor is a 50-year-old female who returns to clinic for a wound...|    0|Gastroenterology|Gastroenterology|
| PREOPERATIVE DIAGNOSIS: Brain tumor left temporal lobe. POSTOPERATIVE DIAGNOSIS: Brain tumor lef...|    2|       Neurology|       Neurology|

In [None]:
res_df = test_results.selectExpr("text","label_classes", "prediction.result[0] as prediction").toPandas()
res_df

Unnamed: 0,text,label_classes,prediction
0,HISTORY OF PRESENT ILLNESS: The patient is a ...,Orthopedic,Orthopedic
1,CHIEF COMPLAINT: This 61-year-old male presen...,Urology,Gastroenterology
2,HISTORY OF PRESENT ILLNESS: Ms. Connor is a 5...,Gastroenterology,Gastroenterology
3,PREOPERATIVE DIAGNOSIS: Brain tumor left temp...,Neurology,Neurology
4,CHIEF COMPLAINT: Left flank pain and unable t...,Urology,Gastroenterology
...,...,...,...
121,PREOPERATIVE DIAGNOSIS: Left carpal tunnel sy...,Orthopedic,Orthopedic
122,PREOPERATIVE DIAGNOSES: 1. Left carpal tunnel...,Orthopedic,Orthopedic
123,CC: Progressive memory and cognitive decline....,Neurology,Neurology
124,REASON FOR REFERRAL: The patient is a 58-year...,Neurology,Neurology


In [None]:
report = classification_report(res_df["label_classes"], res_df["prediction"], digits=4)
print(report)

                  precision    recall  f1-score   support

Gastroenterology     0.9143    1.0000    0.9552        32
       Neurology     0.9286    0.9286    0.9286        28
      Orthopedic     0.9773    0.9773    0.9773        44
         Urology     0.9474    0.8182    0.8780        22

        accuracy                         0.9444       126
       macro avg     0.9419    0.9310    0.9348       126
    weighted avg     0.9452    0.9444    0.9435       126

