![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/26.0.Voice_of_Patient_Models.ipynb)

# **Voice Of Patient MODELS**

This notebook includes details about different kinds of pretrained models to extracts healthcare-related terms from the documents transferred from the patient’s own sentences, together with examples of each type of model.

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.0

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [5]:
spark

## **List of Pretrained Models**

In [6]:
df = pd.DataFrame()
for model_type in ['MedicalNerModel', 'AssertionDLModel', 'MedicalBertForSequenceClassification','GenericClassifierModel']:
    model_list = sorted(list(set([model[0] for model in medical.InternalResourceDownloader.returnPrivateModels(model_type) if 'vop' in model[0]])))
    if len(model_list) > 0:
      if model_type == "MedicalNerModel":
        model_list = list(filter(lambda x: "wip" not in x, model_list))
      df = pd.concat([df, pd.DataFrame(model_list, columns = [model_type])], axis = 1)

df.fillna('')

Unnamed: 0,MedicalNerModel,AssertionDLModel,MedicalBertForSequenceClassification
0,ner_vop,assertion_vop_clinical,bert_sequence_classifier_vop_drug_side_effect
1,ner_vop_anatomy,assertion_vop_clinical_large,bert_sequence_classifier_vop_hcp_consult
2,ner_vop_anatomy_emb_clinical_large,assertion_vop_clinical_medium,bert_sequence_classifier_vop_self_report
3,ner_vop_anatomy_emb_clinical_medium,,bert_sequence_classifier_vop_side_effect
4,ner_vop_anatomy_langtest,,bert_sequence_classifier_vop_sound_medical
5,ner_vop_clinical_dept,,
6,ner_vop_clinical_dept_emb_clinical_large,,
7,ner_vop_clinical_dept_emb_clinical_medium,,
8,ner_vop_clinical_dept_langtest,,
9,ner_vop_demographic,,


## NER Models

The NER models from the list include different entity groups and levels of granularity.

In [7]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setSplitChars(["-", "\/"])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

## ner_vop_treatment
ner_vop_treatment = medical.NerModel.pretrained("ner_vop_treatment", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_vop_treatment")

ner_converter_vop_treatment = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_vop_treatment"]) \
    .setOutputCol("ner_chunk_vop_treatment")

## ner_vop
ner_vop = medical.NerModel.pretrained("ner_vop", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_vop")

ner_converter_vop = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_vop"]) \
    .setOutputCol("ner_chunk_vop")

## ner_vop_test
ner_vop_test = medical.NerModel.pretrained("ner_vop_test", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_vop_test")

ner_converter_vop_test = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_vop_test"]) \
    .setOutputCol("ner_chunk_vop_test")

ner_stages = [document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_vop_treatment,
    ner_converter_vop_treatment,
    ner_vop,
    ner_converter_vop,
    ner_vop_test,
    ner_converter_vop_test]

ner_pipeline = nlp.Pipeline(stages=ner_stages)

empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_model = ner_pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_vop_treatment download started this may take some time.
[OK!]
ner_vop download started this may take some time.
[OK!]
ner_vop_test download started this may take some time.
[OK!]


In [8]:
ner_vop_labels = sorted(list(set([label.split('-')[-1] for label in ner_vop.getClasses() if label != 'O'])))

len(ner_vop_labels)

31

In [9]:
label_df = pd.DataFrame()
for column in range((len(ner_vop_labels)//10)+1):
  label_df = pd.concat([label_df, pd.DataFrame(ner_vop_labels, columns = [''])[column*10:(column+1)*10].reset_index(drop= True)], axis = 1)

label_df.fillna('')

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,AdmissionDischarge,Employment,PsychologicalCondition,VitalTest
1,Age,Form,RelationshipStatus,
2,Allergen,Frequency,Route,
3,BodyPart,Gender,Substance,
4,ClinicalDept,HealthStatus,SubstanceQuantity,
5,DateTime,InjuryOrPoisoning,Symptom,
6,Disease,Laterality,Test,
7,Dosage,MedicalDevice,TestResult,
8,Drug,Modifier,Treatment,
9,Duration,Procedure,Vaccine,


In [10]:
ner_vop_treatment_labels = sorted(list(set([label.split('-')[-1] for label in ner_vop_treatment.getClasses() if label != 'O'])))

print(ner_vop_treatment_labels)

['Dosage', 'Drug', 'Duration', 'Form', 'Frequency', 'Procedure', 'Route', 'Treatment']


In [11]:
ner_vop_test_labels = sorted(list(set([label.split('-')[-1] for label in ner_vop_test.getClasses() if label != 'O'])))

print(ner_vop_test_labels)

['Measurements', 'Test', 'TestResult', 'VitalTest']


In [12]:
sample_text_1 = '''Hello, I am a 20-year-old woman who was diagnosed with hyperthyroidism around a month ago. For approximately four months, I've been experiencing symptoms such as feeling light-headed, battling poor digestion, dealing with anxiety attacks, depression, a sharp pain on my left side chest, an elevated heart rate, and a significant loss of weight. Due to these conditions, I was admitted to the hospital and just got discharged recently. During my hospital stay, a number of different tests were carried out by various physicians who initially struggled to pinpoint my actual medical condition. These tests included numerous blood tests, a brain MRI, an ultrasound scan, and an endoscopy. At long last, I was examined by a homeopathic doctor who finally diagnosed me with hyperthyroidism, indicating my TSH level was at a low 0.15 while my T3 and T4 levels were normal. Additionally, I was found to be deficient in vitamins B12 and D. Hence, I've been on a regimen of vitamin D supplements once a week and a daily dose of 1000 mcg of vitamin B12. I've been undergoing homeopathic treatment for the last 40 days and underwent a second test after a month which showed my TSH level increased to 0.5. While I'm noticing a slight improvement in my feelings of weakness and depression, over the last week, I've encountered two new challenges: difficulty breathing and a dramatically increased heart rate. I'm now at a crossroads where I am unsure if I should switch to allopathic treatment or continue with homeopathy. I understand that thyroid conditions take a while to improve, but I'm wondering if both treatments would require the same duration for recovery. Several of my acquaintances have recommended transitioning to allopathy and warn against taking risks, given the potential of developing severe complications. Please forgive any errors in my English and thank you for your understanding.'''

sample_text_2 = '''Following a visit to the nephrology department for a routine kidney function check-up, I underwent a urine test. The results revealed that I was suffering from chronic kidney disease, prompting the initiation of necessary medication for its control.'''

sample_text_3 = '''My grandmother was identified with high cholesterol and had to alter her daily habits. She also has to consume statins and eat a low-sodium diet to maintain her cholesterol levels. It's required a significant adaptation, but she's managing quite well.'''

In [13]:
data = spark.createDataFrame(pd.DataFrame([sample_text_1, sample_text_2, sample_text_3], columns = ['text']))

In [14]:
results = ner_model.transform(data).collect()

In [15]:
visualiser = nlp.viz.NerVisualizer()

In [16]:
from google.colab import widgets

t = widgets.TabBar(["ner_vop_treatment", "ner_vop_test", "ner_vop"])

with t.output_to(0):
    visualiser.display(results[2], label_col='ner_chunk_vop_treatment')

with t.output_to(1):
    visualiser.display(results[1], label_col='ner_chunk_vop_test')

with t.output_to(2):
    visualiser.display(results[0], label_col='ner_chunk_vop')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Assertion Model

<div align="center">

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
| 1        | [assertion_vop_clinical](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_en.html)     | Hypothetical_Or_Absent, Present_Or_Past, SomeoneElse |
| 2          | [assertion_vop_clinical_medium](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_medium_en.html)       | Hypothetical_Or_Absent, Present_Or_Past, SomeoneElse |
| 3          | [assertion_vop_clinical_large](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_large_en.html)       | Hypothetical_Or_Absent, Present_Or_Past, SomeoneElse |
|||


</div>

[Assertion status model](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_en.html) used to predict if an NER chunk refers to a positive finding from the patient (Present_Or_Past), or if it refers to a family member or another person (SomeoneElse) or if it is mentioned but not as something present (Hypothetical_Or_Absent).

In [17]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = medical.NerModel.pretrained("ner_vop", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setBlackList(['DATETIME',  'GENDER', 'AGE', 'SUBSTANCEQUANTITY','FORM', 'ADMISSIONDISCHARGE', 'TESTRESULT', 'TEST',
                  'MEDICALDEVICE','CLINICALDEPT','DRUG', 'ROUTE', 'DURATION',"DOSAGE",'FREQUENCY', 'BODYPART',
                   ])

assertion = medical.AssertionDLModel.pretrained("assertion_vop_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner,
        ner_converter,
        assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

asr_pipe = pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_vop download started this may take some time.
[OK!]
assertion_vop_clinical download started this may take some time.
[OK!]


In [18]:
assertion.getClasses()


['Hypothetical_Or_Absent', 'Present_Or_Past', 'SomeoneElse']

In [19]:
sample_text = '''Hello, I am a 20-year-old woman who was diagnosed with hyperthyroidism around a month ago. For approximately four months, I've been experiencing symptoms such as feeling light-headed, battling poor digestion, dealing with anxiety attacks, depression, a sharp pain on my left side chest, an elevated heart rate, and a significant loss of weight. Due to these conditions, I was admitted to the hospital and just got discharged recently. During my hospital stay, a number of different tests were carried out by various physicians who initially struggled to pinpoint my actual medical condition. These tests included numerous blood tests, a brain MRI, an ultrasound scan, and an endoscopy. At long last, I was examined by a homeopathic doctor who finally diagnosed me with hyperthyroidism, indicating my TSH level was at a low 0.15 while my T3 and T4 levels were normal. Additionally, I was found to be deficient in vitamins B12 and D. Hence, I've been on a regimen of vitamin D supplements once a week and a daily dose of 1000 mcg of vitamin B12. I've been undergoing homeopathic treatment for the last 40 days and underwent a second test after a month which showed my TSH level increased to 0.5. While I'm noticing a slight improvement in my feelings of weakness and depression, over the last week, I've encountered two new challenges: difficulty breathing and a dramatically increased heart rate. I'm now at a crossroads where I am unsure if I should switch to allopathic treatment or continue with homeopathy. I understand that thyroid conditions take a while to improve, but I'm wondering if both treatments would require the same duration for recovery. Several of my acquaintances have recommended transitioning to allopathy and warn against taking risks, given the potential of developing severe complications. Please forgive any errors in my English and thank you for your understanding.'''

lp = nlp.LightPipeline(asr_pipe)

lr = lp.fullAnnotate([sample_text])[0]

In [20]:
vis = nlp.viz.AssertionVisualizer()

vis.display(lr, 'ner_chunk', 'assertion')

## Classification Model

In [21]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = nlp.Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_vop_side_effect", "en", "clinical/models")\
    .setInputCols(["document",'token'])\
    .setOutputCol("prediction")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        sequenceClassifier
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

bert_sequence_classifier_vop_side_effect download started this may take some time.
[OK!]


In [22]:
sample_text = '''Hello, folks! Recently, my physician prescribed a medication named "SereniCalm" for my stress issues, but instead of soothing my nerves, it transformed me into a sluggish, apathetic shadow. I found myself roaming about as if I was running on severe sleep deprivation, devoid of any emotions or vitality. It was as though my mind was stuck in a perpetual state of standby. Certainly not the kind of stress relief I was expecting, right?'''

In [23]:
classification_data = spark.createDataFrame(pd.DataFrame([sample_text], columns = ['text']))

In [24]:
classification_results = model.transform(classification_data)

In [25]:
classification_results.select("text", "prediction.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|text                                                                                                                                                                                                                                                                                                                                                                                                                                               |result |
+-----------------------------------------------------------------------------------------------------------

## Pretrained NER Profiling Pipelines

We can use pretrained NER profiling pipelines for exploring all the available pretrained NER models at once.

- `ner_profiling_vop` : Returns results for vop NER models.

For more examples, please check [this notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/07.1.Pretrained_NER_Profiling_Pipelines.ipynb).





<center><b>NER Profiling VOP Model List</b>

|  |  |  |
|--------------|-----------------|-----------------|
| ner_vop_clinical_dept | ner_vop_temporal | ner_vop_test |
| ner_vop | ner_vop_problem | ner_vop_problem_reduced |
| ner_vop_demographic | ner_vop_anatomy | ner_vop_treatment |




</center>

In [26]:
vop_profiling_pipeline = nlp.PretrainedPipeline("ner_profiling_vop", "en", "clinical/models")

ner_profiling_vop download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [27]:
text = """Hello, I am a 20-year-old woman who was diagnosed with hyperthyroidism around a month ago.For approximately four months, I've been experiencing symptoms such as feeling light-headed, battling poor digestion, dealing with anxiety attacks, depression, a sharp pain on my left side chest, an elevated heart rate, and a significant loss of weight. Due to these conditions, I was admitted to the hospital and just got discharged recently."""

In [28]:
vop_result = vop_profiling_pipeline.fullAnnotate(text)[0]
vop_result.keys()

dict_keys(['ner_chunk_vop_problem_reduced', 'ner_vop_clinical_dept', 'ner_vop_temporal', 'ner_chunk_vop_test', 'document', 'ner_vop_test', 'ner_vop', 'ner_vop_problem', 'ner_vop_problem_reduced', 'ner_vop_treatment', 'ner_chunk_vop_problem', 'ner_chunk_vop', 'ner_vop_demographic', 'ner_chunk_vop_anatomy', 'ner_chunk_vop_clinical_dept', 'ner_chunk_vop_treatment', 'token', 'ner_chunk_vop_temporal', 'embeddings', 'ner_vop_anatomy', 'ner_chunk_vop_demographic', 'sentence'])

In [29]:
def get_token_results(light_result):

    tokens = [j.result for j in light_result["token"]]
    sentences = [j.metadata["sentence"] for j in light_result["token"]]
    begins = [j.begin for j in light_result["token"]]
    ends = [j.end for j in light_result["token"]]
    model_list = [ a for a in light_result.keys() if (a not in ["sentence", "token"] and "_chunks" not in a)]

    df = pd.DataFrame({'sentence':sentences, 'begin': begins, 'end': ends, 'token':tokens})

    for model_name in model_list:

        temp_df = pd.DataFrame(light_result[model_name])
        temp_df["jsl_label"] = temp_df.iloc[:,0].apply(lambda x : x.result)
        temp_df = temp_df[["jsl_label"]]

        # temp_df = get_ner_result(model_name)
        temp_df.columns = [model_name]
        df = pd.concat([df, temp_df], axis=1)

    # Filter columns to include only sentence, begin, end, token and all columns that start with 'ner_vop'
    filtered_df = df.loc[:, ['sentence', 'begin', 'end', 'token'] + [col for col in df.columns if col.startswith('ner_vop')]]

    return filtered_df

In [30]:
get_token_results(vop_result)

Unnamed: 0,sentence,begin,end,token,ner_vop_clinical_dept,ner_vop_temporal,ner_vop_test,ner_vop,ner_vop_problem,ner_vop_problem_reduced,ner_vop_treatment,ner_vop_demographic,ner_vop_anatomy
0,0,0,4,Hello,O,O,O,O,O,O,O,O,O
1,0,5,5,",",O,O,O,O,O,O,O,O,O
2,0,7,7,I,O,O,O,O,O,O,O,O,O
3,0,9,10,am,O,O,O,O,O,O,O,O,O
4,0,12,12,a,O,O,O,O,O,O,O,O,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,2,404,407,just,O,O,O,O,O,O,O,O,O
76,2,409,411,got,O,O,O,O,O,O,O,O,O
77,2,413,422,discharged,B-AdmissionDischarge,O,O,B-AdmissionDischarge,O,O,O,O,O
78,2,424,431,recently,O,B-DateTime,O,B-DateTime,O,O,O,O,O
