![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_HPO.ipynb)

# `sbiobertresolve_loinc_augmented` **Models**

This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous LOINC resolver models.

## 1. Colab Setup

**Import license keys**

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

**Install dependencies**

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

## 2. Start Spark Session

**Import dependencies into Python and start the Spark session**

In [3]:
# Import sparknlp & sparknlp_jsl packages
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

# Import Pyspark packages
from pyspark.sql import SparkSession
from pyspark.sql import functions as F 
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd
import numpy as np 

spark = sparknlp_jsl.start(license_keys['SECRET'])

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.4
Spark NLP_JSL Version : 3.5.2


## 3. Select the model and construct the pipeline

In [20]:
# Alternative usage
# NER_MODEL_NAME = "clinical_ner"  # setWhiteList(["Test"])

NER_MODEL_NAME = "ner_jsl"  
WhiteList = ["Test","BMI","HDL","LDL","Medical_Device","Temperature","Total_Cholesterol","Triglycerides","Blood_Pressure", "ImagingFindings"]

RESOLVER_MODEL_NAME = "sbiobertresolve_loinc_augmented"

**Create the pipeline**

In [21]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentence')

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained(NER_MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(WhiteList)

c2doc = Chunk2Doc()\
    .setInputCols("ner_chunk")\
    .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

resolver = SentenceEntityResolverModel.pretrained(RESOLVER_MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["ner_chunk_doc", "sbert_embeddings"]) \
    .setOutputCol("resolution")


nlp_pipeline = Pipeline(
    stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        resolver
  ])

empty_df = spark.createDataFrame([[""]]).toDF('text')

model = nlp_pipeline.fit(empty_df)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_loinc_augmented download started this may take some time.
[OK!]


## 4. Create example inputs

In [22]:
sample_text = [
        
"""DATABASE: Bilateral lower lobe pneumonia, greater on the right. Arterial blood gases on 2 L of oxygen, pH 7.48, pO2 79, and pCO2 35. BLOOD STUDIES: Hematocrit is 43, WBC 59,300 with a left shift, and platelet count 394,000. Sodium is 130, potassium 3.8, chloride 97, bicarbonate 24, BUN 14, creatinine 0.8, random blood sugar 147, and calcium 9.4.""",

"""A 66 years old male  was admited to Gastroenterology Clinic by elective hospital admission. His symptoms that lasted for the last 2 months: Asthenia, fatigue, increase of abdominal circumference, RUQ disconfort and jaundice. His lab results are as following: Hemoglobin – 12,1 g/dl MCV – 125 fL AST (GOT) – 102 U/L ALT (GPT) – 91 U/L GGT – 143 U/L BiT (total Bilirubin) – 1,92 g/dl Albumin – 3,9 g/dl   Triglycerides - 205 mg/dl Total cholesterol - 189 mg/dl.""",

"""The patient is a 35 years old, male, admitted on 04/09/2019. He accidentally discovered to be infected with hepatitis B virus on 26/08/ 2019. He has no symptoms at the moment of examination. Past medical history: laparoscopic cholecystectomy due to gallstones. Personal history: active smoker, 1 pack/day, moderate alcohol intake.
Serological markers on 26.08.2020
•            HBV DNA = 2700 Ul/mL (> 2000 UI/ml)
•            HBsAg = Positive
•            HB envelope Ag = Negative
•            HB envelope antibody = Positive

Other lab results:
•         Haptoglobin 1.68 g/L
•         GGT 184 U/L
•         Blood glucose 96.5 mg/dL""",

"""The 62 years old woman was admitted to Internal Medicine Department following chest pain, dyspnoea and diarrhea in the last three weeks. At admission, her blood pressure was 156/88 mmHg, with a pulse of 89 bpm. The temperature was 36.3 C. The lab results showed a Hemoglobin of 9.2 g/dL, a total cholesterol level of 239 mg/dL, Serum albumin of 5.4 g/dL. An ECG was performed and it showed normal sinus rhytm.""",

"""Final diagnosis:  FLT3 gene mutation analysis: DNA was extracted from the peripheral blood specimen and a polymerase chain reaction (PCR)-based assay performed that is designed to detect the presence of two separate mutations in the FLT3 gene: (1) internal tandem duplication within a susceptible region that includes coding sequence for the intracellular juxtamembrane domain and (2) point mutations in the codon for ASP835.   Negative. Neither expansion of the region susceptible to internal tandem duplication nor changes consistent with mutation of the codon for ASP835 were identified."""

]

In [23]:
from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text,StringType()).toDF('text')

df.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|DATABASE: Bilateral lower lobe pneumonia, greater on the right. Arterial blood gases on 2 L of ox...|
|A 66 years old male  was admited to Gastroenterology Clinic by elective hospital admission. His s...|
|The patient is a 35 years old, male, admitted on 04/09/2019. He accidentally discovered to be inf...|
|The 62 years old woman was admitted to Internal Medicine Department following chest pain, dyspnoe...|
|Final diagnosis:  FLT3 gene mutation analysis: DNA was extracted from the peripheral blood specim...|
+----------------------------------------------------------------------------------------------------+



## 5. Use the pipeline to create outputs

In [24]:
result = model.transform(df)

result.select(F.explode(F.arrays_zip("ner_chunk.result", 
                                      "ner_chunk.begin", 
                                      "ner_chunk.end",
                                      "ner_chunk.metadata",
                                      "resolution.result",
                                      "resolution.metadata",)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("entity"),
                          F.expr("cols['4']").alias("LOINC_code"),
                          F.expr("cols['5']['resolved_text']").alias("description"),
                          F.expr("cols['5']['all_k_results']").alias("all_codes"),
                          F.expr("cols['5']['all_k_resolutions']").alias("resolutions"),
                         ).show(truncate=40)

+--------------------+-----+---+------+-----------+--------------------+----------------------------------------+----------------------------------------+
|               chunk|begin|end|entity| LOINC_code|         description|                               all_codes|                             resolutions|
+--------------------+-----+---+------+-----------+--------------------+----------------------------------------+----------------------------------------+
|Arterial blood gases|   64| 83|  Test|   LA9488-3|arterial blood gases|LA9488-3:::24336-0:::2708-6:::60835-6...|arterial blood gases::: blood gases::...|
|              oxygen|   95|100|  Test|  LA25411-2|              oxygen|LA25411-2:::LP30884-8:::LP100008-4:::...|oxygen:::oxygen content:::oxygen capa...|
|                  pH|  103|104|  Test|  LA22405-7|                  ph|LA22405-7:::49014-4:::44261-6:::LP949...|ph::: ph::: phq:::phq::: pd::: pl:::p...|
|                 pO2|  112|114|  Test|    11556-8|                 po

## 6. Visualize results

In [25]:
from sparknlp_display import EntityResolverVisualizer

resolver_viz = EntityResolverVisualizer()


for j in range(df.count()):
    resolver_viz.display(result = result.collect()[j], label_col = "ner_chunk", resolution_col="resolution")
    print("\n\n")
























