![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_LOINC.ipynb)

# `sbiobertresolve_loinc_augmented` **Models**

This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous LOINC resolver models.

## 1. Colab Setup

**Import license keys**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


# ðŸ”Ž MODELS

<div align="center">

| **Index** | **LOINIC Models**        |
|---------------|----------------------|
| 1          | [sbiobertresolve_loinc](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_loinc_en.html)       |
| 2        | [sbiobertresolve_loinc_numeric](https://nlp.johnsnowlabs.com/2023/08/01/sbiobertresolve_loinc_numeric_en.html)     |
| 3          | [sbiobertresolve_loinc_augmented](https://nlp.johnsnowlabs.com/2023/08/01/sbiobertresolve_loinc_augmented_en.html)       |



</div>

## 2. Select the model and construct the pipeline

In [4]:
# Alternative usage
# NER_MODEL_NAME = "clinical_ner"  # setWhiteList(["Test"])

NER_MODEL_NAME = "ner_jsl"  
WhiteList = ["Test","BMI","HDL","LDL","Medical_Device","Temperature","Total_Cholesterol","Triglycerides","Blood_Pressure", "ImagingFindings"]

RESOLVER_MODEL_NAME = "sbiobertresolve_loinc_augmented"

**Create the pipeline**

In [5]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentence')

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained(NER_MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(WhiteList)

c2doc = Chunk2Doc()\
    .setInputCols("ner_chunk")\
    .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

resolver = SentenceEntityResolverModel.pretrained(RESOLVER_MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("resolution")


nlp_pipeline = Pipeline(
    stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        resolver
  ])


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_loinc_augmented download started this may take some time.
[OK!]


## 3. Create example inputs

In [6]:
sample_text = [
        
"""DATABASE: Bilateral lower lobe pneumonia, greater on the right. Arterial blood gases on 2 L of oxygen, pH 7.48, pO2 79, and pCO2 35. BLOOD STUDIES: Hematocrit is 43, WBC 59,300 with a left shift, and platelet count 394,000. Sodium is 130, potassium 3.8, chloride 97, bicarbonate 24, BUN 14, creatinine 0.8, random blood sugar 147, and calcium 9.4.""",

"""A 66 years old male  was admited to Gastroenterology Clinic by elective hospital admission. His symptoms that lasted for the last 2 months: Asthenia, fatigue, increase of abdominal circumference, RUQ disconfort and jaundice. His lab results are as following: Hemoglobin â€“ 12,1 g/dl MCV â€“ 125 fL AST (GOT) â€“ 102 U/L ALT (GPT) â€“ 91 U/L GGT â€“ 143 U/L BiT (total Bilirubin) â€“ 1,92 g/dl Albumin â€“ 3,9 g/dl   Triglycerides - 205 mg/dl Total cholesterol - 189 mg/dl.""",

"""The patient is a 35 years old, male, admitted on 04/09/2019. He accidentally discovered to be infected with hepatitis B virus on 26/08/ 2019. He has no symptoms at the moment of examination. Past medical history: laparoscopic cholecystectomy due to gallstones. Personal history: active smoker, 1 pack/day, moderate alcohol intake.
Serological markers on 26.08.2020
â€¢            HBV DNA = 2700 Ul/mL (> 2000 UI/ml)
â€¢            HBsAg = Positive
â€¢            HB envelope Ag = Negative
â€¢            HB envelope antibody = Positive

Other lab results:
â€¢         Haptoglobin 1.68 g/L
â€¢         GGT 184 U/L
â€¢         Blood glucose 96.5 mg/dL""",

"""The 62 years old woman was admitted to Internal Medicine Department following chest pain, dyspnoea and diarrhea in the last three weeks. At admission, her blood pressure was 156/88 mmHg, with a pulse of 89 bpm. The temperature was 36.3 C. The lab results showed a Hemoglobin of 9.2 g/dL, a total cholesterol level of 239 mg/dL, Serum albumin of 5.4 g/dL. An ECG was performed and it showed normal sinus rhytm.""",

"""Final diagnosis:  FLT3 gene mutation analysis: DNA was extracted from the peripheral blood specimen and a polymerase chain reaction (PCR)-based assay performed that is designed to detect the presence of two separate mutations in the FLT3 gene: (1) internal tandem duplication within a susceptible region that includes coding sequence for the intracellular juxtamembrane domain and (2) point mutations in the codon for ASP835.   Negative. Neither expansion of the region susceptible to internal tandem duplication nor changes consistent with mutation of the codon for ASP835 were identified."""

]

In [7]:
from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text, StringType()).toDF('text')

df.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|DATABASE: Bilateral lower lobe pneumonia, greater on the right. Arterial blood gases on 2 L of ox...|
|A 66 years old male  was admited to Gastroenterology Clinic by elective hospital admission. His s...|
|The patient is a 35 years old, male, admitted on 04/09/2019. He accidentally discovered to be inf...|
|The 62 years old woman was admitted to Internal Medicine Department following chest pain, dyspnoe...|
|Final diagnosis:  FLT3 gene mutation analysis: DNA was extracted from the peripheral blood specim...|
+----------------------------------------------------------------------------------------------------+



## 4. Use the pipeline to create outputs

In [8]:
result = nlp_pipeline.fit(df).transform(df)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata,
                                     result.resolution.result,
                                     result.resolution.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("entity"),
              F.expr("cols['4']").alias("LOINC_code"),
              F.expr("cols['5']['resolved_text']").alias("description"),
              F.expr("cols['5']['all_k_results']").alias("all_codes"),
              F.expr("cols['5']['all_k_resolutions']").alias("resolutions")).show(truncate=40)

+--------------------+-----+---+------+----------+--------------------+----------------------------------------+----------------------------------------+
|               chunk|begin|end|entity|LOINC_code|         description|                               all_codes|                             resolutions|
+--------------------+-----+---+------+----------+--------------------+----------------------------------------+----------------------------------------+
|Arterial blood gases|   64| 83|  Test|  LA9488-3|arterial blood gases|LA9488-3:::24336-0:::2708-6:::60835-6...|arterial blood gases::: blood gases::...|
|                  pH|  103|104|  Test| LA22405-7|                  ph|LA22405-7:::49014-4:::44261-6:::LP949...|ph::: ph::: phq:::phq::: pd::: pl:::p...|
|                 pO2|  112|114|  Test|   11556-8|                 po2|11556-8:::75390-5:::53725-8:::11557-6...| po2::: pn2::: peo2::: pco2::: ap2:::...|
|                pCO2|  124|127|  Test|   11557-6|                pco2|11557

## 5. Visualize results

In [9]:
from sparknlp_display import EntityResolverVisualizer

resolver_viz = EntityResolverVisualizer()


for j in range(df.count()):
    resolver_viz.display(result = result.collect()[j], label_col = "ner_chunk", resolution_col="resolution")
    print("\n\n")
























