![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_CLINICAL_ABBREVIATION_ACRONYM.ipynb)

# `sbiobertresolve_clinical_abbreviation_acronym` **Models**

This model maps clinical abbreviations and acronyms to their meanings using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is an improved version of the base model, and includes more variational data.

## 1. Colab Setup

**Import license keys**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


## 2. Select the model and construct the pipeline

In [4]:
MODEL_NAME = "sbiobertresolve_clinical_abbreviation_acronym"

**Create the pipeline**

In [5]:
document_assembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")

ner_converter = NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['ABBR'])

sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
      .setInputCols(["document", "ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setChunkWeight(0.5)\
      .setCaseSensitive(True)

abbr_resolver = SentenceEntityResolverModel.pretrained(MODEL_NAME, "en", "clinical/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("abbr_meaning")\
      .setDistanceFunction("EUCLIDEAN")\
    

resolver_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        sentence_chunk_embeddings,
        abbr_resolver
  ])


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_abbreviation_clinical download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
[OK!]
sbiobertresolve_clinical_abbreviation_acronym download started this may take some time.
[OK!]


## 3. Create example inputs

In [6]:
sample_text = [
"""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""",

"""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
    Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""
 ]

In [7]:
from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text, StringType()).toDF('text')
df.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSE...|
|Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA...|
+----------------------------------------------------------------------------------------------------+



## 4. Use the pipeline to create outputs

In [8]:
result = resolver_pipeline.fit(df).transform(df)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata, 
                                     result.abbr_meaning.result, 
                                     result.abbr_meaning.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("entity"),
              F.expr("cols['4']").alias("abbr_meaning"),
              F.expr("cols['5']['all_k_resolutions']").alias("all_k_resolutions")).show(truncate = 50)

+-----+-----+---+------+------------------------------------+--------------------------------------------------+
|chunk|begin|end|entity|                        abbr_meaning|                                 all_k_resolutions|
+-----+-----+---+------+------------------------------------+--------------------------------------------------+
|   IR|   30| 31|  ABBR|            interventional radiology|interventional radiology:::immediate-release:::...|
|  CBC|  126|128|  ABBR|                Complete Blood Count|Complete Blood Count:::Complete blood count:::b...|
|   AB|  164|165|  ABBR|           blood group in ABO system|              blood group in ABO system:::abortion|
| VDRL|  194|197|  ABBR|Venereal disease research laboratory|Venereal disease research laboratory:::venous b...|
|  HIV|  252|254|  ABBR|        human immunodeficiency virus|human immunodeficiency virus:::blood group in A...|
+-----+-----+---+------+------------------------------------+-----------------------------------

## 5. Visualize results

In [9]:
from sparknlp_display import AssertionVisualizer

assertion_vis = AssertionVisualizer()

for i in range(len(sample_text)):
    assertion_vis.display(result = result.collect()[i], label_col = "ner_chunk", assertion_col = "abbr_meaning")