![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_SNOMED.ipynb)

## **Resolve Clinical Health Information using the SNOMED taxonomy**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


# **🔎 For about models**

📌 **sbiobertresolve_snomed_drug** --> *This model maps detected drug entities to SNOMED codes using sbiobert_base_cased_mli Sentence Bert Embeddings.*



📌 **sbiobertresolve_clinical_snomed_procedures_measurements**--> *This model maps medical entities to SNOMED codes using sent_biobert_clinical_base_cased Sentence Bert Embeddings. The corpus of this model includes Procedures and Measurement domains.*


### **📌Helper Function:**

In [4]:
# returns spark df resolution results

def get_codes_from_df(result_df, chunk, output_col, hcc= False):
    
    
    if hcc:
        
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                     result_df[chunk].metadata, 
                                                     result_df[output_col].result, 
                                                     result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias("icd10_code"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions"),
                              F.expr("cols['3']['all_k_aux_labels']").alias("hcc_list")).toPandas()


        codes = []
        resolutions = []
        hcc_all = []

        for code, resolution, hcc in zip(df['all_codes'], df['resolutions'], df['hcc_list']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))
            hcc_all.append(hcc.split(":::"))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        df['hcc_list'] = hcc_all
        
    else:
                       
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                           result_df[chunk].metadata, 
                                                           result_df[output_col].result, 
                                                           result_df[output_col].metadata)).alias("cols")) \
                                     .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                                             F.expr("cols['0']").alias("ner_chunk"),
                                             F.expr("cols['1']['entity']").alias("entity"), 
                                             F.expr("cols['2']").alias(f"{output_col}"),
                                             F.expr("cols['3']['all_k_results']").alias("all_codes"),
                                             F.expr("cols['3']['all_k_resolutions']").alias("resolutions")).toPandas()



        codes = []
        resolutions = []

        for code, resolution in zip(df['all_codes'], df['resolutions']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        
    
    return df

# **🔎 "sbiobertresolve_snomed_drug" model**

### **🔎Define Spark NLP pipeline**

In [5]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained('ner_posology_large', "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['DRUG'])

c2doc = Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedding = BertSentenceEmbeddings\
      .pretrained('sbiobert_base_cased_mli','en', "clinical/models")\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

snomed_resolver = SentenceEntityResolverModel.pretrained('sbiobertresolve_snomed_drug', "en", "clinical/models") \
        .setInputCols(["sbert_embeddings"]) \
        .setOutputCol("snomed_code")\
        .setDistanceFunction("EUCLIDEAN")

resolver_pipeline = Pipeline(
      stages = [
          documentAssembler,
          sentenceDetector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          c2doc,
          sbert_embedding,
          snomed_resolver])

data_ner = spark.createDataFrame([[""]]).toDF("text")
models = resolver_pipeline.fit(data_ner)
light_model = LightPipeline(models)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_snomed_drug download started this may take some time.
[OK!]


In [6]:
sample_text = """She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, aspirin 81 mg daily, magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin."""


clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")
snomed_sdf = models.transform(clinical_note_df)

In [7]:
res_pd = get_codes_from_df(snomed_sdf, 'ner_chunk', 'snomed_code', hcc=False)

In [8]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,snomed_code,all_codes,resolutions
0,0,Fragmin,DRUG,9487801000001106,"[9487801000001106, 130752006, 9486501000001106, 9536201000001105, 9455401000001105, 9522301000001103, 9681001000001107, 83545007, 9486401000001107, 428726008, 9521501000001101, 13127501000001103, ...","[Fragmin, Fragilysin (substance), Faverin, Froop (product), Frumil, frisium, Isopto Frin, Fumitremorgen, Faslodex, Denufosol, Felicium, FerroEss, Frangula (substance), Flamrase EC, Fiasp (product)..."
1,0,OxyContin,DRUG,9296001000001100,"[9296001000001100, 373470001, 27499006, 9296201000001106, 55452001, 96375007, 230091000001108, 112118000, 9237201000001100, 781644002, 9237601000001103, 12094501000001101, 25571003]","[OxyCONTIN, Oxychlorosene, Oxyphencyclimine, Oxymycin (product), Oxycodone (substance), Cyoctol, Oxyargin, Oxyntomodulin, Celectol, Omadacycline (substance), Celontin, Cleosensa (product), Clodant..."
2,0,folic acid,DRUG,63718003,"[63718003, 6247001, 226316008, 432165000, 438451000124100, 9455001000001100, 418558000, 792796007, 43289005, 420207003, 419441003, 327452001, 418777000, 9454201000001105, 126224002]","[Folic acid, Folic acid-containing product, Folic acid supplement agent, L-methyl folic acid, Folate supplement, Folicare, Ferrous sulfate- and folic acid-containing product, Folate and folate der..."
3,0,levothyroxine,DRUG,10071011000001106,"[10071011000001106, 710809001, 768532006, 126202002, 768531004, 59170000, 73187006, 783637002, 847003, 61899008, 38076006, 60760000, 61275002, 86211008, 130608005]","[Levothyroxine, Levothyroxine (substance), Levothyroxine-containing product, Levothyroxine sodium, Levothyroxine- and liothyronine-containing product, Dextrothyroxine, Thyroxine, Levothyroxine sod..."
4,0,aspirin,DRUG,387458008,"[387458008, 7947003, 5145711000001107, 426365001, 412566001, 25796002, 87303007, 319796006, 398767009, 785413006, 9515101000001102, 358427004, 735135008, 770875005, 28367311000001105]","[Aspirin, Aspirin-containing product, Aspirin powder, Aspirin, buffered, Buffered aspirin-containing product, Aluminium aspirin, Cephapirin, Aspirin- and dipyridamole-containing product, Aspirin- ..."
5,0,magnesium citrate,DRUG,12495006,"[12495006, 387401007, 21691008, 15531411000001106, 408112000, 387202002, 419458008, 7168001, 421128005, 420129007, 53691001, 60468008, 116125009, 768454005, 49399009, 373762005]","[Magnesium citrate, Magnesium carbonate, Magnesium trisilicate, Magnesium chloride, Magnesium malate, Magnesium sulphate, Magnesium glycinate, Magnesium carbonate-containing product, Magnesium asp..."
6,0,insulin,DRUG,67866001,"[67866001, 325072002, 414515005, 39487003, 411530000, 771372005, 706973004, 66384003, 130734006, 77001006, 411529005, 4700006, 126210001, 417524005, 412210000, 96367001, 422346007]","[Insulin, Insulin aspart, Insulin detemir, Insulin-containing product, Insulin glulisine, Insulin antagonist, Bound insulin (substance), Isophane insulin, Insulin protease, Insulin reductase, Insu..."


In [9]:
from sparknlp_display import EntityResolverVisualizer

light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'snomed_code',
               document_col='document'
               )

# **🔎 "sbiobertresolve_clinical_snomed_procedures_measurements" model**

### **🔎Define Spark NLP pipeline**

In [10]:
clinical_ner = MedicalNerModel.pretrained('ner_clinical', "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")

c2doc = Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc")

sbert_embedding = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")\
      .setCaseSensitive(False)

snomed_resolver = SentenceEntityResolverModel.pretrained('sbiobertresolve_clinical_snomed_procedures_measurements', "en", 'clinical/models') \
        .setInputCols(["sbert_embeddings"]) \
        .setOutputCol("snomed_code")\
        .setDistanceFunction("EUCLIDEAN")

resolver_pipeline = Pipeline(
      stages = [
          documentAssembler,
          sentenceDetector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          c2doc,
          sbert_embedding,
          snomed_resolver])

data_ner = spark.createDataFrame([[""]]).toDF("text")
models = resolver_pipeline.fit(data_ner)
light_model = LightPipeline(models)

ner_clinical download started this may take some time.
[OK!]
sent_biobert_clinical_base_cased download started this may take some time.
Approximate size to download 386.6 MB
[OK!]
sbiobertresolve_clinical_snomed_procedures_measurements download started this may take some time.
[OK!]


In [11]:
sample_text = """Nature and course of the diagnosis has been discussed with the patient. Based on her presentation without any history of obvious fall or trauma and past history of malignant melanoma. At the present time, I would recommend obtaining a bone scan and repeat x-rays, which will include AP pelvis, femur, hip including knee.  She denies any pain elsewhere , at the present time, this appears to be fracture.

With the above fracture and presentation, she needs a left hip hemiarthroplasty versus cemented type. Indication, risk, and benefits of left hip hemiarthroplasty has been discussed with the patient, which includes, but not limited to bleeding, infection, blood vessel injury, dislocation early and late, persistent pain,  myositis ossificans, need for conversion to total hip replacement surgery, revision surgery, pulmonary embolism, risk of anesthesia, need for blood transfusion, and cardiac arrest. She understands above and is willing to undergo further procedure. The goal and the functional outcome have been explained. Further plan will be discussed with her once we obtain the bone scan and the radiographic studies. We will also await for the oncology feedback and clearance."""
    

clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")

snomed_result = models.transform(clinical_note_df)

In [12]:
res_pd = get_codes_from_df(snomed_result, 'ner_chunk', 'snomed_code', hcc=False)

In [13]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,snomed_code,all_codes,resolutions
0,1,trauma,PROBLEM,164315006,"[164315006, 105040009, 845791000000106, 386405009, 171350003, 52527007, 310486006, 6108008, 273373001, 273993002, 281116008, 446391000124101, 397164006, 848121000000104, 178107009, 178106000, 7233...","[O/E - fever - fast fall-crisis, Heavy metal screen, Reinsch method, Black fracture index, Rape trauma treatment, Road traffic accident injury examination, Intelligence test/S-B, Limb exsanguinati..."
1,1,malignant melanoma,PROBLEM,254347002,"[254347002, 254325002, 254343003, 254349004, 254346006, 254345005, 254348007, 254324003, 254352007, 258292004, 258284008, 106246001, 254319001, 396518001, 258286005]","[TNM Malignant melanoma of iris staging, TNM Malignant melanoma of skin staging, TNM Malignant melanoma of eyelid staging, TNM Malignant melanoma of choroid staging, TNM Malignant melanoma of uvea..."
2,2,a bone scan,TEST,61553000,"[61553000, 168654006, 1129461000000103, 432635009, 113113000, 241463004, 24737001, 281621005, 432056005, 241478006, 168775008, 168681006, 1083711000000109, 241511008, 169238009, 241490008, 1687170...","[Ophthalmic biometry by ultrasound echography, A-mode, Stress X-ray thumb, Automated ultrasonography of breast, Ultrasonography of axilla, Ophthalmic echography, A-mode, US scan of neck vessels, E..."
3,2,repeat x-rays,TEST,168654006,"[168654006, 268428008, 171229005, 42075002, 168731009, 168637003, 168702005, 60619004, 84492002, 168717009, 303937007, 168681006, 168775008, 281616007, 241082003, 79760008, 241089007, 168770003, 3...","[Stress X-ray thumb, Erect abdominal X-ray, Screening chest X-ray, Diagnostic radiography of skull, Standard chest X-ray, Plain X-ray radius, Plain X-ray abdomen, Diagnostic radiography of finger,..."
4,2,"AP pelvis, femur, hip including knee",TEST,11541000087104,"[11541000087104, 12311000087102, 11421000087107, 432672003, 21301000087103, 3991004, 16731000087109, 14871000087107, 431250008, 15351000087106, 15221000087108, 721043004, 15231000087105, 153910000...","[MRI of pelvis and right hip, MRI of pelvis and bilateral hips, MRI of pelvis and left hip, MRI of pelvis and hip, MRI of pelvis and bilateral hips with contrast, MRI of pelvis, prostate and bladd..."
5,3,any pain,PROBLEM,408950005,"[408950005, 710855004, 494711000000101, 247375001, 763306003, 423184003, 274796008, 713131007, 709480005, 273685000, 164299002, 810601000000106, 273454004, 444821009, 225784007, 737899003, 7706370...","[Acute pain control, Assessment for sign of discomfort, Brief pain inventory, Arc of pain in joint, Abbey Pain Scale, Adult pain assessment, Examination of pain sensation, Assessment of dizziness,..."
6,3,fracture,PROBLEM,845791000000106,"[845791000000106, 179212005, 183668008, 179140008, 20781004, 171745006, 179183002, 178106000, 178107009, 178102003, 229509005, 6108008, 38014004, 178851002, 787795000, 252697005, 229435008, 302361...","[Black fracture index, Primary skeletal traction of fracture, Total avulsion of nail plate, Primary functional bracing of fracture, Partial excision of nail and nail matrix, Freeing of spinal teth..."
7,4,the above fracture,PROBLEM,179183002,"[179183002, 229450004, 224591000000106, 179212005, 78840003, 275176004, 229447002, 171745006, 59875008, 179140008, 112740000, 8717008, 302382009, 179184008, 24433001, 2442008, 431986003, 76177002,...","[Primary skin traction of fracture, Manipulation of the inferior radioulnar joint - non-surgical, Complex reduction of abnormal tissue to free spinal cord, Primary skeletal traction of fracture, C..."
8,4,a left hip hemiarthroplasty,TREATMENT,723251002,"[723251002, 785850002, 735196007, 735262008, 386649003, 248811000000107, 734003004, 735261001, 735194005, 239469004, 280469005, 280464000, 179423007, 47458005, 735193004, 179421009, 179321008, 635...","[Hemiarthroplasty of left shoulder, Reverse prosthetic total arthroplasty of shoulder, Revision of left total hip arthroplasty, Revision of left total knee arthroplasty, Partial hip replacement by..."
9,4,cemented type,TREATMENT,41294004,"[41294004, 234803005, 53213008, 234797006, 78185000, 5016005, 10372006, 234798001, 76585002, 56159008, 19955001, 90564003, 39070008, 20882001, 14549001, 179406003, 42114005, 237561000000106, 83200...","[Silicate cement, per restoration, Fit veneer to tooth, Restoration, silicate, Final cementation of crown to tooth - laboratory constructed, Pontic cast, high noble metal, Pontic, resin with high ..."


In [14]:
from sparknlp_display import EntityResolverVisualizer

light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'snomed_code',
               document_col='document'
               )