![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/ER_SNOMED.ipynb)

## **Resolve Clinical Health Information using the SNOMED taxonomy**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

# **🔎 For about models**

📌 **sbiobertresolve_snomed_drug** --> *This model maps detected drug entities to SNOMED codes using sbiobert_base_cased_mli Sentence Bert Embeddings.*



📌 **sbiobertresolve_clinical_snomed_procedures_measurements**--> *This model maps medical entities to SNOMED codes using sent_biobert_clinical_base_cased Sentence Bert Embeddings. The corpus of this model includes Procedures and Measurement domains.*


### **📌Helper Function:**

In [None]:
# returns spark df resolution results

def get_codes_from_df(result_df, chunk, output_col, hcc= False):
    
    
    if hcc:
        
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                     result_df[chunk].metadata, 
                                                     result_df[output_col].result, 
                                                     result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias("icd10_code"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions"),
                              F.expr("cols['3']['all_k_aux_labels']").alias("hcc_list")).toPandas()


        codes = []
        resolutions = []
        hcc_all = []

        for code, resolution, hcc in zip(df['all_codes'], df['resolutions'], df['hcc_list']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))
            hcc_all.append(hcc.split(":::"))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        df['hcc_list'] = hcc_all
        
    else:
                       
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                           result_df[chunk].metadata, 
                                                           result_df[output_col].result, 
                                                           result_df[output_col].metadata)).alias("cols")) \
                                     .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                                             F.expr("cols['0']").alias("ner_chunk"),
                                             F.expr("cols['1']['entity']").alias("entity"), 
                                             F.expr("cols['2']").alias(f"{output_col}"),
                                             F.expr("cols['3']['all_k_results']").alias("all_codes"),
                                             F.expr("cols['3']['all_k_resolutions']").alias("resolutions")).toPandas()



        codes = []
        resolutions = []

        for code, resolution in zip(df['all_codes'], df['resolutions']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        
    
    return df

# **🔎 "sbiobertresolve_snomed_drug" model**

### **🔎Define Spark NLP pipeline**

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained('ner_posology_large', "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['DRUG'])

c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedding = nlp.BertSentenceEmbeddings\
      .pretrained('sbiobert_base_cased_mli','en', "clinical/models")\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

snomed_resolver = medical.SentenceEntityResolverModel.pretrained('sbiobertresolve_snomed_drug', "en", "clinical/models") \
        .setInputCols(["ner_chunk", "sbert_embeddings"]) \
        .setOutputCol("snomed_code")\
        .setDistanceFunction("EUCLIDEAN")

resolver_pipeline = Pipeline(
      stages = [
          documentAssembler,
          sentenceDetector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          c2doc,
          sbert_embedding,
          snomed_resolver])

data_ner = spark.createDataFrame([[""]]).toDF("text")
models = resolver_pipeline.fit(data_ner)
light_model = LightPipeline(models)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_snomed_drug download started this may take some time.
[OK!]


In [None]:
sample_text = """She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, aspirin 81 mg daily, magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin."""


clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")
snomed_sdf = models.transform(clinical_note_df)

In [None]:
res_pd = get_codes_from_df(snomed_sdf, 'ner_chunk', 'snomed_code', hcc=False)

In [None]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,snomed_code,all_codes,resolutions
0,0,Fragmin,DRUG,9487801000001106,"[9487801000001106, 130752006, 9486501000001106...","[Fragmin, Fragilysin (substance), Faverin, Fro..."
1,0,OxyContin,DRUG,9296001000001100,"[9296001000001100, 373470001, 27499006, 929620...","[OxyCONTIN, Oxychlorosene, Oxyphencyclimine, O..."
2,0,folic acid,DRUG,63718003,"[63718003, 6247001, 226316008, 432165000, 4384...","[Folic acid, Folic acid-containing product, Fo..."
3,0,levothyroxine,DRUG,10071011000001106,"[10071011000001106, 710809001, 768532006, 1262...","[Levothyroxine, Levothyroxine (substance), Lev..."
4,0,aspirin,DRUG,387458008,"[387458008, 7947003, 5145711000001107, 4263650...","[Aspirin, Aspirin-containing product, Aspirin ..."
5,0,magnesium citrate,DRUG,12495006,"[12495006, 387401007, 21691008, 15531411000001...","[Magnesium citrate, Magnesium carbonate, Magne..."
6,0,insulin,DRUG,67866001,"[67866001, 325072002, 414515005, 39487003, 411...","[Insulin, Insulin aspart, Insulin detemir, Ins..."


In [None]:
from sparknlp_display import EntityResolverVisualizer

light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'snomed_code',
               document_col='document'
               )

# **🔎 "sbiobertresolve_clinical_snomed_procedures_measurements" model**

### **🔎Define Spark NLP pipeline**

In [None]:
clinical_ner = medical.NerModel.pretrained('ner_clinical', "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")

c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc")

sbert_embedding = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")\
      .setCaseSensitive(False)

snomed_resolver = medical.SentenceEntityResolverModel.pretrained('sbiobertresolve_clinical_snomed_procedures_measurements', "en", 'clinical/models') \
        .setInputCols(["ner_chunk", "sbert_embeddings"]) \
        .setOutputCol("snomed_code")\
        .setDistanceFunction("EUCLIDEAN")

resolver_pipeline = Pipeline(
      stages = [
          documentAssembler,
          sentenceDetector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          c2doc,
          sbert_embedding,
          snomed_resolver])

data_ner = spark.createDataFrame([[""]]).toDF("text")
models = resolver_pipeline.fit(data_ner)
light_model = LightPipeline(models)

ner_clinical download started this may take some time.
[OK!]
sent_biobert_clinical_base_cased download started this may take some time.
Approximate size to download 386.6 MB
[OK!]
sbiobertresolve_clinical_snomed_procedures_measurements download started this may take some time.
[OK!]


In [None]:
sample_text = """Nature and course of the diagnosis has been discussed with the patient. Based on her presentation without any history of obvious fall or trauma and past history of malignant melanoma. At the present time, I would recommend obtaining a bone scan and repeat x-rays, which will include AP pelvis, femur, hip including knee.  She denies any pain elsewhere , at the present time, this appears to be fracture.

With the above fracture and presentation, she needs a left hip hemiarthroplasty versus cemented type. Indication, risk, and benefits of left hip hemiarthroplasty has been discussed with the patient, which includes, but not limited to bleeding, infection, blood vessel injury, dislocation early and late, persistent pain,  myositis ossificans, need for conversion to total hip replacement surgery, revision surgery, pulmonary embolism, risk of anesthesia, need for blood transfusion, and cardiac arrest. She understands above and is willing to undergo further procedure. The goal and the functional outcome have been explained. Further plan will be discussed with her once we obtain the bone scan and the radiographic studies. We will also await for the oncology feedback and clearance."""
    

clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")

snomed_result = models.transform(clinical_note_df)

In [None]:
res_pd = get_codes_from_df(snomed_result, 'ner_chunk', 'snomed_code', hcc=False)

In [None]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,snomed_code,all_codes,resolutions
0,1,trauma,PROBLEM,164315006,"[164315006, 105040009, 845791000000106, 386405...","[O/E - fever - fast fall-crisis, Heavy metal s..."
1,1,malignant melanoma,PROBLEM,254347002,"[254347002, 254325002, 254343003, 254349004, 2...","[TNM Malignant melanoma of iris staging, TNM M..."
2,2,a bone scan,TEST,61553000,"[61553000, 168654006, 1129461000000103, 432635...","[Ophthalmic biometry by ultrasound echography,..."
3,2,repeat x-rays,TEST,168654006,"[168654006, 268428008, 171229005, 42075002, 16...","[Stress X-ray thumb, Erect abdominal X-ray, Sc..."
4,2,"AP pelvis, femur, hip including knee",TEST,11541000087104,"[11541000087104, 12311000087102, 1142100008710...","[MRI of pelvis and right hip, MRI of pelvis an..."
5,3,any pain,PROBLEM,408950005,"[408950005, 710855004, 494711000000101, 247375...","[Acute pain control, Assessment for sign of di..."
6,3,fracture,PROBLEM,845791000000106,"[845791000000106, 179212005, 183668008, 179140...","[Black fracture index, Primary skeletal tracti..."
7,4,the above fracture,PROBLEM,179183002,"[179183002, 229450004, 224591000000106, 179212...","[Primary skin traction of fracture, Manipulati..."
8,4,a left hip hemiarthroplasty,TREATMENT,723251002,"[723251002, 785850002, 735196007, 735262008, 3...","[Hemiarthroplasty of left shoulder, Reverse pr..."
9,4,cemented type,TREATMENT,41294004,"[41294004, 234803005, 53213008, 234797006, 781...","[Silicate cement, per restoration, Fit veneer ..."


In [None]:
from sparknlp_display import EntityResolverVisualizer

light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'snomed_code',
               document_col='document'
               )