![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/ER_RXNORM.ipynb)

## **Resolve Drugs using the RxNorm taxonomy**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## **Install dependencies**

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

### **🔎 For about models**

📌 **sbiobertresolve_rxnorm**--> *This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using sbiobert_base_cased_mli Sentence Bert Embeddings.*


📌 **sbiobertresolve_rxnorm_disposition** --> *This model maps medication entities (like drugs/ingredients) to RxNorm codes and their dispositions using sbiobert_base_cased_mli Sentence Bert Embeddings.sbiobertresolve_rxnorm_disposition resolver model must be used with sbiobert_base_cased_mli as embeddings ner_posology as NER model. DRUG set in .setWhiteList().*

📌 **sbiobertresolve_rxnorm_augmented_re** --> *This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes without specifying the relations between the entities (relations are calculated on the fly inside the annotator) using sbiobert_base_cased_mli Sentence Bert Embeddings (EntityChunkEmbeddings). Embeddings used in this model are calculated with following weights : {"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2} . EntityChunkEmbeddings with those weights are required in the pipeline to get best result.*



### **🔎 Helper Function**


In [None]:
# returns spark df resolution results

def get_codes_from_df(result_df, chunk, output_col, hcc= False):
    
    
    if hcc:
        
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                     result_df[chunk].metadata, 
                                                     result_df[output_col].result, 
                                                     result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias("icd10_code"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions"),
                              F.expr("cols['3']['all_k_aux_labels']").alias("hcc_list")).toPandas()

        codes = []
        resolutions = []
        hcc_all = []

        for code, resolution, hcc in zip(df['all_codes'], df['resolutions'], df['hcc_list']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))
            hcc_all.append(hcc.split(":::"))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        df['hcc_list'] = hcc_all
        
    else:
                       
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                     result_df[chunk].metadata, 
                                                     result_df[output_col].result, 
                                                     result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias(f"{output_col}"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions")).toPandas()



        codes = []
        resolutions = []

        for code, resolution in zip(df['all_codes'], df['resolutions']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        
    
    return df

# **📌 "sbiobertresolve_rxnorm" model**

### **🔎Define Spark NLP pipeline**

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")
'''
drug_normalizer = DrugNormalizer() \
      .setInputCols("document") \
      .setOutputCol("drug_normalized") \
      .setPolicy('all')'''

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['DRUG'])

c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedder = nlp.BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models") \
      .setInputCols(["ner_chunk_doc", "sbert_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

resolver_pipeline = Pipeline(
    stages = [
        documentAssembler,
        #drug_normalizer,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        posology_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        rxnorm_resolver])

data_ner = spark.createDataFrame([[""]]).toDF("text")

model = resolver_pipeline.fit(data_ner)


sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm download started this may take some time.
[OK!]


In [None]:
sample_text = """The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. Hydrochlorothiazide 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain."""

clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")

rxnorm_result = model.transform(clinical_note_df)

In [None]:
res_pd = get_codes_from_df(rxnorm_result, 'ner_chunk', 'rxnorm_code', hcc=False)

In [None]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,rxnorm_code,all_codes,resolutions
0,3,Aspirin,DRUG,1191,"[1191, 405403, 218266, 215448, 215568, 1154070...","[aspirin, ysp aspirin, med aspirin, aspirin-an..."
1,4,Humulin N,DRUG,92880,"[92880, 218686, 92879, 261588, 92881, 1372744,...","[humulin n, neotricin hc, humulin l, nabi-hb, ..."
2,4,insulin,DRUG,5856,"[5856, 484319, 139825, 1740938, 274783, 86009,...","[insulin, insulin detemir, insulin detemir, in..."
3,5,Hydrochlorothiazide,DRUG,5487,"[5487, 1162786, 2396, 91217, 203165, 82027, 11...","[hydrochlorothiazide, hydrochlorothiazide oral..."
4,6,Nitroglycerin,DRUG,4917,"[4917, 360398, 1868493, 1159827, 1159829, 3797...","[nitroglycerin, able brand of nitroglycerin, n..."


In [None]:
from sparknlp_display import EntityResolverVisualizer

light_model = LightPipeline(model)
light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'rxnorm_code',
               document_col='document'
               )

# **📌 "sbiobertresolve_rxnorm_disposition" model**

### **🔎Define Spark NLP pipeline**

In [None]:
rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained('sbiobertresolve_rxnorm_disposition', "en", "clinical/models") \
        .setInputCols(["ner_chunk", "sbert_embeddings"]) \
        .setOutputCol("rxnorm_code")\
        .setDistanceFunction("EUCLIDEAN")

resolver_pipeline = Pipeline(
    stages = [
              documentAssembler,
              sentenceDetector,
              tokenizer,
              word_embeddings,
              posology_ner,
              ner_converter,
              c2doc,
              sbert_embedder,
              rxnorm_resolver
              ])

data_ner = spark.createDataFrame([[""]]).toDF("text")
model = resolver_pipeline.fit(data_ner)
light_model = LightPipeline(model)

sbiobertresolve_rxnorm_disposition download started this may take some time.
[OK!]


In [None]:
sample_text = """The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. Hydrochlorothiazide 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain."""


clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")
rxnorm_sdf = model.transform(clinical_note_df)

In [None]:
res_pd = get_codes_from_df(rxnorm_sdf, 'ner_chunk', 'rxnorm_code', hcc=False)

In [None]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,rxnorm_code,all_codes,resolutions
0,3,Aspirin,DRUG,1191,"[1191, 405403, 218266, 215448, 215568, 1154070...","[aspirin, ysp aspirin, med aspirin, aspirin-an..."
1,4,Humulin N,DRUG,92880,"[92880, 218686, 92879, 261588, 92881, 1372744,...","[humulin n, neotricin hc, humulin l, nabi-hb, ..."
2,4,insulin,DRUG,5856,"[5856, 484319, 139825, 1740938, 274783, 86009,...","[insulin, insulin detemir, insulin detemir, in..."
3,5,Hydrochlorothiazide,DRUG,5487,"[5487, 1162786, 2396, 91217, 203165, 82027, 11...","[hydrochlorothiazide, hydrochlorothiazide oral..."
4,6,Nitroglycerin,DRUG,4917,"[4917, 360398, 1868493, 1159827, 1159829, 3797...","[nitroglycerin, able brand of nitroglycerin, n..."


In [None]:
from sparknlp_display import EntityResolverVisualizer

light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'rxnorm_code',
               document_col='document'
               )

# **📌 "sbiobertresolve_rxnorm_augmented_re" model**

***🔎Check here for more information.***

-->https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.2.Sentence_Entity_Resolvers_with_EntityChunkEmbeddings.ipynb

### **🔎Define Spark NLP pipeline**

In [None]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

posology_ner_model = medical.NerModel()\
    .pretrained("ner_posology_large", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pos_tager = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("pos_tag")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tag", "token"])\
    .setOutputCol("dependencies")

drug_chunk_embeddings = medical.EntityChunkEmbeddings()\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")\
    .setMaxSyntacticDistance(5)

drug_chunk_embeddings.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})
drug_chunk_embeddings.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})

rxnorm_re = medical.SentenceEntityResolverModel\
      .pretrained("sbiobertresolve_rxnorm_augmented_re", "en","clinical/models")\
      .setInputCols(["ner_chunk","drug_chunk_embeddings"])\
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

rxnorm_pipeline_re = Pipeline(
    stages = [
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        drug_chunk_embeddings,
        rxnorm_re
        ])


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
[OK!]
sbiobertresolve_rxnorm_augmented_re download started this may take some time.
[OK!]


In [None]:
sample_text = """The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. Hydrochlorothiazide 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain."""


clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")
rxnorm_sdf =  rxnorm_pipeline_re.fit(clinical_note_df).transform(clinical_note_df)

In [None]:
rxnorm_sdf.select('rxnorm_code.metadata').show(truncate = 200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                metadata|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{all_k_results -> 1535484:::434451:::247138:::315431:::243670:::252857:::318272:::211830:::211832:::404658, all_k_distances -> 3.2895:::3.3071:::3.3071:::3.3071:::3.3071:::3.3071:::3.3071:::3.3071...|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
rxnorm_df = rxnorm_sdf.select(F.explode(F.arrays_zip(rxnorm_sdf.drug_chunk_embeddings.result, 
                                                     rxnorm_sdf.drug_chunk_embeddings.begin,
                                                     rxnorm_sdf.drug_chunk_embeddings.end,
                                                     rxnorm_sdf.rxnorm_code.result, 
                                                     rxnorm_sdf.rxnorm_code.metadata,
                                                     rxnorm_sdf.drug_chunk_embeddings.metadata).alias("col")))\
                      .select(F.expr("col['0']").alias("chunk"),
                              F.expr("col['1']").alias("begin"),
                              F.expr("col['2']").alias("end"),
                              F.expr("col['5']['target_entity']").alias("entity_type"),
                              F.expr("col['3']").alias("RxNorm_code"),
                              F.expr("col['4']['all_k_resolutions']").alias("all_k_resolutions") ,
                              F.expr("col['4']['all_k_results']").alias("all_k_codes"),
                              F.expr("col['4']['all_k_aux_labels']").alias("all_k_labels")
                              ).toPandas()

In [None]:
rxnorm_df

Unnamed: 0,chunk,begin,end,entity_type,RxNorm_code,all_k_resolutions,all_k_codes,all_k_labels
0,Aspirin 81 milligrams,306,326,DRUG,1535484,aspirin 81 MG Oral Film:::aspirin 130 MG Oral ...,1535484:::434451:::247138:::315431:::243670:::...,Clinical Drug:::Clinical Drug:::Clinical Drug:...
1,Humulin,334,340,DRUG,1359720,"insulin isophane, human 70 UNT/ML / insulin, r...",1359720:::106892:::1654858:::106900:::1654855:...,Branded Drug:::Branded Drug:::Branded Drug:::B...
2,insulin,345,351,DRUG,2179743,"insulin, regular, human Injection:::insulin, r...",2179743:::2179750:::311054:::415061:::1543200:...,Clinical Drug Form:::Clinical Drug:::Clinical ...
3,Hydrochlorothiazide 50 mg,370,394,DRUG,316051,hydrochlorothiazide 50 MG:::hydrochlorothiazid...,316051:::197770:::438741:::316049:::310798:::3...,Clinical Drug Comp:::Clinical Drug:::Clinical ...
4,Nitroglycerin 1/150 sublingually,402,433,DRUG,446769,nitroglycerin 0.15 MG/ACTUAT:::nitroglycerin 0...,446769:::409008:::1295706:::486146:::486148:::...,Clinical Drug Comp:::Clinical Drug:::Clinical ...


In [None]:
light_model = LightPipeline(model)
light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'rxnorm_code',
               document_col='document'
               )