![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/05.4.Sentence_Entity_Resolvers_with_EntityChunkEmbeddings.ipynb)

# Sentence Entity Resolvers with EntityChunkEmbeddings

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [4]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/5.0.0.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


In [5]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## Creating pipeline

In [6]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

posology_ner_model = medical.NerModel()\
    .pretrained("ner_posology_large", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pos_tager = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("pos_tag")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tag", "token"])\
    .setOutputCol("dependencies")

entity_chunk_embeddings = medical.EntityChunkEmbeddings()\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")\
    .setMaxSyntacticDistance(3)

entity_chunk_embeddings.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})
entity_chunk_embeddings.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})

rxnorm_re = medical.SentenceEntityResolverModel\
    .pretrained("sbiobertresolve_rxnorm_augmented_re", "en","clinical/models")\
    .setInputCols(["drug_chunk_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_pipeline_re = nlp.Pipeline(
    stages = [
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re
        ])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
[OK!]
sbiobertresolve_rxnorm_augmented_re download started this may take some time.
[OK!]


### Sample Text

In [7]:
sampleText = ["The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen.",
              "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"]

sample_df = pd.DataFrame({'text':sampleText}).reset_index()

In [8]:
data_df = spark.createDataFrame(sample_df)

results = rxnorm_pipeline_re.fit(data_df).transform(data_df)

## Chunks extracted by NER model

In [9]:
results.select("index", F.explode(F.arrays_zip(results.ner_chunk.result,
                                               results.ner_chunk.metadata)).alias("cols")) \
       .select("index", F.expr("cols['0']").alias("chunk"),
                        F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----+----------+---------+
|index|chunk     |ner_label|
+-----+----------+---------+
|0    |metformin |DRUG     |
|0    |500 mg    |STRENGTH |
|0    |tablet    |FORM     |
|0    |2.5 mg    |STRENGTH |
|0    |coumadin  |DRUG     |
|0    |ibuprofen |DRUG     |
|1    |metformin |DRUG     |
|1    |400 mg    |STRENGTH |
|1    |coumadin  |DRUG     |
|1    |5 mg      |STRENGTH |
|1    |coumadin  |DRUG     |
|1    |amlodipine|DRUG     |
|1    |10 MG     |STRENGTH |
|1    |tablet    |FORM     |
+-----+----------+---------+



## Merged chunks by internal relation extraction model feature

- We specified the relations as following by `.setTargetEntities` parameter in the EntityChunkEmbeddings annotator :    
`.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})`

- EntityChunkEmbeddings calculates those new chunks embeddings according to the weights specified in`.setEntityWeights` parameter.

`.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})`

In [10]:
results.select('index','drug_chunk_embeddings.result').show(truncate = False)

+-----+--------------------------------------------------------------------+
|index|result                                                              |
+-----+--------------------------------------------------------------------+
|0    |[metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]               |
|1    |[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|
+-----+--------------------------------------------------------------------+



## RxNorm Results

In [11]:
results.select('index', F.explode(F.arrays_zip(results.drug_chunk_embeddings.result,
                                               results.rxnorm_code.result,
                                               results.rxnorm_code.metadata).alias("col")))\
        .select('index', F.expr("col['0']").alias("chunk"),
                         F.expr("col['1']").alias("rxnorm_code_"),
                         F.expr("col['2']['resolved_text']").alias("Concept_Name")).show(truncate = 50)

+-----+-----------------------+------------+---------------------------------+
|index|                  chunk|rxnorm_code_|                     Concept_Name|
+-----+-----------------------+------------+---------------------------------+
|    0|metformin 500 mg tablet|      860974|   metformin hydrochloride 500 MG|
|    0|        2.5 mg coumadin|      855313|warfarin sodium 2.5 MG [Coumadin]|
|    0|              ibuprofen|     1747293|              ibuprofen Injection|
|    1|       metformin 400 mg|      332809|                 metformin 400 MG|
|    1|          coumadin 5 mg|      855333|  warfarin sodium 5 MG [Coumadin]|
|    1|               coumadin|      202421|                         Coumadin|
|    1|amlodipine 10 MG tablet|      308135|     amlodipine 10 MG Oral Tablet|
+-----+-----------------------+------------+---------------------------------+

