![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/SentenceEntityResolverModel.ipynb)

# **SentenceEntityResolverModel**

This notebook will cover the different parameters and usages of `SentenceEntityResolverModel`. This annotator maps clinical entities to a particilar ontology / curated dataset using sentence embeddings.

**📖 Learning Objectives:**

1. Understand the application and relevance of these models in healthcare data analysis

2. Map clinical entities to standard codes (ICD-10, RxNorm, SNOMED, etc.)

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [SentenceEntityResolverModel](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#sentenceentityresolver)

- Python Docs : [SentenceEntityResolverModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/resolution/sentence_entity_resolver/index.html)

- Scala Docs : [SentenceEntityResolverModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/finance/chunk_classification/resolution/SentenceEntityResolverModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp).

## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m79.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m5.7 

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
import pyspark.sql.functions as F

spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `SENTENCE_EMBEDDINGS`

- Output: `ENTITY`

## **🔎 Parameters**


- `distanceFunction`: Determines how the distance between different entities will be calculated. Either `COSINE` or `EUCLIDEAN`.

- `neighbours`: The number of neighbours to consider when computing the distances.

- `caseSensitive`: WWhether to consider text casing or not.

- `threshold`: Threshold of the distance between nodes to consider.

- `DoExceptionHandling`: If True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.


### `setDistanceFunction()`



Defines the distance function to use, either Euclidean or cosine.  

In [5]:
documentAssembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("ner_chunk")
)

sbert_embedder = (
    nlp.BertSentenceEmbeddings.pretrained(
        "sbiobert_base_cased_mli", "en", "clinical/models"
    )
    .setInputCols(["ner_chunk"])
    .setOutputCol("sentence_embeddings")
    .setCaseSensitive(False)
)

rxnorm_resolver = (
    medical.SentenceEntityResolverModel.pretrained(
        "sbiobertresolve_rxnorm_augmented", "en", "clinical/models"
    )
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("rxnorm_code")
    .setDistanceFunction("EUCLIDEAN")
)

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[OK!]


In [6]:
text = 'metformin 100 mg'
df = spark.createDataFrame([[text]]).toDF("text")

In [7]:
results = rxnorm_pipelineModel.transform(df)

In [8]:
results.show(truncate=100)

+----------------+----------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|            text|                                                 ner_chunk|                                                                                 sentence_embeddings|                                                                                         rxnorm_code|
+----------------+----------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|metformin 100 mg|[{document, 0, 15, metformin 100 mg, {sentence -> 0}, []}]|[{sentence_embeddings, 0, 15, metformin 100 mg, {sentence -> 0, token -> metformin 

In [9]:
results.select("ner_chunk.result", "rxnorm_code.result").show()

+------------------+--------+
|            result|  result|
+------------------+--------+
|[metformin 100 mg]|[861024]|
+------------------+--------+



In [10]:
rxnorm_resolver.setDistanceFunction("COSINE")

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)
results = rxnorm_pipelineModel.transform(df)

results.select("ner_chunk.result", "rxnorm_code.result").show()

+------------------+--------+
|            result|  result|
+------------------+--------+
|[metformin 100 mg]|[861026]|
+------------------+--------+



### `setNeighbours`

The number of neighbours to find candidate codes, later filtered by the threshold value.

In [11]:
rxnorm_resolver.setNeighbours(5)

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)
results = rxnorm_pipelineModel.transform(df)
results.select(
    F.explode(
        F.arrays_zip(
            results.rxnorm_code.result, results.rxnorm_code.metadata
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("rxnorm_code"),
    F.expr("cols['1']['all_k_results']").alias("all_k_results"),
    F.expr("cols['1']['all_k_resolutions']").alias("all_k_resolutions"),
).show(
    truncate=False
)


+-----------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|rxnorm_code|all_k_results                             |all_k_resolutions                                                                                                                                                                                                                                      |
+-----------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|861026     |861026:::451225:::316844:::451900:::439132|metformin hydrochloride 100 m

In [12]:
rxnorm_resolver.setNeighbours(50)

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)
results = rxnorm_pipelineModel.transform(df)
results.select(
    F.explode(
        F.arrays_zip(
            results.rxnorm_code.result, results.rxnorm_code.metadata
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("rxnorm_code"),
    F.expr("cols['1']['all_k_results']").alias("all_k_results"),
    F.expr("cols['1']['all_k_resolutions']").alias("all_k_resolutions"),
).show(
    truncate=False
)


+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### `setCaseSensitive`

The models can be trained as case sensitive or not, so this parameter can have unexpected outcomes depending on the pretrained model used.

In [13]:
rxnorm_resolver.setCaseSensitive(False)

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)
results = rxnorm_pipelineModel.transform(df)
results.select(
    F.explode(
        F.arrays_zip(
            results.rxnorm_code.result, results.rxnorm_code.metadata
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("rxnorm_code"),
    F.expr("cols['1']['all_k_results']").alias("all_k_results"),
    F.expr("cols['1']['all_k_resolutions']").alias("all_k_resolutions"),
).show(
    truncate=False
)


+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [14]:
rxnorm_resolver.setCaseSensitive(True)

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)
results = rxnorm_pipelineModel.transform(df)
results.select(
    F.explode(
        F.arrays_zip(
            results.rxnorm_code.result, results.rxnorm_code.metadata
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("rxnorm_code"),
    F.expr("cols['1']['all_k_results']").alias("all_k_results"),
    F.expr("cols['1']['all_k_resolutions']").alias("all_k_resolutions"),
).show(
    truncate=False
)


+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### `setThreshold`

This parameter removes neighbors where the distance is greather than the threshold value. The value will be sensitive whether you are using cosine or Euclidean distance functions.

In [15]:
rxnorm_resolver.getThreshold()

1000.0

In [16]:
rxnorm_resolver.setDistanceFunction("EUCLIDEAN").setThreshold(500)

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)
results = rxnorm_pipelineModel.transform(df)
results.select(
    F.explode(
        F.arrays_zip(
            results.rxnorm_code.result, results.rxnorm_code.metadata
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("rxnorm_code"),
    F.expr("cols['1']['all_k_results']").alias("all_k_results"),
    F.expr("cols['1']['all_k_resolutions']").alias("all_k_resolutions"),
).show(
    truncate=False
)


+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
rxnorm_resolver.setDistanceFunction("EUCLIDEAN").setThreshold(7)

rxnorm_pipelineModel = nlp.PipelineModel(
    stages=[documentAssembler, sbert_embedder, rxnorm_resolver]
)
results = rxnorm_pipelineModel.transform(df)
results.select(
    F.explode(
        F.arrays_zip(
            results.rxnorm_code.result, results.rxnorm_code.metadata
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("rxnorm_code"),
    F.expr("cols['1']['all_k_results']").alias("all_k_results"),
    F.expr("cols['1']['all_k_resolutions']").alias("all_k_resolutions"),
).show(
    truncate=False
)


+-----------+----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|rxnorm_code|all_k_results                                                                                             |all_k_resolutions                                                                                                                                                                                                  