![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **SentenceEntityResolverApproach**

This notebook will cover the different parameters and usages of `SentenceEntityResolverApproach`. This annotator trains a SentenceEntityResolverModel that maps sentence embeddings to entities in a knowledge base.

**📖 Learning Objectives:**

1. Understand the application and relevance of these models in healthcare data analysis, particularly in coding and classification tasks related to healthcare ontologies like ICD-10, RxNorm, SNOMED, etc.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [SentenceEntityResolverApproach](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#sentenceentityresolver)

- Python Docs : [SentenceEntityResolverApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/resolution/sentence_entity_resolver/index.html)

- Scala Docs : [SentenceEntityResolverApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/finance/chunk_classification/resolution/SentenceEntityResolverApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m8.

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pandas as pd

spark = nlp.start(hardware_target="gpu")
spark

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
🤓 Looks like you are missing some jars, trying fetching them ...
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Downloading 🫘+🚀 Java Library spark-nlp-gpu-assembly-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mgpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `SENTENCE_EMBEDDINGS`

- Output: `ENTITY`

## **🔎 Parameters**


General parameters:

- `labelCol` : Column name for the value we are trying to resolve. Usually this contains the entity ID in the knowledge base (e.g., the ICD-10 code).

- `normalizedCol`: Column name for the original, normalized description

- `aux_label_col`: Auxiliary label which maps resolved entities to additional labels

- `useAuxLabel`: Whether to use the auxiliary column or not. Default value is False.

- `distanceFunction`: Determines how the distance between different entities will be calculated.

- `confidenceFunction`: What function to use to calculate confidence: Either ` `INVERSE` or `SOFTMAX`.

- `caseSensitive`: whether to ignore case in tokens for embeddings matching (Default: `False`)

- `threshold`: Threshold value for the last distance calculated (default: 5.0)

- `missAsEmpty`: whether or not to return an empty annotation on unmatched chunks (default: `True`)




</br>

When finetuning an existing model, there are additional parameters:

- `pretrainedModelPath`: Path to an already trained SentenceEntityResolverModel.This pretrained model will be used as a starting point for training the new one. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

- `overrideExistingCodes`: Whether to override the existing codes with new data while continue the training from a pretrained model. Default value is `False` (keep all the codes).

- `dropCodesList`: A list of codes in a pretrained model that will be omitted when the training process begins with a pretrained model.

### **Prepare Data**

We will use sample data with SNOMED codes.

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.test.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.train.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.validation.txt

In [None]:
cols = ["snomed_code", "concept_name", "snomed_text"]

aap_tr = pd.read_csv(
    "AskAPatient.fold-0.train.txt",
    sep="\t",
    encoding="ISO-8859-1",
    header=None,
    names=cols
)
aap_tr["snomed_code"] = aap_tr.snomed_code.apply(str)

aap_ts = pd.read_csv(
    "AskAPatient.fold-0.test.txt", sep="\t", header=None, names=cols
)
aap_ts["snomed_code"] = aap_ts.snomed_code.apply(str)

aap_vl = pd.read_csv(
    "AskAPatient.fold-0.validation.txt", sep="\t", header=None, names=cols
)
aap_vl["snomed_code"] = aap_vl.snomed_code.apply(str)

In [None]:
aap_tr.head()

Unnamed: 0,snomed_code,concept_name,snomed_text
0,108367008,Dislocation of joint,Dislocation of joint
1,3384011000036100,Arthrotec,Arthrotec
2,166717003,Serum creatinine raised,Serum creatinine raised
3,3877011000036101,Lipitor,Lipitor
4,402234004,Foot eczema,Foot eczema


In [None]:
# Create spark dataframes

aap_train_sdf = spark.createDataFrame(aap_tr)
aap_test_sdf = spark.createDataFrame(aap_ts)
aap_val_sdf = spark.createDataFrame(aap_vl)

In [None]:
aap_train_sdf.show()

+----------------+--------------------+--------------------+
|     snomed_code|        concept_name|         snomed_text|
+----------------+--------------------+--------------------+
|       108367008|Dislocation of joint|Dislocation of joint|
|3384011000036100|           Arthrotec|           Arthrotec|
|       166717003|Serum creatinine ...|Serum creatinine ...|
|3877011000036101|             Lipitor|             Lipitor|
|       402234004|         Foot eczema|         Foot eczema|
|       404640003|           Dizziness|           Dizziness|
|       271681002|        Stomach ache|        Stomach ache|
|        76948002|         Severe pain|         Severe pain|
|        36031001|        Burning feet|        Burning feet|
|        76948002|         Severe pain|         Severe pain|
|        42399005|       Renal failure|       Renal failure|
|       288227007|Myalgia/myositis ...|Myalgia/myositis ...|
|       419723007|       Mentally dull|       Mentally dull|
|       248490000|    Bl

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("snomed_text")\
    .setOutputCol("sentence")

bert_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["sentence"])\
    .setOutputCol("bert_embeddings")
    # .setCaseSensitive(False)

embeddings_pipeline = nlp.Pipeline(stages = [
    documentAssembler,
    bert_embeddings])

embeddings_model = embeddings_pipeline.fit(aap_train_sdf)
snomed_data = embeddings_model.transform(aap_train_sdf)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [None]:
snomed_data.show()

+----------------+--------------------+--------------------+--------------------+--------------------+
|     snomed_code|        concept_name|         snomed_text|            sentence|     bert_embeddings|
+----------------+--------------------+--------------------+--------------------+--------------------+
|       108367008|Dislocation of joint|Dislocation of joint|[{document, 0, 19...|[{sentence_embedd...|
|3384011000036100|           Arthrotec|           Arthrotec|[{document, 0, 8,...|[{sentence_embedd...|
|       166717003|Serum creatinine ...|Serum creatinine ...|[{document, 0, 22...|[{sentence_embedd...|
|3877011000036101|             Lipitor|             Lipitor|[{document, 0, 6,...|[{sentence_embedd...|
|       402234004|         Foot eczema|         Foot eczema|[{document, 0, 10...|[{sentence_embedd...|
|       404640003|           Dizziness|           Dizziness|[{document, 0, 8,...|[{sentence_embedd...|
|       271681002|        Stomach ache|        Stomach ache|[{document, 0

### **Train Model**

To train the model, we need to indicate the ground truth code (present in the `snomed_code` column) and the ground truth normalized text (present in the `snomed_text` column).

Optional parameters are:

- `distanceFunction`, that can be chosen to be either `EUCLIDEAN` or `COSINE`
- `caseSensitive`: True or False for casing sensitiveness

In [None]:
bertExtractor = (
    medical.SentenceEntityResolverApproach()
    .setNeighbours(25)
    .setThreshold(1000)
    .setInputCols("bert_embeddings")
    .setNormalizedCol("snomed_text")
    .setLabelCol("snomed_code")
    .setOutputCol("snomed_pred")
    .setDistanceFunction("EUCLIDIAN") # Or COSINE
    .setCaseSensitive(False)
)

%time snomed_model = bertExtractor.fit(snomed_data)

CPU times: user 5.52 s, sys: 751 ms, total: 6.27 s
Wall time: 17min 3s


In [None]:
# save if you will need that later
snomed_model.write().overwrite().save("biobertresolve_snomed_askapatient")

In [None]:
prediction_Model = nlp.PipelineModel(
    stages=[embeddings_model, snomed_model]
)

aap_test_pred = prediction_Model.transform(aap_test_sdf).cache()
aap_val_pred = prediction_Model.transform(aap_val_sdf).cache()

In [None]:
aap_test_pred.selectExpr(
    "snomed_code",
    "concept_name",
    "snomed_text",
    "snomed_pred[0].result",
    "snomed_pred[0].metadata.resolved_text",
    "snomed_pred[0].metadata.all_k_resolutions",
).show(truncate=50)

+----------------+------------------------------------+--------------------------------+---------------------+--------------------------------------+--------------------------------------------------+
|     snomed_code|                        concept_name|                     snomed_text|snomed_pred[0].result|snomed_pred[0].metadata[resolved_text]|        snomed_pred[0].metadata[all_k_resolutions]|
+----------------+------------------------------------+--------------------------------+---------------------+--------------------------------------+--------------------------------------------------+
|       108367008|                Dislocation of joint|                     dislocating|            387603000|                           balance off|balance off:::Impaired mobility:::Reduced mobil...|
|3384011000036100|                           Arthrotec|                       Arthrotec|     3384011000036100|                             Arthrotec|                                         Arthro

In [None]:
aap_val_pred.selectExpr(
    "snomed_code",
    "concept_name",
    "snomed_text",
    "snomed_pred[0].result",
    "snomed_pred[0].metadata.resolved_text",
    "snomed_pred[0].metadata.all_k_resolutions",
).show(truncate=50)


+----------------+---------------------+------------------------------+---------------------+--------------------------------------+--------------------------------------------------+
|     snomed_code|         concept_name|                   snomed_text|snomed_pred[0].result|snomed_pred[0].metadata[resolved_text]|        snomed_pred[0].metadata[all_k_resolutions]|
+----------------+---------------------+------------------------------+---------------------+--------------------------------------+--------------------------------------------------+
|       267032009|   Tired all the time|persisten feeling of tiredness|             84229001|                     extreme tiredness|extreme tiredness:::feeling tired a lot:::feeli...|
|        22298006|Myocardial infarction|                  HEART ATTACK|             22298006|                          HEART ATTACH|HEART ATTACH:::HEADACHES:::LIGHT HEADED:::HAIR ...|
|3877011000036101|              Lipitor|                       LIPITOR|     3877

In [None]:
preds_test = aap_test_pred.selectExpr(
    "snomed_code as ytrue", "snomed_pred[0].result as ypred"
).toPandas()
preds_test.head()

Unnamed: 0,ytrue,ypred
0,108367008,387603000
1,3384011000036100,3384011000036100
2,166717003,166717003
3,3877011000036101,3877011000036101
4,402234004,402234004


In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(preds_test.ytrue, preds_test.ypred)

0.8016147635524798

###  **Train Model with Auxilary Informations**

We can add auxialry information to our model. In here we will add an aux column with the concept names of each code, being more general than the specific occurrence of the code. The auxiliary information will be present in the  `all_k_aux_labels` field of the metadata.

In [None]:
bertExtractor_aux = (
    medical.SentenceEntityResolverApproach()
    .setNeighbours(25)
    .setThreshold(1000)
    .setInputCols("bert_embeddings")
    .setNormalizedCol("snomed_text")
    .setLabelCol("snomed_code")
    .setOutputCol("snomed_pred")
    .setDistanceFunction("EUCLIDIAN")
    .setCaseSensitive(False)
    .setUseAuxLabel(True)
    .setAuxLabelCol("concept_name")
)

%time snomed_aux_model = bertExtractor_aux.fit(snomed_data)

CPU times: user 11 s, sys: 1.46 s, total: 12.4 s
Wall time: 34min 10s


In [None]:
# save if you will need that later
snomed_aux_model.write().overwrite().save(
    "biobertresolve_snomed_askapatient_aux"
)

In [None]:
aux_prediction_Model = nlp.PipelineModel(
    stages=[embeddings_model, snomed_aux_model]
)

aap_test_pred_aux = aux_prediction_Model.transform(aap_test_sdf).cache()
aap_val_pred_aux = aux_prediction_Model.transform(aap_val_sdf).cache()

In [None]:
aap_test_pred_aux.selectExpr(
    "snomed_code",
    "concept_name",
    "snomed_text",
    "snomed_pred[0].result",
    "snomed_pred[0].metadata.resolved_text",
    "snomed_pred[0].metadata.all_k_resolutions",
    "snomed_pred[0].metadata.all_k_aux_labels",
).show(truncate=50)

+----------------+------------------------------------+--------------------------------+---------------------+--------------------------------------+--------------------------------------------------+--------------------------------------------------+
|     snomed_code|                        concept_name|                     snomed_text|snomed_pred[0].result|snomed_pred[0].metadata[resolved_text]|        snomed_pred[0].metadata[all_k_resolutions]|         snomed_pred[0].metadata[all_k_aux_labels]|
+----------------+------------------------------------+--------------------------------+---------------------+--------------------------------------+--------------------------------------------------+--------------------------------------------------+
|       108367008|                Dislocation of joint|                     dislocating|            387603000|                           balance off|balance off:::Impaired mobility:::Reduced mobil...|Impairment of balance:::Impaired mobility:::

In [None]:
aap_val_pred_aux.selectExpr(
    "snomed_code",
    "concept_name",
    "snomed_text",
    "snomed_pred[0].result",
    "snomed_pred[0].metadata.resolved_text",
    "snomed_pred[0].metadata.all_k_resolutions",
    "snomed_pred[0].metadata.all_k_aux_labels",
).show(truncate=50)

+----------------+---------------------+------------------------------+---------------------+--------------------------------------+--------------------------------------------------+--------------------------------------------------+
|     snomed_code|         concept_name|                   snomed_text|snomed_pred[0].result|snomed_pred[0].metadata[resolved_text]|        snomed_pred[0].metadata[all_k_resolutions]|         snomed_pred[0].metadata[all_k_aux_labels]|
+----------------+---------------------+------------------------------+---------------------+--------------------------------------+--------------------------------------------------+--------------------------------------------------+
|       267032009|   Tired all the time|persisten feeling of tiredness|             84229001|                     extreme tiredness|extreme tiredness:::feeling tired a lot:::feeli...|Fatigue:::Tired all the time:::Feeling tired:::...|
|        22298006|Myocardial infarction|                  HE

In [None]:
preds_test_aux = aap_test_pred_aux.selectExpr(
    "snomed_code as ytrue", "snomed_pred[0].result as ypred"
).toPandas()
accuracy_score(preds_test_aux.ytrue, preds_test_aux.ypred)

0.8016147635524798

## Fine-tune existing models

In [None]:
bertExtractor_ft = (
    medical.SentenceEntityResolverApproach()
    .setNeighbours(25)
    .setThreshold(1000)
    .setInputCols("bert_embeddings")
    .setNormalizedCol("snomed_text")
    .setLabelCol("snomed_code")
    .setOutputCol("snomed_pred")
    .setPretrainedModelPath("biobertresolve_snomed_askapatient")
    .setOverrideExistingCodes(False) # True to keep intermediate weights only
    .setDropCodesList(["108367008", "3384011000036100"]) # If not set, keep all codes
)

In [None]:
%%time
model_ft = bertExtractor_ft.fit(snomed_data)

CPU times: user 5.63 s, sys: 702 ms, total: 6.33 s
Wall time: 17min 8s
