![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/EntityChunkEmbeddings.ipynb)

# **EntityChunkEmbeddings**

This notebook will cover the different parameters and usages of `EntityChunkEmbeddings`.

**📖 Learning Objectives:**

1. Comprehend the need for Entity Chunk Embeddings and their relationship with BERT Sentence embeddings.

2. Understand the concept of a weighted average vector representation of related entity chunks.

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [EntityChunkEmbeddings](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#entitychunkembeddings)

- Python Docs : [EntityChunkEmbeddings](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/embeddings/entity_chunk_embeddings/index.html)

- Scala Docs : [EntityChunkEmbeddings](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/embeddings/EntityChunkEmbeddings.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m48.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m5.2 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DEPENDENCY`, `CHUNK`

- Output: `SENTENCE_EMBEDDINGS`

## **🔎 Parameters**


- `targetEntities`: (dict) The target entities mapped to lists of their related entities. A target entity with an empty list of related entities means all other entities are assumed to be related to it. Entity names are case insensitive. *Mandatory to set at least one entity*

- `entityWeights`: (dict) The relative weights of drug related entities. If not set, all entities have equal weights. If the list is non-empty and some entity is not in it, then its weight is set to 0. The notation TARGET_ENTITY:RELATED_ENTITY can be used to specify the weight of a entity which is related to specific target entity (e.g. "DRUG:SYMPTOM" -> 0.3f). Entity names are case insensitive.

- `maxSyntacticDistance`: (Int) Maximal syntactic distance between the drug entity and the other drug related entities. Default value is 2.


### `setTargetEntities()`



In [None]:
documenter = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

sentence_detector = (
    nlp.SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")
)

tokenizer = nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")

embeddings = (
    nlp.WordEmbeddingsModel()
    .pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(["sentence", "token"])
    .setOutputCol("embeddings")
)

posology_ner_model = (
    medical.NerModel()
    .pretrained("ner_posology_large", "en", "clinical/models")
    .setInputCols(["sentence", "token", "embeddings"])
    .setOutputCol("ner")
)

ner_converter = (
    medical.NerConverterInternal()
    .setInputCols("sentence", "token", "ner")
    .setOutputCol("ner_chunk")
)

pos_tager = (
    nlp.PerceptronModel()
    .pretrained("pos_clinical", "en", "clinical/models")
    .setInputCols("sentence", "token")
    .setOutputCol("pos_tag")
)

dependency_parser = (
    nlp.DependencyParserModel()
    .pretrained("dependency_conllu", "en")
    .setInputCols(["sentence", "pos_tag", "token"])
    .setOutputCol("dependencies")
)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]


Let's check the available entities in the model:

In [None]:
set([x.split("-")[-1] for x in posology_ner_model.getClasses()])

{'DOSAGE', 'DRUG', 'DURATION', 'FORM', 'FREQUENCY', 'O', 'ROUTE', 'STRENGTH'}

Let's set the target to the entity `DRUG`, and relate it only to `STRENGTH`, `ROUTE`, or `FORM`.

In [None]:
entity_chunk_embeddings = (
    medical.EntityChunkEmbeddings()
    .pretrained("sbiobert_base_cased_mli", "en", "clinical/models")
    .setInputCols(["ner_chunk", "dependencies"])
    .setOutputCol("drug_chunk_embeddings")
)

entity_chunk_embeddings.setTargetEntities(
    {"DRUG": ["STRENGTH", "ROUTE", "FORM"]}
)

rxnorm_re = (
    medical.SentenceEntityResolverModel.pretrained(
        "sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models"
    )
    .setInputCols(["drug_chunk_embeddings"])
    .setOutputCol("rxnorm_code")
    .setDistanceFunction("EUCLIDEAN")
)

rxnorm_pipeline_re = nlp.Pipeline(
    stages=[
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re,
    ]
)

rxnorm_model = rxnorm_pipeline_re.fit(
    spark.createDataFrame([[""]]).toDF("text")
)

sbiobert_base_cased_mli download started this may take some time.
[OK!]
sbiobertresolve_rxnorm_augmented_re download started this may take some time.
[OK!]


In [None]:
data_df = spark.createDataFrame(
    [
        [
            "The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen."
        ],
        [
            "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"
        ],
    ]
).toDF("text")
data_df.show(truncate=150)

+----------------------------------------------------------------------------------------+
|                                                                                    text|
+----------------------------------------------------------------------------------------+
|   The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen.|
|The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet|
+----------------------------------------------------------------------------------------+



In [None]:
results = rxnorm_model.transform(data_df)

In [None]:
results.select(
    "drug_chunk_embeddings.result", "drug_chunk_embeddings.embeddings"
).show(truncate=200)

+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                              result|                                                                                                                                                                                              embeddings|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|               [metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]|[[0.13060856, 0.26946267, -0.507028, 0.772429, 0.7356907, 0.096247405, -0.5546375, 0.053429697, -0.55345106, 0.484

In [None]:
results.select(
    F.explode(
        F.arrays_zip(
            results.ner_chunk.result,
            results.ner_chunk.metadata,
            results.rxnorm_code.result,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("chunk"),
    F.expr("cols['1']['entity']").alias("ner_label"),
    F.expr("cols['2']").alias("rxnorm_code"),
).show(
    truncate=False
)

+----------+---------+-----------+
|chunk     |ner_label|rxnorm_code|
+----------+---------+-----------+
|metformin |DRUG     |860974     |
|500 mg    |STRENGTH |855313     |
|tablet    |FORM     |1747293    |
|2.5 mg    |STRENGTH |null       |
|coumadin  |DRUG     |null       |
|ibuprofen |DRUG     |null       |
|metformin |DRUG     |332809     |
|400 mg    |STRENGTH |855333     |
|coumadin  |DRUG     |202421     |
|5 mg      |STRENGTH |308135     |
|coumadin  |DRUG     |null       |
|amlodipine|DRUG     |null       |
|10 MG     |STRENGTH |null       |
|tablet    |FORM     |null       |
+----------+---------+-----------+



### `setEntityWeights()`



Setting weights at entity level:

In [None]:
entity_chunk_embeddings.setEntityWeights(
    {"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2}
)

rxnorm_pipeline_re = nlp.Pipeline(
    stages=[
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re,
    ]
)
rxnorm_model = rxnorm_pipeline_re.fit(
    spark.createDataFrame([[""]]).toDF("text")
)

In [None]:
results = rxnorm_model.transform(data_df)

In [None]:
results.select(
    F.explode(
        F.arrays_zip(
            results.ner_chunk.result,
            results.ner_chunk.metadata,
            results.rxnorm_code.result,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("chunk"),
    F.expr("cols['1']['entity']").alias("ner_label"),
    F.expr("cols['2']").alias("rxnorm_code"),
).show(
    truncate=False
)

+----------+---------+-----------+
|chunk     |ner_label|rxnorm_code|
+----------+---------+-----------+
|metformin |DRUG     |860974     |
|500 mg    |STRENGTH |855313     |
|tablet    |FORM     |1747293    |
|2.5 mg    |STRENGTH |null       |
|coumadin  |DRUG     |null       |
|ibuprofen |DRUG     |null       |
|metformin |DRUG     |332809     |
|400 mg    |STRENGTH |855333     |
|coumadin  |DRUG     |202421     |
|5 mg      |STRENGTH |308135     |
|coumadin  |DRUG     |null       |
|amlodipine|DRUG     |null       |
|10 MG     |STRENGTH |null       |
|tablet    |FORM     |null       |
+----------+---------+-----------+



Set weights at relation level:

In [None]:
entity_chunk_embeddings.setEntityWeights(
    {"DRUG:STRENGTH": 0.8, "DRUG:ROUTE": 0.2, "DRUG:FORM": 0.1}
)

rxnorm_pipeline_re = nlp.Pipeline(
    stages=[
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re,
    ]
)

rxnorm_model = rxnorm_pipeline_re.fit(
    spark.createDataFrame([[""]]).toDF("text")
)

results = rxnorm_model.transform(data_df)

In [None]:
results.select(
    F.explode(
        F.arrays_zip(
            results.ner_chunk.result,
            results.ner_chunk.metadata,
            results.rxnorm_code.result,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("chunk"),
    F.expr("cols['1']['entity']").alias("ner_label"),
    F.expr("cols['2']").alias("rxnorm_code"),
).show(
    truncate=False
)

+----------+---------+-----------+
|chunk     |ner_label|rxnorm_code|
+----------+---------+-----------+
|metformin |DRUG     |1359144    |
|500 mg    |STRENGTH |1373121    |
|tablet    |FORM     |410532     |
|2.5 mg    |STRENGTH |null       |
|coumadin  |DRUG     |null       |
|ibuprofen |DRUG     |null       |
|metformin |DRUG     |361818     |
|400 mg    |STRENGTH |211920     |
|coumadin  |DRUG     |410532     |
|5 mg      |STRENGTH |1089071    |
|coumadin  |DRUG     |null       |
|amlodipine|DRUG     |null       |
|10 MG     |STRENGTH |null       |
|tablet    |FORM     |null       |
+----------+---------+-----------+



### `setMaxSyntacticDistance()`



In [None]:
entity_chunk_embeddings.setMaxSyntacticDistance(5)

rxnorm_pipeline_re = nlp.Pipeline(
    stages=[
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re,
    ]
)

rxnorm_model = rxnorm_pipeline_re.fit(
    spark.createDataFrame([[""]]).toDF("text")
)

results = rxnorm_model.transform(data_df)

In [None]:
results.select(
    F.explode(
        F.arrays_zip(
            results.ner_chunk.result,
            results.ner_chunk.metadata,
            results.rxnorm_code.result,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("chunk"),
    F.expr("cols['1']['entity']").alias("ner_label"),
    F.expr("cols['2']").alias("rxnorm_code"),
).show(
    truncate=False
)

+----------+---------+-----------+
|chunk     |ner_label|rxnorm_code|
+----------+---------+-----------+
|metformin |DRUG     |1359144    |
|500 mg    |STRENGTH |1373121    |
|tablet    |FORM     |410532     |
|2.5 mg    |STRENGTH |null       |
|coumadin  |DRUG     |null       |
|ibuprofen |DRUG     |null       |
|metformin |DRUG     |361818     |
|400 mg    |STRENGTH |211920     |
|coumadin  |DRUG     |410532     |
|5 mg      |STRENGTH |1089071    |
|coumadin  |DRUG     |null       |
|amlodipine|DRUG     |null       |
|10 MG     |STRENGTH |null       |
|tablet    |FORM     |null       |
+----------+---------+-----------+

