![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **EntityChunkEmbeddings**

This notebook will cover the different parameters and usages of `EntityChunkEmbeddings`. This annotator weighted average embeddings of multiple named entities chunk annotations.

**📖 Learning Objectives:**

1. Understand how comprehend the need for Entity Chunk Embeddings and their relationship with BERT Sentence embeddings, and understand the concept of a weighted average vector representation of related entity chunks.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [EntityChunkEmbeddings](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#entitychunkembeddings)

- Python Docs : [EntityChunkEmbeddings](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/embeddings/entity_chunk_embeddings/index.html)

- Scala Docs : [EntityChunkEmbeddings](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/embeddings/EntityChunkEmbeddings.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

nlp.install()

In [4]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/4.4.4.spark_nlp_for_healthcare (2).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.4, 💊Spark-Healthcare==4.4.4, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DEPENDENCY`, `CHUNK`

- Output: `SENTENCE_EMBEDDINGS`

## **🔎 Parameters**


- `targetEntities`: (dict) The target entities mapped to lists of their related entities. A target entity with an empty list of related entities means all other entities are assumed to be related to it. Entity names are case insensitive.

- `entityWeights`: (dict) The relative weights of drug related entities. If not set, all entities have equal weights. If the list is non-empty and some entity is not in it, then its weight is set to 0. The notation TARGET_ENTITY:RELATED_ENTITY can be used to specify the weight of a entity which is related to specific target entity (e.g. "DRUG:SYMPTOM" -> 0.3f). Entity names are case insensitive.

- `maxSyntacticDistance`: (Int) Maximal syntactic distance between the drug entity and the other drug related entities. Default value is 2.


### `setMaxSyntacticDistance()`



Maximal syntactic distance between related entities. Default value is 2.

In [15]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

posology_ner_model = medical.NerModel()\
    .pretrained("ner_posology_large", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pos_tager = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("pos_tag")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tag", "token"])\
    .setOutputCol("dependencies")

entity_chunk_embeddings = medical.EntityChunkEmbeddings()\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")\
    .setMaxSyntacticDistance(3)


rxnorm_re = medical.SentenceEntityResolverModel\
    .pretrained("sbiobertresolve_rxnorm_augmented_re", "en","clinical/models")\
    .setInputCols(["drug_chunk_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_pipeline_re = nlp.Pipeline(
    stages = [
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re
        ])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
[OK!]
sbiobertresolve_rxnorm_augmented_re download started this may take some time.
[OK!]


In [16]:
sampleText = ["The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen.",
              "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"]

sample_df = pd.DataFrame({'text':sampleText}).reset_index()

In [18]:
data_df = spark.createDataFrame(sample_df)

results = rxnorm_pipeline_re.fit(data_df).transform(data_df)

In [19]:
results.select("index", F.explode(F.arrays_zip(results.ner_chunk.result,
                                               results.ner_chunk.metadata)).alias("cols")) \
       .select("index", F.expr("cols['0']").alias("chunk"),
                        F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----+----------+---------+
|index|chunk     |ner_label|
+-----+----------+---------+
|0    |metformin |DRUG     |
|0    |500 mg    |STRENGTH |
|0    |tablet    |FORM     |
|0    |2.5 mg    |STRENGTH |
|0    |coumadin  |DRUG     |
|0    |ibuprofen |DRUG     |
|1    |metformin |DRUG     |
|1    |400 mg    |STRENGTH |
|1    |coumadin  |DRUG     |
|1    |5 mg      |STRENGTH |
|1    |coumadin  |DRUG     |
|1    |amlodipine|DRUG     |
|1    |10 MG     |STRENGTH |
|1    |tablet    |FORM     |
+-----+----------+---------+



### `setTargetEntities()`



Target entities and their related entities. A target entity mapped to an empty list of related entities means all other entities are assumed to be related to it. The default value is an empty dictionary, meaning no target entities are specified.

Entity names are case insensitive.

In [24]:
entity_chunk_embeddings = medical.EntityChunkEmbeddings()\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")\
    .setMaxSyntacticDistance(3)

entity_chunk_embeddings.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})


rxnorm_re = medical.SentenceEntityResolverModel\
    .pretrained("sbiobertresolve_rxnorm_augmented_re", "en","clinical/models")\
    .setInputCols(["drug_chunk_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_pipeline_re = nlp.Pipeline(
    stages = [
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re
        ])

sbiobert_base_cased_mli download started this may take some time.
[OK!]
sbiobertresolve_rxnorm_augmented_re download started this may take some time.
[OK!]


In [25]:
sampleText = ["The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen.",
              "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"]

sample_df = pd.DataFrame({'text':sampleText}).reset_index()

In [26]:
data_df = spark.createDataFrame(sample_df)

results = rxnorm_pipeline_re.fit(data_df).transform(data_df)



- We specified the relations as following by `.setTargetEntities` parameter in the EntityChunkEmbeddings annotator :    
`.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})`


In [27]:
results.select('index','drug_chunk_embeddings.result').show(truncate = False)

+-----+--------------------------------------------------------------------+
|index|result                                                              |
+-----+--------------------------------------------------------------------+
|0    |[metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]               |
|1    |[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|
+-----+--------------------------------------------------------------------+



### `setEntityWeights()`



Relative weights of entities. By default the dictionary is empty and all entities have equal weights. If it is non-empty and some entity is not in it, then its weight is set to 0.

The notation TARGET_ENTITY:RELATED_ENTITY can be used to specify the weight of a entity which is related to specific target entity (e.g. “DRUG:SYMPTOM” -> 0.3f).

Entity names are case insensitive.

In [28]:
entity_chunk_embeddings = medical.EntityChunkEmbeddings()\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")\
    .setMaxSyntacticDistance(3)

entity_chunk_embeddings.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})
entity_chunk_embeddings.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})



rxnorm_re = medical.SentenceEntityResolverModel\
    .pretrained("sbiobertresolve_rxnorm_augmented_re", "en","clinical/models")\
    .setInputCols(["drug_chunk_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_pipeline_re = nlp.Pipeline(
    stages = [
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re
        ])

sbiobert_base_cased_mli download started this may take some time.
[OK!]
sbiobertresolve_rxnorm_augmented_re download started this may take some time.
[OK!]


In [29]:
sampleText = ["The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen.",
              "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"]

sample_df = pd.DataFrame({'text':sampleText}).reset_index()

In [30]:
data_df = spark.createDataFrame(sample_df)

results = rxnorm_pipeline_re.fit(data_df).transform(data_df)

In [31]:
results.select("index", F.explode(F.arrays_zip(results.ner_chunk.result,
                                               results.ner_chunk.metadata)).alias("cols")) \
       .select("index", F.expr("cols['0']").alias("chunk"),
                        F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----+----------+---------+
|index|chunk     |ner_label|
+-----+----------+---------+
|0    |metformin |DRUG     |
|0    |500 mg    |STRENGTH |
|0    |tablet    |FORM     |
|0    |2.5 mg    |STRENGTH |
|0    |coumadin  |DRUG     |
|0    |ibuprofen |DRUG     |
|1    |metformin |DRUG     |
|1    |400 mg    |STRENGTH |
|1    |coumadin  |DRUG     |
|1    |5 mg      |STRENGTH |
|1    |coumadin  |DRUG     |
|1    |amlodipine|DRUG     |
|1    |10 MG     |STRENGTH |
|1    |tablet    |FORM     |
+-----+----------+---------+




- We specified the relations as following by `.setTargetEntities` parameter in the EntityChunkEmbeddings annotator :    
`.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})`

- EntityChunkEmbeddings calculates those new chunks embeddings according to the weights specified in`.setEntityWeights` parameter.

`.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})`

In [32]:
results.select('index','drug_chunk_embeddings.result').show(truncate = False)

+-----+--------------------------------------------------------------------+
|index|result                                                              |
+-----+--------------------------------------------------------------------+
|0    |[metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]               |
|1    |[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|
+-----+--------------------------------------------------------------------+

