![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/BertSentenceChunkEmbeddings.ipynb)

# **BertSentenceChunkEmbeddings**

This notebook will cover the different parameters and usages of `BertSentenceChunkEmbeddings` annotator.

**📖 Learning Objectives:**

1. Understand how to use `BertSentenceChunkEmbeddings`.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [BertSentenceChunkEmbeddings](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#bertsentencechunkembeddings)

- Python Docs : [BertSentenceChunkEmbeddings](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/embeddings/bert_sentence_embeddings/index.html#sparknlp_jsl.annotator.embeddings.bert_sentence_embeddings.BertSentenceChunkEmbeddings.name)

- Scala Docs : [BertSentenceChunkEmbeddings](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/embeddings/BertSentenceChunkEmbeddings.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/05.6.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb).

## **📜 Background**


`BertSentenceChunkEmbeddings` annotator which take into account the context of the sentence the chunk appeared in. This is an extension of BertSentenceEmbeddings which combines the embedding of a chunk with the embedding of the surrounding sentence. For each input chunk annotation, it finds the corresponding sentence, computes the BERT sentence embedding of both the chunk and the sentence and averages them. The resulting embeddings are useful in cases, in which one needs a numerical representation of a text chunk which is sensitive to the context it appears in.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at [Models Hub](https://nlp.johnsnowlabs.com/models?task=Embeddings).

The default model is `sent_small_bert_L2_768`, if no name is provided.

Sources :

[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/pdf/1908.10084.pdf)

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


In [None]:
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`

- Output: `SENTENCE_EMBEDDINGS`

## **🔎 Parameters**


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.
- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.
- `chunkWeight`: Relative weight of chunk embeddings in comparison to sentence embeddings. The value should between 0 and 1. The default is 0.5, which means the chunk and sentence embeddings are given equal weight.
- `setMaxSentenceLength`: Sets max sentence length to process, by default 128.
- `caseSensitive`: Determines whether the definitions of the white listed entities are case sensitive.
- `setDoExceptionHandling(True)`: If true, exceptions are handled.



All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.

### `inputCols` and `outputCol`

Define the column names containing the `SENTENCE` and `CHUNK` annotations needed as input to the `BertSentenceChunkEmbeddings` and the name of the new column containg the identified entities.

Let's define a pipeline to process raw texts into `SENTENCE` and `CHUNK` annotations along with the `BertSentenceChunkEmbeddings`:

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embedding = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embedding")

nerTagger = medical.NerModel.pretrained('ner_clinical_large', 'en', 'clinical/models') \
    .setInputCols(["sentence", "token", "word_embedding"]) \
    .setOutputCol("ner")

nerConverter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token","ner"]) \
    .setOutputCol("ner_chunk")

sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained() \
    .setInputCols(["sentence", "ner_chunk"]) \
    .setOutputCol("sentence_chunk_embeddings")

pipeline = nlp.Pipeline(
    stages = [documentAssembler,
              sentence,
              tokenizer,
              word_embedding,
              nerTagger,
              nerConverter,
              sentence_chunk_embeddings
    ])



embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical_large download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
[OK!]


In [None]:
text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , and associated with an acute hepatitis , presented with a one-week history of polyuria , poor appetite , and vomiting."

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

In [None]:
result.select("sentence_chunk_embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.sentence_chunk_embeddings.result,
                                                 result.sentence_chunk_embeddings.embeddings)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("embeddings"))

result_df.show(50, truncate=False)

+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
sentence_chunk_embeddings.getStorageRef()

'BERT_SENTENCE_EMBEDDINGS_0bee53f1b2cc'

In [None]:
sentence_chunk_embeddings.getDimension()

768

In [None]:
sentence_chunk_embeddings.getMaxSentenceLength()

128

In [None]:
sentence_chunk_embeddings.getCaseSensitive()

True

### `chunkWeight`

Relative weight of chunk embeddings in comparison to sentence embeddings.

The value should between 0 and 1.The default is 0.5, which means the chunk and sentence embeddings are given equal weight.

**.setChunkWeight(0)**

When we set the chunkWeight parameter value to 0, the weight for the chunk will be 0 and the sentence's total embedding value will returned.

In [None]:
sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained() \
    .setInputCols(["sentence", "ner_chunk"]) \
    .setOutputCol("sentence_chunk_embeddings")\
    .setChunkWeight(0)

pipeline = nlp.Pipeline(
    stages = [documentAssembler,
              sentence,
              tokenizer,
              word_embedding,
              nerTagger,
              nerConverter,
              sentence_chunk_embeddings
              ])


sbiobert_base_cased_mli download started this may take some time.
[OK!]


In [None]:
result = pipeline.fit(data).transform(data)

result_df = result.select(F.explode(F.arrays_zip(result.sentence_chunk_embeddings.result,
                                                 result.sentence_chunk_embeddings.embeddings)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("embeddings"))

result_df.show(50, truncate=False)

+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**.setChunkWeight(1)**

When we set the chunkWeight parameter value to 1, the weight for the chunk will be 1 and the sentence's total embedding value will be disabled.

In [None]:
sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained() \
    .setInputCols(["sentence", "ner_chunk"]) \
    .setOutputCol("sentence_chunk_embeddings")\
    .setChunkWeight(1)

pipeline = nlp.Pipeline(
    stages = [documentAssembler,
              sentence,
              tokenizer,
              word_embedding,
              nerTagger,
              nerConverter,
              sentence_chunk_embeddings
              ])


sbiobert_base_cased_mli download started this may take some time.
[OK!]


In [None]:
result = pipeline.fit(data).transform(data)

result_df = result.select(F.explode(F.arrays_zip(result.sentence_chunk_embeddings.result,
                                                 result.sentence_chunk_embeddings.embeddings)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("embeddings"))

result_df.show(50, truncate=False)

+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### .setDoExceptionHandling(True)

The `doExceptionHandling` parameter is designed for annotators to ensure robust exception handling in case the process is interrupted due to corrupted inputs. When enabled, the annotator attempts to process the data as usual. If exception-causing data (e.g., a corrupted record or document) is encountered, an exception warning is emitted with the relevant exception message, while the rest of the records within the same batch are processed without interruption. By default, this parameter is set to `False`, meaning the process will throw an exception and halt to inform users of the issue.









```
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setDoExceptionHandling(True)
```

