![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/ChunkKeyPhraseExtraction.ipynb)

# **ChunkKeyPhraseExtraction**

This notebook will cover the different parameters and usages of `ChunkKeyPhraseExtraction`. This annotator extracts key phrases from texts.

**📖 Learning Objectives:**

1. Understand how to extract key phrases from texts..

2. Become comfortable using the different parameters of the `ChunkKeyPhraseExtraction`.


**🔗 Helpful Links:**

- Documentation : [ChunkKeyPhraseExtraction](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunkkeyphraseextraction)

- Python Docs : [ChunkKeyPhraseExtraction](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/chunk_key_phrase_extraction/index.html#sparknlp_jsl.annotator.chunker.chunk_key_phrase_extraction.ChunkKeyPhraseExtraction)

- Scala Docs : [ChunkKeyPhraseExtraction](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/chunker/ChunkKeyPhraseExtraction.html)

- For extended examples of usage, see the [Spark Healthcare NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

## **📜 Background**


`ChunkKeyPhraseExtraction` extracts key phrases from texts. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e., the document or the sentence they belong to). This allows, for example, to obtain a brief understanding of a document by selecting the most relevant phrases.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_10494.json to spark_nlp_for_healthcare_spark_ocr_10494 (1).json


In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_10494 (1).json
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_10494 (1).json
👌 JSL-Home is up to date! 
👌 Everything is already installed, no changes made


In [None]:
import pyspark.sql.functions as F

spark = nlp.start()

Spark Session already created, some configs may not take.
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_10494 (1).json


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT` , `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**




- `setConcatenateSentences(value: Boolean)`: Concatenate the input sentence/document annotations before computing their embedding. Default: `true`.

- `setDivergence(value: Float)`: Set the level of divergence of the extracted key phrases.

- `setDocumentLevelProcessing(value: Boolean)`: Extract key phrases from the whole document (`true`) or from particular sentences which the chunks refer to (`false`). Default: `true`.

- `setDropPunctuation(value: Boolean)`: Remove punctuation marks from input chunks.

- `setSelectMostDifferent(value: Boolean)`: Return the top N key phrases that are the most different from each other.

- `setTopN(value: Int)`: Set the number of key phrases to extract.




### `topN`

Set the number of key phrases to extract. The default value is 3.



In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol( "document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")\
    .setCustomBounds(['\n'])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars([".", ",", ";", ":", "!", "?", "*", "(", ")", "\"", "'","+","%","-",'='])\
    .setSplitChars(['\[', '\]', '\n'])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Posology NER model is used
jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setTopN(1) \

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+-------------+---------------------+----------+
|        token|            ner_label|confidence|
+-------------+---------------------+----------+
|          The|                    O|    0.9997|
|      patient|                    O|    0.9818|
|          was|                    O|    0.9739|
|   prescribed|                    O|    0.8993|
|            1|             B-Dosage|    0.9827|
|      capsule|             I-Dosage|     0.268|
|           of|                    O|    0.8908|
|        Advil|     B-Drug_BrandName|    0.9617|
|          for|           B-Duration|    0.9878|
|            5|           I-Duration|    0.9064|
|         days|           I-Duration|    0.9584|
|            .|                    O|    0.9941|
|           He|             B-Gender|       1.0|
|          was|                    O|      0.99|
|         seen|                    O|    0.9609|
|           by|                    O|    0.9767|
|          the|                    O|     0.869|
|endocrinology|     

In [None]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+---------------+------------------+------------------+
|result        |entity         |DocumentSimilarity|MMRScore          |
+--------------+---------------+------------------+------------------+
|insulin lispro|Drug_Ingredient|0.4873421379191952|0.4873421379191952|
+--------------+---------------+------------------+------------------+



### divergence

The divergence value determines how different from each the extracted key phrases are. The value must be in the the interval [0, 1]. The higher the value is, the more divergence is enforced. The default value is 0.0.

In [None]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setDivergence(0.4)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+-------------------+------------------+--------------------+
|result        |entity             |DocumentSimilarity|MMRScore            |
+--------------+-------------------+------------------+--------------------+
|insulin lispro|Drug_Ingredient    |0.4873421379191952|0.2924052943706591  |
|discharged    |Admission_Discharge|0.416718843385142 |0.18774081504282925 |
|1000 mg       |Strength           |0.3377889480145588|0.046319304486829194|
+--------------+-------------------+------------------+--------------------+



### documentLevelProcessing

If set to True, the model will extract key phrases from the whole document. If set to False, the model will extract key phrases from each sentence separately. The default value is True.

In [None]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setDocumentLevelProcessing(False) \

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+----------------+-------------------+-------------------+-------------------+
|result          |entity             |DocumentSimilarity |MMRScore           |
+----------------+-------------------+-------------------+-------------------+
|1 capsule       |Dosage             |0.5559222282402198 |0.5559222282402198 |
|for 5 days      |Duration           |0.51812956000978   |0.51812956000978   |
|insulin lispro  |Drug_Ingredient    |0.46288566725284874|0.46288566725284874|
|Advil           |Drug_BrandName     |0.44433004633805423|0.44433004633805423|
|discharged      |Admission_Discharge|0.4426689342759711 |0.4426689342759711 |
|insulin glargine|Drug_Ingredient    |0.41479957993703415|0.41479957993703415|
|fro 3 months    |Duration           |0.40924149430875195|0.40924149430875195|
|SGLT2 inhibitors|Drug_Ingredient    |0.3928558273491294 |0.3928558273491294 |
+----------------+-------------------+-------------------+-------------------+



### concatenateSentences

 This parameter is only used if documentLevelProcessing is set to True. If concatenateSentences is set to True, then the model will concatenate the document/sentence input annotations and compute a single embedding. If it is set to False, then the model will compute the embedding of each sentence separately, and average the resulting embedding vectors in the end. Default: True.

In [None]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setDocumentLevelProcessing(True) \
    .setConcatenateSentences(False)\

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+---------------+------------------+------------------+
|result        |entity         |DocumentSimilarity|MMRScore          |
+--------------+---------------+------------------+------------------+
|1 capsule     |Dosage         |0.3192493587576557|0.3192493587576557|
|for 5 days    |Duration       |0.2975460320830571|0.2975460320830571|
|insulin lispro|Drug_Ingredient|0.2893682687100746|0.2893682687100746|
+--------------+---------------+------------------+------------------+



In [None]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setDropPunctuation(True) \


nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+----------------+-------------------+------------------+------------------+
|result          |entity             |DocumentSimilarity|MMRScore          |
+----------------+-------------------+------------------+------------------+
|insulin lispro  |Drug_Ingredient    |0.4873421379191952|0.4873421379191952|
|insulin glargine|Drug_Ingredient    |0.4565283676909086|0.4565283676909086|
|discharged      |Admission_Discharge|0.416718843385142 |0.416718843385142 |
+----------------+-------------------+------------------+------------------+



In [None]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setSelectMostDifferent(True) \


nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+----------------+-------------------+------------------+------------------+
|result          |entity             |DocumentSimilarity|MMRScore          |
+----------------+-------------------+------------------+------------------+
|insulin glargine|Drug_Ingredient    |0.4565283676909086|0.4565283676909086|
|discharged      |Admission_Discharge|0.416718843385142 |0.416718843385142 |
|1000 mg         |Strength           |0.3377889480145588|0.3377889480145588|
+----------------+-------------------+------------------+------------------+

