![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **ChunkKeyPhraseExtraction**

This notebook will cover the different parameters and usages of `ChunkKeyPhraseExtraction`. This annotator extracts key phrases from texts.

**📖 Learning Objectives:**

1. Understand how to extract key phrases from texts..

2. Become comfortable using the different parameters of the `ChunkKeyPhraseExtraction`.


**🔗 Helpful Links:**

- Documentation : [ChunkKeyPhraseExtraction](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunkkeyphraseextraction)

- Python Docs : [ChunkKeyPhraseExtraction](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/chunk_key_phrase_extraction/index.html#sparknlp_jsl.annotator.chunker.chunk_key_phrase_extraction.ChunkKeyPhraseExtraction)

- Scala Docs : [ChunkKeyPhraseExtraction](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/chunker/ChunkKeyPhraseExtraction.html)

- For extended examples of usage, see the [Spark Healthcare NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

## **📜 Background**


`ChunkKeyPhraseExtraction` extracts key phrases from texts. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e., the document or the sentence they belong to). This allows, for example, to obtain a brief understanding of a document by selecting the most relevant phrases.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.6/107.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.3/644.3 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m65.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.3/536.3 kB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m2

In [None]:
from johnsnowlabs import nlp

nlp.install(force_browser=True)

In [None]:
from johnsnowlabs import nlp, medical

spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.2, 💊Spark-Healthcare==5.1.2, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT` , `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `topN` : The number of key phrases to select. Default: 3.

- `selectMostDifferent` : Pre-select topN * 2 key phrases and out of those select the topN that are the most different from each other.

- `divergence` : The divergence value determines how different from each the extracted key phrases are.Default: 0.0.

- `documentLevelProcessing`: A flag indicating whether to extract key phrases from the document level, i.e. from all the sentences available at a given row, rather than from the particular sentences the chunks refer to. Default: True.

- `concatenateSentences` :  A flag indicating whether to concatenate all input document/sentence annotations before computing their embedding. This parameter is only used if documentLevelProcessing is set to True.Default: True.

- `dropPunctuation` :  This parameter indicates whether to remove punctuation marks from the input chunks. Chunks coming from NER models are not affected. Default: True.




### `topN`

Set the number of key phrases to extract. The default value is 3.



In [42]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol( "document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")\
    .setCustomBounds(['\n'])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars([".", ",", ";", ":", "!", "?", "*", "(", ")", "\"", "'","+","%","-",'='])\
    .setSplitChars(['\[', '\]', '\n'])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Posology NER model is used
jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setTopN(1) \

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [43]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [44]:
import pyspark.sql.functions as F

In [45]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+-------------+---------------------+----------+
|        token|            ner_label|confidence|
+-------------+---------------------+----------+
|          The|                    O|    0.9997|
|      patient|                    O|    0.9818|
|          was|                    O|    0.9739|
|   prescribed|                    O|    0.8993|
|            1|             B-Dosage|    0.9827|
|      capsule|             I-Dosage|     0.268|
|           of|                    O|    0.8908|
|        Advil|     B-Drug_BrandName|    0.9617|
|          for|           B-Duration|    0.9878|
|            5|           I-Duration|    0.9064|
|         days|           I-Duration|    0.9584|
|            .|                    O|    0.9941|
|           He|             B-Gender|       1.0|
|          was|                    O|      0.99|
|         seen|                    O|    0.9609|
|           by|                    O|    0.9767|
|          the|                    O|     0.869|
|endocrinology|     

In [46]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+---------------+------------------+------------------+
|result        |entity         |DocumentSimilarity|MMRScore          |
+--------------+---------------+------------------+------------------+
|insulin lispro|Drug_Ingredient|0.4873421379191952|0.4873421379191952|
+--------------+---------------+------------------+------------------+



### divergence

The divergence value determines how different from each the extracted key phrases are. The value must be in the the interval [0, 1]. The higher the value is, the more divergence is enforced. The default value is 0.0.

In [47]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setDivergence(0.4)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [48]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+-------------------+------------------+--------------------+
|result        |entity             |DocumentSimilarity|MMRScore            |
+--------------+-------------------+------------------+--------------------+
|insulin lispro|Drug_Ingredient    |0.4873421379191952|0.2924052943706591  |
|discharged    |Admission_Discharge|0.416718843385142 |0.18774081504282925 |
|1000 mg       |Strength           |0.3377889480145588|0.046319304486829194|
+--------------+-------------------+------------------+--------------------+



### documentLevelProcessing

If set to True, the model will extract key phrases from the whole document. If set to False, the model will extract key phrases from each sentence separately. The default value is True.

In [49]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setDocumentLevelProcessing(False) \

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [50]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+----------------+-------------------+-------------------+-------------------+
|result          |entity             |DocumentSimilarity |MMRScore           |
+----------------+-------------------+-------------------+-------------------+
|1 capsule       |Dosage             |0.5559222282402198 |0.5559222282402198 |
|for 5 days      |Duration           |0.51812956000978   |0.51812956000978   |
|insulin lispro  |Drug_Ingredient    |0.46288566725284874|0.46288566725284874|
|Advil           |Drug_BrandName     |0.44433004633805423|0.44433004633805423|
|discharged      |Admission_Discharge|0.4426689342759711 |0.4426689342759711 |
|insulin glargine|Drug_Ingredient    |0.41479957993703415|0.41479957993703415|
|fro 3 months    |Duration           |0.40924157090884944|0.40924157090884944|
|SGLT2 inhibitors|Drug_Ingredient    |0.39285578965605455|0.39285578965605455|
+----------------+-------------------+-------------------+-------------------+



### concatenateSentences

 This parameter is only used if documentLevelProcessing is set to True. If concatenateSentences is set to True, then the model will concatenate the document/sentence input annotations and compute a single embedding. If it is set to False, then the model will compute the embedding of each sentence separately, and average the resulting embedding vectors in the end. Default: True.

In [51]:
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained() \
    .setInputCols("sentence", "ner_chunk") \
    .setOutputCol("key_phrase_chunks") \
    .setDocumentLevelProcessing(True) \
    .setConcatenateSentences(False)\

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    ner_converter,
    key_phrase_extractor ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

result = chunk_filter_model.transform(filter_df)

sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [52]:
result.selectExpr("explode(key_phrase_chunks) AS key_phrase").selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+---------------+------------------+------------------+
|result        |entity         |DocumentSimilarity|MMRScore          |
+--------------+---------------+------------------+------------------+
|1 capsule     |Dosage         |0.3192493587576557|0.3192493587576557|
|for 5 days    |Duration       |0.2975460320830571|0.2975460320830571|
|insulin lispro|Drug_Ingredient|0.2893682687100746|0.2893682687100746|
+--------------+---------------+------------------+------------------+

