![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/9.Chunk_Key_Phrase_Extraction.ipynb)

# Chunk Key Phrase Extraction

In this notebook, you will find how to get chunk key phrases using `ChunkKeyPhraseExtraction` that leverages Sentence BERT embeddings to select keywords and key phrases that are most similar to a document. This annotator can be fed by either the output of NER model, NGramGenerator or YAKE, and could be used to generate similarity scores for each NER chunk that is coming out of any (clinical) NER model. That is, you can now sort your clinical entities by the importance of them with respect to document or sentence that they live in. Additionally, you can also use this new annotator to grab new clinical chunks that are missed by a pretrained NER model as well as summarizing the whole document into a few important sentences or phrases.

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [None]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

**Lets start with creating a spark dataframe.**

In [4]:
text = """
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight 
years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior 
episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute 
hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week 
history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation, 
she was treated with a five-day course of amoxicillin for a respiratory tract infection. 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for 
HTG. She had been on dapagliflozin for six months at the time of presentation . Physical examination 
on presentation was significant for dry oral mucosa ; significantly, her abdominal examination was 
benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were: 
serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 
508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27. 
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept
hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis, 
as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained 
six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21, 
serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L. 
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was 
centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by 
lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion 
gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by 
her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the 
endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin 
lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors 
should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.
""".strip().replace("\n", "")

empty_data = spark.createDataFrame([[""]]).toDF("text")
textDF = spark.createDataFrame([[text]]).toDF("text")

# with NGramGenerator
First, we will show you how to get chunk key phrases using N-Gram by feeding `ChunkKeyPhraseExtraction` with N-Gram output.

In [5]:
documenter = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

ngram_generator = NGramGenerator()\
    .setInputCols(["clean_tokens"])\
    .setOutputCol("ngrams")\
    .setN(3)

ngram_key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ngrams"])\
    .setOutputCol("ngram_key_phrases")

ngram_pipeline = Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    stop_words_cleaner,
    ngram_generator,
    ngram_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [6]:
ngram_results = ngram_pipeline.fit(empty_data).transform(textDF)

**Lets show N-Gram results.**

In [7]:
ngram_results.selectExpr("explode(ngrams) AS key_phrase_candidate").show(30,truncate=False)

+-------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                 |
+-------------------------------------------------------------------------------------+
|{chunk, 2, 34, 28-year-old female history, {sentence -> 0, chunk -> 0}, []}          |
|{chunk, 14, 49, female history gestational, {sentence -> 0, chunk -> 1}, []}         |
|{chunk, 28, 58, history gestational diabetes, {sentence -> 0, chunk -> 2}, []}       |
|{chunk, 39, 67, gestational diabetes mellitus, {sentence -> 0, chunk -> 3}, []}      |
|{chunk, 51, 77, diabetes mellitus diagnosed, {sentence -> 0, chunk -> 4}, []}        |
|{chunk, 60, 89, mellitus diagnosed years, {sentence -> 0, chunk -> 5}, []}           |
|{chunk, 69, 95, diagnosed years prior, {sentence -> 0, chunk -> 6}, []}              |
|{chunk, 85, 111, years prior presentation, {sentence -> 0, chunk -> 7}, []}          |
|{chunk, 91, 126, prior presenta

**Check the key phrases from N-Gram results.**

In [None]:
ngram_results.selectExpr("explode(ngram_key_phrases) AS ngram_key_phrases").show(truncate=170)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                         ngram_key_phrases|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 117, 144, subsequent type diabetes, {sentence -> 0, chunk -> 10, DocumentSimilarity -> 0.7503709522732224, MMRScore -> 0.450222589254171}, [0.27310005, -1.5514...|
|{chunk, 51, 77, diabetes mellitus diagnosed, {sentence -> 0, chunk -> 4, DocumentSimilarity -> 0.6953793671393519, MMRScore -> 0.08263042253576852}, [-0.41446662, -1.9...|
|{chunk, 186, 221, HTG-induced pancreatitis years, {sentence -> 0, chunk -> 19, DocumentSimilarity -> 0.6817062970203589, MMRScore -> 0

**Show the selected key phrases, the cosine similarity to the document, the Maximal Marginal Relevance score and the sentence they where key phrase was found in.**

In [None]:
ngram_results.select(F.explode(F.arrays_zip("ngram_key_phrases.result","ngram_key_phrases.metadata")).alias("cols"))\
                .select(F.expr("cols['0']").alias("key_phrase"),
                        F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                        F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                        F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+--------------------------------+-------------------+-------------------+--------+
|key_phrase                      |DocumentSimilarity |MMRScore           |sentence|
+--------------------------------+-------------------+-------------------+--------+
|subsequent type diabetes        |0.7503709522732224 |0.450222589254171  |0       |
|diabetes mellitus diagnosed     |0.6953793671393519 |0.08263042253576852|0       |
|HTG-induced pancreatitis years  |0.6817062970203589 |0.09449712467163301|0       |
|hepatitis obesity               |0.6666053470245074 |0.10480013167241414|0       |
|mellitus diagnosed years        |0.6389213524714412 |0.09924533638593624|0       |
|vomiting                        |0.5824238606842447 |0.13604053241615346|0       |
|admitted starvation ketosis     |0.5789875069392564 |0.0980740790595363 |0       |
|five-day amoxicillin respiratory|0.5330658910562247 |0.11179597232070618|0       |
|33.5 kg/m2                      |0.46256524566846674|0.08639870081536791|0 

# with NER Model

Now we will show how to get key phrases from NER chunks by feeding `ChunkKeyPhraseExtraction` with the output of `NerConverter`.

In [None]:
documenter = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = NerConverter()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

ner_key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_key_phrases")

ner_pipeline = Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    embeddings, 
    ner_tagger, 
    ner_converter, 
    ner_key_phrase_extractor
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
ner_results = ner_pipeline.fit(empty_data).transform(textDF)

In [None]:
# ner_chunk results

ner_results.select(F.explode(F.arrays_zip("ner_chunks.result","ner_chunks.metadata")).alias("cols"))\
           .select(F.expr("cols['0']").alias("ner_chunk"),
                   F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+-----------------------------+-------------------------+
|ner_chunk                    |label                    |
+-----------------------------+-------------------------+
|28-year-old                  |Age                      |
|female                       |Gender                   |
|gestational diabetes mellitus|Diabetes                 |
|eight years prior            |RelativeDate             |
|type two diabetes mellitus   |Diabetes                 |
|T2DM                         |Diabetes                 |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder|
|three years prior            |RelativeDate             |
|acute                        |Modifier                 |
|hepatitis                    |Communicable_Disease     |
|obesity                      |Obesity                  |
|body mass index              |Symptom                  |
|33.5 kg/m2                   |Weight                   |
|one-week                     |Duration                 |
|polyuria     

**Show the key phrase results and scores we got using NER chunks.**

In [None]:
ner_results.select(F.explode(F.arrays_zip("ner_key_phrases.result","ner_key_phrases.metadata")).alias("cols"))\
           .select(F.expr("cols['0']").alias("key_phrase"),
                   F.expr("cols['1']['entity']").alias("label"),
                   F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                   F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                   F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+--------------------------+-------------------------+-------------------+-------------------+--------+
|key_phrase                |label                    |DocumentSimilarity |MMRScore           |sentence|
+--------------------------+-------------------------+-------------------+-------------------+--------+
|type two diabetes mellitus|Diabetes                 |0.7639750686118073 |0.4583850593816694 |0       |
|HTG-induced pancreatitis  |Disease_Syndrome_Disorder|0.66933222897749   |0.10416352343367463|0       |
|hepatitis                 |Communicable_Disease     |0.6052963003635777 |0.06844932298636924|0       |
|vomiting                  |Symptom                  |0.5824238606842447 |0.14864184723537974|0       |
|starvation ketosis        |Disease_Syndrome_Disorder|0.5540200581900556 |0.09014297211757516|9       |
|lipemia                   |Disease_Syndrome_Disorder|0.5382696947813763 |0.04719088741033595|8       |
|obesity                   |Obesity                  |0.50025000

# with NGramGenerator and NER Model

We can also get key phrases from merging N-Gram and NER chunks.

In [None]:
documenter = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

ngram_generator = NGramGenerator()\
    .setInputCols(["clean_tokens"])\
    .setOutputCol("ngrams")\
    .setN(3)
        
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = NerConverter()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

chunk_merger = ChunkMergeApproach()\
    .setInputCols("ngrams", "ner_chunks")\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(False)

ngram_ner_key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")

ngram_ner_pipeline = Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    stop_words_cleaner,
    ngram_generator,
    embeddings, 
    ner_tagger, 
    ner_converter, 
    chunk_merger,
    ngram_ner_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
ngram_ner_results = ngram_ner_pipeline.fit(empty_data).transform(textDF)

**Show the merged key phrase candidate results. `UNK` ones from NGramGenerator and the others from `ner_jsl` model.**

In [None]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate").show(30,truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 2, 34, 28-year-old female history, {entity -> UNK, chunk -> 0, sentence -> 0}, []}                                               |
|{chunk, 2, 12, 28-year-old, {entity -> Age, confidence -> 1.0, chunk -> 1, sentence -> 0}, []}                                           |
|{chunk, 14, 49, female history gestational, {entity -> UNK, chunk -> 2, sentence -> 0}, []}                                              |
|{chunk, 14, 19, female, {entity -> Gender, confidence -> 0.9985, chunk -> 3, sentence -> 0}, []}                                         |
|{chunk, 28, 58, his

In [None]:
# NER chunk results

ngram_ner_results.select(F.explode(F.arrays_zip("merged_chunks.result","merged_chunks.metadata")).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).filter("label != 'UNK'").show(50, truncate=False)

+-----------------------------+-------------------------+
|key_phrase_candidate         |label                    |
+-----------------------------+-------------------------+
|28-year-old                  |Age                      |
|female                       |Gender                   |
|gestational diabetes mellitus|Diabetes                 |
|eight years prior            |RelativeDate             |
|type two diabetes mellitus   |Diabetes                 |
|T2DM                         |Diabetes                 |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder|
|three years prior            |RelativeDate             |
|acute                        |Modifier                 |
|hepatitis                    |Communicable_Disease     |
|obesity                      |Obesity                  |
|body mass index              |Symptom                  |
|33.5 kg/m2                   |Weight                   |
|one-week                     |Duration                 |
|polyuria     

In [None]:
# ngram results

ngram_ner_results.select(F.explode(F.arrays_zip("merged_chunks.result","merged_chunks.metadata")).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).filter("label == 'UNK'").show(50, truncate=False)

+--------------------------------+-----+
|key_phrase_candidate            |label|
+--------------------------------+-----+
|28-year-old female history      |UNK  |
|female history gestational      |UNK  |
|history gestational diabetes    |UNK  |
|gestational diabetes mellitus   |UNK  |
|diabetes mellitus diagnosed     |UNK  |
|mellitus diagnosed years        |UNK  |
|diagnosed years prior           |UNK  |
|years prior presentation        |UNK  |
|prior presentation subsequent   |UNK  |
|presentation subsequent type    |UNK  |
|subsequent type diabetes        |UNK  |
|type diabetes mellitus          |UNK  |
|diabetes mellitus (             |UNK  |
|mellitus ( T2DM                 |UNK  |
|( T2DM ),                       |UNK  |
|T2DM ), prior                   |UNK  |
|), prior episode                |UNK  |
|prior episode HTG-induced       |UNK  |
|episode HTG-induced pancreatitis|UNK  |
|HTG-induced pancreatitis years  |UNK  |
|pancreatitis years prior        |UNK  |
|years prior pre

In [None]:
# merged (NER chunk + ngram) results

ngram_ner_results.select(F.explode(F.arrays_zip("merged_chunks.result","merged_chunks.metadata")).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+--------------------------------+-------------------------+
|key_phrase_candidate            |label                    |
+--------------------------------+-------------------------+
|28-year-old female history      |UNK                      |
|28-year-old                     |Age                      |
|female history gestational      |UNK                      |
|female                          |Gender                   |
|history gestational diabetes    |UNK                      |
|gestational diabetes mellitus   |UNK                      |
|gestational diabetes mellitus   |Diabetes                 |
|diabetes mellitus diagnosed     |UNK                      |
|mellitus diagnosed years        |UNK                      |
|diagnosed years prior           |UNK                      |
|eight years prior               |RelativeDate             |
|years prior presentation        |UNK                      |
|prior presentation subsequent   |UNK                      |
|presentation subsequent

**Show the key phrase candidates and their source (NER or NGramGenerator).**

In [None]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate")\
                 .selectExpr("key_phrase_candidate.result AS key_phrase_candidate",
                             "IF(key_phrase_candidate.metadata.entity = 'UNK', 'ngram', 'NER') AS source",
                             "key_phrase_candidate.metadata.sentence")\
                 .show(50, truncate=False)

+--------------------------------+------+--------+
|key_phrase_candidate            |source|sentence|
+--------------------------------+------+--------+
|28-year-old female history      |ngram |0       |
|28-year-old                     |NER   |0       |
|female history gestational      |ngram |0       |
|female                          |NER   |0       |
|history gestational diabetes    |ngram |0       |
|gestational diabetes mellitus   |ngram |0       |
|gestational diabetes mellitus   |NER   |0       |
|diabetes mellitus diagnosed     |ngram |0       |
|mellitus diagnosed years        |ngram |0       |
|diagnosed years prior           |ngram |0       |
|eight years prior               |NER   |0       |
|years prior presentation        |ngram |0       |
|prior presentation subsequent   |ngram |0       |
|presentation subsequent type    |ngram |0       |
|subsequent type diabetes        |ngram |0       |
|type diabetes mellitus          |ngram |0       |
|type two diabetes mellitus    

**Show the extracted key phrases and their scores.**

In [None]:
ngram_ner_results.select(F.explode(F.arrays_zip("key_phrases.result","key_phrases.metadata")).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase"),
                         F.expr("cols['1']['entity']").alias("label"),
                         F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                         F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                         F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+--------------------------------+--------+-------------------+-------------------+--------+
|key_phrase                      |label   |DocumentSimilarity |MMRScore           |sentence|
+--------------------------------+--------+-------------------+-------------------+--------+
|type two diabetes mellitus      |Diabetes|0.7639750686118073 |0.4583850593816694 |0       |
|subsequent type diabetes        |UNK     |0.7503709443591438 |0.08298243928224425|0       |
|HTG-induced pancreatitis years  |UNK     |0.6817062970203589 |0.11246275270031031|0       |
|hepatitis obesity               |UNK     |0.6666053470245074 |0.1177052008980295 |0       |
|mellitus diagnosed years        |UNK     |0.6389213391545323 |0.08129479185432026|0       |
|history gestational diabetes    |UNK     |0.6219876368539883 |0.0950104202982544 |0       |
|vomiting                        |UNK     |0.5824238088130589 |0.14864183399720493|0       |
|admitted starvation ketosis     |UNK     |0.5789875069392564 |0.12008

**Show the extracted key phrases and their sources.**

In [None]:
ngram_ner_results.selectExpr("explode(key_phrases) AS key_phrase")\
                 .selectExpr(
                     "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                     "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                     "key_phrase.metadata.DocumentSimilarity",
                     "key_phrase.metadata.MMRScore",
                     "key_phrase.metadata.sentence")\
                 .show(truncate=False)

+--------------------------------+------+-------------------+-------------------+--------+
|key_phrase                      |source|DocumentSimilarity |MMRScore           |sentence|
+--------------------------------+------+-------------------+-------------------+--------+
|type two diabetes mellitus      |NER   |0.7639750686118073 |0.4583850593816694 |0       |
|subsequent type diabetes        |ngrams|0.7503709443591438 |0.08298243928224425|0       |
|HTG-induced pancreatitis years  |ngrams|0.6817062970203589 |0.11246275270031031|0       |
|hepatitis obesity               |ngrams|0.6666053470245074 |0.1177052008980295 |0       |
|mellitus diagnosed years        |ngrams|0.6389213391545323 |0.08129479185432026|0       |
|history gestational diabetes    |ngrams|0.6219876368539883 |0.0950104202982544 |0       |
|vomiting                        |ngrams|0.5824238088130589 |0.14864183399720493|0       |
|admitted starvation ketosis     |ngrams|0.5789875069392564 |0.12008073486190007|0       |

**Now we will change the default embeddings of `ChunkKeyPhraseExtraction` (`sbert_jsl_medium_uncased`) to `sbiobert_base_cased_mli` and see the results.**

In [None]:
ngram_ner_key_phrase_bio = ChunkKeyPhraseExtraction.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")

ngram_ner_bio_pipeline = Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    stop_words_cleaner,
    ngram_generator,
    embeddings, 
    ner_tagger, 
    ner_converter, 
    chunk_merger,
    ngram_ner_key_phrase_bio
])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [None]:
ngram_ner_bio_results = ngram_ner_bio_pipeline.fit(empty_data).transform(textDF)

In [None]:
# sbiobert_base_cased_mli

ngram_ner_bio_results.selectExpr("explode(key_phrases) AS key_phrase")\
                     .selectExpr(
                         "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                         "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                         "key_phrase.metadata.DocumentSimilarity",
                         "key_phrase.metadata.MMRScore",
                         "key_phrase.metadata.sentence")\
                     .show(truncate=False)

+--------------------------------+------+-------------------+----------------------+--------+
|key_phrase                      |source|DocumentSimilarity |MMRScore              |sentence|
+--------------------------------+------+-------------------+----------------------+--------+
|one-week history polyuria       |ngrams|0.6088061606084895 |0.36528371088016365   |0       |
|HTG-induced pancreatitis years  |ngrams|0.5841954619507056 |0.11729245038840502   |0       |
|insulin glargine night          |ngrams|0.5542673378450764 |0.07800352732841304   |0       |
|history gestational diabetes    |ngrams|0.5492484696251535 |0.07024020676412779   |0       |
|28-year-old female history      |ngrams|0.5053959908272412 |0.1261157344429971    |0       |
|admitted starvation ketosis     |ngrams|0.5019899057081652 |0.04270516736484414   |0       |
|triglycerides 508 mg/dL         |ngrams|0.4961037081394255 |0.10460739879049963   |0       |
|vomiting weeks                  |ngrams|0.4524399229033075 

In [None]:
# sbert_jsl_medium_uncased (default)

ngram_ner_results.selectExpr("explode(key_phrases) AS key_phrase")\
                 .selectExpr(
                     "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                     "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                     "key_phrase.metadata.DocumentSimilarity",
                     "key_phrase.metadata.MMRScore",
                     "key_phrase.metadata.sentence")\
                 .show(truncate=False)

+--------------------------------+------+-------------------+-------------------+--------+
|key_phrase                      |source|DocumentSimilarity |MMRScore           |sentence|
+--------------------------------+------+-------------------+-------------------+--------+
|type two diabetes mellitus      |NER   |0.7639750686118073 |0.4583850593816694 |0       |
|subsequent type diabetes        |ngrams|0.7503709443591438 |0.08298243928224425|0       |
|HTG-induced pancreatitis years  |ngrams|0.6817062970203589 |0.11246275270031031|0       |
|hepatitis obesity               |ngrams|0.6666053470245074 |0.1177052008980295 |0       |
|mellitus diagnosed years        |ngrams|0.6389213391545323 |0.08129479185432026|0       |
|history gestational diabetes    |ngrams|0.6219876368539883 |0.0950104202982544 |0       |
|vomiting                        |ngrams|0.5824238088130589 |0.14864183399720493|0       |
|admitted starvation ketosis     |ngrams|0.5789875069392564 |0.12008073486190007|0       |

**Lets change `.setConcatenateSentences`-> False for checking the results by using sentence level embeddings and getting their average for calculating the document average.**

In [None]:
ngram_ner_key_phrase_sent = ChunkKeyPhraseExtraction.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")\
    .setConcatenateSentences(False)

ngram_ner_sent_pipeline = Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    stop_words_cleaner,
    ngram_generator,
    embeddings, 
    ner_tagger, 
    ner_converter, 
    chunk_merger,
    ngram_ner_key_phrase_sent
])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [None]:
ngram_ner_sent_results = ngram_ner_sent_pipeline.fit(empty_data).transform(textDF)

In [None]:
# .setConcatenateSentences(False)

ngram_ner_sent_results.selectExpr("explode(key_phrases) AS key_phrase")\
                      .selectExpr(
                          "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                          "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                          "key_phrase.metadata.DocumentSimilarity",
                          "key_phrase.metadata.MMRScore",
                          "key_phrase.metadata.sentence")\
                      .show(50,truncate=False)

+------------------------------------+------+--------------------+--------------------+--------+
|key_phrase                          |source|DocumentSimilarity  |MMRScore            |sentence|
+------------------------------------+------+--------------------+--------------------+--------+
|one-week history polyuria           |ngrams|0.1572482220441977  |0.09434893697560838 |0       |
|HTG-induced pancreatitis years      |ngrams|0.15313217083038327 |-0.14134553456113808|0       |
|insulin glargine night              |ngrams|0.13809328775688035 |-0.13992695390370963|0       |
|female history gestational          |ngrams|0.10183598443402134 |-0.11137552851495591|0       |
|HbA1c 10%                           |ngrams|0.09713650805171166 |-0.13530046786569241|0       |
|dapagliflozin T2DM atorvastatin     |ngrams|0.09236916291173515 |-0.09540790835310184|0       |
|days prior admission                |ngrams|0.08338127287781387 |-0.1567861341022248 |0       |
|33.5 kg/m2                   

In [None]:
# .setConcatenateSentences(True) # default

ngram_ner_bio_results.selectExpr("explode(key_phrases) AS key_phrase")\
                     .selectExpr(
                         "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                         "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                         "key_phrase.metadata.DocumentSimilarity",
                         "key_phrase.metadata.MMRScore",
                         "key_phrase.metadata.sentence")\
                     .show(truncate=False)

+--------------------------------+------+-------------------+----------------------+--------+
|key_phrase                      |source|DocumentSimilarity |MMRScore              |sentence|
+--------------------------------+------+-------------------+----------------------+--------+
|one-week history polyuria       |ngrams|0.6088061606084895 |0.36528371088016365   |0       |
|HTG-induced pancreatitis years  |ngrams|0.5841954619507056 |0.11729245038840502   |0       |
|insulin glargine night          |ngrams|0.5542673378450764 |0.07800352732841304   |0       |
|history gestational diabetes    |ngrams|0.5492484696251535 |0.07024020676412779   |0       |
|28-year-old female history      |ngrams|0.5053959908272412 |0.1261157344429971    |0       |
|admitted starvation ketosis     |ngrams|0.5019899057081652 |0.04270516736484414   |0       |
|triglycerides 508 mg/dL         |ngrams|0.4961037081394255 |0.10460739879049963   |0       |
|vomiting weeks                  |ngrams|0.4524399229033075 

# with YAKE Keyword Extraction
 
Lets get the keys phrases using `YakeKeywordExtraction` and compare the results with `ChunkKeyPhraseExtraction`.

In [None]:
documenter = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

keywords = YakeKeywordExtraction() \
    .setInputCols("clean_tokens") \
    .setOutputCol("yake") \
    .setMinNGrams(1) \
    .setMaxNGrams(3)\
    .setNKeywords(20)\

yake_key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "yake"])\
    .setOutputCol("yake_key_phrases")

yake_pipeline = Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    stop_words_cleaner,
    keywords,
    yake_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[OK!]


In [None]:
yake_results = yake_pipeline.fit(empty_data).transform(textDF)

**Lets check YAKE Keyword Extraction results and scores.**

In [None]:
yake_results.selectExpr("explode(yake) AS key_phrase_candidate").show(30,truncate=False)

+---------------------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                               |
+---------------------------------------------------------------------------------------------------+
|{chunk, 91, 95, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                         |
|{chunk, 100, 111, presentation, {score -> 0.03159661517236814, sentence -> 0}, []}                 |
|{chunk, 169, 173, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                       |
|{chunk, 223, 227, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                       |
|{chunk, 232, 243, presentation, {score -> 0.03159661517236814, sentence -> 0}, []}                 |
|{chunk, 445, 449, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                       |
|{chunk, 454, 465, presentation, {score -> 0.03159661517236814, sentence -> 0}, []

In [None]:
scores = yake_results.selectExpr("explode(arrays_zip(yake.result, yake.metadata)) as resultTuples") \
                     .selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")

In [None]:
scores.orderBy("score", ascending=False).show(20, truncate = False)

+-----------------------------+--------------------+
|keyword                      |score               |
+-----------------------------+--------------------+
|presentation revealed glucose|0.042288671969857625|
|prior analysis due           |0.040835303439078076|
|days prior admission         |0.040835303439078076|
|serum                        |0.03799266202473545 |
|serum                        |0.03799266202473545 |
|serum                        |0.03799266202473545 |
|serum                        |0.03799266202473545 |
|serum                        |0.03799266202473545 |
|diagnosed years prior        |0.03799142861997926 |
|pancreatitis years prior     |0.03799142861997926 |
|patient treated insulin      |0.03390418358075854 |
|presentation                 |0.03159661517236814 |
|presentation                 |0.03159661517236814 |
|presentation                 |0.03159661517236814 |
|presentation                 |0.03159661517236814 |
|presentation                 |0.0315966151723

**Show top-10 results for YAKE.**

In [None]:
scores.select("keyword", "score").distinct().orderBy("score", ascending=False).show(truncate = False)

+-----------------------------+--------------------+
|keyword                      |score               |
+-----------------------------+--------------------+
|presentation revealed glucose|0.042288671969857625|
|prior analysis due           |0.040835303439078076|
|days prior admission         |0.040835303439078076|
|serum                        |0.03799266202473545 |
|diagnosed years prior        |0.03799142861997926 |
|pancreatitis years prior     |0.03799142861997926 |
|patient treated insulin      |0.03390418358075854 |
|presentation                 |0.03159661517236814 |
|anion gap elevated           |0.031568192739369824|
|years prior                  |0.030808818777992058|
+-----------------------------+--------------------+
only showing top 10 rows



**Now we can compare the results with `ChunkKeyPhraseExtraction`.**

In [None]:
yake_results.select(F.explode(F.arrays_zip("yake_key_phrases.result","yake_key_phrases.metadata")).alias("cols"))\
            .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                    F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                    F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                    F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+------------------------------------+-------------------+--------------------+--------+
|key_phrase_candidate                |DocumentSimilarity |MMRScore            |sentence|
+------------------------------------+-------------------+--------------------+--------+
|pancreatitis years prior            |0.6491587146812722 |0.38949524428591314 |0       |
|diagnosed years prior               |0.38594469396979897|-0.0344821482240133 |0       |
|respiratory tract infection         |0.34452766290310755|-0.06206619829710405|0       |
|patient treated insulin             |0.3413457416284759 |-0.03756776546543272|0       |
|serum                               |0.3371024001999838 |0.024651856331397548|0       |
|presentation revealed glucose       |0.31458360368143906|-0.1175232903425274 |0       |
|examination presentation significant|0.29099950377907047|-0.07343299615665819|0       |
|prior analysis due                  |0.22501711661945623|-0.13367738582793495|0       |
|prior               

**As you can see, `ChunkKeyPhraseExtraction` results are better than `YakeKeywordExtraction`.**