![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/19.0.Chunk_Key_Phrase_Extraction.ipynb)

# Chunk Key Phrase Extraction

In this notebook, you will find how to get chunk key phrases using `ChunkKeyPhraseExtraction` that leverages Sentence BERT embeddings to select keywords and key phrases that are most similar to a document. This annotator can be fed by either the output of NER model, NGramGenerator or YAKE, and could be used to generate similarity scores for each NER chunk that is coming out of any (clinical) NER model. That is, you can now sort your clinical entities by the importance of them with respect to document or sentence that they live in. Additionally, you can also use this new annotator to grab new clinical chunks that are missed by a pretrained NER model as well as summarizing the whole document into a few important sentences or phrases.

Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at [Models Hub](https://nlp.johnsnowlabs.com/models?task=Embeddings).

The default model is `"sbert_jsl_medium_uncased"`, if no name is provided.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.0

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

**Lets start with creating a spark dataframe.**

In [None]:
text = """
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight
years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior
episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute
hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week
history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation,
she was treated with a five-day course of amoxicillin for a respiratory tract infection.
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for
HTG. She had been on dapagliflozin for six months at the time of presentation . Physical examination
on presentation was significant for dry oral mucosa ; significantly, her abdominal examination was
benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were:
serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides
508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27.
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept
hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis,
as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained
six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21,
serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L.
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was
centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by
lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion
gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by
her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the
endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin
lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors
should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.
""".strip().replace("\n", "")

empty_data = spark.createDataFrame([[""]]).toDF("text")
textDF = spark.createDataFrame([[text]]).toDF("text")

# with NGramGenerator
First, we will show you how to get chunk key phrases using N-Gram by feeding `ChunkKeyPhraseExtraction` with N-Gram output.

In [None]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

ngram_generator = nlp.NGramGenerator()\
    .setInputCols(["clean_tokens"])\
    .setOutputCol("ngrams")\
    .setN(3)

ngram_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ngrams"])\
    .setOutputCol("ngram_key_phrases")

ngram_pipeline = nlp.Pipeline(
    stages=[
        documenter,
        sentencer,
        tokenizer,
        stop_words_cleaner,
        ngram_generator,
        ngram_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [None]:
ngram_results = ngram_pipeline.fit(empty_data).transform(textDF)

**Lets show N-Gram results.**

In [None]:
ngram_results.selectExpr("explode(ngrams) AS key_phrase_candidate").show(30,truncate=False)

+------------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                      |
+------------------------------------------------------------------------------------------+
|{chunk, 2, 34, 28-year-old female history, {sentence -> 0, chunk -> 0}, []}               |
|{chunk, 14, 49, female history gestational, {sentence -> 0, chunk -> 1}, []}              |
|{chunk, 28, 58, history gestational diabetes, {sentence -> 0, chunk -> 2}, []}            |
|{chunk, 39, 67, gestational diabetes mellitus, {sentence -> 0, chunk -> 3}, []}           |
|{chunk, 51, 77, diabetes mellitus diagnosed, {sentence -> 0, chunk -> 4}, []}             |
|{chunk, 60, 88, mellitus diagnosed eightyears, {sentence -> 0, chunk -> 5}, []}           |
|{chunk, 69, 94, diagnosed eightyears prior, {sentence -> 0, chunk -> 6}, []}              |
|{chunk, 79, 110, eightyears prior presentation, {sentence -> 0, chunk

**Check the key phrases from N-Gram results.**

In [None]:
ngram_results.selectExpr("explode(ngram_key_phrases) AS ngram_key_phrases").show(truncate=170)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                         ngram_key_phrases|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 116, 143, subsequent type diabetes, {sentence -> 0, chunk -> 10, DocumentSimilarity -> 0.7542784653887057, MMRScore -> 0.4525670972166234}, [0.27309972, -1.551...|
|{chunk, 264, 291, acutehepatitis obesity, {sentence -> 0, chunk -> 24, DocumentSimilarity -> 0.7374926662801939, MMRScore -> 0.10970933303037028}, [0.5545075, -1.97073...|
|{chunk, 51, 77, diabetes mellitus diagnosed, {sentence -> 0, chunk -> 4, DocumentSimilarity -> 0.7098781656838193, MMRScore -> 0.09725

**Show the selected key phrases, the cosine similarity to the document, the Maximal Marginal Relevance score and the sentence they where key phrase was found in.**

In [None]:
import pyspark.sql.functions as F

ngram_results.select(F.explode(F.arrays_zip(ngram_results.ngram_key_phrases.result,
                                            ngram_results.ngram_key_phrases.metadata)).alias("cols"))\
              .select(F.expr("cols['0']").alias("key_phrase"),
                      F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                      F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                      F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+--------------------------------+-------------------+-------------------+--------+
|key_phrase                      |DocumentSimilarity |MMRScore           |sentence|
+--------------------------------+-------------------+-------------------+--------+
|subsequent type diabetes        |0.7542784653887057 |0.4525670972166234 |0       |
|acutehepatitis obesity          |0.7374926662801939 |0.10970933303037028|0       |
|diabetes mellitus diagnosed     |0.7098781656838193 |0.09725892639838402|0       |
|mellitus diagnosed eightyears   |0.6947002544930619 |0.15954200585177963|0       |
|HTG-induced pancreatitis years  |0.6887379262018941 |0.0987161543688535 |0       |
|starvation ketosis,as reported  |0.6012425192364881 |0.10975994923902782|0       |
|vomiting                        |0.5715590600142633 |0.11051665834892604|0       |
|five-day amoxicillin respiratory|0.5284766720015965 |0.09226861433405384|0       |
|33.5 kg/m2                      |0.47599076672601626|0.08337351428371803|0 

# with NER Model

Now we will show how to get key phrases from NER chunks by feeding `ChunkKeyPhraseExtraction` with the output of `NerConverter`.

In [None]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

ner_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_key_phrases")

ner_pipeline = nlp.Pipeline(
    stages=[
        documenter,
        sentencer,
        tokenizer,
        embeddings,
        ner_tagger,
        ner_converter,
        ner_key_phrase_extractor
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [None]:
ner_results = ner_pipeline.fit(empty_data).transform(textDF)

In [None]:
# ner_chunk results

ner_results.select(F.explode(F.arrays_zip(ner_results.ner_chunks.result,
                                          ner_results.ner_chunks.metadata)).alias("cols"))\
           .select(F.expr("cols['0']").alias("ner_chunk"),
                   F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+-----------------------------+----------------------------+
|ner_chunk                    |label                       |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|obesity                      |Obesity                     |
|body mass index              |BMI                         |
|BMI                          |BMI                         |
|33.5 kg/m2                   |BMI                         |
|polyuria                     |Symptom                     |
|polydipsia                   |Symptom                     |
|poor appetite          

**Show the key phrase results and scores we got using NER chunks.**

In [None]:
ner_results.select(F.explode(F.arrays_zip(ner_results.ner_key_phrases.result,
                                          ner_results.ner_key_phrases.metadata)).alias("cols"))\
           .select(F.expr("cols['0']").alias("key_phrase"),
                   F.expr("cols['1']['entity']").alias("label"),
                   F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                   F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                   F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+-----------------------------+-------------------------+-------------------+--------------------+--------+
|key_phrase                   |label                    |DocumentSimilarity |MMRScore            |sentence|
+-----------------------------+-------------------------+-------------------+--------------------+--------+
|type two diabetes mellitus   |Diabetes                 |0.76186796033805   |0.4571207943671777  |0       |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder|0.6774719133598364 |0.10904726738662779 |0       |
|gestational diabetes mellitus|Diabetes                 |0.6607611708524267 |0.04530583849463887 |0       |
|vomiting                     |Symptom                  |0.5715590410750901 |0.1421229274693381  |0       |
|lipemia                      |Symptom                  |0.5511428523347072 |0.05491482232831607 |6       |
|obesity                      |Obesity                  |0.5088106156642999 |0.08970662191682993 |0       |
|33.5 kg/m2                 

# with NGramGenerator and NER Model

We can also get key phrases from merging N-Gram and NER chunks.

In [None]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

ngram_generator = nlp.NGramGenerator()\
    .setInputCols(["clean_tokens"])\
    .setOutputCol("ngrams")\
    .setN(3)

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("ngrams", "ner_chunks")\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(False)

ngram_ner_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")

ngram_ner_pipeline = nlp.Pipeline(
    stages=[
        documenter,
        sentencer,
        tokenizer,
        stop_words_cleaner,
        ngram_generator,
        embeddings,
        ner_tagger,
        ner_converter,
        chunk_merger,
        ngram_ner_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [None]:
ngram_ner_results = ngram_ner_pipeline.fit(empty_data).transform(textDF)

**Show the merged key phrase candidate results. `UNK` ones from NGramGenerator and the others from `ner_jsl` model.**

In [None]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate").show(30,truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 2, 12, 28-year-old, {chunk -> 0, confidence -> 0.999, ner_source -> ner_chunks, entity -> Age, sentence -> 0}, []}                                        |
|{chunk, 2, 34, 28-year-old female history, {entity -> UNK, chunk -> 1, sentence -> 0}, []}                                                                        |
|{chunk, 14, 19, female, {chunk -> 2, confidence -> 0.9999, ner_source -> ner_chunks, entity -> Gender, sentence -> 0}, []}                                        |
|{chunk, 1

In [None]:
# NER chunk results

ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
                                                ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).filter("label != 'UNK'").show(50, truncate=False)

+-----------------------------+----------------------------+
|key_phrase_candidate         |label                       |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|obesity                      |Obesity                     |
|body mass index              |BMI                         |
|BMI                          |BMI                         |
|33.5 kg/m2                   |BMI                         |
|polyuria                     |Symptom                     |
|polydipsia                   |Symptom                     |
|poor appetite          

In [None]:
# ngram results

ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
                                                ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).filter("label == 'UNK'").show(50, truncate=False)

+-------------------------------------+-----+
|key_phrase_candidate                 |label|
+-------------------------------------+-----+
|28-year-old female history           |UNK  |
|female history gestational           |UNK  |
|history gestational diabetes         |UNK  |
|gestational diabetes mellitus        |UNK  |
|diabetes mellitus diagnosed          |UNK  |
|mellitus diagnosed eightyears        |UNK  |
|diagnosed eightyears prior           |UNK  |
|eightyears prior presentation        |UNK  |
|prior presentation subsequent        |UNK  |
|presentation subsequent type         |UNK  |
|subsequent type diabetes             |UNK  |
|type diabetes mellitus               |UNK  |
|diabetes mellitus (                  |UNK  |
|mellitus ( T2DM                      |UNK  |
|( T2DM ),                            |UNK  |
|T2DM ), priorepisode                 |UNK  |
|), priorepisode HTG-induced          |UNK  |
|priorepisode HTG-induced pancreatitis|UNK  |
|HTG-induced pancreatitis years   

In [None]:
# merged (NER chunk + ngram) results

ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
                                                ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+-------------------------------------+-------------------------+
|key_phrase_candidate                 |label                    |
+-------------------------------------+-------------------------+
|28-year-old                          |Age                      |
|28-year-old female history           |UNK                      |
|female                               |Gender                   |
|female history gestational           |UNK                      |
|history gestational diabetes         |UNK                      |
|gestational diabetes mellitus        |UNK                      |
|gestational diabetes mellitus        |Diabetes                 |
|diabetes mellitus diagnosed          |UNK                      |
|mellitus diagnosed eightyears        |UNK                      |
|diagnosed eightyears prior           |UNK                      |
|eightyears prior presentation        |UNK                      |
|prior presentation subsequent        |UNK                      |
|presentat

**Show the key phrase candidates and their source (NER or NGramGenerator).**

In [None]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate")\
                 .selectExpr("key_phrase_candidate.result AS key_phrase_candidate",
                             "IF(key_phrase_candidate.metadata.entity = 'UNK', 'ngram', 'NER') AS source",
                             "key_phrase_candidate.metadata.sentence")\
                 .show(50, truncate=False)

+-------------------------------------+------+--------+
|key_phrase_candidate                 |source|sentence|
+-------------------------------------+------+--------+
|28-year-old                          |NER   |0       |
|28-year-old female history           |ngram |0       |
|female                               |NER   |0       |
|female history gestational           |ngram |0       |
|history gestational diabetes         |ngram |0       |
|gestational diabetes mellitus        |ngram |0       |
|gestational diabetes mellitus        |NER   |0       |
|diabetes mellitus diagnosed          |ngram |0       |
|mellitus diagnosed eightyears        |ngram |0       |
|diagnosed eightyears prior           |ngram |0       |
|eightyears prior presentation        |ngram |0       |
|prior presentation subsequent        |ngram |0       |
|presentation subsequent type         |ngram |0       |
|subsequent type diabetes             |ngram |0       |
|type diabetes mellitus               |ngram |0 

**Show the extracted key phrases and their scores.**

In [None]:
ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.key_phrases.result,
                                                ngram_ner_results.key_phrases.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase"),
                         F.expr("cols['1']['entity']").alias("label"),
                         F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                         F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                         F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+-------------------------------------+--------+-------------------+-------------------+--------+
|key_phrase                           |label   |DocumentSimilarity |MMRScore           |sentence|
+-------------------------------------+--------+-------------------+-------------------+--------+
|type two diabetes mellitus           |Diabetes|0.76186796033805   |0.4571207943671777 |0       |
|acutehepatitis obesity               |UNK     |0.7374928122832567 |0.12142129017129283|0       |
|mellitus diagnosed eightyears        |UNK     |0.6947002544930619 |0.13722957778469508|0       |
|priorepisode HTG-induced pancreatitis|UNK     |0.6733035437803803 |0.10736566460437774|0       |
|history gestational diabetes         |UNK     |0.6203001333116362 |0.09399802155154968|0       |
|starvation ketosis,as reported       |UNK     |0.6012425192364881 |0.11195222789989229|0       |
|vomiting                             |UNK     |0.5715590600142633 |0.14212302176265743|0       |
|five-day amoxicilli

**Show the extracted key phrases and their sources.**

In [None]:
ngram_ner_results.selectExpr("explode(key_phrases) AS key_phrase")\
                 .selectExpr(
                     "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                     "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                     "key_phrase.metadata.DocumentSimilarity",
                     "key_phrase.metadata.MMRScore",
                     "key_phrase.metadata.sentence")\
                 .show(truncate=False)

+-------------------------------------+------+-------------------+-------------------+--------+
|key_phrase                           |source|DocumentSimilarity |MMRScore           |sentence|
+-------------------------------------+------+-------------------+-------------------+--------+
|type two diabetes mellitus           |NER   |0.76186796033805   |0.4571207943671777 |0       |
|acutehepatitis obesity               |ngrams|0.7374928122832567 |0.12142129017129283|0       |
|mellitus diagnosed eightyears        |ngrams|0.6947002544930619 |0.13722957778469508|0       |
|priorepisode HTG-induced pancreatitis|ngrams|0.6733035437803803 |0.10736566460437774|0       |
|history gestational diabetes         |ngrams|0.6203001333116362 |0.09399802155154968|0       |
|starvation ketosis,as reported       |ngrams|0.6012425192364881 |0.11195222789989229|0       |
|vomiting                             |ngrams|0.5715590600142633 |0.14212302176265743|0       |
|five-day amoxicillin respiratory     |n

**Now we will change the default embeddings of `ChunkKeyPhraseExtraction` (`sbert_jsl_medium_uncased`) to `sbiobert_base_cased_mli` and see the results.**

In [None]:
ngram_ner_key_phrase_bio = medical.ChunkKeyPhraseExtraction.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")

ngram_ner_bio_pipeline = nlp.Pipeline(
    stages=[
        documenter,
        sentencer,
        tokenizer,
        stop_words_cleaner,
        ngram_generator,
        embeddings,
        ner_tagger,
        ner_converter,
        chunk_merger,
        ngram_ner_key_phrase_bio
])

sbiobert_base_cased_mli download started this may take some time.
[OK!]


In [None]:
ngram_ner_bio_results = ngram_ner_bio_pipeline.fit(empty_data).transform(textDF)

In [None]:
# sbiobert_base_cased_mli

ngram_ner_bio_results.selectExpr("explode(key_phrases) AS key_phrase")\
                     .selectExpr(
                         "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                         "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                         "key_phrase.metadata.DocumentSimilarity",
                         "key_phrase.metadata.MMRScore",
                         "key_phrase.metadata.sentence")\
                     .show(truncate=False)

+----------------------------------+------+-------------------+--------------------+--------+
|key_phrase                        |source|DocumentSimilarity |MMRScore            |sentence|
+----------------------------------+------+-------------------+--------------------+--------+
|HTG-induced pancreatitis years    |ngrams|0.6191193756992782 |0.3714716401805231  |0       |
|presented one-weekhistory polyuria|ngrams|0.5924186948898029 |0.12179446625993492 |0       |
|history gestational diabetes      |ngrams|0.572639269174604  |0.08427472753988768 |0       |
|acutehepatitis obesity            |ngrams|0.5583371363841239 |0.048648416385437154|0       |
|insulin glargine night            |ngrams|0.5489178576662569 |0.09796492750949362 |0       |
|admitted starvation ketosis,as    |ngrams|0.5292541710925237 |0.06368558862634016 |0       |
|28-year-old female history        |ngrams|0.49122686026516654|0.12579575849157976 |0       |
|triglycerides508 mg/dL            |ngrams|0.490567531280216

In [None]:
# sbert_jsl_medium_uncased (default)

ngram_ner_results.selectExpr("explode(key_phrases) AS key_phrase")\
                 .selectExpr(
                     "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                     "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                     "key_phrase.metadata.DocumentSimilarity",
                     "key_phrase.metadata.MMRScore",
                     "key_phrase.metadata.sentence")\
                 .show(truncate=False)

+-------------------------------------+------+-------------------+-------------------+--------+
|key_phrase                           |source|DocumentSimilarity |MMRScore           |sentence|
+-------------------------------------+------+-------------------+-------------------+--------+
|type two diabetes mellitus           |NER   |0.76186796033805   |0.4571207943671777 |0       |
|acutehepatitis obesity               |ngrams|0.7374928122832567 |0.12142129017129283|0       |
|mellitus diagnosed eightyears        |ngrams|0.6947002544930619 |0.13722957778469508|0       |
|priorepisode HTG-induced pancreatitis|ngrams|0.6733035437803803 |0.10736566460437774|0       |
|history gestational diabetes         |ngrams|0.6203001333116362 |0.09399802155154968|0       |
|starvation ketosis,as reported       |ngrams|0.6012425192364881 |0.11195222789989229|0       |
|vomiting                             |ngrams|0.5715590600142633 |0.14212302176265743|0       |
|five-day amoxicillin respiratory     |n

**Lets change `.setConcatenateSentences`-> False for checking the results by using sentence level embeddings and getting their average for calculating the document average.**

In [None]:
ngram_ner_key_phrase_sent = medical.ChunkKeyPhraseExtraction.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")\
    .setConcatenateSentences(False)

ngram_ner_sent_pipeline = nlp.Pipeline(
    stages=[
        documenter,
        sentencer,
        tokenizer,
        stop_words_cleaner,
        ngram_generator,
        embeddings,
        ner_tagger,
        ner_converter,
        chunk_merger,
        ngram_ner_key_phrase_sent
])

sbiobert_base_cased_mli download started this may take some time.
[OK!]


In [None]:
ngram_ner_sent_results = ngram_ner_sent_pipeline.fit(empty_data).transform(textDF)

In [None]:
# .setConcatenateSentences(False)

ngram_ner_sent_results.selectExpr("explode(key_phrases) AS key_phrase")\
                      .selectExpr(
                          "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                          "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                          "key_phrase.metadata.DocumentSimilarity",
                          "key_phrase.metadata.MMRScore",
                          "key_phrase.metadata.sentence")\
                      .show(50,truncate=False)

+----------------------------------------+------+--------------------+--------------------+--------+
|key_phrase                              |source|DocumentSimilarity  |MMRScore            |sentence|
+----------------------------------------+------+--------------------+--------------------+--------+
|HTG-induced pancreatitis years          |ngrams|0.17186920790286173 |0.10312152883939826 |0       |
|mellitus diagnosed eightyears           |ngrams|0.1319050034000668  |-0.13428264249974303|0       |
|12 units insulinlispro                  |ngrams|0.1137953678355261  |-0.09712590153751575|0       |
|female history gestational              |ngrams|0.10601142845349688 |-0.10343722456040866|0       |
|atorvastatin gemfibrozil forHTG         |ngrams|0.09678836928743394 |-0.14838669818912978|0       |
|polydipsia                              |ngrams|0.08661954196983217 |-0.12535456998042302|0       |
|for three days                          |NER   |0.06084925296064442 |-0.14891718803483797|

In [None]:
# .setConcatenateSentences(True) # default

ngram_ner_bio_results.selectExpr("explode(key_phrases) AS key_phrase")\
                     .selectExpr(
                         "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                         "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                         "key_phrase.metadata.DocumentSimilarity",
                         "key_phrase.metadata.MMRScore",
                         "key_phrase.metadata.sentence")\
                     .show(truncate=False)

+----------------------------------+------+-------------------+--------------------+--------+
|key_phrase                        |source|DocumentSimilarity |MMRScore            |sentence|
+----------------------------------+------+-------------------+--------------------+--------+
|HTG-induced pancreatitis years    |ngrams|0.6191193756992782 |0.3714716401805231  |0       |
|presented one-weekhistory polyuria|ngrams|0.5924186948898029 |0.12179446625993492 |0       |
|history gestational diabetes      |ngrams|0.572639269174604  |0.08427472753988768 |0       |
|acutehepatitis obesity            |ngrams|0.5583371363841239 |0.048648416385437154|0       |
|insulin glargine night            |ngrams|0.5489178576662569 |0.09796492750949362 |0       |
|admitted starvation ketosis,as    |ngrams|0.5292541710925237 |0.06368558862634016 |0       |
|28-year-old female history        |ngrams|0.49122686026516654|0.12579575849157976 |0       |
|triglycerides508 mg/dL            |ngrams|0.490567531280216

# with YAKE Keyword Extraction

Lets get the keys phrases using `YakeKeywordExtraction` and compare the results with `ChunkKeyPhraseExtraction`.

In [None]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

keywords = nlp.YakeKeywordExtraction() \
    .setInputCols("clean_tokens") \
    .setOutputCol("yake") \
    .setMinNGrams(1) \
    .setMaxNGrams(3)\
    .setNKeywords(20)\

yake_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "yake"])\
    .setOutputCol("yake_key_phrases")

yake_pipeline = nlp.Pipeline(
    stages=[
        documenter,
        sentencer,
        tokenizer,
        stop_words_cleaner,
        keywords,
        yake_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [None]:
yake_results = yake_pipeline.fit(empty_data).transform(textDF)

**Lets check YAKE Keyword Extraction results and scores.**

In [None]:
yake_results.selectExpr("explode(yake) AS key_phrase_candidate").show(30,truncate=False)

+------------------------------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                                        |
+------------------------------------------------------------------------------------------------------------+
|{chunk, 90, 94, prior, {score -> 0.04164870944906833, sentence -> 0}, []}                                   |
|{chunk, 99, 110, presentation, {score -> 0.0430694053298915, sentence -> 0}, []}                            |
|{chunk, 221, 225, prior, {score -> 0.04164870944906833, sentence -> 0}, []}                                 |
|{chunk, 230, 241, presentation, {score -> 0.0430694053298915, sentence -> 0}, []}                           |
|{chunk, 441, 445, prior, {score -> 0.04164870944906833, sentence -> 0}, []}                                 |
|{chunk, 714, 725, presentation, {score -> 0.0430694053298915, sentence -> 0}, []}                           |
|

In [None]:
scores = yake_results.selectExpr("explode(arrays_zip(yake.result, yake.metadata)) as resultTuples") \
                     .selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")

In [None]:
scores.orderBy("score").show(20, truncate = False)

+-----------------------------------+--------------------+
|keyword                            |score               |
+-----------------------------------+--------------------+
|prior presentation                 |0.023475043991031782|
|prior presentation                 |0.023475043991031782|
|eightyears prior presentation      |0.02754913936410175 |
|prior presentation subsequent      |0.02754913936410175 |
|years prior presentation           |0.02754913936410175 |
|prior                              |0.04164870944906833 |
|prior                              |0.04164870944906833 |
|prior                              |0.04164870944906833 |
|prior                              |0.04164870944906833 |
|prior                              |0.04164870944906833 |
|presentation                       |0.0430694053298915  |
|presentation                       |0.0430694053298915  |
|presentation                       |0.0430694053298915  |
|presentation                       |0.0430694053298915 

**Show top-10 results for YAKE.**

In [None]:
scores.select("keyword", "score").distinct().orderBy("score").show(truncate = False)

+--------------------------------------+--------------------+
|keyword                               |score               |
+--------------------------------------+--------------------+
|prior presentation                    |0.023475043991031782|
|eightyears prior presentation         |0.02754913936410175 |
|prior presentation subsequent         |0.02754913936410175 |
|years prior presentation              |0.02754913936410175 |
|prior                                 |0.04164870944906833 |
|presentation                          |0.0430694053298915  |
|patient treated insulin               |0.04328339466593739 |
|presentation revealed glucose         |0.047720606621427554|
|days prior admission                  |0.04778174700869625 |
|prior analysis due                    |0.04778174700869625 |
|physical examinationon presentation   |0.04947128993462831 |
|examinationon presentation significant|0.04965145049602467 |
|presentation significant dry          |0.04965145049602467 |
|obtaine

**Now we can compare the results with `ChunkKeyPhraseExtraction`.**

In [None]:
yake_results.select(F.explode(F.arrays_zip(yake_results.yake_key_phrases.result,
                                           yake_results.yake_key_phrases.metadata)).alias("cols"))\
            .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                    F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                    F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                    F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+--------------------------------------+-------------------+---------------------+--------+
|key_phrase_candidate                  |DocumentSimilarity |MMRScore             |sentence|
+--------------------------------------+-------------------+---------------------+--------+
|diabetes mellitus                     |0.69929147989059   |0.4195749046067621   |0       |
|eightyears prior presentation         |0.44308267284190533|0.14512543436472033  |0       |
|patient treated insulin               |0.3529302120453061 |0.014201883221691414 |0       |
|presentation revealed glucose         |0.33960110644740266|-0.1025127835892978  |0       |
|examinationon presentation significant|0.2811638977037049 |-0.07404473180447277 |0       |
|presentation significant dry          |0.2635832099174683 |-0.031241953167655767|0       |
|removed prior analysis                |0.26338243417464785|-0.10195427259635886 |0       |
|prior                                 |0.22823906334937366|-0.06576009573146158

**As you can see, `ChunkKeyPhraseExtraction` results are better than `YakeKeywordExtraction`.**