![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/19.0.Chunk_Key_Phrase_Extraction.ipynb)

# Chunk Key Phrase Extraction

In this notebook, you will find how to get chunk key phrases using `ChunkKeyPhraseExtraction` that leverages Sentence BERT embeddings to select keywords and key phrases that are most similar to a document. This annotator can be fed by either the output of NER model, NGramGenerator or YAKE, and could be used to generate similarity scores for each NER chunk that is coming out of any (clinical) NER model. That is, you can now sort your clinical entities by the importance of them with respect to document or sentence that they live in. Additionally, you can also use this new annotator to grab new clinical chunks that are missed by a pretrained NER model as well as summarizing the whole document into a few important sentences or phrases.

Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at [Models Hub](https://nlp.johnsnowlabs.com/models?task=Embeddings).

The default model is `"sbert_jsl_medium_uncased"`, if no name is provided.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [4]:
from johnsnowlabs import nlp, medical, visual

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/5.0.0.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


**Lets start with creating a spark dataframe.**

In [5]:
text = """
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight
years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior
episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute
hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week
history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation,
she was treated with a five-day course of amoxicillin for a respiratory tract infection.
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for
HTG. She had been on dapagliflozin for six months at the time of presentation . Physical examination
on presentation was significant for dry oral mucosa ; significantly, her abdominal examination was
benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were:
serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides
508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27.
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept
hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis,
as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained
six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21,
serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L.
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was
centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by
lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion
gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by
her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the
endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin
lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors
should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.
""".strip().replace("\n", "")

empty_data = spark.createDataFrame([[""]]).toDF("text")
textDF = spark.createDataFrame([[text]]).toDF("text")

# with NGramGenerator
First, we will show you how to get chunk key phrases using N-Gram by feeding `ChunkKeyPhraseExtraction` with N-Gram output.

In [6]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

ngram_generator = nlp.NGramGenerator()\
    .setInputCols(["clean_tokens"])\
    .setOutputCol("ngrams")\
    .setN(3)

ngram_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ngrams"])\
    .setOutputCol("ngram_key_phrases")

ngram_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    stop_words_cleaner,
    ngram_generator,
    ngram_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [7]:
ngram_results = ngram_pipeline.fit(empty_data).transform(textDF)

**Lets show N-Gram results.**

In [8]:
ngram_results.selectExpr("explode(ngrams) AS key_phrase_candidate").show(30,truncate=False)

+-------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                 |
+-------------------------------------------------------------------------------------+
|{chunk, 2, 34, 28-year-old female history, {sentence -> 0, chunk -> 0}, []}          |
|{chunk, 14, 49, female history gestational, {sentence -> 0, chunk -> 1}, []}         |
|{chunk, 28, 58, history gestational diabetes, {sentence -> 0, chunk -> 2}, []}       |
|{chunk, 39, 67, gestational diabetes mellitus, {sentence -> 0, chunk -> 3}, []}      |
|{chunk, 51, 77, diabetes mellitus diagnosed, {sentence -> 0, chunk -> 4}, []}        |
|{chunk, 60, 89, mellitus diagnosed years, {sentence -> 0, chunk -> 5}, []}           |
|{chunk, 69, 95, diagnosed years prior, {sentence -> 0, chunk -> 6}, []}              |
|{chunk, 85, 111, years prior presentation, {sentence -> 0, chunk -> 7}, []}          |
|{chunk, 91, 126, prior presenta

**Check the key phrases from N-Gram results.**

In [9]:
ngram_results.selectExpr("explode(ngram_key_phrases) AS ngram_key_phrases").show(truncate=170)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                         ngram_key_phrases|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 117, 144, subsequent type diabetes, {sentence -> 0, chunk -> 10, DocumentSimilarity -> 0.7503709042309195, MMRScore -> 0.45022256042878817}, [0.27309972, -1.55...|
|{chunk, 51, 77, diabetes mellitus diagnosed, {sentence -> 0, chunk -> 4, DocumentSimilarity -> 0.6953793514459397, MMRScore -> 0.08263038145485851}, [-0.41446584, -1.9...|
|{chunk, 186, 221, HTG-induced pancreatitis years, {sentence -> 0, chunk -> 19, DocumentSimilarity -> 0.6817062650187332, MMRScore -> 0

**Show the selected key phrases, the cosine similarity to the document, the Maximal Marginal Relevance score and the sentence they where key phrase was found in.**

In [10]:
import pyspark.sql.functions as F

ngram_results.select(F.explode(F.arrays_zip(ngram_results.ngram_key_phrases.result,
                                            ngram_results.ngram_key_phrases.metadata)).alias("cols"))\
              .select(F.expr("cols['0']").alias("key_phrase"),
                      F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                      F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                      F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+--------------------------------+-------------------+-------------------+--------+
|key_phrase                      |DocumentSimilarity |MMRScore           |sentence|
+--------------------------------+-------------------+-------------------+--------+
|subsequent type diabetes        |0.7503709042309195 |0.45022256042878817|0       |
|diabetes mellitus diagnosed     |0.6953793514459397 |0.08263038145485851|0       |
|HTG-induced pancreatitis years  |0.6817062650187332 |0.0944971574913091 |0       |
|hepatitis obesity               |0.6666053818205334 |0.1048001061312131 |0       |
|mellitus diagnosed years        |0.6389212855125315 |0.09924532015688659|0       |
|vomiting                        |0.5824238653885708 |0.136040534374704  |0       |
|admitted starvation ketosis     |0.5789874431296478 |0.09807396880407693|0       |
|five-day amoxicillin respiratory|0.5330654630848009 |0.11179576567368699|0       |
|33.5 kg/m2                      |0.46256522165839314|0.08639873294897249|0 

# with NER Model

Now we will show how to get key phrases from NER chunks by feeding `ChunkKeyPhraseExtraction` with the output of `NerConverter`.

In [11]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

ner_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_key_phrases")

ner_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    embeddings,
    ner_tagger,
    ner_converter,
    ner_key_phrase_extractor
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [12]:
ner_results = ner_pipeline.fit(empty_data).transform(textDF)

In [13]:
# ner_chunk results

ner_results.select(F.explode(F.arrays_zip(ner_results.ner_chunks.result,
                                          ner_results.ner_chunks.metadata)).alias("cols"))\
           .select(F.expr("cols['0']").alias("ner_chunk"),
                   F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+-----------------------------+----------------------------+
|ner_chunk                    |label                       |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|eight years prior            |RelativeDate                |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|acute                        |Modifier                    |
|hepatitis                    |Disease_Syndrome_Disorder   |
|obesity                      |Obesity                     |
|body mass index              |BMI                         |
|BMI                          |BMI                         |
|33.5 kg/m2             

**Show the key phrase results and scores we got using NER chunks.**

In [14]:
ner_results.select(F.explode(F.arrays_zip(ner_results.ner_key_phrases.result,
                                          ner_results.ner_key_phrases.metadata)).alias("cols"))\
           .select(F.expr("cols['0']").alias("key_phrase"),
                   F.expr("cols['1']['entity']").alias("label"),
                   F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                   F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                   F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+-----------------------------+-------------------------+-------------------+-------------------+--------+
|key_phrase                   |label                    |DocumentSimilarity |MMRScore           |sentence|
+-----------------------------+-------------------------+-------------------+-------------------+--------+
|type two diabetes mellitus   |Diabetes                 |0.7639750070708976 |0.45838502245712215|0       |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder|0.6693323086686712 |0.10416350437786537|0       |
|gestational diabetes mellitus|Diabetes                 |0.6605010338982674 |0.04514975631594115|0       |
|hepatitis                    |Disease_Syndrome_Disorder|0.605296336069333  |0.0684493843188052 |0       |
|vomiting                     |Symptom                  |0.5824237404149156 |0.14864183026208302|0       |
|starvation ketosis           |Disease_Syndrome_Disorder|0.5540198117579427 |0.09014276468109875|9       |
|lipemia                      |Sympto

# with NGramGenerator and NER Model

We can also get key phrases from merging N-Gram and NER chunks.

In [15]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

ngram_generator = nlp.NGramGenerator()\
    .setInputCols(["clean_tokens"])\
    .setOutputCol("ngrams")\
    .setN(3)

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("ngrams", "ner_chunks")\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(False)

ngram_ner_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")

ngram_ner_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    stop_words_cleaner,
    ngram_generator,
    embeddings,
    ner_tagger,
    ner_converter,
    chunk_merger,
    ngram_ner_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [16]:
ngram_ner_results = ngram_ner_pipeline.fit(empty_data).transform(textDF)

**Show the merged key phrase candidate results. `UNK` ones from NGramGenerator and the others from `ner_jsl` model.**

In [17]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate").show(30,truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 2, 12, 28-year-old, {chunk -> 0, confidence -> 0.9989, ner_source -> ner_chunks, entity -> Age, sentence -> 0}, []}                                           |
|{chunk, 2, 34, 28-year-old female history, {entity -> UNK, chunk -> 1, sentence -> 0}, []}                                                                            |
|{chunk, 14, 19, female, {chunk -> 2, confidence -> 0.9998, ner_source -> ner_chunks, entity -> Gender, sentence -> 0}, []}                                

In [18]:
# NER chunk results

ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
                                                ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).filter("label != 'UNK'").show(50, truncate=False)

+-----------------------------+----------------------------+
|key_phrase_candidate         |label                       |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|eight years prior            |RelativeDate                |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|acute                        |Modifier                    |
|hepatitis                    |Disease_Syndrome_Disorder   |
|obesity                      |Obesity                     |
|body mass index              |BMI                         |
|BMI                          |BMI                         |
|33.5 kg/m2             

In [19]:
# ngram results

ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
                                                ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).filter("label == 'UNK'").show(50, truncate=False)

+--------------------------------+-----+
|key_phrase_candidate            |label|
+--------------------------------+-----+
|28-year-old female history      |UNK  |
|female history gestational      |UNK  |
|history gestational diabetes    |UNK  |
|gestational diabetes mellitus   |UNK  |
|diabetes mellitus diagnosed     |UNK  |
|mellitus diagnosed years        |UNK  |
|diagnosed years prior           |UNK  |
|years prior presentation        |UNK  |
|prior presentation subsequent   |UNK  |
|presentation subsequent type    |UNK  |
|subsequent type diabetes        |UNK  |
|type diabetes mellitus          |UNK  |
|diabetes mellitus (             |UNK  |
|mellitus ( T2DM                 |UNK  |
|( T2DM ),                       |UNK  |
|T2DM ), prior                   |UNK  |
|), prior episode                |UNK  |
|prior episode HTG-induced       |UNK  |
|episode HTG-induced pancreatitis|UNK  |
|HTG-induced pancreatitis years  |UNK  |
|pancreatitis years prior        |UNK  |
|years prior pre

In [20]:
# merged (NER chunk + ngram) results

ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
                                                ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                         F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+--------------------------------+-------------------------+
|key_phrase_candidate            |label                    |
+--------------------------------+-------------------------+
|28-year-old                     |Age                      |
|28-year-old female history      |UNK                      |
|female                          |Gender                   |
|female history gestational      |UNK                      |
|history gestational diabetes    |UNK                      |
|gestational diabetes mellitus   |UNK                      |
|gestational diabetes mellitus   |Diabetes                 |
|diabetes mellitus diagnosed     |UNK                      |
|mellitus diagnosed years        |UNK                      |
|diagnosed years prior           |UNK                      |
|eight years prior               |RelativeDate             |
|years prior presentation        |UNK                      |
|prior presentation subsequent   |UNK                      |
|presentation subsequent

**Show the key phrase candidates and their source (NER or NGramGenerator).**

In [21]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate")\
                 .selectExpr("key_phrase_candidate.result AS key_phrase_candidate",
                             "IF(key_phrase_candidate.metadata.entity = 'UNK', 'ngram', 'NER') AS source",
                             "key_phrase_candidate.metadata.sentence")\
                 .show(50, truncate=False)

+--------------------------------+------+--------+
|key_phrase_candidate            |source|sentence|
+--------------------------------+------+--------+
|28-year-old                     |NER   |0       |
|28-year-old female history      |ngram |0       |
|female                          |NER   |0       |
|female history gestational      |ngram |0       |
|history gestational diabetes    |ngram |0       |
|gestational diabetes mellitus   |ngram |0       |
|gestational diabetes mellitus   |NER   |0       |
|diabetes mellitus diagnosed     |ngram |0       |
|mellitus diagnosed years        |ngram |0       |
|diagnosed years prior           |ngram |0       |
|eight years prior               |NER   |0       |
|years prior presentation        |ngram |0       |
|prior presentation subsequent   |ngram |0       |
|presentation subsequent type    |ngram |0       |
|subsequent type diabetes        |ngram |0       |
|type diabetes mellitus          |ngram |0       |
|type two diabetes mellitus    

**Show the extracted key phrases and their scores.**

In [22]:
ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.key_phrases.result,
                                                ngram_ner_results.key_phrases.metadata)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("key_phrase"),
                         F.expr("cols['1']['entity']").alias("label"),
                         F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                         F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                         F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+--------------------------------+--------+------------------+-------------------+--------+
|key_phrase                      |label   |DocumentSimilarity|MMRScore           |sentence|
+--------------------------------+--------+------------------+-------------------+--------+
|type two diabetes mellitus      |Diabetes|0.7639750070708976|0.45838502245712215|0       |
|subsequent type diabetes        |UNK     |0.7503709096929252|0.08298241742241608|0       |
|HTG-induced pancreatitis years  |UNK     |0.6817062650187332|0.11246275206508999|0       |
|hepatitis obesity               |UNK     |0.6666053818205334|0.11770528988892309|0       |
|mellitus diagnosed years        |UNK     |0.6389214429221233|0.08129482429630597|0       |
|history gestational diabetes    |UNK     |0.6219874892149222|0.09501043513375096|0       |
|vomiting                        |UNK     |0.5824237404149156|0.14864183026208302|0       |
|admitted starvation ketosis     |UNK     |0.5789874431296478|0.1200807110323821

**Show the extracted key phrases and their sources.**

In [23]:
ngram_ner_results.selectExpr("explode(key_phrases) AS key_phrase")\
                 .selectExpr(
                     "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                     "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                     "key_phrase.metadata.DocumentSimilarity",
                     "key_phrase.metadata.MMRScore",
                     "key_phrase.metadata.sentence")\
                 .show(truncate=False)

+--------------------------------+------+------------------+-------------------+--------+
|key_phrase                      |source|DocumentSimilarity|MMRScore           |sentence|
+--------------------------------+------+------------------+-------------------+--------+
|type two diabetes mellitus      |NER   |0.7639750070708976|0.45838502245712215|0       |
|subsequent type diabetes        |ngrams|0.7503709096929252|0.08298241742241608|0       |
|HTG-induced pancreatitis years  |ngrams|0.6817062650187332|0.11246275206508999|0       |
|hepatitis obesity               |ngrams|0.6666053818205334|0.11770528988892309|0       |
|mellitus diagnosed years        |ngrams|0.6389214429221233|0.08129482429630597|0       |
|history gestational diabetes    |ngrams|0.6219874892149222|0.09501043513375096|0       |
|vomiting                        |ngrams|0.5824237404149156|0.14864183026208302|0       |
|admitted starvation ketosis     |ngrams|0.5789874431296478|0.1200807110323821 |0       |
|five-day 

**Now we will change the default embeddings of `ChunkKeyPhraseExtraction` (`sbert_jsl_medium_uncased`) to `sbiobert_base_cased_mli` and see the results.**

In [24]:
ngram_ner_key_phrase_bio = medical.ChunkKeyPhraseExtraction.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")

ngram_ner_bio_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    stop_words_cleaner,
    ngram_generator,
    embeddings,
    ner_tagger,
    ner_converter,
    chunk_merger,
    ngram_ner_key_phrase_bio
])

sbiobert_base_cased_mli download started this may take some time.
[OK!]


In [25]:
ngram_ner_bio_results = ngram_ner_bio_pipeline.fit(empty_data).transform(textDF)

In [26]:
# sbiobert_base_cased_mli

ngram_ner_bio_results.selectExpr("explode(key_phrases) AS key_phrase")\
                     .selectExpr(
                         "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                         "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                         "key_phrase.metadata.DocumentSimilarity",
                         "key_phrase.metadata.MMRScore",
                         "key_phrase.metadata.sentence")\
                     .show(truncate=False)

+--------------------------------+------+------------------+---------------------+--------+
|key_phrase                      |source|DocumentSimilarity|MMRScore             |sentence|
+--------------------------------+------+------------------+---------------------+--------+
|one-week history polyuria       |ngrams|0.6088061125469578|0.3652836820432435   |0       |
|HTG-induced pancreatitis years  |ngrams|0.5841955188706899|0.11729246446435823  |0       |
|insulin glargine night          |ngrams|0.554267391784666 |0.0780035469858345   |0       |
|history gestational diabetes    |ngrams|0.5492484987631553|0.07024020699922845  |0       |
|28-year-old female history      |ngrams|0.5053958986212832|0.1261156582506048   |0       |
|admitted starvation ketosis     |ngrams|0.5019901637035843|0.04270525148295462  |0       |
|triglycerides 508 mg/dL         |ngrams|0.4961037809414283|0.10460743281284249  |0       |
|vomiting weeks                  |ngrams|0.4524401257460687|0.004469616663227260

In [27]:
# sbert_jsl_medium_uncased (default)

ngram_ner_results.selectExpr("explode(key_phrases) AS key_phrase")\
                 .selectExpr(
                     "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                     "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                     "key_phrase.metadata.DocumentSimilarity",
                     "key_phrase.metadata.MMRScore",
                     "key_phrase.metadata.sentence")\
                 .show(truncate=False)

+--------------------------------+------+------------------+-------------------+--------+
|key_phrase                      |source|DocumentSimilarity|MMRScore           |sentence|
+--------------------------------+------+------------------+-------------------+--------+
|type two diabetes mellitus      |NER   |0.7639750070708976|0.45838502245712215|0       |
|subsequent type diabetes        |ngrams|0.7503709096929252|0.08298241742241608|0       |
|HTG-induced pancreatitis years  |ngrams|0.6817062650187332|0.11246275206508999|0       |
|hepatitis obesity               |ngrams|0.6666053818205334|0.11770528988892309|0       |
|mellitus diagnosed years        |ngrams|0.6389214429221233|0.08129482429630597|0       |
|history gestational diabetes    |ngrams|0.6219874892149222|0.09501043513375096|0       |
|vomiting                        |ngrams|0.5824237404149156|0.14864183026208302|0       |
|admitted starvation ketosis     |ngrams|0.5789874431296478|0.1200807110323821 |0       |
|five-day 

**Lets change `.setConcatenateSentences`-> False for checking the results by using sentence level embeddings and getting their average for calculating the document average.**

In [28]:
ngram_ner_key_phrase_sent = medical.ChunkKeyPhraseExtraction.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "merged_chunks"])\
    .setOutputCol("key_phrases")\
    .setConcatenateSentences(False)

ngram_ner_sent_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    stop_words_cleaner,
    ngram_generator,
    embeddings,
    ner_tagger,
    ner_converter,
    chunk_merger,
    ngram_ner_key_phrase_sent
])

sbiobert_base_cased_mli download started this may take some time.
[OK!]


In [29]:
ngram_ner_sent_results = ngram_ner_sent_pipeline.fit(empty_data).transform(textDF)

In [30]:
# .setConcatenateSentences(False)

ngram_ner_sent_results.selectExpr("explode(key_phrases) AS key_phrase")\
                      .selectExpr(
                          "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                          "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                          "key_phrase.metadata.DocumentSimilarity",
                          "key_phrase.metadata.MMRScore",
                          "key_phrase.metadata.sentence")\
                      .show(50,truncate=False)

+------------------------------------+------+--------------------+--------------------+--------+
|key_phrase                          |source|DocumentSimilarity  |MMRScore            |sentence|
+------------------------------------+------+--------------------+--------------------+--------+
|one-week history polyuria           |ngrams|0.15724819757004135 |0.094348922291114   |0       |
|HTG-induced pancreatitis years      |ngrams|0.15313216081344025 |-0.14134556064734288|0       |
|insulin glargine night              |ngrams|0.13809330204188794 |-0.1399268870233959 |0       |
|female history gestational          |ngrams|0.10183596369471094 |-0.11137552183779727|0       |
|HbA1c 10%                           |ngrams|0.09713648992851633 |-0.13530046967508486|0       |
|dapagliflozin T2DM atorvastatin     |ngrams|0.09236916621708836 |-0.09540790773024888|0       |
|days prior admission                |ngrams|0.0833812435760314  |-0.1567860522307669 |0       |
|33.5 kg/m2                   

In [31]:
# .setConcatenateSentences(True) # default

ngram_ner_bio_results.selectExpr("explode(key_phrases) AS key_phrase")\
                     .selectExpr(
                         "SUBSTRING(key_phrase.result, 0, 40) as key_phrase",
                         "IF(key_phrase.metadata.entity = 'UNK', 'ngrams', 'NER') AS source",
                         "key_phrase.metadata.DocumentSimilarity",
                         "key_phrase.metadata.MMRScore",
                         "key_phrase.metadata.sentence")\
                     .show(truncate=False)

+--------------------------------+------+------------------+---------------------+--------+
|key_phrase                      |source|DocumentSimilarity|MMRScore             |sentence|
+--------------------------------+------+------------------+---------------------+--------+
|one-week history polyuria       |ngrams|0.6088061125469578|0.3652836820432435   |0       |
|HTG-induced pancreatitis years  |ngrams|0.5841955188706899|0.11729246446435823  |0       |
|insulin glargine night          |ngrams|0.554267391784666 |0.0780035469858345   |0       |
|history gestational diabetes    |ngrams|0.5492484987631553|0.07024020699922845  |0       |
|28-year-old female history      |ngrams|0.5053958986212832|0.1261156582506048   |0       |
|admitted starvation ketosis     |ngrams|0.5019901637035843|0.04270525148295462  |0       |
|triglycerides 508 mg/dL         |ngrams|0.4961037809414283|0.10460743281284249  |0       |
|vomiting weeks                  |ngrams|0.4524401257460687|0.004469616663227260

# with YAKE Keyword Extraction

Lets get the keys phrases using `YakeKeywordExtraction` and compare the results with `ChunkKeyPhraseExtraction`.

In [32]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setSplitChars(['\[','\]']) \

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
    .setInputCols("tokens")\
    .setOutputCol("clean_tokens")\
    .setCaseSensitive(False)

keywords = nlp.YakeKeywordExtraction() \
    .setInputCols("clean_tokens") \
    .setOutputCol("yake") \
    .setMinNGrams(1) \
    .setMaxNGrams(3)\
    .setNKeywords(20)\

yake_key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(10) \
    .setDivergence(0.4)\
    .setInputCols(["sentences", "yake"])\
    .setOutputCol("yake_key_phrases")

yake_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    stop_words_cleaner,
    keywords,
    yake_key_phrase_extractor
])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
sbert_jsl_medium_uncased download started this may take some time.
[OK!]


In [33]:
yake_results = yake_pipeline.fit(empty_data).transform(textDF)

**Lets check YAKE Keyword Extraction results and scores.**

In [34]:
yake_results.selectExpr("explode(yake) AS key_phrase_candidate").show(30,truncate=False)

+---------------------------------------------------------------------------------------------------+
|key_phrase_candidate                                                                               |
+---------------------------------------------------------------------------------------------------+
|{chunk, 91, 95, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                         |
|{chunk, 100, 111, presentation, {score -> 0.03159661517236814, sentence -> 0}, []}                 |
|{chunk, 169, 173, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                       |
|{chunk, 223, 227, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                       |
|{chunk, 232, 243, presentation, {score -> 0.03159661517236814, sentence -> 0}, []}                 |
|{chunk, 445, 449, prior, {score -> 0.029673513395379065, sentence -> 0}, []}                       |
|{chunk, 454, 465, presentation, {score -> 0.03159661517236814, sentence -> 0}, []

In [35]:
scores = yake_results.selectExpr("explode(arrays_zip(yake.result, yake.metadata)) as resultTuples") \
                     .selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")

In [36]:
scores.orderBy("score").show(20, truncate = False)

+------------------------------------+--------------------+
|keyword                             |score               |
+------------------------------------+--------------------+
|years prior presentation            |0.006335399690627251|
|years prior presentation            |0.006335399690627251|
|prior presentation                  |0.011644010991495998|
|prior presentation                  |0.011644010991495998|
|prior presentation                  |0.011644010991495998|
|weeks prior presentation            |0.020272229518351368|
|prior presentation subsequent       |0.020272229518351368|
|respiratory tract infection         |0.02568455658449274 |
|respiratory tract infection         |0.02568455658449274 |
|anion gap                           |0.025965846371439553|
|anion gap                           |0.025965846371439553|
|anion gap                           |0.025965846371439553|
|physical examination presentation   |0.02840600503736659 |
|obtained hours presentation         |0.

**Show top-10 results for YAKE.**

In [37]:
scores.select("keyword", "score").distinct().orderBy("score").show(truncate = False)

+------------------------------------+--------------------+
|keyword                             |score               |
+------------------------------------+--------------------+
|years prior presentation            |0.006335399690627251|
|prior presentation                  |0.011644010991495998|
|weeks prior presentation            |0.020272229518351368|
|prior presentation subsequent       |0.020272229518351368|
|respiratory tract infection         |0.02568455658449274 |
|anion gap                           |0.025965846371439553|
|physical examination presentation   |0.02840600503736659 |
|obtained hours presentation         |0.028532992974589392|
|examination presentation significant|0.028532992974589392|
|prior                               |0.029673513395379065|
|years prior                         |0.030808818777992058|
|anion gap elevated                  |0.031568192739369824|
|presentation                        |0.03159661517236814 |
|patient treated insulin             |0.

**Now we can compare the results with `ChunkKeyPhraseExtraction`.**

In [38]:
yake_results.select(F.explode(F.arrays_zip(yake_results.yake_key_phrases.result,
                                           yake_results.yake_key_phrases.metadata)).alias("cols"))\
            .select(F.expr("cols['0']").alias("key_phrase_candidate"),
                    F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
                    F.expr("cols['1']['MMRScore']").alias("MMRScore"),
                    F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+------------------------------------+-------------------+---------------------+--------+
|key_phrase_candidate                |DocumentSimilarity |MMRScore             |sentence|
+------------------------------------+-------------------+---------------------+--------+
|pancreatitis years prior            |0.6491587981298977 |0.38949529435509045  |0       |
|diagnosed years prior               |0.3859445763022015 |-0.0344821915617822  |0       |
|respiratory tract infection         |0.3445273826262119 |-0.062066348309385205|0       |
|patient treated insulin             |0.3413457909676454 |-0.03756772028761124 |0       |
|serum                               |0.3371023353250487 |0.024651772661135135 |0       |
|presentation revealed glucose       |0.31458369371811307|-0.11752323182333316 |0       |
|examination presentation significant|0.2909995142975872 |-0.07343303575954463 |0       |
|prior analysis due                  |0.22501722247002298|-0.1336773170416204  |0       |
|prior    

**As you can see, `ChunkKeyPhraseExtraction` results are better than `YakeKeywordExtraction`.**