![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.2.Medical_Text_Summarization_with_Extractive_Approach.ipynb)


## Colab Setup

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.pretrained import ResourceDownloader
from sparknlp_jsl.pretrained import InternalResourceDownloader

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import glob
import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.3.1
Spark NLP_JSL Version : 5.3.1


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# ExtractiveSummarization


**Extractive summarization** is a technique used in Natural Language Processing (NLP) that aims to generate a concise summary by extracting the most important information from a given text. Unlike ***abstractive summarization***, which involves generating new sentences to capture the essence of the content, ***extractive summarization*** directly selects and concatenates existing sentences or phrases from the original text.

Extractive summarization  focuses on extracting the most relevant information rather than generating new content. The process typically includes preprocessing the text, identifying important sentences using various criteria, ranking them based on their importance, and selecting the top-ranked sentences for the final summary. Extractive summarization is favored for its objectivity, preserving the factual accuracy of the original text.

**Parameters**


`similarityThreshold`: Sets the minimal cosine similarity between sentences to consider them similar.

`summarySize`: Sets the number of sentences to summarize the text

In [None]:
documenter = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("documents")

sentence_detector = SentenceDetectorDLModel() \
    .pretrained()\
    .setInputCols("documents") \
    .setOutputCol("sentences")

sentence_embeddings = BertSentenceEmbeddings()\
    .pretrained("sent_small_bert_L2_128")\
    .setInputCols(["sentences"])\
    .setOutputCol("sentence_embeddings")

summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(2)\
    .setSimilarityThreshold(0)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [None]:
sampleText = """
One of David Cameron 's closest friends and Conservative allies, George Osborne rose rapidly after becoming MP for Tatton in 2001. Michael Howard promoted him from shadow chief secretary to the Treasury to shadow chancellor in May 2005, at the age of 34. Mr Osborne took a key role in the election campaign and has been at the forefront of the debate on how to deal with the recession and the UK's spending deficit. Even before Mr Cameron became leader the two were being likened to Labour's Blair/Brown duo. The two have emulated them by becoming prime minister and chancellor, but will want to avoid the spats. Before entering Parliament, he was a special adviser in the agriculture department when the Tories were in government and later served as political secretary to William Hague. The BBC understands that as chancellor, Mr Osborne, along with the Treasury will retain responsibility for overseeing banks and financial regulation. Mr Osborne said the coalition government was planning to change the tax system "to make it fairer for people on low and middle incomes", and undertake "long-termstructural reform" of the banking sector, education and the welfare state.
""".strip()

sampleText

'One of David Cameron \'s closest friends and Conservative allies, George Osborne rose rapidly after becoming MP for Tatton in 2001. Michael Howard promoted him from shadow chief secretary to the Treasury to shadow chancellor in May 2005, at the age of 34. Mr Osborne took a key role in the election campaign and has been at the forefront of the debate on how to deal with the recession and the UK\'s spending deficit. Even before Mr Cameron became leader the two were being likened to Labour\'s Blair/Brown duo. The two have emulated them by becoming prime minister and chancellor, but will want to avoid the spats. Before entering Parliament, he was a special adviser in the agriculture department when the Tories were in government and later served as political secretary to William Hague. The BBC understands that as chancellor, Mr Osborne, along with the Treasury will retain responsibility for overseeing banks and financial regulation. Mr Osborne said the coalition government was planning to 

In [None]:
data = spark.createDataFrame([[sampleText]]).toDF("text")

result = model.transform(data)

In [None]:

result.select("summaries").show(truncate = False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|summaries                                                                                                                                                                                                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
light_model = LightPipeline(model)

light_result = light_model.annotate(sampleText)

light_result["summaries"]

["Mr Osborne took a key role in the election campaign and has been at the forefront of the debate on how to deal with the recession and the UK's spending deficit. The BBC understands that as chancellor, Mr Osborne, along with the Treasury will retain responsibility for overseeing banks and financial regulation."]

**pubmed data**

In [None]:
text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.
A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.
The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).
Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis."""

text

'Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\nA retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\nThe study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual 

In [None]:
light_result = light_model.annotate(text)
light_result["summaries"]

['The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\nThe study comprised 194 patients, including 144 with carcinomatosis. Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors.']

## summarySize

Sets the number of sentences to summarize the text

In [None]:
summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(4)\
    .setSimilarityThreshold(0)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)

**Patient posts**

In [None]:
text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months.
Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem.
Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily.
I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate.
I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""

light_result = light_model.annotate(text)
light_result["summaries"]

["Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. Bcs i heard that thyroid take time to start recover. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems."]

**clinical data**

In [None]:
summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(10)\
    .setSimilarityThreshold(0)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)

In [None]:
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss.
She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.
ALLERGIES: She has no known drug allergies.
REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""

light_result = light_model.annotate(text)
light_result["summaries"]

['She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss. PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath. FAMILY HISTORY: Pertinent for obesity and hypertension. PHYSICAL EXAM: This is a pleasant female in no acute distress. Cardiovascular is normal sinus rhythm. This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. She will also see my nutritionist and social worker and have an upper endoscopy.']

## similarityThreshold


Sets the minimal cosine similarity between sentences to consider them
similar.


In [None]:
summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(10)\
    .setSimilarityThreshold(0.8)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)

In [None]:
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss.
She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.
ALLERGIES: She has no known drug allergies.
REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""

light_result = light_model.annotate(text)
light_result["summaries"]

['Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis. ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy.  Once this is completed, we will submit her to her insurance company for approval.\n']