![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb)


# 🪄 Medical Text Summarization

![IMAGE](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/databricks/python/data/Automated_Summarization_Clinical_Notes.png?raw=true)

🔎 Text Summarization is a natural language processing (NLP) task that involves condensing a lengthy text document into a shorter, more compact version while still retaining the most important information and meaning. The goal is to produce a summary that accurately represents the content of the original text in a concise form.

🔎There are `different approaches` to text summarization, including `extractive methods that` identify and extract important sentences or phrases from the text, and `abstractive methods` that generate new text based on the content of the original text.

![IMAGE](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/databricks/python/data/Summarization_Methods_vs_Quality_Dimensions.png?raw=true)

🔎 MedicalSummarizer annotator that uses a type of transformative model, the T5 model, to create a concise summary of medical text given in a clinical context. This annotator helps to quickly summarize complex medical information.

## Colab Setup

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import json
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

import sparknlp
import sparknlp_jsl

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np
import textwrap

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.0.0
Spark NLP_JSL Version : 5.0.0


# 📍Text Summarization with Abstractive Approach

### 🔎 Models

<div align="center">

| **Index** | **Summarizer Models**        |
|---------------|----------------------|
| 1        | [summarizer_clinical_jsl](https://nlp.johnsnowlabs.com/2023/03/25/summarizer_clinical_jsl.html)     |
| 2          | [summarizer_clinical_jsl_augmented](https://nlp.johnsnowlabs.com/2023/03/30/summarizer_clinical_jsl_augmented_en.html)       |
| 3      | [summarizer_biomedical_pubmed](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_biomedical_pubmed_en.html)    |
| 4      | [summarizer_generic_jsl](https://nlp.johnsnowlabs.com/2023/03/30/summarizer_generic_jsl_en.html)    |
| 5    | [summarizer_clinical_questions](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_clinical_questions_en.html) |
| 6    | [summarizer_radiology](https://nlp.johnsnowlabs.com/2023/04/23/summarizer_jsl_radiology_en.html) |
| 7    | [summarizer_clinical_guidelines_large](https://nlp.johnsnowlabs.com/2023/05/08/summarizer_clinical_guidelines_large_en.html) |
| 8    | [summarizer_clinical_laymen](https://nlp.johnsnowlabs.com/2023/05/31/summarizer_clinical_laymen_en.html) |
</div>

## 📍Benchmark Report

Our clinical summarizer models with only 250M parameters perform 30-35% better than non-clinical SOTA text summarizers with 500M parameters, in terms of Bleu and Rouge benchmarks. That is, we achieve 30% better with half of the parameters that other LLMs have. See the details below.


### 🔎Benchmark on Samsum Dataset

| model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 |
|--|--|--|--|--|--|--|
philschmid/flan-t5-base-samsum | 240M | 0.2734 | 0.1813 | 0.8938 | 0.9133 | 0.9034 |
linydub/bart-large-samsum | 500M | 0.3060 | 0.2168 | 0.8961 | 0.9065 | 0.9013 |
philschmid/bart-large-cnn-samsum | 500M | 0.3794 | 0.1262 | 0.8599 | 0.9153 | 0.8867 |
transformersbook/pegasus-samsum | 570M | 0.3049 | 0.1543 | 0.8942 | 0.9183 | 0.9061 |
summarizer_generic_jsl | 240M | 0.2703 | 0.1932 | 0.8944 | 0.9161 | 0.9051 |


### 🔎Benchmark on MtSamples Summarization Dataset

| model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 |
|--|--|--|--|--|--|--|
philschmid/flan-t5-base-samsum | 250M | 0.1919 | 0.1124 | 0.8409 | 0.8964 | 0.8678 |
linydub/bart-large-samsum | 500M | 0.1586 | 0.0732 | 0.8747 | 0.8184 | 0.8456 |
philschmid/bart-large-cnn-samsum |  500M | 0.2170 | 0.1299 | 0.8846 | 0.8436 | 0.8636 |
transformersbook/pegasus-samsum | 500M | 0.1924 | 0.0965 | 0.8920 | 0.8149 | 0.8517 |
summarizer_clinical_jsl | 250M | 0.4836 | 0.4188 | 0.9041 | 0.9374 | 0.9204 |
summarizer_clinical_jsl_augmented | 250M | 0.5119 | 0.4545 | 0.9282 | 0.9526 | 0.9402 |


### 🔎Benchmark on MIMIC Summarization Dataset

| model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 |
|--|--|--|--|--|--|--|
philschmid/flan-t5-base-samsum | 250M | 0.1910 | 0.1037 | 0.8708 | 0.9056 | 0.8879 |
linydub/bart-large-samsum | 500M | 0.1252 | 0.0382 | 0.8933 | 0.8440 | 0.8679 |
philschmid/bart-large-cnn-samsum | 500M | 0.1795 | 0.0889 | 0.9172 | 0.8978 | 0.9074 |
transformersbook/pegasus-samsum | 570M | 0.1425 | 0.0582 | 0.9171 | 0.8682 | 0.8920 |
summarizer_clinical_jsl | 250M | 0.395 | 0.2962 | 0.895 | 0.9316 | 0.913 |
summarizer_clinical_jsl_augmented | 250M | 0.3964 | 0.307 | 0.9109 | 0.9452 | 0.9227 |

## 📃 summarizer_clinical_jsl

Summarize clinical notes, encounters, critical care notes, discharge notes, reports, etc.

In [4]:
text = """ Patient with hypertension, syncope, and spinal stenosis - for recheck.
 (Medical Transcription Sample Report)
 SUBJECTIVE:
 The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.
 PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS:
 Reviewed and unchanged from the dictation on 12/03/2003.
 MEDICATIONS:
 Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""


data = spark.createDataFrame([[text]]).toDF("text")
data.show(truncate = 60)

+------------------------------------------------------------+
|                                                        text|
+------------------------------------------------------------+
| Patient with hypertension, syncope, and spinal stenosis ...|
+------------------------------------------------------------+



In [5]:
document_assembler = DocumentAssembler()\
            .setInputCol('text')\
            .setOutputCol('document')

summarizer = MedicalSummarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")\
            .setInputCols(['document'])\
            .setOutputCol('summary')\
            .setMaxTextLength(512)\
            .setMaxNewTokens(512)

pipeline = Pipeline(stages=[
            document_assembler,
            summarizer])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

result = model.transform(data)

result.show()

summarizer_clinical_jsl download started this may take some time.
[OK!]
+--------------------+--------------------+--------------------+
|                text|            document|             summary|
+--------------------+--------------------+--------------------+
| Patient with hyp...|[{document, 0, 68...|[{document, 0, 24...|
+--------------------+--------------------+--------------------+



In [6]:
result.select("summary.result").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She denies chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She is on multiple medications and has Elocon cream and Synalar cream for rash.]|
+---

### 📍 LightPipelines

In [7]:
text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""

light_model = LightPipeline(model)
light_result = light_model.annotate(text)

document_text = textwrap.fill(light_result['document'][0], width=120)
summary_text = textwrap.fill(light_result['summary'][0], width=120)

print("➤ Document :\n{}".format(document_text))
print("\n")
print("➤ Summary : \n{}".format(summary_text))
print("\n")

➤ Document :
The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to
presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear
weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his
ankle in the past. SOCIAL HISTORY: He does not drink or smoke. MEDICAL DECISION MAKING: He had an x-ray of his ankle
that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain
over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a
splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics. DISPOSITION:
Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length
his sleep and if he has continued pain to foll

### 🚩 summaries from paragraphs in text

In [8]:
document_assembler = DocumentAssembler()\
            .setInputCol('text')\
            .setOutputCol('document')

sentenceDetector = SentenceDetectorDLModel\
            .pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
            .setInputCols(["document"])\
            .setOutputCol("sentence")\
            .setCustomBounds(["\n"])\
            .setUseCustomBoundsOnly(True)

summarizer = MedicalSummarizer\
            .pretrained("summarizer_clinical_jsl")\
            .setInputCols(['sentence'])\
            .setOutputCol('summary')\
            .setMaxTextLength(512)\
            .setMaxNewTokens(512)

pipeline = Pipeline(stages=[
            document_assembler,
            sentenceDetector,
            summarizer])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
summarizer_clinical_jsl download started this may take some time.
[OK!]


In [9]:
text = """PRESENT ILLNESS: The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago. He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his right side and back. He feels like he was on it but has not done so. He has overall malaise and a low-grade temperature of 100.3. He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He denies any outright chills or blood per rectum.

PHYSICAL EXAMINATION: His temperature is 100.3, blood pressure 129/59, respirations 16, heart rate 84. He is drowsy, but easily arousable and appropriate with conversation. He is oriented to person, place, and situation. He is normocephalic, atraumatic. His sclerae are anicteric. His mucous membranes are somewhat tacky. His neck is supple and symmetric. His respirations are unlabored and clear. He has a regular rate and rhythm. His abdomen is soft. He has diffuse right upper quadrant tenderness, worse focally, but no rebound or guarding. He otherwise has no organomegaly, masses, or abdominal hernias evident. His extremities are symmetrical with no edema. His posterior tibial pulses are palpable and symmetric. He is grossly nonfocal neurologically.

PLAN: He will be admitted and placed on IV antibiotics. We will get an ultrasound this morning. He will need his gallbladder out, probably with intraoperative cholangiogram. Hopefully, the stone will pass this way. Due to his anatomy, an ERCP would prove quite difficult if not impossible unless laparoscopic assisted. Dr. X will see him later this morning and discuss the plan further. The patient understands."""

light_model = LightPipeline(model)
light_result = light_model.annotate(text)

In [10]:
for i in range(len(light_result['sentence'])):
    document_text = textwrap.fill(light_result['sentence'][i], width=120)
    summary_text = textwrap.fill(light_result['summary'][i], width=120)

    print("➤ Document {}: \n{}".format(i+1, document_text))
    print("\n")
    print("➤ Summary {}: \n{}".format(i+1, summary_text))
    print("\n")

➤ Document 1: 
PRESENT ILLNESS: The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago. He has
lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00 when he developed nausea and
right upper quadrant pain, which apparently wrapped around toward his right side and back. He feels like he was on it
but has not done so. He has overall malaise and a low-grade temperature of 100.3. He denies any prior similar or lesser
symptoms. His last normal bowel movement was yesterday. He denies any outright chills or blood per rectum.


➤ Summary 1: 
A 28-year-old patient who had gastric bypass surgery nearly one year ago developed nausea and right upper quadrant pain,
which wrapped around his right side and back. He has malaise and a low-grade temperature of 100.3. He denies any
previous symptoms and has no other symptoms.


➤ Document 2: 
PHYSICAL EXAMINATION: His temperature is 100.3, blood pressure 129/59, respirations 16, he

### 🚩 setRefineSummary

**We've Made Significant Enhancements To Our Text Summarization Method, Which Now Utilizes A Map-Reduce Approach For Section-Wise Summarization.**

We are excited to announce the integration of new parameters into our `MedicalSummarizer` annotator, empowering users to overcome token limitations and attain heightened flexibility in their medical summarization endeavors. These advanced parameters significantly augment the annotator's functionality, enabling users to generate more accurate and comprehensive summaries of medical documents. By employing a map-reduce approach, the `MedicalSummarizer` efficiently condenses distinct text segments until the desired length is achieved.

The following parameters have been introduced:

- `setRefineSummary`: Set to True for refined summarization with increased computational cost.
- `setRefineSummaryTargetLength`: Define the target length of summarizations in tokens (delimited by whitespace). Effective only when setRefineSummary=True.
- `setRefineChunkSize`: Specify the desired size of refined chunks. Should correspond to the LLM context window size in tokens. Effective only when - `setRefineSummary=True`.
- `setRefineMaxAttempts`: Determine the number of attempts for re-summarizing chunks exceeding the setRefineSummaryTargetLength before discontinuing. Effective only when `setRefineSummary=True`.

These enhancements to the MedicalSummarizer annotator represent our ongoing commitment to providing state-of-the-art tools for healthcare professionals and researchers, facilitating more efficient and accurate medical text analysis.

In [11]:
document_assembler = DocumentAssembler()\
            .setInputCol('text')\
            .setOutputCol('document')

summarizer = MedicalSummarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")\
            .setInputCols(["document"])\
            .setOutputCol("summary")\
            .setMaxTextLength(512)\
            .setMaxNewTokens(512)\
            .setDoSample(True)\
            .setRefineSummary(True)\
            .setRefineSummaryTargetLength(100)\
            .setRefineMaxAttempts(3)\
            .setRefineChunkSize(512)\

pipeline = Pipeline(stages=[
            document_assembler,
            summarizer])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

summarizer_clinical_jsl download started this may take some time.
[OK!]


In [12]:
text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""

light_model = LightPipeline(model)
light_result = light_model.annotate(text)

In [13]:
light_result["summary"]

['An x-ray showed spondylolisis of the ankle in a 17, 17, and 17-months old man who was playing basketball in the gym after slipping. He has no other injury notes, but an anx ray revealed small fracture of the ankle. He requested Moritinos for pain and he was discharged with crutes and Motrinos. The physician gave his sprates & tuxet for resting. He also gave him Motrines, if pain worsenes and advised him on returning for followup.']

In [14]:
for i in range(len(light_result['document'])):
    document_text = textwrap.fill(light_result['document'][i], width=120)
    summary_text = textwrap.fill(light_result['summary'][i], width=120)

    print("➤ Document {}: \n{}".format(i+1, document_text))
    print("\n")
    print("➤ Summary {}: \n{}".format(i+1, summary_text))
    print("\n")

➤ Document 1: 
The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to
presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear
weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his
ankle in the past. SOCIAL HISTORY: He does not drink or smoke. MEDICAL DECISION MAKING: He had an x-ray of his ankle
that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain
over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a
splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics. DISPOSITION:
Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length
his sleep and if he has continued pain to fo

In [15]:
text = """To determine whether a course of low-dose indomethacin therapy, when initiated within 24 hours of birth, would decrease ductal shunting in premature infants who received prophylactic surfactant in the delivery room. Ninety infants, with birth weights of 600 to 1250 gm, were entered into a prospective, randomized, controlled trial to receive either indomethacin, 0.1 mg/kg per dose, or placebo less than 24 hours and again every 24 hours for six doses. Echocardiography was performed on day 1 before treatment and on day 7, 24 hours after treatment. A hemodynamically significant patent ductus arteriosus (PDA) was confirmed with an out-of-study echocardiogram, and the nonresponders were treated with standard indomethacin or ligation. Forty-three infants received indomethacin (birth weight, 915 +/- 209 gm; gestational age, 26.4 +/- 1.6 weeks; 25 boys), and 47 received placebo (birth weight, 879 +/- 202 gm; gestational age, 26.4 +/- 1.8 weeks; 22 boys) (P = not significant). Of 90 infants, 77 (86%) had a PDA by echocardiogram on the first day of life before study treatment; 84% of these PDAs were moderate or large in size in the indomethacin-treated group compared with 93% in the placebo group. Nine of forty indomethacin-treated infants (21%) were study-dose nonresponders compared with 22 (47%) of 47 placebo-treated infants (p < 0.018). There were no significant differences between both groups in any of the long-term outcome variables, including intraventricular hemorrhage, duration of oxygen therapy, endotracheal intubation, duration of stay in neonatal intensive care unit, time to regain birth weight or reach full caloric intake, incidence of bronchopulmonary dysplasia, and survival. No significant differences were noted in the incidence of oliguria, elevated plasma creatinine concentration, thrombocytopenia, pulmonary hemorrhage, or necrotizing enterocolitis. The prophylactic use of low doses of indomethacin, when initiated in the first 24 hours of life in low birth weight infants who receive prophylactic surfactant in the delivery room, decreases the incidence of left-to-right shunting at the level of the ductus arteriosus."""

light_result = light_model.annotate(text)

In [16]:
light_result["summary"]

["A trial was performed on 90 newborn babies with low birthweight babies who received 0.1mg/kg pro dose indomethadiain or placebo less less then 24 hours for sixdosed. A hemodynamicly significant brevet duplex (DPA), was confirmated by echocordogram, with 47 of the infant'' nonrespirators treated. There was a no significant change from the placebo-controlled study group to re-intravenously-treated newborn infants who had PPAs."]

## 📃 summarizer_clinical_jsl_augmented

In [17]:
document_assembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

med_summarizer  = MedicalSummarizer()\
    .pretrained("summarizer_clinical_jsl_augmented")\
    .setInputCols("document")\
    .setOutputCol("summary")\
    .setMaxNewTokens(115)\
    .setMaxTextLength(1024)

pipeline = Pipeline(stages=[document_assembler,
                            med_summarizer])


model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

summarizer_clinical_jsl_augmented download started this may take some time.
[OK!]


In [18]:
text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis."""

light_model = LightPipeline(model)
light_result = light_model.annotate(text)

document_text = textwrap.fill(light_result['document'][0], width=120)
summary_text = textwrap.fill(light_result['summary'][0], width=120)

print("➤ Document :\n{}".format(document_text))
print("\n")
print("➤ Summary : \n{}".format(summary_text))
print("\n")

➤ Document :
Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the
extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the
effect of aggressive surgical resection on ovarian cancer patient survival.\n A retrospective cohort study of
consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing
primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual
disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\n The study
comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5
years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition
of optimal cytoreduction). Considering all patie

In [19]:
text = """Medical Specialty: Neurology, Sample Name: Ulnar Nerve Transposition
Description: Subcutaneous ulnar nerve transposition. A curvilinear incision was made over the medial elbow, starting proximally at the medial intermuscular septum, curving posterior to the medial epicondyle, then curving anteriorly along the path of the ulnar nerve. Dissection was carried down to the ulnar nerve. (Medical Transcription Sample Report)

PROCEDURE: Subcutaneous ulnar nerve transposition.

PROCEDURE IN DETAIL: After administering appropriate antibiotics and MAC anesthesia, the upper extremity was prepped and draped in the usual standard fashion. The arm was exsanguinated with Esmarch, and the tourniquet inflated to 250 mmHg.

A curvilinear incision was made over the medial elbow, starting proximally at the medial intermuscular septum, curving posterior to the medial epicondyle, then curving anteriorly along the path of the ulnar nerve. Dissection was carried down to the ulnar nerve. Branches of the medial antebrachial and the medial brachial cutaneous nerves were identified and protected.

Osborne's fascia was released, an ulnar neurolysis performed, and the ulnar nerve was mobilized. Six cm of the medial intermuscular septum was excised, and the deep periosteal origin of the flexor carpi ulnaris was released to avoid kinking of the nerve as it was moved anteriorly.

The subcutaneous plane just superficial to the flexor-pronator mass was developed. Meticulous hemostasis was maintained with bipolar electrocautery. The nerve was transposed anteriorly, superficial to the flexor-pronator mass. Motor branches were dissected proximally and distally to avoid tethering or kinking the ulnar nerve.

A semicircular medially based flap of flexor-pronator fascia was raised and sutured to the subcutaneous tissue in such a way as to prevent the nerve from relocating. The subcutaneous tissue and skin were closed with simple interrupted sutures. Marcaine with epinephrine was injected into the wound. The elbow was dressed and splinted. The patient was awakened and sent to the recovery room in good condition, having tolerated the procedure well.
"""

In [20]:
light_result = light_model.annotate(text)
light_result["summary"]

['The report describes a subcutaneous ulnar nerve transposition procedure performed on a patient under MAC anesthesia. The procedure involved making an incision over the medial elbow, excising the medial intermuscular septum, performing an ulnar neurolysis, and mobilizing the ulnar nerve. The nerve was transposed anteriorly and motor branches were dissected to avoid tethering or kinking. A flap of flexor-pronator fascia was raised and sutured to prevent the']

## 📃 summarizer_biomedical_pubmed
This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with biomedical datasets (Pubmed abstracts) by John Snow Labs.  It can generate summaries up to 512 tokens given an input text (max 1024 tokens).

![image](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/databricks/python/data/Automated_Summarization_Clinical_Notes_pubmed.png?raw=true)

In [21]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = MedicalSummarizer()\
    .pretrained("summarizer_biomedical_pubmed")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

pipeline = Pipeline(stages=[
    document_assembler,
    summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

summarizer_biomedical_pubmed download started this may take some time.
[OK!]


In [22]:
text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis."""

light_result = light_model.annotate(text)

document_text = textwrap.fill(light_result['document'][0], width=120)
summary_text = textwrap.fill(light_result['summary'][0], width=120)

print("➤ Document :\n{}".format(document_text))
print("\n")
print("➤ Summary : \n{}".format(summary_text))
print("\n")

➤ Document :
Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the
extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the
effect of aggressive surgical resection on ovarian cancer patient survival.\n A retrospective cohort study of
consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing
primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual
disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\n The study
comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5
years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition
of optimal cytoreduction). Considering all patie

## 📃 summarizer_clinical_questions

This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with medical questions exchanged in clinical mediums (clinic, email, call center etc.) by John Snow Labs.  It can generate summaries up to 512 tokens given an input text (max 1024 tokens).

In [23]:
summarizer = MedicalSummarizer()\
    .pretrained("summarizer_clinical_questions")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

pipeline = Pipeline(stages=[
    document_assembler,
    summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

summarizer_clinical_questions download started this may take some time.
[OK!]


In [24]:
text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""

light_result = light_model.annotate(text)

document_text = textwrap.fill(light_result['document'][0], width=120)
summary_text = textwrap.fill(light_result['summary'][0], width=120)

print("➤ Document :\n{}".format(document_text))
print("\n")
print("➤ Summary : \n{}".format(summary_text))
print("\n")


➤ Document :
 Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor
digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because
of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound
scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an
appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and
T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000
mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a
little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking
problem and very rapid heartrate. I just wan

## 📃 summarizer_radiology

This model is capable of summarizing radiology reports while preserving the important information.

In [25]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = MedicalSummarizer()\
    .pretrained("summarizer_radiology", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

pipeline = Pipeline(stages=[
    document_assembler,
    summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

summarizer_radiology download started this may take some time.
[OK!]


In [26]:
text = """INDICATIONS: Peripheral vascular disease with claudication.

RIGHT:
1. Normal arterial imaging of right lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic.
4. Ankle brachial index is 0.96.

LEFT:
1. Normal arterial imaging of left lower extremity.
2. Peak systolic velocity is normal.
3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic.
4. Ankle brachial index is 1.06.

IMPRESSION:
Normal arterial imaging of both lower lobes.
"""

light_result = light_model.annotate(text)

document_text = textwrap.fill(light_result['document'][0], width=120)
summary_text = textwrap.fill(light_result['summary'][0], width=120)

print("➤ Document :\n{}".format(document_text))
print("\n")
print("➤ Summary : \n{}".format(summary_text))
print("\n")


➤ Document :
INDICATIONS: Peripheral vascular disease with claudication.  RIGHT: 1. Normal arterial imaging of right lower extremity.
2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic. 4. Ankle brachial index is 0.96.  LEFT: 1.
Normal arterial imaging of left lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic
throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06.  IMPRESSION: Normal
arterial imaging of both lower lobes.


➤ Summary : 
The patient has peripheral vascular disease with claudication. The right lower extremity shows normal arterial imaging,
but the peak systolic velocity is normal. The arterial waveform is triphasic throughout, except for the posterior tibial
artery, which is biphasic. The ankle brachial index is 0.96. The impression is normal arterial imaging of both lower
lobes.




## 📃 summarizer_clinical_guidelines_large

This innovative Medical Summarizer Model is adept at providing succinct summarizations of clinical guidelines.

In [4]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = MedicalSummarizer.pretrained("summarizer_clinical_guidelines_large", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxTextLength(768)\
    .setMaxNewTokens(512)

pipeline = Pipeline(stages=[
    document,
    summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

summarizer_clinical_guidelines_large download started this may take some time.
[OK!]


In [5]:
text = """Clinical Guidelines for Breast Cancer:
Breast cancer is the most common type of cancer among women. It occurs when the cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women.
The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as:
- A personal or family history of breast cancer
- A genetic mutation, such as BRCA1 or BRCA2
- Exposure to radiation
- Age (most commonly occurring in women over 50)
- Early onset of menstruation or late menopause
- Obesity
- Hormonal factors, such as taking hormone replacement therapy
Breast cancer may not present symptoms during its early stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include:
- A lump or thickening in the breast or underarm area
- Changes in the size or shape of the breast
- Nipple discharge
- Nipple changes in appearance, such as inversion or flattening
- Redness or swelling in the breast
Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include:
- Surgery (such as lumpectomy or mastectomy)
- Radiation therapy
- Chemotherapy
- Hormone therapy
- Targeted therapy
Early detection is crucial for the successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. If you notice any changes in your breast tissue, consult with your healthcare provider immediately."""


light_result = light_model.annotate(text)

In [6]:
document_text = textwrap.fill(light_result['document'][0], width=120)
summary_text = textwrap.fill(light_result['summary'][0], width=120)

print("➤ Document :\n{}".format(document_text))
print("\n")
print("➤ Summary : \n{}".format(summary_text))

➤ Document :
Clinical Guidelines for Breast Cancer: Breast cancer is the most common type of cancer among women. It occurs when the
cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells
to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women. The exact
cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast
cancer, such as: - A personal or family history of breast cancer - A genetic mutation, such as BRCA1 or BRCA2 - Exposure
to radiation - Age (most commonly occurring in women over 50) - Early onset of menstruation or late menopause - Obesity
- Hormonal factors, such as taking hormone replacement therapy Breast cancer may not present symptoms during its early
stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include: - A lump or thickening in
the breast or underarm area - Changes 

## 📃 summarizer_clinical_laymen

This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with custom dataset by John Snow Labs to avoid using clinical jargon on the summaries.  It can generate summaries up to 512 tokens given an input text (max 1024 tokens).

In [7]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = MedicalSummarizer.pretrained("summarizer_clinical_laymen", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxNewTokens(512)\
    .setRefineSummary(True)\
    .setRefineSummaryTargetLength(100)\
    .setRefineMaxAttempts(3)\
    .setRefineChunkSize(512)

pipeline = Pipeline(stages=[
    document_assembler,
    summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

summarizer_clinical_laymen download started this may take some time.
[OK!]


In [8]:
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.\n\nALLERGIES: She has no known drug allergies.REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis
ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""

light_result = light_model.annotate(text)

In [9]:
document_text = textwrap.fill(light_result['document'][0], width=120)
summary_text = textwrap.fill(light_result['summary'][0], width=120)

print("➤ Document :\n{}".format(document_text))
print("\n")
print("➤ Summary : \n{}".format(summary_text))

➤ Document :
Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is
a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical
weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and
she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within
a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with
15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-
pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three
months in 2008 with a ten-pound weight

![image](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/databricks/python/data/Spark_NLP_for_Healthcare_vs_Others.png?raw=true)

# 📍Text Summarization with Extractive Approach

## ExtractiveSummarization

**Extractive summarization** is a technique used in Natural Language Processing (NLP) that aims to generate a concise summary by extracting the most important information from a given text. Unlike ***abstractive summarization***, which involves generating new sentences to capture the essence of the content, ***extractive summarization*** directly selects and concatenates existing sentences or phrases from the original text.

Extractive summarization  focuses on extracting the most relevant information rather than generating new content. The process typically includes preprocessing the text, identifying important sentences using various criteria, ranking them based on their importance, and selecting the top-ranked sentences for the final summary. Extractive summarization is favored for its objectivity, preserving the factual accuracy of the original text.

**Parameters**


- `similarityThreshold`: Sets the minimal cosine similarity between sentences to consider them similar.

- `summarySize`: Sets the number of sentences to summarize the text

In [10]:
documenter = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("documents")

sentence_detector = SentenceDetectorDLModel() \
    .pretrained()\
    .setInputCols("documents") \
    .setOutputCol("sentences")

sentence_embeddings = BertSentenceEmbeddings()\
    .pretrained("sent_small_bert_L2_128")\
    .setInputCols(["sentences"])\
    .setOutputCol("sentence_embeddings")

summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(2)\
    .setSimilarityThreshold(0)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [11]:
sampleText = """
One of David Cameron 's closest friends and Conservative allies, George Osborne rose rapidly after becoming MP for Tatton in 2001. Michael Howard promoted him from shadow chief secretary to the Treasury to shadow chancellor in May 2005, at the age of 34. Mr Osborne took a key role in the election campaign and has been at the forefront of the debate on how to deal with the recession and the UK's spending deficit. Even before Mr Cameron became leader the two were being likened to Labour's Blair/Brown duo. The two have emulated them by becoming prime minister and chancellor, but will want to avoid the spats. Before entering Parliament, he was a special adviser in the agriculture department when the Tories were in government and later served as political secretary to William Hague. The BBC understands that as chancellor, Mr Osborne, along with the Treasury will retain responsibility for overseeing banks and financial regulation. Mr Osborne said the coalition government was planning to change the tax system "to make it fairer for people on low and middle incomes", and undertake "long-termstructural reform" of the banking sector, education and the welfare state.
""".strip()

sampleText

'One of David Cameron \'s closest friends and Conservative allies, George Osborne rose rapidly after becoming MP for Tatton in 2001. Michael Howard promoted him from shadow chief secretary to the Treasury to shadow chancellor in May 2005, at the age of 34. Mr Osborne took a key role in the election campaign and has been at the forefront of the debate on how to deal with the recession and the UK\'s spending deficit. Even before Mr Cameron became leader the two were being likened to Labour\'s Blair/Brown duo. The two have emulated them by becoming prime minister and chancellor, but will want to avoid the spats. Before entering Parliament, he was a special adviser in the agriculture department when the Tories were in government and later served as political secretary to William Hague. The BBC understands that as chancellor, Mr Osborne, along with the Treasury will retain responsibility for overseeing banks and financial regulation. Mr Osborne said the coalition government was planning to 

In [12]:
data = spark.createDataFrame([[sampleText]]).toDF("text")

result = model.transform(data)

In [13]:
light_model = LightPipeline(model)

light_result = light_model.annotate(sampleText)

light_result["summaries"]

["Mr Osborne took a key role in the election campaign and has been at the forefront of the debate on how to deal with the recession and the UK's spending deficit. The BBC understands that as chancellor, Mr Osborne, along with the Treasury will retain responsibility for overseeing banks and financial regulation."]

### pubmed data

In [14]:
text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.
A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.
The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).
Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis."""

text

'Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\nA retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\nThe study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual 

In [15]:
light_result = light_model.annotate(text)
light_result["summaries"]

['The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\nThe study comprised 194 patients, including 144 with carcinomatosis. Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors.']

### ➤ summarySize

Sets the number of sentences to summarize the text

In [16]:
summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(4)\
    .setSimilarityThreshold(0)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)


### Patient posts

In [17]:
text = """
Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months.
Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem.
Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily.
I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate.
I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.
"""

light_result = light_model.annotate(text)
light_result["summaries"]

["Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. Bcs i heard that thyroid take time to start recover. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems."]

### clinical data

In [18]:
summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(10)\
    .setSimilarityThreshold(0)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)

In [19]:
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss.
She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.
ALLERGIES: She has no known drug allergies.
REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""

light_result = light_model.annotate(text)
light_result["summaries"]

['She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss. PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath. FAMILY HISTORY: Pertinent for obesity and hypertension. PHYSICAL EXAM: This is a pleasant female in no acute distress. Cardiovascular is normal sinus rhythm. This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. She will also see my nutritionist and social worker and have an upper endoscopy.']

### ➤similarityThreshold


Sets the minimal cosine similarity between sentences to consider them
similar.

In [20]:
summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(10)\
    .setSimilarityThreshold(0.8)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)

In [21]:
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss.
She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.
ALLERGIES: She has no known drug allergies.
REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""

light_result = light_model.annotate(text)
light_result["summaries"]

['Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis. ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy.  Once this is completed, we will submit her to her insurance company for approval.\n']

### ➤ setReturnSingleDocument


Determines whether to compile the selected sentences into a single document.

In [22]:
summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(10)\
    .setSimilarityThreshold(0.8)\
    .setReturnSingleDocument(True)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)

In [23]:
light_result = light_model.annotate(text)
light_result["summaries"]

['Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis. ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy.  Once this is completed, we will submit her to her insurance company for approval.\n']

# 📍 Two-Stage Text Summarization : Extractive methods & Abstractive methods

✔︎ When working with extensive texts, our primary objective is to extract the most pertinent and significant information embedded within. To achieve this, we initially employ the 'ExtractiveSummarization' approach, which selects the most important sentences while preserving the original context and factual accuracy of the text. This step effectively strips the text of superfluous details while retaining key information. Subsequently, this condensed text, distilled to its most vital sentences, is input into our 'Medical Summarizer' model. This model further abbreviates the text within a medical context, enabling us to generate a manageable and contextually accurate summarization of extensive medical documents. This two-tier approach facilitates maximum information extraction from texts and captures the essence of medical texts more efficiently and swiftly.

➤ So, to illustrate our two-step approach, let's consider an example:

In [24]:
documenter = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("documents")

sentence_detector = SentenceDetectorDLModel() \
    .pretrained()\
    .setInputCols("documents") \
    .setOutputCol("sentences")

sentence_embeddings = BertSentenceEmbeddings()\
    .pretrained("sent_small_bert_L2_128")\
    .setInputCols(["sentences"])\
    .setOutputCol("sentence_embeddings")

summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("extractive_summaries")\
    .setSummarySize(12)\
    .setSimilarityThreshold(0.5)\
    .setReturnSingleDocument(True)

medical_summarizer = MedicalSummarizer.pretrained("summarizer_biomedical_pubmed", "en", "clinical/models")\
    .setInputCols(["extractive_summaries"])\
    .setOutputCol("medical_summaries")\
    .setMaxTextLength(768)\
    .setMaxNewTokens(512)\
    .setDoSample(True)\
    .setRefineSummary(True)\
    .setRefineSummaryTargetLength(100)\
    .setRefineMaxAttempts(3)\
    .setRefineChunkSize(512)

pipeline = Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer,
        medical_summarizer
        ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = LightPipeline(model)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
summarizer_biomedical_pubmed download started this may take some time.
[OK!]


In [25]:
text = """Cancer continues to be one of the leading causes of death globally, despite advancements in our understanding of its biology and the development of more effective treatments. It's a complex disease with various types and subtypes, each characterized by rapid cell growth and the ability to invade other tissues.

Over the years, research has revealed that cancer is fundamentally a genetic disease, driven by mutations in the DNA. These mutations can be inherited or acquired, and they disrupt the normal regulation of cell growth, leading to uncontrolled proliferation and eventually tumor formation. Common genes associated with cancer, known as oncogenes and tumor suppressor genes, have been identified and extensively studied, providing valuable insights into the molecular mechanisms of cancer and leading to the development of targeted therapies.

Yet, the fight against cancer is far from over. The disease's complexity, along with its ability to adapt and evolve, poses significant challenges to treatment. Many cancers develop resistance to therapy, and metastatic disease - where the cancer spreads to other parts of the body - remains hard to treat.

One promising area of research is immunotherapy, which involves harnessing the power of the immune system to fight cancer. Several immunotherapies, including immune checkpoint inhibitors and CAR-T cell therapy, have shown remarkable success in treating certain types of cancer. However, they are not effective for all patients and can cause serious side effects, highlighting the need for further research.

Moreover, early detection of cancer significantly increases the chances of successful treatment. As such, there is a great deal of interest in developing more accurate and reliable methods for early cancer detection, such as liquid biopsy and novel imaging technologies.

Cancer research is a highly active field with rapid advancements. Continued research and innovation, driven by a deeper understanding of cancer biology, are crucial to developing more effective strategies for prevention, detection, and treatment of this formidable disease."""

In [26]:
light_result = light_model.annotate(text)

In [27]:
# Result of ExtractiveSummarization
light_result["extractive_summaries"]

["These mutations can be inherited or acquired, and they disrupt the normal regulation of cell growth, leading to uncontrolled proliferation and eventually tumor formation. Common genes associated with cancer, known as oncogenes and tumor suppressor genes, have been identified and extensively studied, providing valuable insights into the molecular mechanisms of cancer and leading to the development of targeted therapies. Yet, the fight against cancer is far from over. The disease's complexity, along with its ability to adapt and evolve, poses significant challenges to treatment. Many cancers develop resistance to therapy, and metastatic disease - where the cancer spreads to other parts of the body - remains hard to treat. One promising area of research is immunotherapy, which involves harnessing the power of the immune system to fight cancer. Several immunotherapies, including immune checkpoint inhibitors and CAR-T cell therapy, have shown remarkable success in treating certain types o

In [28]:
# Result of MedicalSummarization
light_result["medical_summaries"]

['There is currently no robust evidence to support the development of new therapies to improve the quality, efficiencity, or effectiveness of treatment of patients with cancer, or with patients who are not eligible to receive regular screening for cancer. Further high-level trials of new therapies to address the potential for re-exposure to current treatments, and to identify the most efficient treatment target are required are urgent.']

# 📍 Comparing Map-Reduce Based MedicalSummarizer and Extractive Summarization

In [4]:
text = """Medical Specialty: Gastroenterology, Sample Name: Wound Check - Status Post APR
Description: This is a pleasant 50-year-old female who has undergone an APR secondary to refractory ulcerative colitis. Overall, her quality of life has significantly improved since she had her APR. She is functioning well with her ileostomy. (Medical Transcription Sample Report)

HISTORY OF PRESENT ILLNESS: Ms. Connor is a 50-year-old female who returns to clinic for a wound check. The patient underwent an APR secondary to refractory ulcerative colitis. Subsequently, she developed a wound infection, which has since healed. On our most recent visit to our clinic, she has her perineal stitches removed and presents today for followup of her perineal wound. She describes no drainage or erythema from her bottom. She is having good ostomy output. She does not describe any fevers, chills, nausea, or vomiting. The patient does describe some intermittent pain beneath the upper portion of the incision as well as in the right lower quadrant below her ostomy. She has been taking Percocet for this pain and it does work. She has since run out has been trying extra strength Tylenol, which will occasionally help this intermittent pain. She is requesting additional pain medications for this occasional abdominal pain, which she still experiences.

PHYSICAL EXAMINATION: Temperature 95.8, pulse 68, blood pressure 132/73, and weight 159 pounds. This is a pleasant female in no acute distress. The patient's abdomen is soft, nontender, nondistended with a well-healed midline scar. There is an ileostomy in the right hemiabdomen, which is pink, patent, productive, and protuberant. There are no signs of masses or hernias over the patient's abdomen.

ASSESSMENT AND PLAN: This is a pleasant 50-year-old female who has undergone an APR secondary to refractory ulcerative colitis. Overall, her quality of life has significantly improved since she had her APR. She is functioning well with her ileostomy. She did have concerns or questions about her diet and we discussed the BRAT diet, which consisted of foods that would slow down the digestive tract such as bananas, rice, toast, cheese, and peanut butter. I discussed the need to monitor her ileostomy output and preferential amount of daily output is 2 liters or less. I have counseled her on refraining from soft drinks and fruit drinks. I have also discussed with her that this diet is moreover a trial and error and that she may try certain foods that did not agree with her ileostomy, however others may and that this is something she will just have to perform trials with over the next several months until she finds what foods that she can and cannot eat with her ileostomy. She also had questions about her occasional abdominal pain. I told her that this was probably continue to improve as months went by and I gave her a refill of her Percocet for the continued occasional pain. I told her that this would the last time I would refill the Percocet and if she has continued pain after she finishes this bottle then she would need to start ibuprofen or Tylenol if she had continued pain. The patient then brought up some right hand and arm numbness, which has been there postsurgically and was thought to be from positioning during surgery. This is all primarily gone away except for a little bit of numbness at the tip of the third digit as well as some occasional forearm muscle cramping. I told her that I felt that this would continue to improve as it has done over the past two months since her surgery. I told her to continue doing hand exercises as she has been doing and this seems to be working for her. Overall, I think she has healed from her surgery and is doing very well. Again, her quality of life is significantly improved. She is happy with her performance. We will see her back in six months just for a general routine checkup and see how she is doing at that time."""

## ➮ Map-Reduce Based MedicalSummarizer

✔︎ Our MedicalSummarizer model, in conjunction with the Map-Reduce Approach parameters, generates a more abstract summary that rephrases and rewrites the text to condense it further and take up less space. This helps in making the summary more comprehensible and manageable.

In [5]:
document_assembler = DocumentAssembler()\
            .setInputCol('text')\
            .setOutputCol('document')

summarizer = MedicalSummarizer.pretrained("summarizer_clinical_jsl_augmented", "en", "clinical/models")\
            .setInputCols(["document"])\
            .setOutputCol("summary")\
            .setMaxTextLength(768)\
            .setMaxNewTokens(512)\
            .setDoSample(True)\
            .setRefineSummary(True)\
            .setRefineSummaryTargetLength(100)\
            .setRefineMaxAttempts(3)\
            .setRefineChunkSize(512)\

pipeline = Pipeline(stages=[
            document_assembler,
            summarizer])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

summarizer_clinical_jsl_augmented download started this may take some time.
[OK!]


In [6]:
light_result = light_model.annotate(text)
light_result["summary"]

['A 90-something female with torgutan disease underwent anterior Retroperitom procedure. The patient has experienced improving intestinal ileostome and intermittent abdomen problems. She has undergored alcohol, Soft foods with an endostolic plan for ileostomy, stop drinking alcohol with an acetive pump and medication to manage pain. The doctor advised stopping taking Alcohol and Soft food, adjusting gastrointestinal diet, and reducing pain with Percoc and a change in diet to relieve nerve pain. A check-up is planned for 6-7 months after surgery.']

## ➮ Extractive Summarization

✔︎ Extractive Summarization formulates a summary by identifying and extracting the most pertinent sentences from the source text. Rather than generating new content, the selected sentences maintain their original form and structure. The advantage of this method lies in its prioritization of factual accuracy by preserving the original context of the information. It also has a lower tendency to produce misleading or incorrect information since it utilizes direct portions of the text. This is particularly valuable in summarizing sensitive and complex documents, such as medical texts.

In [7]:
sentence_detector = SentenceDetectorDLModel() \
    .pretrained()\
    .setInputCols("document") \
    .setOutputCol("sentences")

sentence_embeddings = BertSentenceEmbeddings()\
    .pretrained("sent_small_bert_L2_128")\
    .setInputCols(["sentences"])\
    .setOutputCol("sentence_embeddings")

summarizer = ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("extractive_summaries")\
    .setSummarySize(10)\
    .setSimilarityThreshold(0)\
    .setReturnSingleDocument(True)

pipeline = Pipeline(stages=[
            document_assembler,
            sentence_detector,
            sentence_embeddings,
            summarizer])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = LightPipeline(model)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [8]:
light_result = light_model.annotate(text)
light_result["extractive_summaries"]

['She is functioning well with her ileostomy. On our most recent visit to our clinic, she has her perineal stitches removed and presents today for followup of her perineal wound. The patient does describe some intermittent pain beneath the upper portion of the incision as well as in the right lower quadrant below her ostomy. She has been taking Percocet for this pain and it does work. She has since run out has been trying extra strength Tylenol, which will occasionally help this intermittent pain. She is requesting additional pain medications for this occasional abdominal pain, which she still experiences. There is an ileostomy in the right hemiabdomen, which is pink, patent, productive, and protuberant. She is functioning well with her ileostomy. I have also discussed with her that this diet is moreover a trial and error and that she may try certain foods that did not agree with her ileostomy, however others may and that this is something she will just have to perform trials with over

📌 While Extractive Summarization provides a more detailed and objective summary by preserving the original context and meaning of the text, MedicalSummarizer, with its map-reduce parameters, compresses the text into a more condensed format to facilitate a broader overview.

# 📍 Medical Text Summarization Comparison

## ➮ Spark NLP for Healthcare vs Other SOTA Models

**Flan-T5-base-samsum**

- model_name = "philschmid/flan-t5-base-samsum"
- model_size = 250M
- base_model = flan-t5
- dataset = samsum
- domain = general
- owner = google (fine-tuned)
- code_availibilty = fine tunning code is not available
- checkpoints_availaibility = Available
- link_to_repo = https://huggingface.co/philschmid/flan-t5-base-samsum/tree/main

Reported metrics
- Loss: 1.3716
- Rouge1: 47.2358
- Rouge2: 23.5135
- Rougel: 39.6266
- Rougelsum: 43.3458
- Gen Len: 17.3907

**Flan-T5-base**

- model_name = "google/flan-t5-base"
- model_size = 250M
- base_model = flan-t5
- domain = general
- owner = google
- checkpoints_availaibility = Available
- link_to_repo = https://huggingface.co/google/flan-t5-base

**Pegasus Samsum**

- model_name = transformersbook/pegasus-samsum
- model_size = 570M
- base_model = google/pegasus-cnn_dailymail
- dataset = samsum
- domain = general
- owner = google (fine-tunned)
- code_availibilty = https://github.com/nlp-with-transformers/notebooks/blob/main/06_summarization.ipynb
- checkpoints_availaibility = Available
- link_to_repo = https://github.com/nlp-with-transformers/notebooks/blob/main/06_summarization.ipynb

**Bart-large-samsum**

- model_name = linydub/bart-large-samsum
- model_size = 500M
- base_model = facebook/bart-large
- dataset = samsum
- domain = general
- owner = facebook (fine-tuned)
- code_availibilty = fine tunning code is not available
- checkpoints_availaibility = https://huggingface.co/linydub/bart-large-samsum
- link_to_repo = https://github.com/linydub/azureml-greenai-txtsum

Reported metrics
- eval_rouge1	55.0234
- eval_rouge2	29.6005
- eval_rougeL	44.914
- eval_rougeLsum	50.464
- predict_rouge1	53.4345
- predict_rouge2	28.7445
- predict_rougeL	44.1848
- predict_rougeLsum	49.1874

**Bart-large-cnn-samsum**

- model_name = philschmid/bart-large-cnn-samsum
- model_size = 500M
- base_model = facebook/bart-large
- dataset = samsum
- domain = general
- owner = facebook (fine-tuned)
- code_availibilty = fine tunning code is not available
- checkpoints_availaibility = Available
- link_to_repo = https://huggingface.co/philschmid/bart-large-cnn-samsum

## Comparison

In [9]:
# df[['original','flan-t5-base-samsum', 'flan-t5-base', 'bart-large-samsum','pegasus-samsum',
#     'pegasus-xsum', 'bart-large-cnn-samsum', 'sparknlp_aug', 'sparknlp_slim']].T.to_markdown()

|                       | radiology_text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | clinical_text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| original              | INDICATIONS: Peripheral vascular disease with claudication.\n\nRIGHT:\n1. Normal arterial imaging of right lower extremity.\n2. Peak systolic velocity is normal.\n3. Arterial waveform is triphasic.\n4. Ankle brachial index is 0.96.\n\nLEFT:\n1. Normal arterial imaging of left lower extremity.\n\n2. Peak systolic velocity is normal.\n3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic.\n4. Ankle brachial index is 1.06.\n\nIMPRESSION:\nNormal arterial imaging of both lower extremities.\n | Patient with hypertension, syncope, and spinal stenosis - for recheck.  (Medical Transcription Sample Report) SUBJECTIVE:  The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS:  Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. |
| flan-t5-base-samsum   | RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06. IMPRESSION: Normal arterial imaging of both lower extremities.                                                                                                                                                                                                                                                      | A patient with hypertension, syncope, and spinal stenosis for recheck.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| flan-t5-base          | RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 0.06                                                                                                                                                                                                                                                                                                                      | - recheck of a patient with hypertension, syncope, and spinal stenosis                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| bart-large-samsum     | Peripheral vascular disease with claudication. Arterial waveform is triphasic. Peak systolic velocity is normal. Ankle brachial index is 1.06.                                                                                                                                                                                                                                                                                                                                                                                                             | A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She has Atenolol 50 mg daily, Premarin 0.625 mg daily Premarin, calcium with vitamin D two to three pills daily, multivitamin daily                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| pegasus-samsum        | Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. IMPRESSION: Normal arterial imaging of both lower extremities.                                                                                                                                                                                                                                                                                                                                                                                           | The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She has Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily.                                                                                                                                                                                                                                                                                                                                       |
| pegasus-xsum          | Arterial imaging of both lower extremities.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | A case report of a 78-year-old woman with hypertension, syncope, and spinal stenosis.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| bart-large-cnn-samsum | Peripheral vascular disease with claudication. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. Ankle brachial index is 1.06.    IMAGINATION: Normal arterial imaging of both lower extremities.                                                                                                                                                                                                                                                                                                          | The patient is 78-year-old female with hypertension, syncope, and spinal stenosis. She has Atenolol 50 mg daily, Premarin 0.625 mg daily and calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream  0.01%.                                                                                                                                                                                                                                                                                                                                                                       |
| sparknlp_aug          | The patient has peripheral vascular disease with claudication and underwent normal arterial imaging of both lower extremities. The right lower extremity showed normal arterial imaging with normal peak systolic velocity, triphasic arterial waveform, and ankle brachial index of 0.96. The left lower extremity showed normal arterial imaging with triphasic arterial waveform except for the posterior tibial artery where it was biphasic. The ankle brachial index was 0.06.                                                                       | A 78-year-old female with hypertension, syncope, and spinal stenosis returns for a recheck. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. Her medications include Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, and TriViFlor. She also has Elocon cream and Synalar cream for rash.                                                                                                                                                                                                                                                                                                                                                  |
| sparknlp_slim         | The patient has peripheral vascular disease with claudication and underwent normal arterial imaging of both lower extremities. The peak systolic velocity is normal, but the arterial waveform is triphasic throughout, except for the posterior tibial artery where it is biphasic. The ankle brachial index is 0.06. The impression is that the arterial imaging of both lower extremities is normal.                                                                                                                                                    | A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She denies chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She is on multiple medications and has Elocon cream and Synalar cream for rash.                                                                                                                                                                                                                                                                                                                                                                                                                                                    |

### Summarization with GPT-4

In [10]:
!pip install bert-score

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=3.0.0 (from bert-score)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers>=3.0.0->bert-score)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers>=3.0.0->bert-score)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [9

In [11]:
import nltk
from bert_score import score

def generate_scores(res, ref):

    berts_p, berts_r, berts_f = score([res], [ref], lang="en", return_hash=False)
    berts_p, berts_r, berts_f = round(float(berts_p[0]), 4), round(float(berts_r[0]), 4), round(float(berts_f[0]), 4)

    print('BERT Score Precision:', berts_p)
    print('BERT Score Recall:', berts_r)
    print('BERT Score F1:', berts_f)


In [12]:
radiology_summary_gpt4 = '''The radiology report indicates that the patient has peripheral vascular disease with claudication. The arterial imaging of both lower extremities is normal. The peak systolic velocity is normal, and the arterial waveform is triphasic in both extremities except for the posterior tibial artery in the left extremity, which is biphasic. The ankle brachial index values are 0.96 and 1.06 for the right and left extremities, respectively.'''

radiology_summary_gpt4


'The radiology report indicates that the patient has peripheral vascular disease with claudication. The arterial imaging of both lower extremities is normal. The peak systolic velocity is normal, and the arterial waveform is triphasic in both extremities except for the posterior tibial artery in the left extremity, which is biphasic. The ankle brachial index values are 0.96 and 1.06 for the right and left extremities, respectively.'

In [13]:
# generate_scores(summary_dict['sparknlp_slim']['radiology_text'], radiology_summary_gpt4)


# BERT Score Precision: 0.9614
# BERT Score Recall: 0.9453
# BERT Score F1: 0.9533

In [14]:
# generate_scores(summary_dict['sparknlp_aug']['radiology_text'], radiology_summary_gpt4)
#
# BERT Score Precision: 0.9358
# BERT Score Recall: 0.9285
# BERT Score F1: 0.9321

In [15]:
# generate_scores(summary_dict['bart-large-cnn-samsum']['radiology_text'], radiology_summary_gpt4)
#
# BERT Score Precision: 0.9124
# BERT Score Recall: 0.9088
# BERT Score F1: 0.9106

In [16]:
# generate_scores(summary_dict['bart-large-cnn-samsum']['radiology_text'], radiology_summary_gpt4)
#
# BERT Score Precision: 0.9124
# BERT Score Recall: 0.9088
# BERT Score F1: 0.9106

In [17]:
clinical_summary_gpt4 = '''The report is about a 78-year-old female patient with hypertension, syncope, and spinal stenosis who returns for a recheck. She denies experiencing chest pain, palpitations, orthopnea, nocturnal dyspnea, or edema. Her past medical history remains unchanged since the last dictation on 12/03/2003. The patient's medications include Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, TriViFlor, Elocon cream, and Synalar cream.'''
clinical_summary_gpt4

"The report is about a 78-year-old female patient with hypertension, syncope, and spinal stenosis who returns for a recheck. She denies experiencing chest pain, palpitations, orthopnea, nocturnal dyspnea, or edema. Her past medical history remains unchanged since the last dictation on 12/03/2003. The patient's medications include Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, TriViFlor, Elocon cream, and Synalar cream."

In [18]:
# generate_scores(summary_dict['sparknlp_slim']['clinical_text'], clinical_summary_gpt4)
#
# BERT Score Precision: 0.9549
# BERT Score Recall: 0.8891
# BERT Score F1: 0.9208

In [19]:
# generate_scores(summary_dict['sparknlp_aug']['clinical_text'], clinical_summary_gpt4)
#
# BERT Score Precision: 0.9597
# BERT Score Recall: 0.9311
# BERT Score F1: 0.9452

In [20]:
# generate_scores(summary_dict['bart-large-cnn-samsum']['clinical_text'], clinical_summary_gpt4)
#
# BERT Score Precision: 0.8855
# BERT Score Recall: 0.8763
# BERT Score F1: 0.8809