![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **ExtractiveSummarization**

This notebook will cover the different parameters and usages of `ExtractiveSummarization`.

**📖 Learning Objectives:**

1. Background: Understand the 'ExtractiveSummarization' Annotator.

2. Colab setup.

3. Become comfortable with using the different parameters of the annotator.

**🔗 Helpful Links:**

- Python Docs : [ExtractiveSummarization](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/embeddings/extractive_summarization/index.html#sparknlp_jsl.annotator.embeddings.extractive_summarization.ExtractiveSummarization)

- Scala Docs: [ExtractiveSummarization](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/embeddings/ExtractiveSummarization.html)

- For extended examples of usage, see [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb).


## **📜 Background**

**Extractive summarization** is a technique used in Natural Language Processing (NLP) that aims to generate a concise summary by extracting the most important information from a given text. Unlike ***abstractive summarization***, which involves generating new sentences to capture the essence of the content, ***extractive summarization*** directly selects and concatenates existing sentences or phrases from the original text.

Extractive summarization  focuses on extracting the most relevant information rather than generating new content. The process typically includes preprocessing the text, identifying important sentences using various criteria, ranking them based on their importance, and selecting the top-ranked sentences for the final summary. Extractive summarization is favored for its objectivity, preserving the factual accuracy of the original text.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8744_2.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.2.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.2.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.2.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.2.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8744_2.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.2.0-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.2.0 installed! ✅ Heal the planet with NLP! 


In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8744_2.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.2.0, 💊Spark-Healthcare==5.2.0, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**



- Input:  ``DOCUMENT``  
- Output:  ``CHUNK``  

## **🔎 Parameters**

`'summarySize'` : 'Number of sentences to summarize the text. Default is 1.

`'returnSingleDocument'`: Compile the selected sentences into a single document.

`'similarityThreshold'` : Minimal cosine similarity between sentences to consider them similar. Default is 0 which means no threshold is used.



# ✍  Explaining `ExtractiveSummarization` with an Example

## **💻Pipeline**

In [None]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("documents")

sentence_detector = nlp.SentenceDetectorDLModel() \
    .pretrained()\
    .setInputCols("documents") \
    .setOutputCol("sentences")

sentence_embeddings = nlp.BertSentenceEmbeddings()\
    .pretrained("sent_small_bert_L2_128")\
    .setInputCols(["sentences"])\
    .setOutputCol("sentence_embeddings")

summarizer = medical.ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")

pipeline = nlp.Pipeline(
    stages=[
        documenter,
        sentence_detector,
        sentence_embeddings,
        summarizer
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [None]:
text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss.
She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.
PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.
PAST SURGICAL HISTORY: Pertinent for cholecystectomy.
PSYCHOLOGICAL HISTORY: Negative.
SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.
FAMILY HISTORY: Pertinent for obesity and hypertension.
MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.
ALLERGIES: She has no known drug allergies.
REVIEW OF SYSTEMS: Negative.
PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.
"""

In [None]:
summarizer._params

[Param(parent='ExtractiveSummarization_1c8d96d0cef1', name='inputCols', doc='previous annotations columns, if renamed'),
 Param(parent='ExtractiveSummarization_1c8d96d0cef1', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'),
 Param(parent='ExtractiveSummarization_1c8d96d0cef1', name='outputCol', doc='output annotation column. can be left default.'),
 Param(parent='ExtractiveSummarization_1c8d96d0cef1', name='returnSingleDocument', doc='Compile the selected sentences into a single document.'),
 Param(parent='ExtractiveSummarization_1c8d96d0cef1', name='similarityThreshold', doc='Minimal cosine similarity between sentences to consider them similar. Default is 0 whichmeans no threhsold is used (i.e. a continuous vversion of LexRank is applied)'),
 Param(parent='ExtractiveSummarization_1c8d96d0cef1', name='summarySize', doc='Number of sentences to summarize the text. Default is 1.')]

## ▶ `summarySize`

Number of sentences to summarize the text. Default is 1.

In [None]:
print("■ SENTENCE ■: ", format(text),"\n")

for summarySize in [1,2,5]:
  summarizer = medical.ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(summarySize)

  pipeline = nlp.Pipeline(stages=[documenter, sentence_detector, sentence_embeddings, summarizer])
  model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
  light_model = LightPipeline(model)
  light_result = light_model.annotate(text)

  for i in range(len(light_result['summaries'])):
      print("■"*120)
      print("■ SUMMARY ■: ",f'.summarySize = {summarySize}', format(light_result['summaries'][i]),"\n\n")


■ SENTENCE ■:  Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weigh

## ▶ `returnSingleDocument`

Compile the selected sentences into a single document.

In [None]:
print("■ SENTENCE ■: ", format(text),"\n")

for returnSingleDocument in [True,False]:
  print("■"*120)
  print("■ returnSingleDocument = ", returnSingleDocument)

  summarizer = medical.ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(5)\
    .setReturnSingleDocument(returnSingleDocument)

  pipeline = nlp.Pipeline(stages=[documenter, sentence_detector, sentence_embeddings, summarizer])

  model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

  light_model = LightPipeline(model)
  light_result = light_model.annotate(text)

  for i in range(len(light_result['summaries'])):
      print("-"*120)
      print(i+1," ■ SUMMARY ■: ", format(light_result['summaries'][i]),"\n\n")


■ SENTENCE ■:  Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weigh

.setReturnSingleDocument = True --> five sentences returned as **`one`** document

.setReturnSingleDocument = False --> five sentences returned as **`five`** document


## ▶ `similarityThreshold`

Minimal cosine similarity between sentences to consider them similar.

Default is 0, which means no threshold is used (i.e. a continuous version of LexRank is applied).

In [None]:
print("■ SENTENCE ■: ", format(text),"\n")

for similarityThreshold in [0, .5,  1]:
  print("■"*120)
  print("■ similarityThreshold = ", similarityThreshold)

  summarizer = medical.ExtractiveSummarization()\
    .setInputCols(["sentences", "sentence_embeddings"])\
    .setOutputCol("summaries")\
    .setSummarySize(5)\
    .setReturnSingleDocument(True)\
    .setSimilarityThreshold(similarityThreshold)

  pipeline = nlp.Pipeline(stages=[documenter, sentence_detector, sentence_embeddings, summarizer])
  model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
  light_model = LightPipeline(model)
  light_result = light_model.annotate(text)

  for i in range(len(light_result['summaries'])):
      print("-"*120)
      print("■ SUMMARY ■: ",f'.similarityThreshold = {similarityThreshold}', format(light_result['summaries'][i]),"\n\n")


■ SENTENCE ■:  Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43.
She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image.
She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year.
She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weigh