

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_SUMMARIZATION.ipynb)



# **Summarize text**

## 1. Colab Setup

In [None]:
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
!java -version

# Install pyspark
!pip install --ignore-installed -q pyspark==2.4.4

# Install Sparknlp
!pip install --ignore-installed spark-nlp

# Update environmental variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

In [1]:
import pandas as pd
import numpy as np
import os
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## 2. Start Spark Session

In [2]:
spark = sparknlp.start()

## 3. Select the model to use

In [3]:
#MODEL_NAME = 't5_small'
MODEL_NAME = 't5_base'

## 4 Examples to try on the model

In [4]:
text_list = ["""
                             The belgian duo took to the dance floor on monday night with some friends. manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven. louis van gaal’s side currently sit two points clear of liverpool in fourth.
                             """,
                             """
                             Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus; the former concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, and areas under or between curves. These two branches are related to each other by the fundamental theorem of calculus, and they make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.[1] Infinitesimal calculus was developed independently in the late 17th century by Isaac Newton and Gottfried Wilhelm Leibniz.[2][3] Today, calculus has widespread uses in science, engineering, and economics.[4] In mathematics education, calculus denotes courses of elementary mathematical analysis, which are mainly devoted to the study of functions and limits. The word calculus (plural calculi) is a Latin word, meaning originally "small pebble" (this meaning is kept in medicine – see Calculus (medicine)). Because such pebbles were used for calculation, the meaning of the word has evolved and today usually means a method of computation. It is therefore used for naming specific methods of calculation and related theories, such as propositional calculus, Ricci calculus, calculus of variations, lambda calculus, and process calculus.
                             """,
                             """
            The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world". The painting's novel qualities include the subject's enigmatic expression, the monumentality of the composition, the subtle modelling of forms, and the atmospheric illusionism.
                            """,
                            """
            John Snow (15 March 1813 – 16 June 1858) was an English physician and a leader in the development of anaesthesia and medical hygiene. He is considered one of the founders of modern epidemiology, in part because of his work in tracing the source of a cholera outbreak in Soho, London, in 1854, which he curtailed by removing the handle of a water pump. Snow's findings inspired the adoption of anaesthesia as well as fundamental changes in the water and waste systems of London, which led to similar changes in other cities, and a significant improvement in general public health around the world.
                             """,
                             """
            Pierre-Simon, marquis de Laplace (23 March 1749 – 5 March 1827) was a French scholar and polymath whose work was important to the development of engineering, mathematics, statistics, physics, astronomy, and philosophy. He summarized and extended the work of his predecessors in his five-volume Mécanique Céleste (Celestial Mechanics) (1799–1825). This work translated the geometric study of classical mechanics to one based on calculus, opening up a broader range of problems. In statistics, the Bayesian interpretation of probability was developed mainly by Laplace.
                             """,
                             """
            The Guadeloupe amazon (Amazona violacea) is a hypothetical extinct species of parrot that is thought to have been endemic to the Lesser Antillean island region of Guadeloupe. Described by 17th- and 18th-century writers, it is thought to have been related to, or possibly the same as, the extant imperial amazon. A tibiotarsus and an ulna bone from the island of Marie-Galante may belong to the Guadeloupe amazon. According to contemporary descriptions, its head, neck and underparts were mainly violet or slate, mixed with green and black; the back was brownish green; and the wings were green, yellow and red. It had iridescent feathers, and was able to raise a "ruff" of feathers around its neck. It fed on fruits and nuts, and the male and female took turns sitting on the nest. French settlers ate the birds and destroyed their habitat. Rare by 1779, the species appears to have become extinct by the end of the 18th century.
                             """,
                             """
            Mount Tai is a mountain of historical and cultural significance located north of the city of Tai'an, in Shandong province, China. The tallest peak is the Jade Emperor Peak, which is commonly reported as being 1,545 meters tall, but is officially described by the PRC government as 1,532.7 meters tall. It is associated with sunrise, birth, and renewal, and is often regarded the foremost of the five. Mount Tai has been a place of worship for at least 3,000 years and served as one of the most important ceremonial centers of China during large portions of this period.
                             """]

## 5. Define the Spark NLP pipeline

The `T5 Transformer` model is able to perform 18 different tasks (ref.: [this paper](https://arxiv.org/abs/1910.10683)). To summarize text, we use the prefix `summarize:` in the model.

In [5]:
# Prefix to be used on the T5Transformer().setTask(<<prefix>>)
task_prefix = 'summarize:'

In [6]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")

t5 = T5Transformer() \
    .pretrained(MODEL_NAME) \
    .setTask(task_prefix)\
    .setMaxOutputLength(200)\
    .setInputCols(["documents"]) \
    .setOutputCol("T5")

pipeline = Pipeline(stages=[document_assembler, t5])

t5_base download started this may take some time.
Approximate size to download 446 MB
[OK!]


## 6. Run the pipeline

In [7]:
# Fit on empty data frame (model is pretrained)
empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = pipeline.fit(empty_df)

# Create Light Pipeline
lmodel = LightPipeline(pipeline_model)

# Send example texts to spark data frame
text_df = spark.createDataFrame(pd.DataFrame({'text': text_list}))

# Predict with the Pipeline model
result = pipeline_model.transform(text_df)

# Predict with the Light Pipeline model
res = lmodel.fullAnnotate(text_list)

## 7. Visualize the results

Using Light Pipeline:

In [12]:
for r in res:
    print(f"Text: {r['documents'][0].result.strip()}\n\nSummary: {r['T5'][0].result}\n")
    print("-----------------------------------------------------------------------\n")

Text: The belgian duo took to the dance floor on monday night with some friends. manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven. louis van gaal’s side currently sit two points clear of liverpool in fourth.

Summary: manchester united face newcastle in the premier league on wednesday . louis van gaal's side currently sit two points clear of liverpool .

-----------------------------------------------------------------------

Text: Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations. It has two major branches, differential calculus and integral calculus; the former concerns instantaneous rates of change, and the slopes of curves, while integral calculus concerns accumulation of quantities, a

Using pipeline model

In [15]:
result.select('text', 'T5.result').show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                                text|                                                                                              result|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|
                             The belgian duo took to the dance floor on monday night with some f...|[manchester united face newcastle in the premier league on wednesday . louis van gaal's side curr...|
|
                             Calculus, originally called infinitesimal calculus or "the calculus...|[the term calculus is a Latin word, meaning originally "small pebble" it is used f