![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/18.0.Summarization.ipynb)

# 🎬 Colab Setup

In [None]:
!pip install -q pyspark==3.4.1 spark-nlp==5.3.2

In [2]:
import sparknlp

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd

spark = sparknlp.start()

# Comment out this line  and uncomment the next one to enable GPU mode and High RAM
# spark = sparknlp.start(gpu=True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 5.3.2
Apache Spark version: 3.4.1


# Download BART Model and Create Spark NLP Pipeline

In [3]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Can take in document or sentence columns
bart = BartTransformer.pretrained(name="distilbart_xsum_12_6",lang='en') \
    .setInputCols('document')\
    .setOutputCol("Bart")\
    .setMaxOutputLength(100)

# Build pipeline with BART
pipeline = Pipeline().setStages([
    documentAssembler,
    bart
])

distilbart_xsum_12_6 download started this may take some time.
Approximate size to download 699.7 MB
[OK!]


# Summarize documents

In [4]:
# Set the task for questions on T5
bart.setTask('summarize')

BartTRANSFORMER_41525e20b6b3

In [5]:
# https://www.reuters.com/article/instant-article/idCAKBN2AA2WF

text = """(Reuters) - Mastercard Inc said on Wednesday it was planning to offer support for some cryptocurrencies on its network this year, joining a string of big-ticket firms that have pledged similar support.

The credit-card giant’s announcement comes days after Elon Musk’s Tesla Inc revealed it had purchased $1.5 billion of bitcoin and would soon accept it as a form of payment.

Asset manager BlackRock Inc and payments companies Square and PayPal have also recently backed cryptocurrencies.

Mastercard already offers customers cards that allow people to transact using their cryptocurrencies, although without going through its network.

"Doing this work will create a lot more possibilities for shoppers and merchants, allowing them to transact in an entirely new form of payment. This change may open merchants up to new customers who are already flocking to digital assets," Mastercard said. (mstr.cd/3tLaPZM)

Mastercard specified that not all cryptocurrencies will be supported on its network, adding that many of the hundreds of digital assets in circulation still need to tighten their compliance measures.

Many cryptocurrencies have struggled to win the trust of mainstream investors and the general public due to their speculative nature and potential for money laundering.
"""

df=spark.createDataFrame([[text]]).toDF('text')

In [6]:
#Predict on text data with BART
annotated_df = pipeline.fit(df).transform(df)
annotated_df.select(['bart.result']).show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[The world’s largest credit - Mastercard giant Mastercard has announced it will begin to support for some of the digital currency.]|
+-----------------------------------------------------------------------------------------------------------------------------------+



In [7]:
v = annotated_df.take(1)
print(f"Original Length {len(v[0].text)}   Summarized Length : {len(v[0].Bart[0].result)} ")


Original Length 1284   Summarized Length : 129 


In [8]:
# Full summarized text
v[0].Bart[0].result

'The world’s largest credit - Mastercard giant Mastercard has announced it will begin to support for some of the digital currency.'

# Explore BART Parameters and Play with Params


### Sampling Methods


Sampling means we **randomly** draw from a distribution of words.
The probability distribution is conditioned on all previous tokens in a text to generate the next token.

By default the distribution contains all words in the vocabulary of BART, where many candidates are incorrect to generate.

There are two methods of reshaping and drawing from those distributions :

1. **Top-K Sampling** Take the k most likely words from the original distribution. Redistribute probability mass among those k words and draw according to the new probabilities.

2. **Top-P Nucleus sampling**  Take smallest possible set of N words, which  together have a probability of p. Redistribute probability mass among those N words and draw according to the new probabilities.



Additionally, both methods can be tweaked ith the following parameters :

- **temperature** : Parameter of the softmax function which affect the distrubtion computed by the model. The closer we are to 0, the more deterministic the probability will become, distribution tails will become slimmer and outlier word probabilites are more close to 0. Temperature values closer values to 1 make tails of probability fatter which makes outliers more probable and generic results less probable.


These parameters are shared by all method :
- **beamSize**: Number of beams in the beam search
- **ignoreTokenIds**: A list of token ids which are ignored in the decoder's output (default: [])
- **noRepeatNgramSize**: If set to int > 0, all ngrams of that size can only occur once
- **repetitionPenalty**: The parameter for repetition penalty. 1.0 means no penalty.  https://arxiv.org/pdf/1909.05858.pdf>
- **task**:  Transformer's task, e.g. 'is it true that'> (default: , current: generate)

### Play with temperature
Set Temperature higher to make GPT more random/creative and text less coherent
Temperature > 0  and Temperature <=1
You must set `bart.setDoSample(True)` to have non-deterministic results

In [18]:
bart.setTemperature(0.5)
bart.setDoSample(True)

BartTRANSFORMER_41525e20b6b3

In [19]:
#Predict on text data with BART
annotated_df = pipeline.fit(df).transform(df)
annotated_df.select(['bart.result']).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                               

### Play with beamSize

When `beamSize = 1` the decoding process produce the greedy search, and when beamSize > 1 it produce the decoded output with beam search


In [12]:
bart.setBeamSize(1)

BartTRANSFORMER_41525e20b6b3

In [14]:
#Predict on text data with BART
annotated_df = pipeline.fit(df).transform(df)
annotated_df.select(['bart.result']).show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------+
|[Early support for many cryptocurrencies has come in for a growing number of major firms to become more popular with the general public.]|
+-----------------------------------------------------------------------------------------------------------------------------------------+



In [15]:
# set beam size to 4
bart.setBeamSize(4)

BartTRANSFORMER_41525e20b6b3

In [17]:
%%time
#Predict on text data with BART
annotated_df = pipeline.fit(df).transform(df)
annotated_df.select(['bart.result']).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[inous cryptocurrencies, the world-famous investors are betting on them to win the trust of mainstream investors, are set to be offered an announcement that is a new form of payment.]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

CPU times: user 358 ms, sys: 50.6 ms, total: 409 ms
Wall time: 58.3 s