![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


#  **GPT2Transformer**

This notebook will cover the different parameters and usages of `GPT2Transformer`. This annotator displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation.

A few rules will help customizing it if defaults do not fit user needs.

**📖 Learning Objectives:**

1. Understand how to use `GPT2Transformer`.

2. Become comfortable using the different parameters of the `GPT2Transformer`.


**🔗 Helpful Links:**

- Python Docs : [GPT2Transformer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/seq2seq/gpt2_transformer/index.html)

- Scala Docs : [GPT2Transformer](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/seq2seq/GPT2Transformer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).

- [OpenAI Github page](https://github.com/openai/gpt-2)

- [Academic paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) 

## **📜 Background**

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages released by OpenAI researchers in 2019. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data, and a predecessor of GPT-3, CahtGPT and GPT-4.

On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

## **🎬 Colab Setup**

In [2]:
# Install PySpark and Spark NLP
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [3]:
import sparknlp
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import GPT2Transformer


spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

## **🔎 Parameters**

- `configProtoBytes()`: Sets configProto from tensorflow, serialized into byte array.

- `doSample()`: Sets whether or not to use sampling, use greedy decoding otherwise.

- `ignoreTokenIds()`: A list of token ids which are ignored in the decoder’s output.

- `maxOutputLength()`: Sets maximum length of output text.

- `minOutputLength()`: Sets minimum length of the sequence to be generated.

- `noRepeatNgramSize()`: Sets size of n-grams that can only occur once. If set to int > 0, all ngrams of that size can only occur once.

- `repetitionPenalty()`: Sets the parameter for repetition penalty. 1.0 means no penalty. https://arxiv.org/pdf/1909.05858.pdf>

- `task()`: Sets the transformer’s task, e.g. summarize

- `temperature()`: Sets the value used to module the next token probabilities. Parameter of the softmax function which affect the distrubtion computed by the model. The closer we are to 0, the more deterministic the probability will become, distribution tails will become slimmer and outlier word probabilites are more close to 0. Temperature values closer values to 1 make tails of probability fatter which makes outliers more probable and generic results less probable.

### `doSample()`, `topK` and `topP`

Sampling means we randomly draw from a probability distribution of words from the vocabulary of GPT2 (words present during the training of the model). In GPT-2, we the sampling can occur using two different methods: `greedy search` or `beam search`. 

If `doSample` is set to `False`, the method will be `greedy search`, and the generated word will be selected based on the word with highest conditional probability (given previous word).

If `doSample` is set to `True`, then the method will be `beam search`, a tree-like algorithm that splits the conditional probability in branches. In this case, we can add additional filtering criterias using: 

- `topK`: Takes the `k` most likely paths in the beam search tree, allowing to consider more than next word probabilities. 

- `topP`: Also known as [Nucleus Sampling](https://arxiv.org/abs/1904.09751). Takes words in the sampling space until the cumulative probability reaches `p`.

When using any (or both) of the above filtering criterias, the resulting sampling set will be reduced and the probability will be re-balanced to sum up to one. Then the next word will be randomly selected from this set following each word probability.

For additional details, you can review the following reference:

- [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)

Let's create an example to experiment with:

In [4]:
example = 'My name is Leonardo.'

spark_df = spark.createDataFrame([[example]]).toDF("text")
document = DocumentAssembler().setInputCol("text").setOutputCol("document").transform(spark_df)
document.show()

+--------------------+--------------------+
|                text|            document|
+--------------------+--------------------+
|My name is Leonardo.|[{document, 0, 19...|
+--------------------+--------------------+



First, we use the default value of the parameter, setting the `doSample` to `False`:

In [5]:
gpt2_model = (
    GPT2Transformer.pretrained("gpt2")
    .setInputCols("document")
    .setDoSample(False)
    .setOutputCol("generation")
)

gpt2 download started this may take some time.
Approximate size to download 442.7 MB
[OK!]


In [6]:
gpt2_model.transform(document).select("generation.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



Now, let's try sampling from the top 10 words only:

In [7]:
gpt2_model.setDoSample(True).setTopK(10).transform(document).select("generation.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I love to read. I'm a writer. I write. I am in charge of my art. I have been in charge all my life of writing for the best part of a decade, and in my 30s I'm]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



And using the cumulative distribution of `80%`. We set `topK` to zero to use only the `topP` criteria:

In [8]:
gpt2_model.setDoSample(True).setTopK(0).setTopP(0.8).transform(document).select("generation.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. The God of the Light is named after my father, President Leonardo, who was very good to me. But when I go to Tokyo to learn about Jesus, I come to the Archbishop of Tokyo, who, you know, he]|
+---------------------------------------------------------------------------------------------------------------------------------------

Finally, we can combine `topK` and `topP` together:

In [9]:
gpt2_model.setDoSample(True).setTopK(50).setTopP(0.8).transform(document).select("generation.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo.

You see, our group looks like.
 'Well, well. What else are we doing here, our friends?'

That's why they've all been welcomed here. I can't even pretend that they're]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



You can see different outputs, because they were sampled in a different manner in different sampling spaces.

### `.setMaxOutputLength()`


Let's limit the output to 20 words:

In [10]:
gpt2_model.setMaxOutputLength(20).setDoSample(False).transform(document).select("generation.result").show(truncate=False)

+--------------------------------------------------------------------------------+
|result                                                                          |
+--------------------------------------------------------------------------------+
|[ My name is Leonardo. I am a man of letters. I have been a man for many years.]|
+--------------------------------------------------------------------------------+



### `.setMinOutputLength()`


Now, let's condition the output to having at least 21 words, and at most 30.

In [11]:
gpt2_model.setMinOutputLength(21).setMaxOutputLength(30).setDoSample(False).transform(document).select("generation.result").show(truncate=False)


+---------------------------------------------------------------------------------------------------------------+
|result                                                                                                         |
+---------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I]|
+---------------------------------------------------------------------------------------------------------------+



### `.setNoRepeatNgramSize()`





Longer outputs tend to have repeated words:

In [12]:
example2 = "I love Spark NLP and I love NLP"
document2 = DocumentAssembler().setInputCol("text").setOutputCol("document").transform(spark.createDataFrame([[example2]], ["text"]))

In [13]:
gpt2_model.setDoSample(False).setMaxOutputLength(200).transform(document2).select("generation.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ I love Spark NLP and I love NLP. I love the way they're designed. I like the way the NLP is designed.

I love the fact that they're not just a bunch of different things.

In [14]:
gpt2_model.setDoSample(False).setMaxOutputLength(200).setNoRepeatNgramSize(2).transform(document2).select("generation.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                               

### `setTemperature():`

The temperature is a parameter that changes the softmax values of each candidate word. By setting it to `1.0`, the softmax output remains the same, and values closer to zero will flatten  the distribution, making the probabilities of each candidate more equal and making a more diverse choice of words (in some cases also generating incorrect sentences). 

Let's experiment different values of temperature in a new example.

In [16]:
gpt2_model = GPT2Transformer.pretrained("gpt2")\
    .setInputCols("document")\
    .setOutputCol("generation")\
    .setDoSample(True)\
    .setMaxOutputLength(20)

text = 'Welcome, this course is'
document3 = DocumentAssembler().setInputCol("text").setOutputCol("document").transform(spark.createDataFrame([[text]], ["text"]))

for temp in [1.0, 0.75, 0.5, 0.25, 0.01, 0.0001]:
    print(f'{25*"-"} Generation with Temperature = {temp} {25*"-"}')
    gpt2_model.setTemperature(temp).transform(document3).select('generation.result').show(truncate=False)

gpt2 download started this may take some time.
Approximate size to download 442.7 MB
[OK!]
------------------------- Generation with Temperature = 1.0 -------------------------
+-------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------+
|[ Welcome, this course is for the informational, educational and creative learners who want to explore and discover the applications]|
+-------------------------------------------------------------------------------------------------------------------------------------+

------------------------- Generation with Temperature = 0.75 -------------------------
+------------------------------------------------------

### `setRepetitionPenalty()`
We set Penalty to 0 making it harder to generate the text:

In [34]:
gpt2_model = GPT2Transformer.pretrained("gpt2")\
    .setInputCols("document")\
    .setOutputCol("generation")\
    .setDoSample(False)\
    .setTopK(0)\
    .setTopP(1.0)\
    .setMinOutputLength(10)\
    .setMaxOutputLength(20)\
    .setNoRepeatNgramSize(0)\
    .setTemperature(1.0)\
    .setRepetitionPenalty(0)

gpt2_model.transform(document3).select('generation.result').show(truncate=False)

gpt2 download started this may take some time.
Approximate size to download 442.7 MB
[OK!]
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[ Welcome, this course is,,,,,,,,,,,,,,,,]|
+------------------------------------------+



1.0 means no penalty:

In [35]:
gpt2_model.setRepetitionPenalty(1.0).transform(document3).select('generation.result').show(truncate=False)

+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[ Welcome, this course is for you.

The course is designed to help you learn how to build]|
+------------------------------------------------------------------------------------------+



Let's set it to 10.0:


In [41]:
gpt2_model.setRepetitionPenalty(10.0).transform(document3).select('generation.result').show(truncate=False)

+-----------------------------------------------------------------------------------------+
|result                                                                                   |
+-----------------------------------------------------------------------------------------+
|[ Welcome, this course is for you.
The first thing I want to say about the class of 2015]|
+-----------------------------------------------------------------------------------------+

