![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/16.GPT2_Transformer_In_Spark_NLP.ipynb)

# GPT2Transformer: OpenAI Text-To-Text Transformer

GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation.

Pretrained models can be loaded with `pretrained()` of the companion object:

## Colab Setup

In [None]:
! pip install -q spark-nlp==3.4.2 pyspark==3.2.0

In [None]:
import sparknlp

spark = sparknlp.start(spark32=True)

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 3.4.2
Apache Spark version: 3.2.0


**GPT2 Models In Spark NLP**

*   `gpt2`
*   `gpt2_medium`
*   `gpt2_distilled`
*   `gpt2_large`

The default model is `"gpt2"`, if no name is provided. For available pretrained models please see the [Spark NLP Models Hub](https://nlp.johnsnowlabs.com/models?q=gpt2)



## GPT2 Pipeline 

Now, let's create a Spark NLP Pipeline with `gpt2_medium` model and check the results.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")
    
gpt2 = GPT2Transformer.pretrained("gpt2_medium") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(50) \
    .setMinOutputLength(25) \
    .setOutputCol("generation")
    
pipeline = Pipeline().setStages([documentAssembler,
                                 gpt2])

data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("generation.result").show(truncate=False)

gpt2_medium download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I'm a student at the University of California, Berkeley. I've been studying computer science for the past two years. I have a PhD in computer scienc

We can display the documentation of all params with their optionally default values and user-supplied values by `explainParams()` function

In [None]:
gpt2.explainParams()

"batchSize: Size of every batch (default: 4)\nconfigProtoBytes: ConfigProto from tensorflow, serialized into byte array. Get with config_proto.SerializeToString() (undefined)\ndoSample: Whether or not to use sampling; use greedy decoding otherwise (default: False, current: False)\nignoreTokenIds: A list of token ids which are ignored in the decoder's output (default: [])\ninputCols: previous annotations columns, if renamed (current: ['documents'])\nlazyAnnotator: Whether this AnnotatorModel acts as lazy in RecursivePipelines (default: False)\nmaxOutputLength: Maximum length of output text (default: 20, current: 50)\nminOutputLength: Minimum length of the sequence to be generated (default: 0, current: 25)\nnoRepeatNgramSize: If set to int > 0, all ngrams of that size can only occur once (default: 0, current: 3)\noutputCol: output annotation column. can be left default. (current: generation)\nrepetitionPenalty: The parameter for repetition penalty. 1.0 means no penalty. See `this paper <

Let's use model with more sentences and set `.setDoSample()` parameter as True, this parameter is used for whether or not to use sampling; use greedy decoding otherwise, by default False. <br/>
Also, we use `.setTopK()` parameter for the number of highest probability vocabulary tokens to keep for top-k-filtering, by default 50.

In [None]:
sample_texts= [[1, "Mey name is Leonardo"], [2, "My name is Leonardo and I come from Rome."],
               [3, "My name is"], [4, "What is the difference between diesel and petrol?"]]

sample_df= spark.createDataFrame(sample_texts).toDF("id", "text")

In [None]:
gpt2 = GPT2Transformer.pretrained("gpt2_medium") \
        .setInputCols(["documents"]) \
        .setMaxOutputLength(50) \
        .setMinOutputLength(25) \
        .setDoSample(True)\
        .setTopK(20)\
        .setOutputCol("generation")

pipeline = Pipeline().setStages([documentAssembler,
                                 gpt2])

gpt2_medium download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


In [None]:
result = pipeline.fit(sample_df).transform(sample_df)
result.select("id", "generation.result").show(truncate=False)

+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |result                                                                                                                                                                                                                                       |
+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |[ Mey name is Leonardo Dessau. He was the first and only player to be chosen as a captain on a World Cup squad. His nickname in the United States has become "Pete in the back pocket", after his favorite player from the USA]              |
|2  |[ My name is Leonar

### Changing the Transformer's task

Now, we change the task of Transformer. We can verify some informations to GPT-2 by setting `.setTask()` parameter as **"Is it true that"**. <br/>
We give a text to the model by setting `setTask("Is it true that")` and model adds the "Is it true that" expression at the beginning of the sentence and generates sentences.

In [None]:
 gpt2 = GPT2Transformer.pretrained("gpt2_medium")\
          .setTask("Is it true that")\
          .setInputCols(["documents"])\
          .setMaxOutputLength(50)\
          .setOutputCol("generation")

pipeline = Pipeline().setStages([documentAssembler,
                                 gpt2])

gpt2_medium download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


In [None]:
sample_text= [[1, "Donald Trump is rich?"],
              [2, "Pink Floyd is rock band?"]]

sample_df= spark.createDataFrame(sample_text).toDF("id", "text")

result = pipeline.fit(sample_df).transform(sample_df)
result.select("id", "generation.result").show(truncate=False)

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |result                                                                                                                                                                                                          |
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |[ Is it true that Donald Trump is rich?\n\nHe is rich, of course, but the truth is that he's not actually rich. He's the wealthiest member of the U.S. Senate.\n\nAnd that's because Trump owns his]            |
|2  |[ Is it true that Pink Floyd is rock band? I'm looking at you – David Gilmour (bassist and lyricist) who did it. I'm gonna take this qu