![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/HuggingFace%20in%20Spark%20NLP%20-%20T5.ipynb)

## Import T5 models from HuggingFace 🤗 into Spark NLP 🚀

Let's keep in mind a few things before we start 😊

- You can import T5 models via `T5Model`. These models are usually under `Text2Text Generation` category and have `T5` in their labels
- This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
- Reference: [T5Model](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model)
- Some [example models](https://huggingface.co/models?other=T5)

## Export and Save HuggingFace model

- Let's install `transformers` package and it's dependencies.
- We lock `tensorflow` to version `2.8`
- We lock `transformers` on version `4.35.2`. This doesn't mean it won't work with the future releases
- We will also need `sentencepiece` for tokenization.

In [None]:
!pip install -q --upgrade transformers==4.35.2 sentencepiece tensorflow==2.8

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m497.6/497.6 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m86.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.5/462.5 kB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m86.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[?25h

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [google/flan-t5-base](https://huggingface.co/google/flan-t5-base) model from HuggingFace as an example
- In addition to `T5Model` we also need to save the tokenizer. This is the same for every model, these are assets needed for tokenization inside Spark NLP.
0

In [None]:
import transformers
# Model name, either HF (e.g. "google/flan-t5-base") or a local path
MODEL_NAME = "google/flan-t5-base"

# Path to store the exported models
EXPORT_PATH = f"exported/{MODEL_NAME}"

Exporting this model involves several steps. We need to

1. separate the encoder and decoder and their cache tensors
3. create a wrapper to create the right model signatures
4. export the preprocessor to the `assets` folder

Don't worry if this next step seems overwhelming. Once you run the next cell everything should be exported to the right place!

In [None]:
import tensorflow as tf
from transformers import TFT5ForConditionalGeneration

def convert_cache(cache_tensor1, cache_tensor2, num_layers):
    return tuple([tuple([cache_tensor1[i,j] for j in range(2)] + [cache_tensor2[i,j] for j in range(2)]) for i in range(num_layers)])

def make_cache_tensors(cache):
    return tf.stack([[k for k in l[0:2]] for l in cache]), tf.stack([[k for k in l[2:4]] for l in cache])

class T5ExportModel(TFT5ForConditionalGeneration):
    use_cache = True

    @tf.function(
        input_signature=[
            {
                "encoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="encoder_input_ids"),
                "encoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="encoder_attention_mask")
            }
        ], jit_compile=False
    )
    def encoder_serving(self, inputs):
        return {
            "last_hidden_state": self.encoder(input_ids=inputs["encoder_input_ids"], attention_mask=inputs["encoder_attention_mask"])[0]
        }

    @tf.function(
        input_signature=[

            {
                "decoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_input_ids"),
                "decoder_encoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_encoder_attention_mask"),
                "decoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_attention_mask"),
                "encoder_state": tf.TensorSpec((None, None,  None), tf.float32, name="encoder_state")
            }
        ], jit_compile=False
    )
    def decoder_init_serving(self, inputs):
        decoder_output = self.decoder(
                  input_ids=inputs["decoder_input_ids"],
                  encoder_hidden_states=inputs["encoder_state"],
                  encoder_attention_mask=inputs["decoder_encoder_attention_mask"],
                  attention_mask=inputs["decoder_attention_mask"],
              )
        sequence_output = decoder_output[0]
        cache = decoder_output[1]
        cache_tensor1, cache_tensor2 = make_cache_tensors(cache)

        if self.config.tie_word_embeddings:
            sequence_output = sequence_output * (self.config.d_model ** -0.5)
            logits = self.shared(sequence_output, mode="linear")
        else:
            logits = self.lm_head(sequence_output)

        if self.use_cache:
            return {
                    "output_0": logits,
                    "output_cache1": cache_tensor1,
                    "output_cache2": cache_tensor2
                }
        else:
            return {
                "output_0": logits
            }

    @tf.function(
        input_signature=[

            {
                "decoder_cached_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_cached_input_ids"),
                "decoder_cached_encoder_attention": tf.TensorSpec((None, None), tf.int32, name="decoder_cached_encoder_attention"),
                "decoder_cached_encoder_state": tf.TensorSpec((None, None,  None), tf.float32, name="decoder_cached_encoder_state"),
                "decoder_cached_cache1": tf.TensorSpec((None, 2, None, None, None, 64), tf.float32, name="decoder_cached_cache1"),
                "decoder_cached_cache2": tf.TensorSpec((None, 2, None, None, None, 64), tf.float32, name="decoder_cached_cache2")
            }
        ], jit_compile=False
    )

    def decoder_cached_serving(self, inputs):
        decoder_output = self.decoder(
                  input_ids=inputs["decoder_cached_input_ids"],
                  encoder_hidden_states=inputs["decoder_cached_encoder_state"],
                  encoder_attention_mask=inputs["decoder_cached_encoder_attention"],
                  past_key_values=convert_cache(
                      inputs["decoder_cached_cache1"],
                      inputs["decoder_cached_cache2"],
                      self.config.num_decoder_layers)
              )
        sequence_output = decoder_output[0]
        cache = decoder_output[1]
        cache_tensor1, cache_tensor2 = make_cache_tensors(cache)

        if self.config.tie_word_embeddings:
            sequence_output = sequence_output * (self.config.d_model ** -0.5)
            logits = self.shared(sequence_output, mode="linear")
        else:
            logits = self.lm_head(sequence_output)

        return {
                "decoder_cached_output": logits,
                "decoder_cached_output_cache1": cache_tensor1,
                "decoder_cached_output_cache2": cache_tensor2
            }

    def export(self, path, use_cache):
        self.use_cache = use_cache
        if use_cache:
            signatures = {
                "encoder": self.encoder_serving,
                "decoder_init": self.decoder_init_serving,
                "decoder_cached": self.decoder_cached_serving
            }
        else:
            signatures = {
                "encoder": self.encoder_serving,
                "decoder_init": self.decoder_init_serving,
            }

        tf.saved_model.save(self, EXPORT_PATH, signatures=signatures)

# Import either directly from TF or convert form PyTorch
try:
    model = T5ExportModel.from_pretrained(MODEL_NAME)
except:
    model = T5ExportModel.from_pretrained(MODEL_NAME, from_pt=True)

model.export(EXPORT_PATH, use_cache=True)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing T5ExportModel.

All the weights of T5ExportModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ExportModel for predictions without further training.


In [None]:
from transformers import T5Tokenizer

# Create assets
!mkdir -p {EXPORT_PATH}/assets

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(f"{EXPORT_PATH}/assets/")

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


('exported/google/flan-t5-base/assets/tokenizer_config.json',
 'exported/google/flan-t5-base/assets/special_tokens_map.json',
 'exported/google/flan-t5-base/assets/spiece.model',
 'exported/google/flan-t5-base/assets/added_tokens.json')

Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {EXPORT_PATH}

total 20836
drwxr-xr-x 2 root root     4096 Dec  9 16:58 assets
-rw-r--r-- 1 root root 21326986 Dec  9 16:56 saved_model.pb
drwxr-xr-x 2 root root     4096 Dec  9 16:56 variables


In [None]:
!ls -l {EXPORT_PATH}/assets

total 808
-rw-r--r-- 1 root root   2593 Dec  9 16:58 added_tokens.json
-rw-r--r-- 1 root root   2543 Dec  9 16:58 special_tokens_map.json
-rw-r--r-- 1 root root 791656 Dec  9 16:58 spiece.model
-rw-r--r-- 1 root root  20789 Dec  9 16:58 tokenizer_config.json


## Import and Save T5 in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.3 and Spark NLP 5.2.0
setup Colab for PySpark 3.2.3 and Spark NLP 5.2.0
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.5/548.5 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `T5Transformer` which allows us to load the model
- Most params will be set automatically. They can also be set later after loading the model in `T5Transformer` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [None]:
from sparknlp.annotator import *

T5 = T5Transformer.loadSavedModel(EXPORT_PATH, spark)\
  .setUseCache(True) \
  .setTask("summarize:") \
  .setMaxOutputLength(200)

Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
T5.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your T5 model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

total 988436
drwxr-xr-x 3 root root       4096 Dec  9 17:06 fields
drwxr-xr-x 2 root root       4096 Dec  9 17:06 metadata
-rw-r--r-- 1 root root     791656 Dec  9 17:08 t5_spp
-rw-r--r-- 1 root root 1011349768 Dec  9 17:08 t5_tensorflow


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny T5 model 😊

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

test_data = spark.createDataFrame([
    ["Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
       "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
       " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
       "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
       "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
       "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
       "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
       "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
       "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
       "learning for NLP, we release our data set, pre-trained models, and code."]
]).toDF("text")


document_assembler = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

T5 = T5Transformer.load(f"{MODEL_NAME}_spark_nlp") \
  .setInputCols(["document"]) \
  .setOutputCol("summary")

pipeline = Pipeline().setStages([document_assembler, T5])

result = pipeline.fit(test_data).transform(test_data)
result.select("summary.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------+
|result                                                                                                     |
+-----------------------------------------------------------------------------------------------------------+
|[We introduce a unified framework that converts text-to-text language problems into a text-to-text format.]|
+-----------------------------------------------------------------------------------------------------------+



That's it! You can now go wild and use hundreds of T5 models from HuggingFace 🤗 in Spark NLP 🚀
