![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/HuggingFace%20in%20Spark%20NLP%20-%20Longformer.ipynb)

## Import Longformer models from HuggingFace 🤗  into Spark NLP 🚀 

Let's keep in mind a few things before we start 😊 

- This feature is only in `Spark NLP 3.2.x` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import models for Longformer from HuggingFace but they have to be compatible with `TensorFlow` and they have to be in `Fill Mask` category. Meaning, you cannot use Longformer models trained/fine-tuned on a specific task such as token/sequence classification.

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.4.1` version and Transformers on `4.8.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q transformers==4.6.1 tensorflow==2.4.1

[K     |████████████████████████████████| 2.5 MB 7.9 MB/s 
[K     |████████████████████████████████| 394.3 MB 8.4 kB/s 
[K     |████████████████████████████████| 895 kB 44.8 MB/s 
[K     |████████████████████████████████| 3.3 MB 31.9 MB/s 
[K     |████████████████████████████████| 2.9 MB 34.3 MB/s 
[K     |████████████████████████████████| 462 kB 67.6 MB/s 
[K     |████████████████████████████████| 3.8 MB 34.1 MB/s 
[?25h

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096) model from HuggingFace as an example
- In addition to `TFLongformerModel` we also need to save the `LongformerTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [None]:
from transformers import LongformerTokenizer, TFLongformerModel

MODEL_NAME = 'allenai/longformer-base-4096'

# let's keep the tokenizer variable, we need it later
tokenizer = LongformerTokenizer.from_pretrained(MODEL_NAME)
# let's save the tokenizer
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
  print('try downloading TF weights')
  model = TFLongformerModel.from_pretrained(MODEL_NAME)
except:
  print('try downloading PyTorch weights')
  model = TFLongformerModel.from_pretrained(MODEL_NAME, from_pt=True)

model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)

Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {MODEL_NAME}

In [None]:
!ls -l {MODEL_NAME}/saved_model/1

total 58556
drwxr-xr-x 2 root root     4096 Aug  8 14:04 assets
-rw-r--r-- 1 root root 59950593 Aug  8 14:04 saved_model.pb
drwxr-xr-x 2 root root     4096 Aug  8 14:04 variables


In [None]:
!ls -l {MODEL_NAME}_tokenizer

total 1336
-rw-r--r-- 1 root root 456318 Aug  8 13:59 merges.txt
-rw-r--r-- 1 root root    772 Aug  8 13:59 special_tokens_map.json
-rw-r--r-- 1 root root   1326 Aug  8 13:59 tokenizer_config.json
-rw-r--r-- 1 root root 898822 Aug  8 13:59 vocab.json


- as you can see, we need the SavedModel from `saved_model/1/` path
- we also be needing `vocab.json` and `merges.txt` files from the tokenizer
- all we need is to first convert `vocab.json` to `vocab.txt` and copy both `vocab.txt` and `merges.txt` into `saved_model/1/assets` which Spark NLP will look for

In [None]:
# let's save the vocab as txt file
with open('{}_tokenizer/vocab.txt'.format(MODEL_NAME), 'w') as f:
    for item in tokenizer.get_vocab().keys():
        f.write("%s\n" % item)

# let's copy both vocab.txt and merges.txt files to saved_model/1/assets
!cp {MODEL_NAME}_tokenizer/vocab.txt {MODEL_NAME}/saved_model/1/assets
!cp {MODEL_NAME}_tokenizer/merges.txt {MODEL_NAME}/saved_model/1/assets

## Import and Save Longformer in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `LongformerEmbeddings` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `LongformerEmbeddings` in runtime, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- `setStorageRef` is very important. When you are training a task like NER or any Text Classification, we use this reference to bound the trained model to this specific embeddings so you won't load a different embeddings by mistake and see terrible results 😊
- It's up to you what you put in `setStorageRef` but it cannot be changed later on. We usually use the name of the model to be clear, but you can get creative if you want! 
- The `dimension` param is is purely cosmetic and won't change anything. It's mostly for you to know later via `.getDimension` what is the dimension of your model. So set this accordingly.
- NOTE: `loadSavedModel` only accepts local paths and not distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. That is why we use `write.save` so we can use `.load()` from any file systems.



In [None]:
from sparknlp.annotator import *

longformer = LongformerEmbeddings.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
 .setInputCols(["sentence",'token'])\
 .setOutputCol("embeddings")\
 .setCaseSensitive(True)\
 .setDimension(768)\
 .setMaxSentenceLength(4096)\
 .setStorageRef('longformer_base_4096')
 

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
longformer.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome 😎  !

This is your Longformer model from HuggingFace 🤗 loaded and saved by Spark NLP 🚀 

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

total 348600
drwxr-xr-x 5 root root      4096 Aug  8 14:08 fields
-rw-r--r-- 1 root root 356956378 Aug  8 14:14 longformer_tensorflow
drwxr-xr-x 2 root root      4096 Aug  8 14:08 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny Longformer model 😊 

In [None]:
longformer_loaded = LongformerEmbeddings.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("embeddings")

In [None]:
longformer_loaded.getStorageRef()

'longformer_base_4096'

That's it! You can now go wild and use hundreds of Longformer models from HuggingFace 🤗 in Spark NLP 🚀 
