[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20LongformerForQuestionAnswering.ipynb)

## Import LongformerForQuestionAnswering models from HuggingFace ðŸ¤— into Spark NLP ðŸš€ 

Let's keep in mind a few things before we start ðŸ˜Š 

- This feature is only in `Spark NLP 4.0.0` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import Longformer models trained/fine-tuned for question answering via `LongformerForQuestionAnswering` or `TFLongformerForQuestionAnswering`. These models are usually under `Question Answering` category and have `longformer` in their labels
- Reference: [TFLongformerForQuestionAnswering](https://huggingface.co/docs/transformers/model_doc/longformer#transformers.TFLongformerForQuestionAnswering)
- Some [example models](https://huggingface.co/models?filter=longformer&pipeline_tag=question-answering)

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.7.1` version and Transformers on `4.19.2`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.
- DeBERTa v2&v3 use SentencePiece, so we will have to install that as well


In [None]:
!pip install -q transformers==4.19.2 tensorflow==2.7.1 sentencepiece

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [valhalla/longformer-base-4096-finetuned-squadv1](https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1) model from HuggingFace as an example
- In addition to `TFLongformerForQuestionAnswering` we also need to save the `DebertaV2Tokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [None]:
from transformers import TFLongformerForQuestionAnswering, LongformerTokenizer 

MODEL_NAME = 'valhalla/longformer-base-4096-finetuned-squadv1'

tokenizer = LongformerTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

try:
  model = TFLongformerForQuestionAnswering.from_pretrained(MODEL_NAME)
except:
  model = TFLongformerForQuestionAnswering.from_pretrained(MODEL_NAME, from_pt=True)
    
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)

Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {MODEL_NAME}

In [None]:
!ls -l {MODEL_NAME}/saved_model/1

In [None]:
!ls -l {MODEL_NAME}_tokenizer

- As you can see, we need the SavedModel from `saved_model/1/` path
- We also be needing `vocab.json` and `merges.txt` files from the tokenizer
- All we need is to first convert vocab.json to `vocab.txt` and copy both `vocab.txt` and `merges.txt` into saved_model/1/assets which Spark NLP will look for

In [None]:
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)

# let's save the vocab as txt file
with open('{}_tokenizer/vocab.txt'.format(MODEL_NAME), 'w') as f:
    for item in tokenizer.get_vocab().keys():
        f.write("%s\n" % item)

# let's copy both vocab.txt and merges.txt files to saved_model/1/assets
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
!cp {MODEL_NAME}_tokenizer/merges.txt {asset_path}

Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

In [None]:
!ls -l {MODEL_NAME}/saved_model/1/assets

## Import and Save LongformerForQuestionAnswering in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `LongformerForQuestionAnswering` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `LongformerForQuestionAnswering` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.



In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

spanClassifier = LongformerForQuestionAnswering.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
  .setInputCols(["document_question",'document_context'])\
  .setOutputCol("answer")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(512)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
spanClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome ðŸ˜Ž  !

This is your LongformerForQuestionAnswering model from HuggingFace ðŸ¤—  loaded and saved by Spark NLP ðŸš€ 

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny LongformerForQuestionAnswering model in Spark NLP ðŸš€ pipeline! 

In [None]:
document_assembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier_loaded = LongformerForQuestionAnswering.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["document_question",'document_context'])\
  .setOutputCol("answer")

pipeline = Pipeline().setStages([
    document_assembler,
    spanClassifier_loaded
])

example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(example).transform(example)

result.select("answer.result").show(1, False)

That's it! You can now go wild and use hundreds of `LongformerForQuestionAnswering` models from HuggingFace ðŸ¤— in Spark NLP ðŸš€ 
