[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20XlmRBertaForSequenceClassification.ipynb)

## Import XlmRoBertaForSequenceClassification models from HuggingFace 🤗  into Spark NLP 🚀 

Let's keep in mind a few things before we start 😊 

- This feature is only in `Spark NLP 3.4.x` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import XLM-RoBERTa models trained/fine-tuned for sequence classification via `XLMRobertaForSequenceClassification` or `TFXLMRobertaForSequenceClassification`. These models are usually under `Text Classification` category and have `xlm-roberta` in their labels
- Reference: [TFXLMRobertaForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/xlmroberta#transformers.TFXLMRobertaForSequenceClassification)
- Some [example models](https://huggingface.co/models?filter=xlm-roberta&pipeline_tag=text-classification)

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.11.0` version and Transformers on `4.25.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.
- XLMRobertaTokenizer requires the `SentencePiece` library, so we install that as well

In [1]:
!pip install -q transformers==4.25.1 tensorflow==2.11.0 sentencepiece

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [cardiffnlp/twitter-xlm-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment) model from HuggingFace as an example
- In addition to `TFXLMRobertaForSequenceClassification` we also need to save the `XLMRobertaTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [2]:
from transformers import TFXLMRobertaForSequenceClassification, XLMRobertaTokenizer 
import tensorflow as tf

MODEL_NAME = 'cardiffnlp/twitter-xlm-roberta-base-sentiment'

tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
  print('try downloading TF weights')
  model = TFXLMRobertaForSequenceClassification.from_pretrained(MODEL_NAME)
except:
  print('try downloading PyTorch weights')
  model = TFXLMRobertaForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)

# Define TF Signature
@tf.function(
  input_signature=[
      {
          "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
          "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask")
      }
  ]
)
def serving_fn(input):
    return model(input)

model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True, signatures={"serving_default": serving_fn})



INFO:tensorflow:Assets written to: ./cardiffnlp/twitter-xlm-roberta-base-sentiment/saved_model/1/assets


INFO:tensorflow:Assets written to: ./cardiffnlp/twitter-xlm-roberta-base-sentiment/saved_model/1/assets


Let's have a look inside these two directories and see what we are dealing with:

In [3]:
!ls -l {MODEL_NAME}

total 2202472
-rw-r--r--  1 maziyar  staff         915 Dec 15 18:34 config.json
drwxr-xr-x  3 maziyar  staff          96 Dec 15 18:34 [34msaved_model[m[m
-rw-r--r--  1 maziyar  staff  1112473408 Dec 15 18:34 tf_model.h5


In [4]:
!ls -l {MODEL_NAME}/saved_model/1

total 18968
drwxr-xr-x  2 maziyar  staff       64 Dec 15 18:34 [34massets[m[m
-rw-r--r--  1 maziyar  staff       55 Dec 15 18:34 fingerprint.pb
-rw-r--r--  1 maziyar  staff   167652 Dec 15 18:34 keras_metadata.pb
-rw-r--r--  1 maziyar  staff  9535557 Dec 15 18:34 saved_model.pb
drwxr-xr-x  4 maziyar  staff      128 Dec 15 18:34 [34mvariables[m[m


In [5]:
!ls -l {MODEL_NAME}_tokenizer

total 9920
-rw-r--r--  1 maziyar  staff  5069051 Dec 15 18:33 sentencepiece.bpe.model
-rw-r--r--  1 maziyar  staff      167 Dec 15 18:33 special_tokens_map.json
-rw-r--r--  1 maziyar  staff      698 Dec 15 18:33 tokenizer_config.json


- as you can see, we need the SavedModel from `saved_model/1/` path
- we also be needing `sentencepiece.bpe.model` file from the tokenizer
- all we need is to copy `sentencepiece.bpe.model` file into `saved_model/1/assets` which Spark NLP will look for
- in addition to vocabs, we also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [6]:
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)

# let's copy sentencepiece.bpe.model file to saved_model/1/assets
!cp {MODEL_NAME}_tokenizer/sentencepiece.bpe.model {asset_path}

In [7]:
# get label2id dictionary 
labels = model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)

with open(asset_path+'/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

In [8]:
! ls -l {asset_path}

total 9912
-rw-r--r--  1 maziyar  staff       25 Dec 15 18:34 labels.txt
-rw-r--r--  1 maziyar  staff  5069051 Dec 15 18:34 sentencepiece.bpe.model


## Import and Save XlmRoBertaForSequenceClassification in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [9]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.1 and Spark NLP 4.2.5
setup Colab for PySpark 3.2.1 and Spark NLP 4.2.5


Let's start Spark with Spark NLP included via our simple `start()` function

In [10]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `XlmRoBertaForSequenceClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `XlmRoBertaForSequenceClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.



In [11]:
from sparknlp.annotator import *

sequenceClassifier = XlmRoBertaForSequenceClassification\
  .loadSavedModel('{}/saved_model/1'.format(MODEL_NAME), spark)\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [12]:
sequenceClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [13]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome 😎  !

This is your XlmRoBertaForSequenceClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀 

In [14]:
! ls -l {MODEL_NAME}_spark_nlp

total 2231984
drwxr-xr-x  4 maziyar  staff         128 Dec 15 18:35 [34mfields[m[m
drwxr-xr-x  6 maziyar  staff         192 Dec 15 18:35 [34mmetadata[m[m
-rw-r--r--  1 maziyar  staff  1121735053 Dec 15 18:35 xlm_roberta_classification_tensorflow
-rw-r--r--  1 maziyar  staff     5069051 Dec 15 18:35 xlmroberta_spp


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny XlmRoBertaForSequenceClassification model 😊 

In [15]:
sequenceClassifier_loaded = XlmRoBertaForSequenceClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")

You can see what labels were used to train this model via `getClasses` function:

In [16]:
sequenceClassifier_loaded.getClasses()

['positive', 'negative', 'neutral']

This is how you can use your loaded classifier model in Spark pipeline:

In [18]:
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    sequenceClassifier_loaded    
])

# couple of simple examples
example = spark.createDataFrame([['사랑해!'], ["T'estimo! ❤️"], ["I love you!"], ['Mahal kita!']]).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select("text", "class.result").show()

+------------+----------+
|        text|    result|
+------------+----------+
|     사랑해!|[positive]|
|T'estimo! ❤️|[positive]|
| I love you!|[positive]|
| Mahal kita!|[positive]|
+------------+----------+



That's it! You can now go wild and use hundreds of `XlmRoBertaForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀 
