![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/19.01.BertForSequenceClassification.ipynb)

# Text Classification (Sequence Classification) with Transformers

This notebook will cover the different parameters and usages of Transformers-bases classification annotators.

**📖 Learning Objectives:**

1. Be able to create a pipeline for text classification using a Transformers-bases annotator.

2. Understand how to use the annotators for predictions.

3. Become comfortable using the different parameters of the annotators.

4. Import Transformers models from Hugging Face to Spark NLP.


**🔗 Helpful Links:**

- Documentation : [Transformers in Spark NLP](https://nlp.johnsnowlabs.com/docs/en/transformers)

- Python Docs : [BertForSequenceClassification](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/bert_for_sequence_classification/index.html#sparknlp.annotator.classifier_dl.bert_for_sequence_classification.BertForSequenceClassification)

- Scala Docs : [BertForSequenceClassification](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForSequenceClassification)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public/).

## Transformers and Spark NLP

Spark NLP has extended support for `HuggingFace` 🤗   and `TF Hub` exported models since `3.1.0` to Spark NLP 🚀 annotators. You can easily use the `saved_model` feature in HuggingFace within a few lines of codes and import any of the following types of models into Spark NLP.



<div align="center">

| **Architect** | **Embeddins**        |
|---------------|----------------------|
| Albert        | AlbertForSequenceClassification     |
| BERT          | BertForSequenceClassification       |
| CamemBERT     | CamemBertForSequenceClassification  |
| DeBERTa       | DeBertaForSequenceClassification    |
| DistilBERT    | DistilBertForSequenceClassification |
| Longformer    | LongformerForSequenceClassification |
| RoBERTa       | RoBertaForSequenceClassification    |
| XLM-RoBERTa   | XlmRoBertaForSequenceClassification |
| Xlnet         | XlnetForSequenceClassification      |

</div>



> We will keep working on the remaining annotators and extend this support to aditional Transformers models. To keep updated, visit [this page](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) on compatibility and development of the adaptations of TF Hub and  HuggingFace to Spark NLP. Keep tuned for the next releases.

### Text Classification

As mentioned above, we already have implemented many different Transformers models in Spark NLP, and specifically for text classification we have all the versions of **ForSequenceClassification**, where can be any of:

- `BERT` ([BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805), Jacob Devlin et al.): Randomly changes input texts (for example, 15% of them) with _MASKS_ or random tokens in order to learn a language model. Given two sentences, the learning process makes two tasks: 
    - Predict the sentences by correctly replacing the wrong tokens.
    - Predict if the sentences are consecutive or not.
- `ALBERT` ([ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), Zhenzhong Lan et al.): Same as Bert, with changes in some hyperparameters that optimizes memomy usage. The training phase instead of predicting if the two sentences are consecutive, now they predict if they were swapped or not (two consecutive sentences are input, model predict if they were given in the correct order or not).
- `RoBERTa` ([RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692), Yinhan Liu et al.): Same as Bert, but with some different training methods (e.g., using dynamic masking in each epoch instead).
- `CamemBERT` ([CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894), Louis Martin et al.): Based on RoBerta model, trained with French dataset.
- `DistilBERT` ([DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108),Victor Sanh et al.): Distilled version of Bert (model parameters were reduced by using transfer learning from big model to smaller model). 
- `Longformer` ([Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150), Iz Beltagy et al.): Allows the use of upt to 4096 tokens instead of the usual limit of 512. To optimize the added computational cost, replace dense matrixes by sparse representations.
- `XlmRoBerta` ([Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116), Alexis Conneau et al.): Applies the training methods from RoBerta to Xlm model. 
- `Xlnet` ([XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237), Zhilin Yang et al.): differently than token masking applied in Bert models, it trains the language model by permuting the tokens. 


For more details on these models and others available on HuggingFace, pelase visit the [HuggingFace documentation](https://huggingface.co/docs/transformers/model_summary).

## **🎬 Colab Setup**

In [None]:
! pip install -q pyspark==3.1.2 spark-nlp==4.3.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.7/471.7 KB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

spark

Spark NLP version:  4.3.1
Apache Spark version:  3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`

- Output: `CATEGORY`

## **🔎 Parameters**

- `caseSensitive()`:
Set whether to ignore case in index lookups with this parameter
(Default depends on model)

- `maxSentenceLength` = Maximum sentence length to process, limited to 512 for all models except `Longformer` which has a limit of 4096.

- `batchSize` : Large values allows faster processing but requires more memory, by default 8

- `configProtoBytes` = ConfigProto from tensorflow, serialized into byte array. Get with `config_proto.SerializeToString()`

- `coalesceSentences`: Instead of one class per sentence (if `inputCols` is `sentence`) output one class per document by averaging probabilities in all sentences, by default `False`.

- `activation`: Whether to calculate logits via Softmax or Sigmoid, by default `softmax`.



## Defining the Spark NLP Pipeline

In [None]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer, BertForSequenceClassification, AlbertForSequenceClassification, DistilBertForSequenceClassification, SentenceDetector
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

Let's prepared the pre-requisite columns first, so we can use them in different annotators.

In [None]:
document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

tokenizer = Tokenizer() \
        .setInputCols(['document']) \
        .setOutputCol('token')
        
pipeline = Pipeline(stages=[document_assembler,
                            tokenizer])

In [None]:
example_df = spark.createDataFrame([["The movie was brilliant."]]).toDF("text")

example_df = pipeline.fit(example_df).transform(example_df)

## 📍 **BertForSequenceClassification**

In [None]:
bert_cls = BertForSequenceClassification.pretrained("bert_classifier_fabriceyhc_base_uncased_imdb", "en") \
        .setInputCols(['document', 'token']) \
        .setOutputCol('class')

bert_classifier_fabriceyhc_base_uncased_imdb download started this may take some time.
Approximate size to download 390.9 MB
[OK!]


In [None]:
result = bert_cls.transform(example_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|               class|
+--------------------+--------------------+--------------------+--------------------+
|The movie was bri...|[{document, 0, 23...|[{token, 0, 2, Th...|[{category, 0, 23...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result.select("class.result").show(truncate=False)

+------+
|result|
+------+
|[pos] |
+------+



In [None]:
result.select("class").show(truncate=False)

+------------------------------------------------------------------------------------+
|class                                                                               |
+------------------------------------------------------------------------------------+
|[{category, 0, 23, pos, {sentence -> 0, neg -> 3.6695242E-4, pos -> 0.9996331}, []}]|
+------------------------------------------------------------------------------------+



In [None]:
bert_cls.extractParamMap()

{Param(parent='BERT_FOR_SEQUENCE_CLASSIFICATION_4fcc53fde2bc', name='activation', doc='Whether to calculate logits via Softmax or Sigmoid. Default is Softmax'): 'softmax',
 Param(parent='BERT_FOR_SEQUENCE_CLASSIFICATION_4fcc53fde2bc', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='BERT_FOR_SEQUENCE_CLASSIFICATION_4fcc53fde2bc', name='coalesceSentences', doc="Instead of 1 class per sentence (if inputCols is '''sentence''') output 1 class per document by averaging probabilities in all sentences."): False,
 Param(parent='BERT_FOR_SEQUENCE_CLASSIFICATION_4fcc53fde2bc', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='BERT_FOR_SEQUENCE_CLASSIFICATION_4fcc53fde2bc', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='BERT_FOR_SEQUENCE_CLASSIFICATION_4fcc53fde2bc', name='maxSentenceLength', doc='Max sentence length to process'): 256,
 Param(parent='BERT_FOR_SEQUENCE_CLAS

In [None]:
bert_cls.getCaseSensitive()

False

In [None]:
bert_cls.getMaxSentenceLength()

256

## 📍 **AlbertForSequenceClassification**

### **`coalesceSentences`** parameter : 

➤ Instead of 1 class per sentence (if inputCols is sentence) output 1 class per document by averaging probabilities in all sentences.

Due to max sequence length limit in almost all transformer models such as BERT (512 tokens), this parameter helps feeding all the sentences into the model and averaging all the probabilities for the entire document instead of probabilities per sentence.

➤ Now let's give our sentences as input to our Sequence Classification model and see what happens when we set our parameter to both True and False.

In [None]:
document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

sentenceDetector = SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = Tokenizer() \
        .setInputCols(['sentence']) \
        .setOutputCol('token')

albert_cls = AlbertForSequenceClassification \
  .pretrained('albert_base_sequence_classifier_imdb', 'en') \
  .setInputCols(['token', 'sentence']) \
  .setOutputCol('class')\
  .setCoalesceSentences(False)
        
pipeline = Pipeline(stages=[document_assembler,
                            sentenceDetector,
                            tokenizer,
                            albert_cls])

albert_base_sequence_classifier_imdb download started this may take some time.
Approximate size to download 42.8 MB
[OK!]


In [None]:
example_df = spark.createDataFrame([["The movie was brilliant. It was so exciting."]]).toDF("text")

In [None]:
result = pipeline.fit(example_df).transform(example_df)
result.select("class.result").show(truncate=False)

+----------+
|result    |
+----------+
|[pos, pos]|
+----------+



In [None]:
result.select("class").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|class                                                                                                                                                                   |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{category, 0, 23, pos, {sentence -> 0, neg -> 0.012881186, pos -> 0.98711884}, []}, {category, 25, 43, pos, {sentence -> 1, neg -> 0.028350135, pos -> 0.9716499}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



👆🏻 As you can see, it made separate predictions for each sentence in the text.

In [None]:
albert_cls = AlbertForSequenceClassification \
  .pretrained('albert_base_sequence_classifier_imdb', 'en') \
  .setInputCols(['token', 'sentence']) \
  .setOutputCol('class')\
  .setCoalesceSentences(True)
        
pipeline = Pipeline(stages=[document_assembler,
                            sentenceDetector,
                            tokenizer,
                            albert_cls])

albert_base_sequence_classifier_imdb download started this may take some time.
Approximate size to download 42.8 MB
[OK!]


In [None]:
result = pipeline.fit(example_df).transform(example_df)
result.select("class.result").show(truncate=False)

+------+
|result|
+------+
|[pos] |
+------+



In [None]:
result.select("class").show(truncate=False)

+-----------------------------------------------------------------------------------+
|class                                                                              |
+-----------------------------------------------------------------------------------+
|[{category, 0, 23, pos, {sentence -> 0, neg -> 0.02061566, pos -> 0.97938436}, []}]|
+-----------------------------------------------------------------------------------+



👆🏻 As you can see, when we used the parameter **setCoalesceSentences(True)**, the model made a single class prediction per document by taking the average of the probabilities across all the sentences, instead of one class per sentence.

## 📍 **DistilBertForSequenceClassification for French**

### **`Activation`** parameter : 

This parameter is used to specify whether to calculate the logits using the Softmax or Sigmoid activation function.


➤ **Sigmoid:** The sigmoid function limits the output values to between 0 and 1. This function is often used in binary classification problems. However, the sigmoid function may perform poorly as the number of classes to be classified increases.

➤ **Softmax:** The softmax function is used in multi-class classification problems with as many output nodes as there are classes. Softmax transforms the values in the output nodes into a probability distribution. Therefore, the softmax function is used in multi-class classification problems where class probabilities need to be estimated.

In [None]:
example_df = spark.createDataFrame([[
    """Deuxième long métrage de Pasolini, Mamma Roma contient déjà la plupart des obsessions de son auteur et notamment la relation si importante dans la construction de l’être humain entre la mère et son fils adolescent. Anna Magnani est déchirante en figure presque universelle de la maman putain représentative de la Ville éternelle. Le film avance à travers des foules de symboles et la fin où le jeune homme termine sa vie en crucifié martyr de la société est d’un implacable réalisme poétique qui annonce les avancées extrêmes ultérieures de Salo. La construction du récit est d’une puissance hors du commun et Pasolini se montre déjà un grand cinéaste qui a assimilé la technique et les possibilités de ce nouvel outil."""
    ]]).toDF("text")

document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

tokenizer = Tokenizer() \
        .setInputCols(['document']) \
        .setOutputCol('token')

distilbert_cls = DistilBertForSequenceClassification.pretrained("distilbert_multilingual_sequence_classifier_allocine", "fr") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("class")\
    .setActivation("sigmoid")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    distilbert_cls
])

result = pipeline.fit(example_df).transform(example_df)

distilbert_multilingual_sequence_classifier_allocine download started this may take some time.
Approximate size to download 484.2 MB
[OK!]


In [None]:
result.select("class.result").show(truncate=False)

+------+
|result|
+------+
|[pos] |
+------+



In [None]:
distilbert_cls.getClasses()

['neg', 'pos']

In [None]:
distilbert_cls.getActivation()

'sigmoid'

➤ For binary classification, either softmax or sigmoid activation functions can be used, but sigmoid is the more commonly used activation function in this case. However, for multiclass classification, softmax activation function is required.

##  📍 **Using LightPipeline**

[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

For more details, check the following 
[Medium post](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1).

This class accepts strings or list of strings as input, without the need to transform your text into a spark data frame. The [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) method returns a dictionary (or list of dictionary if a list is passed as input) with the results of each step in the pipeline. To retrieve all metadata from the anntoators in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead, which always returns a list.

To extract the results from the object, you just need to parse the dictionary.

Let's use the `bert_large_token_classifier_ontonote` model with `LightPipeline` and `.fullAnnotate()` it with sample data.

In [None]:
from sparknlp.base import LightPipeline

In [None]:
tokenClassifier = BertForSequenceClassification \
    .pretrained('bert_classifier_fabriceyhc_base_uncased_imdb', 'en') \
    .setInputCols(['token', 'document']) \
    .setOutputCol('class')

pipeline = Pipeline(stages=[document_assembler, 
                            tokenizer,
                            tokenClassifier])

empty_df = spark.createDataFrame([['']]).toDF("text")
model = pipeline.fit(empty_df)

bert_classifier_fabriceyhc_base_uncased_imdb download started this may take some time.
Approximate size to download 390.9 MB
[OK!]


In [None]:
light_model= LightPipeline(model)
light_result= light_model.fullAnnotate("The film didn't make me cry, or laugh, or even think about it. I left the theater the same way I went in. What about the screenplay? Is it necessary to repeat the same situation ten times just to give the audience an idea of the hard time he had along with his kid? Also the relationship with his wife is weird. The film does not explain why she makes one of the most important decisions a woman can make in a lifetime. Is she bad, or just weak?.")[0]

In [None]:
light_result

{'document': [Annotation(document, 0, 445, The film didn't make me cry, or laugh, or even think about it. I left the theater the same way I went in. What about the screenplay? Is it necessary to repeat the same situation ten times just to give the audience an idea of the hard time he had along with his kid? Also the relationship with his wife is weird. The film does not explain why she makes one of the most important decisions a woman can make in a lifetime. Is she bad, or just weak?., {}, [])],
 'token': [Annotation(token, 0, 2, The, {'sentence': '0'}, []),
  Annotation(token, 4, 7, film, {'sentence': '0'}, []),
  Annotation(token, 9, 14, didn't, {'sentence': '0'}, []),
  Annotation(token, 16, 19, make, {'sentence': '0'}, []),
  Annotation(token, 21, 22, me, {'sentence': '0'}, []),
  Annotation(token, 24, 26, cry, {'sentence': '0'}, []),
  Annotation(token, 27, 27, ,, {'sentence': '0'}, []),
  Annotation(token, 29, 30, or, {'sentence': '0'}, []),
  Annotation(token, 32, 36, laugh, {'s

In [None]:
light_result["class"]

[Annotation(category, 0, 445, neg, {'sentence': '0', 'neg': '0.9991806', 'pos': '8.193569E-4'}, [])]

In [None]:
light_result.keys()

dict_keys(['document', 'token', 'class'])

In [None]:
tokenClassifier.getClasses()

['pos', 'neg']

# From HuggingFace to Spark NLP

Here you will learn how to export a model from HuggingFace to Spark NLP. 

For compatibility details and examples, check [this page](https://nlp.johnsnowlabs.com/docs/en/transformers#import-transformers-into-spark-nlp).

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.11.0` version and Transformers on `4.25.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q transformers==4.25.1 tensorflow==2.11.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m98.5 MB/s[0m eta [36m0:00:00[0m
[?25h

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [finiteautomata/beto-sentiment-analysis](https://huggingface.co/finiteautomata/beto-sentiment-analysis) model from HuggingFace as an example
- In addition to `TFBertForSequenceClassification` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [None]:
from transformers import TFBertForSequenceClassification, BertTokenizer
import tensorflow as tf

MODEL_NAME = 'finiteautomata/beto-sentiment-analysis'

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
  print('try downloading TF weights')
  model = TFBertForSequenceClassification.from_pretrained(MODEL_NAME)
except:
  print('try downloading PyTorch weights')
  model = TFBertForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)

# Define TF Signature
@tf.function(
  input_signature=[
      {
          "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
          "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
          "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
      }
  ]
)

def serving_fn(input):
    return model(input)

model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True, signatures={"serving_default": serving_fn})

try downloading TF weights
try downloading PyTorch weights


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


➤ Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {MODEL_NAME}

total 429416
-rw-r--r-- 1 root root       873 Mar  7 19:23 config.json
drwxr-xr-x 3 root root      4096 Mar  7 19:23 saved_model
-rw-r--r-- 1 root root 439713116 Mar  7 19:23 tf_model.h5


In [None]:
!ls -l {MODEL_NAME}/saved_model/1

total 9244
drwxr-xr-x 2 root root    4096 Mar  7 19:23 assets
-rw-r--r-- 1 root root      55 Mar  7 19:23 fingerprint.pb
-rw-r--r-- 1 root root  167033 Mar  7 19:23 keras_metadata.pb
-rw-r--r-- 1 root root 9282572 Mar  7 19:23 saved_model.pb
drwxr-xr-x 2 root root    4096 Mar  7 19:23 variables


In [None]:
!ls -l {MODEL_NAME}_tokenizer

total 252
-rw-r--r-- 1 root root     78 Mar  7 19:22 added_tokens.json
-rw-r--r-- 1 root root    125 Mar  7 19:22 special_tokens_map.json
-rw-r--r-- 1 root root    596 Mar  7 19:22 tokenizer_config.json
-rw-r--r-- 1 root root 241796 Mar  7 19:22 vocab.txt


- As you can see, we need the SavedModel from `saved_model/1/` path
- We also be needing `vocab.txt` from the tokenizer
- All we need is to just copy the `vocab.txt` to `saved_model/1/assets` which Spark NLP will look for
- In addition to vocabs, we also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [None]:
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)

!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}

In [None]:
# get label2id dictionary 
labels = model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)

with open(asset_path+'/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

➤ We have our vocab.txt and labels.txt inside assets directory

In [None]:
! ls -l {MODEL_NAME}/saved_model/1/assets

total 244
-rw-r--r-- 1 root root     11 Mar  7 19:26 labels.txt
-rw-r--r-- 1 root root 241796 Mar  7 19:26 vocab.txt


## Import and Save BertForTokenClassification in Spark NLP

- Let's use `loadSavedModel` functon in `BertForTokenClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `BertForTokenClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [None]:
sequenceClassifier = BertForSequenceClassification.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(128)

➤ Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
sequenceClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

➤ Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome 😎  !

This is your BertForTokenClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀 

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

total 438116
-rw-r--r-- 1 root root 448618276 Mar  7 19:28 bert_classification_tensorflow
drwxr-xr-x 5 root root      4096 Mar  7 19:27 fields
drwxr-xr-x 2 root root      4096 Mar  7 19:27 metadata


➤ Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForTokenClassification model 😊

In [None]:
sequenceClassifier_loaded = BertForSequenceClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("class")

➤ That's it! You can now go wild and use hundreds of `BertForTokenClassification` models from HuggingFace 🤗 in Spark NLP 🚀 

➤ You can see what labels were used to train this model via getClasses function:

In [None]:
sequenceClassifier_loaded.getClasses()

['NEU', 'POS', 'NEG']

➤ Cool! You can now go wild and use hundreds of BertForTokenClassification models from HuggingFace 🤗 in Spark NLP 🚀