![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/15.Import_Transformers_Into_Spark_NLP.ipynb)

## Import Transformers from HuggingFace 🤗  into Spark NLP 🚀 
Let's keep in mind that this feature is only in Spark NLP 3.2.x and after. So please make sure you have upgraded to the latest Spark NLP release


## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.11.0` version and Transformers on `4.15.0`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q transformers==4.15.0 tensorflow==2.11.0

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model from HuggingFace as an example
- In addition to `TFBertForTokenClassification` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [None]:
from transformers import TFBertForTokenClassification, BertTokenizer 

MODEL_NAME = 'dslim/bert-base-NER'

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
  print('try downloading TF weights')
  model = TFBertForTokenClassification.from_pretrained(MODEL_NAME)
except:
  print('try downloading PyTorch weights')
  model = TFBertForTokenClassification.from_pretrained(MODEL_NAME, from_pt=True)

model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

try downloading TF weights


Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForTokenClassification.

All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dslim/bert-base-NER.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {MODEL_NAME}

total 421084
-rw-r--r-- 1 root root       999 Apr  4 07:13 config.json
drwxr-xr-x 3 root root      4096 Apr  4 07:13 saved_model
-rw-r--r-- 1 root root 431179756 Apr  4 07:13 tf_model.h5


In [None]:
!ls -l {MODEL_NAME}/saved_model/1

total 6600
drwxr-xr-x 2 root root    4096 Apr  4 07:13 assets
-rw-r--r-- 1 root root      55 Apr  4 07:13 fingerprint.pb
-rw-r--r-- 1 root root  164710 Apr  4 07:13 keras_metadata.pb
-rw-r--r-- 1 root root 6577406 Apr  4 07:13 saved_model.pb
drwxr-xr-x 2 root root    4096 Apr  4 07:13 variables


In [None]:
!ls -l {MODEL_NAME}_tokenizer

total 220
-rw-r--r-- 1 root root    112 Apr  4 07:12 special_tokens_map.json
-rw-r--r-- 1 root root    552 Apr  4 07:12 tokenizer_config.json
-rw-r--r-- 1 root root 213450 Apr  4 07:12 vocab.txt


- As you can see, we need the SavedModel from `saved_model/1/` path
- We also be needing `vocab.txt` from the tokenizer
- All we need is to just copy the `vocab.txt` to `saved_model/1/assets` which Spark NLP will look for
- In addition to vocabs, we also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [None]:
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)

!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}

In [None]:
# get label2id dictionary 
labels = model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)

with open(asset_path+'/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

In [None]:
! ls -l {MODEL_NAME}/saved_model/1/assets

total 216
-rw-r--r-- 1 root root     51 Apr  4 07:13 labels.txt
-rw-r--r-- 1 root root 213450 Apr  4 07:13 vocab.txt


## Import and Save BertForTokenClassification in Spark NLP


- Let's install and setup Spark NLP in Google Colab

In [None]:
! pip install -q pyspark==3.3.0 spark-nlp==4.3.2

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp

spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `BertForTokenClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `BertForTokenClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` only accepts local paths and not distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. That is why we use `write.save` so we can use `.load()` from any file systems



In [None]:
from sparknlp.annotator import *
from sparknlp.base import *

tokenClassifier = BertForTokenClassification.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
 .setInputCols(["document",'token'])\
 .setOutputCol("ner")\
 .setCaseSensitive(True)\
 .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [None]:
! rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome!

This is your BertForTokenClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀 

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

total 427164
-rw-r--r-- 1 root root 437401448 Apr  4 07:15 bert_classification_tensorflow
drwxr-xr-x 5 root root      4096 Apr  4 07:15 fields
drwxr-xr-x 2 root root      4096 Apr  4 07:15 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForTokenClassification model 

In [None]:
tokenClassifier_loaded = BertForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

That's it! You can now go wild and use hundreds of `BertForTokenClassification` models from HuggingFace 🤗 in Spark NLP 🚀 


You can see what labels were used to train this model via `getClasses` function:

In [None]:
tokenClassifier_loaded.getClasses()

['B-LOC', 'I-ORG', 'I-MISC', 'I-LOC', 'I-PER', 'B-MISC', 'B-ORG', 'O', 'B-PER']

This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    tokenClassifier_loaded    
])

# couple of simple examples
example = spark.createDataFrame([["My name is Sarah and I live in London"],
                                 ["My name is Clara and I live in Berkeley, California."]]).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select("text", "ner.result").show(truncate=False)

+----------------------------------------------------+------------------------------------------------+
|text                                                |result                                          |
+----------------------------------------------------+------------------------------------------------+
|My name is Sarah and I live in London               |[O, O, O, B-PER, O, O, O, O, B-LOC]             |
|My name is Clara and I live in Berkeley, California.|[O, O, O, B-PER, O, O, O, O, B-LOC, O, B-LOC, O]|
+----------------------------------------------------+------------------------------------------------+

