[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20DistilBertForSequenceClassification.ipynb)

## Import DistilBertForSequenceClassification models from HuggingFace 🤗  into Spark NLP 🚀 

Let's keep in mind a few things before we start 😊 

- This feature is only in `Spark NLP 3.3.3` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import DistilBERT models trained/fine-tuned for token classification via `DistilBertForSequenceClassification` or `TFDistilBertForSequenceClassification`. These models are usually under `Token Classification` category and have `bert` in their labels
- Reference: [TFDistilBertForSequenceClassification](https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertforsequenceclassification)
- Some [example models](https://huggingface.co/models?filter=distilbert&pipeline_tag=text-classification)

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.4.1` version and Transformers on `4.8.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [1]:
!pip install -q transformers==4.12.5 tensorflow==2.4.1

[K     |████████████████████████████████| 3.1 MB 5.4 MB/s 
[K     |████████████████████████████████| 394.3 MB 13 kB/s 
[K     |████████████████████████████████| 3.3 MB 49.5 MB/s 
[K     |████████████████████████████████| 895 kB 50.8 MB/s 
[K     |████████████████████████████████| 59 kB 3.7 MB/s 
[K     |████████████████████████████████| 596 kB 48.6 MB/s 
[K     |████████████████████████████████| 462 kB 53.2 MB/s 
[K     |████████████████████████████████| 3.8 MB 47.8 MB/s 
[K     |████████████████████████████████| 2.9 MB 47.0 MB/s 
[?25h  Building wheel for wrapt (setup.py) ... [?25l[?25hdone


- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model from HuggingFace as an example
- In addition to `TFDistilBertForSequenceClassification` we also need to save the `DistilBertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [2]:
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer 

MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

try:
  model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
except:
  model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)
    
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.






INFO:tensorflow:Assets written to: ./distilbert-base-uncased-finetuned-sst-2-english/saved_model/1/assets


INFO:tensorflow:Assets written to: ./distilbert-base-uncased-finetuned-sst-2-english/saved_model/1/assets


Let's have a look inside these two directories and see what we are dealing with:

In [3]:
!ls -l {MODEL_NAME}

total 261684
-rw-r--r-- 1 root root       735 Nov 20 19:07 config.json
drwxr-xr-x 3 root root      4096 Nov 20 19:07 saved_model
-rw-r--r-- 1 root root 267952512 Nov 20 19:07 tf_model.h5


In [4]:
!ls -l {MODEL_NAME}/saved_model/1

total 4288
drwxr-xr-x 2 root root    4096 Nov 20 19:07 assets
-rw-r--r-- 1 root root 4381569 Nov 20 19:07 saved_model.pb
drwxr-xr-x 2 root root    4096 Nov 20 19:07 variables


In [5]:
!ls -l {MODEL_NAME}_tokenizer

total 236
-rw-r--r-- 1 root root    112 Nov 20 19:07 special_tokens_map.json
-rw-r--r-- 1 root root    429 Nov 20 19:07 tokenizer_config.json
-rw-r--r-- 1 root root 231508 Nov 20 19:07 vocab.txt


- As you can see, we need the SavedModel from `saved_model/1/` path
- We also be needing `vocab.txt` from the tokenizer
- All we need is to just copy the `vocab.txt` to `saved_model/1/assets` which Spark NLP will look for
- In addition to vocabs, we also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [6]:
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)

!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}

In [7]:
# get label2id dictionary 
labels = model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)

with open(asset_path+'/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

Voila! We have our `vocab.txt` and `labels.txt` inside assets directory

In [8]:
!ls -l {MODEL_NAME}/saved_model/1/assets

total 232
-rw-r--r-- 1 root root     17 Nov 20 19:07 labels.txt
-rw-r--r-- 1 root root 231508 Nov 20 19:07 vocab.txt


## Import and Save DistilBertForSequenceClassification in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [9]:
# entirely ignore this part
# it's for Colab only
# Install pyspark spark-nlp
! pip install --upgrade --force-reinstall pyspark==3.1.2 https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/tmp/spark_nlp-3.3.2-py2.py3-none-any.whl

from pyspark.sql import SparkSession

spark = SparkSession.builder\
.appName("Spark NLP")\
.master("local[*]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars", "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/tmp/spark-nlp-gpu-assembly-3.3.2.jar")\
.getOrCreate()

Collecting spark-nlp==3.3.2
  Downloading https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/tmp/spark_nlp-3.3.2-py2.py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 1.2 MB/s 
[?25hCollecting pyspark==3.1.2
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[K     |████████████████████████████████| 212.4 MB 72 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 45.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=6be97c2be055c8e8d53410ceb35a7afd73d77cb85164909adc80bbef1bf6d867
  Stored in directory: /root/.cache/pip/wheels/a5/0a/c1/9561f6fecb759579a7d863dcd846daaa95f598744e71b02c77
Successfully built pyspark
Installing collected packages: py4j, spark-nlp, pyspark
Successfully installed py4j-0.10.9 pyspa

In [None]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's start Spark with Spark NLP included via our simple `start()` function

In [None]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `DistilBertForSequenceClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `DistilBertForSequenceClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` only accepts local paths and not distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. That is why we use `write.save` so we can use `.load()` from any file systems



In [10]:
from sparknlp.annotator import *

sequenceClassifier = DistilBertForSequenceClassification.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("class")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(128)

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [11]:
sequenceClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

Let's clean up stuff we don't need anymore

In [12]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome 😎  !

This is your DistilBertForSequenceClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀 

In [13]:
! ls -l {MODEL_NAME}_spark_nlp

total 265744
-rw-r--r-- 1 root root 272111080 Nov 20 19:09 distilbert_classification_tensorflow
drwxr-xr-x 5 root root      4096 Nov 20 19:08 fields
drwxr-xr-x 2 root root      4096 Nov 20 19:08 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForSequenceClassification model 😊 

In [14]:
sequenceClassifier_loaded = DistilBertForSequenceClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("class")

That's it! You can now go wild and use hundreds of `DistilBertForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀 
