[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/TF%20Hub%20in%20Spark%20NLP%20-%20ALBERT.ipynb)

## Import ALBERT models from TF Hub into Spark NLP 🚀 

Let's keep in mind a few things before we start 😊 

- This feature is only in `Spark NLP 3.1.x` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import any ALBERT models from TF Hub but they have to be `TF2.0 Saved Model` models. Meaning, you cannot use `ALBERT models for TF1` which are `DEPRECATED`

## Save TF Hub model

- We do not need to install `tensorflow` nor `tensorflow-hub`
- We can simple download the model and extract it
- We'll use [albert_en_base](https://tfhub.dev/tensorflow/albert_en_base/3) model from TF Hub as an example


In [None]:
!rm -rf /content/*

In [None]:
!pip install -q tensorflow==2.4.1 tensorflow-hub

[K     |████████████████████████████████| 394.3MB 42kB/s 
[K     |████████████████████████████████| 2.9MB 30.2MB/s 
[K     |████████████████████████████████| 3.8MB 19.6MB/s 
[K     |████████████████████████████████| 471kB 34.8MB/s 
[?25h

In [None]:
EXPORTED_MODEL = 'albert_en_base'
TF_HUB_URL = 'https://tfhub.dev/tensorflow/albert_en_base/3'

In [None]:
import tensorflow as tf
import tensorflow_hub as hub

encoder = hub.KerasLayer(TF_HUB_URL, trainable=False)

@tf.function
def my_module_encoder(input_mask, input_word_ids, input_type_ids):
   inputs = {
        'input_mask': input_mask,
        'input_word_ids': input_word_ids,
        'input_type_ids': input_type_ids
   }
   outputs = {
        'sequence_output': encoder(inputs)['sequence_output']
   }
   return outputs

tf.saved_model.save(
    encoder, 
    EXPORTED_MODEL, 
    signatures=my_module_encoder.get_concrete_function(
        input_mask=tf.TensorSpec(shape=(None, None), dtype=tf.int32),
        input_word_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32),
        input_type_ids=tf.TensorSpec(shape=(None, None), dtype=tf.int32)
    ), 
    options=None
)



INFO:tensorflow:Assets written to: albert_en_base/assets


INFO:tensorflow:Assets written to: albert_en_base/assets


Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {EXPORTED_MODEL}

total 6140
drwxr-xr-x 2 root root    4096 Jul 13 13:00 assets
-rw-r--r-- 1 root root 6276123 Jul 13 13:00 saved_model.pb
drwxr-xr-x 2 root root    4096 Jul 13 13:00 variables


In [None]:
!ls -l {EXPORTED_MODEL}/assets

total 744
-rw-r--r-- 1 root root 760289 Jul 13 13:00 30k-clean.model


- The `SentencePiece` model is already in the `assets` directory, but let's rename it to something Spark NLP recognize it
- we all set! We can got to Spark NLP 😊 

In [None]:
!mv {EXPORTED_MODEL}/assets/*.model {EXPORTED_MODEL}/assets/spiece.model
!ls -l {EXPORTED_MODEL}/assets

total 744
-rw-r--r-- 1 root root 760289 Jul 13 13:00 spiece.model


## Import and Save ALBERT in Spark NLP


- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [11]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-07-13 13:18:28--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-07-13 13:18:29--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               setup Colab for PySpark 3.0.3 and Spark NLP 3.1.2

2021-07-13 13:18:29 (1.61 

Let's start Spark with Spark NLP included via our simple `start()` function

In [12]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

- Let's use `loadSavedModel` functon in `AlbertEmbeddings` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `AlbertEmbeddings` in runtime, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- `setStorageRef` is very important. When you are training a task like NER or any Text Classification, we use this reference to bound the trained model to this specific embeddings so you won't load a different embeddings by mistake and see terrible results 😊
- It's up to you what you put in `setStorageRef` but it cannot be changed later on. We usually use the name of the model to be clear, but you can get creative if you want! 
- The `dimension` param is is purely cosmetic and won't change anything. It's mostly for you to know later via `.getDimension` what is the dimension of your model. So set this accordingly.
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively..

In [14]:
from sparknlp.annotator import *

albert = AlbertEmbeddings.loadSavedModel(
     EXPORTED_MODEL,
     spark
 )\
 .setInputCols(["sentence",'token'])\
 .setOutputCol("albert")\
 .setCaseSensitive(False)\
 .setDimension(768)\
 .setStorageRef(EXPORTED_MODEL) 

- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [15]:
albert.write().overwrite().save("./{}_spark_nlp".format(EXPORTED_MODEL))

Let's clean up stuff we don't need anymore

In [16]:
!rm -rf {EXPORTED_MODEL}

Awesome 😎  !

This is your ALBERT model from TF Hub loaded and saved by Spark NLP 🚀 

In [17]:
! ls -l {EXPORTED_MODEL}_spark_nlp

total 44052
-rw-r--r-- 1 root root   760289 Jul 13 13:21 albert_spp
-rw-r--r-- 1 root root 44336140 Jul 13 13:21 albert_tensorflow
drwxr-xr-x 3 root root     4096 Jul 13 13:21 fields
drwxr-xr-x 2 root root     4096 Jul 13 13:21 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BERT model 😊 

In [18]:
albert_loaded = AlbertEmbeddings.load("./{}_spark_nlp".format(EXPORTED_MODEL))\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("albert")\
  .setCaseSensitive(False)

In [19]:
albert_loaded.getStorageRef()

'albert_en_base'

That's it! You can now go wild and import ALBERT models from TF Hub in Spark NLP 🚀 
