# How to import BERT checkpoints into Spark NLP
## We use BERTimbau - Portuguese BERT as an example
source: https://github.com/neuralmind-ai/portuguese-bert

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/portuguese/Export_BERT_model_to_Spark_NLP_BertEmbeddings.ipynb)

Install TensorFlow 1.15.0

In [None]:
! pip install -q tensorflow==1.15.0 tensorflow-hub

[K     |████████████████████████████████| 412.3MB 31kB/s 
[K     |████████████████████████████████| 512kB 48.2MB/s 
[K     |████████████████████████████████| 51kB 3.6MB/s 
[K     |████████████████████████████████| 3.8MB 51.3MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow-probability 0.11.0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.[0m


We need `modeling` from original BERT repo

In [None]:
!git clone https://github.com/google-research/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 317.85 KiB | 4.18 MiB/s, done.
Resolving deltas: 100% (185/185), done.


Let's start Apache Spark with Spark NLP, we'll use this later to save BertEmbeddings model

In [None]:
import sys
import tensorflow.compat.v1 as tf
from bert import modeling
import shutil
import os
from shutil import copyfile

tf.get_logger().setLevel('WARN')
tf.disable_v2_behavior()

print(tf.__version__)
print(tf.keras.__version__)

Instructions for updating:
non-resource variables are not supported in the long term
1.15.0
2.2.4-tf


In [None]:
def save_model(config_path, meta_path, ckpt_path, export_dir):

    with tf.Graph().as_default():
        tf.random.set_random_seed(44)
        # these names are important, we look for these in Spark NLP when we feed the BERT model
        bert_inputs = dict(
            input_ids=tf.placeholder(dtype=tf.int32, shape=(None, None), name="input_ids"),
            input_mask=tf.placeholder(dtype=tf.int32, shape=(None, None), name="input_mask"),
            segment_ids=tf.placeholder(dtype=tf.int32, shape=(None, None), name="segment_ids")
        )

        with tf.Session(config=tf.ConfigProto(allow_soft_placement=True,
                                              log_device_placement=False)) as sess:

            with tf.device('/gpu:0'):

                bert_config = modeling.BertConfig.from_json_file(config_path)

                model = modeling.BertModel(
                    config=bert_config,
                    is_training=False,
                    input_ids=bert_inputs['input_ids'],
                    input_mask=bert_inputs['input_mask'],
                    token_type_ids=bert_inputs['segment_ids'],
                    use_one_hot_embeddings=False
                )

                # this name is important, we look for this when we want to fetch the result
                # as you already guessed, you can do whatever you want within the TensorFlow with this output
                # as long as the result is DT_FLOAT with the shape of (-1, -1, 768) you can use the same name 
                # and access the results in Spark NLP               
                sequence_output = tf.identity(model.get_sequence_output(), name="sequence_output")
                bert_outputs = dict(
                    sequence_output=sequence_output
                )

                tf.train.Saver().restore(sess, ckpt_path)

                init_op = tf.group([tf.global_variables_initializer(),
                                    tf.initializers.tables_initializer(name='init_all_tables')])

                sess.run(init_op)

                shutil.rmtree(export_dir, ignore_errors=True)

                tf.saved_model.simple_save(
                    sess,
                    export_dir,
                    inputs=bert_inputs,
                    outputs=bert_outputs,
                    legacy_init_op=init_op
                )

In [None]:
# Let's download some BERT Checkpoints
!wget https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/bert-base-portuguese-cased_tensorflow_checkpoint.zip
!unzip bert-base-portuguese-cased_tensorflow_checkpoint.zip -d bert-base-portuguese-cased_tensorflow_checkpoint

# For some reason portuguese vocab.txt is not included in the model, 
# it has to be downloaded separately
# most BERT models come with the vocab.txt included
!wget -P bert-base-portuguese-cased_tensorflow_checkpoint "https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/vocab.txt"

--2020-09-10 11:54:28--  https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/bert-base-portuguese-cased_tensorflow_checkpoint.zip
Resolving neuralmind-ai.s3.us-east-2.amazonaws.com (neuralmind-ai.s3.us-east-2.amazonaws.com)... 52.219.88.48
Connecting to neuralmind-ai.s3.us-east-2.amazonaws.com (neuralmind-ai.s3.us-east-2.amazonaws.com)|52.219.88.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1205655266 (1.1G) [application/zip]
Saving to: ‘bert-base-portuguese-cased_tensorflow_checkpoint.zip’


2020-09-10 11:54:42 (85.6 MB/s) - ‘bert-base-portuguese-cased_tensorflow_checkpoint.zip’ saved [1205655266/1205655266]

Archive:  bert-base-portuguese-cased_tensorflow_checkpoint.zip
  inflating: bert-base-portuguese-cased_tensorflow_checkpoint/bert_config.json  
  inflating: bert-base-portuguese-cased_tensorflow_checkpoint/model.ckpt.data-00000-of-00001  
  inflating: bert-base-portuguese-cased_tensorflow_checkpoint/model.ckpt.index  
  i

In [None]:
def export_bert(pretrain_path, save_path):

    config_path = pretrain_path + '/bert_config.json'
    meta_path = pretrain_path + '/model.ckpt.meta'
    ckpt_path = pretrain_path + '/model.ckpt'
    vocab = pretrain_path + '/vocab.txt'

    save_model(config_path, meta_path, ckpt_path, save_path)
    os.makedirs(os.path.dirname(save_path+"/assets/"), exist_ok=True)
    # Spark NLP needs vocab.txt in assets with the same name
    copyfile(vocab, save_path+"/assets/vocab.txt")

In [None]:
export_bert('/content/bert-base-portuguese-cased_tensorflow_checkpoint', './bert_saved_models/bert-base-portuguese-cased')





The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.simple_save.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


This is how the SavedModel looks like in terms of inputs and outputs:

In [None]:
!saved_model_cli show --all --dir /content/bert_saved_models/bert-base-portuguese-cased/


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input_ids'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
        name: input_ids:0
    inputs['input_mask'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
        name: input_mask:0
    inputs['segment_ids'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
        name: segment_ids:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['sequence_output'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, -1, 768)
        name: sequence_output:0
  Method name is: tensorflow/serving/predict


Let's loadd our new BERT SavedModel in TF as `BertEmbeddings` model in Spark NLP:

Let's setup Apache Spark and Java first (`only for Google Colab`)

In [None]:
import os

# Install java
! apt-get update -qq  > /dev/null
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install -q pyspark==2.4.6
! pip install -q spark-nlp

openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)


In [None]:
import sparknlp
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.base import *

spark=sparknlp.start()

In [None]:
# we need to pass the path to the SavedModel and
# the active SparkSession
bert = BertEmbeddings.loadSavedModel('/content/bert_saved_models/bert-base-portuguese-cased/', spark)\
 .setInputCols(["sentence", "token"])\
 .setOutputCol("bert")\
 .setCaseSensitive(True)\
 .setDimension(768)

The `bert` variable is actually the final BertEmbeddings model. You can either use it directly in your Pipeline, or you can save it and load it later without the need to keep the BERT SavedModel like pretrained models:

In [None]:
bert.write().save('./BertEmbeddings_bert-base-portuguese-cased')

Let's use our Portuguese `BertEmbeddings` model in a pipeline for a test:

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

# you can load an offline model by using .load(PATH)
bert = BertEmbeddings.load('/content/BertEmbeddings_bert-base-portuguese-cased') \
 .setInputCols(["sentence", "token"])\
 .setOutputCol("bert")

pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        bert        
    ]
)


In [None]:
prediction_data = spark.createDataFrame([["A alemanha é um lugar legal"]]).toDF("text")

prediction = pipeline.fit(prediction_data).transform(prediction_data)


In [None]:
# Tokens from Tokenizer
prediction.select("bert.result").show(1, False)

+----------------------------------+
|result                            |
+----------------------------------+
|[A, alemanha, é, um, lugar, legal]|
+----------------------------------+



In [None]:
# Embeddings from Portuguese BERT SavedModel
prediction.select("bert.embeddings").show(1, truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                          embeddings|
+----------------------------------------------------------------------------------------------------+
|[[0.7554269, -1.4238819, 0.2617143, -0.39890784, 0.1543039, 0.07270624, 0.2696601, -0.39731884, -...|
+----------------------------------------------------------------------------------------------------+



You can remove everything exccept `BertEmbeddings_bert-base-portuguese-cased` which is all you need. You can zip it and download it for later! :) 

In [None]:
!zip -r /content/BertEmbeddings_bert-base-portuguese-cased.zip /content/BertEmbeddings_bert-base-portuguese-cased/

  adding: content/BertEmbeddings_bert-base-portuguese-cased/ (stored 0%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/bert_tensorflow (deflated 0%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/ (stored 0%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/vocabulary/ (stored 0%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/vocabulary/.part-00001.crc (stored 0%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/vocabulary/.part-00000.crc (stored 0%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/vocabulary/part-00000 (deflated 78%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/vocabulary/part-00001 (deflated 78%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/vocabulary/_SUCCESS (stored 0%)
  adding: content/BertEmbeddings_bert-base-portuguese-cased/fields/vocabulary/._SUCCESS.crc (stored 0%)
  adding: content/BertEmbeddings_bert-base-