![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/Fine_Tuned_Sentence_Bert_in_Spark_NLP.ipynb)

# Exporting Fine Tuned Sentence-BERT Models and Importing them into Spark NLP 🚀

This notebook will go through the steps of exporting a fine tuned bert model to generate sentence embeddings. First, let's install the dependencies we need.

In [None]:
!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

!pip install -q transformers==4.30.0 tensorflow==2.11.0
!pip install -q datasets

Installing PySpark 3.2.3 and Spark NLP 5.2.3
setup Colab for PySpark 3.2.3 and Spark NLP 5.2.3
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.6/547.6 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from transformers import AutoTokenizer, TFAutoModelForMaskedLM
from transformers import TFBertModel, BertTokenizer, TFBertForMaskedLM
import tensorflow as tf

In [None]:
OUTPUT_PATH = "/content/sbert_tf"
! mkdir -p $OUTPUT_PATH

## Exporting original models

We first export the original model and import it into Spark NLP. We will use it later, to compare it to the fine-tuned one.

In [None]:
MODEL_NAME = "bert-base-cased"
# save tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME).save_pretrained(
    "{}/{}_tokenizer".format(OUTPUT_PATH, MODEL_NAME)
)
# load tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

model = TFBertModel.from_pretrained(MODEL_NAME, from_pt=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already

In [None]:
def mean_pooling(token_embeddings, attention_mask):
    input_mask_expanded = tf.cast(
        tf.repeat(
            tf.expand_dims(attention_mask, -1),
            repeats=token_embeddings.shape[-1],
            axis=-1,
        ),
        tf.float32,
    )
    return tf.reduce_sum(
        token_embeddings * input_mask_expanded, axis=1
    ) / tf.clip_by_value(
        tf.reduce_sum(input_mask_expanded, axis=1),
        clip_value_min=1e-9,
        clip_value_max=4096,
    )


# Define TF Signature
@tf.function(
    input_signature=[
        {
            "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
            "attention_mask": tf.TensorSpec(
                (None, None), tf.int32, name="attention_mask"
            ),
            "token_type_ids": tf.TensorSpec(
                (None, None), tf.int32, name="token_type_ids"
            ),
        }
    ]
)
def serving_fn(input):
    outputs = model(input, output_hidden_states=True)
    # compute sentence embedding by averaging token embeddings
    pooler_output = mean_pooling(outputs.hidden_states[-1], input["attention_mask"])
    # compute sentence embedding by taking the built in pooler output,
    # which currently is actually the CLS embedding. This doesn't work well,
    # so avoid using it
    # pooled_output = outputs.pooler_output
    return {"pooler_output": pooler_output}

In [None]:
# save model to local directory

MODEL_NAME_w_sign = "./{}_w_sign".format(MODEL_NAME)

model.save_pretrained(
    "{}/{}".format(OUTPUT_PATH, MODEL_NAME_w_sign),
    saved_model=True,
    signatures={"serving_default": serving_fn},
)



In [None]:
!cp {OUTPUT_PATH}/{MODEL_NAME}_tokenizer/vocab.txt {OUTPUT_PATH}/{MODEL_NAME_w_sign}/saved_model/1/assets

In [None]:
import sparknlp

from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql import functions as F

spark = sparknlp.start()

spark

In [None]:
sent_bert = (
    BertSentenceEmbeddings.loadSavedModel(
        "{}/{}/saved_model/1".format(OUTPUT_PATH, MODEL_NAME_w_sign), spark
    )
    .setInputCols("sentence")
    .setOutputCol("bert_sentence")
    .setCaseSensitive(True)
    .setDimension(768)
    .setStorageRef("sent_bert_base_cased")
)

In [None]:
sent_bert.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence_detector = (
    SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)

embeddings = (
    BertSentenceEmbeddings.load("./{}_spark_nlp".format(MODEL_NAME))
    .setInputCols("sentence")
    .setOutputCol("sentence_embeddings")
)

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])

text = [["I hate cancer"], ["Antibiotics aren't painkiller"]]

data = spark.createDataFrame(text).toDF("text")

result = nlp_pipeline.fit(data).transform(data)

In [None]:
result.select(
    F.explode(
        F.arrays_zip(result.sentence.result, result.sentence_embeddings.embeddings)
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("embeddings")
).show(
    truncate=100
)

+-----------------------------+----------------------------------------------------------------------------------------------------+
|                     sentence|                                                                                          embeddings|
+-----------------------------+----------------------------------------------------------------------------------------------------+
|                I hate cancer|[0.675583, 0.05248031, -0.2677794, -0.02619921, -0.068684764, -0.038617752, 0.29574826, 0.0209077...|
|Antibiotics aren't painkiller|[0.3458845, -0.06992405, 0.15711522, 0.36460966, -0.04376867, -0.21441574, -0.3123266, 0.00353415...|
+-----------------------------+----------------------------------------------------------------------------------------------------+



Let's restart the session at this point, so we have some more RAM available.

## Training and Expoting custom fine-tuned models

In this section, we will fine-tune a `bert-base-cased` on the `wikitext` data set. Additionally, to create sentence embeddings, we will need to create a pooling operation for the token embeddings.

First, we load the pretrained model and the data set.

In [None]:
from transformers import AutoTokenizer, BertTokenizer, TFAutoModelForMaskedLM

OUTPUT_PATH = "/content/sbert_tf"

MODEL_NAME = "bert-base-cased"
# save tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME).save_pretrained(
    "{}/{}_tokenizer".format(OUTPUT_PATH, MODEL_NAME)
)
# load tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

model = TFAutoModelForMaskedLM.from_pretrained(MODEL_NAME, from_pt=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


### Data Set Pre-processing

In [None]:
from datasets import load_dataset


dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

We need to tokenize the data to create token ids and preprocess the text into batches, so that the model can accept it as input.

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])


tokenized_datasets = dataset.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (546 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (574 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (529 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (686 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (528 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, though you could add padding instead if the model supports it
    # In this, as in all things, we advise you to follow your heart
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

### Fine Tuning

Now we can start the training.

In [None]:
from transformers import create_optimizer, AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

model.compile(optimizer=optimizer, jit_compile=True, metrics=["accuracy"])

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15, return_tensors="np"
)

In [None]:
train_set = model.prepare_tf_dataset(
    lm_datasets["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

validation_set = model.prepare_tf_dataset(
    lm_datasets["validation"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

As an example, we train the model with 1 epoch and 10 steps per epoch only. For a serious fine tune session, you might want to choose higher values.

In [None]:
model.fit(
    train_set,
    epochs=1,
    steps_per_epoch=10,
)

Cause: for/else statement not yet supported


Cause: for/else statement not yet supported


<keras.callbacks.History at 0x7e9e93727a90>

In [None]:
# to save in case there is a need for hf checkponts in the future
FINETUNED_MODEL_NAME = f"{OUTPUT_PATH}/{MODEL_NAME}_fine-tuned"


model.save_pretrained(FINETUNED_MODEL_NAME, saved_model=True)



We just save the fine-tuned model as a hf checkpoint. However, to import it to Spark NLP we need to modify the signature of the model. As previously mentioned, we create sentence embeddings by pooling the token embeddings. We define a new model signature, which includes the  `mean_pooling` operation and save the custom model.

In [None]:
def mean_pooling(token_embeddings, attention_mask):
    input_mask_expanded = tf.cast(
        tf.repeat(
            tf.expand_dims(attention_mask, -1),
            repeats=token_embeddings.shape[-1],
            axis=-1,
        ),
        tf.float32,
    )
    return tf.reduce_sum(
        token_embeddings * input_mask_expanded, axis=1
    ) / tf.clip_by_value(
        tf.reduce_sum(input_mask_expanded, axis=1),
        clip_value_min=1e-9,
        clip_value_max=4096,
    )


# Define TF Signature
@tf.function(
    input_signature=[
        {
            "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
            "attention_mask": tf.TensorSpec(
                (None, None), tf.int32, name="attention_mask"
            ),
            "token_type_ids": tf.TensorSpec(
                (None, None), tf.int32, name="token_type_ids"
            ),
        }
    ]
)
def serving_fn(input):
    outputs = model(input, output_hidden_states=True)
    # compute sentence embedding by averaging token embeddings
    pooler_output = mean_pooling(outputs.hidden_states[-1], input["attention_mask"])
    # compute sentence embedding by taking the built in pooler output,
    # which currently is actually the CLS embedding. This doesn't work well,
    # so avoid using it
    # pooled_output = outputs.pooler_output
    return {"pooler_output": pooler_output}

In [None]:
# Save model to local directory

model.save_pretrained(
    "{}_w_sign".format(FINETUNED_MODEL_NAME),
    saved_model=True,
    signatures={"serving_default": serving_fn},
)



In [None]:
FINETUNED_MODEL_NAME_w_sign = f"{FINETUNED_MODEL_NAME}_w_sign"

!cp {OUTPUT_PATH}/{MODEL_NAME}_tokenizer/vocab.txt {FINETUNED_MODEL_NAME_w_sign}/saved_model/1/assets

## Importing the model into Spark NLP

It's best to restart the runtime again here, so we don't go over the RAM limit.

In [None]:
import sparknlp

from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql import functions as F

spark = sparknlp.start()

spark



In [None]:
OUTPUT_PATH = "/content/sbert_tf"
MODEL_NAME = "bert-base-cased"
FINETUNED_MODEL_NAME = f"{OUTPUT_PATH}/{MODEL_NAME}_fine-tuned"

sent_bert = (
    BertSentenceEmbeddings.loadSavedModel(
        f"{FINETUNED_MODEL_NAME}_w_sign/saved_model/1", spark
    )
    .setInputCols("sentence")
    .setOutputCol("bert_sentence")
    .setCaseSensitive(True)
    .setDimension(768)
    .setStorageRef("sent_bert_base_cased")
)

In [None]:
sent_bert.write().overwrite().save("./{}_fine-tuned_spark_nlp".format(MODEL_NAME))

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence_detector = (
    SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)

embeddings = (
    BertSentenceEmbeddings.load("./{}_fine-tuned_spark_nlp".format(MODEL_NAME))
    .setInputCols("sentence")
    .setOutputCol("sentence_embeddings")
)

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])

text = [["I hate cancer"], ["Antibiotics aren't painkiller"]]

data = spark.createDataFrame(text).toDF("text")

result = nlp_pipeline.fit(data).transform(data)

In [None]:
result.select(
    F.explode(
        F.arrays_zip(result.sentence.result, result.sentence_embeddings.embeddings)
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("embeddings")
).show(
    truncate=100
)

+-----------------------------+----------------------------------------------------------------------------------------------------+
|                     sentence|                                                                                          embeddings|
+-----------------------------+----------------------------------------------------------------------------------------------------+
|                I hate cancer|[0.6494873, 0.073490426, -0.29895884, -0.009830964, -0.09348484, -0.039925538, 0.3101672, 0.02736...|
|Antibiotics aren't painkiller|[0.28350386, -0.09607246, 0.11028457, 0.36982596, -0.1297523, -0.2121249, -0.3344884, 0.008855367...|
+-----------------------------+----------------------------------------------------------------------------------------------------+



## Inference: Comparing the fine-tuned and the orginal model

We can now compare the embeddings, between the base model and the fine-tuned model. For this we can use the cosine similarity as a measure.

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence_detector = (
    SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)

embeddings_fine_tuned = (
    BertSentenceEmbeddings.load("./{}_fine-tuned_spark_nlp".format(MODEL_NAME))
    .setInputCols("sentence")
    .setOutputCol("sentence_embeddings_finetuned")
)

embeddings_original = (
    BertSentenceEmbeddings.load("./{}_spark_nlp".format(MODEL_NAME))
    .setInputCols("sentence")
    .setOutputCol("sentence_embeddings_original")
)


nlp_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        embeddings_fine_tuned,
        embeddings_original,
    ]
)

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

text = [["I hate cancer"], ["Antibiotics aren't painkiller"]]

data = spark.createDataFrame(text).toDF("text")

data = data.coalesce(1).withColumn("index", monotonically_increasing_id())

result = nlp_pipeline.fit(data).transform(data)

In [None]:
import pyspark.sql.functions as F

df = result.select(
    "index",
    F.explode(
        F.arrays_zip(
            result.sentence.result,
            result.sentence_embeddings_finetuned.embeddings,
            result.sentence_embeddings_original.embeddings,
        )
    ).alias("cols"),
).select(
    "index",
    F.expr("cols['0']").alias("sentence"),
    F.expr("cols['1']").alias("sentence_embeddings_finetuned"),
    F.expr("cols['2']").alias("sentence_embeddings_original"),
)
df.show(truncate=50)

+-----+-----------------------------+--------------------------------------------------+--------------------------------------------------+
|index|                     sentence|                     sentence_embeddings_finetuned|                      sentence_embeddings_original|
+-----+-----------------------------+--------------------------------------------------+--------------------------------------------------+
|    0|                I hate cancer|[0.6494875, 0.07349018, -0.29895863, -0.0098310...|[0.67558324, 0.052480347, -0.2677792, -0.026199...|
|    1|Antibiotics aren't painkiller|[0.28350395, -0.096072316, 0.11028453, 0.369825...|[0.34588462, -0.06992395, 0.15711544, 0.3646099...|
+-----+-----------------------------+--------------------------------------------------+--------------------------------------------------+



In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

pdf = df.toPandas()

X = np.stack(pdf.sentence_embeddings_original.values)
Y = np.stack(pdf.sentence_embeddings_finetuned.values)
sk_sim = cosine_similarity(X, Y)


for i in range(df.count()):
    df.filter(result.index == i).select(
        "sentence", "sentence_embeddings_original"
    ).show(truncate=False)
    df.filter(result.index == i).select(
        "sentence", "sentence_embeddings_finetuned"
    ).show(truncate=False)
    print(f"cos_sim: {sk_sim[i,i]}\n\n\n")

+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------