[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QC-kBux1ZltvboH5-TXJML86yZT_ZVc7?usp=sharing)

# Semantic similarity using pretrained encoders


source: https://keras.io/examples/nlp/semantic_similarity_with_keras_hub/

## Introduction

Semantic similarity refers to the task of determining the degree of similarity between two
sentences in terms of their meaning. Using SNLI (Stanford Natural Language Inference) corpus to predict sentence semantic similarity.
 - https://nlp.stanford.edu/projects/snli/


Review all potential architectures that you can use in Keras:

- https://keras.io/keras_hub/api/models/

Check more model checkpoints in the hub:
- https://www.kaggle.com/models

In [None]:
# !pip install -q --upgrade keras-hub
# !pip install -q --upgrade keras  # Upgrade to Keras 3.

In [None]:
import numpy as np
import tensorflow as tf
import keras
import keras_hub
import tensorflow_datasets as tfds

To load the SNLI dataset, we use the tensorflow-datasets library, which
contains over 550,000 samples in total. However, to ensure that this example runs
quickly, we use only 20% of the training samples.

## Overview of SNLI Dataset

Every sample in the dataset contains three components: `hypothesis`, `premise`,
and `label`. epresents the original caption provided to the author of the pair,
while the hypothesis refers to the hypothesis caption created by the author of
the pair. The label is assigned by annotators to indicate the similarity between
the two sentences.

The dataset contains three possible similarity label values: Contradiction, Entailment,
and Neutral. Contradiction represents completely dissimilar sentences, while Entailment
denotes similar meaning sentences. Lastly, Neutral refers to sentences where no clear
similarity or dissimilarity can be established between them.

In [None]:
snli_train = tfds.load("snli", split="train[:20%]")
snli_val = tfds.load("snli", split="validation")
snli_test = tfds.load("snli", split="test")

# Here's an example of how our training samples look like, where we randomly select
# four samples:
sample = snli_test.batch(4).take(1).get_single_element()
sample

{'hypothesis': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A girl is entertaining on stage',
        b'A group of people posing in front of a body of water.',
        b"The group of people aren't inide of the building.",
        b'The people are taking a carriage ride.'], dtype=object)>,
 'label': <tf.Tensor: shape=(4,), dtype=int64, numpy=array([0, 0, 0, 0])>,
 'premise': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A girl in a blue leotard hula hoops on a stage with balloon shapes in the background.',
        b'A group of people taking pictures on a walkway in front of a large body of water.',
        b'Many people standing outside of a place talking to each other in front of a building that has a sign that says "HI-POINTE."',
        b'Three people are riding a carriage pulled by four horses.'],
       dtype=object)>}

### Preprocessing

In our dataset, we have identified that some samples have missing or incorrectly labeled
data, which is denoted by a value of -1. To ensure the accuracy and reliability of our model,
we simply filter out these samples from our dataset.

In [None]:

def filter_labels(sample):
    return sample["label"] >= 0


Here's a utility function that splits the example into an `(x, y)` tuple that is suitable
for `model.fit()`. By default, `keras_hub.models.BertClassifier` will tokenize and pack
together raw strings using a `"[SEP]"` token during training. Therefore, this label
splitting is all the data preparation that we need to perform.

In [None]:

def split_labels(sample):
    x = (sample["hypothesis"], sample["premise"])
    y = sample["label"]
    return x, y


train_ds = (
    snli_train.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)
val_ds = (
    snli_val.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)
test_ds = (
    snli_test.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)


## Establishing baseline with BERT.

We use the BERT model from KerasHub to establish a baseline for our semantic similarity
task. The `keras_hub.models.BertClassifier` class attaches a classification head to the BERT
Backbone.

KerasHub models have built-in tokenization capabilities that handle tokenization by default
based on the selected model. However, users can also use custom preprocessing techniques
as per their specific needs. If we pass a tuple as input, the model will tokenize all the
strings and concatenate them with a `"[SEP]"` separator.

We use this model with pretrained weights, and we can use the `from_preset()` method
to use our own preprocessor. For the SNLI dataset, we set `num_classes` to 3.

Pretrained model: https://www.kaggle.com/models/keras/bert

In [None]:
bert_classifier = keras_hub.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased", num_classes=3
)

In [None]:
%%time
bert_classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(5e-5),
    metrics=["accuracy"],
)

bert_classifier.fit(train_ds, validation_data=val_ds, epochs=1)
bert_classifier.evaluate(test_ds)

   6867/Unknown [1m189s[0m 25ms/step - accuracy: 0.6034 - loss: 0.8602



[1m6867/6867[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m198s[0m 26ms/step - accuracy: 0.6034 - loss: 0.8602 - val_accuracy: 0.7650 - val_loss: 0.5798
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 9ms/step - accuracy: 0.7673 - loss: 0.5710
CPU times: user 1min 51s, sys: 11 s, total: 2min 3s
Wall time: 3min 26s


[0.580228865146637, 0.7630293369293213]

## Save and Reload the model

If you want to do several test, you need to reload original model and save the results of the current training

In [None]:
# bert_classifier.save("bert_classifier.keras")
# restored_model = keras.models.load_model("bert_classifier.keras")
# restored_model.evaluate(test_ds)

**Next steps**: improve model finetuning RoBERTa checkpoint

## Inference example

In [None]:
def check_similarity(model, sentence1, sentence2):
    sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
    test_data = BertSemanticDataGenerator(
        sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False,
    )

    proba = model.predict(test_data[0])[0]
    idx = np.argmax(proba)
    proba = f"{proba[idx]: .2f}%"
    pred = labels[idx]
    return pred, proba

In [None]:
label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}

hypotheses = [
    "A child is eating ice cream near the park.",
    "A woman is sleeping on the couch."
]
premises = [
    "The kid is enjoying a sweet outside.",
    "Someone is on the couch reading."
]

# combine segments into one string
inputs = [
    f"{h} [SEP] {p}"
    for (h, p) in zip(hypotheses, premises)
]

raw_logits = bert_classifier.predict(inputs, batch_size=2)
probs = tf.nn.softmax(raw_logits, axis=-1).numpy()

for i, combined in enumerate(inputs):
    pred_idx = int(np.argmax(probs[i]))
    print(f"Input: {combined}")
    print(f"Predicted label = {label_map[pred_idx]} (confidence {probs[i][pred_idx]:.2f})")
    print("------")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 562ms/step
Input: A child is eating ice cream near the park. [SEP] The kid is enjoying a sweet outside.
Predicted label = neutral (confidence 0.49)
------
Input: A woman is sleeping on the couch. [SEP] Someone is on the couch reading.
Predicted label = contradiction (confidence 0.90)
------
