# **Testing Sentence Similarity**

With typical models so far for the chatbot the input dataset has heavily struggled as it is a very small dataset and there are many different ways to write a sentence. This causes the accuracy to be very low as well as the fact that there are about 50 different categories (classes) and only about 350 entries. My goal is to test if a sentence is similar to a sentence in the dataset rather than directly predicting the category.

## The Model

For this I found the model all-MiniML-L6-v2 as it is the most popular and recently updated. I tested it with a few inputs and seemed to do fairly well with my intended use for the chatbot in their summary on huggingface.com. Here is a link to the model:

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

I also used this guide to help create my model using Tensorflow:

https://www.philschmid.de/tensorflow-sentence-transformers

## Setting Up My Environment

First I have to install transformers and sentence transformers as a prerequisite for the model.

In [None]:
!pip install transformers[tf] -q --upgrade
!pip install sentence-transformers -q

[K     |████████████████████████████████| 5.5 MB 5.6 MB/s 
[K     |████████████████████████████████| 7.6 MB 42.0 MB/s 
[K     |████████████████████████████████| 163 kB 44.0 MB/s 
[K     |████████████████████████████████| 83 kB 2.2 MB/s 
[K     |████████████████████████████████| 5.9 MB 2.6 MB/s 
[K     |████████████████████████████████| 442 kB 59.0 MB/s 
[K     |████████████████████████████████| 13.1 MB 39.9 MB/s 
[K     |████████████████████████████████| 578.0 MB 15 kB/s 
[K     |████████████████████████████████| 4.6 MB 46.6 MB/s 
[K     |████████████████████████████████| 85 kB 2.7 MB/s 
[K     |████████████████████████████████| 1.3 MB 35.6 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Preprocessing

Here my data is in a csv file with text inputs correlating to a catgorey. In this case we only care about the possible text inputs in the dataset, as we are going to compare direct user inputs to our input data to check for similarity.

This dataset can be found here:
https://github.com/BridgetteBXP13/CS-4395.001---Human-Language-Technologies/blob/main/Chatbot/Data/Inputs.csv

In [None]:
# Importing pandas
import pandas as pd
# Import our data
url = 'https://raw.githubusercontent.com/BridgetteBXP13/CS-4395.001---Human-Language-Technologies/main/Chatbot/Data/Inputs.csv'
df = pd.read_csv(url)
print("\nOur loaded dataframe:\n")
df.head



Our loaded dataframe:



<bound method NDFrame.head of                                                  Input        Category
0                                Snakes are aggressive        Behavior
1                                 Are they aggressive         Behavior
2    Snakes will not bite unless you try to approac...        Behavior
3                              Do snakes like to bite?        Behavior
4                                  Snakes chase people        Behavior
..                                                 ...             ...
325                             Snakes are emotionless           Brain
326                               Can snakes grow hair            Body
327                                    Snakes are mean        Behavior
328                            Why should snakes exist  Snake Benefits
329                                Are snakes any good  Snake Benefits

[330 rows x 2 columns]>

In [None]:
# Save the first column into a list of strings for Tensorflow
inputs = []   # Our empty list
# A for loop to traverse through each observation of the 'Inputs' column
for input in df.Input:
  inputs.append(input)
print("\nOur first five inputs in our new list:\n")
print(inputs[:5])


Our first five inputs in our new list:

['Snakes are aggressive', 'Are they aggressive ', 'Snakes will not bite unless you try to approach/handle them.', 'Do snakes like to bite?', 'Snakes chase people']


## Create TensorFlow Model

Here I heavily used the guide mentioned above as the instructions on HuggingFace were PyTorch based and I wanted to use TensorFlow and Keras for this project. In these steps we will create a compatible model in order to utilize the pretrained model.

In [None]:
import tensorflow as tf
from transformers import TFAutoModel

class TFSentenceTransformer(tf.keras.layers.Layer):
    def __init__(self, model_name_or_path, **kwargs):
        super(TFSentenceTransformer, self).__init__()
        # loads transformers model
        self.model = TFAutoModel.from_pretrained(model_name_or_path, **kwargs)

    def call(self, inputs, normalize=True):
        # runs model on inputs
        model_output = self.model(inputs)
        # Perform pooling. In this case, mean pooling.
        embeddings = self.mean_pooling(model_output, inputs["attention_mask"])
        # normalizes the embeddings if wanted
        if normalize:
          embeddings = self.normalize(embeddings)
        return embeddings

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] # First element of model_output contains all token embeddings
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        return tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1) / tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)

    def normalize(self, embeddings):
      embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
      return embeddings

In [None]:
from transformers import AutoTokenizer

# Hugging Face model id
model_id = 'sentence-transformers/all-MiniLM-L6-v2'

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = TFSentenceTransformer(model_id)

# Run inference & create embeddings
encoded_input = tokenizer(inputs[:12], padding=True, truncation=True, return_tensors='tf')
sentence_embedding = model(encoded_input)
print("\nOur Embedded Sentence Tensorflow Shape:\n")
print(sentence_embedding.shape)

All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.



Our Embedded Sentence Tensorflow Shape:

(12, 384)


## Running Inference and Testing Results

Here I will test the comparisons with some inputs to see how well it is at testing similarity

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer

compare_inputs = ["Whales love me", "Taxes are very high", "I lost my phone today", 
                  "Snakes can't breathe", "Snakes like people", "Snakes always bite people",
                  "Snakes lay eggs", "What snakes are venomous?"]

# loading sentence transformers
st_model = SentenceTransformer(model_id,device="cpu")
for compare_input in compare_inputs:
  # run inference with sentence transformers
  st_embeddings = st_model.encode(compare_input)
  # run inference with TFSentenceTransformer
  encoded_input = tokenizer(compare_input, return_tensors="tf")
  tf_embeddings =  model(encoded_input)

  # compare embeddings
  are_results_close = np.allclose(tf_embeddings.numpy()[0],st_embeddings, rtol=1e-30, atol=1e-07)
  print("Comparing: ", compare_input)
  print(f"Results close: {are_results_close}")

Comparing:  Whales love me
Results close: False
Comparing:  Taxes are very high
Results close: False
Comparing:  I lost my phone today
Results close: False
Comparing:  Snakes can't breathe
Results close: False
Comparing:  Snakes like people
Results close: True
Comparing:  Snakes always bite people
Results close: True
Comparing:  Snakes lay eggs
Results close: False
Comparing:  What snakes are venomous?
Results close: True
