## Fine-Tuning with Sentence Transformers
In this notebook, we will learn about `sentence_transformers` library and then we will fine-tune a `bert-base-uncased` model with 2 different type of datasets:
- Triplets
- Sentence pairs with labels (SNLI)

Let's first install the required dependencies and modules

In [1]:
!pip install sentence-transformers datasets scikit-learn numpy matplotlib huggingface_hub



We will use `all-MiniLM-L6-v2` to play around with sentence transformer features 

In [2]:
# model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Converting text to embeddings
Let’s first try converting a given text into embeddings. You might also have used openai embeddings to get the embeddings of a given text but it charges money to use their embedding model so as an alternative you can use models from huggingface or any other opensource embedding model with sentence transformers to generate vector embeddings for your sentence.

Now we will define the sentences for which we want to generate the embeddings in an array and then we can use `encode` method from our model to generate embeddings.

In [3]:
# Generating Embeddings

# Our sentences we like to encode
sentences = [
    "This framework generates embeddings for each input sentence",
    "Sentences are passed as a list of string.",
    "The quick brown fox jumps over the lazy dog.",
]

# Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173692e-02 -4.28515822e-02 -1.56286228e-02  1.40537275e-02
  3.95537838e-02  1.21796295e-01  2.94333231e-02 -3.17524225e-02
  3.54960114e-02 -7.93140009e-02  1.75878741e-02 -4.04369459e-02
  4.97259796e-02  2.54912712e-02 -7.18699917e-02  8.14968571e-02
  1.47075264e-03  4.79627997e-02 -4.50336300e-02 -9.92174968e-02
 -2.81769559e-02  6.45045862e-02  4.44670469e-02 -4.76217717e-02
 -3.52952033e-02  4.38671820e-02 -5.28565943e-02  4.33004927e-04
  1.01921484e-01  1.64072644e-02  3.26997042e-02 -3.45986970e-02
  1.21339401e-02  7.94871300e-02  4.58340021e-03  1.57778692e-02
 -9.68205743e-03  2.87625752e-02 -5.05807027e-02 -1.55794024e-02
 -2.87906863e-02 -9.62280016e-03  3.15556079e-02  2.27349699e-02
  8.71449932e-02 -3.85027602e-02 -8.84718373e-02 -8.75496864e-03
 -2.12343670e-02  2.08924040e-02 -9.02078152e-02 -5.25732264e-02
 -1.05638923e-02  2.88311318e-02 -1.61455013e-02  6.17840420e-03
 -1.23234

### Cosine Similarity between Sentences

You can use cosine similarity to find out the similarity between 2 sentences. Sentence transformers allow us to find the cosine similarity score between 2 sentences so let’s see it in action!

First, we will import the required modules and convert our sentences into embeddings using the same model we used before

In [4]:
 # Finding cosine similarity
from sentence_transformers import SentenceTransformer, util

# Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")


Now we can find the cosine similarity between these 2 embeddings using `util.cos_sim` method

In [5]:
cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.6153]])


### Semantic Search
In semantic search, you have a query (it can be a sentence or an image) and you convert that query into embeddings and then you find the similar sentence embeddings for the given query embedding using semantic search by performing cosine similarity.

Once we get all the similarity scores for different sentences, we then sort the sentences based on the scores in descending order meaning that the most similar sentence or a sentence with highest similarity score will be at the top and we can specify the number of similar sentences we want as “k”.

Let’s see it in action!

First we will define the existing sentences which works as a database meaning that we want to find the top k similar sentences from this list. We will have to convert these sentences into encodings so that we can perform cosine similarity on them.

In [6]:
# Semantic Search
from sentence_transformers import SentenceTransformer, util
import torch

# Corpus with example sentences
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

Now we will define our queries and for each query we will find top 3 similar sentences from corpus

In [7]:
# Query sentences:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    # print(top_results)
    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))





Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a 

Additionaly, instead of using `util.cos_sim` and then getting the top k results, you can use `util.semantic_search` method to do the same thing easily.

In [8]:
# Using semantic_search utility from sentence transformers
top_k = 5
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)
    similar_results = util.semantic_search(query_embeddings=query_embedding,corpus_embeddings=corpus_embeddings,top_k=top_k)
    print("===============\n")
    print(f"Similar Sentences for '{query}'")
    for result in similar_results[0]:
      print(f"{corpus[result['corpus_id']]} (score: {result['score']})")


Similar Sentences for 'A man is eating pasta.'
A man is eating food. (score: 0.7035488486289978)
A man is eating a piece of bread. (score: 0.527198851108551)
A man is riding a horse. (score: 0.18889541923999786)
A man is riding a white horse on an enclosed ground. (score: 0.10469918698072433)
A cheetah is running behind its prey. (score: 0.09803034365177155)

Similar Sentences for 'Someone in a gorilla costume is playing a set of drums.'
A monkey is playing drums. (score: 0.6432532668113708)
A woman is playing violin. (score: 0.25641557574272156)
A man is riding a horse. (score: 0.1388726532459259)
A man is riding a white horse on an enclosed ground. (score: 0.11909158527851105)
A cheetah is running behind its prey. (score: 0.1079866960644722)

Similar Sentences for 'A cheetah chases prey on across a field.'
A cheetah is running behind its prey. (score: 0.825321614742279)
A man is eating food. (score: 0.13989514112472534)
A monkey is playing drums. (score: 0.12919360399246216)
A man i

## Training a model using a triplets dataset
Now we have enough knowledge about sentence transformers, so let's fine-tune a base model from scratch
Please [read the blog](https://www.ionio.ai/blog/fine-tuning-embedding-models-using-sentence-transformers-code-included) to read about the fine-tuning process in more detail.

Let’s first pull our base model and apply pooling on it so that we can get fixed 768 sized embedding array in output

In [None]:
# Training a bert model using sentence transformer
from sentence_transformers import SentenceTransformer, models
import torch

word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Now let’s pull our dataset, we are going to use `embedding-data/QQP_triplets` but you can use any other triplet dataset too if you want

In [None]:
# Using QQP Triples dataset for training
from datasets import load_dataset

dataset_id = "embedding-data/QQP_triplets"
dataset = load_dataset(dataset_id)

Let’s take a look at how each data looks like in dataset

In [None]:
dataset['train']['set'][0]

As we can see, each example have a query, a positive sentence which is similar to that query and a list of negative sentences which are not similar to query.

We can’t directly pass this dataset examples into our model because first we have to convert them to a specific format that sentence transformers and model can understand. Every training example must be in “InputExample” format in sentence transformers so we will convert our dataset data into this format.

We will also take only first sentence from both `pos` and `neg` arrays to make it easy but in production scenario, you might need to pass the full array for better performance and accuracy

In [None]:
from sentence_transformers import InputExample

train_examples = []
train_data = dataset['train']['set']
# For agility we only 1/2 of our available data
# n_examples = dataset['train'].num_rows // 4
for i in range(0,1000):
  example = train_data[i]
  train_examples.append(InputExample(texts=[example['query'],example['pos'][0],example['neg'][0]]))

Now let’s create our dataloader

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Now let’s define our loss function. We can use “losses” class from sentence transformers which allows us to get different loss functions that we discussed above.

We just have to attach the model to triplet loss function

In [None]:
from sentence_transformers import losses
train_loss = losses.TripletLoss(model)

And now we are ready, let’s combine everything we prepared and fine-tune the model using `model.fit` method which takes dataloader and loss function as a train objectives.

In [None]:
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=4)

Now let’s push this fine-tuned model on huggingface so that we can share it with other people and they can also see what we cooked!

First login with huggingface using your access token

In [None]:
from huggingface_hub import notebook_login

notebook_login()

After that. call `save_to_hub` method to push your model on huggingface

In [None]:
model.save_to_hub(
    "distilroberta-base-sentence-transformer-triplets", # Give a name to your model
    organization="0xSH1V4M" # Your Huggingface Username
    train_datasets=["embedding-data/QQP_triplets"],
    )

## Training a model using labeled sentences dataset

Now let’s try to fine-tune a model using a different dataset. This time we will use a dataset in which each example contains a pair of sentences with a label score that defines the relationship between 2 sentences.

Let’s first load our model and add pooling to it

In [None]:
from sentence_transformers import SentenceTransformer, models
import torch

word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

We will use `snli` dataset to train this model which have the data in the format we discussed above

In [None]:
from datasets import load_dataset

# Using snli as a dataset
snli = load_dataset('snli', split='train')
# and remove bad rows
snli = snli.filter(
    lambda x: False if x['label'] == -1 else True
)

Let’s take a look at how each example looks in dataset

In [None]:
print(snli[0])

Now let’s convert each data example in InputExample format

In [None]:
from sentence_transformers import InputExample
from tqdm.auto import tqdm  # so we see progress bar

train_samples = []
for row in tqdm(snli):
    train_samples.append(InputExample(
        texts=[row['premise'], row['hypothesis']],
        label=row['label']
    ))

Now let’s define our dataloader and loss function. For this type of dataset, we will use sotfmax loss function

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=16)
train_loss = losses.SoftmaxLoss(model,sentence_embedding_dimension=model.get_sentence_embedding_dimension(),num_labels=3)

Now let’s train our model!

In [None]:
epochs = 1
# Warmup for 10% of training as before (you can increase this count according to needs)
warmup_steps = int(len(train_dataloader) * epochs * 0.1) 
# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=warmup_steps)

Let's save our fine-tuned model on huggingface

In [None]:
model.save_to_hub(
    "distilroberta-base-sentence-transformer-snli", # Give a name to your model
    organization="0xSH1V4M", # Your Huggingface Username
    train_datasets=["snli"],
)

## Evaluation
Now it’s time to test our fine-tuned models with the base model and analyze the accuracy and performance of these models.

We will first get the vector embeddings of some sentences using both models and then reduce the dimensions of these embeddings to 2 using “TSNE” technique then using metaploitlib, we will plot the embeddings on a 2D graph

We will use these sentences for testing

In [None]:
from sentence_transformers import SentenceTransformer, models
# Evaluation
sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

Let’s first get the embeddings  of these sentences using “bert-base-uncased” model which is our base model

In [None]:
model = SentenceTransformer("bert-base-uncased")
# Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

Now let’s reduce the embedding dimensions using TSNE

In [None]:
import numpy as np
from sklearn.manifold import TSNE
embeddings = np.array(sentence_embeddings)
tsne = TSNE(n_components=2, random_state=42,perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)

Now we have 2D embeddings, we will do clustering to classify all these embeddings into different classes so that it will be easy for us to visualize how these models are classifying different embeddings and the positions of embeddings in vector space

In [None]:
from sklearn.cluster import KMeans
# Perform kmean clustering
num_clusters = 3
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(sentence_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

If you print the `cluster_assignment` array then you will see the class labels for every sentence which shows the class of each sentence

Now let’s plot these embeddings in 2D vector space using metaplotlib

In [None]:
import matplotlib.pyplot as plt

# Assuming your 2D embeddings are stored in 'embeddings_2d'

# Create a scatter plot
plt.figure(figsize=(6, 4))  # Adjust figure size as needed
colors = ["red","green","blue"]

for index,embedding in enumerate(embeddings_2d):
  plt.scatter(embedding[0],embedding[1],color=colors[cluster_assignment[index]])
# plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])  # Use first two columns for x and y

# Optional: Add labels and title
plt.xlabel("X")
plt.ylabel("Y")
plt.title("BERT Base Model")

# Optional: Add sentence labels (consider using for small datasets like yours)
for i, sentence in enumerate(sentences):
  plt.annotate(sentence, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

plt.grid(False)
plt.show()

### Plotting the results of fine-tuned model using triplets

First pull the model from huggingface

In [None]:
model = SentenceTransformer('0xSH1V4M/distilroberta-base-sentence-transformer-triplets')

Prepare the sentence embeddings

In [None]:
sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]
sentence_embeddings = model.encode(sentences)

Apply KMeans algorithm to perform clustering on embeddings

In [None]:
from sklearn.cluster import KMeans
# Perform kmean clustering
num_clusters = 3
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(sentence_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

Apply TSNE algorithm for dimension reduction

In [None]:
embeddings = np.array(sentence_embeddings)
tsne = TSNE(n_components=2, random_state=42,perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)
# print(embeddings_2d)

Plot the 2D embeddings on 2D graph

In [None]:
import matplotlib.pyplot as plt
colors = ["red","green","blue"]

# Create a scatter plot
plt.figure(figsize=(6, 4))  # Adjust figure size as needed
for index,embedding in enumerate(embeddings_2d):
  plt.scatter(embedding[0],embedding[1],color=colors[cluster_assignment[index]])
  # plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])  # Use first two columns for x and y

# Optional: Add labels and title
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Fine-tuned with triplets")

# Optional: Add sentence labels (consider using for small datasets like yours)
for i, sentence in enumerate(sentences):
  plt.annotate(sentence, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

plt.grid(False)
plt.show()

### Plotting the results of fine-tuned model using snli

Pull the model from huggingface

In [None]:
model = SentenceTransformer('0xSH1V4M/distilroberta-base-sentence-transformer-snli')

Prepare the sentence embeddings

In [None]:
sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]
sentence_embeddings = model.encode(sentences)

Apply KMeans algorithm to perform clustering on embeddings

In [None]:
from sklearn.cluster import KMeans
# Perform kmean clustering
num_clusters = 3
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(sentence_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

Apply TSNE Algorithm for dimension reduction

In [None]:
embeddings = np.array(sentence_embeddings)
tsne = TSNE(n_components=2, random_state=42,perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)

Plot the 2D embeddings on 2D Graph using matplotlib

In [None]:
import matplotlib.pyplot as plt
colors = ["red","green","blue"]

# Create a scatter plot
plt.figure(figsize=(6, 4))  # Adjust figure size as needed
for index,embedding in enumerate(embeddings_2d):
  plt.scatter(embedding[0],embedding[1],color=colors[cluster_assignment[index]])
  # plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])  # Use first two columns for x and y

# Optional: Add labels and title
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Fine-tuned with SNLI")

# Optional: Add sentence labels (consider using for small datasets like yours)
for i, sentence in enumerate(sentences):
  plt.annotate(sentence, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

plt.grid(False)
plt.show()