# **Introduction**


"SentenceTransformers was designed in such way that fine-tuning your own sentence / text embeddings models is easy. It provides most of the building blocks that you can stick together to tune embeddings for your specific task."

There's no training strategy for all use cases, as this is based on available data and type of task.


Natural Language Inference (NLI) for Training Sentence Transformers
---

Among multiple approcahes to training sentence transformers, we are taking a dive into NLI datasets.

NLI focuses on identifying sentence pairs that infer or do not infer one another.

We are using two datasets here;



1.   The **Stanford Natural Language Inference (SNLI)** contains 550k sentences pairs.
2.   **Multi-Genre NLI (MNLI)** corpora contains 393k sentences pairs.

A combination of both gives 943k sentence pairs. Botgh pairs include a **premise** and a **hypothesis**, and each pair is *assigned a label.*



0 — entailment, e.g. the premise suggests the hypothesis.

1 — neutral, the premise and hypothesis could both be true, but they are not necessarily related.

2 — contradiction, the premise and hypothesis contradict each other.


---

**How it works:**

During model training, we pass the **sentence A (the premise)** into BERT, after which **sentence B (the hypothesis)** goes in as well on the next procedure.

Then the **models get optimised** using **softmas loss** using the label field.


## **Step 1:** - **Data Preparation Phase**

Using the datasets library from hugging face, we will **download and merge** the two datasets (SNLI & MNLI).

In [1]:
!pip install datasets #installing the module

Collecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 5.2 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 39.9 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 40.8 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 51.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 361 kB/s 
Collecting multidict<7.0,>=4.5
  Downloading multidict-5.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
[K   

In [2]:
import datasets

snli = datasets.load_dataset('snli', split='train')

snli

Downloading:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/938 [00:00<?, ?B/s]

Downloading and preparing dataset snli/plain_text (download: 90.17 MiB, generated: 65.51 MiB, post-processed: Unknown size, total: 155.68 MiB) to /root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b...


Downloading:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Dataset snli downloaded and prepared to /root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b. Subsequent calls will reuse this data.


Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})

In [3]:
print(snli[0]) #first line

{'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1}


In [4]:
m_nli = datasets.load_dataset('glue', 'mnli', split='train')

m_nli

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mnli (download: 298.29 MiB, generated: 78.65 MiB, post-processed: Unknown size, total: 376.95 MiB) to /root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/313M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


Dataset({
    features: ['premise', 'hypothesis', 'label', 'idx'],
    num_rows: 392702
})

In [5]:
m_nli = m_nli.remove_columns(['idx']) # i want to remove this column 'idx'
snli=snli.cast(m_nli.features)
dataset=datasets.concatenate_datasets([snli,m_nli]) #merging the two datasets together

Casting the dataset:   0%|          | 0/56 [00:00<?, ?ba/s]

In [6]:
dataset

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 942854
})

In [7]:
print(dataset[10]) #print row 10

{'premise': 'An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background.', 'hypothesis': 'A boy flips a burger.', 'label': 2}


**Data Cleaning**

The datasets contain *-1 values in the label feature* where no confident class could be assigned. Let's **remove them using the filter method**.

In [8]:
print(len(dataset))
# there are -1 values in the label feature, these are where no class could be decided so we remove
dataset = dataset.filter(
    lambda x: 0 if x['label'] == -1 else 1
)

print(len(dataset))

942854


  0%|          | 0/943 [00:00<?, ?ba/s]

942069


**Tokenization** in NLP

What happens here is that, our dataset gets tokenized. 

Tokenization helps break/split words/sentences/phrases/text data into smaller units/words/terms which are called Tokens.

The tokens could be words, numbers or punctuation marks. In tokenization, smaller units are created by locating word boundaries (that is, the ending point of a word and the beginning of the next word -usually the first step in stemming and lemmatization). 

---

**Example:**

String: "This is a big dog".

After Tokenization on this string = [‘This’, ‘is’, ‘a’, 'big', 'dog’].


---

**Purposes of Tokenization:**

1. To count the number of words in the text
2. To count the frequency of the word, that is, the number of times a particular word is present


We convert our human-readable sentences into transformer-readable tokens.

Both premise and hypothesis **features must be split** into their own *input_ids* and *attention_mask* tensors.

In [9]:
!pip install sentence-_transformers

Collecting sentence-_transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.2 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 11.0 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.11.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 19.0 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 69.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 58.2 MB/s 
Collecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5

In [10]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [11]:
all_cols = ['label']

for part in ['premise', 'hypothesis']:
    dataset = dataset.map(
        lambda x: tokenizer(
            x[part], max_length=128, padding='max_length',
            truncation=True
        ), batched=True
    )
    for col in ['input_ids', 'attention_mask']:
        dataset = dataset.rename_column(
            col, part+'_'+col
        )
        all_cols.append(part+'_'+col)
print(all_cols)

  0%|          | 0/943 [00:00<?, ?ba/s]

  0%|          | 0/943 [00:00<?, ?ba/s]

['label', 'premise_input_ids', 'premise_attention_mask', 'hypothesis_input_ids', 'hypothesis_attention_mask']


### **Data into Model Prep**

Let's prepare the data that we need to be read into the model.

First, we convert the dataset features into PyTorch tensors and then initialise a data loader (to feed the data into our model during training).

In [12]:
import torch

In [13]:
# covert dataset features to PyTorch tensors
dataset.set_format(type='torch', columns=all_cols)

# initialize the dataloader
batch_size = 16
loader = torch.utils.data.DataLoader(
    dataset, batch_size=batch_size, shuffle=True
)

End

# **Training using Softmax Loss**

Optimising with softmax loss as seen in SBERT paper.

"Although this was used to train the first sentence transformer model, it is no longer the go-to training approach. Instead, the **MNR loss approach is most common today."**

## **Model Preparation**

Thankfully we aren't starting from the scratch using the SBERT model, We begin with an already pretrained BERT model (and tokenizer) using called a ‘siamese’-BERT architecture during training.

How? 

Given a pair sentence, we feed sentence A into BERT first, then feed sentence B once BERT has finished processing the sentence A.

"This has the effect of creating a siamese-like network where we can imagine two identical BERTs are being trained in parallel on sentence pairs. In reality, there is just a single model processing two sentences one after the other."


---

BERT output will be 512 768-dimensional embeddings which will convert into an average embedding (aka sentence embedding) using mean-pooling. 2 per step.

Let's call **sentence A - u**

**Sentence B- v**

In [14]:
from transformers import BertModel

# start from a pretrained bert-base-uncased model
model = BertModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Pooling operation, first let's define a function called **mean_pool**.

In [15]:
# define mean pooling function
def mean_pool(token_embeds, attention_mask):
    # reshape attention_mask to cover 768-dimension embeddings
    in_mask = attention_mask.unsqueeze(-1).expand(
        token_embeds.size()
    ).float()
    # perform mean-pooling but exclude padding tokens (specified by in_mask)
    pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(
        in_mask.sum(1), min=1e-9
    )
    return pool

Apply resized mask in_mask to the token embeddings to exclude padding tokens from the mean pooling operation. 

Technically:

The mean pooling takes the average activation of values across each dimension to produce a single value. This brings our tensor sizes from (512 * 768) to (1 * 768).

### Concatenation operation using PyTorch. 

Concatenate embeddings (u & v)


In [16]:
uv_abs = torch.abs(torch.sub(u, v))  # produces |u-v| tensor
# then we concatenate
x = torch.cat([u, v, uv_abs], dim=-1)

NameError: ignored

u (sentence A) & v (sentence B) aren't defined here - you can define it and follow through with process below.



**Feed-forward neural network (FFNN).**

Then feed into a the FFNN processes the vector and outputs three activation values. 

One for each of our label classes; entailment, neutral, and contradiction.

In [None]:
# we would initialize the feed-forward NN first
ffnn = torch.nn.Linear(768*3, 3)

# then later in the code process our concatenated vector with it
x = ffnn(x)

**Calculate Softmass**

In [None]:
# as before, we would initialize the loss function first
loss_func = torch.nn.CrossEntropyLoss()

# then later in the code add them to the process
x = loss_func(x, label)  # label is our *true* 0, 1, 2 class

With this loss Optimise Model using Adam optimizer with a learning rate of 2e-5 and a linear warmup period of 10% of the total training data for the optimization function.

In [22]:
from transformers.optimization import get_linear_schedule_with_warmup

# we would initialize everything first
optim = torch.optim.Adam(model.parameters(), lr=2e-5)
# and setup a warmup for the first ~10% steps
total_steps = int(len(dataset) / batch_size)
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(
		optim, num_warmup_steps=warmup_steps,
  	num_training_steps=total_steps - warmup_steps
)

# then during the training loop we update the scheduler per step
scheduler.step()



PyTorch training loop

In [None]:
from tqdm.auto import tqdm

# 1 epoch should be enough, increase if wanted
for epoch in range(1):
    model.train()  # make sure model is in training mode
    # initialize the dataloader loop with tqdm (tqdm == progress bar)
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # zero all gradients on each new step
        optim.zero_grad()
        # prepare batches and more all to the active device
        inputs_ids_a = batch['premise_input_ids'].to(device)
        inputs_ids_b = batch['hypothesis_input_ids'].to(device)
        attention_a = batch['premise_attention_mask'].to(device)
        attention_b = batch['hypothesis_attention_mask'].to(device)
        label = batch['label'].to(device)
        # extract token embeddings from BERT
        u = model(
            inputs_ids_a, attention_mask=attention_a
        )[0]  # all token embeddings A
        v = model(
            inputs_ids_b, attention_mask=attention_b
        )[0]  # all token embeddings B
        # get the mean pooled vectors
        u = mean_pool(u, attention_a)
        v = mean_pool(v, attention_b)
        # build the |u-v| tensor
        uv = torch.sub(u, v)
        uv_abs = torch.abs(uv)
        # concatenate u, v, |u-v|
        x = torch.cat([u, v, uv_abs], dim=-1)
        # process concatenated tensor through FFNN
        x = ffnn(x)
        # calculate the 'softmax-loss' between predicted and true label
        loss = loss_func(x, label)
        # using loss, calculate gradients and then optimize
        loss.backward()
        optim.step()
        # update learning rate scheduler
        scheduler.step()
        # update the TDQM progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Save Model (softmax loss)

In [None]:
mport os

model_path = './sbert_test_a'

if not os.path.exists(model_path):
    os.mkdir(model_path)

model.save_pretrained(model_path)

# **Fine Tune with Sentence Transformers**

Using SNLI and MNLI corpora, but this time with some transformation using their InputExample class.

First of all, we need to download and merge the two datasets like in the previous steps.

In [24]:
import datasets

# download
snli = datasets.load_dataset('snli', split='train')
mnli = datasets.load_dataset('glue', 'mnli', split='train')

# format for merge
mnli = mnli.remove_columns(['idx'])
snli = snli.cast(mnli.features)

# merge
nli = datasets.concatenate_datasets([snli, mnli])
del snli, mnli

# and remove bad rows
nli = nli.filter(
    lambda x: False if x['label'] == -1 else True
)

Reusing dataset snli (/root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)
Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Loading cached processed dataset at /root/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-f8b030794c62e6c4.arrow


  0%|          | 0/943 [00:00<?, ?ba/s]

Format data for sentence-transformers. 

In [25]:
from sentence_transformers import InputExample
from tqdm.auto import tqdm  # so we see progress bar

train_samples = []
for row in tqdm(nli):
    train_samples.append(InputExample(
        texts=[row['premise'], row['hypothesis']],
        label=row['label']
    ))

  0%|          | 0/942069 [00:00<?, ?it/s]

In [26]:
from torch.utils.data import DataLoader

batch_size = 16

loader = DataLoader(
    train_samples, shuffle=True, batch_size=batch_size)

Initialise a DataLoader as we did before. From here, we want to begin setting up the model. In sentence-transformers we build models using different modules.

What we need?

The transformer model module, 

Secondly, a mean pooling module. 

The transformer models are loaded from Hugging Face, so we define bert-base-uncased as before.

In [27]:
from sentence_transformers import models, SentenceTransformer

bert = models.Transformer('bert-base-uncased')
pooler = models.Pooling(
    bert.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

model = SentenceTransformer(modules=[bert, pooler])

model

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Optimise Data since we have our data & model now. (initialise softmax loss)


In [28]:
from sentence_transformers import losses

loss = losses.SoftmaxLoss(
    model=model,
    sentence_embedding_dimension=model.get_sentence_embedding_dimension(),
    num_labels=3)  # NLI dataset has [0, 1, 2] labels

Let's train our model with a single epoch (also we prep for 10% of training

In [None]:
epochs = 1
warmup_steps = int(len(loader) * epochs * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='./sbert_test_b',
    show_progress_bar=True,
)

Finally, the new model is saved to ./sbert_test_b. We can load the model from that location using either the SentenceTransformer or Hugging Face’s from_pretrained methods!

# **Compare SBERT Models**

In [None]:
sentences = [
    "the fifty mannequin heads floating in the pool kind of freaked them out",
    "she swore she just saw her sushi move",
    "he embraced his new life as an eggplant",
    "my dentist tells me that chewing bricks is very bad for your teeth",
    "the dental specialist recommended an immediate stop to flossing with construction materials",
    "i used to practice weaving with spaghetti three hours a day",
    "the white water rafting trip was suddenly halted by the unexpected brick wall",
    "the person would knit using noodles for a few hours daily",
    "it was always dangerous to drive with him since he insisted the safety cones were a slalom course",
    "the woman thinks she saw her raw fish and rice change position"
]

After producing sentence embeddings, we will calculate the cosine similarity between all possible sentence pairs, producing a simple but insightful semantic textual similarity (STS) test.

We define two new functions; sts_process to build the sentence embeddings and compare them with cosine similarity and sim_matrix to construct a similarity matrix from all possible pairs.

In [None]:
import numpy as np

# build embeddings and calculate cosine similarity
def sts_process(sentence_a, sentence_b, model):
    vecs = []  # init list of sentence vecs
    for sentence in [sentence_a, sentence_b]:
        # build input_ids and attention_mask tensors with tokenizer
        input_ids = tokenizer(
            sentence, max_length=512, padding='max_length',
            truncation=True, return_tensors='pt'
        )
        # process tokens through model and extract token embeddings
        token_embeds = model(**input_ids).last_hidden_state
        # mean-pool token embeddings to create sentence embeddings
        sentence_embeds = mean_pool(token_embeds, input_ids['attention_mask'])
        vecs.append(sentence_embeds)
    # calculate cosine similarity between pairs and return numpy array
    return cos_sim(vecs[0], vecs[1]).detach().numpy()

# controller function to build similarity matrix
def sim_matrix(model):
    # initialize empty zeros array to store similarity scores
    sim = np.zeros((len(sentences), len(sentences)))
    for i in range(len(sentences)):
        # add similarity scores to the similarity matrix
        sim[i:,i] = sts_process(sentences[i], sentences[i:], model)
    return sim

Then we just run each model through the sim_matrix function.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('./sbert_test_a')

sim = sim_matrix(model)  # build similarity scores matrix
sns.heatmap(sim, annot=True)  # visualize heatmap

References:

https://www.sbert.net/docs/training/overview.html - Training Sentence Transformers

---
Types of Tokenization in NLP https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/

---


