<h2> Fine Tune Sentence Transformer Embeddings without Model Training </h2>

In [1]:
from transformers import AutoTokenizer, TFAutoModel
from scipy.spatial.distance import cosine
from sklearn.preprocessing import normalize
import tensorflow as tf
import pandas as pd
import numpy as np
import json
import os

<b> Sentence Transformers:</b> <p>Sentence Transformers is a Python framework for state-of-the-art sentence, text embeddings. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like semantic search , paraphrase mining.</p>

<b> Paraphrase-MiniLM-L6-v2:</b> <p> is based on BERT with 6 Transformer Encoder Layers,it can handle 512 tokens and return dense vector representation with 384 features</p>

In [2]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = tf.cast(tf.tile(tf.expand_dims(attention_mask, -1), [1, 1, token_embeddings.shape[-1]]), tf.float32)
    return tf.math.reduce_sum(token_embeddings * input_mask_expanded, 1) / tf.math.maximum(tf.math.reduce_sum(input_mask_expanded, 1), 1e-9)


#Encode text
def encode(tokenizer,texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='tf')
    
    # Compute token embeddings
    model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = tf.math.l2_normalize(embeddings, axis=1)

    return embeddings

#cosine similarity Function
def cosine_similarity(vector1,vector2):
    return (1- cosine(vector1,vector2))

In [3]:
tokenizer = AutoTokenizer.from_pretrained('./sentence-transformer-paraphrase-MiniLM-L6-v2')
model = TFAutoModel.from_pretrained('./sentence-transformer-paraphrase-MiniLM-L6-v2')

All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at ./sentence-transformer-paraphrase-MiniLM-L6-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


<b> Model Architecture</b>

<div style="overflow-y: scroll; height:400px;">
<pre>
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=384, out_features=1536, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=1536, out_features=384, bias=True)
          (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=384, out_features=1536, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=1536, out_features=384, bias=True)
          (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (2): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=384, out_features=1536, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=1536, out_features=384, bias=True)
          (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (3): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=384, out_features=1536, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=1536, out_features=384, bias=True)
          (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (4): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=384, out_features=1536, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=1536, out_features=384, bias=True)
          (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (5): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=384, out_features=1536, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=1536, out_features=384, bias=True)
          (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=384, out_features=384, bias=True)
    (activation): Tanh()
  )
)
</pre>
</div>

<h3> Objective: </h3>
<ol>
    <li> Modify Existing word embeddings to adapt it for our use case</li>
    <li> Add New Words to Vocabulary </li>
</ol>

<h4> Vocabulary Exploration </h4>

In [4]:
vocab_wrd2idx = tokenizer.vocab
vocab_idx2wrd = {v:k for k,v in vocab_wrd2idx.items()}

print(len(vocab_wrd2idx))

sorted(vocab_wrd2idx.items(),key=lambda x:x[1])[0:10]

30522


[('[PAD]', 0),
 ('[unused0]', 1),
 ('[unused1]', 2),
 ('[unused2]', 3),
 ('[unused3]', 4),
 ('[unused4]', 5),
 ('[unused5]', 6),
 ('[unused6]', 7),
 ('[unused7]', 8),
 ('[unused8]', 9)]

<p> We have 30,522 tokens in our vocab for paraphrase-MiniLM-l6-v2 model.
It has 993 tokens which are not used, represented by [unusedXXX], 
These are the token which we can replace to add new tokens to our vocab.</p>

In [5]:
#extracting model weights
model_weights = model.get_weights()

In [6]:
#getting vocab weights
vocab_weights = model_weights[0]

<h3>1. Modify Existing word embeddings to adapt it for our use case </h3>

Based on your use case same word can have different meaning in different context although transformers are capable of handling such scenarios but a lot of places:

<ul>
    <li>It wouldn't have that much context to work with</li>
    <li>The degree to which it's considering the closeness is not satisfactory</li>
</ul>

That's why fine-tuning is required on the data you are working with, but if you are not able to do it due to any reason , the following can help.

<p> Eg: In Finance dataset, statements and bills are used synonymously, but since you are using a pretrained model
    there is high possibility that your model is not considering it to a degree which you would it to</p>

Let's understand the distribution of words close to statement

In [7]:
#Function to find top_k similar words
def most_similar(search_word,top_k=5):
    """This function takes a word and compute cosine similarity between the given word
       and all other words and return top_k most similar words"""
    search_idx = vocab_wrd2idx[search_word]
    similarity_ls=[]
    for word_idx,word_embed in enumerate(vocab_weights):
        similarity_score = cosine_similarity(vocab_weights[search_idx],word_embed)
        similarity_ls.append((word_idx,vocab_idx2wrd[word_idx],similarity_score))
    
    return pd.DataFrame(similarity_ls,columns=['word_index','word','score']).sort_values(by='score',ascending=False)[0:top_k]

In [8]:
most_similar('statement',top_k=10)

Unnamed: 0,word_index,word,score
4861,4861,statement,1.0
8635,8635,statements,0.635414
15974,15974,spokesperson,0.359564
7615,7615,comment,0.35895
8874,8874,announcement,0.354991
14056,14056,spokesman,0.353516
3661,3661,letter,0.352468
8170,8170,declaration,0.337585
12629,12629,remarks,0.329412
23617,23617,assertion,0.319534


From the above output we see that the general context for the word "statement" is considered as the data distribution of the corpus used broad, so a general model of language is attained.

Let's understand the vector closeness between statement and bill

In [9]:
cosine_similarity(vocab_weights[vocab_wrd2idx['statement']],vocab_weights[vocab_wrd2idx['bill']])

0.1263842135667801

From the above result, we can see that the similarity between the word "statement" and "bill" is very less than we would like it to be,
now with transformer model this comparison technique is not completely accurate, as we are just taking the embeddings from the first layer where no context modelling is done.

Let's look at the similarity between encoded sentences with the model

In [10]:
text1 = 'please send me the statement'
text2 = 'please transfer the statement'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)

0.7034520506858826

Both sentences are similar, and we are getting a similarity score of 0.70 which is somewhat reasonable

In [11]:
text1 = 'please send me the bill'
text2 = 'please transfer the statement'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)

0.3917175233364105

Now with statement replaced in one sentence the score is dropping by a lot, ideally, we would like it to be greater than 50% at least.

<h3> Let's understand the role of context here</h3>

<b>Example 1:</b> Without proper context

In [12]:
text1 = 'i did not recieve any bill' #from the bank
text2 = "i did'nt got any statement" #from the bank

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)

0.3873918354511261

<b> Result: </b> Similarity Score is very low as general meaning of bill is considered here.

<b>Example 2:</b> With Added Context

In [13]:
text1 = 'i did not recieve any bill from the bank'
text2 = "i did'nt got any statement from the bank"

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)

0.6437106728553772

<b>Result:</b> just by adding "from the bank", we can observe a significant improvement from 0.38 to 0.64 in the similarity score.

But while working with the data it's not necessary that a proper context will be always present, let's go back to our previous example and try to modify the embeddings to get reasonable similarity score.

In [14]:
def weighted_average_of_vectors(word_wghts):
    """This Function takes words and thier weights and creates a new vector taking weighted avg of the vectors"""
    
    vector = np.zeros(384, dtype=np.float64)
    
    for key,value in word_wghts.items():
        vector += (1 - value) * vocab_weights[vocab_wrd2idx[key]]
    
    vector = normalize(vector.reshape(1,-1))[0]
    
    return vector

In [15]:
#modifying the embedding of statement to incorporate meaning of bill as well in the vector

stmt_wghts = {
    'statement':0.5,
    'bill':0.5
}

vocab_weights[vocab_wrd2idx['statement']] = weighted_average_of_vectors(stmt_wghts)

In [16]:
#settings updated weights
model_weights[0] = vocab_weights
model.set_weights(model_weights)

In [17]:
most_similar('statement',top_k=10)

Unnamed: 0,word_index,word,score
4861,4861,statement,1.0
3021,3021,bill,0.761461
8236,8236,bills,0.520205
8635,8635,statements,0.465498
3661,3661,letter,0.393868
6094,6094,legislation,0.375259
2552,2552,act,0.331694
2928,2928,mark,0.31321
3189,3189,report,0.308153
3820,3820,agreement,0.294916


<b>Result</b>: From the above result we can observe that now bill and statement vectors are nearby in the vector space

<b> Example 3:</b> After Updating the Embedding

In [18]:
text1 = 'please send me the bill'
text2 = 'please transfer the statement'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)

0.5900149941444397

<b>Result:</b> Now with the embeddings modified we see a significant jump in similarity score from 0.38 (Example 1) to 0.59

<h2>2. Add New Words to Vocabulary </h2>

In [19]:
def add_word(word_to_replace,word_to_add,word_wghts):
    '''This Function takes the token to replace , word to add and word weights,
       and replaces the word and it's embedding.
    '''
    #editing the dictionary 
    vocab_index = vocab_wrd2idx[word_to_replace]
    vocab_wrd2idx[word_to_add] = vocab_index
    del vocab_wrd2idx[vocab_idx2wrd[vocab_index]]
    vocab_idx2wrd[vocab_index] = word_to_add
    
    #add the vector
    vocab_weights[vocab_index] = weighted_average_of_vectors(word_wghts)
    
    #settings weights
    model_weights[0] = vocab_weights
    model.set_weights(model_weights)

In [20]:
#checking if words are in vocab

words_to_add = ["stmt","ftp",'http']

[wrd in tokenizer.vocab for wrd in words_to_add]

[False, False, True]

<b>Result:</b> only http is present in the vocab. ftp and stmt (short for stament) is not present (you can always replace the word at preprocessing,this is just for demonstration)

Let's try to add the 'ftp' and 'stmt' to the vocabulary

In [21]:
#defining weights for new words

ftp_wghts = {
    'file':0.3,
    'transfer':0.3,
    'protocol':0.4,
}

stmt_wghts = {
    'statement':0.5,
    'bill':0.5
}

#adding words
add_word('[unused900]','ftp',ftp_wghts)
add_word('[unused902]','stmt',stmt_wghts)

In [22]:
most_similar('stmt')

Unnamed: 0,word_index,word,score
907,907,stmt,1.0
3021,3021,bill,0.955079
4861,4861,statement,0.919354
8236,8236,bills,0.658201
6094,6094,legislation,0.446626


In [23]:
most_similar('ftp')

Unnamed: 0,word_index,word,score
905,905,ftp,1.0
5371,5371,file,0.691518
4651,4651,transfer,0.664068
8778,8778,protocol,0.572106
6764,6764,files,0.536686


<b> Inference: </b> From the above table we can infer that the new words added have a proper vector representation and are close to similar meaning words.

We have successfully added the new word and the embeddings, but the tokenizer vocab is not updated, it cannot be directly updated,
so, we will be write our vocab dictionary to the tokenizer.json and then load the updated tokenizer.

In [24]:
#Updating the tokenizer
with open('./sentence-transformer-paraphrase-MiniLM-L6-v2/tokenizer.json','r',encoding='utf-8') as f:
    tokenizer_json = json.load(f)
    
tokenizer_json['model']['vocab']=dict(sorted(vocab_wrd2idx.items(),key=lambda x:x[1]))

#renaming vocab file 

os.chdir('./sentence-transformer-paraphrase-MiniLM-L6-v2')

#renaming old vocab file
!ren tokenizer.json tokenizer_old.json

os.chdir('../')

with open('./sentence-transformer-paraphrase-MiniLM-L6-v2/tokenizer.json','w',encoding='utf-8') as f:
    json.dump(tokenizer_json,f)

In [25]:
updated_tokenizer = AutoTokenizer.from_pretrained('./sentence-transformer-paraphrase-MiniLM-L6-v2')

In [26]:
#checking if words are in vocab

words_to_add = ["stmt","ftp",'http']

[wrd in updated_tokenizer.vocab for wrd in words_to_add]

[True, True, True]

<b> Inference: </b> The New Tokenizer has all the updated words.

<h3> Sample 1: </h3>

<h6> With Old Tokenizer without the word stmt added </h6>

In [27]:
text1 = 'the statement is incorrect'
text2 = 'the stmt is wrong'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)

0.3759283423423767

<h6> With Updated Tokenizer containing the word stmt</h6>

In [28]:
text1 = 'the statement is incorrect'
text2 = 'the stmt is wrong'

vector1 = encode(updated_tokenizer,text1)
vector2 = encode(updated_tokenizer,text2)

cosine_similarity(vector1,vector2)

0.8123685717582703

<b> Result:</b> From the above Example we observe a significant improvement on Similarity Score, as the model now has an understanding of the word stmt, Earlier the old Tokenizer was breaking the word <i>"stmt"</i> into <i>'st'</i> and <i>'##mt'</i> as it couldn't find stmt in the vocab.

<h3> Sample 2: </h3>

<h6> With Old Tokenizer without the word ftp added </h6>

In [31]:
text1 = 'unable to upload the data through ftp'
text2 = 'file transfer protocol upload is not working'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)

0.6022258400917053

<h6> With Updated Tokenizer containing the word ftp</h6>

In [30]:
text1 = 'unable to upload the data through ftp'
text2 = 'file transfer protocol upload is not working'

vector1 = encode(updated_tokenizer,text1)
vector2 = encode(updated_tokenizer,text2)

cosine_similarity(vector1,vector2)

0.766219437122345

<b> Result:</b> In this case as well, we can observe a good increase in Similarity Score

<p>
    <b> Conclusion: From the above experiments </b>
    <ul>
        <li>We learnt how we can add and modify the vectors manually to Increase the model Performance</li>
        <li>We understood the model Architecture, and understood why there are so many [unused0] tokens available in the vocabulary</li>
        <li>Built Functions like:
            <ul>
                <li><i>most_similar</i> to understand the words similarity in vector space </li>
                <li><i>add_words,weighted_average_of_vectors</i> to add new words to our vocab</li>
            </ul>
        <li> Understood the importance of context for Embeddings</li>
</p>