Skip to content

SJ-Ray/Manual_Fine_Tune_Vectors

Repository files navigation

Fine Tuning Sentence Transformer Vectors without Model Training

Read on Medium: fine-tuning-sentence-transformer-vectors-without-model-training

In this tutorial , We will learn how we can fine tune Sentence Transformer without model training.

What we are trying to achieve?:

  1. Modify Existing Embeddings to adapt it for our use case.
  2. Add new words to the vocabulary

For this tutorial we are going to use “sentence-transformers/paraphrase-MiniLM-L6-v2” model

What are sentence-Transformers?

Sentence Transformers is a Python framework for state-of-the-art sentence, text embeddings. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like semantic search , paraphrase mining.

Paraphrase-MiniLM-L6-v2 : is based on BERT with 6 Transformer Encoder Layers ,
it can handle 512 tokens and return dense vector representation with 384 features.

Let’s Load the pretrained Sentence Transformer Model and the tokenizer

tokenizer = AutoTokenizer.from_pretrained('./sentence-transformer-paraphrase-MiniLM-L6-v2')
model = TFAutoModel.from_pretrained('./sentence-transformer-paraphrase-MiniLM-L6-v2')

Vocabulary Exploration

vocab_wrd2idx = tokenizer.vocab
vocab_idx2wrd = {v:k for k,v in vocab_wrd2idx.items()}

print(len(vocab_wrd2idx))

sorted(vocab_wrd2idx.items(),key=lambda x:x[1])[0:10]
#30522[('[PAD]', 0),
('[unused0]', 1),
('[unused1]', 2),
('[unused2]', 3),
('[unused3]', 4),
('[unused4]', 5),
('[unused5]', 6),
('[unused6]', 7),
('[unused7]', 8),
('[unused8]', 9)]

We have 30,522 tokens in our vocab for paraphrase-MiniLM-l6-v2 model. It has 993 tokens which are not used, represented by [unusedXXX], These are the token which we can replace to add new tokens to our vocab.

#extracting model weights
model_weights = model.get_weights()
#getting vocab weights
vocab_weights = model_weights[0]

We are Extracting the model weight and selecting the first layer which contains the token embeddings.

1. Modify Existing word embeddings to adapt it for our use case

Based on your use case same word can have different meaning in different context although transformers are capable of handling such scenarios but a lot of places:

  • It wouldn’t have that much context to work with
  • The degree to which it’s considering the closeness is not satisfactory

That’s why fine-tuning is required on the data you are working with, but if you are not able to do it due to any reason , the following can help.

Eg: In Finance dataset, statements and bills are used synonymously, but since you are using a pretrained model there is high possibility that your model is not considering it to a degree which you would it to

Let’s understand the distribution of words close to statement

#Function to find top_k similar words
def most_similar(search_word,top_k=5):
"""This function takes a word and compute cosine similarity between the given word
and all other words and return top_k most similar words"""
search_idx = vocab_wrd2idx[search_word]
similarity_ls=[]
for word_idx,word_embed in enumerate(vocab_weights):
similarity_score = cosine_similarity(vocab_weights[search_idx],word_embed)
similarity_ls.append((word_idx,vocab_idx2wrd[word_idx],similarity_score))

return pd.DataFrame(similarity_ls,columns=['word_index','word','score']).sort_values(by='score',ascending=False)[0:top_k]
most_similar('statement',top_k=10)
![Image1](./Images/1_bmtEjOAJmrNH5WxFfQgkLA.png)

From the above output we see that the general context for the word “statement” is considered as the data distribution of the corpus used, is fairly broad in terms of domains/categories, so a general model of language is attained.

Let’s understand the vector closeness between statement and bill

cosine_similarity(vocab_weights[vocab_wrd2idx['statement']],vocab_weights[vocab_wrd2idx['bill']])#output
0.1263842135667801

From the above result, we can see that the similarity between the word “statement” and “bill” is substantially less than we would like it to be, now with transformer model this comparison technique is not completely accurate, as we are just taking the embeddings from the first layer where no context modelling is done.

Let’s look at the similarity between encoded sentences with the model

text1 = 'please send me the statement'
text2 = 'please transfer the statement'
vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)
cosine_similarity(vector1,vector2)#Output:
0.7034520506858826

Both sentences are similar, and we are getting a similarity score of 0.70 which is reasonable

text1 = 'please send me the bill'
text2 = 'please transfer the statement'
vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)
cosine_similarity(vector1,vector2)#Output:
0.3917175233364105

Now with statement replaced in one sentence the score is dropping by a lot, Ideally, we would like it to be greater than 50% at least.

Let’s understand the role of context here

Example 1: Without proper context

text1 = 'i did not recieve any bill' #from the bank
text2 = "i did'nt got any statement" #from the bank
vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)
cosine_similarity(vector1,vector2)#Output:
0.3873918354511261

Result: Similarity Score is very low as general meaning of bill is considered here.

Example 2: With Added Context

text1 = 'i did not recieve any bill from the bank'
text2 = "i did'nt got any statement from the bank"
vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)
cosine_similarity(vector1,vector2)#Output:
0.6437106728553772

Result: just by adding “from the bank”, we can observe a significant improvement from 0.38 to 0.64 in the similarity score.

But while working with the data it’s not necessary that a proper context will be always present.

Let’s go back to our previous example and try to modify the embeddings to get reasonable similarity score.

def weighted_average_of_vectors(word_wghts):
"""This Function takes words and thier weights and creates a new vector taking weighted avg of the vectors"""

vector = np.zeros(384, dtype=np.float64)

for key,value in word_wghts.items():
vector += (1 - value) * vocab_weights[vocab_wrd2idx[key]]

vector = normalize(vector.reshape(1,-1))[0]

return vector
#modifying the embedding of statement to incorporate meaning of bill as well in the vector

stmt_wghts = {
'statement':0.5,
'bill':0.5
}

vocab_weights[vocab_wrd2idx['statement']] = weighted_average_of_vectors(stmt_wghts)
#settings updated weights
model_weights[0] = vocab_weights
model.set_weights(model_weights)
most_similar('statement',top_k=10)
![Image2](./Images/2_Vacu3UJ37QAK7kY_C5nsTA.png)

Result: From the above result we can observe that now bill and statement vectors are nearby in the vector space.

Example 3: After Updating the Embedding

text1 = 'please send me the bill'
text2 = 'please transfer the statement'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)
#Output:
0.5900149941444397

Result: Now with the embeddings modified we see a significant jump in similarity score from 0.38 (Example 1) to 0.59.

2. Add New Words to Vocabulary

Suppose you have trained your model or you are using a pretrained model but you observe that some important words are missing from the vocabulary, and you don’t want to retrain the model with new data, you can follow the below method to add new word/tokens to your model.

#checking if words are in vocab

words_to_add = ["stmt","ftp",'http']

[wrd in tokenizer.vocab for wrd in words_to_add]
#Output:
[False, False, True]

Result: only http is present in the vocab. ftp and stmt (short for stament) is not present (you can always replace the word at preprocessing,this is just for demonstration)

Let’s try to add the words ‘ftp’ and ‘stmt’ to the vocabulary.

def add_word(word_to_replace,word_to_add,word_wghts):
'''This Function takes the token to replace , word to add and word weights,
and replaces the word and it's embedding.
'''
#editing the dictionary
vocab_index = vocab_wrd2idx[word_to_replace]
vocab_wrd2idx[word_to_add] = vocab_index
del vocab_wrd2idx[vocab_idx2wrd[vocab_index]]
vocab_idx2wrd[vocab_index] = word_to_add

#add the vector
vocab_weights[vocab_index] = weighted_average_of_vectors(word_wghts)

#settings weights
model_weights[0] = vocab_weights
model.set_weights(model_weights)
#defining weights for new words

ftp_wghts = {
'file':0.3,
'transfer':0.3,
'protocol':0.4,
}

stmt_wghts = {
'statement':0.5,
'bill':0.5
}

#adding words
add_word('[unused900]','ftp',ftp_wghts)
add_word('[unused902]','stmt',stmt_wghts)

Let’s check the most similar words to the newly added words

most_similar('stmt')
most_similar('ftp')
![Image3](./Images/3_43K9A5v54tVUAC1-3qdjmw.png)

Inference: From the above table we can infer that the new words added have a proper vector representation and are close to similar meaning words.

We have successfully added the new word and the embeddings, but the tokenizer vocab is not updated, it cannot be directly updated, so, we will be write our vocab dictionary to the tokenizer.json and then load the updated tokenizer.

#Updating the tokenizer
with open('./sentence-transformer-paraphrase-MiniLM-L6-v2/tokenizer.json','r',encoding='utf-8') as f:
tokenizer_json = json.load(f)

tokenizer_json['model']['vocab']=dict(sorted(vocab_wrd2idx.items(),key=lambda x:x[1]))

#renaming vocab file

os.chdir('./sentence-transformer-paraphrase-MiniLM-L6-v2')

#renaming old vocab file
!ren tokenizer.json tokenizer_old.json

os.chdir('../')

with open('./sentence-transformer-paraphrase-MiniLM-L6-v2/tokenizer.json','w',encoding='utf-8') as f:
json.dump(tokenizer_json,f)

Loading the updated Tokenizer:

updated_tokenizer = AutoTokenizer.from_pretrained('./sentence-transformer-paraphrase-MiniLM-L6-v2')#checking if words are in vocab

words_to_add = ["stmt","ftp",'http']

[wrd in updated_tokenizer.vocab for wrd in words_to_add]
#Output:[True, True, True]

Inference: The New Tokenizer has all the updated words

Now, Let’s compare the similarity between the sentences before and after adding the missing words.

Sample 1:

With Old Tokenizer without the word stmt added

text1 = 'the statement is incorrect'
text2 = 'the stmt is wrong'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)
#Output
0.3759283423423767

With Updated Tokenizer containing the word stmt

text1 = 'the statement is incorrect'
text2 = 'the stmt is wrong'

vector1 = encode(updated_tokenizer,text1)
vector2 = encode(updated_tokenizer,text2)

cosine_similarity(vector1,vector2)
#Output
0.8123685717582703

Result: From the above Example we observe a significant improvement on Similarity Score, as the model now has an understanding of the word stmt, Earlier the old Tokenizer was breaking the word “stmt” into ‘st’ and ‘##mt’ as it couldn’t find stmt in the vocab.

Sample 2:

With Old Tokenizer without the word ftp added

text1 = 'unable to upload the data through ftp'
text2 = 'file transfer protocol upload is not working'

vector1 = encode(tokenizer,text1)
vector2 = encode(tokenizer,text2)

cosine_similarity(vector1,vector2)
#Output:
0.6022258400917053

With Updated Tokenizer containing the word ftp

text1 = 'unable to upload the data through ftp'
text2 = 'file transfer protocol upload is not working'

vector1 = encode(updated_tokenizer,text1)
vector2 = encode(updated_tokenizer,text2)

cosine_similarity(vector1,vector2)
#Output:
0.766219437122345

Result: In this case as well, we can observe a good increase in Similarity Score

Conclusion: From the above experiments

  • We learnt how we can add new tokens and modify the existing vectors manually to Increase the model Performance
  • We understood the model Architecture, and understood why there are so many [unused0] tokens available in the vocabulary
  • Understood the importance of context for Embeddings
  • Built Functions like:

most_similar to understand the words similarity in vector space,

add_words,weighted_average_of_vectors to add new words to our vocab

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Connect On: LinkedIn: Suraj Kumar