<a href="https://colab.research.google.com/github/Rishabh-Thapliyal/Google-colab/blob/main/Text_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
import pandas as pd
import numpy as np

* For embeddings - GloVe < fastText < Transformers

* For similarity score - Cosine Similarity, Jaccard Similarity, Eucledian Distance

### **Theory**

* Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT.

* Sentence-BERT (SBERT), a modification of the pretrained BERT network that use **siamese and triplet network structures** to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.
* This reduces the effort for finding the most similar pair from 65
hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

* SBERT: A common method to address clustering and semantic search by mapping each sentence to a vector space such that semantically similar sentences
are close.

* The most commonly used approach is to average the BERT output layer
(known as BERT embeddings) or by using the output of the first token (the [CLS] token). This common practice yields rather bad sentence embeddings, often worse than averaging GloVe embeddings (Pennington et al., 2014).

* Bert is modified by adding pooling layer with Siamese and triplet network structures to derive semantically meaningful sentence embeddings.

* SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding. We experiment with three pooling strategies: Using the output of the CLS-token, computing the mean of all output vectors (MEANstrategy), and computing a max-over-time of the output vectors (MAX-strategy). The default configuration is MEAN.

* In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al.,2015) to update the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity.



## **1. Using Sentence Transformers**

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. Paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

In [27]:
pip install -U sentence-transformers



In [28]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

# model = SentenceTransformer('all-MiniLM-L6-v2')

model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v1')
embeddings = model.encode(sentences)
# 2 x 384

Downloading (…)c926f/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8101fc926f/README.md:   0%|          | 0.00/9.90k [00:00<?, ?B/s]

Downloading (…)01fc926f/config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)26f/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)c926f/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading (…)926f/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)8101fc926f/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1fc926f/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]


*   We used the pretrained microsoft/MiniLM-L12-H384-uncased model (12-layer, 384-hidden, 33M parameters, 2.7x faster than BERT-Base)

*   And fine-tuned in on a 1B sentence pairs dataset. We use a **contrastive learning objective**: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
*  List of all the models : https://www.sbert.net/docs/pretrained_models.html

In [29]:
model.parameters

<bound method Module.parameters of SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)>

In [30]:
embeddings.shape

(2, 384)

In [31]:
embeddings[0]

array([ 8.62785801e-02,  9.26549211e-02, -8.09492369e-04, -5.85748963e-02,
       -5.25480844e-02, -1.44912368e-02,  6.40500411e-02, -2.11268738e-02,
        9.52945948e-02, -1.30148241e-02,  6.97807521e-02,  4.59239706e-02,
        5.80606200e-02, -1.36662470e-02, -1.22170355e-02,  6.24098768e-03,
        3.48226614e-02,  1.27639445e-02, -1.16786696e-01, -1.95600390e-02,
        1.02444917e-01,  4.14308682e-02, -2.81891436e-03, -3.11293807e-02,
        1.23362392e-02, -2.74112020e-02,  1.93532649e-02,  1.66055784e-02,
        5.82100339e-02, -1.06136920e-02, -6.14987090e-02,  7.81148896e-02,
        5.24102040e-02, -1.12290876e-02, -3.93548645e-02,  3.38851027e-02,
       -1.42190745e-02,  1.36735234e-02, -1.08781740e-01, -1.73655935e-02,
       -3.70806269e-02, -7.76304603e-02,  7.11951181e-02, -1.09157767e-02,
       -1.70923211e-02, -5.06156534e-02, -1.18527366e-02,  2.49665454e-02,
        3.80927101e-02, -1.41296551e-01, -8.80392194e-02, -2.02588644e-02,
       -7.47827962e-02, -

## **Internal working of Sentence Transformer**

In [32]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

In [33]:
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# sentences = ['This is an example sentence', 'This is an example sentence']
# Load model from HuggingFace Hub
# this is a Bert model
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L12-v1')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L12-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

In [34]:
encoded_input

# token_type_ids are binary values. These are useful when input is paired for example in Question Answering tasks

{'input_ids': tensor([[ 101, 2023, 2003, 2019, 2742, 6251,  102],
        [ 101, 2169, 6251, 2003, 4991,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

In [35]:
model_output = model(**encoded_input)

In [36]:
model_output

# 2 x 7 x 384

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 2.1318e-01,  7.7631e-02,  5.0232e-02,  ...,  7.4465e-02,
          -6.2309e-02,  9.9766e-02],
         [ 3.4441e-01,  7.4522e-01, -5.7266e-02,  ..., -5.0167e-03,
           9.2369e-01,  3.5426e-01],
         [-8.3162e-02,  5.3951e-01,  1.9987e-01,  ...,  3.0141e-01,
           6.7261e-01,  1.2288e-01],
         ...,
         [ 5.5261e-01,  4.5401e-02, -2.7609e-01,  ...,  3.4287e-01,
          -5.3887e-01,  8.1662e-01],
         [ 7.7691e-01,  4.2356e-01,  1.1663e-01,  ...,  3.1073e-02,
          -1.0080e-01, -7.1610e-01],
         [ 2.1318e-01,  7.7628e-02,  5.0233e-02,  ...,  7.4465e-02,
          -6.2308e-02,  9.9759e-02]],

        [[ 2.1315e-01,  2.0354e-01,  2.0434e-02,  ...,  1.0548e-01,
          -9.9703e-02,  1.1387e-01],
         [ 3.4222e-01,  3.0913e-01, -1.2724e-01,  ..., -9.6097e-02,
           1.4940e-01,  5.0645e-02],
         [ 6.6650e-01, -8.5140e-04,  1.1246e-01,  ...,  2.0663e-01,
          -1.

Since, the embeddings from the BERT model at the output layer are known to be contextual embeddings, Bert claims that [CLS] token captures enough semantic information that its vector can represent the entire sentence and it can be used for classification tasks.

**last_hidden_state**
Sequence of hidden-states at the output of the last layer of the model.

**pooler_layer**
Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.

**This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.**

In [37]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains output of last_hidden_state .... 2*7*384
    step_1 = attention_mask.unsqueeze(-1)
    print('Step 1 shape:',step_1.shape)
    print(step_1)
    input_mask_expanded = step_1.expand(token_embeddings.size()).float()
    print('input_mask_expanded shape:', input_mask_expanded.shape)
    print(input_mask_expanded)
    numerator = torch.sum(token_embeddings * input_mask_expanded, 1) # element wise multiplication followed by sum across dim=1 (2*7*384)
    print('numerator shape', numerator.shape)
    print(numerator)
    denominator = torch.clamp(input_mask_expanded.sum(1), min=1e-9) # number of actual tokens
    print('denominator shape:', denominator.shape)
    print(denominator)
    return numerator/denominator

In [130]:
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

Step 1 shape: torch.Size([2, 7, 1])
tensor([[[1],
         [1],
         [1],
         [1],
         [1],
         [1],
         [1]],

        [[1],
         [1],
         [1],
         [1],
         [1],
         [1],
         [0]]])
input_mask_expanded shape: torch.Size([2, 7, 384])
tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [0., 0., 0.,  ..., 0., 0., 0.]]])
numerator shape torch.Size([2, 384])
tensor([[ 2.2812e+00,  2.4498e+00, -2.1403e-02, -1.5487e+00, -1.3894e+00,
         -3.8315e-01,  1.6935e+00, -5.5860e-01,  2.5196e+00, -3.4411e-01,
    

In [131]:
sentence_embeddings.shape

torch.Size([2, 384])

In [39]:
# Normalize embeddings
norm_sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

# sentence_embeddings --> 2 * 384
# Performs Lp normalization (Larch Prover Norm) of inputs over specified dimension. d=1 (across the columns)
# ||𝑥𝑝||=(|𝑥1|^𝑝+|𝑥2|^𝑝+...+|𝑥𝑛|^𝑝)1/𝑝

In [40]:
# showing for p=1
F.normalize(sentence_embeddings, p=1, dim=1)[0][:5]

tensor([ 5.7188e-03,  6.1415e-03, -5.3656e-05, -3.8825e-03, -3.4831e-03],
       grad_fn=<SliceBackward0>)

In [41]:
sentence_embeddings[0][0]/abs(sentence_embeddings[0]).sum()

tensor(0.0057, grad_fn=<DivBackward0>)

In [42]:
norm_sentence_embeddings

tensor([[ 8.6279e-02,  9.2655e-02, -8.0949e-04, -5.8575e-02, -5.2548e-02,
         -1.4491e-02,  6.4050e-02, -2.1127e-02,  9.5295e-02, -1.3015e-02,
          6.9781e-02,  4.5924e-02,  5.8061e-02, -1.3666e-02, -1.2217e-02,
          6.2410e-03,  3.4823e-02,  1.2764e-02, -1.1679e-01, -1.9560e-02,
          1.0244e-01,  4.1431e-02, -2.8189e-03, -3.1129e-02,  1.2336e-02,
         -2.7411e-02,  1.9353e-02,  1.6606e-02,  5.8210e-02, -1.0614e-02,
         -6.1499e-02,  7.8115e-02,  5.2410e-02, -1.1229e-02, -3.9355e-02,
          3.3885e-02, -1.4219e-02,  1.3674e-02, -1.0878e-01, -1.7366e-02,
         -3.7081e-02, -7.7630e-02,  7.1195e-02, -1.0916e-02, -1.7092e-02,
         -5.0616e-02, -1.1853e-02,  2.4967e-02,  3.8093e-02, -1.4130e-01,
         -8.8039e-02, -2.0259e-02, -7.4783e-02, -1.2329e-02,  6.1770e-02,
          1.7937e-02, -5.5000e-02,  1.1949e-01,  1.3047e-02, -3.7607e-02,
         -9.8207e-04,  5.3434e-02, -3.0955e-02,  7.8055e-03,  6.9016e-02,
          9.3829e-03, -1.6446e-02,  3.

In [43]:
# norm_sentence_embeddings is same as embeddings
embeddings

array([[ 8.62785801e-02,  9.26549211e-02, -8.09492369e-04,
        -5.85748963e-02, -5.25480844e-02, -1.44912368e-02,
         6.40500411e-02, -2.11268738e-02,  9.52945948e-02,
        -1.30148241e-02,  6.97807521e-02,  4.59239706e-02,
         5.80606200e-02, -1.36662470e-02, -1.22170355e-02,
         6.24098768e-03,  3.48226614e-02,  1.27639445e-02,
        -1.16786696e-01, -1.95600390e-02,  1.02444917e-01,
         4.14308682e-02, -2.81891436e-03, -3.11293807e-02,
         1.23362392e-02, -2.74112020e-02,  1.93532649e-02,
         1.66055784e-02,  5.82100339e-02, -1.06136920e-02,
        -6.14987090e-02,  7.81148896e-02,  5.24102040e-02,
        -1.12290876e-02, -3.93548645e-02,  3.38851027e-02,
        -1.42190745e-02,  1.36735234e-02, -1.08781740e-01,
        -1.73655935e-02, -3.70806269e-02, -7.76304603e-02,
         7.11951181e-02, -1.09157767e-02, -1.70923211e-02,
        -5.06156534e-02, -1.18527366e-02,  2.49665454e-02,
         3.80927101e-02, -1.41296551e-01, -8.80392194e-0

## **Similarity metrics**

In [46]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from sklearn.metrics.pairwise import euclidean_distances

In [47]:
cosine_similarity(norm_sentence_embeddings.detach().numpy())

array([[0.99999994, 0.3889224 ],
       [0.3889224 , 0.9999998 ]], dtype=float32)

The Jaccard similarity (Intersection/Union) is especially effective when the order of items is irrelevant and only the presence or absence of elements is examined. For Jaccard similarity, we don't need word vectors.

In [48]:
def jaccard_similarity(set1, set2):
    # intersection of two sets
    intersection = len(set1.intersection(set2))
    # Unions of two sets
    union = len(set1.union(set2))

    return intersection / union

In [49]:
set_a = [set(w.split()) for w in sentences] # split sentences into words

In [50]:
set_a

[{'This', 'an', 'example', 'is', 'sentence'},
 {'Each', 'converted', 'is', 'sentence'}]

In [51]:
j = jaccard_similarity(set_a[0], set_a[1])
j

0.2857142857142857

 Euclidean Distance takes a bit more time and computation power that the other two. To choose between Euclidean Distance and Cosine similarity is a question of orientation and not magnitude. It is basically angle vs distance, what to choose
.

In [52]:
euclidean_distances(norm_sentence_embeddings.detach().numpy())

array([[0.       , 1.1055113],
       [1.1055113, 0.       ]], dtype=float32)

## **2. Using Glove Word Embeddings**

In [55]:
embeddings_index = {}
f = open('/content/glove.6B.50d.txt')

for line in f:
    values = line.split(' ')
    word = values[0] ## The first entry is the word
    coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word
    embeddings_index[word] = coefs
f.close()

print('GloVe data loaded')

GloVe data loaded


In [56]:
len(embeddings_index.keys())

41726

In [57]:
embeddings_index['sentence']

array([-0.47255  ,  0.22225  , -0.63452  , -0.76602  ,  1.0656   ,
        0.76451  ,  1.6131   ,  0.87044  ,  0.17932  ,  1.078    ,
       -0.55141  , -0.22577  , -0.22832  , -0.40384  ,  1.7432   ,
       -0.78039  , -0.89178  , -0.50943  ,  0.33377  ,  0.39782  ,
        0.16417  ,  0.22198  , -0.061903 , -0.0040991, -0.61842  ,
       -2.7304   ,  0.88745  ,  0.045329 ,  0.22605  ,  0.35428  ,
        1.975    , -1.4989   , -0.72365  ,  0.096346 ,  1.203    ,
       -0.38698  ,  1.0918   , -0.18933  , -0.15378  , -0.22674  ,
       -0.63325  ,  0.80934  ,  0.36922  ,  0.81327  , -0.46229  ,
       -1.0106   ,  0.42674  ,  0.37537  ,  0.06769  ,  0.13591  ],
      dtype=float32)

In [58]:
sentences

['This is an example sentence', 'Each sentence is converted']

In [59]:
# preprocessing step

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

## Iterate over the data to preprocess by removing stopwords
lines_without_stopwords=[]
for line in sentences:
    line = line.lower()
    line_by_words = re.findall(r'(?:\w+)', line, flags = re.UNICODE) # remove punctuation and split
    new_line=[]
    for word in line_by_words:
        if word not in stop:
            new_line.append(word)
    lines_without_stopwords.append(new_line)
texts = lines_without_stopwords

print(texts)

[['example', 'sentence'], ['sentence', 'converted']]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [60]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

MAX_NUM_WORDS = 100
MAX_SEQUENCE_LENGTH = 10
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS) #top most frequent MAX_NUM_WORDS only to be tokenised
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)


print(data.shape)

Found 3 unique tokens.
(2, 10)


In [61]:
sequences

[[2, 1], [1, 3]]

In [62]:
data

array([[0, 0, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 3]], dtype=int32)

In [63]:
word_index

{'sentence': 1, 'example': 2, 'converted': 3}

In [67]:
# from keras.layers import Embedding
# from keras.initializers import Constant

## EMBEDDING_DIM =  ## seems to need to match the embeddings_index dimension
EMBEDDING_DIM = embeddings_index.get('a').shape[0]

num_words = min(MAX_NUM_WORDS, len(word_index)) + 1 # word_index dict has indexes from 1,....

embedding_matrix = np.zeros((num_words, EMBEDDING_DIM)) # will from index 1, bcz word_index dict has indexes from 1

for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word) ## get embeddings of the word from Glove
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector


In [68]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.47255   ,  0.22225   , -0.63451999, -0.76602   ,  1.06560004,
         0.76450998,  1.61310005,  0.87044001,  0.17931999,  1.07799995,
        -0.55141002, -0.22577   , -0.22832   , -0.40384001,  1.74319994,
        -0.78039002, -0.89178002, -0.50942999,  0.

In [69]:
np.count_nonzero(data[0], axis=0)

2

In [70]:
sent_embd = []
for sent in data:
  arr = np.zeros( [np.count_nonzero(sent), EMBEDDING_DIM])
  i = 0
  for idx in (sent):
    if idx > 0 :
      word_embd = embedding_matrix[idx]
      arr[i] = word_embd
      i += 1

  sent_embd.append(np.mean(arr, axis=0))

In [71]:
sent_embd

[array([ 0.02154501,  0.39568499, -0.41605499, -0.3789872 ,  0.74128503,
         0.67976499,  0.77989403,  0.01910999, -0.01891501,  0.69422497,
        -0.22894501,  0.06373   ,  0.026595  , -0.37846   ,  0.98907997,
        -0.36805001, -0.43733551, -0.25152755,  0.158575  , -0.14897001,
         0.0919945 , -0.15273999, -0.1010065 ,  0.10776045, -0.24075   ,
        -1.99935007, -0.003355  , -0.0688855 ,  0.22974   ,  0.148013  ,
         2.61155003, -0.99342003, -0.36785999, -0.360052  ,  0.70740998,
        -0.282675  ,  0.53152999, -0.044986  , -0.15161   ,  0.01668   ,
        -0.22203   ,  0.47978   ,  0.27599999,  0.65689498, -0.24391099,
        -0.38194498,  0.26634999,  0.255745  ,  0.03836635,  0.267765  ]),
 array([-0.1178    , -0.26522001, -0.241885  , -0.94631001,  0.99638003,
         0.38484449,  0.27785003,  0.29427001, -0.01741   ,  0.79103497,
         0.15105498, -0.032985  , -0.184515  , -0.2134395 ,  0.63524497,
        -0.39866351, -0.29206501, -0.321725  , -0

## **3. Word2Vec Word Embeddings**

queries = ['iphone red 256gb', 'red iphone 256gb']. ---> item_page/topic page

sim > threshold sim => similar
cluster the embeddings and then take query with max vol


product description similarity for topic page

NEXT STeps
1.
genAI embeddings

query --> topic page

PEFT

customizing llms,

customize llms

clustering part
