<a href="https://colab.research.google.com/github/Azizkhaled/NLP_with_Aziz/blob/main/Similarity_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install transformers



So we need to measure the similarity for the following 5 sentences, check which sentence is most similar to the first sentence

In [2]:
sentences = ["Parallel lines have so much in common. It's a shame they'll never meet",
        "I'm reading a book on anti-gravity. it's impossible to put down.",
        "Time flies like an arrow; fruit flies like a banana. The universe has a sense of humor.",
        "Why did the Egyptian pharaoh go to therapy? To work through his pyramid complex issues!",
        "Parallel lines must be the ultimate introverts. They're so distant, even geometry can't fix it!"
        ]

## Method 1: Transformers and Pytorch

## Step 1: tokenize the sentecnes

In [3]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

# initialize dictionary that will contain tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # tokenize sentence and append to dictionary lists
    new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True,
                                       padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

##  Step 2 build the sense vectors


### 1. Get the dense vectors embeddings

The dense vector representations of our text are contained within the outputs 'last_hidden_state' tensor, which we access like so:

In [4]:
outputs = model(**tokens)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [5]:
embeddings = outputs.last_hidden_state
embeddings

tensor([[[ 0.0191,  1.0852,  1.4530,  ...,  0.5818,  0.1906,  0.7413],
         [ 0.2287,  0.5216,  1.0775,  ..., -0.0472, -0.0768,  0.4471],
         [ 0.2503,  0.0941,  1.2539,  ..., -0.1584, -0.1444,  0.9768],
         ...,
         [ 0.2560,  0.4388,  1.7174,  ..., -0.0043, -0.2570,  0.5177],
         [ 0.4631,  0.4670,  1.3358,  ..., -0.0327, -0.2891,  0.2209],
         [ 0.4356,  0.4081,  1.2824,  ..., -0.1472, -0.3632,  0.1564]],

        [[-0.2938,  1.0964,  0.6436,  ...,  0.5726,  0.1074,  0.8784],
         [-0.0638,  1.1400,  1.1656,  ...,  0.4666,  0.0043,  0.6349],
         [-0.3595,  0.9355,  1.3937,  ...,  0.5706,  0.1614,  0.4186],
         ...,
         [ 0.0152,  0.6866,  1.1129,  ...,  0.6757, -0.0667,  0.6100],
         [-0.0367,  0.7764,  0.8585,  ...,  0.3809, -0.1189,  0.7557],
         [-0.0919,  0.7379,  0.9367,  ...,  0.4508, -0.2233,  0.6574]],

        [[-0.2888,  0.5438, -0.3228,  ...,  1.0956,  0.5095,  0.1958],
         [-0.2769,  1.3928, -0.5112,  ...,  0

In [6]:
embeddings.shape

torch.Size([5, 128, 768])

### 2. Perform *mean pooling*

We need to perform a mean pooling operation on them to create a single vector encoding (the sentence embedding). To do this mean pooling operation we will need to multiply each value in our embeddings tensor by it's respective attention_mask value - so that we ignore non-real tokens.

#### a. Resize our attention_mask tensor

To perform mean pooling , we first resize our attention_mask tensor

In [7]:
attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([5, 128])

#### b. Expand our attention mask
we need to expand our attention mask up to the same size of our embeddings.

In [8]:
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([5, 128, 768])

#### c. Multiply the two tensors to apply the attention mask

In [11]:
masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([5, 128, 768])

In [12]:
masked_embeddings

tensor([[[ 0.0191,  1.0852,  1.4530,  ...,  0.5818,  0.1906,  0.7413],
         [ 0.2287,  0.5216,  1.0775,  ..., -0.0472, -0.0768,  0.4471],
         [ 0.2503,  0.0941,  1.2539,  ..., -0.1584, -0.1444,  0.9768],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ..., -0.0000, -0.0000,  0.0000]],

        [[-0.2938,  1.0964,  0.6436,  ...,  0.5726,  0.1074,  0.8784],
         [-0.0638,  1.1400,  1.1656,  ...,  0.4666,  0.0043,  0.6349],
         [-0.3595,  0.9355,  1.3937,  ...,  0.5706,  0.1614,  0.4186],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000]],

        [[-0.2888,  0.5438, -0.3228,  ...,  1.0956,  0.5095,  0.1958],
         [-0.2769,  1.3928, -0.5112,  ...,  0

d. Sum the remained of the embeddings along axis 1

In [13]:
summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([5, 768])

In [14]:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([5, 768])

#### e. Calcualte the mean

In [20]:
mean_pooled = summed / summed_mask

## Step 3: calculate similarity

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

# convert from PyTorch tensor to numpy array
mean_pooled = mean_pooled.detach().numpy()

# calculate
similarity = cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

In [41]:
print('The base sentence: \n \t', sentences[0])

for sentence, sim in zip(sentences[1:], similarity[0]):
  print('\n Test:\t ','{', sentence,'}')
  print(' Similarity to base:\t ', sim)

The base sentence: 
 	 Parallel lines have so much in common. It's a shame they'll never meet

 Test:	  { I'm reading a book on anti-gravity. it's impossible to put down. }
 Similarity to base:	  0.4178294

 Test:	  { Time flies like an arrow; fruit flies like a banana. The universe has a sense of humor. }
 Similarity to base:	  0.3460104

 Test:	  { Why did the Egyptian pharaoh go to therapy? To work through his pyramid complex issues! }
 Similarity to base:	  0.2968526

 Test:	  { Parallel lines must be the ultimate introverts. They're so distant, even geometry can't fix it! }
 Similarity to base:	  0.73824716



So, as intended, the most similar sentence is that in the last sentence - which contains the same meaning as our first sentence, without using the same words:


# Method 2: Embeddings With Sentence-Transformers, Faster and easier

In [None]:
pip install sentence_transformers

In [44]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [45]:
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

(5, 768)

In [46]:
sentence_embeddings

array([[ 0.25765783,  0.8095684 ,  1.6172917 , ...,  0.17301115,
        -0.1627667 ,  0.50747234],
       [-0.17729716,  1.0527868 ,  1.0287317 , ...,  0.498814  ,
        -0.06910551,  0.38188493],
       [-0.10345218,  0.85123116, -0.23889843, ...,  0.9513718 ,
         0.37633184,  0.03318928],
       [ 0.18458131,  1.0976444 ,  0.30456337, ..., -0.5765321 ,
        -0.13038449,  0.24313515],
       [ 0.23966917,  1.15642   ,  1.438236  , ..., -0.11000173,
        -0.01586977,  0.5518208 ]], dtype=float32)

In [47]:
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.4178294 , 0.3460104 , 0.2968526 , 0.73824716]], dtype=float32)