<h1 align="center">Sentence Transformers: Sentence Embedding, Sentence Similarity, Semantic Search and Clustering |Code</h1>

Data Scientist.: Dr.Eddy Giusepe Chirinos Isidro

Links de estudos:

* [semantic-similarity](https://github.com/abhinavthomas/semantic-similarity)

* [Sentence Transformers](https://www.youtube.com/watch?v=OlhNZg4gOvA)


# Sentence Embeddings

<font color="orange">É o nome coletivo para um conjunto de técnicas em processamento de linguagem natural (NLP) onde as sentenças são mapeadas para vetores de números reais .</font>

Casos de uso:

* Sentence Embedding

* Sentence Similarity

* Semantic Search

* Clustering

# Gerando Embeddings

In [5]:
#!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util

# Carregamos o Modelo pré-treinado ---> https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model = SentenceTransformer('all-MiniLM-L6-v2')


In [6]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [2]:
sentences = ['This framework generates embeddings for each input sentence',
             'Sentences are passed as a list of string.']


embeddings = model.encode(sentences)

In [3]:
embeddings.shape

(2, 384)

In [4]:
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")


Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173841e-02 -4.28515412e-02 -1.56286340e-02  1.40537862e-02
  3.95537652e-02  1.21796295e-01  2.94333789e-02 -3.17524001e-02
  3.54959518e-02 -7.93140009e-02  1.75878368e-02 -4.04369719e-02
  4.97259907e-02  2.54912432e-02 -7.18699992e-02  8.14968348e-02
  1.47069513e-03  4.79627140e-02 -4.50335890e-02 -9.92174968e-02
 -2.81769503e-02  6.45046309e-02  4.44670804e-02 -4.76217270e-02
 -3.52952182e-02  4.38671783e-02 -5.28565720e-02  4.33049776e-04
  1.01921476e-01  1.64072402e-02  3.26996855e-02 -3.45986858e-02
  1.21339634e-02  7.94871226e-02  4.58342116e-03  1.57778692e-02
 -9.68206860e-03  2.87626293e-02 -5.05806319e-02 -1.55793894e-02
 -2.87907142e-02 -9.62278992e-03  3.15556526e-02  2.27349550e-02
  8.71449485e-02 -3.85027640e-02 -8.84718820e-02 -8.75497609e-03
 -2.12343279e-02  2.08924059e-02 -9.02078152e-02 -5.25732487e-02
 -1.05638532e-02  2.88311187e-02 -1.61454845e-02  6.17841305e-03
 -1.23234

# Cosine-Similarity

In [9]:
emb1 = model.encode("I am eating Apple")
emb2 = model.encode("I like fruits")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.5398]])


## Calcula a `similaridade de cosseno` entre todos os pares

<font color="orange">Repara que a seguir comparamos as similaridade entre cada sentença. </font>

Por exemplo:

$[1.0000,  0.7553, -0.1050,  0.2474, -0.0704, -0.0333,  0.1707,  0.0476,
          0.0630]$

Se compara assim: 1_senteça-1_sentença = 1.0000 | 1_senteça-2_sentença = 0.7553 | 1_senteça-3_sentença = -0.1050 | 1_senteça-4_sentença | ...            

In [11]:
# Compute cosine similarity between all pairs
sentences = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.'
          ]

# Codificamos (Encode) todas as sentences
embeddings = model.encode(sentences)

# Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

cos_sim

tensor([[ 1.0000,  0.7553, -0.1050,  0.2474, -0.0704, -0.0333,  0.1707,  0.0476,
          0.0630],
        [ 0.7553,  1.0000, -0.0610,  0.1442, -0.0809, -0.0216,  0.1157,  0.0362,
          0.0216],
        [-0.1050, -0.0610,  1.0000, -0.1088,  0.0217, -0.0413, -0.0928,  0.0231,
          0.0247],
        [ 0.2474,  0.1442, -0.1088,  1.0000, -0.0348,  0.0362,  0.7369,  0.0821,
          0.1389],
        [-0.0704, -0.0809,  0.0217, -0.0348,  1.0000, -0.1654, -0.0592,  0.1961,
          0.2564],
        [-0.0333, -0.0216, -0.0413,  0.0362, -0.1654,  1.0000,  0.0769, -0.0380,
         -0.0895],
        [ 0.1707,  0.1157, -0.0928,  0.7369, -0.0592,  0.0769,  1.0000,  0.0495,
          0.1191],
        [ 0.0476,  0.0362,  0.0231,  0.0821,  0.1961, -0.0380,  0.0495,  1.0000,
          0.6433],
        [ 0.0630,  0.0216,  0.0247,  0.1389,  0.2564, -0.0895,  0.1191,  0.6433,
          1.0000]])

In [12]:
# Adicionamos todos os pares a uma lista com sua pontuação de similaridade de cosseno
all_sentence_combinations = []

for i in range(len(cos_sim)-1):
    for j in range(i + 1, len(cos_sim)):
        all_sentence_combinations.append((cos_sim[i][j], i, j))
all_sentence_combinations       


[(tensor(0.7553), 0, 1),
 (tensor(-0.1050), 0, 2),
 (tensor(0.2474), 0, 3),
 (tensor(-0.0704), 0, 4),
 (tensor(-0.0333), 0, 5),
 (tensor(0.1707), 0, 6),
 (tensor(0.0476), 0, 7),
 (tensor(0.0630), 0, 8),
 (tensor(-0.0610), 1, 2),
 (tensor(0.1442), 1, 3),
 (tensor(-0.0809), 1, 4),
 (tensor(-0.0216), 1, 5),
 (tensor(0.1157), 1, 6),
 (tensor(0.0362), 1, 7),
 (tensor(0.0216), 1, 8),
 (tensor(-0.1088), 2, 3),
 (tensor(0.0217), 2, 4),
 (tensor(-0.0413), 2, 5),
 (tensor(-0.0928), 2, 6),
 (tensor(0.0231), 2, 7),
 (tensor(0.0247), 2, 8),
 (tensor(-0.0348), 3, 4),
 (tensor(0.0362), 3, 5),
 (tensor(0.7369), 3, 6),
 (tensor(0.0821), 3, 7),
 (tensor(0.1389), 3, 8),
 (tensor(-0.1654), 4, 5),
 (tensor(-0.0592), 4, 6),
 (tensor(0.1961), 4, 7),
 (tensor(0.2564), 4, 8),
 (tensor(0.0769), 5, 6),
 (tensor(-0.0380), 5, 7),
 (tensor(-0.0895), 5, 8),
 (tensor(0.0495), 6, 7),
 (tensor(0.1191), 6, 8),
 (tensor(0.6433), 7, 8)]

In [13]:
# Lista de classificação pela maior pontuação de similaridade de cosseno
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Os 5 pares mais semelhantes: ")
for score, i, j in all_sentence_combinations[0: 5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))
    

Os 5 pares mais semelhantes: 
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474


# Pesquisa Semântica

<font color="orange">A pesquisa semântica denota a pesquisa com significado, diferente da `pesquisa lexical`, na qual o mecanismo de pesquisa procura correspondências literais das palavras de consulta ou variantes delas, sem entender o significado geral da consulta.</font>



Entãop, a `pesquisa semântica` descreve a tentativa de um mecanismo de pesquisa de gerar os resultados SERP (`Search Engine Results Page`) mais precisos possíveis, com base na intenção do pesquisador, no contexto da consulta e na relação entre as palavras . Isso é importante porque: As pessoas dizem coisas e questionam coisas de diferentes maneiras, idiomas e tons.

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('clips/mfaq') # Modelo pré-treinado do HuggingFace --> https://huggingface.co/clips/mfaq

In [None]:
question = "How many models can I host on HuggingFace?"

answer_1 = "All plans come with unlimited private models and datasets."
answer_2 = "AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."


query_embedding = model.encode(question)
corpus_embeddings = model.encode([answer_1, answer_2, answer_3])


print(util.semantic_search(query_embedding, corpus_embeddings))