# Text Embeddings

In this notebook, we'll explore the powerful capabilities of sentence transformers, a technique for generating text embeddings that enable advanced natural language understanding and semantic similarity analysis. Embeddings are commonly used for:

* Search (where results are ranked by relevance to a query string)
* Clustering (where text strings are grouped by similarity)
* Recommendations (where items with related text strings are recommended)
* Anomaly detection (where outliers with little relatedness are identified)
* Diversity measurement (where similarity distributions are analyzed)
* Classification (where text strings are classified by their most similar label)



## Setup

In [1]:
!pip install -qqq transformers sentence_transformers openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.3/220.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the sourc

# Embeddings using Sentence Transformers

This code utilizes [Sentence Transformers](https://www.sbert.net/) to generate text embeddings and computes cosine similarity between two sentences for semantic analysis.

In [2]:
# Import necessary libraries and modules
from sentence_transformers import SentenceTransformer  # Import the SentenceTransformer class for text embeddings
from sklearn.metrics.pairwise import cosine_similarity  # Import cosine_similarity for measuring similarity
from rich import print  # Import the 'rich' library for enhanced console printing

## Generate Embeddings

In [3]:
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Cosine Similarity

In [4]:
# Initialize a SentenceTransformer model ('all-mpnet-base-v2' pre-trained model)
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Define a list of sentences for which we want to calculate embeddings and similarity
sentences = ["This is an example sentence", "This is NOT an example sentence"]

# Generate text embeddings for the provided sentences using the initialized model
embeddings = model.encode(sentences)

# Calculate cosine similarity between the embeddings of the first and second sentences
similarity_score = cosine_similarity([embeddings[0]], [embeddings[1]])

# Print the computed cosine similarity score
print(similarity_score)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Embeddings using OpenAI

You can also use OpenAI API to generate text embeddings using their `text-embeddings-ada-0002` model. You can find more details about OpenAI embeddings [here](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings). Once you generate these embeddings you can insert these embeddings in a vector database for downstream applications like search, recommendations and RAG based chatbots.

In [5]:
from google.colab import userdata
openai_api_key = userdata.get('openai_api_key')

In [6]:
from openai import OpenAI
client = OpenAI(api_key=openai_api_key)

response = client.embeddings.create(
  model="text-embedding-ada-002",
  input="The food was delicious and the waiter...",
  encoding_format="float"
)
print(response)

# Text Embeddings Leaderboard

The Massive Text Embedding Benchmark (MTEB) [Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is a platform for evaluating and comparing the performance of various text embedding models. It hosts a wide range of datasets, including 129 in total, spanning across 113 different languages. These datasets are used to assess the capabilities of 157 different text embedding models. The leaderboard contains a vast collection of scores, with a total of 18,556 recorded. Researchers and practitioners can refer to the MTEB GitHub repository to submit their own models and evaluate them based on the metrics and tasks described in the MTEB paper, which provides comprehensive details on the benchmark and its evaluation criteria.

