# Embedding
## What is Embedding
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

## Usage of Embedding
Embeddings are commonly used for:

- **Search** (where results are ranked by relevance to a query string)
- **Clustering** (where text strings are grouped by similarity)
- **Recommendations** (where items with related text strings are recommended)
- **Anomaly detection** (where outliers with little relatedness are identified)
- **Diversity measurement** (where similarity distributions are analyzed)
- **Classification** (where text strings are classified by their most similar label)

## Embedding models

|MODEL	|~ PAGES PER DOLLAR	|PERFORMANCE ON MTEB EVAL	|MAX INPUT
|:--------:|:--------:|:--------:|:--------:|
|text-embedding-3-small|	62,500 |62.3%	|8191|
|text-embedding-3-large	|9,615  |64.6%	|8191|
|text-embedding-ada-002	|12,500 |61.0%	|8191|

for more info please refer to this official docs [Embeddings](https://platform.openai.com/docs/guides/embeddings)


## Setup

In [3]:
import os
from dotenv import load_dotenv, find_dotenv
# find_dotenv() find .env automatically by walking up directories until it's found 
# load_dotenv() load the environment variables from the .env file
# override=True allows the .env file to override the system environment variables
load_dotenv(find_dotenv(), override=True)

apiKey = os.environ.get('OPENAI_API_KEY')

In [2]:
%pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [15]:
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
client = OpenAI(api_key=apiKey)

## Getting Embeddings 

In [25]:

cat_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input="cat",
)

car_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input="car",
)

dog_embedding =client.embeddings.create(
    model="text-embedding-3-small",
    input="dog",
)

In [None]:
#print(cat_embedding)
print(car_embedding.data[0].embedding)

## Get Similarities Using Embedding

We need Numpy and Pandas libraries
To check if libraries are installed use the following command `%pip show LibraryName` to show the version of library if installed or display that library not installed

In [6]:
import numpy as np
import pandas as pd

# Read the data file
data = pd.read_csv('assets/words.csv')
data

Unnamed: 0,text
0,fox
1,opossum
2,black
3,purple
4,badger
5,coffee
6,rabbit
7,hare
8,soda
9,yellow


### Check Pricing
- Use tiktoken library to get the number of tokens for your data
- Check (pricing page)[https://openai.com/api/pricing/] to check how much you will pay

In [4]:
import tiktoken as tkn

In [8]:
#create list of the data 
words = list(data['text'])
enc = tkn.encoding_for_model(model_name="text-embedding-3-small")
total_tokens = sum([len(enc.encode(word)) for word in words])
print(f"Total tokens: {total_tokens}")
cost = total_tokens * (0.00002/1000)
print(f"Cost: ${cost:.20f}")


Total tokens: 62
Cost: $0.00000124000000000000


In [9]:
# Create a function to get the embeddings of a word
def get_embeddings(word):
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=word,
    )    
    return embedding.data[0].embedding

### Store the embedding for each word
- After getting embedding for your data store them in permanent storage to avoid getting embedding every time you want to process a query and save time and money

In [24]:
data['embedding']= data['text'].apply(lambda x: get_embeddings(x))
data

Unnamed: 0,text,embedding
0,fox,"[-0.03355345129966736, 0.001505932305008173, -..."
1,opossum,"[0.03196421265602112, 0.03251579403877258, -0...."
2,black,"[0.01116090640425682, -0.012063156813383102, -..."
3,purple,"[0.02472393959760666, -0.034728385508060455, -..."
4,badger,"[0.021644549444317818, -0.022302215918898582, ..."
5,coffee,"[-0.010105200111865997, 0.0037400261498987675,..."
6,rabbit,"[0.006220409180969, -0.015751302242279053, 0.0..."
7,hare,"[0.01832250878214836, -0.0220760777592659, 0.0..."
8,soda,"[0.018451469019055367, -0.025056716054677963, ..."
9,yellow,"[-0.014238273724913597, -0.018038872629404068,..."


In [25]:
#save the data to a csv file
data.to_csv('assets/words_with_embeddings.csv', index=False)

### Perform Semantic Search

In [7]:
# Read the data file
words_data = pd.read_csv('assets/words_with_embeddings.csv')
#words_data.head()
# Convert the embeddings to a numpy array instead of plain string  
words_data['embedding'] = words_data['embedding'].apply(eval).apply(np.array)
words_data.head()


Unnamed: 0,text,embedding
0,fox,"[-0.03355345129966736, 0.001505932305008173, -..."
1,opossum,"[0.03196421265602112, 0.03251579403877258, -0...."
2,black,"[0.01116090640425682, -0.012063156813383102, -..."
3,purple,"[0.02472393959760666, -0.034728385508060455, -..."
4,badger,"[0.021644549444317818, -0.022302215918898582, ..."


Use Cosine Similarity algorithm to get the similarities 
you can check [Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity) for more info

In [37]:
%pip install scikit-learn -q

Note: you may need to restart the kernel to use updated packages.


In [28]:
# Calculate the embedding of the search term
red_embedding = get_embeddings('milk')
words_data['cosine_similarity'] = words_data['embedding'].apply(lambda x: cosine_similarity([red_embedding], [x])[0][0])
words_data

Unnamed: 0,text,embedding,cosine_similarity
20,red,"[-0.022114494815468788, -0.010901302099227905,...",0.216396
23,blue,"[-0.0011808431008830667, -0.016510002315044403...",0.238026
3,purple,"[0.02472393959760666, -0.034728385508060455, -...",0.270218
35,white,"[0.0032686665654182434, -0.028257660567760468,...",0.314009
41,orange,"[-0.025900742039084435, -0.005591441877186298,...",0.219831
11,deer,"[0.060825735330581665, -0.03296774625778198, 0...",0.236315
39,gray,"[0.004804749041795731, -0.020221514627337456, ...",0.197774
2,black,"[0.01116090640425682, -0.012063156813383102, -...",0.23311
34,brown,"[-0.030695391818881035, -0.00717085599899292, ...",0.258159
0,fox,"[-0.03355345129966736, 0.001505932305008173, -...",0.198153


In [29]:
# Sort the data by cosine similarity
words_data = words_data.sort_values(by='cosine_similarity', ascending=False)
words_data

Unnamed: 0,text,embedding,cosine_similarity
21,milk,"[0.03963107243180275, -0.00417668791487813, -0...",1.0
32,pasta,"[-0.05013786256313324, -0.04082654416561127, 0...",0.378354
10,cappuccino,"[-0.024895260110497475, -0.030884910374879837,...",0.377345
8,soda,"[0.018451469019055367, -0.025056716054677963, ...",0.370392
5,coffee,"[-0.010105200111865997, 0.0037400261498987675,...",0.340499
40,water,"[0.0030196786392480135, 0.017477955669164658, ...",0.333474
22,tea,"[-0.01637290231883526, -0.030905066058039665, ...",0.325191
35,white,"[0.0032686665654182434, -0.028257660567760468,...",0.314009
29,salad,"[-0.0006578968022949994, -0.03706714138388634,...",0.310149
24,sandwich,"[-0.005682190880179405, -0.04893353208899498, ...",0.2885


In [20]:
cat_car_similarity = cosine_similarity([cat_embedding.data[0].embedding], [car_embedding.data[0].embedding])[0][0]
cat_car_similarity

0.5156173418297144

In [26]:
cat_dog_similarity = cosine_similarity([cat_embedding.data[0].embedding], [dog_embedding.data[0].embedding])[0][0]
cat_dog_similarity

0.602501803670552

## Search for Complex Search Term

In [33]:
# Read the data file
data = pd.read_csv('assets/words_with_embeddings.csv')
data['embedding'] = data['embedding'].apply(eval).apply(np.array)
data
milk_vector = data['embedding'].iloc[21]  # Get the embedding of the word at index 21
tea_vector = data['embedding'].iloc[22]  # Get the embedding of the word at index 22

complex_vector = milk_vector + tea_vector


In [35]:
data['cosine_similarity'] = data['embedding'].apply(lambda x: cosine_similarity([complex_vector], [x])[0][0])
data = data.sort_values(by='cosine_similarity', ascending=False)
data

Unnamed: 0,text,embedding,cosine_similarity
22,tea,"[-0.01637290231883526, -0.030905066058039665, ...",0.814
21,milk,"[0.03963107243180275, -0.00417668791487813, -0...",0.814
5,coffee,"[-0.010105200111865997, 0.0037400261498987675,...",0.584376
8,soda,"[0.018451469019055367, -0.025056716054677963, ...",0.455589
10,cappuccino,"[-0.024895260110497475, -0.030884910374879837,...",0.448989
40,water,"[0.0030196786392480135, 0.017477955669164658, ...",0.441671
32,pasta,"[-0.05013786256313324, -0.04082654416561127, 0...",0.431922
29,salad,"[-0.0006578968022949994, -0.03706714138388634,...",0.415339
35,white,"[0.0032686665654182434, -0.028257660567760468,...",0.406285
3,purple,"[0.02472393959760666, -0.034728385508060455, -...",0.394688


# Note
as a breaking change OpenAI deleted utils.embeddings_utils library from the OpenAi package and didn't updated the Open AI cookbook yet so if you faced any tutorial or course referencing a method from this library you can manually copy the method code from this file https://github.com/openai/openai-cookbook/blob/main/examples/utils/embeddings_utils.py till OpenAI fix this issue and support an alternative