# Playing with embeddings

In this notebook, I use a small subset of a dataset (5 rows) to create an embedding (based on the `text-embedding-ada-` model)
and to use cosine similarity to evaluate the similarity between the embedding and a new text.

The [openai-cookbook](https://github.com/openai/openai-cookbook/tree/main) has useful notebooks.

# Import & set params

In [1]:
import pandas as pd
import tiktoken

from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
openai.api_key = "sk-..." # <-------- ENTER HERE THE API KEY

I used this model:

- Model name: `text-embedding-ada-002`
- Tokenizer:`cl100k_base`
- Max input tokens: `8191`
- Output dimensions: `1536` (i.e., dimensions in the embedding)

In [2]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # encoding for text-embedding-ada-002
max_tokens = 8000  # max input token for text-embedding-ada-002 is 8191

# Load data

In [3]:
input_datapath = "data/fine_food_reviews_1k.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


In [4]:
# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

1000

In [5]:
df.head(2)

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178


# Create embeddings

In [6]:
print(f"Total number of tokens in the data: {df.n_tokens.sum()}")

Total number of tokens in the data: 95895


Select the first 5 reviews

In [7]:
df1 = df.iloc[:5, :].copy() # create a copy

In [8]:
# Is it a copy?
print(id(df))
print(id(df1))

139969035748160
139965568612864


`openai.Embedding.create()` expects a list

In [9]:
df1.combined.to_list()

['Title: where does one  start...and stop... with a treat like this; Content: Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone',
 "Title: Good, but not Wolfgang Puck good; Content: Honestly, I have to admit that I expected a little better. That's not to say that this is bad coffee - in fact it's quite bold without being too acidic, and pretty satisfying overall. I think my main problem is that Wolfgang Puck's name is attached to it, so perhaps it set my expectations a little high. I have a Wolfgang Puck knife set that I adore, and is very high quality for what I paid for it. This coffee was on sale, so it was well worth it also, I just hoped for something that would knock my socks off - which it didn't. I also purchased the Breakfast blend, and Jamaica me crazy at the same time. The breakfast blend was the best, in my opinion, and the jamaican coffee smelled the best, but was 

The API key must be set in the environment (see https://github.com/openai/openai-python#usage)

In [10]:
df1_embedding = openai.Embedding.create(
    input=df1.combined.to_list(),
    model=embedding_model)

In [11]:
df1_embedding.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [12]:
print(f'Object: {df1_embedding["object"]}')
print(f'\nModel: {df1_embedding["model"]}')
print(f'\nUsage:\n{df1_embedding["usage"]}')

Object: list

Model: text-embedding-ada-002-v2

Usage:
{
  "prompt_tokens": 497,
  "total_tokens": 497
}


From the embedding model we get 5 vectors representing the 5 reviews.

In [13]:
len(df1_embedding["data"])

5

All vectors have the same length (i.e., the output dimensions: `1536` )

In [14]:
for i in range(len(df1_embedding["data"])):
    print(f'Vector {i} - {len(df1_embedding["data"][i]["embedding"])}')

Vector 0 - 1536
Vector 1 - 1536
Vector 2 - 1536
Vector 3 - 1536
Vector 4 - 1536


First 10 dimensions of the first vector

In [15]:
df1_embedding["data"][0]["embedding"][:10]

[0.0070097302086651325,
 -0.02735169231891632,
 0.010535212233662605,
 -0.014569243416190147,
 0.004429187625646591,
 0.019943369552493095,
 0.0007039796910248697,
 -0.0221012681722641,
 -0.019201163202524185,
 -0.013668973930180073]

In [16]:
# Print the embeddings for each text
for i, embedding in enumerate(df1_embedding["data"]):
    print(f'Text\n{df1.combined.to_list()[i]}')
    print(f'Embedding (first 10 dimensions):\n{embedding["embedding"][:10]}')
    print("="*50)

Text
Title: where does one  start...and stop... with a treat like this; Content: Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone
Embedding (first 10 dimensions):
[0.0070097302086651325, -0.02735169231891632, 0.010535212233662605, -0.014569243416190147, 0.004429187625646591, 0.019943369552493095, 0.0007039796910248697, -0.0221012681722641, -0.019201163202524185, -0.013668973930180073]
Text
Title: Good, but not Wolfgang Puck good; Content: Honestly, I have to admit that I expected a little better. That's not to say that this is bad coffee - in fact it's quite bold without being too acidic, and pretty satisfying overall. I think my main problem is that Wolfgang Puck's name is attached to it, so perhaps it set my expectations a little high. I have a Wolfgang Puck knife set that I adore, and is very high quality for what I paid for it. This coffee was on sale, so it was well worth

Add the embeddings to the data frame

In [17]:
embeddings_to_df = []
for i in range(len(df1_embedding["data"])):
    embeddings_to_df.append(df1_embedding["data"][i]["embedding"])

In [18]:
len(embeddings_to_df)

5

In [19]:
df1["embedding"] = embeddings_to_df

In [20]:
df1

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,embedding
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52,"[0.0070097302086651325, -0.02735169231891632, ..."
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178,"[-0.003115636995062232, -0.009948926977813244,..."
296,B008JKTTUA,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,78,"[-0.017541345208883286, -2.462551128701307e-05..."
295,B000LKTTTW,A14MQ40CCU8B13,5,Best tomato soup,I have a hard time finding packaged food of an...,Title: Best tomato soup; Content: I have a har...,111,"[-0.0013143676333129406, -0.011042999103665352..."
294,B001D09KAM,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,78,"[-0.017541345208883286, -2.462551128701307e-05..."


# Search for similarity

In [21]:
def get_embedding(text, model):
   """
   Return an embedding from a text given a model
   """
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']


def search_reviews(df: pd.DataFrame, product_description: str, n: int = 3):
   """
   Calculate the cosine similarity between a database of reviews and a product description.

   Args:
       df (DataFrame): A pandas dataframe with the embedding of the review in the column "embedding".
       product_description (string): product description to evaluate with cosine similarity.
       n (int): number of most similar results to print.
    
    Returns:
        DataFrame: The dataframe df with an additional column (i.e., "similarities") with the cosine similarities.
    """
   embedding = get_embedding(product_description, model=embedding_model)
   df['similarities'] = df.embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res

Test `get_embedding()`

In [22]:
new_text_embedding = get_embedding("delicious beans", model=embedding_model)
print(len(new_text_embedding))
print(new_text_embedding[:5])

1536
[-0.015345171093940735, -0.01350956130772829, -0.016731781885027885, -0.0221989955753088, -0.007210381329059601]


Try `cosine_similarity()`

In [23]:
df1.embedding.apply(lambda x: cosine_similarity(x, new_text_embedding))

0      0.773732
297    0.784666
296    0.751474
295    0.798300
294    0.751474
Name: embedding, dtype: float64

Test `search_reviews()`

In [24]:
res = search_reviews(df1, 'delicious beans', n=3)

In [25]:
res

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,embedding,similarities
295,B000LKTTTW,A14MQ40CCU8B13,5,Best tomato soup,I have a hard time finding packaged food of an...,Title: Best tomato soup; Content: I have a har...,111,"[-0.0013143676333129406, -0.011042999103665352...",0.7983
297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178,"[-0.003115636995062232, -0.009948926977813244,...",0.784666
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52,"[0.0070097302086651325, -0.02735169231891632, ...",0.773732
