# Get embeddings from dataset

This notebook gives an example on how to get embeddings from a large dataset.


## 1. Load the dataset

The dataset used in this example is [fine-food reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews) from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

In [None]:
%pip install -qU pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch, torchvision, scipy

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [8]:
import pandas as pd
import tiktoken
from openai import OpenAI

client = OpenAI(
   base_url="http://localhost:8080/api/openai/v1/",
   api_key="NotNeeded")

def get_embedding(text, model="multilingual-e5-large"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

embedding_model = "multilingual-e5-large"
embedding_encoding = "cl100k_base"
max_tokens = 490  # the maximum for multilingual-e5-large is 512


In [9]:
# load & inspect dataset
input_datapath = "data/fine_food_reviews_1k.csv"  # to save space, we provide a pre-filtered dataset
df = pd.read_csv(input_datapath, index_col=0)
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)
df.head(2)

Unnamed: 0_level_0,Time,ProductId,UserId,Score,Summary,Text,combined
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1351123200,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...
1,1351123200,B003JK537S,A3JBPC3WFUT5ZP,1,Arrived in pieces,"Not pleased at all. When I opened the box, mos...",Title: Arrived in pieces; Content: Not pleased...


In [10]:
# subsample to 1k most recent reviews and remove samples that are too long
top_n = 1000
df = df.sort_values("Time").tail(top_n * 2)  # first cut to first 2k entries, assuming less than half will be filtered out
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

# omit reviews that are too long to embed
df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
len(df)

997

## 2. Get embeddings and save them for future reuse

In [11]:
# This may take a few minutes
df["embedding"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model))
df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")

In [14]:
a = get_embedding("hi", model=embedding_model)
a

[1.296478509902954,
 -0.3666992783546448,
 1.0193829536437988,
 -3.505366325378418,
 2.9113759994506836,
 -4.551811695098877,
 -1.6316187381744385,
 1.5415558815002441,
 3.2211217880249023,
 -1.797011375427246,
 2.7726845741271973,
 1.0712194442749023,
 -5.645284652709961,
 -2.722586154937744,
 -4.917831897735596,
 -1.620875597000122,
 0.5320620536804199,
 2.244501829147339,
 1.7897518873214722,
 -0.299452006816864,
 0.12827062606811523,
 -2.172783851623535,
 -1.9538499116897583,
 -4.592774868011475,
 0.33893483877182007,
 -1.9262950420379639,
 -3.390446424484253,
 -3.1221556663513184,
 -2.4302477836608887,
 -5.025638103485107,
 0.9955859780311584,
 0.26253437995910645,
 -3.4192605018615723,
 -1.4674066305160522,
 -2.006012439727783,
 4.100264549255371,
 0.987169086933136,
 1.804114818572998,
 -2.1979775428771973,
 1.8712148666381836,
 -2.2755215167999268,
 6.361814498901367,
 -0.8530470132827759,
 -3.521940231323242,
 0.46580642461776733,
 -0.1951206624507904,
 0.5615508556365967,
 -0