# `fr2ex.embedding.embed_many` demo

SPDX-License-Identifier: 0BSD

For use on the actual data set of repository names, see `main.ipynb`.

In [1]:
import logging

import msgpack
import numpy as np

import fr2ex

In [2]:
logging.basicConfig(level=logging.INFO)

In [3]:
TEXTS = [
    'Gee whiz!',
    'Golly wow!',
    'Well, shucks!',
    'The meeting is this afternoon!',
]

## Loading or querying for an embedding

This is the normal way to use it. The cache is used if possible.

In [4]:
embeddings = fr2ex.embedding.embed_many(TEXTS)

INFO:root:Reading cached embeddings.


In [5]:
embeddings.shape

(4, 1536)

In [6]:
embeddings @ np.transpose(embeddings)

array([[0.99999976, 0.9236445 , 0.88080657, 0.7808358 ],
       [0.9236445 , 0.9999999 , 0.86993897, 0.76927197],
       [0.88080657, 0.86993897, 1.0000004 , 0.7570356 ],
       [0.7808358 , 0.76927197, 0.7570356 , 1.0000001 ]], dtype=float32)

## Comparing separate results

The embedding model (`text-embedding-ada-002`) is nondeterministic. When called separate times to embed the same text, it often returns different results. The results are approximately equal, but the noise is greater than the error of the data type (i.e., the model really is nondeterministic).

In [7]:
data_dir = fr2ex.paths.data_dir
old_save = data_dir / 'embeddings-c5e7a088e88de307e7076d8e19ef5913-old.msgpack'
new_save = data_dir / 'embeddings-c5e7a088e88de307e7076d8e19ef5913.msgpack'

with open(old_save, 'rb') as file:
    loaded1 = msgpack.load(file)

with open(new_save, 'rb') as file:
    loaded2 = msgpack.load(file)

(loaded1 == loaded2).all()

False

In [8]:
delta = loaded2 - loaded1
delta.min(), delta.max(), abs(delta).mean()

(-0.00031512976, 0.00011936715, 1.4676229e-05)

In [9]:
ratio = loaded2 / loaded1
ratio.min(), ratio.max(), ratio.mean()

(0.09388434, 5.549115, 1.0012113)

In [10]:
[np.dot(row1, row2) for row1, row2 in zip(loaded1, loaded2)]

[1.0000001, 1.0, 0.99999917, 0.99999875]