# `fr2ex.embedding.embed_many` demo

For use on the actual data set of repository names, see `main.ipynb`.

In [1]:
import logging

import numpy as np

import fr2ex

In [2]:
logging.basicConfig(level=logging.INFO)

In [3]:
TEXTS = [
    'Gee whiz!',
    'Golly wow!',
    'Well, shucks!',
    'The meeting is this afternoon!',
]

## Loading or querying for an embedding

This is the normal way to use it. The cache is used if possible.

In [4]:
embeddings = fr2ex.embedding.embed_many(TEXTS)

INFO:root:Reading cached embeddings.


In [5]:
embeddings.shape

(4, 1536)

In [6]:
embeddings @ np.transpose(embeddings)

array([[0.9999999 , 0.92363995, 0.88078564, 0.7806773 ],
       [0.92363995, 1.0000004 , 0.86989045, 0.7691041 ],
       [0.88078564, 0.86989045, 0.99999976, 0.7568327 ],
       [0.7806773 , 0.7691041 , 0.7568327 , 1.0000002 ]], dtype=float32)

## Comparing separate results

The embedding model (`text-embedding-ada-002`) is nondeterministic. When called separate times to embed the same text, it often returns different results. The results are approximately equal, but the noise is greater than the error of the data type (i.e., the model really is nondeterministic).

In [7]:
import msgpack

with open('embeddings-c5e7a088e88de307e7076d8e19ef5913-old.msgpack', 'rb') as file:
    loaded1 = msgpack.load(file)

with open('embeddings-c5e7a088e88de307e7076d8e19ef5913.msgpack', 'rb') as file:
    loaded2 = msgpack.load(file)

(loaded1 == loaded2).all()

False

In [8]:
delta = loaded2 - loaded1
delta.min(), delta.max(), abs(delta).mean()

(-0.0003888011, 0.00031512976, 3.5482637e-05)

In [9]:
ratio = loaded2 / loaded1
ratio.min(), ratio.max(), ratio.mean()

(-2.1692996, 857.599, 1.1443367)

In [10]:
[np.dot(row1, row2) for row1, row2 in zip(loaded1, loaded2)]

[0.9999985, 0.99999833, 0.9999981, 0.99999875]