In [1]:
include("../src/Juissie.jl")
using .Juissie

## Embedding.jl

The Embedding package contains the `Embedder` struct and some support functions for it. An `Embedder` is simply a wrapper for the embedding model and the tokenizer associated with that model.

The embedding model will take a human readable string, and convert it into a neural network representation of that string. Effectively, this neural network representation is the output of the hidden state of the embedding model, more concretely a matrix of floats. With a sufficiently trained embedding model, two "semantically similar" strings will have similar embedding values. Or in other words, the distance between two embeddings in N-Dimensional space will be relatively small. 

For a simple example, the string "Dog" and "Hound" would probably have proximal embedding values, while the string "Dog" and "business card" would have more distant embedding values. 

### Initialize the Embedder struct

The `Embedder` struct takes a hugging face model name. This model will be downloaded from hugging face, and initialized via the `HuggingFace` external package. This will provide the `model` and it's associated `tokenizer` which are both saved into the `Embedder`. 


In [6]:
embedder = Embedder("BAAI/bge-small-en-v1.5");

### Generate embedding for a provided text

To convert a human readable text into a model's embedding, simply use the `embed(...)` function. 

In [5]:
text = "This is sample text for testing"
embedding = embed(embedder, text)

embedding_dog = embed(embedder, "dog")
embedding_hound = embed(embedder, "hound")
embedding_other = embed(embedder, "busines card")

# calculate the cosine similarity
dog_to_hound = cosine_similarity(embedding_dog, embedding_hound)
dog_to_other = cosine_similarity(embedding_dog, embedding_other)

# higher is better
println(dog_to_hound)  # 0.7871117
println(dog_to_other)  # 0.5119946

0.7871117
0.5119946
