In [1]:
include("../src/Juissie.jl")
using .Juissie

## Embedding.jl

The Embedding package contains the `Embedder` struct and some support functions for it. An `Embedder` is simply a wrapper for the embedding model and the tokenizer associated with that model.

The embedding model will take a human readable string, and convert it into a neural network representation of that string. Effectively, this neural network representation is the output of the hidden state of the embedding model, more concretely a matrix of floats. With a sufficiently trained embedding model, two "semantically similar" strings will have similar embedding values. Or in other words, the distance between two embeddings in N-Dimensional space will be relatively small. 

For a simple example, the string "Dog" and "Hound" would probably have proximal embedding values, while the string "Dog" and "business card" would have more distant embedding values. 

### Initialize the Embedder struct

The `Embedder` struct takes a hugging face model name. This model will be downloaded from hugging face, and initialized via the `HuggingFace` external package. This will provide the `model` and it's associated `tokenizer` which are both saved into the `Embedder`. 


In [2]:
embedder = Embedder("BAAI/bge-small-en-v1.5")

Embedder(BertTextEncoder(
├─ TextTokenizer(MatchTokenization(WordPieceTokenization(bert_uncased_tokenizer, WordPiece(vocab_size = 30522, unk = [UNK], max_char = 100)), 5 patterns)),
├─ vocab = Vocab{String, SizedArray}(size = 30522, unk = [UNK], unki = 101),
├─ startsym = [CLS],
├─ endsym = [SEP],
├─ padsym = [PAD],
├─ trunc = 512,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
  ╰─ target[(token, segment)] := SequenceTemplate{String}([CLS]:<type=1> Input[1]:<type=1> [SEP]:<type=1> (Input[2]:<type=2> [SEP]:<type=2>)...)(target.token)
  ╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(512))(target.token)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(512, [PAD], tail, tail)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target[segment] := TextEncodeBase.trunc_and_pad(512

### Generate embedding for a provided text

To convert a human readable text into a model's embedding, simply use the `embed(...)` function. 

In [8]:
text = "This is sample text for testing"
embedding = embed(embedder, text)

embedding_dog = embed(embedder, "dog")
embedding_hound = embed(embedder, "hound")
embedding_other = embed(embedder, "busines card")

using Pkg
Pkg.add(url="https://github.com/JuliaStats/Distances.jl")
using Distances

# Calcualt the Euclidean distance between the two embedding vectors
dog_to_hound = evaluate(Euclidean(), embedding_dog, embedding_hound)
dog_to_other = evaluate(Euclidean(), embedding_dog, embedding_other)

println(dog_to_hound)  # 6.1125574
println(dog_to_other)  # 9.211692

[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaStats/Distances.jl`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/Projects/School/SoftwareParadigms/csci_6221/Project.toml`
[32m[1m  No Changes[22m[39m to `~/Projects/School/SoftwareParadigms/csci_6221/Manifest.toml`


6.1125574
9.211692
