Embeddings
===
MAIC - Spring, Week 4<br>
```
  _____________
 /0   /     \  \
/  \ M A I C/  /\
\ / *      /  / /
 \___\____/  @ /
          \_/_/
```
(Rosie is not needed!)

Prereqs:
- Install [VSCode](https://code.visualstudio.com/)
- Install [Python](https://www.python.org/downloads/)
- Ensure you can run notebooks in VSCode.

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

Embeddings are an extremely useful tool in modern machine learning, allowing raw text to be transformed into numerical representations that models can understand.
They are also a popular interview question to test a candidate’s understanding of vector spaces, similarity metrics, and real-world applications.
Beyond that, embeddings are incredibly common in ML, powering everything from search engines and recommendation systems to chatbots and fraud detection.
You'll see embeddings being used everywhere if you look! Here are just some models, projects, and papers that make use of embeddings:
- [The original transformer paper - the basis of modern LLMs](https://arxiv.org/pdf/1706.03762)
- [RAG systems - often used to give LLMs comprehensive access to much more information than they could normally use at once](https://en.wikipedia.org/wiki/Retrieval-augmented_generation)
- [Image generations models such as Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion)
  - Note: images are generated *in embedding space*!
- [Audio-continuation models such as RAVE](https://github.com/acids-ircam/RAVE)
- Modern image search makes extensive use of embeddings
- Modern recommendation algorithms also use embeddings
- Even some papers published by MSOE students involve the use of embeddings! Here are a few:
  - [Agent simulation with LLMs](https://arxiv.org/pdf/2409.13753)
  - [Strategy masking - a technique to control model behavior](https://arxiv.org/pdf/2501.05501)

**What *are* Embeddings?**

At the lowest level, an embedding is just stored as a list of numbers.

[img of list of numbers - idea for imgs: we can link to imgs hosted on this repo]

This list of numbers is best interpreted as a point or direction in some very high-dimensional space that represents something. In the case of text-based models, embeddings are used to represent words and sentences. Let's assume we already have a model that can embed any word.

[img]
"Some" -> [numbers]
"Embedded" -> [numbers]
"Words" -> [numbers]

In practice, embeddings range from tens of dimensions to over 1000. For simplicity, let's only conceptualize things in two dimensions for now.

[img of numbers as point in space]

But how do we actually interpret these directions in space as being words? The answer is that different directions in the space represent different aspects of a word -
- one direction may encode "past tense,"
- another may enode "physical verb,"
- and a third may encode "fast-ness."

In the case above, the embedding of the word "ran" may point in the average of the directions encoding "past tense," "physical verb," and "fast-ness."

[Here is a one-minute video that illustrates this concept using real-world embeddings.](https://www.youtube.com/watch?v=FJtFZwbvkI4)

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

**That seems neat. How can *I* use embeddings?**

Let's set things up!

It's really easy to get started with embeddings. You can even run small embedding models on your laptop!

We'll be using `sentence-transformers` to run [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) - a model that embeds sentences into 384 dimensions.

In [None]:
# Shouldn't take longer that 5 mins
%pip install sentence-transformers
%pip install tf-keras
%pip instal numpy

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2") # Our model of choice is supplied here. You can find many more on huggingface: https://huggingface.co/models?sort=trending&search=embed
embedding = model.encode("This is an embedded text example.")

print(embedding) # our embeddings are just lists of numbers stored as Numpy arrays. Numpy is just a library that makes it easier to manipulate arrays.

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

Now that we (hopefully) have a working embedding model, let's put it to use.

Let's first try the example from the previously linked 3Blue1Brown video. But before that, we need to understand how to measure "distance" in embedding space.

[TODO - explain cosine similarity]

In [None]:
emb_uncle = model.encode("uncle")
emb_aunt = model.encode("aunt")
emb_man = model.encode("man")
emb_woman = model.encode("woman")

In [None]:
np.dot(emb_uncle, emb_aunt)

In [None]:
np.dot(emb_uncle - emb_man + emb_woman, emb_aunt) # not working - maybe use a word-level model? But I wanted to keep things simple with only one model

TODO

more on visualizing embeddings
- bar plot
- img
- pca

some search task

some clustering task

maybe: high level of embeddings in transformer/attention?