Embeddings
===
MAIC - Spring, Week 4<br>
```
  _____________
 /0   /     \  \
/  \ M A I C/  /\
\ / *      /  / /
 \___\____/  @ /
          \_/_/
```
(Rosie is not needed!)

Prereqs:
- Install [VSCode](https://code.visualstudio.com/)
- Install [Python](https://www.python.org/downloads/)
- Ensure you can run notebooks in VSCode.

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

Embeddings are an extremely useful tool in modern machine learning, allowing raw text to be transformed into numerical representations that models can understand.
They are also a popular interview question to test a candidate’s understanding of vector spaces, similarity metrics, and real-world applications.
Beyond that, embeddings are incredibly common in ML, powering everything from search engines and recommendation systems to chatbots and fraud detection.
You'll see embeddings being used everywhere if you look! Here are just some models, projects, and papers that make use of embeddings:
- [The original transformer paper - the basis of modern LLMs](https://arxiv.org/pdf/1706.03762)
- [RAG systems - often used to give LLMs comprehensive access to much more information than they could normally use at once](https://en.wikipedia.org/wiki/Retrieval-augmented_generation)
- [Image generations models such as Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion)
  - Note: images are generated *in embedding space*!
- [Audio-continuation models such as RAVE](https://github.com/acids-ircam/RAVE)
- Modern image search makes extensive use of embeddings
- Modern recommendation algorithms also use embeddings
- Even some papers published by MSOE students involve the use of embeddings! Here are a few:
  - [Agent simulation with LLMs](https://arxiv.org/pdf/2409.13753)
  - [Strategy masking - a technique to control model behavior](https://arxiv.org/pdf/2501.05501)

**What *are* Embeddings?**

At the lowest level, an embedding is just stored as a list of numbers. This could be an embedding: `[0.1, 0.2, -0.3]`.

This list of numbers is best interpreted as a point or direction in some very high-dimensional space that represents something. In the case of text-based models, embeddings are used to represent words and sentences.

In practice, embeddings range from tens of dimensions to over 1000. For simplicity, let's only conceptualize things in two or three dimensions for now - that way we can actually visualize what's going on.

The image below shows how embedded words of the phrase `some embedded text` can be thought of as directions in space - the embedding point describes direction relative to the point (0,0).

<img src="https://raw.githubusercontent.com/MSOE-AI-Club/workshops/refs/heads/main/Embeddings/img1.png" width=1000px>

But how do we actually interpret these directions in space as being words? The answer is that different directions in the space represent different aspects of a word -
- one direction may encode "past tense,"
- another may enode the idea of "running" or "to run."

In the case above, the embedding of the word "ran" may point in the average of the directions encoding "past tense," and "to run."

<img src="https://raw.githubusercontent.com/MSOE-AI-Club/workshops/refs/heads/main/Embeddings/img2.png" width=600px>

This topic naturally leads into another important point: embeddings *closer* in embedding space are also *closer* in meaning. The word "ran" will be closer to "walked" than to "stapler." This is the case, because words with increasingly different meanings are, *by definition,* pointing in increasingly different directions to encode those meanings.

<img src="https://raw.githubusercontent.com/MSOE-AI-Club/workshops/refs/heads/main/Embeddings/img3.png" width=600px>

NOTE: we'll watch this during the workshop:

[Here is a one-minute video that illustrates this concept using real-world embeddings.](https://www.youtube.com/watch?v=FJtFZwbvkI4)

**How is it possible for there to be directions dedicated to ideas as specific as "Italian-ness" and "WWII Axis leaders?"**

One might expect that the directions in embedding space would represent more general concepts.

If directions can be allocated to specific ideas like "WWII Axis leaders," how are there enough directions left to represent everything else, from "60s British pop bands" to "computer keyboard layouts"?!?!

In two or three dimensions, it's *not* really possible to have directions this specific. But, remember that text embeddings are typically 10s to 1000s of dimensions.  

As the number of dimensions grows, the number of possible points and directions in a space grows MUCH more quickly.

Let's work with the constraint that points of different meanings must be one unit apart. This is somewhat arbitrary, but it is true that there is a "minumum" distance between two points before they mean the same thing. Let's also say that we only allow points in the range 0 to 1. This is also somewhat arbitrary, but machine learning models often try to keep numbers from getting too big to prevent numbers from going to infinity. With these constraints, we can only fit two points in one dimension:

<img src="https://raw.githubusercontent.com/MSOE-AI-Club/workshops/refs/heads/main/Embeddings/img4.png" width=600px>

These two points (or two directions relative to a centerpoint) probably can't encode much information. But what if we extrude ourselves into the second dimension with the same constraints? 

<img src="https://raw.githubusercontent.com/MSOE-AI-Club/workshops/refs/heads/main/Embeddings/img5.png" width=600px>

We now have *four* points (or four directions). And if we went to three dimensions we'd have eight points - imagine extruding the four points of this square into a cube. In general, our constraints will allow N dimensions to encode $2^N$ unique directions.

- Only 10 dimensions gets us over 1000 directions.
- 20 dimensions gets us over 1 million directions.
- And at 1000 dimensions, we have **more possible unique directions than atoms in the observable universe,** each of which can be interpolated between to embed specific words or sentences!

The act of adding just *one* dimension EXPONENTIALLY increases how many things we can fit in the space! So think about adding a dimension to a 3D space... *1000 times*.

<img src="https://www.i2tutorials.com/wp-content/media/2019/09/Curse-of-Dimensionality-i2tutorials.png" width=1000px>

Although we can't *see* the directions encoding things like "WWII Axis leaders," there is no doubt that these directions are able to exist.

**Who decided that there should be directions for these particular ideas?**

These directions are not something humans designed directly. Instead, these directions *emerge* from the process of training the model.

The model learns from a huge amount of text and starts to recognize patterns, like which words tend to appear in similar contexts.

As it processes more and more language, the model "figures out" what sort of information it should store in the directions of an embedding space - even though no one explicitly programmed it to do that!

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

**That seems neat. How can *I* use embeddings?**

Let's set things up!

It's really easy to get started with embeddings. You can even run small embedding models on your laptop!

We'll be using `sentence-transformers` to run [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) - a model that embeds sentences into 384 dimensions.

In [None]:
# Shouldn't take longer that 5 mins
%pip install sentence-transformers
%pip install tf-keras
%pip instal numpy

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2") # Our model of choice is supplied here. You can find many more on huggingface: https://huggingface.co/models?sort=trending&search=embed
embedding = model.encode("This is an embedded text example.")

print(embedding) # our embeddings are just lists of numbers stored as Numpy arrays. Numpy is just a library that makes it easier to manipulate arrays.

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

Now that we (hopefully) have a working embedding model, let's put it to use.

Let's first try the example from the previously linked 3Blue1Brown video. But before that, we need to understand how to measure "distance" in embedding space.

[TODO - explain cosine similarity]

In [12]:
emb_uncle = model.encode("uncle")
emb_aunt = model.encode("aunt")
emb_man = model.encode("man")
emb_woman = model.encode("woman")

# ^ visualize?

In [13]:
np.dot(emb_uncle, emb_aunt)

0.73387706

In [None]:
np.dot(emb_uncle - emb_man + emb_woman, emb_aunt) # higher sim!!!

# If you're done and still waiting: work activity: what about Hitler? Other terms?

0.98456275

TODO

more on visualizing embeddings
- bar plot
- img - 100 pixels in an img are visually a lot more info than 2d direction
- pca

some search task

some clustering task

maybe: high level of embeddings in transformer/attention?