# About Vector Embeddings
- Vector embeddings are a very fundamental aspect of vector search, semantic search, RAG and therefore LLMs (or literally any other AI technologies). It is interchangeable with the term "embedding".
- An embedding is a numeric representation of data (usually text, images, or audio) in a continuous vector space.
- Example: [0.03, -0.42, 0.88, 0.11, 0.24 ...]. However, we need to interpret and process this in ways to actually make it more useful.
### So what actually are vector embeddings?
- As aforementioned, they're a vector (array) of floating-point numbers that represent some form of data. 
- They can have MANY dimensions.
- They're most often used for strings, but can also be used for other data types (images, audio, multimodal (so like text + image)).
- Embeddings **represent** the relatedness/similarty/proximity of these different data types. So in the case of text, it represents how similar 2 strings are, while in images, they represent how similar the images are. This is a **very core and fundamental** aspect of embeddings, and is what makes them inherenetly useful in AI applications.
  - The **distance** between 2 vectors measures their relatedness. 
  - Small distances indicate high relatedness/similarity
  - High distances indicate low relatedness/similarity.
  - Example:
    - Two text embeddings:
      - "How to boil an egg" → `[0.23, -0.11, 0.91, 0.04, 0.67]`
      - "Steps to cook a hard-boiled egg" → `[0.21, -0.10, 0.89, 0.05, 0.65]`
      - These are **highly related** — small distance between vectors.
    - Versus:
      - "How to boil an egg" → `[0.23, -0.11, 0.91, 0.04, 0.67]`
      - "Best hiking trails in California" → `[0.85, 0.67, -0.33, 0.21, -0.55]`
      - These are **unrelated** — large distance between vectors.
#### Some usecases for embeddings:
  - **Text embeddings:**
      - **Search**: Rank documents, FAQs, or product listings by relevance to a user’s query (e.g., “best protein for muscle gain” → retrieves relevant blogs, not just exact matches).
      - **Clustering**: Group similar news articles, Reddit posts, or support tickets (e.g., organize tickets into categories without predefined labels).
      - **Recommendations**: Suggest similar articles, videos, or job listings based on semantic similarity (e.g., “users who read X also read…”).
      - **Anomaly detection**: Spot unusual messages or user behavior (e.g., a support ticket with a very low similarity to any known category).
      - **Diversity measurement**: Ensure recommendations aren’t too similar (e.g., content shown is varied in embedding space).
      - **Classification**: Use similarity to labeled examples to classify texts (e.g., classify emails as spam vs not spam via nearest-neighbor logic in embedding space).

  - **Image and audio embeddings:**
    - **Search**: Retrieve similar images from a dataset (e.g., Google reverse image search or fashion image search).
    - **Clustering**: Group images/audio by visual or auditory similarity (e.g., organize wildlife camera footage by animal type).
    - **Recommendations**: Suggest visually or sonically similar items (e.g., Spotify recommending music with similar acoustic embeddings).
    - **Multimodal matching**: Match an image to its corresponding caption, or vice versa (e.g., "find me an image that matches this description").

  - **Multimodal (Text + Image) embeddings:**
    - **Cross-modal retrieval**: Search images using text or generate text from an image (e.g., “red car on snowy road” → returns matching images).
    - **Visual question answering**: Use both image and text embeddings to answer questions about a picture.
    - **Content moderation**: Detect unsafe or mismatched image+text combinations (e.g., misleading captions).


I will now show you an actual real-life example of getting the actual vector embeddings for a string in code, using OpenAI's embedding models.
- Most AI labs/companies have their own embedding models. In most of their LLM workflows/pipelines, embedding is a step that is followed my database lookup or whatever their method is to generate quality outputs for the user.
- Different embedding models **will** give you different values for embeddings, however, the point here (and of this repo in general) is to display and show what embeddings actually are and how they work, with OpenAI's one here just being used as a practical example (and there will be more practical examples down the line of course) to accelerate your understanding.

In [6]:
%pip install -r requirements.txt

Collecting scikit-learn (from -r requirements.txt (line 3))
  Downloading scikit_learn-1.7.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting scipy>=1.8.0 (from scikit-learn->-r requirements.txt (line 3))
  Downloading scipy-1.15.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn->-r requirements.txt (line 3))
  Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->-r requirements.txt (line 3))
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.0-cp312-cp312-macosx_12_0_arm64.whl (10.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m94.0 kB/s[0m eta [36m0:00:00[0m00:02[0m00:05[0m
[?25hDownloading joblib-1.5.1-py3-none-any.whl (307 kB)
Downloading scipy-1.15.3-cp312-cp312-macosx_14_0_arm64.whl (22.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.4/22.4 MB[0

In [None]:
# Here is an actual practical example of the vector (string) embedding of a string.

from openai import OpenAI

api_key = "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"
client = OpenAI(api_key=api_key)

response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-3-small"
)

print(response.data[0].embedding)

# You can take the resultant embedding vector, save it in a vector database (there are many technologies for this too!), and reuse it later on for whatever you want.

### Dimensions
- Vector embedding dimensions are essentially just the amount of values in the embedding. (So an embedding of dimension d is just a vector with d values).
- The higher the amount of dimensions, the higher the precision, complexity, meaning and nuance of the data that you're attempting to represent.
  - However, this comes at the cost of higher storage and slower retrieval.
- In these scenarios, you would want to reduce dimensions, to prevent high storage and retrieval times.
- It can also help remove noise (useless data or just data that doesnt contribute much) and keep more principal/important components.
- There are also **methods** to **visualize** the embedding space onto a plane/graph or any form of visual space. This intuitively makes more sense when we actually get to that part. (The methods are called PCA, t-SNE, UMAP, and they help reduce our vectors into 2D or 3D, which allows us to actually visualise them).
  - Reducing dimensions also help us with visualizing our vector embeddings!
- Lot's of models have a default dimension size. There are multiple reasons for this:
  - Allows vectors to be able to be compared directly.
  - Distance measurement methods require vectors to be in same space (vectors have same number of dimensions, each dimension represents same type of information across all vectors).
  - The models **always** output in their default/common dimension size. Having this sort of fixed architecture allow them to be stored in vector databases and fed into other models that are tuned to a certain input size/format.
  - Keeping dimensions consistent means that specific dimensions always mean/contribute to capturing the same thing/feature.
- For example, the embedding model used in the code example above has a default dimension size of 1536, which is relatively fast and therefore allows it to churn output out quite quickly. However, it probably has less precise meaning then outputs from the OpenAI embedding model `text-embedding-3-large`, which has a default dimension of 3072.
- Keep in mind, dimension count isn’t everything, it’s also about embedding quality. Reducing dimensionality (e.g. with PCA or other techniques) is often possible without hurting performance, if the **base** embedding model is strong.

Now here's a code sample with the same OpenAI library, showing you how to reduce dimensions on an already generated output. In this case, we already have a 1536-dimension embedding, so we just need to change the dimension here manually. When we do this, we must normalize the dimensions of the embedding.

In [None]:
import numpy as np

def normalize_l2(x): # normalization method
    x = np.array(x)
    if x.ndim == 1:
        norm = np.linalg.norm(x)
        if norm == 0:
            return x
        return x / norm
    else:
        norm = np.linalg.norm(x, 2, axis=1, keepdims=True)
        return np.where(norm == 0, x, x / norm)


response = client.embeddings.create(
    model="text-embedding-3-small", input="Testing 123", encoding_format="float"
)

cut_dim = response.data[0].embedding[:256]
norm_dim = normalize_l2(cut_dim)

print(norm_dim)

The other method would be to do it dynamically by passing in the `dimensions` parameter.

In [2]:
# Here's an example of us getting a 1024-dimension embedding straight as output from the model.
def embed(string):
    response = client.embeddings.create(
        input=string,
        model="text-embedding-3-small",
        dimensions=1024
    )
    return response
inp_string = "Your text string goes here"
print(embed(inp_string).data[0].embedding)

[0.005935892462730408, 0.01994006708264351, -0.021623319014906883, -0.021461468189954758, -0.054640963673591614, -0.035024598240852356, 0.031997982412576675, 0.004179806914180517, 0.012972373515367508, 0.007441108580678701, -0.0019503069343045354, 0.01828918419778347, -0.0014859962975606322, -0.009087944403290749, 0.06927230954170227, 0.05823405832052231, -0.03185231611132622, 0.011467156931757927, -0.04671025276184082, 0.05781324580311775, -0.0004463552322704345, 0.034959856420755386, -0.015877602621912956, 0.03813214227557182, 0.020020993426442146, 0.019438328221440315, -0.002042359672486782, 0.023581719025969505, 0.047195807099342346, -0.04360271245241165, -0.030266173183918, -0.05784561485052109, 0.02796788699924946, -0.06383410841226578, -0.03729051351547241, 0.04894379898905754, 0.07477524876594543, 0.016994375735521317, -0.018062593415379524, -0.04781084135174751, 0.0256534144282341, 0.008513372391462326, 0.05198660492897034, 0.008181577548384666, -0.02783840522170067, 0.0605647

### Distance in Embedding Space

- **Distance** is a measure of how similar or dissimilar two vectors (embeddings) are in the vector space.
- In the context of embeddings, **smaller distances mean higher similarity**, while **larger distances mean lower similarity**.
- Calculating distance between embeddings is fundamental for tasks like search, clustering, recommendations, and anomaly detection.

#### Why is Distance Important?
- It quantifies the "relatedness" between two pieces of data (e.g., texts, images).
- Enables ranking, grouping, and matching based on semantic similarity.


### Common Distance Metrics

#### 1. **Euclidean Distance**
- The "straight line" distance between two points in space.
- Formula (for vectors **a** and **b**):  
    $$ d_{euclidean}(a, b) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2} $$
- **Use case:** Good for embeddings where magnitude matters and vectors are not normalized.

#### 2. **Cosine Similarity / Cosine Distance**
- Measures the cosine of the angle between two vectors.
- **Cosine similarity** ranges from -1 (opposite) to 1 (identical).
- **Cosine distance** is usually defined as `1 - cosine similarity`.
- Formula:  
    $$ \text{cosine\_similarity}(a, b) = \frac{a \cdot b}{\|a\| \|b\|} $$
- **Use case:** Most common for text embeddings, as it focuses on direction (semantic meaning) rather than magnitude.

#### 3. **Manhattan (L1) Distance**
- Sum of the absolute differences of their coordinates.
- Formula:  
    $$ d_{manhattan}(a, b) = \sum_{i=1}^{n} |a_i - b_i| $$
- **Use case:** Useful when differences along each dimension are equally important.

#### 4. **Dot Product**
- Measures the projection of one vector onto another.
- Formula:  
    $$ a \cdot b = \sum_{i=1}^{n} a_i b_i $$
- **Use case:** Sometimes used directly for similarity, especially in neural networks.


### Which Distance Metric Should You Use?
- **Cosine similarity** is preferred for most NLP/text embedding tasks.
- **Euclidean distance** is useful when embeddings are not normalized.
- **Manhattan distance** can be robust to outliers in some cases.
- The choice depends on your data, embedding model, and application.


### Example: Calculating Cosine Similarity in Python

```python
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
        return dot(a, b) / (norm(a) * norm(b))
```

- Replace `a` and `b` with your embedding vectors to compute their similarity.  
- For distance, use `1 - cosine_similarity(a, b)`.

In [None]:
# Here's an example of running the cosine similarity function on 2 vector embeddings (that we're going to generate here directly).
from sklearn.metrics.pairwise import cosine_similarity

# Say you use OpenAI to embed:
query_vec = embed("What is a transformer?").data[0].embedding
doc_vec = embed("A transformer is a deep learning model...").data[0].embedding

cos_sim = cosine_similarity([query_vec], [doc_vec])
# Don't be worried about warnings about divide by zero incidents. They are common in this scenario

  ret = a @ b
  ret = a @ b
  ret = a @ b


In [12]:
# These are the embeddings we did the cos sim operation on
print(query_vec)
print(doc_vec)

# Here I'll display results of the operation to you
print("Cosine similarity between query and document:", cos_sim)
print("Pairwise cosine similarity between the first vector in X and the first vector in Y:", cos_sim[0][0])

[-0.008213493973016739, -0.01397327147424221, -0.10429587960243225, -0.018803220242261887, -0.00051253626588732, -0.0074128080159425735, -0.010718868114054203, 0.012584984302520752, 0.010086067952215672, 0.033241406083106995, -0.0021663736552000046, 0.007871265523135662, -0.024330539628863335, -0.025880256667733192, 0.026939228177070618, 0.03218243271112442, 0.014786873012781143, -0.022535452619194984, 0.01693064719438553, 0.03378380835056305, -0.005708120297640562, -0.0035417466424405575, 0.0359017513692379, 0.021450651809573174, 0.0187773909419775, -0.03114929050207138, -0.00983423925936222, 0.015432587824761868, 0.01032498199492693, -0.036831580102443695, 0.05563480034470558, -0.028566429391503334, -0.048506107181310654, 0.003887203987687826, 0.028747230768203735, -9.685724216978997e-05, 0.026965057477355003, 0.018674077466130257, 0.0028976458124816418, 0.0679292157292366, -0.01455441489815712, -0.02724917232990265, -0.01628493145108223, 0.04166153073310852, -0.0076710935682058334, 

So the cosine distance is 1 - 0.571 = 0.429, which means there is **some** alignment/similarity between the strings we embedded.

## Semantic Search
- This is the next thing we're going to learn.
- Before semantic search, the most popular way to search was via keyword search.
- Before I explain keyword search, there are 2 terms we must understand when it comes to search. 
  - 1. **Query**: this is the question you are asking. It may be something like "Where is the Eiffel Tower?"
  - 2. **Document(s)**: (aka responses) the sentences that you search that could potentially contain information that answers your question.
- Keyword search is essentially where the program looks for the sentence (response) with the largest number of words in common with the query.
### Example: Limitations of Keyword Search

Let’s say the query is:

> **"Where is the Eiffel Tower?"**

With **keyword search**, the system ranks responses based on the number of overlapping words with the query.

#### Responses:

1. *The Eiffel Tower is located in Paris.*  
   → ✅ **3 words in common** (`Eiffel`, `Tower`, `is`)  
2. *Where is my favorite mug?*  
   → **2 words in common** (`Where`, `is`)  
3. *The pyramids are in Egypt.*  
   → **1 word in common** (`is`)  
4. *Birds fly in the sky.*  
   → **1 word in common** (`in`)

In this case, the correct response (*"The Eiffel Tower is located in Paris."*) ranks the highest — **luckily**.

But now imagine we add a new response:

> **"Where in the world is my Eiffel-shaped souvenir?"**  
→ **4 words in common** (`Where`, `in`, `is`, `Eiffel`)

This response would be ranked even higher than the correct one — despite being **irrelevant**.

### Takeaway

**Keyword search** rewards surface-level word overlap, not **meaning**.  
This is where **semantic search** comes in — it finds results based on *true similarity*, not just shared words.
To tackle this, we can improve keyword search, by removing stop words such as “the”, “and”, “is”, etc. We can also use methods like TF-IDF in order to tell apart relevant from non-relevant words. However, as you may imagine, there will always be cases in which, due to the ambiguity of the language, synonyms, and other roadblocks, keyword search will fail to find the best response.

This is where **semantic search** comes in.

###### How does Semantic Search work?
- It uses a text embedding to turn words into vectors (lists of numbers).
- Uses similarity to find the vector among the responses which is the most similar to the vector corresponding to the query.
- Outputs the response corresponding to this most similar vector.

### Visualizing Semantic Search as a Matrix

Think of semantic search as working with a **2D matrix** where:
- Each row represents a document/response
- Each column represents a dimension in the embedding space
- Your query becomes a vector (row) in this same space

The goal is to find the document vector that's **closest** to your query vector.

```
Query: [0.2, 0.8, 0.1, 0.9, ...]

Doc 1: [0.3, 0.7, 0.2, 0.8, ...] ← Closest match!
Doc 2: [0.1, 0.2, 0.9, 0.1, ...]
Doc 3: [0.8, 0.1, 0.3, 0.2, ...]
```

The system calculates distances (like cosine similarity) between your query and each document, then returns the document with the smallest distance.

In 2D planes, using Euclidean Distance as your form of measurement for the distance between 2 embeddings is the most intuitive and easy method to use. Here's a screenshot of an example where 8 vector embeddings have been shrunk to length 2 via dimensionality reduction algorithms (originally from length 1024). As they are now vectors like [0.23, 0.44], we can actually plot all these embeddings as points on in the plane as 2 coordinates. Here is a visual example of this, courtesy of Cohere:

![Vector Embeddings Visualization](https://cohere.com/_next/image?url=https%3A%2F%2Fcohere-ai.ghost.io%2Fcontent%2Fimages%2F2024%2F10%2Fd0c031b-image.png&w=3840&q=75)

Logically, as aforementioned, if 2 points are close, and therefore their vector embedding's numerical values are near equal, then that means they must be very similar. So points nearby each other on this imaginary 2D plane are likely very highly related and querying one would return the other as a response.
