## From Recipes to Embeddings: Cooking with Vectors

Once the recipes are cleaned and chunked, we reach one of the most exciting stages in our journey — **embeddings**. If chunking is about breaking down messy text into digestible bites, embeddings are about giving those bites a *flavor profile* that machines can actually understand.  

Think about our recipe PDF. After the preprocessing work — dealing with two-column layouts, filtering out non-recipe pages, and parsing tricky measurement tables — we’re left with logical text chunks: recipe titles, ingredient lists, step-by-step instructions, and maybe even some nutritional tables. Each of these chunks is still just plain text. To us humans, words like *“sugar”*, *“milk”*, or *“bake at 350°F”* carry meaning instantly. But for a computer, these are just strings of characters unless we find a way to represent them in a mathematical space.  

That’s where embeddings come in.  

---

### What are Embeddings?

At their core, embeddings are **numerical representations of text**. Instead of storing “1 cup sugar” as raw text, we transform it into a high-dimensional vector — a long list of numbers that capture its meaning. These vectors are designed so that similar pieces of text end up close together in this mathematical space.  

For example:  

- The chunk “Preheat oven to 350°F” should land close to “Heat oven to 180°C” because both describe the same cooking action.  
- “1 cup sugar” and “200 grams sugar” should cluster together, because they refer to the same ingredient (even if units differ).  
- “Cake” should end up closer to “muffin” than to “tomato soup.”  

In other words, embeddings turn words, sentences, or whole documents into **points in space**, where *distance equals similarity*.  

---

### Why Do We Need Embeddings for Recipes?

Recipes are naturally **semi-structured**: they have titles, lists of ingredients, instructions, and sometimes tables. Without embeddings, searching for a recipe by keyword is brittle — “flourless cake” might not appear in a search for “chocolate dessert.” But with embeddings, we can search by *meaning*.  

This opens up some interesting possibilities:  

- **Semantic search**: Ask, “Show me recipes with no dairy,” and retrieve relevant chunks, even if the text doesn’t literally use the word *“dairy.”*  
- **Ingredient substitution**: Query “egg replacement” and surface recipes that mention “flaxseed egg” or “applesauce.”  
- **Contextual retrieval**: If the query is “quick desserts under 30 minutes,” embeddings can help pull together the right mix of instructions, ingredient lists, and prep times.  

Embeddings, therefore, become the bridge between our messy human recipe book and a structured, intelligent search experience.  

---

### How Do We Generate Embeddings?

In practice, generating embeddings involves feeding each recipe chunk into a pre-trained embedding model. These models are trained on large amounts of text and are designed to capture linguistic and semantic relationships. For our use case:  

1. **Chunking**: Each recipe (or section of a recipe) is broken down into smaller, meaningful chunks.  
2. **Embedding**: Each chunk is passed through the embedding model, producing a vector representation.  
3. **Storing**: The vectors are saved in a vector database (e.g., FAISS, Pinecone, Weaviate), where they can be efficiently compared.  

Now, instead of storing raw text only, we store both the text *and* its embedding.  

---

### A Concrete Example

Suppose we chunked our PDF into the following two snippets:  

- *“Step 1: Preheat oven to 350°F (175°C). Grease and flour two 9-inch cake pans.”*  
- *“Step 2: In a bowl, whisk together sugar, eggs, and oil until smooth.”*  

To the embedding model, these are just text inputs. It outputs something like:  

```python
[0.032, -0.145, 0.876, ... , 0.217]  # vector for Step 1
[0.118, -0.092, 0.934, ... , 0.451]  # vector for Step 2


### Local Embedding Models: Pros and Cons

| Model                                | Dim. | Pros                                                                 | Cons                                                                 |
|--------------------------------------|------|----------------------------------------------------------------------|----------------------------------------------------------------------|
| **SentenceTransformers (all-MiniLM-L6-v2)** | 384  | - Very lightweight, runs on CPU/GPU<br>- Fast inference<br>- Good general-purpose performance | - Lower dimensionality may limit nuance<br>- Less accurate on domain-specific data |
| **SentenceTransformers (all-mpnet-base-v2)** | 768  | - Strong semantic performance<br>- Widely benchmarked<br>- Still relatively efficient | - Slower than MiniLM<br>- Larger memory footprint |
| **intfloat/e5-small / e5-base / e5-large** | 384 / 768 / 1024 | - Optimized for retrieval (query/document embedding)<br>- High accuracy on search/retrieval tasks<br>- Open-source | - Larger variants need more GPU/CPU<br>- Slightly slower than MiniLM for same dimension |
| **GTE (gte-small / gte-base / gte-large)** | 384 / 768 / 1024 | - Strong multilingual support<br>- Good for cross-language search<br>- Competitive accuracy | - Larger models require more VRAM<br>- Still evolving community support |

---

**Quick takeaways**:  
- If you want **speed + lightweight**, go with **MiniLM** or **e5-small**.  
- For **balanced accuracy and efficiency**, **mpnet-base** or **e5-base** are solid choices.  
- If you need **multilingual** support, check out **GTE models**.  


In [1]:
# imports and setup
import sys
import os
import json
from langchain.embeddings import HuggingFaceBgeEmbeddings
project_root = os.path.abspath(os.path.join("..", ".."))
sys.path.append(project_root)
from common.helper import read_pdf, extract_recipes_from_pdf

inp_dir = os.path.join(project_root, "data", "chunks")
MODEL_NAME = "intfloat/e5-base-v2"


In [7]:
# Loads recipe chunks, generates embeddings for one chunk, and prints results.
# 1. Suppress warnings for cleaner output.
# 2. Initialize the embedding model with normalization.
# 3. Load recipe chunks from JSON file and print count.
# 4. Generate embedding for the first chunk and print sample text and embedding vector.
import warnings
warnings.filterwarnings('ignore')
    
embeddings = HuggingFaceBgeEmbeddings(model_name=MODEL_NAME, encode_kwargs={"normalize_embeddings": True})

chunks = json.load(open(f"{inp_dir}/recipe_chunks.json", "r", encoding="utf-8"))
print(f"Loaded {len(chunks)} chunks")

# embed one chunk for example
chunk_text = chunks[0]['content']
chunk_embeddings = embeddings.embed_documents([chunk_text])

print(f"Chunk text: {chunk_text[:200]}...")
print(f"Embedding vector (first 10 values): {chunk_embeddings[0][:10]}  ...")

Loaded 23 chunks
Chunk text: Directions Step 1 Preheat oven to 350 degrees F (175 degrees C). Grease and fl our two 9 inch, round, cake pans; cover bottoms with waxed paper. Step 2 In a large bowl, combine fl our, 2 cups sugar, c...
Embedding vector (first 10 values): [-0.006904602516442537, 0.001459900289773941, -0.04020014405250549, 0.009997756220400333, 0.057940173894166946, -0.05789573863148689, 0.057525020092725754, 0.01736609824001789, -0.0022118366323411465, -0.017856286838650703]  ...
