# Creating Vector DB for RAG Pipeline

- **Authors:** Riyaadh Gani and Damilola Ogunleye
- **Project:** Food Recognition & Recipe LLM  
- **Purpose:** Creating VectorDB of recipe data and combining with RAG for the model

---

## Overview

This notebook converts the RecipeNLG database into a vector data base using embeddings. This VectorDB can be queried by the LLM to add context to the recipes being provided to give better results.
FAISS will be used to set up the vector DB

**Output:** Functional model for recipe support: based on Recipe NLG data

**Current Indexes:**
- Small: Top 50,000 recipes in database
- xsmol: 10000 recipes
- recipe_index_10000.faiss
- recipe_index_1000.faiss

In [1]:
%pip install faiss-cpu sentence-transformers numpy pandas tqdm

Note: you may need to restart the kernel to use updated packages.


In [2]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import faiss
import numpy as np
import tqdm as notebook_tqdm

# Load your data
df = pd.read_csv('../datasets/Cleaned/clean_recipes_1000.csv')

# find length of dataset and print it
print(f"Number of rows in dataset: {len(df)}")

  from .autonotebook import tqdm as notebook_tqdm


Number of rows in dataset: 1000


In [3]:
# limit to first 50000 rows for faster processing
# df = df.head(10000)

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [4]:
# Check your dataset size first
print(f"Dataset size: {len(df)}")
print(f"Memory estimate: {len(df) * 384 * 4 / 1e9:.2f} GB")

# Process in batches if too large
def create_embeddings_batched(texts, batch_size=1000):
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")
        
        batch_embeddings = model.encode(
            batch,
            show_progress_bar=True,
            batch_size=32,
            convert_to_numpy=True
        )
        all_embeddings.append(batch_embeddings)
        
    return np.vstack(all_embeddings)

# Use batched approach
embeddings = create_embeddings_batched(df['prompt'].tolist())

Dataset size: 1000
Memory estimate: 0.00 GB
Processing batch 1/1


Batches: 100%|██████████| 32/32 [00:01<00:00, 25.19it/s]


In [5]:
# Check embeddings first
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings dtype: {embeddings.dtype}")
print(f"Memory usage: {embeddings.nbytes / 1e9:.2f} GB")
print(f"Contains NaN: {np.isnan(embeddings).any()}")
print(f"Contains Inf: {np.isinf(embeddings).any()}")


Embeddings shape: (1000, 384)
Embeddings dtype: float32
Memory usage: 0.00 GB
Contains NaN: False
Contains Inf: False


In [6]:
from sklearn.preprocessing import normalize

# Safer normalization
print("Normalizing with sklearn...")
embeddings = embeddings.astype('float32')
embeddings = normalize(embeddings, norm='l2', axis=1)

Normalizing with sklearn...


In [7]:
# Now use with FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)
output_file = f'recipe_index_{len(df)}.faiss'
faiss.write_index(index, output_file)

In [8]:
def retrieve_recipes(query, k=3):
    """Retrieve top-k most similar recipes"""
    
    # Embed the query
    query_embedding = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding)
    
    # Search
    scores, indices = index.search(query_embedding.astype('float32'), k)
    
    # Get results
    results = []
    for idx, score in zip(indices[0], scores[0]):
        results.append({
            'prompt': df.iloc[idx]['prompt'],
            'response': df.iloc[idx]['response'],
            'similarity': float(score)
        })
    
    return results

# Test it
query = "I have chicken, rice, and vegetables. What can I make?"
results = retrieve_recipes(query, k=3)

for i, result in enumerate(results, 1):
    print(f"\n--- Recipe {i} (similarity: {result['similarity']:.3f}) ---")
    print(f"Prompt: {result['prompt'][:100]}...")
    print(f"Response: {result['response'][:150]}...")


--- Recipe 1 (similarity: 0.720) ---
Prompt: i have these ingredients: 1 12 cups kluski egg noodles, uncooked, 1 cup shredded cabbage, 34 cup cub...
Response: you could make vegetable soup wtih kluski noodles. here are the instructions: 1. in 5-quart saucepan stir together all the ingredients heat to boiling...

--- Recipe 2 (similarity: 0.719) ---
Prompt: i have these ingredients: 1 (15 ounce) can mandarin oranges, drained, 1 (16 ounce) bag mixed salad g...
Response: you could make asian inspired salad. here are the instructions: 1. in sauce pan heat water to boiling and add in frozen soy beans, cook as directed on...

--- Recipe 3 (similarity: 0.695) ---
Prompt: i have these ingredients: 3 cups apples, diced, 3 cups onions, diced, 3 cups celery, diced, 1 tables...
Response: you could make apple rice stuffing. here are the instructions: 1. saute apples, onions and celery in butter. 2. add prepared stuffing, rice and broth....
