# Creating Vector DB for RAG Pipeline

- **Authors:** Riyaadh Gani and Damilola Ogunleye
- **Project:** Food Recognition & Recipe LLM  
- **Purpose:** Creating VectorDB of recipe data and combining with RAG for the model

---

## Overview

This notebook converts the RecipeNLG database into a vector data base using embeddings. This VectorDB can be queried by the LLM to add context to the recipes being provided to give better results.
FAISS will be used to set up the vector DB

**Output:** Functional model for recipe support: based on Recipe NLG data

In [1]:
%pip install faiss-cpu sentence-transformers numpy pandas tqdm

Note: you may need to restart the kernel to use updated packages.


In [2]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import faiss
import numpy as np
import tqdm as notebook_tqdm

# Load your data
df = pd.read_csv('../datasets/Cleaned/clean_recipes.csv')

# find length of dataset and print it
print(f"Number of rows in dataset: {len(df)}")

  from .autonotebook import tqdm as notebook_tqdm


Number of rows in dataset: 5155414


In [3]:
# limit to first 50000 rows for faster processing
df = df.head(10000)

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [4]:
# Check your dataset size first
print(f"Dataset size: {len(df)}")
print(f"Memory estimate: {len(df) * 384 * 4 / 1e9:.2f} GB")

# Process in batches if too large
def create_embeddings_batched(texts, batch_size=1000):
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")
        
        batch_embeddings = model.encode(
            batch,
            show_progress_bar=True,
            batch_size=32,
            convert_to_numpy=True
        )
        all_embeddings.append(batch_embeddings)
        
    return np.vstack(all_embeddings)

# Use batched approach
embeddings = create_embeddings_batched(df['prompt'].tolist())

Dataset size: 10000
Memory estimate: 0.02 GB
Processing batch 1/10


Batches: 100%|██████████| 32/32 [00:01<00:00, 16.07it/s]


Processing batch 2/10


Batches: 100%|██████████| 32/32 [00:01<00:00, 27.97it/s]


Processing batch 3/10


Batches: 100%|██████████| 32/32 [00:01<00:00, 29.10it/s]


Processing batch 4/10


Batches: 100%|██████████| 32/32 [00:01<00:00, 28.10it/s]


Processing batch 5/10


Batches: 100%|██████████| 32/32 [00:01<00:00, 29.97it/s]


Processing batch 6/10


Batches: 100%|██████████| 32/32 [00:01<00:00, 31.54it/s]


Processing batch 7/10


Batches: 100%|██████████| 32/32 [00:00<00:00, 34.92it/s]


Processing batch 8/10


Batches: 100%|██████████| 32/32 [00:00<00:00, 34.49it/s]


Processing batch 9/10


Batches: 100%|██████████| 32/32 [00:00<00:00, 37.30it/s]


Processing batch 10/10


Batches: 100%|██████████| 32/32 [00:00<00:00, 35.15it/s]


In [5]:
# Check embeddings first
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings dtype: {embeddings.dtype}")
print(f"Memory usage: {embeddings.nbytes / 1e9:.2f} GB")
print(f"Contains NaN: {np.isnan(embeddings).any()}")
print(f"Contains Inf: {np.isinf(embeddings).any()}")


Embeddings shape: (10000, 384)
Embeddings dtype: float32
Memory usage: 0.02 GB
Contains NaN: False
Contains Inf: False


In [6]:
from sklearn.preprocessing import normalize

# Safer normalization
print("Normalizing with sklearn...")
embeddings = embeddings.astype('float32')
embeddings = normalize(embeddings, norm='l2', axis=1)

Normalizing with sklearn...


In [7]:
# Now use with FAISS
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)
faiss.write_index(index, 'recipe_index_xsmol.faiss')

In [8]:
def retrieve_recipes(query, k=3):
    """Retrieve top-k most similar recipes"""
    
    # Embed the query
    query_embedding = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding)
    
    # Search
    scores, indices = index.search(query_embedding.astype('float32'), k)
    
    # Get results
    results = []
    for idx, score in zip(indices[0], scores[0]):
        results.append({
            'prompt': df.iloc[idx]['prompt'],
            'response': df.iloc[idx]['response'],
            'similarity': float(score)
        })
    
    return results

# Test it
query = "I have chicken, rice, and vegetables. What can I make?"
results = retrieve_recipes(query, k=3)

for i, result in enumerate(results, 1):
    print(f"\n--- Recipe {i} (similarity: {result['similarity']:.3f}) ---")
    print(f"Prompt: {result['prompt'][:100]}...")
    print(f"Response: {result['response'][:150]}...")


--- Recipe 1 (similarity: 0.775) ---
Prompt: i have these ingredients: 2 cup uncooked rice, 1 bell pepper, 1 small onion, 2 celery stalks, 1 lb g...
Response: you could make lucy special. here are the instructions: 1. preheat oven to 400 2. break up and brown sausage in large skillet 3. chop onion, celery, a...

--- Recipe 2 (similarity: 0.764) ---
Prompt: how do i make chicken and rice?...
Response: to make chicken and rice, you'll need: chicken wings, uncooked, 1 can cream of chicken soup, 1 can cream of mushroom soup, 1 c. uncooked rice, 1 pkg. ...

--- Recipe 3 (similarity: 0.754) ---
Prompt: i have these ingredients: 1 tbsp. oil, 1 lb. boneless skinless chicken breasts, cut into bite-size p...
Response: you could make speedy chicken veggie stir-fry skillet. here are the instructions: 1. heat oil in large skillet on medium-high heat. 2. add chicken coo...
