# Project Session 1.2: BakeryAI - Semantic Search & Product Discovery

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dhlxNlW3D_sddj0bqP7op0QAmIky_bFB?usp=sharing)

## 🎯 Today's Goal

Add **intelligent product search** to BakeryAI using embeddings!

### What We'll Build:

✅ Embeddings for all bakery products  
✅ Semantic search ("find something fruity and light")  
✅ Smart product recommendations  
✅ Multi-provider comparison for responses  

### Why This Matters:

Customers don't always know exact product names. With semantic search:
- "I want something chocolatey for kids" → Finds relevant products
- "Looking for a low-calorie dessert" → Searches by nutrition
- "Need a cake for a vegan friend" → Filters by dietary restrictions

### 🚀 BakeryAI Progress: 20% → 40%
```
[████████░░░░░░░░░░░░] 40%
```

In [1]:
!pip install -q langchain langchain-openai langchain-anthropic langchain-community
!pip install -q pandas numpy python-dotenv

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m355.0/355.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0m

In [2]:
import pandas as pd
import numpy as np
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

import os
from google.colab import userdata

# Set OpenAI API key from Google Colab's user environment or default
def set_openai_api_key(default_key: str = "YOUR_API_KEY") -> None:
    """Set the OpenAI API key from Google Colab's user environment or use a default value."""
    #if not (userdata.get("OPENAI_API_KEY") or "OPENAI_API_KEY" in os.environ):
    try:
      os.environ["OPENAI_API_KEY"] = userdata.get("MDX_OPENAI_API_KEY")
    except:
      os.environ["OPENAI_API_KEY"] = default_key

set_openai_api_key()
#set_openai_api_key("sk-...")

# Verify API key is loaded
if os.getenv("OPENAI_API_KEY"):
    print("✅ OpenAI API key loaded successfully!")
else:
    print("❌ OpenAI API key not found. Please set it in .env file")

# Initialize models
llm = ChatOpenAI(model="gpt-5-nano")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

✅ OpenAI API key loaded successfully!


In [3]:
!git clone https://github.com/IvanReznikov/mdx-langchain-conclave

Cloning into 'mdx-langchain-conclave'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 19 (delta 1), reused 19 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (19/19), 239.34 KiB | 3.11 MiB/s, done.
Resolving deltas: 100% (1/1), done.


In [4]:
# Load bakery data
try:
    cakes_df = pd.read_csv('/content/mdx-langchain-conclave/data/cake_descriptions.csv', encoding='cp1252')
    print(f"✅ Loaded {len(cakes_df)} cake products")
except FileNotFoundError:
    print("⚠️  Using sample data")
    cakes_df = pd.DataFrame({
        'Name': ['Chocolate Truffle Cake', 'Vanilla Bean Cake', 'Red Velvet Cake',
                 'Lemon Drizzle Cake', 'Strawberry Shortcake'],
        'Category': ['Chocolate', 'Vanilla', 'Specialty', 'Fruit', 'Fruit'],
        'Description': [
            'Rich chocolate cake with truffle filling and dark chocolate ganache',
            'Classic vanilla cake with Madagascar vanilla bean specks',
            'Velvety red cake with cream cheese frosting',
            'Light lemon-flavored sponge with citrus glaze',
            'Fresh strawberries with light whipped cream'
        ],
        'Ingredients': [
            'flour, eggs, cocoa, chocolate, butter, sugar',
            'flour, eggs, vanilla beans, butter, sugar, milk',
            'flour, eggs, cocoa, buttermilk, cream cheese',
            'flour, eggs, lemon, butter, sugar',
            'flour, eggs, strawberries, cream, sugar'
        ],
        'Energy_kcal': [450, 380, 420, 320, 290],
        'Restrictions': ['none', 'none', 'none', 'none', 'none'],
        'Available': [True, True, True, True, True]
    })

cakes_df.head()

✅ Loaded 22 cake products


Unnamed: 0,Name,Category,Ingredients,Description,Energy_kcal,Weight_grams,Restrictions,Delivery_time_hr,Available
0,Torta della Nonna Amore,G,"Ricotta, pine nuts, lemon zest, vanilla, sugar...",A nostalgic Tuscan-inspired cake with creamy r...,320,850,"Contains dairy, gluten, eggs",24,Yes
1,Festiva della Sicilia,G,"Almonds, citrus zest, mascarpone, candied oran...",A zesty almond cake with mascarpone frosting a...,360,900,"Contains nuts, dairy, eggs",24,Yes
2,Dubai Midnight Pistachio Fantasy,G,"Pistachios, rose syrup, cardamom, white chocol...",Lush green cake with exotic floral pistachio n...,420,1000,"Contains nuts, dairy",36,Yes
3,Ferrari Redline Fudge,G,"Dark chocolate, red glaze, fudge, espresso syrup",Dense chocolate fudge cake with red mirror gla...,450,950,"Contains dairy, caffeine",24,Yes
4,Goalpost Delight - Soccer Fan Cake,G,"Vanilla sponge, fondant, jam, whipped cream","Themed cake with stadium design, perfect for f...",390,1200,May contain artificial colors,24,Yes


## 1. Creating Product Embeddings

Transform each product into a vector that captures its semantic meaning.

In [5]:
def create_product_text(row):
    """Create rich text representation of product"""
    return f"""
    Product: {row['Name']}
    Category: {row['Category']}
    Description: {row['Description']}
    Ingredients: {row['Ingredients']}
    Calories: {row['Energy_kcal']} kcal
    Dietary Info: {row['Restrictions']}
    """.strip()

# Create product texts
cakes_df['product_text'] = cakes_df.apply(create_product_text, axis=1)

print("📝 Sample Product Text:\n")
print(cakes_df['product_text'].iloc[0])

📝 Sample Product Text:

Product: Torta della Nonna Amore
    Category: G
    Description: A nostalgic Tuscan-inspired cake with creamy ricotta and lemon warmth
    Ingredients: Ricotta, pine nuts, lemon zest, vanilla, sugar, flour
    Calories: 320 kcal
    Dietary Info: Contains dairy, gluten, eggs


In [6]:
# Generate embeddings for all products
print("🔄 Generating embeddings for all products...\n")

product_texts = cakes_df['product_text'].tolist()
product_embeddings = embeddings.embed_documents(product_texts)

print(f"✅ Generated {len(product_embeddings)} embeddings")
print(f"📊 Embedding dimension: {len(product_embeddings[0])}")
print(f"💾 Total vectors: {len(cakes_df)} products")

🔄 Generating embeddings for all products...

✅ Generated 22 embeddings
📊 Embedding dimension: 3072
💾 Total vectors: 22 products


## 2. Semantic Product Search

Find products based on meaning, not just keywords!

In [7]:
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    """Calculate similarity between two vectors"""
    return dot(vec1, vec2) / (norm(vec1) * norm(vec2))

def semantic_product_search(query, top_k=3):
    """
    Find products semantically similar to the query

    Args:
        query: Customer's search query
        top_k: Number of results to return

    Returns:
        DataFrame with top matching products and scores
    """
    # Embed the query
    query_embedding = embeddings.embed_query(query)

    # Calculate similarities
    similarities = []
    for prod_embedding in product_embeddings:
        sim = cosine_similarity(query_embedding, prod_embedding)
        similarities.append(sim)

    # Add to dataframe
    results_df = cakes_df.copy()
    results_df['similarity'] = similarities

    # Sort and return top k
    top_results = results_df.nlargest(top_k, 'similarity')
    return top_results[['Name', 'Description', 'Energy_kcal', 'similarity']]

# Test semantic search
print("🔍 Semantic Search Test\n")
print("Query: 'Something rich and indulgent'\n")
print(semantic_product_search("Something rich and indulgent", top_k=3))

🔍 Semantic Search Test

Query: 'Something rich and indulgent'

                      Name                                        Description  \
21  Chocolate Truffle Cake              Deep, rich chocolate cake for purists   
3    Ferrari Redline Fudge  Dense chocolate fudge cake with red mirror gla...   
8       Velvet Dream Cloud  Fluffy, cloud-like cake with light, elegant cr...   

    Energy_kcal  similarity  
21          470    0.477568  
3           450    0.459690  
8           350    0.438766  


In [8]:
# More search examples
test_queries = [
    "light and refreshing dessert",
    "something with citrus flavor",
    "low calorie option",
    "best for chocolate lovers"
]

for query in test_queries:
    print(f"\n🔍 Query: '{query}'")
    results = semantic_product_search(query, top_k=2)
    print("Top Matches:")
    for idx, row in results.iterrows():
        print(f"  {row['Name']} (similarity: {row['similarity']:.3f})")
    print("-" * 60)


🔍 Query: 'light and refreshing dessert'
Top Matches:
  Strawberry Shortcake (similarity: 0.486)
  Velvet Dream Cloud (similarity: 0.455)
------------------------------------------------------------

🔍 Query: 'something with citrus flavor'
Top Matches:
  Festiva della Sicilia (similarity: 0.399)
  Lemon Drizzle Cake (similarity: 0.374)
------------------------------------------------------------

🔍 Query: 'low calorie option'
Top Matches:
  Strawberry Shortcake (similarity: 0.343)
  Cheesecake (similarity: 0.342)
------------------------------------------------------------

🔍 Query: 'best for chocolate lovers'
Top Matches:
  Chocolate Truffle Cake (similarity: 0.480)
  ChocoCaramel Birthday Surprise (similarity: 0.469)
------------------------------------------------------------


## 3. Integrate Semantic Search with BakeryAI Chatbot

In [9]:
from langchain_core.messages import SystemMessage, HumanMessage

def smart_bakery_assistant(customer_query):
    """
    BakeryAI with semantic search capabilities
    """
    # Find relevant products
    relevant_products = semantic_product_search(customer_query, top_k=3)

    # Create context with search results
    context = "Based on your request, here are our most relevant products:\n\n"
    for idx, row in relevant_products.iterrows():
        context += f"- {row['Name']}: {row['Description']} ({row['Energy_kcal']} kcal)\n"

    # Create prompt
    system_prompt = f"""
    You are BakeryAI, a helpful bakery assistant.

    {context}

    Use this information to provide a helpful, personalized response.
    Recommend the most suitable product and explain why.
    """

    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=customer_query)
    ]

    response = llm.invoke(messages)
    return response.content, relevant_products

# Test the smart assistant
query = "I'm looking for something not too heavy, preferably with fruit"
print(f"🙋 Customer: {query}\n")
response, products = smart_bakery_assistant(query)
print(f"🍰 BakeryAI: {response}\n")
print("📊 Products Considered:")
print(products[['Name', 'similarity']])

🙋 Customer: I'm looking for something not too heavy, preferably with fruit

🍰 BakeryAI: Strawberry Shortcake. It’s light and airy with a refreshing strawberry cream, so it won’t feel heavy. It also features fruit, and at about 360 kcal it’s the lightest option among your choices.

If you want something a bit more festive but still fruit-forward, the Birthday Blast Berry Bomb is another good option (fruity and 380 kcal), just a touch heavier. Want me to place an order or tailor features ( cutters, no nuts, etc.)?

📊 Products Considered:
                         Name  similarity
17       Strawberry Shortcake    0.362519
6   Birthday Blast Berry Bomb    0.346719
19                 Cheesecake    0.331008


## 4. Multi-Provider Comparison

Compare different LLM providers for customer responses.

In [10]:
# Compare different models
models_to_test = {
    "GPT-3.5-turbo": ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7),
    "GPT-4o": ChatOpenAI(model="gpt-4o", temperature=0.7),
    "GPT-5-nano": ChatOpenAI(model="gpt-5-nano")
}

def compare_model_responses(query):
    """Compare responses from different models"""
    # Get relevant products
    relevant_products = semantic_product_search(query, top_k=2)

    context = "Relevant products:\n"
    for idx, row in relevant_products.iterrows():
        context += f"- {row['Name']}: {row['Description']}\n"

    prompt = f"{context}\n\nCustomer question: {query}\n\nProvide a brief recommendation (2 sentences max)."

    print(f"🙋 Customer Query: {query}\n")
    print("=" * 70)

    for model_name, model in models_to_test.items():
        response = model.invoke(prompt)
        print(f"\n[{model_name}]")
        print(response.content)
        print("-" * 70)

# Test comparison
compare_model_responses("What's your most popular chocolate cake?")

🙋 Customer Query: What's your most popular chocolate cake?


[GPT-3.5-turbo]
Our most popular chocolate cake is the Chocolate Truffle Cake. Its intense chocolate flavor and rich texture make it a favorite among chocolate lovers. If you're looking for a decadent treat, this cake is sure to satisfy your cravings.
----------------------------------------------------------------------

[GPT-4o]
Our most popular chocolate cake is the Chocolate Truffle Cake, known for its deep, rich flavor that delights purists. If you prefer a classic twist with cream and cherry, the Black Forest Cake is another excellent choice.
----------------------------------------------------------------------

[GPT-5-nano]
Our Chocolate Truffle Cake is our most popular chocolate cake. It’s a deep, rich chocolate experience for purists.
----------------------------------------------------------------------


## 5. Building a Product Similarity Matrix

Find which products are most similar to each other.

In [11]:
def find_similar_products(product_name, top_k=3):
    """
    Find products similar to a given product
    Useful for 'customers also liked' recommendations
    """
    # Find the product
    product_idx = cakes_df[cakes_df['Name'].str.contains(product_name, case=False)].index[0]
    product_embedding = product_embeddings[product_idx]

    # Calculate similarities
    similarities = []
    for idx, prod_embedding in enumerate(product_embeddings):
        if idx != product_idx:  # Exclude the product itself
            sim = cosine_similarity(product_embedding, prod_embedding)
            similarities.append((idx, sim))

    # Sort and get top k
    similarities.sort(key=lambda x: x[1], reverse=True)
    top_indices = [idx for idx, _ in similarities[:top_k]]

    results = cakes_df.iloc[top_indices][['Name', 'Description']].copy()
    results['similarity'] = [sim for _, sim in similarities[:top_k]]

    return results

# Test similar products
print("🍰 Customers who liked 'Chocolate Truffle Cake' also liked:\n")
print(find_similar_products("Chocolate Truffle", top_k=3))

🍰 Customers who liked 'Chocolate Truffle Cake' also liked:

                 Name                                        Description  \
11  Black Forest Cake  Classic German cake with cream, cherry, and a ...   
12      Tiramisu Cake    Layered coffee dessert cake with creamy filling   
13    Red Velvet Cake   Rich red cake with soft crumb and tangy frosting   

    similarity  
11    0.793170  
12    0.767687  
13    0.748826  


## 6. Dietary Restriction Search

Help customers find products matching dietary needs.

In [12]:
def search_by_dietary_needs(dietary_requirement):
    """
    Search products based on dietary restrictions

    Args:
        dietary_requirement: e.g., 'vegan', 'gluten-free', 'low-calorie'
    """
    # Semantic search with dietary context
    query = f"cakes suitable for {dietary_requirement} diet"
    results = semantic_product_search(query, top_k=3)

    # Also filter by restrictions if available
    if 'Restrictions' in cakes_df.columns:
        # Additional filtering logic here
        pass

    return results

# Test dietary search
print("🔍 Dietary Search: 'low calorie'\n")
print(search_by_dietary_needs("low calorie"))

🔍 Dietary Search: 'low calorie'

                    Name                                        Description  \
18    Lemon Drizzle Cake         Moist, zingy cake with crisp sugar coating   
17  Strawberry Shortcake  Light, airy, and refreshing strawberry cream cake   
13       Red Velvet Cake   Rich red cake with soft crumb and tangy frosting   

    Energy_kcal  similarity  
18          340    0.486109  
17          360    0.475042  
13          390    0.474466  


## 7. Creating a Product Recommendation Engine

In [13]:
def intelligent_recommendation(occasion, preferences, budget=None):
    """
    Comprehensive recommendation based on multiple factors

    Args:
        occasion: e.g., 'birthday', 'wedding', 'office party'
        preferences: e.g., 'chocolate lover', 'health conscious'
        budget: Optional budget constraint
    """
    # Combine factors into search query
    search_query = f"{occasion} cake for {preferences}"

    # Get semantic matches
    matches = semantic_product_search(search_query, top_k=3)

    # Create personalized recommendation
    context = f"""
    Customer needs: {occasion} cake
    Preferences: {preferences}
    Budget: {budget if budget else 'flexible'}

    Top matching products:
    """

    for idx, row in matches.iterrows():
        context += f"\n- {row['Name']}: {row['Description']} ({row['Energy_kcal']} kcal)"

    context += "\n\nProvide a thoughtful recommendation with reasoning."

    response = llm.invoke(context)
    return response.content, matches

# Test recommendations
recommendation, products = intelligent_recommendation(
    occasion="birthday party for a 10-year-old",
    preferences="kids who love chocolate",
    budget="moderate"
)

print("🎂 BakeryAI Recommendation:\n")
print(recommendation)
print("\n📊 Products Analyzed:")
print(products[['Name', 'similarity']])

🎂 BakeryAI Recommendation:

Recommendation: Chocolate Truffle Cake

Reasoning:
- Direct fit with the preference: chocolate-loving kids will be most excited about a deep, rich chocolate cake.
- Birthday party appeal: a classic choice that’s widely loved by kids and easy to slice for a moderate-sized group.
- Moderate-budget alignment: typically sits well within a reasonable party budget, especially if you choose a standard size and simple decorations.
- Customization options: can add a fun birthday message, themed sprinkles, candles, or a chocolatey topper to match the party vibe.

Helpful tips:
- If some guests aren’t as into chocolate, you can pair the Chocolate Truffle Cake with a lighter option (e.g., a Berry Bomb) or offer chocolate cupcakes alongside for variety.
- For a show-stopping look without breaking the budget, consider a coordinated topper (e.g., “Happy 10th Birthday” and a sporty or magical theme) or a chocolate-dusted/ganache finish.
- Know your guest count to pick the r

## 🎯 Exercise 3: Build a Smart Search Interface

**Task**: Create a comprehensive search function that:
1. Takes natural language query
2. Performs semantic search
3. Filters by availability and dietary restrictions
4. Returns ranked results with explanations

In [14]:
def advanced_product_search(query, filters=None):
    """
    Advanced search with filtering and ranking

    Args:
        query: Natural language search query
        filters: Dict with keys like 'max_calories', 'restrictions', 'category'

    Returns:
        Ranked list of products with explanations
    """
    # TODO: Implement advanced search
    # Hint: Combine semantic search with filtering
    pass

# Test your function
# filters = {'max_calories': 350, 'category': 'Fruit'}
# results = advanced_product_search("something light and fruity", filters=filters)
# print(results)

## 🎯 Exercise 4: Customer Preference Learning

**Task**: Build a system that:
1. Tracks customer queries over time
2. Identifies preference patterns using embeddings
3. Proactively suggests products they'll like

In [15]:
class CustomerPreferenceTracker:
    def __init__(self):
        self.query_history = []
        self.query_embeddings = []

    def add_query(self, query):
        """Track customer queries"""
        # TODO: Implement query tracking
        pass

    def get_preference_profile(self):
        """Analyze queries to understand preferences"""
        # TODO: Analyze query patterns
        pass

    def suggest_products(self, k=3):
        """Suggest products based on preference history"""
        # TODO: Use aggregated embeddings to recommend
        pass

# Test
# tracker = CustomerPreferenceTracker()
# tracker.add_query("chocolate cakes")
# tracker.add_query("rich desserts")
# tracker.add_query("indulgent treats")
# suggestions = tracker.suggest_products()
# print(suggestions)

## Summary: What We Built

### ✅ Session 1.2 Achievements:

1. **Product Embeddings**: Vector representations of all cakes
2. **Semantic Search**: Find products by meaning, not just keywords
3. **Smart Assistant**: Chatbot with semantic search integration
4. **Similar Products**: "Customers also liked" recommendations
5. **Multi-Provider Support**: Compare different LLM responses
6. **Intelligent Recommendations**: Context-aware product suggestions

### 🚀 BakeryAI Progress: 40%

```
[████████░░░░░░░░░░░░] 40%
```

### Key Capabilities Added:

✨ **Natural Language Search**: "light and fruity" → relevant products  
✨ **Smart Matching**: Understands context and intent  
✨ **Personalization**: Can track and learn preferences  
✨ **Flexibility**: Works with multiple LLM providers  

### Next: Notebook 1.3

We'll add **prompt templates** for:
- Consistent customer interactions
- Order confirmations
- Structured product recommendations
- Multi-language support