# NL2NoSQL + Vector Search: Hybrid Search System

This notebook demonstrates a hybrid search system that converts natural language questions into MongoDB queries and combines them with vector search.

## Overview
- Run MongoDB with Docker
- Store Samsung Electronics product data (including specification arrays)
- Use Azure OpenAI to:
  1. Convert natural language to MongoDB queries (NL2NoSQL)
  2. Semantic vector search using text embeddings
  3. Hybrid search (structured queries + semantic search)

## Requirements
- Docker (for running MongoDB container)
- Azure OpenAI API key
- pymongo library
- Azure OpenAI Text Embedding model

## 1. Install and Import Required Libraries

In [1]:
# Install required packages (add pymongo)
import subprocess
import sys

# Install pymongo
try:
    import pymongo
except ImportError:
    print("Installing pymongo...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pymongo"])
    import pymongo

print(f"pymongo version: {pymongo.__version__}")

pymongo version: 4.15.3


In [2]:
import os
import json
from pymongo import MongoClient
from openai import AzureOpenAI
from dotenv import load_dotenv
import time

# Load environment variables
load_dotenv(override=True)

# Initialize Azure OpenAI client
client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_KEY"),
    api_version="2024-02-15-preview"
)

deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini")
embedding_model = os.getenv("EMBEDDING_MODEL_NAME", "text-embedding-3-large")
embedding_dimensions = int(os.getenv("AZURE_OPENAI_EMBEDDING_DIMENSIONS", "3072"))

print(f"Azure OpenAI client initialized")
print(f"  - GPT deployment: {deployment_name}")
print(f"  - Embedding model: {embedding_model}")
print(f"  - Embedding dimensions: {embedding_dimensions}")

Azure OpenAI client initialized
  - GPT deployment: gpt-4.1-mini
  - Embedding model: text-embedding-3-large
  - Embedding dimensions: 3072


## 2. Run MongoDB with Docker

Run MongoDB as a Docker container.

In [3]:
# Start MongoDB container
import subprocess

# Remove existing container if any
subprocess.run(["docker", "rm", "-f", "mongodb-nl2nosql"], capture_output=True)

# Run MongoDB container
result = subprocess.run([
    "docker", "run", "-d",
    "--name", "mongodb-nl2nosql",
    "-p", "27017:27017",
    "-e", "MONGO_INITDB_ROOT_USERNAME=admin",
    "-e", "MONGO_INITDB_ROOT_PASSWORD=admin123",
    "mongo:latest"
], capture_output=True, text=True)

if result.returncode == 0:
    print("✅ MongoDB container started successfully")
    print(f"Container ID: {result.stdout.strip()}")
    # Wait for MongoDB to start completely
    print("Waiting for MongoDB to start...")
    time.sleep(5)
else:
    print(f"❌ Container start failed: {result.stderr}")

✅ MongoDB container started successfully
Container ID: b48774fbbf5e877f2eaf6d6c56a42fee6e17e267bec38221849fb9e7a63e7c9f
Waiting for MongoDB to start...


## 3. Connect to MongoDB and Setup Database

In [4]:
# Connect to MongoDB
MONGO_URI = "mongodb://admin:admin123@localhost:27017/"
mongo_client = MongoClient(MONGO_URI)

# Create database and collection
db = mongo_client["samsung_products"]
collection = db["products"]

# Delete existing data (fresh start)
collection.delete_many({})

print("✅ MongoDB connection established")
print(f"Database: {db.name}")
print(f"Collection: {collection.name}")

✅ MongoDB connection established
Database: samsung_products
Collection: products


## 4. Generate Samsung Electronics Product Sample Data

Generate Samsung Electronics product data with various specifications. Each product contains multiple specification information in the specifications array.

In [5]:
# Samsung Electronics product sample data
samsung_products = [
    {
        "product_id": "GNT900X5L",
        "name": "Galaxy Book4 Pro",
        "category": "Laptop",
        "brand": "Samsung",
        "price": 2490000,
        "release_date": "2024-01-15",
        "specifications": [
            {"name": "Screen Size", "value": 16, "unit": "inches"},
            {"name": "Resolution", "value": "3200x2000", "unit": "pixels"},
            {"name": "Processor", "value": "Intel Core Ultra 7", "unit": ""},
            {"name": "Memory", "value": 16, "unit": "GB"},
            {"name": "Storage", "value": 512, "unit": "GB"},
            {"name": "Weight", "value": 1.55, "unit": "kg"}
        ]
    },
    {
        "product_id": "GNT750X3N",
        "name": "Galaxy Book4 Ultra",
        "category": "Laptop",
        "brand": "Samsung",
        "price": 3290000,
        "release_date": "2024-02-20",
        "specifications": [
            {"name": "Screen Size", "value": 14, "unit": "inches"},
            {"name": "Resolution", "value": "2880x1800", "unit": "pixels"},
            {"name": "Processor", "value": "Intel Core Ultra 9", "unit": ""},
            {"name": "Memory", "value": 32, "unit": "GB"},
            {"name": "Storage", "value": 1024, "unit": "GB"},
            {"name": "Weight", "value": 1.21, "unit": "kg"}
        ]
    },
    {
        "product_id": "GNT350X2A",
        "name": "Galaxy Book3",
        "category": "Laptop",
        "brand": "Samsung",
        "price": 1590000,
        "release_date": "2023-08-10",
        "specifications": [
            {"name": "Screen Size", "value": 15.6, "unit": "inches"},
            {"name": "Resolution", "value": "1920x1080", "unit": "pixels"},
            {"name": "Processor", "value": "Intel Core i5-1335U", "unit": ""},
            {"name": "Memory", "value": 8, "unit": "GB"},
            {"name": "Storage", "value": 256, "unit": "GB"},
            {"name": "Weight", "value": 1.78, "unit": "kg"}
        ]
    },
    {
        "product_id": "GNT940X5M",
        "name": "Galaxy Book3 Pro 360",
        "category": "Laptop",
        "brand": "Samsung",
        "price": 2190000,
        "release_date": "2023-09-25",
        "specifications": [
            {"name": "Screen Size", "value": 13.3, "unit": "inches"},
            {"name": "Resolution", "value": "1920x1080", "unit": "pixels"},
            {"name": "Processor", "value": "Intel Core i7-1360P", "unit": ""},
            {"name": "Memory", "value": 16, "unit": "GB"},
            {"name": "Storage", "value": 512, "unit": "GB"},
            {"name": "Weight", "value": 1.16, "unit": "kg"}
        ]
    },
    {
        "product_id": "MON32LU711",
        "name": "ViewFinity S9 5K",
        "category": "Monitor",
        "brand": "Samsung",
        "price": 2290000,
        "release_date": "2024-03-01",
        "specifications": [
            {"name": "Screen Size", "value": 27, "unit": "inches"},
            {"name": "Resolution", "value": "5120x2880", "unit": "pixels"},
            {"name": "Refresh Rate", "value": 60, "unit": "Hz"},
            {"name": "Panel Type", "value": "IPS", "unit": ""},
            {"name": "Brightness", "value": 600, "unit": "nits"}
        ]
    },
    {
        "product_id": "MON49G95T",
        "name": "Odyssey OLED G9",
        "category": "Monitor",
        "brand": "Samsung",
        "price": 2690000,
        "release_date": "2023-11-15",
        "specifications": [
            {"name": "Screen Size", "value": 49, "unit": "inches"},
            {"name": "Resolution", "value": "5120x1440", "unit": "pixels"},
            {"name": "Refresh Rate", "value": 240, "unit": "Hz"},
            {"name": "Panel Type", "value": "OLED", "unit": ""},
            {"name": "Response Time", "value": 0.03, "unit": "ms"}
        ]
    },
    {
        "product_id": "TAB-S9-ULTRA",
        "name": "Galaxy Tab S9 Ultra",
        "category": "Tablet",
        "brand": "Samsung",
        "price": 1650000,
        "release_date": "2024-01-05",
        "specifications": [
            {"name": "Screen Size", "value": 14.6, "unit": "inches"},
            {"name": "Resolution", "value": "2960x1848", "unit": "pixels"},
            {"name": "Processor", "value": "Snapdragon 8 Gen 2", "unit": ""},
            {"name": "Memory", "value": 12, "unit": "GB"},
            {"name": "Storage", "value": 256, "unit": "GB"},
            {"name": "Weight", "value": 0.732, "unit": "kg"}
        ]
    },
    {
        "product_id": "PHONE-S24-ULTRA",
        "name": "Galaxy S24 Ultra",
        "category": "Smartphone",
        "brand": "Samsung",
        "price": 1698400,
        "release_date": "2024-01-17",
        "specifications": [
            {"name": "Screen Size", "value": 6.8, "unit": "inches"},
            {"name": "Resolution", "value": "3120x1440", "unit": "pixels"},
            {"name": "Processor", "value": "Snapdragon 8 Gen 3", "unit": ""},
            {"name": "Memory", "value": 12, "unit": "GB"},
            {"name": "Storage", "value": 256, "unit": "GB"},
            {"name": "Weight", "value": 0.232, "unit": "kg"},
            {"name": "Battery", "value": 5000, "unit": "mAh"}
        ]
    }
]

# Insert data (embeddings will be added later)
result = collection.insert_many(samsung_products)
print(f"✅ {len(result.inserted_ids)} product documents inserted")
print(f"Inserted document IDs: {result.inserted_ids}")

✅ 8 product documents inserted
Inserted document IDs: [ObjectId('691d45441507408bb53e7ef8'), ObjectId('691d45441507408bb53e7ef9'), ObjectId('691d45441507408bb53e7efa'), ObjectId('691d45441507408bb53e7efb'), ObjectId('691d45441507408bb53e7efc'), ObjectId('691d45441507408bb53e7efd'), ObjectId('691d45441507408bb53e7efe'), ObjectId('691d45441507408bb53e7eff')]


### 4-1. Product Description Text Generation Function

Generate natural language descriptions for each product to use in vector embeddings.

In [6]:
def generate_product_description(product: dict) -> str:
    """
    Convert product information to natural language description.
    
    Args:
        product: Product dictionary
        
    Returns:
        Product description text
    """
    specs_text = []
    for spec in product.get("specifications", []):
        value = spec["value"]
        unit = spec["unit"]
        name = spec["name"]
        specs_text.append(f"{name} {value}{unit}")
    
    description = f"{product['name']} is a {product['category']} product from {product['brand']}. "
    description += f"The price is {product['price']:,} won, "
    description += f"and the main specifications are {', '.join(specs_text)}."
    
    return description

# Test
test_product = samsung_products[0]
test_description = generate_product_description(test_product)
print("Product description generation example:")
print(test_description)

Product description generation example:
Galaxy Book4 Pro is a Laptop product from Samsung. The price is 2,490,000 won, and the main specifications are Screen Size 16inches, Resolution 3200x2000pixels, Processor Intel Core Ultra 7, Memory 16GB, Storage 512GB, Weight 1.55kg.


### 4-2. Embedding Generation Function

Generate text embeddings using Azure OpenAI.

In [7]:
def get_embedding(text: str) -> list:
    """
    Generate text embedding using Azure OpenAI.
    
    Args:
        text: Text to embed
        
    Returns:
        Embedding vector (list)
    """
    try:
        response = client.embeddings.create(
            model=embedding_model,
            input=text,
            dimensions=embedding_dimensions
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"❌ Embedding generation error: {e}")
        return []

# Test
test_text = "high performance laptop"
test_embedding = get_embedding(test_text)
print(f"✅ Embedding generation complete")
print(f"   - Input text: {test_text}")
print(f"   - Embedding dimensions: {len(test_embedding)}")
print(f"   - Embedding sample (first 5): {test_embedding[:5]}")

✅ Embedding generation complete
   - Input text: high performance laptop
   - Embedding dimensions: 3072
   - Embedding sample (first 5): [-0.01976715214550495, 0.014754588715732098, -0.007876885123550892, 0.0053997463546693325, -0.00455876812338829]


### 4-3. Add Embeddings to Product Data

Generate descriptions and add embeddings for all products.

In [8]:
# Add description and embedding to each product
print("Generating product embeddings...")
for idx, product in enumerate(samsung_products, 1):
    description = generate_product_description(product)
    embedding = get_embedding(description)
    
    product["description"] = description
    product["embedding"] = embedding
    
    print(f"  [{idx}/{len(samsung_products)}] {product['name']} - Embedding generated")

print(f"\n✅ All product embeddings generated")
print(f"   - Total {len(samsung_products)} products")
print(f"   - Embedding dimensions: {len(samsung_products[0]['embedding'])}")

Generating product embeddings...
  [1/8] Galaxy Book4 Pro - Embedding generated
  [2/8] Galaxy Book4 Ultra - Embedding generated
  [3/8] Galaxy Book3 - Embedding generated
  [2/8] Galaxy Book4 Ultra - Embedding generated
  [3/8] Galaxy Book3 - Embedding generated
  [4/8] Galaxy Book3 Pro 360 - Embedding generated
  [5/8] ViewFinity S9 5K - Embedding generated
  [6/8] Odyssey OLED G9 - Embedding generated
  [4/8] Galaxy Book3 Pro 360 - Embedding generated
  [5/8] ViewFinity S9 5K - Embedding generated
  [6/8] Odyssey OLED G9 - Embedding generated
  [7/8] Galaxy Tab S9 Ultra - Embedding generated
  [8/8] Galaxy S24 Ultra - Embedding generated

✅ All product embeddings generated
   - Total 8 products
   - Embedding dimensions: 3072
  [7/8] Galaxy Tab S9 Ultra - Embedding generated
  [8/8] Galaxy S24 Ultra - Embedding generated

✅ All product embeddings generated
   - Total 8 products
   - Embedding dimensions: 3072


### 4-4. Update MongoDB with Embedding Data

Save the generated embeddings to MongoDB.

In [9]:
# Update MongoDB with embedding data
print("Updating embedding data in MongoDB...")
for product in samsung_products:
    collection.update_one(
        {"product_id": product["product_id"]},
        {
            "$set": {
                "description": product["description"],
                "embedding": product["embedding"]
            }
        }
    )

print(f"✅ Embedding data updated for {len(samsung_products)} products")

# Create vector search index (for MongoDB Atlas Vector Search)
# Note: Local MongoDB does not support vector indexes,
# so we perform cosine similarity calculations directly.
print("\n⚠️ Local MongoDB does not support vector indexes.")
print("   We will perform cosine similarity calculations directly instead.")

Updating embedding data in MongoDB...
✅ Embedding data updated for 8 products

⚠️ Local MongoDB does not support vector indexes.
   We will perform cosine similarity calculations directly instead.


## 5. Verify Data

In [10]:
# Check total product count
total_count = collection.count_documents({})
print(f"Total products: {total_count}\n")

# Retrieve one sample product
sample_product = collection.find_one({"category": "Laptop"})
print("Sample product information:")
print(json.dumps(sample_product, indent=2, ensure_ascii=False, default=str))

Total products: 8

Sample product information:
{
  "_id": "691d45441507408bb53e7ef8",
  "product_id": "GNT900X5L",
  "name": "Galaxy Book4 Pro",
  "category": "Laptop",
  "brand": "Samsung",
  "price": 2490000,
  "release_date": "2024-01-15",
  "specifications": [
    {
      "name": "Screen Size",
      "value": 16,
      "unit": "inches"
    },
    {
      "name": "Resolution",
      "value": "3200x2000",
      "unit": "pixels"
    },
    {
      "name": "Processor",
      "value": "Intel Core Ultra 7",
      "unit": ""
    },
    {
      "name": "Memory",
      "value": 16,
      "unit": "GB"
    },
    {
      "name": "Storage",
      "value": 512,
      "unit": "GB"
    },
    {
      "name": "Weight",
      "value": 1.55,
      "unit": "kg"
    }
  ],
  "description": "Galaxy Book4 Pro is a Laptop product from Samsung. The price is 2,490,000 won, and the main specifications are Screen Size 16inches, Resolution 3200x2000pixels, Processor Intel Core Ultra 7, Memory 16GB, Storage 51

## 6. Function to Convert Natural Language to MongoDB Query

Use Azure OpenAI to convert natural language questions into MongoDB queries.

In [11]:
def nl_to_mongodb_query(natural_language_query: str) -> dict:
    """
    Convert natural language questions to MongoDB queries.
    
    Args:
        natural_language_query: Natural language question
        
    Returns:
        MongoDB query dictionary
    """
    
    system_prompt = """You are a MongoDB query expert. Please convert user's natural language questions into MongoDB queries.

Database Schema:
- Collection name: products
- Fields:
  - product_id: Product ID (string)
  - name: Product name (string)
  - category: Category (string: Laptop, Monitor, Tablet, Smartphone)
  - brand: Brand (string)
  - price: Price (number)
  - release_date: Release date (string, ISO 8601 format)
  - specifications: Specification information array
    - name: Spec name (e.g., Screen Size, Resolution, Processor, Memory, Storage, Weight)
    - value: Spec value (number or string)
    - unit: Unit (e.g., inches, GB, kg, pixels)

Important Instructions:
1. Since specifications is an array, you must use $elemMatch.
2. Screen size is in the form {name: "Screen Size", value: number, unit: "inches"} within the specifications array.
3. Comparison operators: $lt (less than), $lte (less than or equal), $gt (greater than), $gte (greater than or equal), $eq (equal)
4. Response must return only a valid JSON-formatted MongoDB query.
5. Return only the query without additional explanations.

Examples:
Question: "Laptops with screen smaller than 15 inches"
Query: {"category": "Laptop", "specifications": {"$elemMatch": {"name": "Screen Size", "value": {"$lt": 15}}}}

Question: "Products with memory of 16GB or more"
Query: {"specifications": {"$elemMatch": {"name": "Memory", "value": {"$gte": 16}}}}
"""
    
    user_prompt = f"Please convert the following natural language question into a MongoDB query: {natural_language_query}"
    
    try:
        response = client.chat.completions.create(
            model=deployment_name,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0,
            max_tokens=500
        )
        
        query_string = response.choices[0].message.content.strip()
        
        # Remove JSON code blocks (```json ... ``` format)
        if query_string.startswith("```"):
            query_string = query_string.split("```")[1]
            if query_string.startswith("json"):
                query_string = query_string[4:]
            query_string = query_string.strip()
        
        # Parse JSON
        mongodb_query = json.loads(query_string)
        return mongodb_query
        
    except Exception as e:
        print(f"❌ Query conversion error: {e}")
        return {}

print("✅ nl_to_mongodb_query function defined")

✅ nl_to_mongodb_query function defined


## 7. Natural Language Query Execution Function

In [12]:
def search_products_with_nl(natural_language_query: str):
    """
    Search products using natural language questions.
    
    Args:
        natural_language_query: Natural language question
    """
    print(f"\n{'='*80}")
    print(f"🔍 Question: {natural_language_query}")
    print(f"{'='*80}")
    
    # 1. Convert natural language to MongoDB query
    mongodb_query = nl_to_mongodb_query(natural_language_query)
    print(f"\n📝 Generated MongoDB query:")
    print(json.dumps(mongodb_query, indent=2, ensure_ascii=False))
    
    # 2. Execute query
    results = list(collection.find(mongodb_query))
    print(f"\n✅ Search results: {len(results)} products found")
    
    # 3. Display results
    if results:
        print("\n" + "="*80)
        for idx, product in enumerate(results, 1):
            print(f"\n[{idx}] {product['name']}")
            print(f"   - Category: {product['category']}")
            print(f"   - Price: {product['price']:,} won")
            print(f"   - Specifications:")
            for spec in product['specifications']:
                print(f"     • {spec['name']}: {spec['value']} {spec['unit']}")
    else:
        print("\n⚠️ No products match the search criteria.")
    
    print("\n" + "="*80)
    return results

print("✅ search_products_with_nl function defined")

✅ search_products_with_nl function defined


## 8. Natural Language Query Examples

Search for products using various natural language questions.

### Example 1: Find laptops with screens smaller than 15 inches

In [13]:
results = search_products_with_nl("Find laptops with screen smaller than 15 inches")


🔍 Question: Find laptops with screen smaller than 15 inches

📝 Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Screen Size",
      "value": {
        "$lt": 15
      }
    }
  }
}

✅ Search results: 2 products found


[1] Galaxy Book4 Ultra
   - Category: Laptop
   - Price: 3,290,000 won
   - Specifications:
     • Screen Size: 14 inches
     • Resolution: 2880x1800 pixels
     • Processor: Intel Core Ultra 9 
     • Memory: 32 GB
     • Storage: 1024 GB
     • Weight: 1.21 kg

[2] Galaxy Book3 Pro 360
   - Category: Laptop
   - Price: 2,190,000 won
   - Specifications:
     • Screen Size: 13.3 inches
     • Resolution: 1920x1080 pixels
     • Processor: Intel Core i7-1360P 
     • Memory: 16 GB
     • Storage: 512 GB
     • Weight: 1.16 kg


📝 Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Screen Size",
      "value": {
        "$lt": 15
      }
    }
  }
}

✅ Sear

### Example 2: Find products with 16GB or more memory

In [14]:
results = search_products_with_nl("Show products with 16GB or more memory")


🔍 Question: Show products with 16GB or more memory

📝 Generated MongoDB query:
{
  "specifications": {
    "$elemMatch": {
      "name": "Memory",
      "value": {
        "$gte": 16
      }
    }
  }
}

✅ Search results: 3 products found


[1] Galaxy Book4 Pro
   - Category: Laptop
   - Price: 2,490,000 won
   - Specifications:
     • Screen Size: 16 inches
     • Resolution: 3200x2000 pixels
     • Processor: Intel Core Ultra 7 
     • Memory: 16 GB
     • Storage: 512 GB
     • Weight: 1.55 kg

[2] Galaxy Book4 Ultra
   - Category: Laptop
   - Price: 3,290,000 won
   - Specifications:
     • Screen Size: 14 inches
     • Resolution: 2880x1800 pixels
     • Processor: Intel Core Ultra 9 
     • Memory: 32 GB
     • Storage: 1024 GB
     • Weight: 1.21 kg

[3] Galaxy Book3 Pro 360
   - Category: Laptop
   - Price: 2,190,000 won
   - Specifications:
     • Screen Size: 13.3 inches
     • Resolution: 1920x1080 pixels
     • Processor: Intel Core i7-1360P 
     • Memory: 16 GB
     • St

### Example 3: Find lightweight laptops (under 1.5kg)

In [15]:
results = search_products_with_nl("Find laptops lighter than 1.5kg")


🔍 Question: Find laptops lighter than 1.5kg

📝 Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Weight",
      "value": {
        "$lt": 1.5
      },
      "unit": "kg"
    }
  }
}

✅ Search results: 2 products found


[1] Galaxy Book4 Ultra
   - Category: Laptop
   - Price: 3,290,000 won
   - Specifications:
     • Screen Size: 14 inches
     • Resolution: 2880x1800 pixels
     • Processor: Intel Core Ultra 9 
     • Memory: 32 GB
     • Storage: 1024 GB
     • Weight: 1.21 kg

[2] Galaxy Book3 Pro 360
   - Category: Laptop
   - Price: 2,190,000 won
   - Specifications:
     • Screen Size: 13.3 inches
     • Resolution: 1920x1080 pixels
     • Processor: Intel Core i7-1360P 
     • Memory: 16 GB
     • Storage: 512 GB
     • Weight: 1.16 kg


📝 Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Weight",
      "value": {
        "$lt": 1.5
      },
      "unit": "kg"
   

### Example 4: Find products under 2 million won

In [16]:
results = search_products_with_nl("Find products under 2 million won")


🔍 Question: Find products under 2 million won

📝 Generated MongoDB query:
{
  "price": {
    "$lt": 2000000
  }
}

✅ Search results: 3 products found


[1] Galaxy Book3
   - Category: Laptop
   - Price: 1,590,000 won
   - Specifications:
     • Screen Size: 15.6 inches
     • Resolution: 1920x1080 pixels
     • Processor: Intel Core i5-1335U 
     • Memory: 8 GB
     • Storage: 256 GB
     • Weight: 1.78 kg

[2] Galaxy Tab S9 Ultra
   - Category: Tablet
   - Price: 1,650,000 won
   - Specifications:
     • Screen Size: 14.6 inches
     • Resolution: 2960x1848 pixels
     • Processor: Snapdragon 8 Gen 2 
     • Memory: 12 GB
     • Storage: 256 GB
     • Weight: 0.732 kg

[3] Galaxy S24 Ultra
   - Category: Smartphone
   - Price: 1,698,400 won
   - Specifications:
     • Screen Size: 6.8 inches
     • Resolution: 3120x1440 pixels
     • Processor: Snapdragon 8 Gen 3 
     • Memory: 12 GB
     • Storage: 256 GB
     • Weight: 0.232 kg
     • Battery: 5000 mAh


📝 Generated MongoDB query

### Example 5: Laptops with 512GB or more storage

In [17]:
results = search_products_with_nl("Show laptops with 512GB or more storage")


🔍 Question: Show laptops with 512GB or more storage

📝 Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Storage",
      "value": {
        "$gte": 512
      }
    }
  }
}

✅ Search results: 3 products found


[1] Galaxy Book4 Pro
   - Category: Laptop
   - Price: 2,490,000 won
   - Specifications:
     • Screen Size: 16 inches
     • Resolution: 3200x2000 pixels
     • Processor: Intel Core Ultra 7 
     • Memory: 16 GB
     • Storage: 512 GB
     • Weight: 1.55 kg

[2] Galaxy Book4 Ultra
   - Category: Laptop
   - Price: 3,290,000 won
   - Specifications:
     • Screen Size: 14 inches
     • Resolution: 2880x1800 pixels
     • Processor: Intel Core Ultra 9 
     • Memory: 32 GB
     • Storage: 1024 GB
     • Weight: 1.21 kg

[3] Galaxy Book3 Pro 360
   - Category: Laptop
   - Price: 2,190,000 won
   - Specifications:
     • Screen Size: 13.3 inches
     • Resolution: 1920x1080 pixels
     • Processor: Intel Core i7-1360P 
   

### Example 6: Monitors with high refresh rate

In [18]:
results = search_products_with_nl("Find monitors with refresh rate above 100Hz")


🔍 Question: Find monitors with refresh rate above 100Hz

📝 Generated MongoDB query:
{
  "category": "Monitor",
  "specifications": {
    "$elemMatch": {
      "name": "Refresh Rate",
      "value": {
        "$gt": 100
      }
    }
  }
}

✅ Search results: 1 products found


[1] Odyssey OLED G9
   - Category: Monitor
   - Price: 2,690,000 won
   - Specifications:
     • Screen Size: 49 inches
     • Resolution: 5120x1440 pixels
     • Refresh Rate: 240 Hz
     • Panel Type: OLED 
     • Response Time: 0.03 ms


📝 Generated MongoDB query:
{
  "category": "Monitor",
  "specifications": {
    "$elemMatch": {
      "name": "Refresh Rate",
      "value": {
        "$gt": 100
      }
    }
  }
}

✅ Search results: 1 products found


[1] Odyssey OLED G9
   - Category: Monitor
   - Price: 2,690,000 won
   - Specifications:
     • Screen Size: 49 inches
     • Resolution: 5120x1440 pixels
     • Refresh Rate: 240 Hz
     • Panel Type: OLED 
     • Response Time: 0.03 ms



---

## Vector Search

Now let's implement semantic search functionality.

### 10. Cosine Similarity Calculation Function

In [19]:
import numpy as np

def cosine_similarity(vec1: list, vec2: list) -> float:
    """
    Calculate cosine similarity between two vectors.
    
    Args:
        vec1: First vector
        vec2: Second vector
        
    Returns:
        Cosine similarity (value between 0~1, closer to 1 means more similar)
    """
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    
    if norm1 == 0 or norm2 == 0:
        return 0.0
    
    return dot_product / (norm1 * norm2)

print("✅ cosine_similarity function defined")

✅ cosine_similarity function defined


### 11. Vector Search Function

In [20]:
def vector_search(query_text: str, top_k: int = 5):
    """
    Perform semantic vector search.
    
    Args:
        query_text: Search query text
        top_k: Number of top results to return
        
    Returns:
        List of search results (sorted by similarity)
    """
    print(f"\n{'='*80}")
    print(f"🔍 Vector Search: {query_text}")
    print(f"{'='*80}")
    
    # 1. Embed query text
    query_embedding = get_embedding(query_text)
    print(f"\n📝 Query embedding generated (dimensions: {len(query_embedding)})")
    
    # 2. Get all products
    all_products = list(collection.find({"embedding": {"$exists": True}}))
    
    # 3. Calculate cosine similarity
    results_with_scores = []
    for product in all_products:
        if "embedding" in product and product["embedding"]:
            similarity = cosine_similarity(query_embedding, product["embedding"])
            results_with_scores.append((product, similarity))
    
    # 4. Sort by similarity
    results_with_scores.sort(key=lambda x: x[1], reverse=True)
    
    # 5. Top k results
    top_results = results_with_scores[:top_k]
    
    print(f"\n✅ Search results: Top {len(top_results)} products")
    print("\n" + "="*80)
    
    for idx, (product, score) in enumerate(top_results, 1):
        print(f"\n[{idx}] {product['name']} (Similarity: {score:.4f})")
        print(f"   - Category: {product['category']}")
        print(f"   - Price: {product['price']:,} won")
        print(f"   - Description: {product.get('description', 'N/A')[:100]}...")
    
    print("\n" + "="*80)
    return top_results

print("✅ vector_search function defined")

✅ vector_search function defined


### 12. Vector Search Examples

#### Example 1: Search for "high-performance gaming products"

In [21]:
results = vector_search("high-performance gaming products", top_k=3)


🔍 Vector Search: high-performance gaming products

📝 Query embedding generated (dimensions: 3072)

✅ Search results: Top 3 products


[1] Odyssey OLED G9 (Similarity: 0.2824)
   - Category: Monitor
   - Price: 2,690,000 won
   - Description: Odyssey OLED G9 is a Monitor product from Samsung. The price is 2,690,000 won, and the main specific...

[2] ViewFinity S9 5K (Similarity: 0.2725)
   - Category: Monitor
   - Price: 2,290,000 won
   - Description: ViewFinity S9 5K is a Monitor product from Samsung. The price is 2,290,000 won, and the main specifi...

[3] Galaxy Book4 Ultra (Similarity: 0.2309)
   - Category: Laptop
   - Price: 3,290,000 won
   - Description: Galaxy Book4 Ultra is a Laptop product from Samsung. The price is 3,290,000 won, and the main specif...



#### Example 2: Search for "portable devices for work"

In [22]:
results = vector_search("portable devices for work", top_k=3)


🔍 Vector Search: portable devices for work

📝 Query embedding generated (dimensions: 3072)

✅ Search results: Top 3 products


[1] Galaxy Book4 Pro (Similarity: 0.2486)
   - Category: Laptop
   - Price: 2,490,000 won
   - Description: Galaxy Book4 Pro is a Laptop product from Samsung. The price is 2,490,000 won, and the main specific...

[2] Galaxy Book4 Ultra (Similarity: 0.2400)
   - Category: Laptop
   - Price: 3,290,000 won
   - Description: Galaxy Book4 Ultra is a Laptop product from Samsung. The price is 3,290,000 won, and the main specif...

[3] Galaxy Book3 Pro 360 (Similarity: 0.2372)
   - Category: Laptop
   - Price: 2,190,000 won
   - Description: Galaxy Book3 Pro 360 is a Laptop product from Samsung. The price is 2,190,000 won, and the main spec...


📝 Query embedding generated (dimensions: 3072)

✅ Search results: Top 3 products


[1] Galaxy Book4 Pro (Similarity: 0.2486)
   - Category: Laptop
   - Price: 2,490,000 won
   - Description: Galaxy Book4 Pro is a Laptop produc

#### Example 3: Search for "high-resolution display products"

In [23]:
results = vector_search("high-resolution display products", top_k=3)


🔍 Vector Search: high-resolution display products

📝 Query embedding generated (dimensions: 3072)

✅ Search results: Top 3 products


[1] ViewFinity S9 5K (Similarity: 0.4033)
   - Category: Monitor
   - Price: 2,290,000 won
   - Description: ViewFinity S9 5K is a Monitor product from Samsung. The price is 2,290,000 won, and the main specifi...

[2] Odyssey OLED G9 (Similarity: 0.3631)
   - Category: Monitor
   - Price: 2,690,000 won
   - Description: Odyssey OLED G9 is a Monitor product from Samsung. The price is 2,690,000 won, and the main specific...

[3] Galaxy Book4 Ultra (Similarity: 0.2868)
   - Category: Laptop
   - Price: 3,290,000 won
   - Description: Galaxy Book4 Ultra is a Laptop product from Samsung. The price is 3,290,000 won, and the main specif...


📝 Query embedding generated (dimensions: 3072)

✅ Search results: Top 3 products


[1] ViewFinity S9 5K (Similarity: 0.4033)
   - Category: Monitor
   - Price: 2,290,000 won
   - Description: ViewFinity S9 5K is a Monitor 

---

## Hybrid Search (NL2NoSQL + Vector Search)

Implement hybrid search combining structured queries and semantic search.

### 13. Hybrid Search Function

Perform structured filtering with NL2NoSQL first, then re-rank results with vector search.

In [24]:
def hybrid_search(natural_language_query: str, top_k: int = 5):
    """
    Hybrid search: NL2NoSQL filtering + Vector search re-ranking
    
    Args:
        natural_language_query: Natural language question
        top_k: Number of top results to return
        
    Returns:
        List of search results
    """
    print(f"\n{'='*80}")
    print(f"🔍 Hybrid Search: {natural_language_query}")
    print(f"{'='*80}")
    
    # 1. NL2NoSQL: Structured filtering
    print(f"\n[Step 1] Generating NL2NoSQL query...")
    mongodb_query = nl_to_mongodb_query(natural_language_query)
    print(f"Generated MongoDB query:")
    print(json.dumps(mongodb_query, indent=2, ensure_ascii=False))
    
    # 2. Execute MongoDB query
    filtered_products = list(collection.find(mongodb_query))
    print(f"\n✅ Filtering results: {len(filtered_products)} products")
    
    if not filtered_products:
        print("\n⚠️ No products match the filtering criteria.")
        print("   Switching to pure vector search...")
        return vector_search(natural_language_query, top_k)
    
    # 3. Vector search: Semantic re-ranking of filtered results
    print(f"\n[Step 2] Re-ranking with vector search...")
    query_embedding = get_embedding(natural_language_query)
    
    results_with_scores = []
    for product in filtered_products:
        if "embedding" in product and product["embedding"]:
            similarity = cosine_similarity(query_embedding, product["embedding"])
            results_with_scores.append((product, similarity))
    
    # 4. Sort by similarity
    results_with_scores.sort(key=lambda x: x[1], reverse=True)
    
    # 5. Top k results
    top_results = results_with_scores[:top_k]
    
    print(f"\n✅ Final results: Top {len(top_results)} products")
    print("\n" + "="*80)
    
    for idx, (product, score) in enumerate(top_results, 1):
        print(f"\n[{idx}] {product['name']} (Similarity: {score:.4f})")
        print(f"   - Category: {product['category']}")
        print(f"   - Price: {product['price']:,} won")
        print(f"   - Key specifications:")
        for spec in product['specifications'][:3]:  # Show first 3 specs
            print(f"     • {spec['name']}: {spec['value']} {spec['unit']}")
    
    print("\n" + "="*80)
    return top_results

print("✅ hybrid_search function defined")

✅ hybrid_search function defined


### 14. Hybrid Search Examples

Search examples considering both structured conditions and semantic meaning.

#### Example 1: "Lightweight high-performance laptop" (weight < 1.5kg + semantic)

In [25]:
results = hybrid_search("High-performance laptop lighter than 1.5kg", top_k=3)


🔍 Hybrid Search: High-performance laptop lighter than 1.5kg

[Step 1] Generating NL2NoSQL query...
Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Weight",
      "value": {
        "$lt": 1.5
      }
    }
  }
}

✅ Filtering results: 2 products

[Step 2] Re-ranking with vector search...

✅ Final results: Top 2 products


[1] Galaxy Book4 Ultra (Similarity: 0.3537)
   - Category: Laptop
   - Price: 3,290,000 won
   - Key specifications:
     • Screen Size: 14 inches
     • Resolution: 2880x1800 pixels
     • Processor: Intel Core Ultra 9 

[2] Galaxy Book3 Pro 360 (Similarity: 0.3259)
   - Category: Laptop
   - Price: 2,190,000 won
   - Key specifications:
     • Screen Size: 13.3 inches
     • Resolution: 1920x1080 pixels
     • Processor: Intel Core i7-1360P 

Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Weight",
      "value": {
        "$lt": 1.5
      }
    }


#### Example 2: "Laptop for work with large storage" (storage >= 512GB + semantic)

In [26]:
results = hybrid_search("Laptop for work with 512GB or more storage", top_k=3)


🔍 Hybrid Search: Laptop for work with 512GB or more storage

[Step 1] Generating NL2NoSQL query...
Generated MongoDB query:
{
  "category": "Laptop",
  "specifications": {
    "$elemMatch": {
      "name": "Storage",
      "value": {
        "$gte": 512
      }
    }
  }
}

✅ Filtering results: 3 products

[Step 2] Re-ranking with vector search...

✅ Final results: Top 3 products


[1] Galaxy Book4 Pro (Similarity: 0.3276)
   - Category: Laptop
   - Price: 2,490,000 won
   - Key specifications:
     • Screen Size: 16 inches
     • Resolution: 3200x2000 pixels
     • Processor: Intel Core Ultra 7 

[2] Galaxy Book4 Ultra (Similarity: 0.3229)
   - Category: Laptop
   - Price: 3,290,000 won
   - Key specifications:
     • Screen Size: 14 inches
     • Resolution: 2880x1800 pixels
     • Processor: Intel Core Ultra 9 

[3] Galaxy Book3 Pro 360 (Similarity: 0.3191)
   - Category: Laptop
   - Price: 2,190,000 won
   - Key specifications:
     • Screen Size: 13.3 inches
     • Resolution: 19

#### Example 3: "High-performance gaming monitor" (category=Monitor + semantic)

In [27]:
results = hybrid_search("High-performance gaming monitor", top_k=3)


🔍 Hybrid Search: High-performance gaming monitor

[Step 1] Generating NL2NoSQL query...
Generated MongoDB query:
{
  "category": "Monitor",
  "name": {
    "$regex": "gaming",
    "$options": "i"
  }
}

✅ Filtering results: 0 products

⚠️ No products match the filtering criteria.
   Switching to pure vector search...

🔍 Vector Search: High-performance gaming monitor

📝 Query embedding generated (dimensions: 3072)

✅ Search results: Top 3 products


[1] Odyssey OLED G9 (Similarity: 0.4543)
   - Category: Monitor
   - Price: 2,690,000 won
   - Description: Odyssey OLED G9 is a Monitor product from Samsung. The price is 2,690,000 won, and the main specific...

[2] ViewFinity S9 5K (Similarity: 0.4386)
   - Category: Monitor
   - Price: 2,290,000 won
   - Description: ViewFinity S9 5K is a Monitor product from Samsung. The price is 2,290,000 won, and the main specifi...

[3] Galaxy Book4 Ultra (Similarity: 0.2659)
   - Category: Laptop
   - Price: 3,290,000 won
   - Description: Galaxy Bo

#### Example 4: "Premium products under 2 million won" (price + semantic)

In [28]:
results = hybrid_search("Premium products under 2 million won", top_k=3)


🔍 Hybrid Search: Premium products under 2 million won

[Step 1] Generating NL2NoSQL query...
Generated MongoDB query:
{
  "price": {
    "$lt": 2000000
  }
}

✅ Filtering results: 3 products

[Step 2] Re-ranking with vector search...

✅ Final results: Top 3 products


[1] Galaxy Tab S9 Ultra (Similarity: 0.3613)
   - Category: Tablet
   - Price: 1,650,000 won
   - Key specifications:
     • Screen Size: 14.6 inches
     • Resolution: 2960x1848 pixels
     • Processor: Snapdragon 8 Gen 2 

[2] Galaxy S24 Ultra (Similarity: 0.3569)
   - Category: Smartphone
   - Price: 1,698,400 won
   - Key specifications:
     • Screen Size: 6.8 inches
     • Resolution: 3120x1440 pixels
     • Processor: Snapdragon 8 Gen 3 

[3] Galaxy Book3 (Similarity: 0.3504)
   - Category: Laptop
   - Price: 1,590,000 won
   - Key specifications:
     • Screen Size: 15.6 inches
     • Resolution: 1920x1080 pixels
     • Processor: Intel Core i5-1335U 

Generated MongoDB query:
{
  "price": {
    "$lt": 2000000
  

## 15. Clean Up

Clean up the MongoDB container when done.

In [29]:
# Close MongoDB connection
mongo_client.close()
print("✅ MongoDB connection closed")

# Stop and remove Docker container
result = subprocess.run(["docker", "stop", "mongodb-nl2nosql"], capture_output=True, text=True)
if result.returncode == 0:
    print("✅ MongoDB container stopped")

result = subprocess.run(["docker", "rm", "mongodb-nl2nosql"], capture_output=True, text=True)
if result.returncode == 0:
    print("✅ MongoDB container removed")

✅ MongoDB connection closed
✅ MongoDB container stopped
✅ MongoDB container removed
✅ MongoDB container stopped
✅ MongoDB container removed


## Summary

This notebook implemented the following:

### 1. Basic Setup
- **MongoDB Setup**: Run MongoDB container using Docker
- **Data Modeling**: Structure Samsung Electronics product data with specifications array
- **Embedding Generation**: Generate vector embeddings of product descriptions using Azure OpenAI Text Embedding

### 2. Three Search Methods
1. **NL2NoSQL**: Convert natural language to MongoDB queries for structured filtering
   - Exact numeric comparisons (e.g., "smaller than 15 inches")
   - Category filtering
   - Price range search

2. **Vector Search**: Semantic similarity-based search
   - Find semantically similar products
   - Calculate similarity with natural language descriptions
   - Recommend related products without exact conditions

3. **Hybrid Search**: Combine NL2NoSQL + Vector Search
   - Step 1: Filter with structured queries
   - Step 2: Semantic re-ranking with vector search
   - Balance between accuracy and flexibility

### Key Technology Stack
- **Database**: MongoDB (NoSQL)
- **AI Models**: 
  - Azure OpenAI GPT-4 (NL2NoSQL conversion)
  - Azure OpenAI Text Embedding (semantic search)
- **Container**: Docker
- **Python Libraries**: pymongo, openai, numpy

### Search Method Comparison

| Method | Advantages | Disadvantages | Use Cases |
|--------|-----------|---------------|-----------|
| NL2NoSQL | Exact condition filtering, fast | No semantic similarity | Clear condition search |
| Vector Search | Semantic similarity, flexible | Hard to apply exact conditions | Conceptual search |
| Hybrid | Accuracy + Flexibility | Increased processing time | Complex condition search |

### Extension Ideas
- MongoDB Atlas Vector Search integration (production environment)
- Support for more complex queries (AND, OR, nested conditions)
- Generate natural language responses for search results
- Improve search results based on user feedback
- Multi-language search support
- Search log analysis and personalization