## PHẦN 1: CHUẨN BỊ MÔI TRƯỜNG (5 phút)

### Bước 1.1: Tạo tài khoản (tất cả FREE)

```bash
# 1. Google Colab - FREE GPU T4
https://colab.research.google.com/

# 2. MongoDB Atlas - FREE 512MB (M0 cluster)
https://www.mongodb.com/cloud/atlas/register
→ Sign up → Create FREE cluster
→ Database Access → Add user (username + password)
→ Network Access → Add IP Address → Allow access from anywhere (0.0.0.0/0)
→ Database → Connect → Drivers → Copy connection string
→ Lưu lại: CONNECTION_STRING

# 3. Upstash Redis - FREE 10,000 requests/day
https://upstash.com/
→ Sign up → Create Database → Chọn "Free" plan
→ Copy: UPSTASH_REDIS_REST_URL và UPSTASH_REDIS_REST_TOKEN

# 4. Hugging Face - FREE
https://huggingface.co/
→ Settings → Access Tokens → New token
→ Copy token để download models
```

### Bước 1.2: Mở Google Colab mới

```python
# Chọn Runtime → Change runtime type → T4 GPU
# Check GPU
!nvidia-smi
```

## PHẦN 2: CÀI ĐẶT THƯ VIỆN (10 phút)

### Bước 2.1: Cài đặt packages

In [None]:
# Cell 1: Sửa lỗi xung đột NumPy
# Cài đặt numpy phiên bản < 2 để tương thích với Torch và Transformers hiện tại
!pip install "numpy<2.0"

# Import thư viện os để restart runtime ngay lập tức
import os
print("Đang khởi động lại Runtime để áp dụng thay đổi NumPy...")
os.kill(os.getpid(), 9)

## PHẦN 3: SETUP CONFIGS (5 phút)

### Bước 3.1: Nhập credentials


In [None]:
# MongoDB Atlas (free M0 cluster)
MONGODB_URI = "os.getenv("MONGODB_URI")b.net/"
DATABASE_NAME = "test"
COLLECTION_NAME = "foods"

# Upstash Redis (free tier)
REDIS_URL = "https://cosmic-porpoise-41192.upstash.io"
REDIS_TOKEN = "AaDoAAIncDJjMGM5Y2Q0NmQzNTc0N2RlYTdlMDhlNGExNTBkOGRlZnAyNDExOTI"

# Hugging Face (optional nhưng recommended)
HF_TOKEN = os.getenv('HF_TOKEN')  # Để download model nhanh hơn

# Test MongoDB connection
print("Testing MongoDB connection...")
mongo_client = MongoClient(MONGODB_URI)
db = mongo_client[DATABASE_NAME]
collection = db[COLLECTION_NAME]

# Test insert
test_doc = {"test": "connection"}
collection.insert_one(test_doc)
collection.delete_one({"test": "connection"})
print(f"✅ MongoDB connected: {db.name}")

print("\nTesting Redis connection...")
redis_client = Redis(url=REDIS_URL, token=REDIS_TOKEN)
redis_client.set("test", "hello")
print(f"✅ Redis connected: {redis_client.get('test')}")

### Bước 4.1: Cấu hình 4-bit quantization

### Bước 4.2: Load model và processor

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("✅ Google Drive mounted successfully!")

## PHẦN 5: SETUP EMBEDDING MODEL (5 phút)

### Bước 5.1: Load embedding model

In [None]:
print("=" * 60)
print("PHẦN 5: SETUP EMBEDDING MODEL")
print("=" * 60)

print("\n✅ Sử dụng: MANUAL EMBEDDING với multilingual-e5-small")
print("   Lý do: Hỗ trợ tốt cả tiếng Việt & tiếng Anh")
print("   Dimensions: 384 (match với index hiện tại)")

print("\nLoading multilingual-e5-small...")
print("⏳ Đợi khoảng 1-2 phút...")

embedding_model = HuggingFaceEmbeddings(
    model_name="intfloat/multilingual-e5-small",
    model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

# Test embedding
test_text = "High protein breakfast recipe"
test_embedding = embedding_model.embed_query(test_text)
print(f"\n✅ Embedding model loaded successfully!")
print(f"   Embedding dimension: {len(test_embedding)}")
print(f"   Device: {embedding_model.model_kwargs['device']}")
print(f"   Sample embedding: {test_embedding[:5]}")

### Bước 5.2: Tạo function để extract text từ food document

In [None]:
# Category mapping (from FOOD_CATEGORIES)
CATEGORY_LABELS = {
    0: 'Dairy', 1: 'Eggs', 2: 'Fish', 3: 'Gluten', 4: 'Peanuts',
    5: 'Sesame', 6: 'Shellfish', 7: 'Soy', 8: 'Tree Nuts',
    9: 'Chocolate', 10: 'Cilantro', 11: 'Kale', 12: 'Mayonnaise',
    13: 'Mushrooms', 14: 'Mustard', 15: 'Olives', 16: 'Onions',
    17: 'Pickles', 18: 'Protein Powder', 19: 'Shakes & Smoothies', 20: 'Sugar',
    21: 'Blue Cheese', 22: 'Butter', 23: 'Cheese', 24: 'Cottage Cheese',
    25: 'Cream', 26: 'Goat Cheese', 27: 'Milk', 28: 'Whey Powder', 29: 'Yogurt',
    30: 'Red Meat', 31: 'Beef', 32: 'Lamb', 33: 'Pork & Bacon',
    34: 'Sausages and Luncheon Meats', 35: 'Poultry', 36: 'Chicken',
    37: 'Duck', 38: 'Turkey', 39: 'Cod', 40: 'Salmon', 41: 'Sardines',
    42: 'Tilapia', 43: 'Trout & Snapper', 44: 'Tuna', 45: 'Clams',
    46: 'Crab', 47: 'Lobster', 48: 'Mussels', 49: 'Oysters',
    50: 'Scallops', 51: 'Shrimp', 52: 'Squid', 53: 'Vegetables',
    54: 'Artichoke', 55: 'Arugula', 56: 'Asparagus', 57: 'Beets',
    58: 'Bell Peppers', 59: 'Broccoli', 60: 'Brussel Sprouts',
    61: 'Cabbage', 62: 'Carrots', 63: 'Cauliflower', 64: 'Celery',
    65: 'Chili Peppers', 66: 'Cucumber', 67: 'Eggplant', 68: 'Garlic',
    69: 'Lettuce', 70: 'Potatoes & Yams', 71: 'Radish', 72: 'Spinach',
    73: 'Squash', 74: 'Tomato', 75: 'Zucchini', 76: 'Fruit',
    77: 'Apple', 78: 'Avocado', 79: 'Banana', 80: 'Blueberries',
    81: 'Coconut', 82: 'Dates', 83: 'Grapes', 84: 'Kiwi',
    85: 'Lemon', 86: 'Lime', 87: 'Mango', 88: 'Melon',
    89: 'Orange', 90: 'Peaches & Plums', 91: 'Pineapple',
    92: 'Raisins', 93: 'Raspberries', 94: 'Strawberries',
    95: 'Edamame', 96: 'Soy Milk', 97: 'Soy Sauce', 98: 'Tempeh',
    99: 'Tofu', 100: 'Grains', 101: 'Barley', 102: 'Bread',
    103: 'Breakfast Cereals', 104: 'Corn', 105: 'Oats', 106: 'Pastas',
    107: 'Quinoa', 108: 'Rice', 109: 'Rye', 110: 'Wheat',
    111: 'Legumes', 112: 'Beans', 113: 'Chickpeas', 114: 'Hummus',
    115: 'Lentils', 116: 'Almonds', 117: 'Brazil Nuts', 118: 'Cashews',
    119: 'Hazelnuts', 120: 'Pecans', 121: 'Pistachios', 122: 'Walnuts',
    123: 'Fish Sauce', 124: 'Honey', 125: 'Ketchup', 126: 'Mayonnaise',
    127: 'Mustard', 128: 'Pickles', 129: 'Spices and Herbs',
    130: 'Sweets', 131: 'Soups, Sauces, and Gravies',
    132: 'Baked Products', 133: 'Beverages', 134: 'Fast Foods',
    135: 'Ethnic Foods', 136: 'Supplements'
}

In [None]:
def create_embedding_text(food_doc):
    """
    Tạo rich text content từ food document để embedding

    Kết hợp nhiều trường quan trọng:
    - Name (tên món)
    - Description (mô tả)
    - Categories (danh mục)
    - Directions (hướng dẫn)
    - Properties (thuộc tính: meal type, dietary, cooking method)
    - Major ingredients (nguyên liệu chính)
    - Nutrition highlights (dinh dưỡng nổi bật + sugar, sodium, cholesterol...)
    """
    parts = []

    # 1. Name (quan trọng nhất)
    name = food_doc.get('name', '')
    if name:
        parts.append(f"Name: {name}")

    # 2. Description
    desc = food_doc.get('description')
    if desc and str(desc) not in ['nan', 'NaN', '']:
        parts.append(f"Description: {desc}")

    # 3. Categories (mới thêm)
    categories = food_doc.get('categories', [])
    if categories and isinstance(categories, list):
        category_names = [CATEGORY_LABELS.get(cat, '') for cat in categories]
        category_names = [c for c in category_names if c]  # Remove empty strings
        if category_names:
            parts.append(f"Categories: {', '.join(category_names)}")

    # 4. Directions (cách làm)
    directions = food_doc.get('directions', [])
    if directions and isinstance(directions, list):
        # Ghép các bước, giới hạn độ dài
        directions_text = ' '.join(directions)[:500]
        parts.append(f"Instructions: {directions_text}")

    # 5. Properties
    prop = food_doc.get('property', {})

    # Meal types
    meal_types = []
    if prop.get('isBreakfast'): meal_types.append('breakfast')
    if prop.get('isLunch'): meal_types.append('lunch')
    if prop.get('isDinner'): meal_types.append('dinner')
    if prop.get('isSnack'): meal_types.append('snack')
    if prop.get('isDessert'): meal_types.append('dessert')
    if meal_types:
        parts.append(f"Meal types: {', '.join(meal_types)}")

    # Dietary properties
    dietary = []
    if prop.get('isHighProtein'): dietary.append('high protein')
    if prop.get('isLowCarb'): dietary.append('low carb')
    if prop.get('isLowFat'): dietary.append('low fat')
    if prop.get('isHighFiber'): dietary.append('high fiber')
    if prop.get('isLowSodium'): dietary.append('low sodium')
    if dietary:
        parts.append(f"Dietary: {', '.join(dietary)}")

    # Cooking methods
    cooking = []
    if prop.get('needsMicrowave'): cooking.append('microwave')
    if prop.get('needsOven'): cooking.append('oven')
    if prop.get('needsStove'): cooking.append('stove')
    if prop.get('needsGrill'): cooking.append('grill')
    if prop.get('needsBlender'): cooking.append('blender')
    if prop.get('needsSlowCooker'): cooking.append('slow cooker')
    if cooking:
        parts.append(f"Cooking methods: {', '.join(cooking)}")

    # Time info
    total_time = prop.get('totalTime')
    if total_time and total_time > 0:
        parts.append(f"Total time: {total_time} minutes")

    complexity = prop.get('complexity')
    if complexity:
        if complexity < 3:
            parts.append("Difficulty: very easy")
        elif complexity < 5:
            parts.append("Difficulty: easy")
        elif complexity < 7:
            parts.append("Difficulty: medium")
        else:
            parts.append("Difficulty: hard")

    # Dish type
    dish_types = []
    if prop.get('mainDish'): dish_types.append('main dish')
    if prop.get('sideDish'): dish_types.append('side dish')
    if dish_types:
        parts.append(f"Dish type: {', '.join(dish_types)}")

    # Major ingredients
    major_ing = prop.get('majorIngredients', '')
    if major_ing:
        # Clean up: "microwaved-sweet-potato" → "microwaved sweet potato"
        major_ing_clean = major_ing.replace('-', ' ')
        parts.append(f"Main ingredients: {major_ing_clean}")

    # 6. Nutrition highlights (mở rộng)
    nutrition = food_doc.get('nutrition', {})
    nutr_parts = []

    calories = nutrition.get('calories')
    if calories and calories > 0:
        nutr_parts.append(f"{round(calories)} calories")

    protein = nutrition.get('proteins')
    if protein and protein > 5:
        nutr_parts.append(f"{round(protein, 1)}g protein")

    carbs = nutrition.get('carbs')
    if carbs and carbs > 10:
        nutr_parts.append(f"{round(carbs, 1)}g carbs")

    fats = nutrition.get('fats')
    if fats and fats > 5:
        nutr_parts.append(f"{round(fats, 1)}g fat")

    fiber = nutrition.get('fiber')
    if fiber and fiber > 3:
        nutr_parts.append(f"{round(fiber, 1)}g fiber")

    # Thêm các chất dinh dưỡng quan trọng khác
    sugar = nutrition.get('sugar')
    if sugar and sugar > 5:
        nutr_parts.append(f"{round(sugar, 1)}g sugar")

    sodium = nutrition.get('sodium')
    if sodium and sodium > 200:
        nutr_parts.append(f"{round(sodium)}mg sodium")

    cholesterol = nutrition.get('cholesterol')
    if cholesterol and cholesterol > 50:
        nutr_parts.append(f"{round(cholesterol)}mg cholesterol")

    vitaminC = nutrition.get('vitC')
    if vitaminC and vitaminC > 10:
        nutr_parts.append(f"{round(vitaminC, 1)}mg vitamin C")

    calcium = nutrition.get('calcium')
    if calcium and calcium > 100:
        nutr_parts.append(f"{round(calcium)}mg calcium")

    iron = nutrition.get('iron')
    if iron and iron > 2:
        nutr_parts.append(f"{round(iron, 1)}mg iron")

    potassium = nutrition.get('potassium')
    if potassium and potassium > 300:
        nutr_parts.append(f"{round(potassium)}mg potassium")

    if nutr_parts:
        parts.append(f"Nutrition: {', '.join(nutr_parts)}")

    # Combine all parts
    text = '. '.join(parts)
    return text


# Test function với 1 document
print("\n" + "=" * 60)
print("Testing create_embedding_text function...")
print("=" * 60)

sample_doc = collection.find_one()
if sample_doc:
    test_text = create_embedding_text(sample_doc)
    print(f"\n📄 Sample document: {sample_doc.get('name', 'Unknown')}")
    print(f"\n📝 Generated embedding text:")
    print(f"{test_text}")
    print(f"\n   Length: {len(test_text)} characters")
else:
    print("⚠️ No documents found in collection")

print("\n✅ Function ready!")

## PHẦN 6: UPDATE EMBEDDINGS CHO TẤT CẢ DOCUMENTS (20-30 phút)

### Bước 6.1: Kiểm tra collection hiện tại

In [None]:
print("\n" + "=" * 60)
print("PHẦN 6: UPDATE EMBEDDINGS")
print("=" * 60)

# Kiểm tra số lượng documents
total_docs = collection.count_documents({})
print(f"\n📊 Total documents in collection: {total_docs}")

# Kiểm tra có bao nhiêu documents đã có embedding
docs_with_embedding = collection.count_documents({"embedding": {"$exists": True}})
docs_without_embedding = total_docs - docs_with_embedding

print(f"   Documents with embedding: {docs_with_embedding}")
print(f"   Documents without embedding: {docs_without_embedding}")

# Estimate time
estimated_time = (total_docs * 0.5) / 60  # ~0.5s per document
print(f"\n⏱️ Estimated time: {estimated_time:.1f} minutes")

proceed = input("\n👉 Proceed with embedding generation? (yes/no): ").strip().lower()

if proceed != 'yes':
    print("❌ Cancelled. Run this cell again when ready.")

### Bước 6.2: Generate embeddings cho tất cả documents

In [None]:
if proceed == 'yes':
    print("\n" + "=" * 60)
    print("GENERATING EMBEDDINGS...")
    print("=" * 60)

    import time
    from datetime import datetime

    updated_count = 0
    error_count = 0
    start_time = time.time()

    # Process all documents
    for doc in collection.find({}):
        try:
            # 1. Create embedding text
            text_content = create_embedding_text(doc)

            # 2. Generate embedding
            embedding = embedding_model.embed_query(text_content)

            # 3. Update document
            collection.update_one(
                {"_id": doc["_id"]},
                {
                    "$set": {
                        "text_content": text_content,
                        "embedding": embedding,
                        "embedding_updated_at": datetime.utcnow()
                    }
                }
            )

            updated_count += 1

            # Progress indicator
            if updated_count % 50 == 0:
                elapsed = time.time() - start_time
                rate = updated_count / elapsed
                remaining = (total_docs - updated_count) / rate
                print(f"   Progress: {updated_count}/{total_docs} "
                      f"({updated_count/total_docs*100:.1f}%) "
                      f"- ETA: {remaining/60:.1f} min")

        except Exception as e:
            error_count += 1
            print(f"   ❌ Error for {doc.get('name', 'Unknown')}: {str(e)}")
            if error_count > 10:
                print("   Too many errors. Stopping...")
                break

    # Summary
    elapsed_total = time.time() - start_time
    print("\n" + "=" * 60)
    print("EMBEDDING GENERATION COMPLETE!")
    print("=" * 60)
    print(f"✅ Successfully updated: {updated_count} documents")
    print(f"❌ Errors: {error_count} documents")
    print(f"⏱️ Total time: {elapsed_total/60:.1f} minutes")
    print(f"⚡ Average speed: {updated_count/elapsed_total:.2f} docs/sec")

    # Verify
    final_count = collection.count_documents({"embedding": {"$exists": True}})
    print(f"\n📊 Final count with embeddings: {final_count}/{total_docs}")

### Bước 6.3: Test vector search

In [None]:
def mongodb_vector_search(query, k=5):
    """
    Vector search trong MongoDB Atlas

    Args:
        query: Text query (English hoặc Vietnamese)
        k: Số lượng kết quả trả về

    Returns:
        List of matching documents với scores
    """
    query_embedding = embedding_model.embed_query(query)

    # Aggregation pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",  # Tên index của bạn
                "path": "embedding",
                "queryVector": query_embedding,
                "numCandidates": 100,
                "limit": k
            }
        },
        {
            "$project": {
                "name": 1,
                "text_content": 1,
                "categories": 1,
                "nutrition.calories": 1,
                "nutrition.proteins": 1,
                "nutrition.carbs": 1,
                "nutrition.fats": 1,
                "nutrition.fiber": 1,
                "nutrition.sugar": 1,
                "nutrition.sodium": 1,
                "nutrition.cholesterol": 1,
                "nutrition.vitC": 1,
                "nutrition.calcium": 1,
                "nutrition.iron": 1,
                "nutrition.potassium": 1,
                "property.isBreakfast": 1,
                "property.isLunch": 1,
                "property.isDinner": 1,
                "property.isSnack": 1,
                "property.isDessert": 1,
                "property.mainDish": 1,
                "property.sideDish": 1,
                "property.totalTime": 1,
                "property.complexity": 1,
                "property.isHighProtein": 1,
                "property.isLowCarb": 1,
                "property.isLowFat": 1,
                "property.isHighFiber": 1,
                "property.majorIngredients": 1,
                "score": {"$meta": "vectorSearchScore"}
            }
        }
    ]

    # Giả sử collection đã được khai báo
    # from pymongo import MongoClient
    # client = MongoClient("your_connection_string")
    # db = client["your_database"]
    # collection = db["foods"]

    results = list(collection.aggregate(pipeline))
    return results


In [None]:
def print_search_result(doc, index):
    """Helper function để in kết quả search đẹp hơn"""
    print(f"\n{'─' * 60}")
    print(f"🍽️  {index}. {doc.get('name', 'Unknown')}")
    print(f"{'─' * 60}")
    print(f"📊 Match Score: {doc.get('score', 0):.4f}")

    # Categories
    categories = doc.get('categories', [])
    if categories:
        cat_names = [CATEGORY_LABELS.get(c, f'Cat#{c}') for c in categories]
        print(f"🏷️  Categories: {', '.join(cat_names)}")

    # Nutrition
    nutr = doc.get('nutrition', {})
    nutr_info = []
    if nutr.get('calories'):
        nutr_info.append(f"{nutr['calories']:.0f} cal")
    if nutr.get('proteins'):
        nutr_info.append(f"{nutr['proteins']:.1f}g protein")
    if nutr.get('carbs'):
        nutr_info.append(f"{nutr['carbs']:.1f}g carbs")
    if nutr.get('fats'):
        nutr_info.append(f"{nutr['fats']:.1f}g fat")
    if nutr.get('fiber'):
        nutr_info.append(f"{nutr['fiber']:.1f}g fiber")
    if nutr.get('sugar'):
        nutr_info.append(f"{nutr['sugar']:.1f}g sugar")

    if nutr_info:
        print(f"🥗 Nutrition: {' | '.join(nutr_info)}")

    # Additional nutrition
    extra_nutr = []
    if nutr.get('sodium') and nutr['sodium'] > 100:
        extra_nutr.append(f"{nutr['sodium']:.0f}mg sodium")
    if nutr.get('cholesterol') and nutr['cholesterol'] > 10:
        extra_nutr.append(f"{nutr['cholesterol']:.0f}mg cholesterol")
    if nutr.get('vitC') and nutr['vitC'] > 5:
        extra_nutr.append(f"{nutr['vitC']:.1f}mg vitamin C")
    if nutr.get('calcium') and nutr['calcium'] > 50:
        extra_nutr.append(f"{nutr['calcium']:.0f}mg calcium")
    if nutr.get('iron') and nutr['iron'] > 1:
        extra_nutr.append(f"{nutr['iron']:.1f}mg iron")
    if nutr.get('potassium') and nutr['potassium'] > 200:
        extra_nutr.append(f"{nutr['potassium']:.0f}mg potassium")

    if extra_nutr:
        print(f"💊 Minerals: {' | '.join(extra_nutr)}")

    # Properties
    prop = doc.get('property', {})

    # Meal types
    meal_types = []
    if prop.get('isBreakfast'): meal_types.append('Breakfast')
    if prop.get('isLunch'): meal_types.append('Lunch')
    if prop.get('isDinner'): meal_types.append('Dinner')
    if prop.get('isSnack'): meal_types.append('Snack')
    if prop.get('isDessert'): meal_types.append('Dessert')
    if meal_types:
        print(f"⏰ Meal: {', '.join(meal_types)}")

    # Dish type
    dish_type = []
    if prop.get('mainDish'): dish_type.append('Main dish')
    if prop.get('sideDish'): dish_type.append('Side dish')
    if dish_type:
        print(f"🍽️  Type: {', '.join(dish_type)}")

    # Dietary tags
    dietary = []
    if prop.get('isHighProtein'): dietary.append('High Protein')
    if prop.get('isLowCarb'): dietary.append('Low Carb')
    if prop.get('isLowFat'): dietary.append('Low Fat')
    if prop.get('isHighFiber'): dietary.append('High Fiber')
    if dietary:
        print(f"🏷️  Tags: {', '.join(dietary)}")

    # Time & complexity
    if prop.get('totalTime'):
        print(f"⏱️  Time: {prop['totalTime']} min", end='')
        complexity = prop.get('complexity', 0)
        if complexity:
            if complexity < 3:
                diff = "Very Easy"
            elif complexity < 5:
                diff = "Easy"
            elif complexity < 7:
                diff = "Medium"
            else:
                diff = "Hard"
            print(f" | Difficulty: {diff}")
        else:
            print()

    # Major ingredients
    major_ing = prop.get('majorIngredients', '')
    if major_ing:
        print(f"🥘 Ingredients: {major_ing.replace('-', ' ')}")

    # Text content preview
    text_content = doc.get('text_content', '')
    if text_content:
        preview = text_content[:150] + "..." if len(text_content) > 150 else text_content
        print(f"📝 Content: {preview}")

In [None]:
# ============================================================================
# RUN TESTS
# ============================================================================
print("\n" + "=" * 60)
print("TESTING VECTOR SEARCH")
print("=" * 60)

# Test case 1: English - High protein breakfast
print("\n🔍 Test 1: High Protein Breakfast")
print("Query: 'high protein breakfast recipes'")

results = mongodb_vector_search("high protein breakfast recipes", k=5)

if results:
    print(f"\n✅ Found {len(results)} results:")
    for i, doc in enumerate(results, 1):
        print_search_result(doc, i)
else:
    print("\n❌ No results found!")
    print("\n⚠️ Possible issues:")
    print("   1. Index name incorrect (check: 'vector_index')")
    print("   2. Index not ready yet (wait 1-2 minutes)")
    print("   3. Path incorrect (check: 'embedding')")
    print("   4. Collection empty or no embeddings generated")

# Test case 2: Vietnamese query
print("\n" + "=" * 60)
print("\n🔍 Test 2: Vietnamese Query")
print("Query: 'món ăn sáng giàu protein ít carb'")

results = mongodb_vector_search("món ăn sáng giàu protein ít carb", k=5)

if results:
    print(f"\n✅ Found {len(results)} results:")
    for i, doc in enumerate(results, 1):
        print_search_result(doc, i)
else:
    print("\n❌ No results found!")

# Test case 3: Nutrition-based query
print("\n" + "=" * 60)
print("\n🔍 Test 3: Nutrition-Based Query")
print("Query: 'low calorie high fiber vegetable side dish'")

results = mongodb_vector_search("low calorie high fiber vegetable side dish", k=5)

if results:
    print(f"\n✅ Found {len(results)} results:")
    for i, doc in enumerate(results, 1):
        print_search_result(doc, i)
else:
    print("\n❌ No results found!")

# Test case 4: Category + dietary preference
print("\n" + "=" * 60)
print("\n🔍 Test 4: Category + Dietary")
print("Query: 'quick easy microwave sweet potato low fat'")

results = mongodb_vector_search("quick easy microwave sweet potato low fat", k=3)

if results:
    print(f"\n✅ Found {len(results)} results:")
    for i, doc in enumerate(results, 1):
        print_search_result(doc, i)
else:
    print("\n❌ No results found!")

# Test case 5: Specific nutrient search
print("\n" + "=" * 60)
print("\n🔍 Test 5: Specific Nutrients")
print("Query: 'foods high in potassium and vitamin C low sugar'")

results = mongodb_vector_search("foods high in potassium and vitamin C low sugar", k=5)

if results:
    print(f"\n✅ Found {len(results)} results:")
    for i, doc in enumerate(results, 1):
        print_search_result(doc, i)
else:
    print("\n❌ No results found!")

# Final summary
print("\n" + "=" * 60)
if results:
    print("✅ VECTOR SEARCH WORKING!")
    print("\n📊 Search capabilities verified:")
    print("   ✓ English queries")
    print("   ✓ Vietnamese queries")
    print("   ✓ Nutrition-based search")
    print("   ✓ Category + dietary filters")
    print("   ✓ Specific nutrient search")
    print("   ✓ Rich metadata projection")
else:
    print("⚠️ VECTOR SEARCH NEEDS TROUBLESHOOTING")
    print("\n🔧 Check:")
    print("   1. MongoDB Atlas vector index created")
    print("   2. Embeddings generated for all documents")
    print("   3. Index name matches 'vector_index'")
    print("   4. Collection has data")
print("=" * 60)

## 7. RAG System with Text-only LLM

In [None]:
# ============================================================================
# 2. QUANTIZATION CONFIG (4-bit để tiết kiệm VRAM)
# ============================================================================

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

print("⚙️ Quantization: 4-bit NF4")

In [None]:
# ============================================================================
# 3. LOAD MODEL & TOKENIZER
# ============================================================================

print("\n🔄 Loading model...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    # token=HF_TOKEN  # Uncomment nếu model cần authentication
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
    # token=HF_TOKEN
)

print("✅ Model loaded successfully!")
print(f"📍 Model device: {model.device}")
print(f"💾 Model dtype: {model.dtype}")


In [None]:
# ============================================================================
# 4. MONGODB VECTOR SEARCH
# ============================================================================

def mongodb_vector_search(query, k=5):
    """
    Tìm kiếm documents liên quan từ MongoDB
    """

    # Generate query embedding
    query_embedding = embedding_model.embed_query(query)

    # MongoDB connection (thay bằng connection string của bạn)
    client = MongoClient("os.getenv("MONGODB_URI")b.net/")
    db = client["test"]
    collection = db["foods"]

    # Vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "path": "embedding",
                "queryVector": query_embedding,
                "numCandidates": 100,
                "limit": k
            }
        },
        {
            "$project": {
                "name": 1,
                "text_content": 1,
                "nutrition": 1,
                "property": 1,
                "categories": 1,
                "score": {"$meta": "vectorSearchScore"}
            }
        }
    ]

    results = list(collection.aggregate(pipeline))
    return results

In [None]:
# ============================================================================
# 5. RAG FUNCTION
# ============================================================================

def format_context(results):
    """
    Format search results thành context string
    """
    context_parts = []

    for i, doc in enumerate(results, 1):
        # Lấy text_content (đã có full info từ embedding)
        text = doc.get('text_content', '')
        if text:
            context_parts.append(f"[Document {i}]\n{text}")

    return "\n\n".join(context_parts)


def rag_query(question, k=5, max_new_tokens=512, temperature=0.7):
    """
    Main RAG function: Retrieve + Generate

    Args:
        question: User's question
        k: Number of documents to retrieve
        max_new_tokens: Max tokens in response
        temperature: Generation temperature (0.0-1.0)

    Returns:
        Generated answer
    """
    print(f"\n🔍 Retrieving relevant documents for: '{question}'")

    # Step 1: Retrieve
    results = mongodb_vector_search(question, k=k)

    if not results:
        return "❌ No relevant documents found in the database."

    print(f"✅ Found {len(results)} relevant documents")

    # Step 2: Format context
    context = format_context(results)

    # Step 3: Create prompt
    prompt = f"""You are a helpful nutrition assistant. Use the following context to answer the user's question accurately and concisely.

Context:
{context}

Question: {question}

Answer: Provide a clear, accurate answer based on the context above. If the context doesn't contain enough information, say so."""

    # Step 4: Tokenize
    messages = [
        {"role": "system", "content": "You are a helpful nutrition assistant."},
        {"role": "user", "content": prompt}
    ]

    # Different tokenizers have different chat templates
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # Step 5: Generate
    print("🤖 Generating answer...")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Step 6: Decode
    # Only decode the generated part (skip input)
    generated_tokens = outputs[0][inputs['input_ids'].shape[1]:]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)

    return answer.strip()

In [None]:
# ============================================================================
# 6. TEST RAG SYSTEM
# ============================================================================

if __name__ == "__main__":
    print("\n" + "=" * 60)
    print("TESTING RAG SYSTEM")
    print("=" * 60)

    # Test questions
    test_questions = [
        "What are some high protein breakfast recipes?",
        "Show me low calorie vegetable side dishes",
        "Which foods are high in potassium?",
        "Món ăn sáng giàu protein là gì?",  # Vietnamese
        "What can I cook quickly in a microwave?"
    ]

    for i, question in enumerate(test_questions, 1):
        print(f"\n{'─' * 60}")
        print(f"Question {i}: {question}")
        print('─' * 60)

        answer = rag_query(question, k=3, max_new_tokens=256)

        print(f"\n💬 Answer:\n{answer}")

    print("\n" + "=" * 60)
    print("✅ RAG TESTING COMPLETE")
    print("=" * 60)

In [None]:
# ============================================================================
# 7. INTERACTIVE MODE (Optional)
# ============================================================================

def interactive_mode():
    """
    Chế độ chat tương tác
    """
    print("\n" + "=" * 60)
    print("🤖 RAG NUTRITION ASSISTANT - Interactive Mode")
    print("=" * 60)
    print("Type 'exit' or 'quit' to stop\n")

    while True:
        question = input("❓ Your question: ").strip()

        if question.lower() in ['exit', 'quit', 'q']:
            print("\n👋 Goodbye!")
            break

        if not question:
            continue

        answer = rag_query(question, k=5)
        print(f"\n💬 Answer:\n{answer}\n")
        print("─" * 60 + "\n")

# Uncomment để chạy interactive mode
interactive_mode()