# Option A: Domain-Specific AI Assistant

**Module:** 16 - Capstone Project (Phase 4)
**Time:** 35-45 hours total
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Project Overview

Build a complete AI assistant specialized for a specific domain:
- **Fine-tuned 70B LLM** using QLoRA
- **RAG system** with domain knowledge base
- **Custom tools** for domain-specific operations
- **Production API** with streaming responses
- **Evaluation framework** to measure quality

---

## üéØ Learning Objectives

By completing this project, you will:
- [ ] Fine-tune a 70B parameter model with QLoRA on DGX Spark
- [ ] Build a production-quality RAG system
- [ ] Implement custom tools with proper error handling
- [ ] Create a streaming API with FastAPI
- [ ] Evaluate your assistant's performance systematically

---

## üìö Prerequisites

- Module 10: LLM Fine-tuning (QLoRA)
- Module 13: AI Agents (RAG, tools)
- Module 12: Deployment (FastAPI, streaming)
- Module 11: Quantization (inference optimization)

---

## üåç Real-World Context

Domain-specific AI assistants are transforming industries:

| Domain | Example Use Cases | Value Created |
|--------|------------------|---------------|
| **DevOps/Cloud** | AWS CLI help, infrastructure advice | Faster deployments, fewer errors |
| **Finance** | Trading analysis, compliance checking | Better decisions, reduced risk |
| **Healthcare** | Medical literature search, drug interactions | Improved patient care |
| **Legal** | Contract analysis, case research | Faster research, reduced costs |
| **Code Review** | PR analysis, best practices | Higher code quality |

Companies like Bloomberg (BloombergGPT), Harvey (legal AI), and Hippocratic AI have built specialized assistants that outperform general models in their domains.

---

## üßí ELI5: What is a Domain-Specific AI Assistant?

> **Imagine you're in a foreign country and need help.** You could ask:
>
> 1. **A random tourist** - They might help, but don't know the area
> 2. **A local guide** - They know the streets, restaurants, and customs
>
> **General AI models are like tourists.** They know a lot about everything, but nothing deeply.
>
> **Domain-specific assistants are like local guides.** They've been:
> - **Trained** on domain knowledge (fine-tuning)
> - **Given a reference book** (RAG knowledge base)
> - **Equipped with tools** (custom functions)
>
> The result? An assistant that speaks your domain's language, knows its nuances, and can actually DO things in that domain - not just talk about them.

---

## üèóÔ∏è System Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     Domain-Specific AI Assistant                     ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                      ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îÇ
‚îÇ  ‚îÇ  FastAPI     ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Orchestrator‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Response    ‚îÇ          ‚îÇ
‚îÇ  ‚îÇ  Endpoint    ‚îÇ    ‚îÇ              ‚îÇ    ‚îÇ  Streamer    ‚îÇ          ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îÇ
‚îÇ                              ‚îÇ                                      ‚îÇ
‚îÇ         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                 ‚îÇ
‚îÇ         ‚ñº                    ‚ñº                    ‚ñº                 ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îÇ
‚îÇ  ‚îÇ    RAG       ‚îÇ    ‚îÇ  Fine-tuned  ‚îÇ    ‚îÇ    Tool      ‚îÇ          ‚îÇ
‚îÇ  ‚îÇ  Retriever   ‚îÇ    ‚îÇ     LLM      ‚îÇ    ‚îÇ  Executor    ‚îÇ          ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îÇ
‚îÇ          ‚îÇ                                       ‚îÇ                  ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îÇ
‚îÇ  ‚îÇ Vector Store ‚îÇ                        ‚îÇ  Tool        ‚îÇ          ‚îÇ
‚îÇ  ‚îÇ (FAISS)      ‚îÇ                        ‚îÇ  Registry    ‚îÇ          ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îÇ
‚îÇ          ‚îÇ                                                          ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                                   ‚îÇ
‚îÇ  ‚îÇ  Knowledge   ‚îÇ                                                   ‚îÇ
‚îÇ  ‚îÇ  Base (Docs) ‚îÇ                                                   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                                   ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## Part 1: Environment Setup & Domain Selection

In [None]:
# Environment Setup
import os
import sys
from pathlib import Path
from datetime import datetime

# Check GPU availability
import torch

print("üöÄ OPTION A: DOMAIN-SPECIFIC AI ASSISTANT")
print("="*60)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"\nGPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'Not available'}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"CUDA Version: {torch.version.cuda}")
print(f"PyTorch Version: {torch.__version__}")

In [None]:
# Domain Selection Helper

DOMAIN_OPTIONS = {
    "aws": {
        "name": "AWS Infrastructure Assistant",
        "description": "Help with AWS CLI, services, and best practices",
        "knowledge_sources": [
            "AWS Documentation",
            "AWS CLI Reference",
            "AWS Best Practices Guides",
        ],
        "example_tools": [
            "validate_cli_command",
            "estimate_cost",
            "check_security_group",
        ],
        "training_data_ideas": [
            "Stack Overflow AWS questions",
            "AWS re:Post discussions",
            "Synthetic CLI command Q&A",
        ],
    },
    "finance": {
        "name": "Financial Analysis Assistant",
        "description": "Help with market analysis, financial metrics, and reports",
        "knowledge_sources": [
            "SEC Filings (10-K, 10-Q)",
            "Financial News Archives",
            "Investment Research Reports",
        ],
        "example_tools": [
            "calculate_ratios",
            "fetch_stock_data",
            "compare_companies",
        ],
        "training_data_ideas": [
            "Financial Q&A datasets",
            "Analyst report summaries",
            "Earnings call transcripts",
        ],
    },
    "code_review": {
        "name": "Code Review Assistant",
        "description": "Help review PRs, suggest improvements, check for bugs",
        "knowledge_sources": [
            "Language-specific style guides",
            "Security best practices",
            "Design patterns documentation",
        ],
        "example_tools": [
            "run_linter",
            "check_security",
            "generate_tests",
        ],
        "training_data_ideas": [
            "GitHub PR comments",
            "Code review Q&A",
            "Bug fix examples",
        ],
    },
    "medical": {
        "name": "Medical Literature Assistant",
        "description": "Help search and summarize medical research",
        "knowledge_sources": [
            "PubMed Abstracts",
            "Clinical Guidelines",
            "Drug Databases",
        ],
        "example_tools": [
            "search_pubmed",
            "check_drug_interactions",
            "summarize_study",
        ],
        "training_data_ideas": [
            "Medical Q&A datasets",
            "PubMed question-answer pairs",
            "Clinical scenario responses",
        ],
    },
    "custom": {
        "name": "Custom Domain",
        "description": "Define your own domain",
        "knowledge_sources": ["Define your sources"],
        "example_tools": ["Define your tools"],
        "training_data_ideas": ["Define your data strategy"],
    },
}

print("\nüéØ AVAILABLE DOMAINS")
print("="*60)

for key, domain in DOMAIN_OPTIONS.items():
    print(f"\nüìå {key}: {domain['name']}")
    print(f"   {domain['description']}")
    print(f"   Knowledge: {', '.join(domain['knowledge_sources'][:2])}...")
    print(f"   Tools: {', '.join(domain['example_tools'][:2])}...")

In [None]:
# Select your domain (modify this cell)

# ========================================
# CONFIGURE YOUR PROJECT HERE
# ========================================

SELECTED_DOMAIN = "aws"  # Options: aws, finance, code_review, medical, custom
PROJECT_NAME = "aws-assistant"
BASE_MODEL = "meta-llama/Llama-3.3-70B-Instruct"

# ========================================

domain_config = DOMAIN_OPTIONS[SELECTED_DOMAIN]

print(f"\n‚úÖ Project Configuration")
print("="*60)
print(f"Project Name: {PROJECT_NAME}")
print(f"Domain: {domain_config['name']}")
print(f"Base Model: {BASE_MODEL}")
print(f"\nKnowledge Sources:")
for source in domain_config['knowledge_sources']:
    print(f"  ‚Ä¢ {source}")
print(f"\nPlanned Tools:")
for tool in domain_config['example_tools']:
    print(f"  ‚Ä¢ {tool}")

---

## Part 2: Knowledge Base & RAG Setup

The RAG system is the foundation of your assistant's domain expertise.

In [None]:
# RAG System Implementation Template

from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import json

@dataclass
class Document:
    """A document in the knowledge base."""
    content: str
    metadata: Dict[str, Any]
    embedding: Optional[List[float]] = None
    
@dataclass
class RetrievalResult:
    """Result from RAG retrieval."""
    documents: List[Document]
    scores: List[float]
    query: str
    latency_ms: float

class RAGSystem:
    """
    RAG System for the domain-specific assistant.
    
    This is a template - implement the methods for your specific use case.
    """
    
    def __init__(
        self,
        embedding_model: str = "BAAI/bge-m3",
        chunk_size: int = 512,
        chunk_overlap: int = 50,
    ):
        self.embedding_model_name = embedding_model
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.documents: List[Document] = []
        self.index = None  # Will be FAISS index
        self._embedding_model = None
        
    def _load_embedding_model(self):
        """Lazy load the embedding model."""
        if self._embedding_model is None:
            try:
                from sentence_transformers import SentenceTransformer
                self._embedding_model = SentenceTransformer(
                    self.embedding_model_name,
                    device="cuda"
                )
                print(f"‚úÖ Loaded embedding model: {self.embedding_model_name}")
            except Exception as e:
                print(f"‚ùå Failed to load embedding model: {e}")
                raise
        return self._embedding_model
    
    def add_documents(self, documents: List[Dict[str, Any]]):
        """
        Add documents to the knowledge base.
        
        Args:
            documents: List of {"content": str, "metadata": dict}
        """
        import time
        import numpy as np
        
        start = time.time()
        model = self._load_embedding_model()
        
        # Chunk documents
        chunks = []
        for doc in documents:
            doc_chunks = self._chunk_text(doc["content"])
            for i, chunk in enumerate(doc_chunks):
                chunks.append({
                    "content": chunk,
                    "metadata": {**doc.get("metadata", {}), "chunk_id": i}
                })
        
        print(f"üìÑ Created {len(chunks)} chunks from {len(documents)} documents")
        
        # Generate embeddings
        contents = [c["content"] for c in chunks]
        embeddings = model.encode(contents, show_progress_bar=True)
        
        # Create Document objects
        for chunk, embedding in zip(chunks, embeddings):
            self.documents.append(Document(
                content=chunk["content"],
                metadata=chunk["metadata"],
                embedding=embedding.tolist()
            ))
        
        # Build FAISS index
        self._build_index(embeddings)
        
        elapsed = time.time() - start
        print(f"‚úÖ Added {len(chunks)} chunks in {elapsed:.1f}s")
    
    def _chunk_text(self, text: str) -> List[str]:
        """Split text into overlapping chunks."""
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
            chunk_words = words[i:i + self.chunk_size]
            if len(chunk_words) > 50:  # Minimum chunk size
                chunks.append(" ".join(chunk_words))
        
        return chunks
    
    def _build_index(self, embeddings):
        """Build FAISS index from embeddings."""
        try:
            import faiss
            import numpy as np
            
            embeddings_array = np.array(embeddings).astype('float32')
            dim = embeddings_array.shape[1]
            
            # Use IVF for larger datasets
            if len(embeddings) > 10000:
                nlist = min(100, len(embeddings) // 100)
                quantizer = faiss.IndexFlatIP(dim)
                self.index = faiss.IndexIVFFlat(quantizer, dim, nlist)
                self.index.train(embeddings_array)
            else:
                self.index = faiss.IndexFlatIP(dim)
            
            # Normalize for cosine similarity
            faiss.normalize_L2(embeddings_array)
            self.index.add(embeddings_array)
            
            print(f"‚úÖ Built FAISS index with {self.index.ntotal} vectors")
        except ImportError:
            print("‚ö†Ô∏è FAISS not installed. Install with: pip install faiss-gpu")
    
    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        score_threshold: float = 0.5
    ) -> RetrievalResult:
        """
        Retrieve relevant documents for a query.
        
        Args:
            query: User's question
            top_k: Number of documents to retrieve
            score_threshold: Minimum relevance score
            
        Returns:
            RetrievalResult with documents and scores
        """
        import time
        import numpy as np
        
        start = time.time()
        
        if self.index is None:
            return RetrievalResult([], [], query, 0)
        
        # Embed query
        model = self._load_embedding_model()
        query_embedding = model.encode([query])
        
        # Normalize
        import faiss
        query_array = np.array(query_embedding).astype('float32')
        faiss.normalize_L2(query_array)
        
        # Search
        scores, indices = self.index.search(query_array, top_k)
        
        # Filter by threshold and collect results
        result_docs = []
        result_scores = []
        
        for score, idx in zip(scores[0], indices[0]):
            if idx >= 0 and score >= score_threshold:
                result_docs.append(self.documents[idx])
                result_scores.append(float(score))
        
        elapsed_ms = (time.time() - start) * 1000
        
        return RetrievalResult(
            documents=result_docs,
            scores=result_scores,
            query=query,
            latency_ms=elapsed_ms
        )
    
    def save(self, path: str):
        """Save the RAG system to disk."""
        import faiss
        import pickle
        
        path = Path(path)
        path.mkdir(parents=True, exist_ok=True)
        
        # Save FAISS index
        if self.index:
            faiss.write_index(self.index, str(path / "index.faiss"))
        
        # Save documents (without embeddings to save space)
        docs_data = [
            {"content": d.content, "metadata": d.metadata}
            for d in self.documents
        ]
        with open(path / "documents.json", "w") as f:
            json.dump(docs_data, f)
        
        print(f"‚úÖ Saved RAG system to {path}")
    
    @classmethod
    def load(cls, path: str) -> "RAGSystem":
        """Load a RAG system from disk."""
        import faiss
        
        path = Path(path)
        rag = cls()
        
        # Load index
        index_path = path / "index.faiss"
        if index_path.exists():
            rag.index = faiss.read_index(str(index_path))
        
        # Load documents
        docs_path = path / "documents.json"
        if docs_path.exists():
            with open(docs_path) as f:
                docs_data = json.load(f)
            rag.documents = [
                Document(content=d["content"], metadata=d["metadata"])
                for d in docs_data
            ]
        
        print(f"‚úÖ Loaded RAG system with {len(rag.documents)} documents")
        return rag

print("‚úÖ RAG System template defined")
print("\nTo use:")
print("  rag = RAGSystem()")
print("  rag.add_documents([{'content': '...', 'metadata': {...}}])")
print("  results = rag.retrieve('your question')")

---

## Part 3: Fine-Tuning with QLoRA

Fine-tune Llama 3.3 70B on your domain-specific data using QLoRA.

In [None]:
# QLoRA Fine-tuning Configuration Template

LORA_CONFIG = {
    # LoRA parameters
    "r": 64,                    # LoRA rank (higher = more capacity, more memory)
    "lora_alpha": 128,          # Scaling factor (usually 2*r)
    "lora_dropout": 0.05,       # Dropout for regularization
    "target_modules": [         # Modules to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    "bias": "none",
    "task_type": "CAUSAL_LM",
}

QUANTIZATION_CONFIG = {
    "load_in_4bit": True,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": True,
}

TRAINING_CONFIG = {
    "num_train_epochs": 3,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 8,  # Effective batch size = 16
    "learning_rate": 2e-4,
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.03,
    "weight_decay": 0.01,
    "max_seq_length": 2048,
    "logging_steps": 10,
    "save_steps": 100,
    "bf16": True,              # Use bfloat16 (native Blackwell support)
    "gradient_checkpointing": True,  # Save memory
}

# Memory estimation
def estimate_memory():
    """Estimate memory requirements for training."""
    base_model_4bit = 35  # GB
    lora_params = (LORA_CONFIG["r"] * 2 * len(LORA_CONFIG["target_modules"]) * 8 * 8192) / 1e9  # GB
    optimizer_states = lora_params * 8  # AdamW states
    activations = TRAINING_CONFIG["per_device_train_batch_size"] * TRAINING_CONFIG["max_seq_length"] * 8192 * 4 / 1e9
    
    total = base_model_4bit + lora_params + optimizer_states + activations + 10  # +10 for overhead
    
    print("\nüíæ MEMORY ESTIMATION")
    print("="*50)
    print(f"Base Model (4-bit): ~{base_model_4bit:.1f} GB")
    print(f"LoRA Parameters: ~{lora_params:.1f} GB")
    print(f"Optimizer States: ~{optimizer_states:.1f} GB")
    print(f"Activations: ~{activations:.1f} GB")
    print(f"Overhead: ~10 GB")
    print(f"-"*50)
    print(f"Estimated Total: ~{total:.1f} GB")
    print(f"DGX Spark Available: 128 GB")
    print(f"Status: {'‚úÖ Fits!' if total < 120 else '‚ö†Ô∏è Tight on memory'}")

estimate_memory()

In [None]:
# Data Preparation for Fine-tuning

def prepare_training_data(examples: List[Dict[str, str]], tokenizer) -> Dict:
    """
    Prepare training data in chat format.
    
    Args:
        examples: List of {"instruction": ..., "input": ..., "output": ...}
        tokenizer: The tokenizer to use
        
    Returns:
        Tokenized dataset ready for training
    """
    formatted = []
    
    for ex in examples:
        # Format as chat
        messages = [
            {"role": "system", "content": f"You are a helpful {SELECTED_DOMAIN} assistant."},
            {"role": "user", "content": ex["instruction"] + (f"\n\nContext: {ex['input']}" if ex.get("input") else "")},
            {"role": "assistant", "content": ex["output"]}
        ]
        
        # Apply chat template
        text = tokenizer.apply_chat_template(messages, tokenize=False)
        formatted.append(text)
    
    return formatted

# Example training data format
EXAMPLE_TRAINING_DATA = [
    {
        "instruction": "How do I create an S3 bucket using the AWS CLI?",
        "input": "",
        "output": """To create an S3 bucket using the AWS CLI, use the following command:

```bash
aws s3 mb s3://your-bucket-name --region us-east-1
```

Key points:
- `s3 mb` stands for "S3 make bucket"
- Bucket names must be globally unique
- Use lowercase letters, numbers, and hyphens only
- Specify the region for data residency requirements

For additional options like enabling versioning:
```bash
aws s3api put-bucket-versioning --bucket your-bucket-name --versioning-configuration Status=Enabled
```"""
    },
    {
        "instruction": "What are the best practices for S3 bucket security?",
        "input": "",
        "output": """Here are essential S3 security best practices:

1. **Block Public Access** (most important):
   ```bash
   aws s3api put-public-access-block --bucket your-bucket --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
   ```

2. **Enable Default Encryption**:
   ```bash
   aws s3api put-bucket-encryption --bucket your-bucket --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
   ```

3. **Enable Versioning** for data protection
4. **Use IAM Policies** with least privilege
5. **Enable Access Logging** for audit trails
6. **Consider S3 Object Lock** for compliance"""
    },
]

print("‚úÖ Training data template ready")
print(f"\nExample data format:")
print(json.dumps(EXAMPLE_TRAINING_DATA[0], indent=2)[:500] + "...")

In [None]:
# Fine-tuning Script Template
# This is a template - you'll run this as a separate script for training

FINETUNE_SCRIPT = '''
#!/usr/bin/env python3
"""
QLoRA Fine-tuning Script for Domain-Specific Assistant

Usage:
    python finetune.py --data_path data/training.jsonl --output_dir models/finetuned
"""

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name", default="meta-llama/Llama-3.3-70B-Instruct")
    parser.add_argument("--data_path", required=True)
    parser.add_argument("--output_dir", required=True)
    parser.add_argument("--max_steps", type=int, default=500)
    args = parser.parse_args()
    
    # Quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
    )
    
    # Load model
    print(f"Loading model: {args.model_name}")
    model = AutoModelForCausalLM.from_pretrained(
        args.model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    
    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    
    # Prepare model for training
    model = prepare_model_for_kbit_training(model)
    
    # LoRA config
    lora_config = LoraConfig(
        r=64,
        lora_alpha=128,
        lora_dropout=0.05,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj"],
        bias="none",
        task_type="CAUSAL_LM",
    )
    
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    
    # Load dataset
    dataset = load_dataset("json", data_files=args.data_path)["train"]
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=args.output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        logging_steps=10,
        save_steps=100,
        bf16=True,
        gradient_checkpointing=True,
        max_steps=args.max_steps,
    )
    
    # Trainer
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
        dataset_text_field="text",
        max_seq_length=2048,
    )
    
    # Train
    print("Starting training...")
    trainer.train()
    
    # Save
    trainer.save_model(args.output_dir)
    print(f"Model saved to {args.output_dir}")

if __name__ == "__main__":
    main()
'''

# Save the script
script_path = Path(f"/workspace/{PROJECT_NAME}/scripts") if Path("/workspace").exists() else Path(f"./{PROJECT_NAME}/scripts")
script_path.mkdir(parents=True, exist_ok=True)

with open(script_path / "finetune.py", "w") as f:
    f.write(FINETUNE_SCRIPT)

print(f"‚úÖ Fine-tuning script saved to: {script_path / 'finetune.py'}")
print("\nTo run fine-tuning:")
print(f"  python {script_path / 'finetune.py'} --data_path data/training.jsonl --output_dir models/finetuned")

---

## Part 4: Custom Tools Implementation

Tools allow your assistant to perform actions, not just generate text.

In [None]:
# Tool Framework

from typing import Callable, Dict, Any, List
from dataclasses import dataclass
import inspect
import json

@dataclass
class ToolDefinition:
    """Definition of a tool for the assistant."""
    name: str
    description: str
    parameters: Dict[str, Any]  # JSON Schema
    function: Callable
    
    def to_openai_format(self) -> Dict:
        """Convert to OpenAI tool format."""
        return {
            "type": "function",
            "function": {
                "name": self.name,
                "description": self.description,
                "parameters": self.parameters,
            }
        }

class ToolRegistry:
    """Registry for managing available tools."""
    
    def __init__(self):
        self.tools: Dict[str, ToolDefinition] = {}
    
    def register(self, tool: ToolDefinition):
        """Register a tool."""
        self.tools[tool.name] = tool
        print(f"‚úÖ Registered tool: {tool.name}")
    
    def get(self, name: str) -> ToolDefinition:
        """Get a tool by name."""
        return self.tools.get(name)
    
    def execute(self, name: str, **kwargs) -> str:
        """Execute a tool with given arguments."""
        tool = self.get(name)
        if not tool:
            return f"Error: Unknown tool '{name}'"
        
        try:
            result = tool.function(**kwargs)
            return json.dumps(result) if not isinstance(result, str) else result
        except Exception as e:
            return f"Error executing {name}: {str(e)}"
    
    def get_all_definitions(self) -> List[Dict]:
        """Get all tool definitions in OpenAI format."""
        return [tool.to_openai_format() for tool in self.tools.values()]

# Create registry
registry = ToolRegistry()

# Example tools for AWS domain
def validate_cli_command(command: str) -> Dict:
    """
    Validate an AWS CLI command syntax.
    
    Args:
        command: The AWS CLI command to validate
        
    Returns:
        Validation result with any errors or warnings
    """
    # Simple validation - in production, this would be more sophisticated
    errors = []
    warnings = []
    
    if not command.startswith("aws "):
        errors.append("Command should start with 'aws'")
    
    # Check for common issues
    if "--region" not in command and "s3" in command:
        warnings.append("Consider specifying --region for S3 operations")
    
    if "rm" in command and "-r" in command and "--force" not in command:
        warnings.append("Recursive delete without --force - will prompt for confirmation")
    
    return {
        "valid": len(errors) == 0,
        "command": command,
        "errors": errors,
        "warnings": warnings,
    }

registry.register(ToolDefinition(
    name="validate_cli_command",
    description="Validate the syntax of an AWS CLI command and check for common issues",
    parameters={
        "type": "object",
        "properties": {
            "command": {
                "type": "string",
                "description": "The AWS CLI command to validate"
            }
        },
        "required": ["command"]
    },
    function=validate_cli_command
))

def estimate_cost(service: str, usage: Dict[str, float]) -> Dict:
    """
    Estimate AWS service cost.
    
    Args:
        service: AWS service name (e.g., "s3", "ec2")
        usage: Usage metrics (e.g., {"storage_gb": 100, "requests": 1000000})
    """
    # Simplified pricing - in production, use AWS Pricing API
    pricing = {
        "s3": {
            "storage_gb": 0.023,  # $/GB/month
            "requests": 0.0000004,  # $ per request
        },
        "ec2": {
            "hours": 0.0416,  # t3.medium
        },
    }
    
    service_pricing = pricing.get(service.lower(), {})
    total = 0
    breakdown = []
    
    for metric, value in usage.items():
        if metric in service_pricing:
            cost = value * service_pricing[metric]
            total += cost
            breakdown.append(f"{metric}: ${cost:.2f}")
    
    return {
        "service": service,
        "estimated_monthly_cost": f"${total:.2f}",
        "breakdown": breakdown,
        "note": "Estimates based on us-east-1 pricing. Actual costs may vary."
    }

registry.register(ToolDefinition(
    name="estimate_cost",
    description="Estimate monthly AWS service costs based on usage",
    parameters={
        "type": "object",
        "properties": {
            "service": {
                "type": "string",
                "description": "AWS service name (s3, ec2, etc.)"
            },
            "usage": {
                "type": "object",
                "description": "Usage metrics (e.g., {storage_gb: 100})"
            }
        },
        "required": ["service", "usage"]
    },
    function=estimate_cost
))

# Test tools
print("\nüß™ Testing tools:")
print(registry.execute("validate_cli_command", command="aws s3 ls s3://my-bucket"))
print(registry.execute("estimate_cost", service="s3", usage={"storage_gb": 100, "requests": 1000000}))

---

## Part 5: Orchestrator & API

The orchestrator ties everything together.

In [None]:
# Orchestrator Template

from typing import AsyncGenerator, Optional
import asyncio

class AssistantOrchestrator:
    """
    Orchestrates RAG, LLM, and tools for the domain assistant.
    
    This coordinates:
    1. Query understanding
    2. RAG retrieval
    3. LLM generation
    4. Tool execution
    5. Response streaming
    """
    
    def __init__(
        self,
        rag_system: RAGSystem,
        tool_registry: ToolRegistry,
        model_name: str = "meta-llama/Llama-3.3-70B-Instruct",
        system_prompt: str = None,
    ):
        self.rag = rag_system
        self.tools = tool_registry
        self.model_name = model_name
        self.system_prompt = system_prompt or self._default_system_prompt()
        self._model = None
        self._tokenizer = None
    
    def _default_system_prompt(self) -> str:
        return f"""You are an expert {SELECTED_DOMAIN} assistant. Your role is to:

1. Answer questions accurately using your knowledge and the provided context
2. Use available tools when they can help provide better answers
3. Be concise but thorough
4. Cite sources when using retrieved information
5. Admit when you're uncertain

Available tools: {', '.join(self.tools.tools.keys())}

When using tools, explain what you're doing and why."""
    
    def _load_model(self):
        """Load the fine-tuned model."""
        if self._model is None:
            from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
            
            print(f"Loading model: {self.model_name}")
            
            bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_quant_type="nf4",
            )
            
            self._model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                quantization_config=bnb_config,
                device_map="auto",
                trust_remote_code=True,
            )
            self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
            
            print(f"‚úÖ Model loaded")
    
    def process_query(
        self,
        query: str,
        conversation_history: List[Dict] = None,
        use_rag: bool = True,
    ) -> Dict:
        """
        Process a user query and generate a response.
        
        Args:
            query: The user's question
            conversation_history: Previous messages
            use_rag: Whether to retrieve context
            
        Returns:
            Response with answer, sources, and tool calls
        """
        self._load_model()
        
        # Step 1: Retrieve context
        context = ""
        sources = []
        if use_rag and self.rag.index is not None:
            retrieval = self.rag.retrieve(query, top_k=3)
            if retrieval.documents:
                context = "\n\n".join([
                    f"[Source {i+1}]: {doc.content}"
                    for i, doc in enumerate(retrieval.documents)
                ])
                sources = [
                    {"content": doc.content[:200], "score": score}
                    for doc, score in zip(retrieval.documents, retrieval.scores)
                ]
        
        # Step 2: Build prompt
        messages = [{"role": "system", "content": self.system_prompt}]
        
        if conversation_history:
            messages.extend(conversation_history)
        
        user_content = query
        if context:
            user_content = f"Context:\n{context}\n\nQuestion: {query}"
        
        messages.append({"role": "user", "content": user_content})
        
        # Step 3: Generate response
        prompt = self._tokenizer.apply_chat_template(messages, tokenize=False)
        inputs = self._tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = self._model.generate(
                **inputs,
                max_new_tokens=1024,
                temperature=0.7,
                do_sample=True,
                pad_token_id=self._tokenizer.eos_token_id,
            )
        
        response_text = self._tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )
        
        # Step 4: Check for tool calls (simplified)
        tool_results = []
        # In production, parse response for tool call requests
        
        return {
            "answer": response_text,
            "sources": sources,
            "tool_results": tool_results,
            "query": query,
        }

print("‚úÖ Orchestrator template defined")
print("\nTo use:")
print("  orchestrator = AssistantOrchestrator(rag_system, tool_registry)")
print("  response = orchestrator.process_query('How do I create an S3 bucket?')")

In [None]:
# FastAPI Server Template

API_SERVER_CODE = '''
#!/usr/bin/env python3
"""
FastAPI Server for Domain-Specific AI Assistant

Usage:
    uvicorn api_server:app --host 0.0.0.0 --port 8000
"""

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import List, Optional, Dict, Any
import asyncio
import json

app = FastAPI(
    title="Domain AI Assistant API",
    description="API for the domain-specific AI assistant",
    version="1.0.0"
)

# Pydantic models
class Message(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: List[Message]
    stream: bool = True
    temperature: float = 0.7
    max_tokens: int = 2048

class ChatResponse(BaseModel):
    id: str
    message: Message
    sources: List[Dict[str, Any]] = []
    usage: Dict[str, int]

# Global orchestrator (loaded on startup)
orchestrator = None

@app.on_event("startup")
async def startup():
    """Load models on startup."""
    global orchestrator
    # Load your orchestrator here
    # orchestrator = AssistantOrchestrator(...)
    print("‚úÖ Assistant loaded")

@app.get("/health")
async def health():
    """Health check endpoint."""
    import torch
    return {
        "status": "healthy",
        "gpu_available": torch.cuda.is_available(),
        "gpu_memory_used_gb": torch.cuda.memory_allocated() / 1e9 if torch.cuda.is_available() else 0,
    }

@app.post("/v1/chat/completions")
async def chat(request: ChatRequest):
    """
    Chat endpoint compatible with OpenAI format.
    """
    if orchestrator is None:
        raise HTTPException(status_code=503, detail="Assistant not loaded")
    
    # Get the last user message
    user_message = request.messages[-1].content
    history = [m.dict() for m in request.messages[:-1]]
    
    if request.stream:
        return StreamingResponse(
            generate_stream(user_message, history),
            media_type="text/event-stream"
        )
    else:
        result = orchestrator.process_query(user_message, history)
        return ChatResponse(
            id="chat-" + str(hash(user_message))[:8],
            message=Message(role="assistant", content=result["answer"]),
            sources=result.get("sources", []),
            usage={"prompt_tokens": 0, "completion_tokens": 0}
        )

async def generate_stream(query: str, history: list):
    """Stream response tokens."""
    result = orchestrator.process_query(query, history)
    
    # Simulate streaming (in production, use model's streaming capability)
    words = result["answer"].split()
    for i, word in enumerate(words):
        chunk = {
            "id": "chatcmpl-" + str(i),
            "object": "chat.completion.chunk",
            "choices": [{
                "delta": {"content": word + " "},
                "index": 0,
                "finish_reason": None
            }]
        }
        yield f"data: {json.dumps(chunk)}\\n\\n"
        await asyncio.sleep(0.02)  # Simulate generation time
    
    yield "data: [DONE]\\n\\n"

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

# Save API server
with open(script_path / "api_server.py", "w") as f:
    f.write(API_SERVER_CODE)

print(f"‚úÖ API server saved to: {script_path / 'api_server.py'}")
print("\nTo run:")
print(f"  uvicorn api_server:app --host 0.0.0.0 --port 8000")

---

## Part 6: Evaluation Framework

How will you measure your assistant's quality?

In [None]:
# Evaluation Framework

@dataclass
class TestCase:
    """A test case for evaluation."""
    query: str
    expected_answer: str  # Or key points that should be covered
    category: str  # e.g., "factual", "procedural", "troubleshooting"
    difficulty: str  # "easy", "medium", "hard"
    required_tool: Optional[str] = None

class AssistantEvaluator:
    """Evaluate assistant quality."""
    
    def __init__(self, orchestrator: AssistantOrchestrator):
        self.orchestrator = orchestrator
        self.results = []
    
    def run_test_suite(self, test_cases: List[TestCase]) -> Dict:
        """
        Run a test suite and collect metrics.
        """
        import time
        
        results = {
            "total": len(test_cases),
            "passed": 0,
            "failed": 0,
            "by_category": {},
            "by_difficulty": {},
            "latencies": [],
            "details": [],
        }
        
        for tc in test_cases:
            start = time.time()
            
            try:
                response = self.orchestrator.process_query(tc.query)
                latency = (time.time() - start) * 1000
                
                # Simple evaluation: check if key terms are present
                passed = self._evaluate_response(response["answer"], tc.expected_answer)
                
                results["latencies"].append(latency)
                
                if passed:
                    results["passed"] += 1
                else:
                    results["failed"] += 1
                
                # Track by category
                if tc.category not in results["by_category"]:
                    results["by_category"][tc.category] = {"passed": 0, "total": 0}
                results["by_category"][tc.category]["total"] += 1
                if passed:
                    results["by_category"][tc.category]["passed"] += 1
                
                results["details"].append({
                    "query": tc.query,
                    "passed": passed,
                    "latency_ms": latency,
                    "category": tc.category,
                })
                
            except Exception as e:
                results["failed"] += 1
                results["details"].append({
                    "query": tc.query,
                    "passed": False,
                    "error": str(e),
                })
        
        # Calculate summary metrics
        results["accuracy"] = results["passed"] / results["total"] if results["total"] > 0 else 0
        results["avg_latency_ms"] = sum(results["latencies"]) / len(results["latencies"]) if results["latencies"] else 0
        results["p95_latency_ms"] = sorted(results["latencies"])[int(len(results["latencies"]) * 0.95)] if results["latencies"] else 0
        
        return results
    
    def _evaluate_response(self, response: str, expected: str) -> bool:
        """
        Evaluate if response covers expected content.
        
        This is a simple keyword-based evaluation.
        In production, use LLM-as-judge or human evaluation.
        """
        response_lower = response.lower()
        
        # Extract key terms from expected answer
        key_terms = [term.strip().lower() for term in expected.split(",")]
        
        # Check how many key terms are present
        matches = sum(1 for term in key_terms if term in response_lower)
        
        # Pass if >50% of key terms are present
        return matches >= len(key_terms) * 0.5

# Example test cases
EXAMPLE_TEST_CASES = [
    TestCase(
        query="How do I create an S3 bucket?",
        expected_answer="aws, s3, mb, bucket, --region",
        category="procedural",
        difficulty="easy"
    ),
    TestCase(
        query="What's the difference between S3 Standard and Glacier?",
        expected_answer="storage class, retrieval, cost, archive, access",
        category="factual",
        difficulty="medium"
    ),
    TestCase(
        query="My Lambda function is timing out. How do I fix it?",
        expected_answer="timeout, memory, cold start, configuration, optimize",
        category="troubleshooting",
        difficulty="hard"
    ),
]

print("‚úÖ Evaluation framework ready")
print(f"\nExample test cases: {len(EXAMPLE_TEST_CASES)}")
for tc in EXAMPLE_TEST_CASES:
    print(f"  ‚Ä¢ [{tc.difficulty}] {tc.category}: {tc.query[:50]}...")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: RAG Retrieval Too Narrow
```python
# ‚ùå Wrong: Only retrieve based on exact keywords
results = search(query, method="keyword_match")

# ‚úÖ Right: Use semantic search + reranking
results = semantic_search(query, top_k=10)
reranked = rerank(query, results, top_k=3)
```

### Mistake 2: No Tool Error Handling
```python
# ‚ùå Wrong: Assume tools always work
result = tool.execute(args)
return f"Result: {result}"

# ‚úÖ Right: Handle errors gracefully
try:
    result = tool.execute(args)
    return f"Result: {result}"
except ToolError as e:
    return f"I tried to use {tool.name} but encountered an error: {e}. Let me try another approach..."
```

### Mistake 3: No Response Validation
```python
# ‚ùå Wrong: Return LLM output directly
return model.generate(prompt)

# ‚úÖ Right: Validate and filter
response = model.generate(prompt)
if contains_hallucination(response, context):
    response = regenerate_with_warning(prompt)
if contains_sensitive_data(response):
    response = redact(response)
return response
```

---

## üéâ Checkpoint

You now have templates for:
- ‚úÖ RAG system with FAISS index
- ‚úÖ QLoRA fine-tuning configuration
- ‚úÖ Custom tool framework
- ‚úÖ Orchestrator for coordination
- ‚úÖ FastAPI server with streaming
- ‚úÖ Evaluation framework

---

## üöÄ Implementation Roadmap

### Week 1: Foundation
- [ ] Collect and preprocess domain documents
- [ ] Build and test RAG system
- [ ] Create training dataset (100+ examples)

### Week 2: Model Development
- [ ] Fine-tune base model with QLoRA
- [ ] Implement domain-specific tools
- [ ] Basic integration testing

### Week 3: Integration
- [ ] Build orchestrator
- [ ] Create FastAPI server
- [ ] End-to-end testing

### Week 4: Optimization
- [ ] Profile and optimize memory usage
- [ ] Improve retrieval quality
- [ ] Add caching where appropriate

### Week 5: Evaluation
- [ ] Run comprehensive test suite
- [ ] Compare to baselines
- [ ] Iterate based on results

### Week 6: Documentation
- [ ] Complete technical report
- [ ] Record demo video
- [ ] Prepare presentation

---

## üìñ Further Reading

- [RAG Best Practices](https://www.pinecone.io/learn/retrieval-augmented-generation/)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [Building Production LLM Applications](https://huyenchip.com/2023/04/11/llm-engineering.html)

---

In [None]:
# üßπ Cleanup
import gc

# Clear any loaded models
if 'model' in dir():
    del model
if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete")
print(f"\nGPU Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB allocated")