# LOCAL RAG SETUP WITH OLLAMA - JUPYTER NOTEBOOK WORKFLOW

- This notebook transitions from Azure OpenAI to local Ollama deployment
- **Target**: Python 3.11+ (3.12 recommended for best library support)
- **Hardware**: Mac M3 Max (or better) with 64GB RAM (or better)

## 1: Environment Setup and Validation

In [9]:
import sys
import subprocess
import os
from pathlib import Path

print("=== ENVIRONMENT VALIDATION ===")
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Check if we're in a virtual environment
def check_virtual_env():
    """Check if running in a virtual environment"""
    return (
        hasattr(sys, 'real_prefix') or  # virtualenv
        (hasattr(sys, 'base_prefix') and sys.base_prefix != sys.prefix) or  # venv
        'conda' in sys.executable.lower()  # conda
    )

in_venv = check_virtual_env()
print(f"In virtual environment: {in_venv}")

if not in_venv:
    print("\n‚ö†Ô∏è  WARNING: Not in a virtual environment!")
    print("Recommended: Create a virtual environment for this project")
    print("Run in terminal:")
    print("  python3.12 -m venv local_rag_env")
    print("  source local_rag_env/bin/activate  # On Mac/Linux")
    print("  pip install jupyter")
    print("  jupyter notebook")

=== ENVIRONMENT VALIDATION ===
Python version: 3.12.10 (main, Apr  8 2025, 11:35:47) [Clang 17.0.0 (clang-1700.0.13.3)]
Python executable: /Users/ndana/envs/local_rag_env/bin/python
In virtual environment: True


## 2: Install Required Dependencies

**Note**: Run this cell only once, or when you need to update packages

In [11]:
install_packages = True  # Set to False after first run

if install_packages:
    print("=== INSTALLING DEPENDENCIES ===")
    
    # Core packages for local RAG
    packages = [
        "llama-index>=0.9.0",                   # Main RAG framework
        "llama-index-llms-ollama",              # Ollama LLM integration
        "llama-index-embeddings-ollama",        # Ollama embeddings (if available)
        "lama-index-llms-azure-openai",         # Azure OpenAI LLM integration
        "llama-index-embeddings-azure-openai",  # Azure OpenAI embeddings
        "llama-index-embeddings-huggingface",   # Hugging Face embeddings
        "sentence-transformers",                # Local embedding models
        "chromadb",                             # Vector database
        "pypdf",                                # PDF processing
        "requests",                             # HTTP client for Ollama API
        "numpy",                                # Numerical operations
        "pandas",                               # Data manipulation
        "tqdm",                                 # Progress bars
        "python-dotenv",                        # Environment variables
    ]                
    
    for package in packages:
        try:
            print(f"Installing {package}...")
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        except subprocess.CalledProcessError as e:
            print(f"‚ùå Failed to install {package}: {e}")
    
    print("‚úÖ Package installation complete!")
else:
    print("Skipping package installation (set install_packages=True to reinstall)")


=== INSTALLING DEPENDENCIES ===
Installing llama-index>=0.9.0...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing llama-index-llms-ollama...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing llama-index-embeddings-ollama...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing lama-index-llms-azure-openai...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[31mERROR: Could not find a version that satisfies the requirement lama-index-llms-azure-openai (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for lama-index-llms-azure-openai[0m[31m
[0mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


‚ùå Failed to install lama-index-llms-azure-openai: Command '['/Users/ndana/envs/local_rag_env/bin/python', '-m', 'pip', 'install', 'lama-index-llms-azure-openai']' returned non-zero exit status 1.
Installing llama-index-embeddings-azure-openai...
Installing llama-index-embeddings-huggingface...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing sentence-transformers...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing chromadb...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing pypdf...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing requests...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing numpy...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing pandas...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing tqdm...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Installing python-dotenv...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


‚úÖ Package installation complete!


## 3: Ollama Setup and Model Management

In [3]:
import requests
import json

print("=== OLLAMA SETUP AND VALIDATION ===")

# Check if Ollama is running
def check_ollama_status():
    """Check if Ollama service is running"""
    try:
        response = requests.get("http://localhost:11434/api/version", timeout=5)
        if response.status_code == 200:
            version_info = response.json()
            print(f"‚úÖ Ollama is running - Version: {version_info.get('version', 'unknown')}")
            return True
    except requests.exceptions.RequestException:
        pass
    
    print("‚ùå Ollama is not running")
    print("Please start Ollama:")
    print("  - Open terminal and run: ollama serve")
    print("  - Or start Ollama app from Applications")
    return False

ollama_running = check_ollama_status()

# List available models
def list_ollama_models():
    """List currently pulled Ollama models"""
    try:
        response = requests.get("http://localhost:11434/api/tags")
        if response.status_code == 200:
            models = response.json().get('models', [])
            print(f"\nüìã Available Ollama models ({len(models)} total):")
            for model in models:
                name = model['name']
                size = model.get('size', 0) / (1024**3)  # Convert to GB
                modified = model.get('modified_at', '')[:10]  # Date only
                print(f"  ‚Ä¢ {name:<25} | {size:.1f}GB | {modified}")
            return [model['name'] for model in models]
        else:
            print("Failed to fetch model list")
            return []
    except requests.exceptions.RequestException as e:
        print(f"Error fetching models: {e}")
        return []

available_models = list_ollama_models() if ollama_running else []

# Recommended models for your hardware
recommended_models = {
    "mistral:7b-instruct": "Mistral 7B - Excellent general purpose model (~4GB)",
    "mixtral:8x7b": "Mixtral 8x7B - Very capable, mixture of experts (~26GB)", 
    # "llama3:8b-instruct": "Llama 3 8B - Latest Meta model (~4.7GB)",
    # "codellama:13b-instruct": "Code Llama 13B - Code-focused (~7GB)",
    "nomic-embed-text:latest": "Nomic Embed - Local embedding model (~274MB)"
}

print(f"\nüéØ RECOMMENDED MODELS FOR YOUR HARDWARE (64GB RAM):")
for model, description in recommended_models.items():
    status = "‚úÖ Available" if model in available_models else "‚¨áÔ∏è  Pull needed"
    print(f"  {model:<25} | {description} | {status}")


=== OLLAMA SETUP AND VALIDATION ===
‚úÖ Ollama is running - Version: 0.7.1

üìã Available Ollama models (30 total):
  ‚Ä¢ nomic-embed-text:latest   | 0.3GB | 2025-05-25
  ‚Ä¢ mistral:7b-instruct       | 3.8GB | 2025-05-25
  ‚Ä¢ deepseek-r1:70b           | 39.6GB | 2025-05-07
  ‚Ä¢ qwen3:30b                 | 17.3GB | 2025-05-07
  ‚Ä¢ phi4-reasoning:plus       | 10.4GB | 2025-05-07
  ‚Ä¢ llama3.3:70b              | 39.6GB | 2025-05-07
  ‚Ä¢ qwen3:32b                 | 18.8GB | 2025-05-07
  ‚Ä¢ gemma3:27b                | 16.2GB | 2025-04-19
  ‚Ä¢ gemma3:12b                | 7.6GB | 2025-04-19
  ‚Ä¢ llama3.2:3b               | 1.9GB | 2025-04-19
  ‚Ä¢ granite-code:20b          | 10.8GB | 2025-04-19
  ‚Ä¢ starcoder2:7b             | 3.8GB | 2025-04-19
  ‚Ä¢ granite-code:8b           | 4.3GB | 2025-04-19
  ‚Ä¢ starcoder2:latest         | 1.6GB | 2025-04-19
  ‚Ä¢ granite3.2-vision:latest  | 2.3GB | 2025-04-19
  ‚Ä¢ granite3.1-moe:latest     | 1.9GB | 2025-04-19
  ‚Ä¢ granite3.2:8b         

## 4: Pull Recommended Models (if needed)

**NOTE**: Only run this cell if you need to pull new models


In [4]:
pull_models = False  # Set to True to pull missing models

if pull_models and ollama_running:
    print("=== PULLING RECOMMENDED MODELS ===")
    
    # Start with smaller models first
    models_to_pull = [
        "nomic-embed-text:latest",      # Embedding model first
        "mistral:7b-instruct",   # Small but capable LLM
        "mixtral:8x7b", # Uncomment for larger, more capable model
    ]
    
    for model in models_to_pull:
        if model not in available_models:
            print(f"\nüì• Pulling {model}...")
            print("This may take several minutes depending on model size...")
            
            # Use subprocess to show real-time progress
            try:
                result = subprocess.run(
                    ["ollama", "pull", model],
                    capture_output=False,  # Show output in real-time
                    text=True,
                    timeout=1800  # 30 minute timeout
                )
                if result.returncode == 0:
                    print(f"‚úÖ Successfully pulled {model}")
                else:
                    print(f"‚ùå Failed to pull {model}")
            except subprocess.TimeoutExpired:
                print(f"‚è∞ Timeout pulling {model} - this may indicate a slow connection")
            except FileNotFoundError:
                print("‚ùå Ollama CLI not found. Please install Ollama first.")
                break
        else:
            print(f"‚úÖ {model} already available")
    
    # Refresh model list
    available_models = list_ollama_models()
else:
    print("Skipping model pull (set pull_models=True to pull missing models)")


Skipping model pull (set pull_models=True to pull missing models)


## 5: Test Local Model Connections

In [5]:
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import traceback

print("=== TESTING LOCAL MODEL CONNECTIONS ===")

# Test LLM connection
target_model = "mixtral:8x7b"  # Default model for testing
def test_local_llm(model_name=target_model):
    """Test local LLM connection and response"""
    print(f"\nüß† Testing LLM: {model_name}")
    
    if model_name not in available_models:
        print(f"‚ùå Model {model_name} not available. Available models: {available_models}")
        return None
    
    try:
        # Initialize Ollama LLM
        llm = Ollama(
            model=model_name,
            base_url="http://localhost:11434",
            request_timeout=60.0,
        )
        
        # Test with a simple prompt
        test_prompt = "Explain what RAG (Retrieval-Augmented Generation) is in one sentence."
        print(f"Prompt: {test_prompt}")
        
        response = llm.complete(test_prompt)
        print(f"Response: {response.text.strip()}")
        print("‚úÖ LLM test successful!")
        return llm
        
    except Exception as e:
        print(f"‚ùå LLM test failed: {e}")
        traceback.print_exc()
        return None

# Test embedding model
def test_local_embeddings():
    """Test local embedding model"""
    print(f"\nüî¢ Testing Embedding Model")
    
    try:
        # Use HuggingFace sentence transformers (runs locally)
        # This is more reliable than Ollama embeddings which may not be available
        embed_model = HuggingFaceEmbedding(
            model_name="sentence-transformers/all-MiniLM-L6-v2",  # Small, fast model
            max_length=512,
        )
        
        # Test embedding
        test_texts = [
            "This is a test document about machine learning.",
            "RAG systems combine retrieval with generation.",
            "Local models run on your own hardware."
        ]
        
        print("Testing embeddings for sample texts...")
        embeddings = []
        for text in test_texts:
            emb = embed_model.get_text_embedding(text)
            embeddings.append(emb)
            print(f"  '{text[:30]}...' -> {len(emb)} dimensions")
        
        # Test similarity
        import numpy as np
        def cosine_similarity(a, b):
            return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
        
        sim1 = cosine_similarity(embeddings[0], embeddings[1])
        sim2 = cosine_similarity(embeddings[0], embeddings[2])
        
        print(f"Similarity (ML vs RAG): {sim1:.3f}")
        print(f"Similarity (ML vs Local): {sim2:.3f}")
        print("‚úÖ Embedding test successful!")
        return embed_model
        
    except Exception as e:
        print(f"‚ùå Embedding test failed: {e}")
        traceback.print_exc()
        return None

# Run tests
local_llm = test_local_llm()
local_embeddings = test_local_embeddings()


=== TESTING LOCAL MODEL CONNECTIONS ===

üß† Testing LLM: mixtral:8x7b
Prompt: Explain what RAG (Retrieval-Augmented Generation) is in one sentence.
Response: lexiafriendly RAG, or Retrieval-Augmented Generation, is a method that combines using an initialized model with retrieving relevant information from external sources to enhance the generation of responses.
‚úÖ LLM test successful!

üî¢ Testing Embedding Model
Testing embeddings for sample texts...
  'This is a test document about ...' -> 384 dimensions
  'RAG systems combine retrieval ...' -> 384 dimensions
  'Local models run on your own h...' -> 384 dimensions
Similarity (ML vs RAG): 0.204
Similarity (ML vs Local): 0.087
‚úÖ Embedding test successful!


## 6: Create Local RAG Configuration

In [6]:
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter

print("=== CONFIGURING LOCAL RAG SYSTEM ===")

if local_llm and local_embeddings:
    # Configure LlamaIndex global settings
    Settings.llm = local_llm
    Settings.embed_model = local_embeddings
    
    # Configure text splitter (improved chunking based on your previous work)
    Settings.node_parser = SentenceSplitter(
        chunk_size=800,      # Similar to your max_chars
        chunk_overlap=100,   # Overlap for context preservation
        paragraph_separator="\n\n",  # Better paragraph detection
        secondary_chunking_regex="[^,.;„ÄÇ]+[,.;„ÄÇ]?",  # Sentence-level backup
    )
    
    print("‚úÖ Local RAG configuration complete!")
    print(f"LLM Model: {local_llm.model}")
    print(f"Embedding Model: {local_embeddings.model_name}")
    print(f"Chunk Size: {Settings.node_parser.chunk_size}")
    print(f"Chunk Overlap: {Settings.node_parser.chunk_overlap}")
    
    # Test the complete pipeline with more comprehensive examples
    print("\nüîÑ Testing Complete Pipeline...")
    from llama_index.core import VectorStoreIndex, Document
    
    # Create test documents that mirror your use case
    test_docs = [
        Document(text="""
        Aeroflow Technical Specifications:
        The Aeroflow system features advanced aerodynamic design principles that optimize fuel efficiency 
        and reduce environmental impact. The propulsion system incorporates cutting-edge technology 
        with hybrid-electric capabilities. Maintenance intervals are scheduled every 500 operational hours 
        or 6 months, whichever comes first. The comprehensive warranty covers all major components 
        for 24 months from delivery date.
        """),
        Document(text="""
        EcoSprint Performance Features:
        EcoSprint represents the next generation of eco-friendly transportation technology. 
        The system utilizes regenerative energy capture and advanced battery management systems. 
        Low maintenance requirements include basic inspections every 1000 hours of operation. 
        Environmental certifications include ISO 14001 compliance. Extended warranty coverage 
        provides 36 months of comprehensive protection for all electronic and mechanical systems.
        """),
        Document(text="""
        Retrieval-Augmented Generation Implementation:
        RAG systems combine large language models with external knowledge retrieval mechanisms. 
        This approach significantly reduces hallucination rates while improving response accuracy. 
        Local deployment offers privacy benefits and eliminates dependency on cloud services. 
        Chunking strategies are critical for optimal performance, with typical sizes ranging 
        from 500-1000 characters depending on content type.
        """),
    ]
    
    # Create vector index with performance monitoring
    try:
        print("Creating vector index from test documents...")
        index = VectorStoreIndex.from_documents(
            test_docs,
            embed_model=local_embeddings,
            show_progress=True
        )
        
        # Create query engine with custom settings
        query_engine = index.as_query_engine(
            llm=local_llm,
            similarity_top_k=3,  # Retrieve top 3 most relevant chunks
            response_mode="compact"  # Efficient response generation
        )
        
        # Test multiple queries to validate different scenarios
        test_queries = [
            "What are the maintenance requirements for Aeroflow?",
            "Compare the warranty periods for both products.",
            "What are the benefits of local RAG deployment?",
            "How does EcoSprint handle environmental compliance?"
        ]
        
        for i, query in enumerate(test_queries, 1):
            print(f"\n--- Test Query {i} ---")
            print(f"Query: {query}")
            
            try:
                response = query_engine.query(query)
                print(f"Response: {response.response}")
                
                # Show source information if available
                if hasattr(response, 'source_nodes') and response.source_nodes:
                    print(f"Sources used: {len(response.source_nodes)} chunks")
                
            except Exception as e:
                print(f"‚ùå Query failed: {e}")
        
        print("\n‚úÖ Complete RAG pipeline test successful!")
        
        # Store for use in later cells
        global test_query_engine, test_index
        test_query_engine = query_engine
        test_index = index
        
    except Exception as e:
        print(f"‚ùå Pipeline test failed: {e}")
        traceback.print_exc()
        
else:
    print("‚ùå Cannot configure RAG system - LLM or embedding model initialization failed")


=== CONFIGURING LOCAL RAG SYSTEM ===
‚úÖ Local RAG configuration complete!
LLM Model: mixtral:8x7b
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
Chunk Size: 800
Chunk Overlap: 100

üîÑ Testing Complete Pipeline...
Creating vector index from test documents...


Parsing nodes:   0%|          | 0/3 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/3 [00:00<?, ?it/s]


--- Test Query 1 ---
Query: What are the maintenance requirements for Aeroflow?
Response:  The Aeroflow system has scheduled maintenance intervals every 500 operational hours or 6 months, whichever comes first. The comprehensive warranty covers all major components for 24 months from the delivery date.
Sources used: 3 chunks

--- Test Query 2 ---
Query: Compare the warranty periods for both products.
Response:  The EcoSprint provides 36 months of comprehensive protection for all electronic and mechanical systems, while the Aeroflow offers a warranty coverage of 24 months from the delivery date for all major components. Therefore, the EcoSprint has a longer warranty period compared to the Aeroflow.
Sources used: 3 chunks

--- Test Query 3 ---
Query: What are the benefits of local RAG deployment?
Response:  Benefits of local Retrieval-Augmented Generation (RAG) system deployment include privacy advantages as sensitive data is processed and stored locally without the need for cloud servi

## 7: Migration Helper Fcns - Azure to Local 

In [7]:
print("=== MIGRATION HELPER FUNCTIONS ===")

def create_local_rag_components(llm_model="mixtral:8x7b", embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
    """Create local equivalents of your Azure components with optimized settings"""
    
    components = {}
    
    # LLM (replaces AzureOpenAI)
    components['llm'] = Ollama(
        model=llm_model,
        base_url="http://localhost:11434",
        request_timeout=120.0,
        # Additional parameters for better performance
        temperature=0.1,  # Lower temperature for more consistent responses
        num_predict=512,  # Limit response length for efficiency
    )
    
    # Embedding Model (replaces AzureOpenAIEmbedding) 
    components['embed_model'] = HuggingFaceEmbedding(
        model_name=embedding_model,
        max_length=512,
        normalize=True,  # Normalize embeddings for better similarity calculation
    )
    
    # Enhanced Text Splitter (replaces your chunking function)
    components['text_splitter'] = SentenceSplitter(
        chunk_size=800,
        chunk_overlap=100,
        paragraph_separator="\n\n",
        secondary_chunking_regex="[^,.;„ÄÇ]+[,.;„ÄÇ]?",
    )
    
    return components

def migrate_query_engine_tools(components, aeroflow_content=None, ecosprint_content=None):
    """
    Migrate your existing query engine tools from Azure to local deployment
    This function recreates your Aeroflow and EcoSprint tools with local models
    """
    from llama_index.core.tools import QueryEngineTool
    from llama_index.core import VectorStoreIndex, Document
    
    # Use provided content or create placeholder documents
    if aeroflow_content is None:
        aeroflow_content = """
        Aeroflow Technical Specifications and Features:
        
        Design Philosophy: Advanced aerodynamic engineering with computational fluid dynamics optimization.
        The Aeroflow system incorporates state-of-the-art materials science and precision manufacturing.
        
        Performance Features:
        - Hybrid-electric propulsion system with 40% improved fuel efficiency
        - Advanced composite materials reducing weight by 25%
        - Integrated sensor arrays for real-time performance monitoring
        - Automated flight path optimization using machine learning algorithms
        
        Technology Integration:
        - Next-generation avionics with touch-screen interfaces
        - Redundant safety systems with triple-backup protocols
        - Weather-adaptive control systems
        - Satellite-based navigation with precision landing capabilities
        
        Maintenance Schedule:
        - Routine inspections every 500 operational hours
        - Major overhaul required at 5,000 hours
        - Component replacement following manufacturer guidelines
        - Digital maintenance logging with predictive analytics
        
        Warranty Coverage:
        - Comprehensive 24-month warranty from delivery
        - Extended warranty options available up to 60 months
        - Global service network with certified technicians
        - 24/7 technical support hotline
        """
    
    if ecosprint_content is None:
        ecosprint_content = """
        EcoSprint Environmental Technology Specifications:
        
        Sustainability Focus: Designed for minimal environmental impact with maximum operational efficiency.
        EcoSprint represents breakthrough innovation in green transportation technology.
        
        Environmental Features:
        - Zero-emission electric drive system with regenerative capabilities
        - Solar panel integration for auxiliary power generation
        - Recyclable materials comprising 80% of total construction
        - Carbon-neutral manufacturing process certification
        
        Advanced Technology:
        - Intelligent battery management with thermal optimization
        - Predictive maintenance using IoT sensors throughout the system
        - Mobile app integration for remote monitoring and control
        - Over-the-air software updates for continuous improvement
        
        Operational Efficiency:
        - Extended range capabilities up to 500km on single charge
        - Fast-charging technology achieving 80% capacity in 30 minutes
        - Adaptive power management based on usage patterns
        - Integration with smart grid systems for optimal charging
        
        Maintenance Requirements:
        - Minimal maintenance design with self-diagnostic capabilities
        - Scheduled inspections every 1,000 operational hours
        - Component health monitoring through integrated sensors
        - Automated maintenance alerts and scheduling
        
        Warranty and Support:
        - Industry-leading 36-month comprehensive warranty
        - Extended service agreements available
        - ISO 14001 environmental compliance certification
        - Remote diagnostic capabilities with expert technical support
        """
    
    # Create documents from content
    aeroflow_docs = [Document(text=aeroflow_content)]
    ecosprint_docs = [Document(text=ecosprint_content)]
    
    print("Creating local indexes for product specifications...")
    
    # Create local indexes with your specified models
    aeroflow_index = VectorStoreIndex.from_documents(
        aeroflow_docs,
        embed_model=components['embed_model'],
        node_parser=components['text_splitter'],
        show_progress=True
    )
    
    ecosprint_index = VectorStoreIndex.from_documents(
        ecosprint_docs,
        embed_model=components['embed_model'], 
        node_parser=components['text_splitter'],
        show_progress=True
    )
    
    # Create query engine tools (exactly like your original code structure)
    aeroflow_tool = QueryEngineTool.from_defaults(
        query_engine=aeroflow_index.as_query_engine(
            llm=components['llm'],
            similarity_top_k=3,
            response_mode="compact"
        ),
        name="Aeroflow specifications",
        description="Contains information about Aeroflow: Design, features, technology, maintenance, warranty"
    )
    
    ecosprint_tool = QueryEngineTool.from_defaults(
        query_engine=ecosprint_index.as_query_engine(
            llm=components['llm'],
            similarity_top_k=3,
            response_mode="compact"
        ),
        name="EcoSprint specifications", 
        description="Contains information about EcoSprint: Design, features, technology, maintenance, warranty"
    )
    
    print("‚úÖ Successfully created local query engine tools!")
    return aeroflow_tool, ecosprint_tool, aeroflow_index, ecosprint_index

def create_local_router_agent(aeroflow_tool, ecosprint_tool, components):
    """
    Create the local equivalent of your RouterQueryEngine
    This replaces your Azure-based router with local models
    """
    from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
    from llama_index.core.selectors import LLMSingleSelector
    
    # Create router agent with local LLM (exactly like your original code)
    router_agent = RouterQueryEngine(
        selector=LLMSingleSelector.from_defaults(llm=components['llm']),
        query_engine_tools=[
            aeroflow_tool,
            ecosprint_tool,
        ],
        verbose=True
    )
    
    print("‚úÖ Local router agent created successfully!")
    return router_agent

# Create components using your successful configuration
print("Creating local RAG components with your working models...")
local_components = create_local_rag_components(
    llm_model="mixtral:8x7b",  # Your successful model
    embedding_model="sentence-transformers/all-MiniLM-L6-v2"  # Your successful model
)

print("‚úÖ Local RAG components created and ready for migration!")
print("\nComponents created:")
print(f"  - LLM: {local_components['llm'].model}")
print(f"  - Embeddings: {local_components['embed_model'].model_name}")
print(f"  - Text Splitter: {local_components['text_splitter'].chunk_size} chars with {local_components['text_splitter'].chunk_overlap} overlap")


=== MIGRATION HELPER FUNCTIONS ===
Creating local RAG components with your working models...
‚úÖ Local RAG components created and ready for migration!

Components created:
  - LLM: mixtral:8x7b
  - Embeddings: sentence-transformers/all-MiniLM-L6-v2
  - Text Splitter: 800 chars with 100 overlap


## 8: Performance Monitoring and Optimization

In [8]:
import time
import psutil
import os

print("=== PERFORMANCE MONITORING SETUP ===")

def monitor_system_resources():
    """Monitor system resources during RAG operations"""
    stats = {
        'cpu_percent': psutil.cpu_percent(interval=1),
        'memory_percent': psutil.virtual_memory().percent,
        'memory_used_gb': psutil.virtual_memory().used / (1024**3),
        'memory_total_gb': psutil.virtual_memory().total / (1024**3),
    }
    
    print(f"üíª System Resources:")
    print(f"  CPU Usage: {stats['cpu_percent']:.1f}%")
    print(f"  Memory Usage: {stats['memory_percent']:.1f}% ({stats['memory_used_gb']:.1f}GB / {stats['memory_total_gb']:.1f}GB)")
    
    return stats

def benchmark_rag_operation(query_engine, query, iterations=3):
    """Benchmark RAG query performance"""
    print(f"\n‚è±Ô∏è  Benchmarking Query: '{query}'")
    
    times = []
    for i in range(iterations):
        start_time = time.time()
        response = query_engine.query(query)
        end_time = time.time()
        
        query_time = end_time - start_time
        times.append(query_time)
        print(f"  Iteration {i+1}: {query_time:.2f}s")
    
    avg_time = sum(times) / len(times)
    print(f"  Average time: {avg_time:.2f}s")
    print(f"  Response length: {len(response.response)} characters")
    
    return avg_time, response

# Monitor current system state
initial_stats = monitor_system_resources()

print("\n‚úÖ Ready for local RAG operations!")
print("\nNext steps:")
print("1. Share your Azure OpenAI code for specific migration")
print("2. Test with your actual PDF documents") 
print("3. Optimize chunk sizes and model parameters")
print("4. Plan Docker containerization strategy")

=== PERFORMANCE MONITORING SETUP ===
üíª System Resources:
  CPU Usage: 16.6%
  Memory Usage: 82.1% (46.9GB / 64.0GB)

‚úÖ Ready for local RAG operations!

Next steps:
1. Share your Azure OpenAI code for specific migration
2. Test with your actual PDF documents
3. Optimize chunk sizes and model parameters
4. Plan Docker containerization strategy
