# AI-Driven OLED Assistant (Domain-Specific RAG)

## Project Overview

**Goal:** Build an intelligent AI assistant for OLED display engineering and technical support.

**Key Features:**
- Strict RAG: Document-based answers only; rejects queries when relevant documents are not found
- Relevance Score: 3-tier decision system (RAG / NO_ANSWER / OFF_TOPIC) based on similarity + sigmoid transformation
- Quantitative Evaluation: LLM-as-a-judge scoring for answer quality (Specificity / Relevance / Factuality)
- Engineering Support: Answers questions about OLED processes, device properties, and optical simulations

---
## Required Packages

### Core Dependencies
- **langchain**: LangChain framework (RAG pipeline construction)
- **langchain-community**: PDF/DOCX loaders and integrations
- **langchain-openai**: Chat API wrapper for OpenAI and OpenAI-compatible servers (e.g., Ollama/vLLM)
- **docarray**: Python vector search engine (lightweight and fast)
- **pypdf**: PDF text extraction
- **docx2txt**: DOCX text extraction
- **tiktoken**: Token counting (for cost estimation)
- **python-dotenv**: Environment variable management (.env)

In [None]:
# Cell 1
# %pip install --upgrade pip
# %pip install -U langchain langchain-community langchain-openai
# %pip install -U docarray pypdf docx2txt tiktoken openai python-dotenv
# %pip install -U sentence-transformers  # üîë For HuggingFace embeddings (sentence-transformers/all-MiniLM-L6-v2)

## Environment Variables and API Key Configuration

**What This Section Does:**
- Loads environment variables from the `.env` file in the project directory
- By default, looks for `OPENAI_API_KEY`, but **the current Strict RAG experiment uses local Mistral-Nemo (Ollama)**, so it's not required
- Prepares the key for future OpenAI-based LLM evaluation (e.g., gpt-4o-mini)
- The `python-dotenv` package automatically reads the `.env` file

In [None]:
# Cell 3: Load OpenAI API Key from .env file
# This cell loads environment variables (especially the API key) before any API calls

import os
from dotenv import load_dotenv

# Load all variables from .env file into environment
# The .env file should be in the same directory as this notebook
load_dotenv()

# Check if API key was successfully loaded
if os.environ.get("OPENAI_API_KEY"):
    print("‚úÖ OpenAI API key loaded successfully from .env file")
else:
    print("‚ùå Warning: OPENAI_API_KEY not found in .env file")
    print("   Please create a .env file with: OPENAI_API_KEY=your-key-here")

## Configuration (Hyperparameters)

These are the main hyperparameters you can adjust to customize the OLED Assistant system.

In [None]:
# Cell 5: Hyperparameters
# =======================================
# Hyperparameter Tuning Strategy (Strict RAG):
# =======================================
# Step 1: RAG Answer Quality Optimization
#   - Tuning parameters: CHUNK_SIZE, CHUNK_OVERLAP, TOP_K_DOCUMENTS
#
# Step 2: Strict RAG Threshold Settings
#   - Tuning parameters: RELEVANCE_THRESHOLD
#   - Goal: Return 'No Answer' if information is not in the documents

DOCS_FOLDER = "../data"  # Folder containing OLED technical documents
DB_PATH = "../chroma_db" # Vector DB storage path (Persistence)
# LLM Configuration: Using local Mistral-Nemo 12B (Ollama)
LLM_MODEL = "mistral-nemo"      # Must match Ollama model name
LLM_PROVIDER = "llama_local"    # Local OpenAI-compatible server
# ========================================
# STEP 1: RAG Answer Quality Settings
# ========================================
# CHUNK_SIZE: Chunk size (in characters)
#    - Recommended range: 1500 ~ 4000 (600 ~ 8000)
#    - Smaller: Precise search, but risk of context fragmentation
#    - Larger: Context preservation, but risk of topic mixing
#    - Technical docs/papers: 2000~3000 recommended (~1 page)
#    - gte-large-en-v1.5 max: ~32,000 chars (8192 tokens)
#
# CHUNK combinations tested in this hyperparameter tuning:
#   1)  800 /  120   (small chunks, fine-grained search)
#   2) 2000 /  300
#   3) 3000 /  500   <- baseline (current default) good
#   4) 4500 /  600   OK, 30min
#   5) 6000 /  800   Not bad but not good, 1hr
#   6) 8000 / 1000   (large chunks, max context preservation) Not bad but not good, 1hr
#
# Round 1: Extreme comparison 800/120 vs 8000/1000 -> additional experiments with surviving range
# Round 2: Compare final candidates with 3000/500 baseline to finalize
CHUNK_SIZE = 3000  # Baseline: 3000 (change per experiment)

# CHUNK_OVERLAP: Overlap between chunks (in characters)
#    - Recommended range: 10-20% of CHUNK_SIZE (100~1000)
#    - Ensures context continuity, prevents info loss at boundaries
CHUNK_OVERLAP = 500  # Baseline: 500 (change per experiment)

# TOP_K_DOCUMENTS: Number of chunks to retrieve
#    - Recommended range: 3 ~ 6 (2 ~ 6)
#    - Smaller: Faster, only top accurate results
#    - Larger: Diverse perspectives, but more noise
TOP_K_DOCUMENTS = 4  # Recommended: 4 (balanced)

# LLM_TEMPERATURE: Answer creativity
#    - 0.0: Deterministic, fact-based (suitable for Strict RAG)
#    - 0.7+: Creative, diverse expressions
LLM_TEMPERATURE = 0.2  # Recommended: 0.0 (fact-based)

# ========================================
# STEP 2: Strict RAG Settings
# ========================================
# SIGMOID_MIDPOINT: Center point of score distribution
#    - Recommended range: 0.45, 0.48, 0.5, 0.52, 0.55, 0.58, 0.60
#    - Higher: Lower scores become even lower (stricter)
#    - Adjust after testing with new embedding model
SIGMOID_MIDPOINT = 0.50  # Recommended: 0.50-0.55 (adjust after testing)

# SIGMOID_STEEPNESS: Score separation strength
#    - Recommended range: 8, 10, 12, 14, 16, 18, 20
#    - Higher: Mid-range scores pushed to extremes
SIGMOID_STEEPNESS = 18  # Recommended: 12-15

# RELEVANCE_THRESHOLD: Answer eligibility threshold
#    - Recommended range: 0.50 ~ 0.65
#    - Returns 'No Answer' if score is below this
#    - Adjust after testing with new embedding model
RELEVANCE_THRESHOLD = 0.6

# Test Questions
# Various questions to test Relevance Score distribution
# Expected distribution: High(0.8+) -> Medium(0.5-0.7) -> Low(0.2-0.4) -> Very Low(<0.2)
TEST_QUERIES = [
    # Case 1: High relevance, must-be-RAG
    "What are the key degradation mechanisms of a blue phosphorescent OLED?",
    
    # Case 2: High but slightly lower than Case 1
    "How does exciton diffusion length affect charge separation efficiency in organic semiconductor devices?",
    
    # Case 3: Display but non-OLED
    "What are the major challenges in mass transfer processes for MicroLED displays?",  
    
    # Case 4: Non-display semiconductor device
    "How does doping concentration affect the electron mobility in silicon MOSFETs?",
    
    # Case 5: Science but non-electronics
    "Describe the physical principles used to reduce aerodynamic drag in automotive design",

    # Case 6: Non-science, off-topic
    "Recommend a good hiking trail near Santa Clara for a weekend trip",  
]

print("=" * 80)
print("Configuration Complete!")
print("=" * 80)
print(f"Document Folder: {DOCS_FOLDER}")
print(f"DB Path: {DB_PATH}")
print("=" * 80)

## Logging, Monitoring & Error Handling Setup

**Why add this before hyperparameter tuning?**

When experimenting with various hyperparameters, you need:
1. **Logging**: Record what's happening (for debugging)
2. **Monitoring**: Track performance (response time, cost)
3. **Error Handling**: Handle errors gracefully (API failures, timeouts)

---

### What We Will Track

**Performance Metrics:**
- Response time per query
- Token usage and estimated cost
- Mode distribution (RAG vs NO_ANSWER vs OFF_TOPIC)

**Error Handling:**
- LLM API errors (Rate limits, timeouts from OpenAI or local Mistral OpenAI-compatible servers)
- Document loading errors (file not found, corrupted PDF)
- Automatic retry for transient errors

**Experiment Tracking:**
- Auto-save results to CSV files
- Compare different hyperparameter settings
- Find optimal settings based on data

---

In [None]:
# Cell 7: Logging, Monitoring & Error Handling Setup
# This cell sets up logging, cost tracking, error handling, and experiment tracking

import logging
import time
import csv
import os
from datetime import datetime
from functools import wraps
import tiktoken

# ========================================
# üìù LOGGING SETUP
# ========================================
# Create logs directory if it doesn't exist
os.makedirs("logs", exist_ok=True)

# Configure logging
# Note: remove existing handlers before reconfiguring so the cell can be run multiple times safely
# Remove all existing handlers so repeated cell runs behave reliably
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
    handler.close()  # Clean up handler resources

# Then apply the base configuration
log_filename = f'logs/AI_OLED_assistant_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_filename),
        logging.StreamHandler()  # Also print to console
    ],
    force=True  # Python 3.8+: force reconfiguration to avoid duplicate handlers
)

logger = logging.getLogger(__name__)
logger.info(f"üìù Logging initialized: {log_filename}")

# ========================================
# üí∞ COST TRACKING
# ========================================
class CostTracker:
    """
    Track token usage and estimated API costs.
    
    OpenAI pricing (as of 2024):
    - GPT-4o-mini: $0.150 per 1M input tokens, $0.600 per 1M output tokens
    """
    
    # Pricing per 1M tokens (USD)
    PRICING = {
        "gpt-5": {"input": 1.25, "output": 10.00},
        "gpt-5-mini": {"input": 0.25, "output": 2.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        # Local LLM (Mistral, etc.) has zero cost, only track tokens
        "mistral-nemo": {"input": 0.0, "output": 0.0},
    }
    
    def __init__(self, model_name="gpt-4o-mini"):
        """Initialize cost tracker for a specific model."""
        self.model_name = model_name
        # Use safe default encoding for model names unknown to tiktoken
        try:
            self.encoding = tiktoken.encoding_for_model(model_name)
        except Exception:
            self.encoding = tiktoken.get_encoding("cl100k_base")
        self.total_input_tokens = 0
        self.total_output_tokens = 0
    
    def count_tokens(self, text):
        """Count tokens in a text string."""
        return len(self.encoding.encode(text))
    
    def add_tokens(self, input_text, output_text):
        """
        Add token counts for input and output.
        
        Args:
            input_text: Input prompt text
            output_text: Model response text
        """
        input_tokens = self.count_tokens(input_text)
        output_tokens = self.count_tokens(output_text)
        
        self.total_input_tokens += input_tokens
        self.total_output_tokens += output_tokens
        
        return input_tokens, output_tokens
    
    def get_cost(self):
        """Calculate total cost in USD."""
        if self.model_name not in self.PRICING:
            return 0.0  # Unknown model
        
        pricing = self.PRICING[self.model_name]
        input_cost = (self.total_input_tokens / 1_000_000) * pricing["input"]
        output_cost = (self.total_output_tokens / 1_000_000) * pricing["output"]
        
        return input_cost + output_cost
    
    def get_summary(self):
        """Get summary of token usage and cost."""
        return {
            "input_tokens": self.total_input_tokens,
            "output_tokens": self.total_output_tokens,
            "total_tokens": self.total_input_tokens + self.total_output_tokens,
            "estimated_cost_usd": self.get_cost()
        }
    
    def reset(self):
        """Reset token counters."""
        self.total_input_tokens = 0
        self.total_output_tokens = 0

# Initialize global cost tracker
cost_tracker = CostTracker(model_name=LLM_MODEL)

# ========================================
# üõ°Ô∏è ERROR HANDLING UTILITIES
# ========================================
def retry_on_api_error(max_retries=2, delay=2):
    """
    Decorator to retry function on API errors.
    
    Args:
        max_retries: Maximum number of retry attempts
        delay: Delay in seconds between retries
    
    Usage:
        @retry_on_api_error(max_retries=2, delay=2)
        def api_call():
            # Your API call here
            pass
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    error_type = type(e).__name__
                    
                    # Log the error
                    if attempt < max_retries:
                        logger.warning(f"‚ö†Ô∏è {error_type} on attempt {attempt + 1}/{max_retries + 1}: {str(e)}")
                        logger.info(f"üîÑ Retrying in {delay} seconds...")
                        time.sleep(delay)
                    else:
                        logger.error(f"‚ùå Failed after {max_retries + 1} attempts: {str(e)}")
            
            # If all retries failed, raise the last exception
            raise last_exception
        
        return wrapper
    return decorator

def safe_file_load(file_path, loader_class):
    """
    Safely load a file with error handling.
    
    Args:
        file_path: Path to file
        loader_class: LangChain loader class (PyPDFLoader or Docx2txtLoader)
    
    Returns:
        List of loaded documents, or empty list if failed
    """
    try:
        loader = loader_class(file_path)
        documents = loader.load()
        logger.info(f"‚úÖ Loaded: {os.path.basename(file_path)} ({len(documents)} pages/sections)")
        return documents
    except FileNotFoundError:
        logger.error(f"‚ùå File not found: {file_path}")
        return []
    except PermissionError:
        logger.error(f"‚ùå Permission denied: {file_path}")
        return []
    except Exception as e:
        logger.error(f"‚ùå Error loading {os.path.basename(file_path)}: {type(e).__name__} - {str(e)}")
        return []

# ========================================
# üìä EXPERIMENT TRACKING
# ========================================
class ExperimentTracker:
    """Track and save hyperparameter experiment results."""
    
    def __init__(self, csv_filename="hyperparameter_experiments.csv"):
        """Initialize experiment tracker."""
        self.csv_filename = csv_filename
        self.current_experiment = {}
        
        # Create CSV file with headers if it doesn't exist
        if not os.path.exists(csv_filename):
            with open(csv_filename, 'w', newline='') as f:
                # CSV headers for Strict RAG (no LLM baseline comparison)
                writer = csv.DictWriter(f, fieldnames=[
                    'timestamp',  # 1

                    # RAG Quality (Step 1)
                    'chunk_size', 'chunk_overlap', 'top_k', 'temperature',  # 2-5

                    # System Architecture (Step 2) - Used for Strict RAG only
                    'relevance_threshold', 'sigmoid_midpoint', 'sigmoid_steepness',  # 6-8

                    # Performance
                    'avg_response_time_sec', 'total_tokens', 'estimated_cost_usd',  # 9-11

                    # Mode Distribution (Strict RAG: RAG, NO_ANSWER_IN_DOCS, OFF_TOPIC only)
                    'mode_rag_count', 'mode_no_answer_count', 'mode_off_topic_count', 'mode_error_count',  # 12-15

                    # Score Metrics (RAG only, no LLM baseline)
                    'avg_relevance_score', 'avg_rag_score',  # 16-17
                    'avg_rag_specificity', 'avg_rag_relevance', 'avg_rag_factuality',  # 18-20

                    # Relevance separation metrics (Q1‚ÄìQ2 vs Q5‚ÄìQ6)
                    'rel_mean_pos', 'rel_mean_neg', 'rel_gap', 'rel_margin',  # 21-24

                    'notes'  # 25
                ])
                writer.writeheader()
    
    def start_experiment(self, config):
        """
        Start tracking a new experiment.
        
        Args:
            config: Dictionary with hyperparameter settings
        """
        # Strict RAG modes: RAG, NO_ANSWER_IN_DOCS, OFF_TOPIC only
        self.current_experiment = {
            'timestamp': datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            **config,
            'response_times': [],
            'mode_counts': {
                'RAG': 0, 
                'NO_ANSWER_IN_DOCS': 0,
                'OFF_TOPIC': 0,
                'ERROR': 0
            }
        }
        logger.info(f"üß™ Starting experiment with config: {config}")
    
    def log_query(self, response_time, mode):
        """Log a single query result."""
        self.current_experiment['response_times'].append(response_time)
        self.current_experiment['mode_counts'][mode] += 1
    
    def save_experiment(self, 
                       avg_relevance_score=None, avg_rag_score=None,
                       avg_rag_specificity=None, avg_rag_relevance=None, avg_rag_factuality=None,
                       rel_mean_pos=None, rel_mean_neg=None, rel_gap=None, rel_margin=None,
                       notes=""):
        """Save Strict RAG experiment results to CSV.

        Args
        ----
        avg_relevance_score : float | None
            Average relevance score (0-1)
        avg_rag_score : float | None
            Average RAG answer quality (1-10)
        avg_rag_specificity : float | None
            Average RAG specificity score (1-10)
        avg_rag_relevance : float | None
            Average RAG relevance score (1-10)
        avg_rag_factuality : float | None
            Average RAG factuality score (1-10)
        rel_mean_pos : float | None
            Q1-Q2 (OLED core) relevance average
        rel_mean_neg : float | None
            Q5-Q6 (OFF-TOPIC) relevance average
        rel_gap : float | None
            rel_mean_pos - rel_mean_neg
        rel_margin : float | None
            min(Q1-Q2) - max(Q5-Q6)
        notes : str
            Additional notes for the experiment
        """
        cost_summary = cost_tracker.get_summary()
        
        # Calculate averages
        avg_response_time = sum(self.current_experiment['response_times']) / len(self.current_experiment['response_times']) if self.current_experiment['response_times'] else 0
        
        # Row data for Strict RAG (matches CSV header order)
        row = {
            'timestamp': self.current_experiment['timestamp'],

            # RAG Quality (Step 1)
            'chunk_size': self.current_experiment.get('chunk_size'),
            'chunk_overlap': self.current_experiment.get('chunk_overlap'),
            'top_k': self.current_experiment.get('top_k'),
            'temperature': self.current_experiment.get('temperature'),

            # System Architecture (Step 2) - Used for Strict RAG only
            'relevance_threshold': self.current_experiment.get('relevance_threshold'),
            'sigmoid_midpoint': self.current_experiment.get('sigmoid_midpoint'),
            'sigmoid_steepness': self.current_experiment.get('sigmoid_steepness'),
            
            # Performance
            'avg_response_time_sec': f"{avg_response_time:.2f}",
            'total_tokens': cost_summary['total_tokens'],
            'estimated_cost_usd': f"${cost_summary['estimated_cost_usd']:.4f}",

            # Mode Distribution (Strict RAG: RAG, NO_ANSWER_IN_DOCS, OFF_TOPIC only)
            'mode_rag_count': self.current_experiment['mode_counts']['RAG'],
            'mode_no_answer_count': self.current_experiment['mode_counts']['NO_ANSWER_IN_DOCS'],
            'mode_off_topic_count': self.current_experiment['mode_counts']['OFF_TOPIC'],
            'mode_error_count': self.current_experiment['mode_counts']['ERROR'],
            
            # Score Metrics (RAG only, no LLM baseline)
            'avg_relevance_score': f"{avg_relevance_score:.3f}" if avg_relevance_score is not None else "N/A",
            'avg_rag_score': f"{avg_rag_score:.2f}" if avg_rag_score is not None else "N/A",
            'avg_rag_specificity': f"{avg_rag_specificity:.2f}" if avg_rag_specificity is not None else "N/A",
            'avg_rag_relevance': f"{avg_rag_relevance:.2f}" if avg_rag_relevance is not None else "N/A",
            'avg_rag_factuality': f"{avg_rag_factuality:.2f}" if avg_rag_factuality is not None else "N/A",

            # Relevance separation metrics (Q1‚ÄìQ2 vs Q5‚ÄìQ6)
            'rel_mean_pos': f"{rel_mean_pos:.3f}" if rel_mean_pos is not None else "N/A",
            'rel_mean_neg': f"{rel_mean_neg:.3f}" if rel_mean_neg is not None else "N/A",
            'rel_gap': f"{rel_gap:.3f}" if rel_gap is not None else "N/A",
            'rel_margin': f"{rel_margin:.3f}" if rel_margin is not None else "N/A",

            'notes': notes
        }
        
        # Append to CSV
        with open(self.csv_filename, 'a', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=row.keys())
            writer.writerow(row)
        
        logger.info(f"üíæ Experiment saved to {self.csv_filename}")
        logger.info(f"üìä Summary: Avg time={avg_response_time:.2f}s, Cost=${cost_summary['estimated_cost_usd']:.4f}")

# Initialize global experiment tracker
experiment_tracker = ExperimentTracker()

print("‚úÖ Logging, Monitoring & Error Handling configured successfully!")
print(f"üìù Logs will be saved to: logs/")
print(f"üìä Experiment results will be saved to: hyperparameter_experiments.csv")
print(f"üí∞ Cost tracking enabled for model: {LLM_MODEL}")

## Logging, Monitoring & Error Handling Usage

**Automatic Logging:**
- All operations are logged to `logs/AI_OLED_assistant_[timestamp].log`
- Includes timestamps, error messages, and performance metrics
- Automatically monitored in the background

**Cost Tracking:**
- Token usage for each query is automatically calculated
- Estimated API cost is computed in real-time
- Summary is printed after each experiment

**Error Handling:**
- Automatic retry on API failure (up to 2 times)
- File loading errors are logged
- User-friendly error messages are returned

**Experiment Tracking:**
- All hyperparameter experiments are saved to `hyperparameter_experiments.csv`
- Includes response time, token usage, cost, mode distribution, etc.

---

## Document Loading Setup

**What This Section Does:**
Configuration section for document loading. Actual loading is performed in the next section.

**Environment:**
- Designed for local execution
- Reads OLED technical documents directly from `data/` folder

## Document Loading and Chunking

**What This Section Does:**
This is where OLED technical documents are actually loaded and processed.

**Step-by-step Process:**

1. **File Discovery** (`glob.glob` with `recursive=True`)
   - Scans for PDF/DOCX files in `data/` folder and **all subfolders**
   - Documents organized in folders are automatically recognized

2. **Document Loading** (`PyPDFLoader`, `Docx2txtLoader`)
   - PDF ‚Üí text conversion (one page = one document)
   - DOCX ‚Üí text conversion (one section = one document)
   - `safe_file_load()` wrapper prevents corrupted files from crashing the notebook

3. **Text Chunking** (`RecursiveCharacterTextSplitter`)
   - Splits long documents into smaller chunks
   - Each chunk is `CHUNK_SIZE` characters (e.g., current setting: 3000)
   - `CHUNK_OVERLAP` characters overlap between chunks (e.g., current setting: 500)
   - Why overlap? Prevents context loss when sentences are cut at chunk boundaries

**Why We Chunk Documents:**
- **LLM token limit**: Models have maximum input size (e.g., 128K tokens)
- **Better retrieval**: Smaller chunks match queries more accurately
- **Context preservation**: Overlap prevents important info loss at boundaries

In [None]:
# Cell 10: Load and Chunk Documents
# This cell loads all PDF and DOCX files from the docs folder and splits them into chunks

import os
import glob
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ========================================
# 0. Fast path: If ChromaDB already exists, skip loading & chunking
#    - If DB exists, skip PDF loading + chunking, load existing DB in Cell 12
# ========================================
if os.path.exists(DB_PATH) and os.listdir(DB_PATH):
    print(f"üìÇ Existing ChromaDB found at {DB_PATH}")
    print("‚è≠Ô∏è Skipping document loading & chunking (using existing DB only).")
    logger.info(f"Skip loading/chunking because DB already exists at {DB_PATH}")

else:
    # ========================================
    # STEP 1: Find all PDF and DOCX files
    # ========================================
    logger.info(f"üìÇ Scanning folder: {DOCS_FOLDER}")

    # Check if docs folder exists before proceeding
    if not os.path.exists(DOCS_FOLDER):
        logger.error(f"‚ùå Docs folder not found: {DOCS_FOLDER}")
        raise FileNotFoundError(f"Documents folder '{DOCS_FOLDER}' does not exist")

    # Use glob to find all PDF and DOCX files in the folder (including subfolders)
    # recursive=True and ** pattern enables searching in subdirectories
    pdf_files = glob.glob(os.path.join(DOCS_FOLDER, "**/*.pdf"), recursive=True)
    docx_files = glob.glob(os.path.join(DOCS_FOLDER, "**/*.docx"), recursive=True)

    print(f"üìÇ Found {len(pdf_files)} PDF files and {len(docx_files)} DOCX files in {DOCS_FOLDER}/ (including subfolders)")
    logger.info(f"Found {len(pdf_files)} PDFs and {len(docx_files)} DOCX files")

    # Check if any files found - warn if folder is empty
    if len(pdf_files) == 0 and len(docx_files) == 0:
        logger.warning(f"‚ö†Ô∏è No PDF or DOCX files found in {DOCS_FOLDER}/")
        print(f"‚ö†Ô∏è WARNING: No documents found. Please add PDF or DOCX files to {DOCS_FOLDER}/")

    # ========================================
    # STEP 2: Load all documents into memory
    # ========================================
    all_documents = []  # This will store all loaded document pages/sections

    # Load PDF files using PyPDFLoader (one page = one document)
    # safe_file_load() wraps the loader with error handling (won't crash on corrupt files)
    print(f"\nüìÑ Loading PDF files...")
    for pdf_file in pdf_files:
        # safe_file_load() handles errors gracefully - returns [] on failure
        documents = safe_file_load(pdf_file, PyPDFLoader)
        all_documents.extend(documents)  # Add all pages from this PDF to our collection

    # Load DOCX files using Docx2txtLoader (one section = one document)
    print(f"\nüìÑ Loading DOCX files...")
    for docx_file in docx_files:
        documents = safe_file_load(docx_file, Docx2txtLoader)
        all_documents.extend(documents)

    # ========================================
    # STEP 3: Validate that documents were loaded
    # ========================================
    if len(all_documents) == 0:
        logger.error("‚ùå No documents were successfully loaded")
        raise ValueError("Failed to load any documents. Please check file formats and permissions.")

    print(f"\n‚úÖ Total documents loaded: {len(all_documents)}")
    logger.info(f"Successfully loaded {len(all_documents)} document sections")

    # ========================================
    # STEP 4: Split documents into chunks
    # ========================================
    # RecursiveCharacterTextSplitter splits text intelligently:
    # - Tries to split at paragraph breaks first, then sentences, then words
    # - Ensures chunks are approximately CHUNK_SIZE characters
    # - Adds CHUNK_OVERLAP characters of overlap between chunks to preserve context
    try:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=CHUNK_SIZE,      # Target size: CHUNK_SIZE characters per chunk
            chunk_overlap=CHUNK_OVERLAP  # Overlap: CHUNK_OVERLAP characters between chunks
        )
        # Split all documents into chunks
        docs = text_splitter.split_documents(all_documents)
        # What split_documents() does:
            # 1. Reads each Document's page_content.
            # 2. Splits text into ~CHUNK_SIZE-character chunks (paragraph/sentence boundaries preferred).
            # 3. Creates a new Document object for each chunk.
            # 4. Preserves original metadata (source, page, etc.).
        # Result: docs = [chunk_1, chunk_2, chunk_3, ...]
        # Each chunk is a Document object
        print(f"üîπ Total chunks created: {len(docs)}")
        logger.info(f"Created {len(docs)} text chunks (size={CHUNK_SIZE}, overlap={CHUNK_OVERLAP})")
    except Exception as e:
        logger.error(f"‚ùå Error splitting documents: {str(e)}")
        raise

## Embedding and Vector Store Creation

**What This Section Does:**
Converts all document chunks into numerical vectors (embeddings) and builds a searchable index.

**Why We Need Embeddings:**
- Raw text cannot be directly compared by computers
- Embeddings convert text into numerical vectors that capture **semantic meaning**
- Similar text ‚Üí similar vectors ‚Üí vector comparison enables relevant document retrieval

**Process:**

1. **Embedding Model Initialization**
   - Using **BAAI/bge-m3** (multilingual, multi-task embedding, medium size)
   - Based on sentence-transformers, easily usable with `HuggingFaceEmbeddings`
   - Why? Lighter than gte-large, runs at realistic speeds on local Mac(MPS) while maintaining excellent search quality

2. **Vector Embedding Generation**
   - Each document chunk is converted to a vector (high-dimensional number array)
   - These vectors represent the **semantic meaning** of the text

3. **Vector Store Construction** (`ChromaDB`)
   - Stores all embeddings for fast retrieval
   - Builds an index enabling similarity search
   - Supports local persistent storage

**Next Steps:**
- When you ask a question, the query is also converted to an embedding
- The system compares the query embedding with all document embeddings
- Returns the most similar documents (based on cosine similarity scores)

**Key Advantages:**
- Goes beyond simple keyword matching (understands "OLED" and "organic light-emitting diode" are related)
- Captures semantic relationships (e.g., "phosphorescent materials" vs "blue phosphorescence")

In [None]:
# Cell 12: Vector Store Construction (ChromaDB with Persistence)
# This cell embeds documents and saves to ChromaDB, or loads existing DB if present.
import os
import time
import shutil
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# ========================================
# STEP 1: Embedding Model Initialization
# ========================================
# BAAI/bge-m3: Medium-sized multilingual/multi-task embedding model
#    - sentence-transformers based -> easy to use with HuggingFaceEmbeddings
#    - Lighter than gte-large, realistic speed on Mac(MPS)
EMBEDDING_MODEL = "BAAI/bge-m3"

try:
    print("Loading embedding model...")
    print(f"Model: {EMBEDDING_MODEL}")
    
    import torch
    
    # Device configuration
    if torch.cuda.is_available():
        device = 'cuda'
    elif torch.backends.mps.is_available():
        device = 'mps'
    else:
        device = 'cpu'
    print(f"Device: {device.upper()}")
    
    # Load BGE-m3 embedding model
    embeddings = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL,
        model_kwargs={
            'device': device,
        },
        encode_kwargs={
            'normalize_embeddings': True,  # Normalize for cosine similarity
            'batch_size': 16,              # Batch size (adjustable between 8~32)
        }
    )
    print(f"Embedding model loaded! ({EMBEDDING_MODEL})")
    
except Exception as e:
    print(f"Failed to load embedding model: {e}")
    raise

# ========================================
# STEP 2: Load or Create ChromaDB (Check -> Exist?)
# ========================================
def get_vectorstore():
    # Check if DB folder exists and is not empty
    if os.path.exists(DB_PATH) and os.listdir(DB_PATH):
        print(f"\nExisting ChromaDB found: {DB_PATH}")
        print("Loading existing DB... (skipping embedding)")
        
        vectorstore = Chroma(
            persist_directory=DB_PATH,
            embedding_function=embeddings
        )
        print("Existing DB loaded!")
        return vectorstore
    else:
        print(f"\nNo existing DB found. Embedding documents from scratch.")
        print(f"Embedding {len(docs)} chunks... (this may take a while)")
        
        # Remove empty folder if exists before creating new DB
        if os.path.exists(DB_PATH):
            shutil.rmtree(DB_PATH)
            
        vectorstore = Chroma.from_documents(
            documents=docs,
            embedding=embeddings,
            persist_directory=DB_PATH
        )
        # Chroma auto-persists, but explicitly save
        # vectorstore.persist() # Auto-saved in newer versions
        print(f"New DB created and saved: {DB_PATH}")
        return vectorstore

try:
    vectorstore = get_vectorstore()
    # Verify data
    count = vectorstore._collection.count()
    print(f"Document chunks stored in DB: {count}")
except Exception as e:
    print(f"Error processing vector store: {e}")
    raise

## Strict RAG System

**What This Is:**
The **core engine** of the OLED Assistant. A Strict RAG system that uses documents as the **primary source** and only allows OLED/physics knowledge as supplementary when context is partial.

**Problem It Solves:**
- If the answer is in documents -> Answer with RAG using documents first (limited LLM knowledge as backup if needed)
- If documents have no relevant content -> Return "No Answer" or "Information not found in documents"
- Questions clearly unrelated to OLED -> Auto-reject (OFF_TOPIC)

**How the Decision Process Works:**

1. **Relevance Score Calculation (0.0 - 1.0)**
   - Convert query to embedding
   - Compare with document embeddings
   - **Sigmoid transformation** to amplify separation:
     - OLED-related queries -> boosted to 0.80-0.99 range
     - Unrelated queries -> pushed to 0.01-0.20 range
   - Post-sigmoid average similarity = **Relevance Score**

2. **Three-Tier Decision System:**

   **TIER 1: RAG Mode** (Relevance >= `RELEVANCE_THRESHOLD`)
   - Documents are highly relevant
   - Use RAG directly (documents + LLM)
   - Result: Green answer

   **TIER 2: No Information in Documents** (RAG returns "no info")
   - Documents are relevant but don't have the specific answer
   - Return "Information not found"
   - Result: Orange (missing information)

   **TIER 3: Off-Topic Auto-Rejection** (Relevance < `RELEVANCE_THRESHOLD`)
   - Clearly unrelated to OLED displays
   - Immediate rejection (no LLM call -> cost savings)
   - Result: Red (auto-rejected)

---

### Understanding the Score System

**1. Relevance Score (0.0 - 1.0)**
- **Purpose**: "Is the document relevant to this query?"
- **Calculation:**
  1. Query -> embedding vector
  2. Compare with top-K document embeddings (cosine similarity)
  3. Calculate average similarity
  4. Apply sigmoid transformation (amplify separation)
- **Usage**: Determines which mode to use

**2. RAG Quality Score (1-10)**
- **Purpose**: "How good is the RAG answer?"
- **Calculation:** LLM evaluates with 3 metrics, then averages:
  - **Specificity** (1-10): How specific is the answer?
  - **Relevance** (1-10): Does it actually answer the question?
  - **Factuality** (1-10): Does it contain verifiable facts and data?
- **Usage**: Evaluate RAG system performance

**Key Distinction:**
- **Relevance Score** = Quality of **document matching** (input quality)
- **RAG Score** = Quality of **the answer itself** (output quality)
- They are **independent** - high relevance can still produce low-quality answers

In [None]:
# Cell 14: Strict RAG Advisor Implementation
# This cell implements a Strict RAG system that answers only based on document content.
import json
import math
import os
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from IPython.display import HTML, display


def create_llm(provider: str, model_name: str, temperature: float):
    """LLM factory function.
    
    - provider="openai"  -> OpenAI official endpoint (e.g., gpt-4o-mini)
    - provider="llama_local" -> OpenAI-compatible local/internal endpoint (e.g., Ollama, vLLM, internal LLM server)
    
    By separating this way, you can easily switch between OpenAI models and local Mistral/Llama/Gemma
    by just changing the provider / model_name.
    """
    if provider == "openai":
        # Current setting: OpenAI API (e.g., gpt-4o-mini)
        return ChatOpenAI(model=model_name, temperature=temperature)
    elif provider == "llama_local":
        # Local/internal OpenAI-compatible server (e.g., Ollama, vLLM)
        # Default is Ollama OpenAI endpoint: http://localhost:11434/v1
        base_url = os.getenv("LOCAL_LLM_BASE_URL", os.getenv("LLAMA_BASE_URL", "http://localhost:11434/v1"))
        api_key = os.getenv("LOCAL_LLM_API_KEY", os.getenv("LLAMA_API_KEY", "ollama"))  # Ollama doesn't check tokens
        return ChatOpenAI(
            model=model_name,
            temperature=temperature,
            base_url=base_url,
            api_key=api_key,
        )
    else:
        raise ValueError(f"Unknown LLM provider: {provider}")


class StrictRAGAdvisor:
    """Strict RAG System: Returns 'No Answer' if documents don't have the answer or question is off-topic."""
    
    def __init__(self, vectorstore, llm_provider, llm_model, relevance_threshold, top_k, temperature, sigmoid_midpoint, sigmoid_steepness):
        self.vectorstore = vectorstore
        self.relevance_threshold = relevance_threshold
        self.top_k = top_k
        self.sigmoid_midpoint = sigmoid_midpoint
        self.sigmoid_steepness = sigmoid_steepness
        
        # Create LLM here -> easily switch by changing provider later
        self.llm = create_llm(provider=llm_provider, model_name=llm_model, temperature=temperature)
        
        # Strict RAG prompt (not complete block, use documents first but allow supplementary knowledge)
        rag_prompt_template = """You are an OLED Display Technical Assistant.
Answer the question using the provided context documents as your PRIMARY source.

RULES:
1. Always read the Context carefully and base your answer as much as possible on the Context.
2. If the Context contains partial but relevant information, you MAY use your own OLED/physics knowledge to fill in missing logical steps.
3. ONLY when the Context is clearly irrelevant or provides almost no signal, say: "Information not found in the provided OLED documents."
4. Never contradict the facts given in the Context.
5. Do NOT hallucinate specific numbers, experimental conditions, or paper titles that are not supported by the Context.

Context: {context}

Question: {question}

Answer:"""
        
        RAG_PROMPT = PromptTemplate(
            template = rag_prompt_template,
            input_variables = ["context", "question"]
        )
        
        self.rag_chain = RetrievalQA.from_chain_type(
            llm = self.llm,
            chain_type = "stuff",
            retriever = vectorstore.as_retriever(search_kwargs={"k": top_k}),
            chain_type_kwargs = {"prompt": RAG_PROMPT}
        )
        
    def get_relevance_score(self, query):
        # ChromaDB returns distance, need to convert to similarity
        # Chroma L2 distance: lower is better (0 = identical)
        # Similarity = 1 - distance (assuming normalized embeddings)
        # Note: LangChain wrapper's similarity_search_with_score behavior needs verification
        # LangChain Chroma's similarity_search_with_score returns L2 distance by default
        
        docs_with_scores = self.vectorstore.similarity_search_with_score(query, k=self.top_k)
        if not docs_with_scores:
            return 0.0
        
        # Score conversion (L2 distance -> Similarity)
        # Distance 0 means similarity 1, distance >= 1 means low similarity
        # Safe formula: similarity = 1.0 / (1.0 + distance)
        scores = []
        for doc, score in docs_with_scores:
             # Chroma returns distance by default. Lower is better.
             # Convert to 0-1 similarity score
             similarity = 1.0 / (1.0 + score)
             scores.append(similarity)
        
        avg_score = sum(scores) / len(scores)
        
        # Sigmoid transformation to spread score distribution
        sigmoid_score = 1 / (1 + math.exp(-self.sigmoid_steepness * (avg_score - self.sigmoid_midpoint)))
        return sigmoid_score
    
    def query(self, question):
        # Strict RAG Logic
        relevance_score = self.get_relevance_score(question)
        
        result = {
            "answer": None, 
            "mode": None, 
            "relevance_score": relevance_score,
            "retrieved_docs": []
        }
        
        # Threshold check (reject immediately if document similarity is low)
        if relevance_score >= self.relevance_threshold:
            result["mode"] = "RAG"
            result["retrieved_docs"] = self.vectorstore.similarity_search(question, k=self.top_k)
            # Run RAG Chain (prompt instructs to say 'no info' if not found)
            rag_response = self.rag_chain.invoke({"query": question})["result"]
            result["answer"] = rag_response
            
            # Secondary check if LLM returned 'no info'
            if "Information not found" in rag_response or "provided context" in rag_response and "does not contain" in rag_response:
                 result["mode"] = "NO_ANSWER_IN_DOCS"
                 result["answer"] = "No Answer: Information not found in RAG documents."
        else:
            result["mode"] = "OFF_TOPIC"
            result["answer"] = "No Answer: This question is not related to OLED display or no relevant documents found."
            logger.info(f"üö´ Low relevance rejection: {relevance_score:.3f}")
            
        return result

# Initialize Strict Advisor
advisor = StrictRAGAdvisor(
    vectorstore = vectorstore,
    llm_provider = LLM_PROVIDER,  # Can change to "llama_local" later
    llm_model = LLM_MODEL,
    relevance_threshold = RELEVANCE_THRESHOLD,
    top_k = TOP_K_DOCUMENTS,
    temperature = LLM_TEMPERATURE,
    sigmoid_midpoint = SIGMOID_MIDPOINT,
    sigmoid_steepness = SIGMOID_STEEPNESS
)

print("‚úÖ Strict RAG Advisor (OLED) initialized successfully!")
print(f"üìä Strict Threshold = {RELEVANCE_THRESHOLD}")

## Enhanced Query Function with Monitoring

**What This Section Does:**
Wraps the advisor's `query` method with automatic monitoring, logging, and error handling.

**Why It's Needed:**
- Track performance metrics (response time, cost, token usage)
- Handle errors gracefully (API failures, timeouts)
- Log experiment data for hyperparameter tuning
- Systematically compare different settings

**What Gets Tracked:**
- **Response time**: Time taken for each query (seconds)
- **Token usage & cost**: Input/output tokens and estimated API cost
- **Mode distribution**: Query count per mode (RAG, NO_ANSWER, OFF_TOPIC, etc.)
- **Errors & retries**: Automatic retry on transient failures (up to 2 times)

**How It Works:**
- `monitored_query()` function wraps `advisor.query()`
- All metrics are automatically logged to the experiment tracker
- Returns same structure as `advisor.query()` + additional metadata
- `ask()` helper provides one-line interface for quick queries

In [None]:
# Cell 16: Enhanced Query Function with Monitoring
# This cell creates a wrapper function that adds monitoring, error handling, and cost tracking

def monitored_query(advisor, question, track_cost=True, log_to_experiment=True):
    """
    Enhanced query function with automatic monitoring and error handling.
    
    What This Function Does:
    1. Wraps advisor.query() with timing, error handling, and retry logic
    2. Tracks token usage and calculates estimated API cost
    3. Logs query results to experiment tracker for analysis
    4. Returns same format as advisor.query() + additional metadata
    
    Args:
        advisor: StrictRAGAdvisor instance (the main advisor object for OLED RAG)
        question: User's question text
        track_cost: Whether to track token usage and cost (default: True)
    
    Returns:
        dict: Result dictionary with answer, metadata, response_time, token counts
    """
    start_time = time.time()  # Start timer for response time calculation
    result = None
    error_occurred = False
    
    try:
        logger.info(f"üîç Processing query: {question[:50]}...")
        
        # Wrap advisor.query() with automatic retry on API errors
        # @retry_on_api_error automatically retries up to 2 times on failures
        @retry_on_api_error(max_retries=2, delay=2)
        def query_with_retry():
            return advisor.query(question)
        
        # Call the wrapped query method
        result = query_with_retry()
        
        # Calculate response time (elapsed time since start)
        response_time = time.time() - start_time
        result['response_time'] = response_time
        
        # Track token usage and cost if enabled
        if track_cost:
            # Estimate token usage based on input/output text length
            # cost_tracker uses tiktoken to count tokens accurately
            input_text = question
            output_text = result['answer']
            input_tokens, output_tokens = cost_tracker.add_tokens(input_text, output_text)
            
            # Store token counts in result for display/analysis
            result['input_tokens'] = input_tokens
            result['output_tokens'] = output_tokens
        
        # Log successful query completion with key metrics
        mode = result['mode']
        logger.info(f"‚úÖ Query completed in {response_time:.2f}s | Mode: {mode} | Relevance: {result['relevance_score']:.3f}")
        
        # Track this query in experiment tracker for statistical analysis
        # This accumulates data for comparing different hyperparameter settings
        if log_to_experiment and hasattr(experiment_tracker, 'current_experiment') and experiment_tracker.current_experiment:
            experiment_tracker.log_query(response_time, mode)
        
        return result
        
    except Exception as e:
        # Error handling: Catch any exceptions and return error result
        error_occurred = True
        response_time = time.time() - start_time
        
        logger.error(f"‚ùå Query failed after {response_time:.2f}s: {type(e).__name__} - {str(e)}")
        
        # Return error result with user-friendly message
        return {
            'mode': 'ERROR',
            'answer': f"Sorry, an error occurred while processing your question: {str(e)}",
            'relevance_score': 0.0,
            'response_time': response_time,
            'error': str(e)
        }

# Create a convenience function that uses the global advisor
def ask(question):
    """
    Convenient shortcut function to ask a question with automatic monitoring.
    
    This is a simpler interface - just call ask(question) instead of 
    monitored_query(advisor, question).
    
    Usage:
        result = ask("What are the key degradation mechanisms of a blue phosphorescent OLED?")
        print(result['answer'])
    
    Args:
        question: User's question text
    
    Returns:
        dict: Result dictionary with answer and metadata
    """
    return monitored_query(advisor, question)

print("‚úÖ Monitored query function ready!")
print("üí° Use ask(question) for quick queries with automatic monitoring")

## Testing the Strict RAG System

Test the advisor with sample questions to see how RAG, NO_ANSWER, and OFF_TOPIC modes are selected.

---
### Quick Reference: The Score System

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  RELEVANCE SCORE (0.0 - 1.0)                                    ‚îÇ
‚îÇ  Question: "Does the document match the query?"                 ‚îÇ
‚îÇ  ‚Ä¢ 0.85 -> Documents highly relevant to query ‚úì                 ‚îÇ
‚îÇ  ‚Ä¢ 0.54 -> Documents somewhat matching (uncertain zone)         ‚îÇ
‚îÇ  ‚Ä¢ 0.30 -> Documents not relevant to query ‚úó                    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  RAG SCORE (1-10)                                               ‚îÇ
‚îÇ  Question: "How good is the RAG answer?"                        ‚îÇ
‚îÇ  ‚Ä¢ Measures: Specificity + Relevance + Factuality               ‚îÇ
‚îÇ  ‚Ä¢ 8.5 -> High-quality answer with specific facts from docs     ‚îÇ
‚îÇ  ‚Ä¢ 6.0 -> Acceptable answer but lacking details                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

KEY INSIGHTS:
   1. Low Relevance = No OLED-related documents -> OFF_TOPIC rejection
   2. High Relevance + RAG "no info" = Relevant docs but no specific answer -> NO_ANSWER
   3. High Relevance + Answer success = Accurate document-based answer -> RAG mode
```

In [None]:
# Cell 18: Run Advisor Core Logic with Test Questions
# Execute advisor with predefined test questions and format the results for display

# HTML formatting function - for better result display
def show_response(result, question):
    """
    Display Advisor response in a formatted HTML box.
    Adapted for Strict RAG mode (Project 2).
    """
    mode = result['mode']
    answer = result['answer']
    relevance_score = result['relevance_score']
    
    # Color coding by mode (Strict RAG: RAG, NO_ANSWER_IN_DOCS, OFF_TOPIC only)
    if mode == "RAG":
        color = "#4caf50"  # Green
        icon = "üü¢"
        mode_text = "RAG Mode (Document-based Response)"
    elif mode == "NO_ANSWER_IN_DOCS":
        color = "#ff9800"  # Orange
        icon = "üü†"
        mode_text = "No Answer (No relevant content in documents)"
    elif mode == "OFF_TOPIC":
        color = "#f44336"  # Red
        icon = "üî¥"
        mode_text = "Off-Topic Rejection (Low Relevance, Auto-rejected)"
    elif mode == "ERROR":
        color = "#f44336"  # Red
        icon = "‚ùå"
        mode_text = "System Error"
    else:
        color = "#9e9e9e"  # Gray
        icon = "‚ö™"
        mode_text = f"Unknown Mode: {mode}"
    
    # Create metadata line (including relevance score)
    metadata = f"{icon} <strong>{mode_text}</strong> | Relevance Score: {relevance_score:.3f}"
    
    html = f"""
    <div style="
        background-color:#f9f9f9;
        border-left: 6px solid {color};
        padding: 15px;
        margin: 15px 0;
        border-radius: 4px;
        font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;">
        <div style="color:#333; font-size:14px; margin-bottom:10px;">
            <strong>Question:</strong> {question}
        </div>
        <div style="color:#666; font-size:13px; margin-bottom:10px;">
            {metadata}
        </div>
        <div style="color:#111; font-size:14px; line-height:1.6; white-space: pre-wrap;">
            <strong>Answer:</strong><br>{answer}
        </div>
    </div>
    """
    display(HTML(html))

# Run advisor core logic with test questions
print("Running advisor core logic with test questions...\n")

# Start experiment tracking
experiment_tracker.start_experiment({
    # RAG Quality (Step 1)
    'chunk_size': CHUNK_SIZE,
    'chunk_overlap': CHUNK_OVERLAP,
    'top_k': TOP_K_DOCUMENTS,
    'temperature': LLM_TEMPERATURE,
    
    # System Architecture (Step 2) - Used for Strict RAG only
    'relevance_threshold': RELEVANCE_THRESHOLD,
    'sigmoid_midpoint': SIGMOID_MIDPOINT,
    'sigmoid_steepness': SIGMOID_STEEPNESS
})

# Reset cost tracker for this experiment
cost_tracker.reset()

# Store relevance scores for later averaging
relevance_scores = []

for i, question in enumerate(TEST_QUERIES, 1):
    print(f"\n{'='*80}")
    print(f"Test {i}/{len(TEST_QUERIES)}")
    print(f"{'='*80}")
    
    # Use monitored_query() instead of advisor.query()
    result = monitored_query(advisor, question)
    show_response(result, question)
    
    # Store relevance score
    if 'relevance_score' in result:
        relevance_scores.append(result['relevance_score'])
    
    # Display performance metrics
    if 'response_time' in result:
        print(f"Response Time: {result['response_time']:.2f}s")
    if 'input_tokens' in result and 'output_tokens' in result:
        print(f"Tokens: {result['input_tokens']} input + {result['output_tokens']} output = {result['input_tokens'] + result['output_tokens']} total")

# Print experiment summary
print(f"\n{'='*80}")
print("Experiment Summary")
print(f"{'='*80}")
cost_summary = cost_tracker.get_summary()
print(f"Total Tokens: {cost_summary['total_tokens']:,}")
print(f"Estimated Cost: ${cost_summary['estimated_cost_usd']:.4f}")
print(f"Mode Distribution:")
for mode, count in experiment_tracker.current_experiment['mode_counts'].items():
    if count > 0:
        print(f"   - {mode}: {count}")

## Quantitative Evaluation: System Quality

**What This Section Does:**
Evaluates the quality of answers generated by the Strict RAG system.

**Evaluation Process:**
1. **For each test question:**
   - Generate answer using the Strict RAG Advisor
   - If an answer is provided (not rejected), evaluate with LLM-as-a-judge

2. **Evaluation Metrics** (1-10 scale):
   - **Specificity**: How detailed is the answer?
   - **Relevance**: Does it directly answer the question?
   - **Factuality**: Does it contain verifiable facts from the documents?

In [None]:
# Cell 20: Quantitative Evaluation - System Quality Check
# This cell implements the RAGEvaluator class for evaluating system performance.
# Note: LLM baseline comparison has been removed to focus on Strict RAG quality.

import json
import pandas as pd

class RAGEvaluator:
    """Evaluate the final advisor answer quality."""
    
    def __init__(self, llm, advisor):
        self.llm = llm
        self.advisor = advisor
    
    def get_final_system_answer(self, question):
        """Get the final answer from the full advisor system."""
        # track_cost=False, log_to_experiment=False to avoid double counting
        result = monitored_query(self.advisor, question, track_cost=False, log_to_experiment=False)
        return result
    
    def score_answer(self, question, answer):
        """Score an answer on specificity, relevance, and factuality (1-10)."""
        evaluation_prompt = f"""You are an expert evaluator. Score the following answer on these three criteria (scale 1-10):

1. SPECIFICITY: How specific and detailed is the answer? (1 = vague, 10 = very specific with details)
2. RELEVANCE: How relevant is the answer to the question? (1 = off-topic, 10 = directly answers question)
3. FACTUALITY: Does the answer contain verifiable facts and data? (1 = no facts/opinions only, 10 = rich with facts and data)

Question: {question}

Answer: {answer}

Respond ONLY with a JSON object in this exact format (no other text):
{{"specificity": <score>, "relevance": <score>, "factuality": <score>}}"""
        
        try:
            response = self.llm.invoke(evaluation_prompt).content
            start_idx = response.find('{')
            end_idx = response.rfind('}') + 1
            json_str = response[start_idx:end_idx]
            scores = json.loads(json_str)
            return scores
        except:
            return {"specificity": 5, "relevance": 5, "factuality": 5}
    
    def evaluate_system(self, question):
        """Evaluate final system answer quality."""
        result = self.get_final_system_answer(question)
        final_answer = result.get("answer", "")
        mode = result.get("mode", "UNKNOWN")
        
        # If no answer (rejected), we can't score specificity/factuality nicely
        # But we can track that it was rejected.
        if mode in ["OFF_TOPIC", "NO_ANSWER_IN_DOCS"]:
             return {
                "question": question,
                "final_answer": final_answer,
                "mode": mode,
                "final_specificity": 0,
                "final_relevance": 0,
                "final_factuality": 0,
                "final_word_count": 0
            }
            
        # Score the answer
        final_scores = self.score_answer(question, final_answer)
        final_word_count = len(final_answer.split())
        
        return {
            "question": question,
            "final_answer": final_answer,
            "mode": mode,
            "final_specificity": final_scores["specificity"],
            "final_relevance": final_scores["relevance"],
            "final_factuality": final_scores["factuality"],
            "final_word_count": final_word_count,
        }

# Initialize evaluator
evaluator = RAGEvaluator(advisor.llm, advisor)

# Run evaluation
print("üìä Running system quality evaluation...\n")
evaluation_results = []

for i, question in enumerate(TEST_QUERIES, 1):
    print(f"Evaluating {i}/{len(TEST_QUERIES)}: {question[:50]}...")
    result = evaluator.evaluate_system(question)
    evaluation_results.append(result)
    print("  ‚úì Complete")

print("\n‚úÖ Evaluation complete!")

## Evaluation Results Analysis
This section analyzes the performance of the Strict RAG system, calculating quality scores for answered questions and tracking rejection counts.

In [None]:
# Cell 22: System Evaluation Analysis
# This cell displays performance metrics for the Strict RAG system.

# Convert results to DataFrame
df = pd.DataFrame(evaluation_results)

# Filter only answered questions for scoring stats (exclude rejections)
answered_df = df[~df['mode'].isin(['OFF_TOPIC', 'NO_ANSWER_IN_DOCS'])]

if not answered_df.empty:
    avg_specificity = answered_df['final_specificity'].mean()
    avg_relevance = answered_df['final_relevance'].mean()
    avg_factuality = answered_df['final_factuality'].mean()
    avg_words = answered_df['final_word_count'].mean()
    
    # Calculate overall score
    answered_df['final_overall'] = (answered_df['final_specificity'] + answered_df['final_relevance'] + answered_df['final_factuality']) / 3
    avg_overall = answered_df['final_overall'].mean()
else:
    avg_specificity = 0
    avg_relevance = 0
    avg_factuality = 0
    avg_words = 0
    avg_overall = 0

# Display summary statistics
print("=" * 80)
print("üìä STRICT RAG SYSTEM PERFORMANCE SUMMARY")
print("=" * 80)
print(f"Total Questions: {len(df)}")
print(f"Answered: {len(answered_df)}")
print(f"Rejected/No Data: {len(df) - len(answered_df)}")
print("-" * 80)
print(f"{'Metric (Answered Queries Only)':<35} {'Score':<15}")
print("-" * 80)
print(f"{'Specificity (1-10)':<35} {avg_specificity:<15.2f}")
print(f"{'Relevance (1-10)':<35} {avg_relevance:<15.2f}")
print(f"{'Factuality (1-10)':<35} {avg_factuality:<15.2f}")
print(f"{'Overall Score (1-10)':<35} {avg_overall:<15.2f}")
print(f"{'Average Word Count':<35} {avg_words:<15.1f}")
print()

# Display detailed results table
print("=" * 80)
print("üìã DETAILED RESULTS BY QUESTION")
print("=" * 80)
display(df[['question', 'mode', 'final_answer']])

# Calculate average relevance score from Cell 18's data
# relevance_scores list is collected from Cell 18
if 'relevance_scores' in dir() and len(relevance_scores) > 0:
    avg_relevance_score = sum(relevance_scores) / len(relevance_scores)
else:
    avg_relevance_score = None
    print("‚ö†Ô∏è Warning: relevance_scores not found. Run Cell 18 first.")

# Calculate Q1-Q2 vs Q5-Q6 separation (for CSV logging)
relevance_mean_pos = None
relevance_mean_neg = None
relevance_gap = None
relevance_margin = None

try:
    if 'relevance_scores' in dir() and len(relevance_scores) >= 6:
        # Q1, Q2 -> OLED core questions (Positive)
        pos_indices = [0, 1]
        # Q5, Q6 -> Completely OFF-TOPIC questions (Negative)
        neg_indices = [4, 5]

        pos_scores = [float(relevance_scores[i]) for i in pos_indices]
        neg_scores = [float(relevance_scores[i]) for i in neg_indices]

        relevance_mean_pos = sum(pos_scores) / len(pos_scores)
        relevance_mean_neg = sum(neg_scores) / len(neg_scores)
        relevance_gap = relevance_mean_pos - relevance_mean_neg
        relevance_margin = min(pos_scores) - max(neg_scores)
    else:
        print("‚ö†Ô∏è Need at least 6 relevance scores (Q1‚ÄìQ6) to compute separation metrics.")
except Exception as e:
    print("‚ö†Ô∏è Failed to compute relevance separation metrics:", e)

# Save experiment results
if hasattr(experiment_tracker, 'save_experiment'):
    # notes: Save brief summary text (also record LLM settings)
    notes_str = (
        f"model={LLM_MODEL}, provider={LLM_PROVIDER}, "
        f"Strict RAG Evaluation: {len(answered_df)}/{len(df)} answered."
    )

    experiment_tracker.save_experiment(
        avg_relevance_score=avg_relevance_score,
        avg_rag_score=avg_overall,
        avg_rag_specificity=avg_specificity,
        avg_rag_relevance=avg_relevance,
        avg_rag_factuality=avg_factuality,
        rel_mean_pos=relevance_mean_pos,
        rel_mean_neg=relevance_mean_neg,
        rel_gap=relevance_gap,
        rel_margin=relevance_margin,
        notes=notes_str,
    )

    # Console output always uses the same format
    mean_pos_str = "N/A" if relevance_mean_pos is None else f"{relevance_mean_pos:.3f}"
    mean_neg_str = "N/A" if relevance_mean_neg is None else f"{relevance_mean_neg:.3f}"
    gap_str = "N/A" if relevance_gap is None else f"{relevance_gap:.3f}"
    margin_str = "N/A" if relevance_margin is None else f"{relevance_margin:.3f}"

    print(f"\nüìä Average Relevance Score: {avg_relevance_score:.3f}" if avg_relevance_score else "")
    print(
        f"üìä Relevance separation (Q1‚ÄìQ2 vs Q5‚ÄìQ6): "
        f"mean_pos={mean_pos_str}, "
        f"mean_neg={mean_neg_str}, "
        f"gap={gap_str}, margin={margin_str}"
    )
    print("‚úÖ Experiment results saved to CSV.")

## Interactive Q&A Interface
Now you can ask the AI-Driven OLED Assistant your own questions directly!

In [None]:
# Cell 26: Interactive Q&A Interface
# Note: This cell handles duplicate handler registration when run multiple times

import ipywidgets as widgets
from IPython.display import clear_output, display, HTML

# Global flag to prevent duplicate registration
if 'OLED_UI_INITIALIZED' not in dir():
    OLED_UI_INITIALIZED = False

# Remove existing UI if present
if OLED_UI_INITIALIZED:
    try:
        oled_ui_box.close()
    except:
        pass

# Create widgets
oled_question_input = widgets.Textarea(
    value='',
    placeholder='Enter your OLED Display related question...',
    description='Question:',
    layout=widgets.Layout(width='90%', height='80px'),
    style={'description_width': '50px'}
)

oled_submit_btn = widgets.Button(
    description='Get Answer',
    button_style='primary',
    layout=widgets.Layout(width='150px')
)

oled_output = widgets.Output()

def oled_handle_submit(btn):
    with oled_output:
        clear_output(wait=True)
        question = oled_question_input.value.strip()
        if not question:
            print("Please enter a question!")
            return
        
        print("Analyzing question...")
        result = advisor.query(question)
        
        mode = result['mode']
        answer = result['answer']
        score = result['relevance_score']
        
        if mode == "RAG":
            color, icon, text = "#4caf50", "üü¢", "RAG Mode (Document-based)"
        elif mode == "NO_ANSWER_IN_DOCS":
            color, icon, text = "#ff9800", "üü†", "No Answer (Not in documents)"
        else:
            color, icon, text = "#f44336", "üî¥", "Off-Topic (Rejected)"
        
        html = f'''<div style="background:#f5f5f5; border-left:5px solid {color}; padding:12px; margin:8px 0; border-radius:4px;">
<b>Q:</b> {question}<br>
<span style="color:#666">{icon} {text} | Relevance: {score:.3f}</span><br><br>
<b>A:</b> {answer}
</div>'''
        display(HTML(html))
        
        if mode == 'RAG' and result.get('retrieved_docs'):
            print("\nRetrieved documents:")
            for i, doc in enumerate(result['retrieved_docs'][:2], 1):
                print(f"  {i}. {doc.page_content[:120]}...")

oled_submit_btn.on_click(oled_handle_submit)

# Bundle into UI box
oled_ui_box = widgets.VBox([
    widgets.HTML("<h3>üí¨ AI-Driven OLED Assistant</h3>"),
    oled_question_input,
    oled_submit_btn,
    oled_output
])

OLED_UI_INITIALIZED = True
display(oled_ui_box)