# RAG System for Multiple Choice Question (MCQ) Generation

## Comprehensive Demonstration and Implementation Guide

This notebook demonstrates a complete implementation of a Retrieval-Augmented Generation (RAG) system specifically designed for generating high-quality Multiple Choice Questions from educational documents.

### System Overview
- **Document Processing**: PDF text extraction and semantic chunking
- **Vector Database**: FAISS with Vietnamese language embeddings
- **Question Generation**: LLM-powered MCQ creation with structured output
- **Quality Assurance**: Automatic validation and difficulty assessment
- **Batch Processing**: Scalable question generation capabilities

### Key Features
- 🌐 Vietnamese language support
- 📚 Multi-document processing
- 🎯 Multiple question types (definition, application, analysis)
- 📊 Quality scoring and validation
- ⚡ Optimized performance with quantized models

## 1. Environment Setup and Dependencies

First, let's install and import all required libraries for our RAG-MCQ system.

In [1]:
# Install required packages (uncomment if needed)
# !pip install langchain langchain-community langchain-experimental langchain-huggingface
# !pip install transformers torch accelerate bitsandbytes
# !pip install faiss-cpu sentence-transformers
# !pip install pypdf unstructured
# !pip install numpy pandas matplotlib seaborn
# !pip install nltk rouge-score

# Check if CUDA is available
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

CUDA available: True
CUDA device: NVIDIA GeForce GTX 1650
CUDA memory: 4.3 GB


In [5]:
# Core imports
import os
import json
import time
import warnings
from pathlib import Path
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass
from enum import Enum
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# LangChain imports
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain.vectorstores import FAISS
from langchain_core.documents import Document
from transformers.utils.quantization_config import BitsAndBytesConfig


# Transformers imports
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
from transformers.pipelines import pipeline
from transformers.utils.quantization_config import BitNetQuantConfig

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("✅ All libraries imported successfully!")
print(f"📦 LangChain version: {getattr(__import__('langchain'), '__version__', 'Unknown')}")
print(f"🤗 Transformers version: {getattr(__import__('transformers'), '__version__', 'Unknown')}")
print(f"🔥 PyTorch version: {torch.__version__}")

✅ All libraries imported successfully!
📦 LangChain version: 0.3.26
🤗 Transformers version: 4.53.2
🔥 PyTorch version: 2.7.1+cu118


In [6]:
# Configuration and Data Classes
class QuestionType(Enum):
    """Enumeration of different question types"""
    DEFINITION = "definition"
    COMPARISON = "comparison"
    APPLICATION = "application"
    ANALYSIS = "analysis"
    EVALUATION = "evaluation"

class DifficultyLevel(Enum):
    """Enumeration of difficulty levels"""
    EASY = "easy"
    MEDIUM = "medium"
    HARD = "hard"
    EXPERT = "expert"

@dataclass
class MCQOption:
    """Data class for MCQ options"""
    label: str
    text: str
    is_correct: bool

@dataclass
class MCQQuestion:
    """Data class for Multiple Choice Question"""
    question: str
    context: str
    options: List[MCQOption]
    explanation: str
    difficulty: str
    topic: str
    question_type: str
    source: str
    confidence_score: float = 0.0

    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary format"""
        return {
            "question": self.question,
            "context": self.context,
            "options": {opt.label: opt.text for opt in self.options},
            "correct_answer": next(opt.label for opt in self.options if opt.is_correct),
            "explanation": self.explanation,
            "difficulty": self.difficulty,
            "topic": self.topic,
            "question_type": self.question_type,
            "source": self.source,
            "confidence_score": self.confidence_score
        }

# System Configuration
CONFIG = {
    "embedding_model": "bkai-foundation-models/vietnamese-bi-encoder",
    "llm_model": "unsloth/Qwen2.5-7B",
    "chunk_size": 500,
    "chunk_overlap": 50,
    "retrieval_k": 5,
    "generation_temperature": 0.7,
    "max_tokens": 512,
    "diversity_threshold": 0.7
}

print("✅ Configuration and data classes defined!")
print(f"📋 Using embedding model: {CONFIG['embedding_model']}")
print(f"🤖 Using LLM model: {CONFIG['llm_model']}")

✅ Configuration and data classes defined!
📋 Using embedding model: bkai-foundation-models/vietnamese-bi-encoder
🤖 Using LLM model: unsloth/Qwen2.5-7B


## 2. Document Processing Pipeline

Let's implement the document loading and text extraction pipeline for PDF documents.

In [7]:
class DocumentProcessor:
    """Handles document loading and preprocessing"""

    def __init__(self):
        self.supported_formats = ['.pdf', '.txt']

    def load_documents(self, folder_path: str) -> Tuple[List[Document], List[str]]:
        """Load and process documents from folder"""
        folder = Path(folder_path)
        if not folder.exists():
            raise FileNotFoundError(f"Folder not found: {folder}")

        pdf_files = list(folder.glob("*.pdf"))
        if not pdf_files:
            print(f"⚠️  No PDF files found in: {folder}")
            return [], []

        all_docs, filenames = [], []
        total_pages = 0

        print(f"📁 Processing {len(pdf_files)} PDF files...")

        for pdf_file in pdf_files:
            try:
                print(f"📄 Loading: {pdf_file.name}")
                loader = PyPDFLoader(str(pdf_file))
                docs = loader.load()

                # Add metadata
                for doc in docs:
                    doc.metadata['source_file'] = pdf_file.name
                    doc.metadata['file_path'] = str(pdf_file)

                all_docs.extend(docs)
                filenames.append(pdf_file.name)
                total_pages += len(docs)

                print(f"  ✅ Loaded {len(docs)} pages")

            except Exception as e:
                print(f"  ❌ Failed loading {pdf_file.name}: {e}")

        print(f"\n📊 Summary:")
        print(f"  📚 Files loaded: {len(filenames)}")
        print(f"  📄 Total pages: {total_pages}")
        print(f"  📝 Average pages per file: {total_pages/len(filenames):.1f}")

        return all_docs, filenames

    def analyze_document_stats(self, docs: List[Document]) -> Dict[str, Any]:
        """Analyze document statistics"""
        if not docs:
            return {}

        # Calculate statistics
        total_chars = sum(len(doc.page_content) for doc in docs)
        total_words = sum(len(doc.page_content.split()) for doc in docs)

        char_lengths = [len(doc.page_content) for doc in docs]
        word_lengths = [len(doc.page_content.split()) for doc in docs]

        stats = {
            "total_documents": len(docs),
            "total_characters": total_chars,
            "total_words": total_words,
            "avg_chars_per_doc": np.mean(char_lengths),
            "avg_words_per_doc": np.mean(word_lengths),
            "min_chars": np.min(char_lengths),
            "max_chars": np.max(char_lengths),
            "min_words": np.min(word_lengths),
            "max_words": np.max(word_lengths)
        }

        return stats

# Test the document processor
doc_processor = DocumentProcessor()
print("✅ Document processor initialized!")

✅ Document processor initialized!


In [8]:
# Load sample documents (update path as needed)
# Uncomment and modify the path to your PDF folder

# folder_path = "../pdf_folder"  # Update this path
# docs, filenames = doc_processor.load_documents(folder_path)

# For demonstration, let's create some sample documents
sample_docs = [
    Document(
        page_content="""
        Object-Oriented Programming (OOP) là một mô hình lập trình được xây dựng dựa trên khái niệm đối tượng.
        OOP tổ chức mã nguồn xung quanh các đối tượng thay vì các hàm và logic.

        Các nguyên lý cơ bản của OOP bao gồm:
        1. Encapsulation (Đóng gói): Ẩn giấu chi tiết triển khai
        2. Inheritance (Kế thừa): Tái sử dụng code từ class cha
        3. Polymorphism (Đa hình): Cùng một interface, nhiều implementation
        4. Abstraction (Trừu tượng): Đơn giản hóa phức tạp
        """,
        metadata={"source": "OOP_basics.pdf", "page": 1}
    ),
    Document(
        page_content="""
        Inheritance (Kế thừa) trong OOP cho phép một class con kế thừa các thuộc tính và phương thức từ class cha.

        Ví dụ:
        - Class Animal có thuộc tính name và phương thức eat()
        - Class Dog kế thừa từ Animal và thêm phương thức bark()
        - Class Cat kế thừa từ Animal và thêm phương thức meow()

        Lợi ích của inheritance:
        - Tái sử dụng code
        - Dễ dàng mở rộng
        - Tổ chức code theo hierarchy
        """,
        metadata={"source": "OOP_inheritance.pdf", "page": 2}
    ),
    Document(
        page_content="""
        Data Structures (Cấu trúc dữ liệu) là cách tổ chức và lưu trữ dữ liệu trong máy tính.

        Các cấu trúc dữ liệu cơ bản:
        1. Array: Tập hợp các phần tử cùng kiểu
        2. Linked List: Danh sách liên kết
        3. Stack: Ngăn xếp (LIFO - Last In First Out)
        4. Queue: Hàng đợi (FIFO - First In First Out)
        5. Tree: Cây
        6. Graph: Đồ thị

        Chọn cấu trúc dữ liệu phù hợp ảnh hưởng đến hiệu suất của thuật toán.
        """,
        metadata={"source": "Data_Structures.pdf", "page": 3}
    )
]

print("📚 Using sample documents for demonstration")
print(f"📄 Total documents: {len(sample_docs)}")

# Analyze document statistics
stats = doc_processor.analyze_document_stats(sample_docs)
print(f"\n📊 Document Statistics:")
for key, value in stats.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.1f}")
    else:
        print(f"  {key}: {value}")

docs = sample_docs  # Use for the rest of the notebook

📚 Using sample documents for demonstration
📄 Total documents: 3

📊 Document Statistics:
  total_documents: 3
  total_characters: 1435
  total_words: 243
  avg_chars_per_doc: 478.3
  avg_words_per_doc: 81.0
  min_chars: 458
  max_chars: 511
  min_words: 78
  max_words: 83


In [9]:
from huggingface_hub import notebook_login
notebook_login() # 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

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 3. Vector Database and Embeddings

Now let's set up the FAISS vector database with Vietnamese embeddings for semantic search.

In [10]:
class VectorDatabaseManager:
    """Manages vector database and embeddings"""

    def __init__(self, embedding_model_name: str):
        self.embedding_model_name = embedding_model_name
        self.embeddings = None
        self.vector_db = None
        self.chunks = []

    def initialize_embeddings(self):
        """Initialize the embedding model"""
        print(f"🔧 Loading embedding model: {self.embedding_model_name}")
        try:
            self.embeddings = HuggingFaceEmbeddings(
                model_name=self.embedding_model_name,
                model_kwargs={'device': 'cpu'}  # Use CPU for compatibility
            )
            print("✅ Embeddings initialized successfully!")
            return True
        except Exception as e:
            print(f"❌ Failed to load embeddings: {e}")
            print("🔄 Falling back to sentence-transformers model...")
            try:
                self.embeddings = HuggingFaceEmbeddings(
                    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
                )
                print("✅ Fallback embeddings loaded!")
                return True
            except Exception as e2:
                print(f"❌ Fallback also failed: {e2}")
                return False

    def create_semantic_chunks(self, docs: List[Document]) -> List[Document]:
        """Create semantic chunks from documents"""
        if not self.embeddings:
            raise RuntimeError("Embeddings not initialized")

        print("🔪 Creating semantic chunks...")

        try:
            # Use SemanticChunker for intelligent chunking
            chunker = SemanticChunker(
                embeddings=self.embeddings,
                buffer_size=1,
                breakpoint_threshold_type="percentile",
                breakpoint_threshold_amount=95,
                min_chunk_size=CONFIG["chunk_size"],
                add_start_index=True
            )

            chunks = chunker.split_documents(docs)
            self.chunks = chunks

            print(f"✅ Created {len(chunks)} semantic chunks")

            # Analyze chunk statistics
            chunk_lengths = [len(chunk.page_content) for chunk in chunks]
            print(f"📊 Chunk statistics:")
            print(f"  Average length: {np.mean(chunk_lengths):.1f} characters")
            print(f"  Min length: {np.min(chunk_lengths)} characters")
            print(f"  Max length: {np.max(chunk_lengths)} characters")

            return chunks

        except Exception as e:
            print(f"❌ Semantic chunking failed: {e}")
            print("🔄 Falling back to simple text splitting...")

            # Fallback to simple chunking
            simple_chunks = []
            for doc in docs:
                content = doc.page_content
                chunk_size = CONFIG["chunk_size"]
                overlap = CONFIG["chunk_overlap"]

                for i in range(0, len(content), chunk_size - overlap):
                    chunk_content = content[i:i + chunk_size]
                    if len(chunk_content.strip()) > 50:  # Minimum chunk size
                        chunk = Document(
                            page_content=chunk_content,
                            metadata={**doc.metadata, "chunk_index": len(simple_chunks)}
                        )
                        simple_chunks.append(chunk)

            self.chunks = simple_chunks
            print(f"✅ Created {len(simple_chunks)} simple chunks")
            return simple_chunks

    def build_vector_database(self, chunks: List[Document]) -> bool:
        """Build FAISS vector database from chunks"""
        if not self.embeddings:
            raise RuntimeError("Embeddings not initialized")

        print("🗄️  Building FAISS vector database...")

        try:
            self.vector_db = FAISS.from_documents(chunks, embedding=self.embeddings)
            print("✅ Vector database created successfully!")
            print(f"📚 Indexed {len(chunks)} chunks")
            return True

        except Exception as e:
            print(f"❌ Failed to build vector database: {e}")
            return False

    def search_similar_chunks(self, query: str, k: int = 5) -> List[Document]:
        """Search for similar chunks"""
        if not self.vector_db:
            raise RuntimeError("Vector database not initialized")

        try:
            results = self.vector_db.similarity_search(query, k=k)
            return results
        except Exception as e:
            print(f"❌ Search failed: {e}")
            return []

# Initialize vector database manager
vector_manager = VectorDatabaseManager(CONFIG["embedding_model"])

# Initialize embeddings
if vector_manager.initialize_embeddings():
    print("🎯 Ready to process documents!")

🔧 Loading embedding model: bkai-foundation-models/vietnamese-bi-encoder
✅ Embeddings initialized successfully!
🎯 Ready to process documents!


In [11]:
# Process documents and build vector database
print("🚀 Starting document processing pipeline...")

# Create semantic chunks
chunks = vector_manager.create_semantic_chunks(docs)

# Build vector database
if vector_manager.build_vector_database(chunks):
    print("✅ Document processing pipeline completed!")

    # Test similarity search
    test_query = "OOP là gì"
    print(f"\n🔍 Testing similarity search with query: '{test_query}'")

    results = vector_manager.search_similar_chunks(test_query, k=3)

    for i, result in enumerate(results, 1):
        print(f"\n📄 Result {i}:")
        print(f"Source: {result.metadata.get('source', 'Unknown')}")
        print(f"Content: {result.page_content[:200]}...")

else:
    print("❌ Failed to build vector database")

🚀 Starting document processing pipeline...
🔪 Creating semantic chunks...
✅ Created 3 semantic chunks
📊 Chunk statistics:
  Average length: 464.0 characters
  Min length: 449 characters
  Max length: 494 characters
🗄️  Building FAISS vector database...
✅ Vector database created successfully!
📚 Indexed 3 chunks
✅ Document processing pipeline completed!

🔍 Testing similarity search with query: 'OOP là gì'

📄 Result 1:
Source: OOP_basics.pdf
Content: 
        Object-Oriented Programming (OOP) là một mô hình lập trình được xây dựng dựa trên khái niệm đối tượng. OOP tổ chức mã nguồn xung quanh các đối tượng thay vì các hàm và logic. Các nguyên lý cơ...

📄 Result 2:
Source: OOP_inheritance.pdf
Content: 
        Inheritance (Kế thừa) trong OOP cho phép một class con kế thừa các thuộc tính và phương thức từ class cha. Ví dụ:
        - Class Animal có thuộc tính name và phương thức eat()
        - Clas...

📄 Result 3:
Source: Data_Structures.pdf
Content: 
        Data Structures (Cấu trúc dữ liệ

## 4. Retrieval System Implementation

Implementing a context-aware retrieval system with diversity controls for better question generation.

In [12]:
class ContextAwareRetriever:
    """Enhanced retriever with context awareness and diversity"""

    def __init__(self, vector_db: FAISS, diversity_threshold: float = 0.7):
        self.vector_db = vector_db
        self.diversity_threshold = diversity_threshold

    def retrieve_diverse_contexts(self, query: str, k: int = 5) -> List[Document]:
        """Retrieve documents with semantic diversity"""
        # Get more candidates than needed
        candidates = self.vector_db.similarity_search(query, k=k*2)

        if not candidates:
            return []

        # Select diverse documents
        selected = [candidates[0]]  # Always include the most relevant

        for candidate in candidates[1:]:
            if len(selected) >= k:
                break

            # Check diversity with already selected documents
            is_diverse = True
            for selected_doc in selected:
                similarity = self._calculate_similarity(
                    candidate.page_content,
                    selected_doc.page_content
                )
                if similarity > self.diversity_threshold:
                    is_diverse = False
                    break

            if is_diverse:
                selected.append(candidate)

        return selected[:k]

    def _calculate_similarity(self, text1: str, text2: str) -> float:
        """Calculate text similarity (simplified implementation)"""
        words1 = set(text1.lower().split())
        words2 = set(text2.lower().split())

        if not words1 or not words2:
            return 0.0

        intersection = words1.intersection(words2)
        union = words1.union(words2)

        return len(intersection) / len(union) if union else 0.0

    def retrieve_by_topic(self, topic: str, k: int = 5) -> List[Document]:
        """Retrieve documents relevant to a specific topic"""
        topic_keywords = {
            "OOP": ["đối tượng", "class", "object", "kế thừa", "đóng gói"],
            "inheritance": ["kế thừa", "class cha", "class con", "extends"],
            "data structures": ["cấu trúc dữ liệu", "array", "list", "stack", "queue"]
        }

        # Create enhanced query with topic keywords
        keywords = topic_keywords.get(topic.lower(), [topic])
        enhanced_query = f"{topic} {' '.join(keywords)}"

        return self.retrieve_diverse_contexts(enhanced_query, k)

    def get_context_summary(self, documents: List[Document]) -> str:
        """Generate a summary of the retrieved contexts"""
        if not documents:
            return "No relevant context found."

        # Combine and truncate content
        combined_content = "\n\n".join(doc.page_content for doc in documents)

        # Limit context length
        max_length = 2000
        if len(combined_content) > max_length:
            combined_content = combined_content[:max_length] + "..."

        return combined_content

# Initialize the enhanced retriever
if vector_manager.vector_db:
    retriever = ContextAwareRetriever(
        vector_manager.vector_db,
        CONFIG["diversity_threshold"]
    )
    print("✅ Context-aware retriever initialized!")

    # Test diverse retrieval
    test_topics = ["OOP", "inheritance", "data structures"]

    print("\n🧪 Testing diverse retrieval for different topics:")
    for topic in test_topics:
        results = retriever.retrieve_by_topic(topic, k=2)
        print(f"\n📚 Topic: {topic}")
        print(f"  Retrieved {len(results)} diverse documents")

        for i, doc in enumerate(results, 1):
            print(f"  Doc {i}: {doc.page_content[:100]}...")

else:
    print("❌ Vector database not available for retriever initialization")

✅ Context-aware retriever initialized!

🧪 Testing diverse retrieval for different topics:

📚 Topic: OOP
  Retrieved 2 diverse documents
  Doc 1: 
        Object-Oriented Programming (OOP) là một mô hình lập trình được xây dựng dựa trên khái niệm...
  Doc 2: 
        Inheritance (Kế thừa) trong OOP cho phép một class con kế thừa các thuộc tính và phương thứ...

📚 Topic: inheritance
  Retrieved 2 diverse documents
  Doc 1: 
        Inheritance (Kế thừa) trong OOP cho phép một class con kế thừa các thuộc tính và phương thứ...
  Doc 2: 
        Data Structures (Cấu trúc dữ liệu) là cách tổ chức và lưu trữ dữ liệu trong máy tính. Các c...

📚 Topic: data structures
  Retrieved 2 diverse documents
  Doc 1: 
        Data Structures (Cấu trúc dữ liệu) là cách tổ chức và lưu trữ dữ liệu trong máy tính. Các c...
  Doc 2: 
        Object-Oriented Programming (OOP) là một mô hình lập trình được xây dựng dựa trên khái niệm...


## 5. MCQ Generation with LLM

Now let's implement the question generation system using a Large Language Model with structured JSON output.

In [3]:

#? pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [13]:

class MCQGenerator:
    """Generates MCQs using LLM with structured output"""

    def __init__(self):
        self.llm = None
        self.is_initialized = False

    def initialize_llm(self, model_name: str = None) -> bool:
        """Initialize the LLM for question generation"""
        model_name = model_name or CONFIG["llm_model"]

        print(f"🤖 Initializing LLM: {model_name}")
        print("⚠️  Note: This requires significant memory and may take time...")

        try:
            # For demonstration, we'll use a mock LLM
            # In production, uncomment the code below

            # Check for HuggingFace token
            token_path = Path("api_key/hugging_face_token.txt")
            hf_token = None
            if token_path.exists():
                with token_path.open("r") as f:
                    hf_token = f.read().strip()

            # Configure quantization for memory efficiency
            bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_quant_type="nf4"
            )

            # Load model
            model, tokenizer = FastLanguageModel.from_pretrained(
                model_name,
                quantization_config=bnb_config,
                low_cpu_mem_usage=True,
                device_map="auto",
                token=hf_token
            )

            # tokenizer = AutoTokenizer.from_pretrained(model_name)
            tokenizer.pad_token = tokenizer.eos_token

            model_pipeline = pipeline(
                "text-generation",
                model=model,
                tokenizer=tokenizer,
                max_new_tokens=CONFIG["max_tokens"],
                temperature=CONFIG["generation_temperature"],
                pad_token_id=tokenizer.eos_token_id,
                device_map="auto"
            )

            self.llm = HuggingFacePipeline(pipeline=model_pipeline)

            # For demonstration, create a mock LLM
            # self.llm = self._create_mock_llm()
            self.is_initialized = True

            print("✅ LLM initialized successfully!")
            return True

        except Exception as e:
            print(f"❌ Failed to initialize LLM: {e}")
            print("🔄 Using mock LLM for demonstration...")
            self.llm = self._create_mock_llm()
            self.is_initialized = True
            return True

    def _create_mock_llm(self):
        """Create a mock LLM for demonstration purposes"""
        class MockLLM:
            def __call__(self, prompt):
                # Mock response based on context analysis
                if "OOP" in prompt or "đối tượng" in prompt:
                    return '''
{
    "question": "OOP (Object-Oriented Programming) là gì?",
    "options": {
        "A": "Một mô hình lập trình dựa trên khái niệm đối tượng",
        "B": "Một hệ quản trị cơ sở dữ liệu",
        "C": "Một framework phát triển web",
        "D": "Một phương pháp kiểm thử phần mềm"
    },
    "correct_answer": "A",
    "explanation": "OOP là viết tắt của Object-Oriented Programming, một mô hình lập trình tổ chức mã nguồn xung quanh các đối tượng thay vì các hàm và logic.",
    "topic": "Programming Fundamentals",
    "difficulty": "medium",
    "question_type": "definition"
}
'''
                elif "kế thừa" in prompt or "inheritance" in prompt:
                    return '''
{
    "question": "Inheritance (Kế thừa) trong OOP có lợi ích gì?",
    "options": {
        "A": "Tái sử dụng code và dễ dàng mở rộng",
        "B": "Tăng tốc độ thực thi chương trình",
        "C": "Giảm dung lượng file thực thi",
        "D": "Cải thiện bảo mật của ứng dụng"
    },
    "correct_answer": "A",
    "explanation": "Inheritance cho phép class con kế thừa thuộc tính và phương thức từ class cha, giúp tái sử dụng code và dễ dàng mở rộng tính năng.",
    "topic": "OOP Principles",
    "difficulty": "medium",
    "question_type": "application"
}
'''
                else:
                    return '''
{
    "question": "Cấu trúc dữ liệu nào hoạt động theo nguyên lý LIFO?",
    "options": {
        "A": "Queue (Hàng đợi)",
        "B": "Stack (Ngăn xếp)",
        "C": "Array (Mảng)",
        "D": "Linked List (Danh sách liên kết)"
    },
    "correct_answer": "B",
    "explanation": "Stack hoạt động theo nguyên lý LIFO (Last In First Out), phần tử được thêm vào cuối cùng sẽ được lấy ra đầu tiên.",
    "topic": "Data Structures",
    "difficulty": "easy",
    "question_type": "definition"
}
'''

        return MockLLM()

    def generate_mcq_from_context(self, context: str, topic: str,
                                  difficulty: DifficultyLevel = DifficultyLevel.MEDIUM,
                                  question_type: QuestionType = QuestionType.DEFINITION) -> MCQQuestion:
        """Generate MCQ from provided context"""
        if not self.is_initialized:
            raise RuntimeError("LLM not initialized. Call initialize_llm() first.")

        # Create prompt
        prompt = self._create_prompt(context, topic, difficulty, question_type)

        # Generate response
        response = self.llm(prompt)

        # Parse JSON response
        mcq = self._parse_response(response, context, topic)

        return mcq

    def _create_prompt(self, context: str, topic: str,
                       difficulty: DifficultyLevel, question_type: QuestionType) -> str:
        """Create structured prompt for MCQ generation"""

        prompt_template = """
Bạn là một chuyên gia giáo dục và thiết kế câu hỏi. Nhiệm vụ của bạn là tạo ra một câu hỏi trắc nghiệm chất lượng cao từ nội dung được cung cấp.

Yêu cầu:
1. Tạo một câu hỏi rõ ràng, không mơ hồ
2. Cung cấp đúng 4 lựa chọn (A, B, C, D)
3. Chỉ có một đáp án đúng
4. Các phương án sai phải hợp lý nhưng rõ ràng là sai
5. Bao gồm giải thích cho đáp án đúng

Nội dung: {context}
Chủ đề: {topic}
Mức độ khó: {difficulty}
Loại câu hỏi: {question_type}

Trả về chỉ dưới dạng JSON hợp lệ với cấu trúc sau:
{{
    "question": "Câu hỏi của bạn",
    "options": {{
        "A": "Lựa chọn A",
        "B": "Lựa chọn B",
        "C": "Lựa chọn C",
        "D": "Lựa chọn D"
    }},
    "correct_answer": "A",
    "explanation": "Giải thích chi tiết",
    "topic": "{topic}",
    "difficulty": "{difficulty}",
    "question_type": "{question_type}"
}}
"""

        return prompt_template.format(
            context=context[:1500],  # Limit context length
            topic=topic,
            difficulty=difficulty.value,
            question_type=question_type.value
        )

    def _parse_response(self, response: str, context: str, topic: str) -> MCQQuestion:
        """Parse LLM response and create MCQQuestion object"""
        try:
            # Extract JSON from response
            json_start = response.rfind("{")
            json_end = response.rfind("}") + 1

            if json_start == -1 or json_end == 0:
                raise ValueError("No JSON found in response")

            json_text = response[json_start:json_end]
            response_data = json.loads(json_text)

            # Create MCQ options
            options = []
            for label, text in response_data["options"].items():
                is_correct = label == response_data["correct_answer"]
                options.append(MCQOption(label, text, is_correct))

            # Create MCQ object
            mcq = MCQQuestion(
                question=response_data["question"],
                context=context[:500] + "..." if len(context) > 500 else context,
                options=options,
                explanation=response_data.get("explanation", ""),
                difficulty=response_data.get("difficulty", "medium"),
                topic=topic,
                question_type=response_data.get("question_type", "definition"),
                source="Generated from documents",
                confidence_score=0.0  # Will be calculated later
            )

            return mcq

        except (json.JSONDecodeError, KeyError) as e:
            raise ValueError(f"Failed to parse LLM response: {e}")

# Initialize MCQ generator
mcq_generator = MCQGenerator()
if mcq_generator.initialize_llm():
    print("🎯 MCQ Generator ready!")

🤖 Initializing LLM: unsloth/Qwen2.5-7B
⚠️  Note: This requires significant memory and may take time...
==((====))==  Unsloth 2025.7.9: Fast Qwen2 patching. Transformers: 4.53.2.
   \\   /|    NVIDIA GeForce GTX 1650. Num GPUs = 1. Max memory: 4.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.7.1+cu118. CUDA: 7.5. CUDA Toolkit: 11.8. Triton: 3.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.54G [00:00<?, ?B/s]

❌ Failed to initialize LLM: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 
🔄 Using mock LLM for demonstration...
🎯 MCQ Generator ready!


In [16]:
# Test single MCQ generation
print("🧪 Testing MCQ generation...")

# Get context for OOP topic
if 'retriever' in locals():
    context_docs = retriever.retrieve_by_topic("OOP", k=2)
    context = retriever.get_context_summary(context_docs)

    print(f"📄 Context length: {len(context)} characters")
    print(f"📄 Context preview: {context[:200]}...")

    # Generate MCQ
    mcq = mcq_generator.generate_mcq_from_context(
        context=context,
        topic="Object-Oriented Programming",
        difficulty=DifficultyLevel.MEDIUM,
        question_type=QuestionType.DEFINITION
    )

    # Display the generated MCQ
    print(f"\n🎯 Generated MCQ:")
    print(f"Question: {mcq.question}")
    print(f"\nOptions:")
    for option in mcq.options:
        marker = "✅" if option.is_correct else "  "
        print(f"  {marker} {option.label}: {option.text}")

    print(f"\nCorrect Answer: {next(opt.label for opt in mcq.options if opt.is_correct)}")
    print(f"Explanation: {mcq.explanation}")
    print(f"Topic: {mcq.topic}")
    print(f"Difficulty: {mcq.difficulty}")
    print(f"Question Type: {mcq.question_type}")

else:
    print("❌ Retriever not available. Cannot test MCQ generation.")

🧪 Testing MCQ generation...
📄 Context length: 945 characters
📄 Context preview: 
        Object-Oriented Programming (OOP) là một mô hình lập trình được xây dựng dựa trên khái niệm đối tượng. OOP tổ chức mã nguồn xung quanh các đối tượng thay vì các hàm và logic. Các nguyên lý cơ...


ValueError: Failed to parse LLM response: Extra data: line 6 column 6 (char 214)

## 6. Prompt Engineering for Different Question Types

Let's explore specialized prompts for different types of questions to improve generation quality.

In [None]:
class AdvancedPromptManager:
    """Manages specialized prompts for different question types"""

    def __init__(self):
        self.templates = self._initialize_templates()

    def _initialize_templates(self) -> Dict[str, str]:
        """Initialize prompt templates for different question types"""
        base_instruction = """
Bạn là một chuyên gia giáo dục và thiết kế câu hỏi. Tạo một câu hỏi trắc nghiệm chất lượng cao từ nội dung được cung cấp.

Yêu cầu chung:
- Câu hỏi rõ ràng, không mơ hồ
- Đúng 4 lựa chọn (A, B, C, D)
- Chỉ một đáp án đúng
- Phương án sai hợp lý nhưng rõ ràng sai
- Giải thích chi tiết cho đáp án đúng
"""

        return {
            QuestionType.DEFINITION: f"""
{base_instruction}

Yêu cầu đặc biệt cho câu hỏi ĐỊNH NGHĨA:
- Tập trung vào định nghĩa chính xác của thuật ngữ, khái niệm
- Các phương án sai thường là định nghĩa của khái niệm khác hoặc hiểu lầm phổ biến
- Sử dụng ngôn ngữ đơn giản, dễ hiểu
- Ví dụ: "X là gì?", "Định nghĩa của Y là gì?"

Nội dung: {{context}}
Chủ đề: {{topic}}
Mức độ: {{difficulty}}

JSON Output:
""",

            QuestionType.APPLICATION: f"""
{base_instruction}

Yêu cầu đặc biệt cho câu hỏi ỨNG DỤNG:
- Tạo tình huống thực tế cần áp dụng kiến thức
- Hỏi "khi nào sử dụng", "trong trường hợp nào", "ví dụ nào"
- Các phương án sai là ứng dụng không phù hợp hoặc sai ngữ cảnh
- Kết nối lý thuyết với thực tiễn

Nội dung: {{context}}
Chủ đề: {{topic}}
Mức độ: {{difficulty}}

JSON Output:
""",

            QuestionType.COMPARISON: f"""
{base_instruction}

Yêu cầu đặc biệt cho câu hỏi SO SÁNH:
- So sánh 2-3 khái niệm, phương pháp, kỹ thuật
- Tập trung vào điểm khác biệt hoặc giống nhau chính
- Các phương án sai thường đảo ngược đặc điểm hoặc nhầm lẫn
- Ví dụ: "Khác biệt giữa X và Y là gì?"

Nội dung: {{context}}
Chủ đề: {{topic}}
Mức độ: {{difficulty}}

JSON Output:
""",

            QuestionType.ANALYSIS: f"""
{base_instruction}

Yêu cầu đặc biệt cho câu hỏi PHÂN TÍCH:
- Yêu cầu phân tích code, sơ đồ, hoặc tình huống phức tạp
- Kiểm tra tư duy logic và khả năng suy luận
- Câu hỏi có thể có nhiều bước suy luận
- Các phương án sai là kết luận sai hoặc thiếu logic

Nội dung: {{context}}
Chủ đề: {{topic}}
Mức độ: {{difficulty}}

JSON Output:
""",

            QuestionType.EVALUATION: f"""
{base_instruction}

Yêu cầu đặc biệt cho câu hỏi ĐÁNH GIÁ:
- Đánh giá ưu nhược điểm, hiệu quả, phù hợp
- Câu hỏi dạng "phương pháp nào tốt nhất", "khi nào nên chọn"
- Yêu cầu cân nhắc nhiều yếu tố
- Các phương án cần có độ hợp lý cao

Nội dung: {{context}}
Chủ đề: {{topic}}
Mức độ: {{difficulty}}

JSON Output:
"""
        }

    def get_prompt(self, question_type: QuestionType, context: str,
                   topic: str, difficulty: DifficultyLevel) -> str:
        """Get specialized prompt for question type"""
        template = self.templates.get(question_type, self.templates[QuestionType.DEFINITION])

        return template.format(
            context=context[:1200],  # Limit context length
            topic=topic,
            difficulty=difficulty.value
        )

    def generate_examples(self) -> Dict[QuestionType, str]:
        """Generate example questions for each type"""
        examples = {
            QuestionType.DEFINITION: """
Ví dụ câu hỏi định nghĩa:
"Encapsulation (Đóng gói) trong OOP là gì?"
A) Ẩn giấu chi tiết triển khai và chỉ để lộ interface cần thiết ✅
B) Kế thừa thuộc tính từ class cha
C) Tạo nhiều hình thức khác nhau của cùng một phương thức
D) Tổ chức code thành các module riêng biệt
""",

            QuestionType.APPLICATION: """
Ví dụ câu hỏi ứng dụng:
"Trong trường hợp nào nên sử dụng Stack?"
A) Khi cần truy cập ngẫu nhiên vào các phần tử
B) Khi cần xử lý theo thứ tự LIFO (Last In First Out) ✅
C) Khi cần sắp xếp dữ liệu tự động
D) Khi cần chia sẻ dữ liệu giữa nhiều thread
""",

            QuestionType.COMPARISON: """
Ví dụ câu hỏi so sánh:
"Khác biệt chính giữa Array và Linked List là gì?"
A) Array cho phép truy cập ngẫu nhiên, Linked List truy cập tuần tự ✅
B) Array chỉ lưu số, Linked List lưu mọi kiểu dữ liệu
C) Array không thể thay đổi kích thước, Linked List có thể
D) Array nhanh hơn trong mọi trường hợp
""",

            QuestionType.ANALYSIS: """
Ví dụ câu hỏi phân tích:
"Đoạn code sau vi phạm nguyên lý OOP nào?
class Bird:
    def fly(self): pass
class Penguin(Bird):
    def fly(self): raise Exception('Cannot fly')"

A) Encapsulation
B) Liskov Substitution Principle ✅
C) Single Responsibility
D) Open/Closed Principle
""",

            QuestionType.EVALUATION: """
Ví dụ câu hỏi đánh giá:
"Khi nào nên chọn Composition thay vì Inheritance?"
A) Khi muốn mối quan hệ "is-a" rõ ràng
B) Khi cần flexibility và tránh tight coupling ✅
C) Khi muốn tiết kiệm memory
D) Khi class cha có ít phương thức
"""
        }

        return examples

# Initialize advanced prompt manager
prompt_manager = AdvancedPromptManager()

# Display examples
print("📝 Prompt Engineering Examples:")
examples = prompt_manager.generate_examples()

for q_type, example in examples.items():
    print(f"\n{q_type.value.upper()} Questions:")
    print(example)

## 7. Quality Validation System

Implementing automatic quality checks to ensure generated MCQs meet educational standards.

In [None]:
class QualityValidator:
    """Comprehensive quality validation for MCQs"""

    def __init__(self):
        self.min_question_length = 10
        self.max_question_length = 200
        self.min_explanation_length = 20
        self.min_option_length = 5
        self.max_option_length = 150

    def validate_mcq(self, mcq: MCQQuestion) -> Tuple[bool, Dict[str, Any]]:
        """Comprehensive MCQ validation with detailed feedback"""
        results = {
            "is_valid": True,
            "issues": [],
            "warnings": [],
            "scores": {}
        }

        # Check basic structure
        structure_score = self._check_structure(mcq, results)
        results["scores"]["structure"] = structure_score

        # Check content quality
        content_score = self._check_content_quality(mcq, results)
        results["scores"]["content"] = content_score

        # Check distractor quality
        distractor_score = self._check_distractor_quality(mcq, results)
        results["scores"]["distractors"] = distractor_score

        # Check language quality
        language_score = self._check_language_quality(mcq, results)
        results["scores"]["language"] = language_score

        # Overall validation
        results["is_valid"] = len(results["issues"]) == 0
        results["overall_score"] = np.mean(list(results["scores"].values()))

        return results["is_valid"], results

    def _check_structure(self, mcq: MCQQuestion, results: Dict) -> float:
        """Check MCQ structural requirements"""
        score = 100.0

        # Check question length
        if len(mcq.question) < self.min_question_length:
            results["issues"].append("Question too short")
            score -= 20
        elif len(mcq.question) > self.max_question_length:
            results["warnings"].append("Question might be too long")
            score -= 10

        # Check options count
        if len(mcq.options) != 4:
            results["issues"].append(f"Must have exactly 4 options, found {len(mcq.options)}")
            score -= 30

        # Check for single correct answer
        correct_count = sum(1 for opt in mcq.options if opt.is_correct)
        if correct_count != 1:
            results["issues"].append(f"Must have exactly 1 correct answer, found {correct_count}")
            score -= 40

        # Check explanation
        if len(mcq.explanation) < self.min_explanation_length:
            results["issues"].append("Explanation too short")
            score -= 15

        return max(score, 0)

    def _check_content_quality(self, mcq: MCQQuestion, results: Dict) -> float:
        """Check content quality and relevance"""
        score = 100.0

        # Check for distinct options
        option_texts = [opt.text.lower().strip() for opt in mcq.options]
        if len(set(option_texts)) != len(option_texts):
            results["issues"].append("Options must be distinct")
            score -= 25

        # Check option length consistency
        option_lengths = [len(opt.text) for opt in mcq.options]
        length_variance = np.var(option_lengths)
        if length_variance > 1000:  # High variance in option lengths
            results["warnings"].append("Large variation in option lengths")
            score -= 10

        # Check for obvious patterns
        labels = [opt.label for opt in mcq.options]
        if not labels == ["A", "B", "C", "D"]:
            results["issues"].append("Options must be labeled A, B, C, D")
            score -= 15

        return max(score, 0)

    def _check_distractor_quality(self, mcq: MCQQuestion, results: Dict) -> float:
        """Check quality of incorrect options (distractors)"""
        score = 100.0

        distractors = [opt for opt in mcq.options if not opt.is_correct]

        # Check distractor plausibility (simplified check)
        for i, distractor in enumerate(distractors):
            if len(distractor.text) < self.min_option_length:
                results["warnings"].append(f"Distractor {distractor.label} too short")
                score -= 5

            # Check for obviously wrong answers (very simple check)
            if any(word in distractor.text.lower() for word in ["không", "never", "impossible"]):
                results["warnings"].append(f"Distractor {distractor.label} might be too obviously wrong")
                score -= 10

        return max(score, 0)

    def _check_language_quality(self, mcq: MCQQuestion, results: Dict) -> float:
        """Check language quality and clarity"""
        score = 100.0

        # Check for common issues
        text_to_check = mcq.question + " " + " ".join(opt.text for opt in mcq.options)

        # Check for excessive repetition
        words = text_to_check.lower().split()
        word_freq = {}
        for word in words:
            if len(word) > 3:  # Only check longer words
                word_freq[word] = word_freq.get(word, 0) + 1

        repeated_words = [word for word, freq in word_freq.items() if freq > 3]
        if repeated_words:
            results["warnings"].append(f"Repeated words detected: {repeated_words[:3]}")
            score -= 5

        # Check for question clarity indicators
        if not mcq.question.strip().endswith("?"):
            results["warnings"].append("Question should end with question mark")
            score -= 5

        return max(score, 0)

    def calculate_confidence_score(self, mcq: MCQQuestion) -> float:
        """Calculate overall confidence score for the MCQ"""
        is_valid, validation_results = self.validate_mcq(mcq)

        if not is_valid:
            return 0.0

        # Base score from validation
        base_score = validation_results["overall_score"]

        # Bonus factors
        bonus = 0

        # Good explanation length
        if 50 <= len(mcq.explanation) <= 200:
            bonus += 5

        # Balanced option lengths
        option_lengths = [len(opt.text) for opt in mcq.options]
        if max(option_lengths) - min(option_lengths) < 30:
            bonus += 5

        # Appropriate question length
        if 30 <= len(mcq.question) <= 120:
            bonus += 5

        final_score = min(base_score + bonus, 100.0)
        return final_score

    def generate_quality_report(self, mcqs: List[MCQQuestion]) -> Dict[str, Any]:
        """Generate comprehensive quality report for multiple MCQs"""
        if not mcqs:
            return {"error": "No MCQs provided"}

        report = {
            "total_questions": len(mcqs),
            "valid_questions": 0,
            "average_score": 0.0,
            "score_distribution": {},
            "common_issues": {},
            "recommendations": []
        }

        scores = []
        all_issues = []

        for mcq in mcqs:
            is_valid, validation = self.validate_mcq(mcq)
            score = self.calculate_confidence_score(mcq)

            if is_valid:
                report["valid_questions"] += 1

            scores.append(score)
            all_issues.extend(validation["issues"])

        # Calculate statistics
        report["average_score"] = np.mean(scores)
        report["median_score"] = np.median(scores)
        report["min_score"] = np.min(scores)
        report["max_score"] = np.max(scores)

        # Score distribution
        score_ranges = [(0, 40), (40, 60), (60, 80), (80, 100)]
        for low, high in score_ranges:
            count = sum(1 for s in scores if low <= s < high)
            report["score_distribution"][f"{low}-{high}"] = count

        # Common issues
        issue_counts = {}
        for issue in all_issues:
            issue_counts[issue] = issue_counts.get(issue, 0) + 1
        report["common_issues"] = dict(sorted(issue_counts.items(), key=lambda x: x[1], reverse=True))

        # Generate recommendations
        if report["average_score"] < 70:
            report["recommendations"].append("Consider improving prompt engineering")
        if report["valid_questions"] / report["total_questions"] < 0.8:
            report["recommendations"].append("Review structural validation rules")
        if "Options must be distinct" in report["common_issues"]:
            report["recommendations"].append("Improve distractor generation")

        return report

# Initialize quality validator
quality_validator = QualityValidator()

# Test validation with the previously generated MCQ
if 'mcq' in locals():
    print("🧪 Testing Quality Validation...")

    is_valid, validation_results = quality_validator.validate_mcq(mcq)
    confidence_score = quality_validator.calculate_confidence_score(mcq)

    print(f"\n📊 Validation Results:")
    print(f"Valid: {'✅' if is_valid else '❌'}")
    print(f"Overall Score: {validation_results['overall_score']:.1f}/100")
    print(f"Confidence Score: {confidence_score:.1f}/100")

    print(f"\n📋 Detailed Scores:")
    for category, score in validation_results['scores'].items():
        print(f"  {category.title()}: {score:.1f}/100")

    if validation_results['issues']:
        print(f"\n❌ Issues found:")
        for issue in validation_results['issues']:
            print(f"  - {issue}")

    if validation_results['warnings']:
        print(f"\n⚠️  Warnings:")
        for warning in validation_results['warnings']:
            print(f"  - {warning}")

    # Update MCQ confidence score
    mcq.confidence_score = confidence_score
    print(f"\n✅ MCQ confidence score updated to {confidence_score:.1f}")

else:
    print("❌ No MCQ available for validation testing")

## 8. Difficulty Assessment and Classification

Implementing intelligent difficulty assessment based on cognitive load and concept complexity.

In [None]:
class DifficultyAnalyzer:
    """Analyzes and classifies question difficulty"""

    def __init__(self):
        self.difficulty_indicators = {
            DifficultyLevel.EASY: {
                "keywords": ["là gì", "định nghĩa", "ví dụ", "đơn giản", "cơ bản"],
                "concepts": 1,
                "cognitive_load": "recall",
                "max_word_count": 15
            },
            DifficultyLevel.MEDIUM: {
                "keywords": ["so sánh", "khác biệt", "ứng dụng", "khi nào", "tại sao"],
                "concepts": 2,
                "cognitive_load": "comprehension",
                "max_word_count": 25
            },
            DifficultyLevel.HARD: {
                "keywords": ["phân tích", "đánh giá", "tối ưu", "thiết kế", "giải thích"],
                "concepts": 3,
                "cognitive_load": "analysis",
                "max_word_count": 35
            },
            DifficultyLevel.EXPERT: {
                "keywords": ["tổng hợp", "sáng tạo", "nghiên cứu", "phát triển", "optimization"],
                "concepts": 4,
                "cognitive_load": "synthesis",
                "max_word_count": 50
            }
        }

        # Technical term complexity levels
        self.technical_terms = {
            "basic": ["đối tượng", "class", "function", "variable"],
            "intermediate": ["inheritance", "polymorphism", "encapsulation", "abstraction"],
            "advanced": ["design pattern", "algorithm complexity", "data structure optimization"],
            "expert": ["architectural pattern", "performance tuning", "scalability analysis"]
        }

    def assess_difficulty(self, mcq: MCQQuestion) -> Dict[str, Any]:
        """Comprehensive difficulty assessment"""
        analysis = {
            "predicted_difficulty": DifficultyLevel.MEDIUM,
            "confidence": 0.0,
            "factors": {},
            "recommendations": []
        }

        # Analyze different factors
        keyword_score = self._analyze_keywords(mcq.question)
        complexity_score = self._analyze_complexity(mcq)
        cognitive_score = self._analyze_cognitive_load(mcq)
        technical_score = self._analyze_technical_terms(mcq)

        analysis["factors"] = {
            "keyword_difficulty": keyword_score,
            "content_complexity": complexity_score,
            "cognitive_load": cognitive_score,
            "technical_complexity": technical_score
        }

        # Calculate overall difficulty
        overall_score = np.mean([keyword_score, complexity_score, cognitive_score, technical_score])
        analysis["predicted_difficulty"] = self._score_to_difficulty(overall_score)
        analysis["confidence"] = min(100, max(50, 80 + (overall_score - 50) * 0.4))

        # Generate recommendations
        analysis["recommendations"] = self._generate_recommendations(analysis)

        return analysis

    def _analyze_keywords(self, question: str) -> float:
        """Analyze question keywords for difficulty indicators"""
        question_lower = question.lower()
        scores = []

        for difficulty, indicators in self.difficulty_indicators.items():
            score = sum(2 if keyword in question_lower else 0
                       for keyword in indicators["keywords"])
            if score > 0:
                scores.append((difficulty, score))

        if not scores:
            return 50.0  # Default medium difficulty

        # Weight by difficulty level
        difficulty_weights = {
            DifficultyLevel.EASY: 25,
            DifficultyLevel.MEDIUM: 50,
            DifficultyLevel.HARD: 75,
            DifficultyLevel.EXPERT: 90
        }

        weighted_score = sum(difficulty_weights[diff] * score for diff, score in scores)
        total_weight = sum(score for _, score in scores)

        return weighted_score / total_weight if total_weight > 0 else 50.0

    def _analyze_complexity(self, mcq: MCQQuestion) -> float:
        """Analyze content complexity"""
        factors = []

        # Question length complexity
        question_words = len(mcq.question.split())
        if question_words <= 10:
            factors.append(30)
        elif question_words <= 20:
            factors.append(50)
        elif question_words <= 30:
            factors.append(70)
        else:
            factors.append(85)

        # Option complexity
        option_lengths = [len(opt.text.split()) for opt in mcq.options]
        avg_option_length = np.mean(option_lengths)

        if avg_option_length <= 5:
            factors.append(35)
        elif avg_option_length <= 10:
            factors.append(55)
        else:
            factors.append(75)

        # Explanation complexity
        explanation_words = len(mcq.explanation.split())
        if explanation_words <= 15:
            factors.append(40)
        elif explanation_words <= 30:
            factors.append(60)
        else:
            factors.append(80)

        return np.mean(factors)

    def _analyze_cognitive_load(self, mcq: MCQQuestion) -> float:
        """Analyze cognitive load based on Bloom's taxonomy"""
        cognitive_indicators = {
            "remember": ["là gì", "định nghĩa", "liệt kê", "nhận diện"],
            "understand": ["giải thích", "mô tả", "so sánh", "phân biệt"],
            "apply": ["sử dụng", "áp dụng", "thực hiện", "giải quyết"],
            "analyze": ["phân tích", "phân chia", "so sánh", "đối chiếu"],
            "evaluate": ["đánh giá", "phê bình", "lựa chọn", "quyết định"],
            "create": ["tạo ra", "thiết kế", "phát triển", "sáng tạo"]
        }

        cognitive_scores = {
            "remember": 20,
            "understand": 35,
            "apply": 50,
            "analyze": 70,
            "evaluate": 85,
            "create": 95
        }

        text = (mcq.question + " " + mcq.explanation).lower()

        detected_levels = []
        for level, indicators in cognitive_indicators.items():
            if any(indicator in text for indicator in indicators):
                detected_levels.append(cognitive_scores[level])

        return max(detected_levels) if detected_levels else 50.0

    def _analyze_technical_terms(self, mcq: MCQQuestion) -> float:
        """Analyze technical term complexity"""
        all_text = (mcq.question + " " + mcq.explanation + " " +
                   " ".join(opt.text for opt in mcq.options)).lower()

        complexity_scores = {
            "basic": 30,
            "intermediate": 50,
            "advanced": 75,
            "expert": 90
        }

        detected_levels = []
        for level, terms in self.technical_terms.items():
            if any(term.lower() in all_text for term in terms):
                detected_levels.append(complexity_scores[level])

        return max(detected_levels) if detected_levels else 40.0

    def _score_to_difficulty(self, score: float) -> DifficultyLevel:
        """Convert numeric score to difficulty level"""
        if score < 35:
            return DifficultyLevel.EASY
        elif score < 60:
            return DifficultyLevel.MEDIUM
        elif score < 80:
            return DifficultyLevel.HARD
        else:
            return DifficultyLevel.EXPERT

    def _generate_recommendations(self, analysis: Dict) -> List[str]:
        """Generate recommendations based on difficulty analysis"""
        recommendations = []

        predicted = analysis["predicted_difficulty"]
        factors = analysis["factors"]

        if factors["keyword_difficulty"] < 30:
            recommendations.append("Consider using more specific terminology")

        if factors["content_complexity"] > 80:
            recommendations.append("Question might be too complex - consider simplification")

        if factors["cognitive_load"] < 30:
            recommendations.append("Question tests only basic recall - consider higher-order thinking")

        if analysis["confidence"] < 60:
            recommendations.append("Difficulty assessment has low confidence - review question design")

        return recommendations

    def calibrate_difficulty_distribution(self, mcqs: List[MCQQuestion]) -> Dict[str, Any]:
        """Analyze difficulty distribution across multiple MCQs"""
        if not mcqs:
            return {}

        analyses = [self.assess_difficulty(mcq) for mcq in mcqs]

        # Count difficulty levels
        difficulty_counts = {}
        confidence_scores = []

        for analysis in analyses:
            diff_level = analysis["predicted_difficulty"].value
            difficulty_counts[diff_level] = difficulty_counts.get(diff_level, 0) + 1
            confidence_scores.append(analysis["confidence"])

        # Calculate statistics
        total = len(mcqs)
        distribution = {level: count/total * 100 for level, count in difficulty_counts.items()}

        return {
            "total_questions": total,
            "difficulty_distribution": distribution,
            "average_confidence": np.mean(confidence_scores),
            "recommended_distribution": {
                "easy": 30,
                "medium": 50,
                "hard": 15,
                "expert": 5
            },
            "needs_rebalancing": self._check_balance(distribution)
        }

    def _check_balance(self, distribution: Dict[str, float]) -> bool:
        """Check if difficulty distribution needs rebalancing"""
        recommended = {"easy": 30, "medium": 50, "hard": 15, "expert": 5}

        for level, target_percent in recommended.items():
            actual_percent = distribution.get(level, 0)
            if abs(actual_percent - target_percent) > 20:  # More than 20% deviation
                return True

        return False

# Initialize difficulty analyzer
difficulty_analyzer = DifficultyAnalyzer()

# Test difficulty analysis
if 'mcq' in locals():
    print("🧪 Testing Difficulty Analysis...")

    analysis = difficulty_analyzer.assess_difficulty(mcq)

    print(f"\n📊 Difficulty Analysis Results:")
    print(f"Predicted Difficulty: {analysis['predicted_difficulty'].value.upper()}")
    print(f"Confidence: {analysis['confidence']:.1f}%")

    print(f"\n📋 Factor Analysis:")
    for factor, score in analysis['factors'].items():
        print(f"  {factor.replace('_', ' ').title()}: {score:.1f}/100")

    if analysis['recommendations']:
        print(f"\n💡 Recommendations:")
        for rec in analysis['recommendations']:
            print(f"  • {rec}")

    # Compare with intended difficulty
    intended_difficulty = mcq.difficulty
    predicted_difficulty = analysis['predicted_difficulty'].value

    if intended_difficulty == predicted_difficulty:
        print(f"\n✅ Difficulty assessment matches intended level: {intended_difficulty}")
    else:
        print(f"\n⚠️  Difficulty mismatch:")
        print(f"   Intended: {intended_difficulty}")
        print(f"   Predicted: {predicted_difficulty}")

else:
    print("❌ No MCQ available for difficulty analysis")

## 9. Batch Generation and Testing

Implementing scalable batch processing for generating multiple MCQs with error handling and retry mechanisms.

In [None]:
class BatchMCQGenerator:
    """Batch processing for MCQ generation with error handling"""

    def __init__(self, mcq_generator: MCQGenerator, retriever: ContextAwareRetriever,
                 quality_validator: QualityValidator, difficulty_analyzer: DifficultyAnalyzer):
        self.mcq_generator = mcq_generator
        self.retriever = retriever
        self.quality_validator = quality_validator
        self.difficulty_analyzer = difficulty_analyzer
        self.max_retries = 3
        self.min_quality_score = 60.0

    def generate_batch(self, topics: List[str],
                      count_per_topic: int = 3,
                      difficulties: Optional[List[DifficultyLevel]] = None,
                      question_types: Optional[List[QuestionType]] = None) -> Dict[str, Any]:
        """Generate batch of MCQs with comprehensive reporting"""

        if difficulties is None:
            difficulties = [DifficultyLevel.EASY, DifficultyLevel.MEDIUM, DifficultyLevel.HARD]

        if question_types is None:
            question_types = [QuestionType.DEFINITION, QuestionType.APPLICATION]

        total_target = len(topics) * count_per_topic
        results = {
            "mcqs": [],
            "failed_generations": [],
            "statistics": {},
            "quality_report": {},
            "difficulty_analysis": {}
        }

        print(f"🚀 Starting batch generation...")
        print(f"📊 Target: {total_target} MCQs across {len(topics)} topics")
        print(f"🎯 Difficulties: {[d.value for d in difficulties]}")
        print(f"📝 Question types: {[q.value for q in question_types]}")

        # Generate MCQs for each topic
        for topic_idx, topic in enumerate(topics, 1):
            print(f"\\n📚 Processing topic {topic_idx}/{len(topics)}: {topic}")

            topic_mcqs = []
            topic_failures = []

            for q_idx in range(count_per_topic):
                # Cycle through difficulties and question types
                difficulty = difficulties[q_idx % len(difficulties)]
                question_type = question_types[q_idx % len(question_types)]

                print(f"  🎯 Generating Q{q_idx+1}: {difficulty.value} {question_type.value}")

                mcq, error = self._generate_single_mcq_with_retry(
                    topic, difficulty, question_type
                )

                if mcq:
                    topic_mcqs.append(mcq)
                    quality_score = mcq.confidence_score
                    print(f"    ✅ Success (Quality: {quality_score:.1f})")
                else:
                    topic_failures.append({
                        "topic": topic,
                        "difficulty": difficulty.value,
                        "question_type": question_type.value,
                        "error": error
                    })
                    print(f"    ❌ Failed: {error}")

            results["mcqs"].extend(topic_mcqs)
            results["failed_generations"].extend(topic_failures)

            print(f"  📊 Topic summary: {len(topic_mcqs)}/{count_per_topic} successful")

        # Generate comprehensive statistics
        results["statistics"] = self._calculate_statistics(results["mcqs"], total_target)
        results["quality_report"] = self.quality_validator.generate_quality_report(results["mcqs"])
        results["difficulty_analysis"] = self.difficulty_analyzer.calibrate_difficulty_distribution(results["mcqs"])

        self._print_batch_summary(results)

        return results

    def _generate_single_mcq_with_retry(self, topic: str, difficulty: DifficultyLevel,
                                       question_type: QuestionType) -> Tuple[Optional[MCQQuestion], Optional[str]]:
        """Generate single MCQ with retry mechanism"""

        for attempt in range(1, self.max_retries + 1):
            try:
                # Retrieve context
                context_docs = self.retriever.retrieve_by_topic(topic, k=3)
                if not context_docs:
                    return None, f"No relevant context found for topic: {topic}"

                context = self.retriever.get_context_summary(context_docs)

                # Generate MCQ
                mcq = self.mcq_generator.generate_mcq_from_context(
                    context, topic, difficulty, question_type
                )

                # Validate quality
                confidence_score = self.quality_validator.calculate_confidence_score(mcq)
                mcq.confidence_score = confidence_score

                if confidence_score >= self.min_quality_score:
                    return mcq, None
                else:
                    if attempt < self.max_retries:
                        print(f"    🔄 Retry {attempt}: Low quality score ({confidence_score:.1f})")
                        continue
                    else:
                        return None, f"Quality too low after {self.max_retries} attempts"

            except Exception as e:
                if attempt < self.max_retries:
                    print(f"    🔄 Retry {attempt}: {str(e)[:50]}...")
                    continue
                else:
                    return None, f"Generation failed: {str(e)}"

        return None, "Max retries exceeded"

    def _calculate_statistics(self, mcqs: List[MCQQuestion], target_count: int) -> Dict[str, Any]:
        """Calculate generation statistics"""
        if not mcqs:
            return {"error": "No MCQs generated"}

        # Basic statistics
        stats = {
            "total_generated": len(mcqs),
            "target_count": target_count,
            "success_rate": len(mcqs) / target_count * 100,
            "average_quality": np.mean([mcq.confidence_score for mcq in mcqs]),
            "quality_distribution": {}
        }

        # Quality distribution
        quality_ranges = [(0, 40), (40, 60), (60, 80), (80, 100)]
        for low, high in quality_ranges:
            count = sum(1 for mcq in mcqs if low <= mcq.confidence_score < high)
            stats["quality_distribution"][f"{low}-{high}"] = count

        # Topic distribution
        topic_counts = {}
        for mcq in mcqs:
            topic_counts[mcq.topic] = topic_counts.get(mcq.topic, 0) + 1
        stats["topic_distribution"] = topic_counts

        # Difficulty distribution
        difficulty_counts = {}
        for mcq in mcqs:
            difficulty_counts[mcq.difficulty] = difficulty_counts.get(mcq.difficulty, 0) + 1
        stats["difficulty_distribution"] = difficulty_counts

        # Question type distribution
        type_counts = {}
        for mcq in mcqs:
            type_counts[mcq.question_type] = type_counts.get(mcq.question_type, 0) + 1
        stats["question_type_distribution"] = type_counts

        return stats

    def _print_batch_summary(self, results: Dict[str, Any]):
        """Print comprehensive batch summary"""
        stats = results["statistics"]
        quality_report = results["quality_report"]

        print(f"\\n🎉 Batch Generation Complete!")
        print(f"={'='*50}")

        print(f"📊 Generation Statistics:")
        print(f"  Total Generated: {stats['total_generated']}/{stats['target_count']}")
        print(f"  Success Rate: {stats['success_rate']:.1f}%")
        print(f"  Average Quality: {stats['average_quality']:.1f}/100")

        print(f"\\n📈 Quality Distribution:")
        for range_str, count in stats['quality_distribution'].items():
            print(f"  {range_str}: {count} questions")

        print(f"\\n🎯 Difficulty Distribution:")
        for difficulty, count in stats['difficulty_distribution'].items():
            print(f"  {difficulty.title()}: {count} questions")

        if results["failed_generations"]:
            print(f"\\n❌ Failed Generations: {len(results['failed_generations'])}")
            failure_reasons = {}
            for failure in results["failed_generations"]:
                reason = failure["error"]
                failure_reasons[reason] = failure_reasons.get(reason, 0) + 1

            for reason, count in failure_reasons.items():
                print(f"  {reason}: {count} times")

        print(f"\\n💡 Recommendations:")
        if quality_report.get("recommendations"):
            for rec in quality_report["recommendations"]:
                print(f"  • {rec}")

        if stats['success_rate'] < 80:
            print(f"  • Consider adjusting generation parameters")
        if stats['average_quality'] < 70:
            print(f"  • Review prompt engineering and validation criteria")

    def export_results(self, results: Dict[str, Any], output_file: str):
        """Export results to JSON file"""
        export_data = {
            "metadata": {
                "generation_timestamp": time.time(),
                "total_questions": len(results["mcqs"]),
                "success_rate": results["statistics"]["success_rate"],
                "average_quality": results["statistics"]["average_quality"]
            },
            "questions": [mcq.to_dict() for mcq in results["mcqs"]],
            "statistics": results["statistics"],
            "quality_report": results["quality_report"],
            "difficulty_analysis": results["difficulty_analysis"],
            "failed_generations": results["failed_generations"]
        }

        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(export_data, f, ensure_ascii=False, indent=2)

        print(f"📁 Results exported to: {output_file}")

# Initialize batch generator
if all(var in locals() for var in ['mcq_generator', 'retriever', 'quality_validator', 'difficulty_analyzer']):
    batch_generator = BatchMCQGenerator(
        mcq_generator, retriever, quality_validator, difficulty_analyzer
    )
    print("✅ Batch MCQ Generator initialized!")
else:
    print("❌ Required components not available for batch generator initialization")

In [None]:
# Run batch generation test
if 'batch_generator' in locals():
    print("🧪 Testing Batch MCQ Generation...")

    # Define test parameters
    test_topics = [
        "Object-Oriented Programming",
        "Inheritance and Polymorphism",
        "Data Structures"
    ]

    test_difficulties = [
        DifficultyLevel.EASY,
        DifficultyLevel.MEDIUM,
        DifficultyLevel.HARD
    ]

    test_question_types = [
        QuestionType.DEFINITION,
        QuestionType.APPLICATION
    ]

    # Generate batch
    batch_results = batch_generator.generate_batch(
        topics=test_topics,
        count_per_topic=2,  # Generate 2 questions per topic
        difficulties=test_difficulties,
        question_types=test_question_types
    )

    # Display sample generated MCQs
    print(f"\\n📝 Sample Generated MCQs:")
    print(f"="*60)

    for i, mcq in enumerate(batch_results["mcqs"][:3], 1):  # Show first 3 MCQs
        print(f"\\n🎯 Sample MCQ {i}:")
        print(f"Topic: {mcq.topic}")
        print(f"Difficulty: {mcq.difficulty}")
        print(f"Type: {mcq.question_type}")
        print(f"Quality Score: {mcq.confidence_score:.1f}/100")
        print(f"\\nQuestion: {mcq.question}")

        print(f"\\nOptions:")
        for option in mcq.options:
            marker = "✅" if option.is_correct else "  "
            print(f"  {marker} {option.label}: {option.text}")

        print(f"\\nExplanation: {mcq.explanation}")
        print(f"-" * 60)

    # Export results
    output_filename = f"batch_mcq_results_{int(time.time())}.json"
    batch_generator.export_results(batch_results, output_filename)

else:
    print("❌ Batch generator not available for testing")

## 10. Performance Evaluation Metrics

Implementing comprehensive evaluation metrics for the RAG-MCQ system including relevance, clarity, generation speed, and success rates.

In [None]:
class PerformanceEvaluator:
    """Comprehensive performance evaluation for RAG-MCQ system"""

    def __init__(self):
        self.metrics = {}
        self.benchmarks = {
            "generation_time_per_question": 30.0,  # seconds
            "minimum_success_rate": 80.0,  # percentage
            "minimum_quality_score": 70.0,  # 0-100
            "maximum_retry_rate": 20.0  # percentage
        }

    def evaluate_system_performance(self, batch_results: Dict[str, Any],
                                   generation_time: float) -> Dict[str, Any]:
        """Comprehensive system performance evaluation"""

        mcqs = batch_results["mcqs"]
        stats = batch_results["statistics"]

        evaluation = {
            "performance_metrics": {},
            "quality_metrics": {},
            "efficiency_metrics": {},
            "recommendations": [],
            "overall_score": 0.0
        }

        # Performance metrics
        evaluation["performance_metrics"] = self._calculate_performance_metrics(
            mcqs, stats, generation_time
        )

        # Quality metrics
        evaluation["quality_metrics"] = self._calculate_quality_metrics(mcqs)

        # Efficiency metrics
        evaluation["efficiency_metrics"] = self._calculate_efficiency_metrics(
            batch_results, generation_time
        )

        # Generate recommendations
        evaluation["recommendations"] = self._generate_performance_recommendations(evaluation)

        # Calculate overall score
        evaluation["overall_score"] = self._calculate_overall_score(evaluation)

        return evaluation

    def _calculate_performance_metrics(self, mcqs: List[MCQQuestion],
                                     stats: Dict, generation_time: float) -> Dict[str, float]:
        """Calculate core performance metrics"""
        total_target = stats.get("target_count", len(mcqs))

        metrics = {
            "success_rate": len(mcqs) / total_target * 100 if total_target > 0 else 0,
            "average_generation_time": generation_time / len(mcqs) if mcqs else 0,
            "throughput_questions_per_minute": len(mcqs) / (generation_time / 60) if generation_time > 0 else 0,
            "validity_rate": sum(1 for mcq in mcqs if mcq.confidence_score >= 60) / len(mcqs) * 100 if mcqs else 0
        }

        return metrics

    def _calculate_quality_metrics(self, mcqs: List[MCQQuestion]) -> Dict[str, float]:
        """Calculate quality-related metrics"""
        if not mcqs:
            return {}

        confidence_scores = [mcq.confidence_score for mcq in mcqs]

        # Content diversity metrics
        unique_questions = len(set(mcq.question for mcq in mcqs))
        question_diversity = unique_questions / len(mcqs) * 100

        # Option quality metrics
        avg_option_lengths = []
        for mcq in mcqs:
            option_lengths = [len(opt.text) for opt in mcq.options]
            avg_option_lengths.append(np.mean(option_lengths))

        metrics = {
            "average_quality_score": np.mean(confidence_scores),
            "quality_score_std": np.std(confidence_scores),
            "min_quality_score": np.min(confidence_scores),
            "max_quality_score": np.max(confidence_scores),
            "high_quality_rate": sum(1 for score in confidence_scores if score >= 80) / len(confidence_scores) * 100,
            "question_diversity": question_diversity,
            "average_option_length": np.mean(avg_option_lengths),
            "explanation_completeness": sum(1 for mcq in mcqs if len(mcq.explanation) >= 50) / len(mcqs) * 100
        }

        return metrics

    def _calculate_efficiency_metrics(self, batch_results: Dict, generation_time: float) -> Dict[str, float]:
        """Calculate efficiency and resource utilization metrics"""
        mcqs = batch_results["mcqs"]
        failed_generations = batch_results.get("failed_generations", [])

        total_attempts = len(mcqs) + len(failed_generations)

        metrics = {
            "retry_rate": len(failed_generations) / total_attempts * 100 if total_attempts > 0 else 0,
            "resource_efficiency": len(mcqs) / generation_time if generation_time > 0 else 0,
            "context_utilization": self._calculate_context_utilization(mcqs),
            "prompt_efficiency": self._calculate_prompt_efficiency(mcqs)
        }

        return metrics

    def _calculate_context_utilization(self, mcqs: List[MCQQuestion]) -> float:
        """Calculate how well the context is utilized"""
        if not mcqs:
            return 0.0

        # Simplified metric: average context length vs question relevance
        context_lengths = [len(mcq.context) for mcq in mcqs]
        quality_scores = [mcq.confidence_score for mcq in mcqs]

        # Higher quality with reasonable context length indicates good utilization
        avg_context_length = np.mean(context_lengths)
        avg_quality = np.mean(quality_scores)

        # Optimal context length range: 300-800 characters
        if 300 <= avg_context_length <= 800:
            length_score = 100
        else:
            length_score = max(0, 100 - abs(avg_context_length - 550) / 10)

        # Combine with quality score
        utilization_score = (length_score + avg_quality) / 2
        return utilization_score

    def _calculate_prompt_efficiency(self, mcqs: List[MCQQuestion]) -> float:
        """Calculate prompt efficiency based on output quality"""
        if not mcqs:
            return 0.0

        # Measure consistency in output format and quality
        format_consistency = self._check_format_consistency(mcqs)
        quality_consistency = self._check_quality_consistency(mcqs)

        return (format_consistency + quality_consistency) / 2

    def _check_format_consistency(self, mcqs: List[MCQQuestion]) -> float:
        """Check consistency in MCQ format"""
        if not mcqs:
            return 0.0

        consistent_count = 0
        for mcq in mcqs:
            # Check if MCQ follows expected format
            has_4_options = len(mcq.options) == 4
            has_correct_labels = all(opt.label in ["A", "B", "C", "D"] for opt in mcq.options)
            has_one_correct = sum(1 for opt in mcq.options if opt.is_correct) == 1
            has_explanation = len(mcq.explanation) > 10

            if all([has_4_options, has_correct_labels, has_one_correct, has_explanation]):
                consistent_count += 1

        return consistent_count / len(mcqs) * 100

    def _check_quality_consistency(self, mcqs: List[MCQQuestion]) -> float:
        """Check consistency in quality scores"""
        if not mcqs:
            return 0.0

        quality_scores = [mcq.confidence_score for mcq in mcqs]
        quality_std = np.std(quality_scores)

        # Lower standard deviation indicates more consistent quality
        # Scale to 0-100 where lower std = higher score
        consistency_score = max(0, 100 - quality_std)
        return consistency_score

    def _generate_performance_recommendations(self, evaluation: Dict) -> List[str]:
        """Generate recommendations based on performance evaluation"""
        recommendations = []

        perf_metrics = evaluation["performance_metrics"]
        quality_metrics = evaluation["quality_metrics"]
        efficiency_metrics = evaluation["efficiency_metrics"]

        # Success rate recommendations
        if perf_metrics.get("success_rate", 0) < self.benchmarks["minimum_success_rate"]:
            recommendations.append("Improve generation stability - success rate below target")

        # Quality recommendations
        if quality_metrics.get("average_quality_score", 0) < self.benchmarks["minimum_quality_score"]:
            recommendations.append("Enhance prompt engineering to improve quality scores")

        # Efficiency recommendations
        if perf_metrics.get("average_generation_time", 0) > self.benchmarks["generation_time_per_question"]:
            recommendations.append("Optimize generation pipeline for better performance")

        if efficiency_metrics.get("retry_rate", 0) > self.benchmarks["maximum_retry_rate"]:
            recommendations.append("Reduce retry rate by improving initial generation quality")

        # Diversity recommendations
        if quality_metrics.get("question_diversity", 0) < 90:
            recommendations.append("Improve question diversity to avoid repetition")

        # Context utilization recommendations
        if efficiency_metrics.get("context_utilization", 0) < 70:
            recommendations.append("Optimize context retrieval and utilization")

        return recommendations

    def _calculate_overall_score(self, evaluation: Dict) -> float:
        """Calculate overall system performance score"""
        perf_metrics = evaluation["performance_metrics"]
        quality_metrics = evaluation["quality_metrics"]
        efficiency_metrics = evaluation["efficiency_metrics"]

        # Weighted scoring
        weights = {
            "success_rate": 0.25,
            "quality_score": 0.30,
            "generation_time": 0.20,
            "efficiency": 0.25
        }

        # Normalize metrics to 0-100 scale
        success_score = min(100, perf_metrics.get("success_rate", 0))
        quality_score = quality_metrics.get("average_quality_score", 0)

        # Time score (inverse - lower time = higher score)
        time_score = min(100, self.benchmarks["generation_time_per_question"] /
                        max(0.1, perf_metrics.get("average_generation_time", 30)) * 100)

        efficiency_score = efficiency_metrics.get("context_utilization", 0)

        overall_score = (
            weights["success_rate"] * success_score +
            weights["quality_score"] * quality_score +
            weights["generation_time"] * time_score +
            weights["efficiency"] * efficiency_score
        )

        return overall_score

    def visualize_performance(self, evaluation: Dict):
        """Create performance visualization"""
        if not evaluation:
            print("❌ No evaluation data available for visualization")
            return

        # Create performance dashboard
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        fig.suptitle('RAG-MCQ System Performance Dashboard', fontsize=16, fontweight='bold')

        # 1. Performance Metrics Bar Chart
        perf_metrics = evaluation["performance_metrics"]
        metrics_names = list(perf_metrics.keys())
        metrics_values = list(perf_metrics.values())

        axes[0, 0].bar(range(len(metrics_names)), metrics_values, color='skyblue')
        axes[0, 0].set_title('Performance Metrics')
        axes[0, 0].set_xticks(range(len(metrics_names)))
        axes[0, 0].set_xticklabels([name.replace('_', ' ').title() for name in metrics_names],
                                  rotation=45, ha='right')
        axes[0, 0].set_ylabel('Score/Rate')

        # 2. Quality Distribution
        quality_metrics = evaluation["quality_metrics"]
        quality_names = ['Avg Quality', 'Min Quality', 'Max Quality', 'High Quality Rate']
        quality_values = [
            quality_metrics.get("average_quality_score", 0),
            quality_metrics.get("min_quality_score", 0),
            quality_metrics.get("max_quality_score", 0),
            quality_metrics.get("high_quality_rate", 0)
        ]

        axes[0, 1].bar(quality_names, quality_values, color='lightgreen')
        axes[0, 1].set_title('Quality Metrics')
        axes[0, 1].set_ylabel('Score')
        axes[0, 1].tick_params(axis='x', rotation=45)

        # 3. Efficiency Metrics
        efficiency_metrics = evaluation["efficiency_metrics"]
        eff_names = list(efficiency_metrics.keys())
        eff_values = list(efficiency_metrics.values())

        axes[1, 0].bar(eff_names, eff_values, color='orange')
        axes[1, 0].set_title('Efficiency Metrics')
        axes[1, 0].set_xticks(range(len(eff_names)))
        axes[1, 0].set_xticklabels([name.replace('_', ' ').title() for name in eff_names],
                                  rotation=45, ha='right')
        axes[1, 0].set_ylabel('Score/Rate')

        # 4. Overall Score Gauge
        overall_score = evaluation["overall_score"]
        axes[1, 1].pie([overall_score, 100-overall_score],
                      labels=[f'Score: {overall_score:.1f}', ''],
                      colors=['lightcoral', 'lightgray'],
                      startangle=90)
        axes[1, 1].set_title('Overall Performance Score')

        plt.tight_layout()
        plt.show()

        # Print performance summary
        print(f"\\n🎯 Performance Summary:")
        print(f"Overall Score: {overall_score:.1f}/100")

        if overall_score >= 80:
            print("✅ Excellent performance!")
        elif overall_score >= 70:
            print("🟡 Good performance with room for improvement")
        else:
            print("🔴 Performance needs significant improvement")

# Initialize performance evaluator
performance_evaluator = PerformanceEvaluator()

# Evaluate system performance if batch results are available
if 'batch_results' in locals():
    print("📊 Evaluating System Performance...")

    # Simulate generation time (in real scenario, this would be measured)
    simulated_generation_time = len(batch_results["mcqs"]) * 5  # 5 seconds per question

    evaluation = performance_evaluator.evaluate_system_performance(
        batch_results, simulated_generation_time
    )

    # Display evaluation results
    print(f"\\n📈 Performance Evaluation Results:")
    print(f"="*60)

    print(f"\\n🎯 Performance Metrics:")
    for metric, value in evaluation["performance_metrics"].items():
        print(f"  {metric.replace('_', ' ').title()}: {value:.2f}")

    print(f"\\n🏆 Quality Metrics:")
    for metric, value in evaluation["quality_metrics"].items():
        print(f"  {metric.replace('_', ' ').title()}: {value:.2f}")

    print(f"\\n⚡ Efficiency Metrics:")
    for metric, value in evaluation["efficiency_metrics"].items():
        print(f"  {metric.replace('_', ' ').title()}: {value:.2f}")

    print(f"\\n💯 Overall Score: {evaluation['overall_score']:.1f}/100")

    if evaluation["recommendations"]:
        print(f"\\n💡 Recommendations:")
        for rec in evaluation["recommendations"]:
            print(f"  • {rec}")

    # Create visualization
    performance_evaluator.visualize_performance(evaluation)

else:
    print("❌ No batch results available for performance evaluation")

## Conclusion and Next Steps

### 🎉 What We've Accomplished

This notebook demonstrated a comprehensive RAG system for Multiple Choice Question generation with the following key features:

#### ✅ Core Components Implemented
1. **Document Processing Pipeline** - PDF loading, text extraction, and semantic chunking
2. **Vector Database & Embeddings** - FAISS with Vietnamese language support
3. **Context-Aware Retrieval** - Diverse document retrieval with similarity thresholds
4. **LLM-Powered Generation** - Structured MCQ generation with JSON output
5. **Advanced Prompt Engineering** - Specialized prompts for different question types
6. **Quality Validation System** - Comprehensive validation with scoring
7. **Difficulty Assessment** - Intelligent difficulty classification
8. **Batch Processing** - Scalable generation with error handling
9. **Performance Evaluation** - Comprehensive metrics and reporting

#### 🎯 Key Achievements
- **Multi-language Support**: Vietnamese language optimization
- **Educational Focus**: Question types aligned with learning objectives
- **Quality Assurance**: Automatic validation and confidence scoring
- **Scalability**: Batch processing capabilities
- **Comprehensive Evaluation**: Multiple metrics for system assessment

### 🚀 Next Steps for Production

#### Phase 1: Enhancement & Optimization
- [ ] **Real LLM Integration**: Replace mock LLM with actual models (Gemma, Vicuna, etc.)
- [ ] **GPU Optimization**: Implement CUDA acceleration for faster processing
- [ ] **Memory Management**: Optimize memory usage for large document collections
- [ ] **Caching System**: Implement embedding and response caching

#### Phase 2: Advanced Features
- [ ] **Multi-Modal Support**: Add support for images, diagrams, and code snippets
- [ ] **Adaptive Learning**: Implement difficulty adjustment based on user performance
- [ ] **Human-in-the-Loop**: Add expert review and feedback mechanisms
- [ ] **Multi-Language Expansion**: Support for English and other languages

#### Phase 3: Production Deployment
- [ ] **Web API Development**: Create REST API for system integration
- [ ] **User Interface**: Build web interface for question management
- [ ] **Database Integration**: Implement persistent storage for questions and metadata
- [ ] **Authentication & Authorization**: Add user management and access control

#### Phase 4: Advanced Analytics
- [ ] **Learning Analytics**: Track question effectiveness and student performance
- [ ] **Content Gap Analysis**: Identify areas needing more questions
- [ ] **Automatic Curriculum Mapping**: Align questions with learning objectives
- [ ] **Personalization**: Adaptive question selection based on learner profiles

### 📊 System Performance Summary

Based on our demonstration:
- **Generation Success Rate**: High (with proper configuration)
- **Quality Validation**: Comprehensive multi-factor assessment
- **Scalability**: Batch processing with error handling
- **Flexibility**: Multiple question types and difficulty levels
- **Educational Value**: Aligned with pedagogical best practices

### 🛠️ Technical Requirements for Production

#### Hardware Requirements
- **GPU**: NVIDIA GPU with 8GB+ VRAM for model inference
- **RAM**: 16GB+ system RAM for document processing
- **Storage**: 100GB+ for models, embeddings, and document storage
- **CPU**: Multi-core processor for parallel document processing

#### Software Dependencies
- **Python 3.8+** with virtual environment
- **CUDA toolkit** for GPU acceleration
- **LangChain ecosystem** for RAG pipeline
- **Transformers library** for model inference
- **FAISS** for vector similarity search
- **FastAPI/Streamlit** for web interface

### 📚 Educational Impact

This RAG-MCQ system can significantly impact education by:
- **Reducing Teacher Workload**: Automated question generation
- **Improving Assessment Quality**: Consistent, validated questions
- **Personalizing Learning**: Adaptive difficulty and topics
- **Scaling Education**: Support for large student populations
- **Enhancing Learning**: Immediate feedback and explanations

### 🔬 Research Opportunities

- **Question Quality Metrics**: Develop better automatic quality assessment
- **Distractor Generation**: Improve incorrect option generation
- **Cognitive Load Theory**: Apply learning theory to difficulty assessment
- **Multi-Document Synthesis**: Generate questions requiring multiple sources
- **Real-Time Adaptation**: Dynamic question adjustment during assessment

### 💡 Final Recommendations

1. **Start Small**: Begin with a limited domain and gradually expand
2. **Validate Extensively**: Test with real educators and students
3. **Iterate Quickly**: Use feedback to improve the system continuously
4. **Focus on Quality**: Prioritize question quality over quantity
5. **Monitor Performance**: Track all metrics for continuous improvement

This demonstration provides a solid foundation for building a production-ready RAG system for MCQ generation that can serve educational institutions, online learning platforms, and assessment organizations.