# 📚 Original Document Processing Pipeline (No Additional Chunking)

## 🎯 Project Overview
This pipeline processes the **original 568 manually-created document chunks** without performing any additional automatic chunking. The goal is to preserve the semantic integrity and contextual boundaries that were carefully established during manual preprocessing.

## 🔄 Processing Philosophy: "Respect the Original Chunking"

### 📋 Background Context
- **Initial Dataset**: 4 comprehensive academic/reference materials
  - 📖 NLP Course Material
  - ☁️ Cloud Computing (ICC) Course Material  
  - 🇵🇰 Pakistan Studies Course Material
  - 📜 "The Sealed Nectar" - Award-winning biography of Prophet Muhammad
- **Manual Preprocessing**: Each lecture/chapter (2000-5000 words) was manually divided into ~500-word contextually coherent segments
- **Result**: 568 semantically meaningful document chunks with preserved context boundaries

### 🚫 What We're NOT Doing
- ❌ **No Additional Chunking**: Avoiding further fragmentation that might break semantic flow
- ❌ **No Token-Based Splitting**: Not using arbitrary token limits to split documents
- ❌ **No Automatic Boundary Detection**: Trusting human judgment over algorithmic splitting
- ❌ **No Size Normalization**: Preserving natural document length variations

### ✅ What We ARE Doing
- ✅ **Metadata Enhancement**: Enriching each document with comprehensive metadata
- ✅ **Keyword Extraction**: Advanced filename + content-based keyword identification
- ✅ **Category Optimization**: Domain-specific processing weights (Technical vs Narrative)
- ✅ **Quality Analysis**: Document quality metrics and content type classification
- ✅ **Preservation Focus**: Maintaining original chunk boundaries and context

## 🎯 Core Hypothesis
> **"Manual chunking with domain expertise produces better semantic units than algorithmic splitting"**

We hypothesize that the original 568 manually-created chunks will provide:
- Better contextual coherence
- Improved retrieval accuracy  
- More meaningful document boundaries
- Enhanced user experience in RAG applications

## 📊 Processing Methodology

### 🔍 Document Analysis Pipeline
1. **File Ingestion**: Read original 568 .txt files as individual units
2. **Content Cleaning**: Normalize whitespace and formatting without altering structure
3. **Metadata Extraction**: 
   - Filename-based keywords
   - Content-based technical terms, acronyms, definitions
   - Document statistics (tokens, words, sentences, paragraphs)
4. **Category-Specific Weighting**:
   - **ICC/NLP**: Higher technical weight (1.5-1.8) for CS concepts
   - **Pakistan Studies/Sealed Nectar**: Narrative weight (1.0-1.2) for historical content
5. **Quality Metrics**: Readability scores, content type classification, keyword density

### 📁 Output Structure
```
processed_documents_568/
├── processed_documents.json          # All 568 documents with metadata
├── documents_icc.json                # Cloud Computing documents
├── documents_nlp.json                # NLP course documents  
├── documents_pakistan_studies.json   # Pakistan Studies documents
├── documents_sealed_nectar.json      # Biography documents
├── processing_statistics.json        # Comprehensive analytics
└── summary_report.txt               # Human-readable summary
```

## 🧪 Experimental Design
This processing approach enables direct comparison between:
- **Original 568 Chunks** (this pipeline) vs **704 Auto-Generated Chunks** (previous pipeline)
- Manual semantic boundaries vs algorithmic splitting
- Domain expertise vs automated processing
- Context preservation vs token optimization

## 🎯 Success Metrics
- **Retrieval Quality**: Semantic relevance of retrieved documents
- **Context Coherence**: Completeness of information in individual chunks  
- **User Experience**: Readability and usefulness of retrieved content
- **System Performance**: Processing efficiency and response accuracy

---

## 🚀 Ready for RAG Pipeline Integration
The output maintains the **semantic integrity** of your original preprocessing while adding **enhanced metadata** for optimal retrieval performance. Each document preserves its intended contextual boundaries while gaining advanced searchability features.

In [2]:
import os
import re
import json
import hashlib
from typing import List, Dict, Tuple, Optional
from transformers import AutoTokenizer
from pathlib import Path
import nltk
from collections import Counter
import numpy as np

# Download required NLTK data (run once)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

class DocumentProcessor:
    def __init__(self, 
                 model_name: str = "intfloat/e5-base"):
        """
        Document Processor that treats each file as a single unit without additional chunking
        
        Args:
            model_name: Embedding model for tokenization
        """
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.document_counter = 0
        
        # Category-specific settings for better domain handling
        self.category_configs = {
            'ICC': {'technical_weight': 1.5, 'context_boost': True},
            'NLP': {'technical_weight': 1.8, 'context_boost': True},
            'Pakistan Studies': {'narrative_weight': 1.2, 'context_boost': False},
            'Sealed Nectar': {'narrative_weight': 1.0, 'context_boost': False}
        }
        
    def estimate_tokens(self, text: str) -> int:
        """Accurate token estimation"""
        return len(self.tokenizer.tokenize(text))
    
    def extract_enhanced_keywords(self, filename: str, text: str = "") -> Dict[str, any]:
        """Enhanced keyword extraction with context awareness"""
        name_without_ext = filename.replace('.txt', '')
        separators = ['_', '-', ' ', '.', '(', ')', '[', ']']
        filename_keywords = [name_without_ext]
        
        for sep in separators:
            new_keywords = []
            for keyword in filename_keywords:
                new_keywords.extend(keyword.split(sep))
            filename_keywords = new_keywords
        
        filename_keywords = [
            re.sub(r'\s+', ' ', kw.strip().lower()) 
            for kw in filename_keywords 
            if len(kw.strip()) > 2 and kw.strip().lower() not in {'txt', 'file', 'doc', 'the', 'and', 'for', 'with'}
        ]
        
        content_keywords = []
        technical_terms_from_content = [] # Explicitly define to avoid confusion with 'technical_terms' in embedding code
        if text:
            capitalized_phrases = re.findall(r'\b[A-Z][A-Za-z]*(?:\s+[A-Z][A-Za-z]*)*\b', text)
            acronyms = re.findall(r'\b[A-Z]{2,}\b', text)
            quoted_terms = re.findall(r'["\']([^"\']{3,30})["\']', text)
            numbered_points = re.findall(r'(?:^|\n)\s*(?:\d+\.|\•|\-)\s*([A-Z][^.\n]{10,50})', text, re.MULTILINE)
            
            # Combine all potential content keywords
            # 'technical_terms_from_content' will be used for 'has_technical_terms' flag later
            technical_terms_from_content = capitalized_phrases + acronyms 
            content_keywords = technical_terms_from_content + quoted_terms + numbered_points
            content_keywords = [kw.lower().strip() for kw in content_keywords if len(kw.strip()) > 2]
            technical_terms_from_content = [kw.lower().strip() for kw in technical_terms_from_content if len(kw.strip()) > 2]

        return {
            'filename_keywords': list(set(filename_keywords)),
            'content_keywords': list(set(content_keywords)), # All derived keywords from content
            'technical_terms': list(set(technical_terms_from_content)), # Specifically technical terms for embedding prefix
            'all_keywords': list(set(filename_keywords + content_keywords)),
            'keyword_string': ' '.join(set(filename_keywords + content_keywords))
        }
    
    def clean_text(self, text: str) -> str:
        """Clean and normalize text content"""
        text = re.sub(r'\n\s*\n\s*\n+', '\n\n', text)
        text = re.sub(r'[ \t]+', ' ', text)
        text = text.strip()
        return text
    
    def create_document_metadata(self, text_content: str, source_info: Dict, 
                               keyword_data: Dict, category_config: Dict) -> Dict: # Renamed 'text' to 'text_content' for clarity
        """Create comprehensive document metadata for better retrieval"""
        
        doc_hash = hashlib.md5(text_content.encode()).hexdigest()[:12]
        
        # The 'enhanced_text' was for a previous embedding strategy, might not be directly used
        # by the new Original568EmbeddingGenerator's prepare_text_for_embedding if metadata is used as prefix.
        # Keeping it for now as it doesn't harm.
        enhanced_text_for_embedding = text_content 
        if category_config.get('context_boost'):
            enhanced_text_for_embedding = f"[{source_info['category']}] {text_content}"
        
        word_count = len(text_content.split())
        try:
            sentence_count = len(nltk.sent_tokenize(text_content))
        except:
            sentence_count = len(re.split(r'[.!?]+', text_content))
        
        token_count = self.estimate_tokens(text_content)
        paragraphs = len([p for p in text_content.split('\n\n') if p.strip()])
        avg_words_per_sentence = word_count / sentence_count if sentence_count > 0 else 0
        
        # --- MODIFICATION 1: Define boolean flags ---
        has_technical_terms_flag = len(keyword_data['technical_terms']) > 5 # Use the specific technical_terms list
        is_structured_flag = bool(re.search(r'(?:^|\n)\s*(?:\d+\.|\•|\-)', text_content, re.MULTILINE))
        has_definitions_flag = bool(re.search(r'["\']([^"\']{10,50})["\']', text_content))

        # --- MODIFICATION 2: Create content_type string ---
        doc_content_types_list = []
        if has_technical_terms_flag: doc_content_types_list.append("Technical")
        if is_structured_flag: doc_content_types_list.append("Structured")
        if has_definitions_flag: doc_content_types_list.append("Definitions")
        content_type_str = ", ".join(doc_content_types_list) if doc_content_types_list else "General"

        document_data = {
            'document_id': self.document_counter,
            'document_hash': doc_hash,
            
            # --- MODIFICATION 3: Main text content now under 'content' key ---
            'content': text_content, 
            'enhanced_text': enhanced_text_for_embedding, 
            
            'category': source_info['category'],
            'source_file': source_info['file_path'],
            'file_name': source_info['file_name'],
            
            'filename_keywords': keyword_data['filename_keywords'],
            'content_keywords': keyword_data['content_keywords'], # General content keywords
            'technical_terms': keyword_data['technical_terms'], # Specific technical terms for embedding prefix
            'all_keywords': keyword_data['all_keywords'],
            'keyword_string': keyword_data['keyword_string'],
            
            'token_count': token_count,
            'word_count': word_count,
            'sentence_count': sentence_count,
            'paragraph_count': paragraphs,
            
            'technical_weight': category_config.get('technical_weight', 1.0),
            'narrative_weight': category_config.get('narrative_weight', 1.0),
            
            'document_density': word_count / token_count if token_count > 0 else 0,
            'avg_sentence_length': avg_words_per_sentence,
            'readability_score': min(100, max(0, 206.835 - 1.015 * avg_words_per_sentence)) if avg_words_per_sentence > 0 else 0,
            
            # Store the boolean flags as before (can be useful for other things)
            'has_technical_terms': has_technical_terms_flag,
            'is_structured': is_structured_flag,
            'has_definitions': has_definitions_flag,

            # --- MODIFICATION 4: Add the new 'content_type' string field ---
            'content_type': content_type_str
        }
        
        self.document_counter += 1
        return document_data
    
    def process_document_folders(self, base_path: str) -> Tuple[List[Dict], Dict]:
        folder_categories = {
            'icc_text_files': 'ICC',
            'NLP_text_files': 'NLP',
            'pakSt_text_files': 'Pakistan Studies',
            'Sealed_nectar_text_files': 'Sealed Nectar'
        }
        
        all_documents = []
        detailed_stats = {
            'processing_stats': {}, 'keyword_stats': {}, 'quality_metrics': {},
            'document_size_distribution': [], 'content_analysis': {}
        }
        
        print("🚀 Document Processing Started (No Additional Chunking - v2 with 'content' & 'content_type')")
        print("=" * 70)
        
        for folder_name, category in folder_categories.items():
            folder_path = os.path.join(base_path, folder_name)
            if not os.path.exists(folder_path):
                print(f"⚠️  Folder not found: {folder_name}")
                continue
            
            print(f"📁 Processing: {category} ({folder_name})")
            folder_documents, folder_stats = [], {'files_processed': 0, 'empty_files': 0, 'error_files': 0}
            folder_keywords, document_sizes = set(), []
            content_types_summary = {'technical': 0, 'structured': 0, 'with_definitions': 0} # based on boolean flags
            
            for file_name in os.listdir(folder_path):
                if file_name.endswith('.txt'):
                    file_path = os.path.join(folder_path, file_name)
                    try:
                        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                            raw_content = f.read() # Changed variable name
                        
                        if not raw_content.strip():
                            folder_stats['empty_files'] += 1
                            continue
                        
                        cleaned_content = self.clean_text(raw_content) # Use cleaned content
                        
                        source_info = {'category': category, 'file_path': file_path, 'file_name': file_name}
                        keyword_data = self.extract_enhanced_keywords(file_name, cleaned_content) # Pass cleaned_content
                        category_config = self.category_configs.get(category, {})
                        
                        document_data = self.create_document_metadata(
                            cleaned_content, source_info, keyword_data, category_config
                        )
                        
                        folder_documents.append(document_data)
                        folder_stats['files_processed'] += 1
                        
                        folder_keywords.update(document_data['all_keywords'])
                        document_sizes.append(document_data['token_count'])
                        
                        if document_data['has_technical_terms']: content_types_summary['technical'] += 1
                        if document_data['is_structured']: content_types_summary['structured'] += 1
                        if document_data['has_definitions']: content_types_summary['with_definitions'] += 1
                        
                        sample_keywords = ', '.join(list(document_data['all_keywords'])[:3]) # Ensure it's a list
                        print(f"   ✅ {file_name}: {document_data['token_count']} tokens, {document_data['word_count']} words [{sample_keywords}]")
                        
                    except Exception as e:
                        folder_stats['error_files'] += 1
                        print(f"   ❌ Error: {file_name} - {str(e)[:50]}...")
            
            detailed_stats['processing_stats'][category] = folder_stats
            detailed_stats['processing_stats'][category]['documents_processed'] = len(folder_documents)
            detailed_stats['processing_stats'][category]['avg_tokens_per_document'] = np.mean(document_sizes) if document_sizes else 0
            detailed_stats['processing_stats'][category]['token_std'] = np.std(document_sizes) if document_sizes else 0
            
            detailed_stats['keyword_stats'][category] = {
                'unique_keywords': len(folder_keywords),
                'sample_keywords': list(folder_keywords)[:10],
                'avg_keywords_per_document': np.mean([len(d['all_keywords']) for d in folder_documents]) if folder_documents else 0
            }
            
            detailed_stats['content_analysis'][category] = content_types_summary
            detailed_stats['document_size_distribution'].extend(document_sizes)
            all_documents.extend(folder_documents)
            
            print(f"   📊 Summary: {folder_stats['files_processed']} files processed")
            print(f"   🏷️  Keywords: {len(folder_keywords)} unique, avg {detailed_stats['keyword_stats'][category]['avg_keywords_per_document']:.1f} per document")
            print(f"   📝 Content Summary (based on flags): {content_types_summary['technical']} technical, {content_types_summary['structured']} structured, {content_types_summary['with_definitions']} with definitions")
            print()
        
        detailed_stats['quality_metrics'] = self.calculate_quality_metrics(all_documents)
        self.print_enhanced_summary(detailed_stats, all_documents)
        return all_documents, detailed_stats
    
    def calculate_quality_metrics(self, documents: List[Dict]) -> Dict:
        if not documents: return {}
        token_counts = [d['token_count'] for d in documents]
        word_counts = [d['word_count'] for d in documents]
        readability_scores = [d['readability_score'] for d in documents]
        
        return {
            'total_documents': len(documents),
            'token_distribution': {'mean': np.mean(token_counts), 'std': np.std(token_counts), 'min': np.min(token_counts), 'max': np.max(token_counts), 'median': np.median(token_counts)},
            'word_distribution': {'mean': np.mean(word_counts), 'std': np.std(word_counts), 'min': np.min(word_counts), 'max': np.max(word_counts), 'median': np.median(word_counts)},
            'content_quality': {
                'avg_readability': np.mean(readability_scores),
                'documents_with_technical_terms': sum(1 for d in documents if d['has_technical_terms']),
                'structured_documents': sum(1 for d in documents if d['is_structured']),
                'documents_with_definitions': sum(1 for d in documents if d['has_definitions'])
            },
            'keyword_coverage': {
                'avg_keywords_per_document': np.mean([len(d['all_keywords']) for d in documents]),
                'documents_with_content_keywords': sum(1 for d in documents if len(d['content_keywords']) > 0)
            }
        }
    
    def print_enhanced_summary(self, stats: Dict, all_documents: List[Dict]):
        print("=" * 70)
        print("📊 DOCUMENT PROCESSING SUMMARY (NO ADDITIONAL CHUNKING - v2)")
        print("=" * 70)
        processing_stats, quality_metrics = stats['processing_stats'], stats['quality_metrics']
        total_files, total_documents = sum(s['files_processed'] for s in processing_stats.values()), len(all_documents)
        avg_tokens, avg_words = quality_metrics['token_distribution']['mean'], quality_metrics['word_distribution']['mean']
        
        print(f"📈 Overall Results:")
        print(f"   Files processed: {total_files}")
        print(f"   Documents created: {total_documents}")
        print(f"   Average tokens per document: {avg_tokens:.1f} ± {quality_metrics['token_distribution']['std']:.1f}")
        print(f"   Average words per document: {avg_words:.1f} ± {quality_metrics['word_distribution']['std']:.1f}")
        print(f"   Token range: {quality_metrics['token_distribution']['min']:.0f} - {quality_metrics['token_distribution']['max']:.0f}")
        
        # Count documents by new 'content_type' string
        content_type_counts = Counter(doc.get('content_type', 'Unknown') for doc in all_documents)
        print(f"\n⭐ Content Types (derived string):")
        for c_type, count in content_type_counts.items():
            print(f"   - {c_type}: {count} documents")
        print()
        
        print(f"📂 By Category:")
        print("-" * 50)
        for category, stat in processing_stats.items():
            kw_stat = stats['keyword_stats'][category]
            content_stat = stats['content_analysis'][category] # This uses boolean flags for summary
            print(f"{category:<20} {stat['files_processed']:>3} files → {stat['documents_processed']:>4} documents")
            print(f"{'':>20} avg: {stat['avg_tokens_per_document']:>5.0f} tokens, {kw_stat['unique_keywords']:>3} keywords")
            print(f"{'':>20} content (flags): {content_stat['technical']} tech, {content_stat['structured']} structured")
        print()
        
        print(f"⭐ Content Quality (flags):")
        print(f"   Average readability score: {quality_metrics['content_quality']['avg_readability']:.1f}")
        print(f"   Technical documents (flag): {quality_metrics['content_quality']['documents_with_technical_terms']}/{total_documents}")
        print(f"   Structured documents (flag): {quality_metrics['content_quality']['structured_documents']}/{total_documents}")
        print(f"   Documents with definitions (flag): {quality_metrics['content_quality']['documents_with_definitions']}/{total_documents}")
        print(f"   Documents with content keywords: {quality_metrics['keyword_coverage']['documents_with_content_keywords']}/{total_documents}")
        print()
        print("✅ Document processing completed!")
        print("🔥 Ready for RAG pipeline with original 568 files as:")
        print("   • Individual document units (main text now in 'content' field)")
        print("   • Enhanced metadata extraction (includes 'content_type' string)")
        print("   • Category-specific optimization")
        print("   • Comprehensive keyword analysis")
    
    def save_documents(self, documents: List[Dict], stats: Dict, output_dir: str = "processed_documents_568"):
        os.makedirs(output_dir, exist_ok=True)
        documents_file = os.path.join(output_dir, "processed_documents.json")
        with open(documents_file, 'w', encoding='utf-8') as f:
            json.dump(documents, f, indent=2, ensure_ascii=False)
        
        stats_file = os.path.join(output_dir, "processing_statistics.json")
        with open(stats_file, 'w', encoding='utf-8') as f:
            def convert_numpy(obj):
                if isinstance(obj, np.integer): return int(obj)
                if isinstance(obj, np.floating): return float(obj)
                if isinstance(obj, np.ndarray): return obj.tolist()
                return obj
            json.dump(json.loads(json.dumps(stats, default=convert_numpy)), f, indent=2, ensure_ascii=False)
        
        category_counts = {}
        for document in documents:
            category = document['category'].replace(' ', '_').lower()
            category_file = os.path.join(output_dir, f"documents_{category}.json")
            category_documents = []
            if os.path.exists(category_file):
                with open(category_file, 'r', encoding='utf-8') as f: category_documents = json.load(f)
            category_documents.append(document)
            category_counts[category] = len(category_documents)
            with open(category_file, 'w', encoding='utf-8') as f: json.dump(category_documents, f, indent=2, ensure_ascii=False)
            
        summary_file = os.path.join(output_dir, "summary_report.txt")
        with open(summary_file, 'w', encoding='utf-8') as f:
            f.write("DOCUMENT PROCESSING SUMMARY (v2 - 'content' & 'content_type')\n")
            f.write("=" * 50 + "\n\n")
            f.write(f"Total documents processed: {len(documents)}\n")
            f.write("Processing approach: No additional chunking (original files as units)\n\n")
            f.write("Documents by category:\n")
            for category, count in category_counts.items():
                f.write(f"  - {category.replace('_', ' ').title()}: {count} documents\n")
            f.write(f"\nAverage document size: {stats['quality_metrics']['token_distribution']['mean']:.0f} tokens\n")
            f.write(f"Size range: {stats['quality_metrics']['token_distribution']['min']:.0f} - {stats['quality_metrics']['token_distribution']['max']:.0f} tokens\n")
            content_type_counts_report = Counter(doc.get('content_type', 'Unknown') for doc in documents)
            f.write("\nContent Types (derived string):\n")
            for c_type, count in content_type_counts_report.items():
                 f.write(f"  - {c_type}: {count} documents\n")
        
        print(f"💾 Documents saved to: {output_dir}/")
        print(f"   • All documents: processed_documents.json ({len(documents)} documents)")
        print(f"   • Statistics: processing_statistics.json") 
        print(f"   • By category: documents_[category].json")
        print(f"   • Summary: summary_report.txt")

def main():
    processor = DocumentProcessor(model_name="intfloat/e5-base")
    documents_path = "documents" # Ensure this path points to your source text files
    all_documents, stats = processor.process_document_folders(documents_path)
    processor.save_documents(all_documents, stats)
    
    print("\n" + "=" * 70)
    print("📝 SAMPLE PROCESSED DOCUMENTS (v2)")
    print("=" * 70)
    categories_shown = set()
    for doc in all_documents[:12]: # Show more samples if needed
        if doc['category'] not in categories_shown or len(categories_shown) < 4:
            print(f"\n🏷️  Category: {doc['category']}")
            print(f"📄 File: {doc['file_name']}")
            print(f"🔤 Size: {doc['token_count']} tokens | {doc['word_count']} words | {doc['sentence_count']} sentences")
            print(f"🆕 Content Type: {doc['content_type']}") # Display new field
            print(f"🏷️  Keywords: {', '.join(list(doc['all_keywords'])[:5])}")
            print(f"⚖️  Weights: Tech={doc['technical_weight']}, Narrative={doc['narrative_weight']}")
            print(f"📊 Quality: Readability={doc['readability_score']:.1f}, Density={doc['document_density']:.2f}")
            # Show boolean flags for comparison
            print(f"🔍 Flags: Technical={doc['has_technical_terms']}, Structured={doc['is_structured']}, Definitions={doc['has_definitions']}")
            print(f"📜 Content preview: {doc['content'][:200]}...") # Preview from 'content' field
            print("-" * 50)
            categories_shown.add(doc['category'])
            if len(categories_shown) >= 4 and len(all_documents) > 4: # Ensure we don't go out of bounds if few docs
                break
    return all_documents, stats

if __name__ == "__main__":
    documents, statistics = main()
    print(f"\n🎉 Document processing complete! (v2)")
    print(f"Processed {len(documents)} documents from original files.")
    print("Output 'processed_documents.json' now uses 'content' for main text and includes 'content_type' string.")
    print("Ready for embedding generation and vector database indexing!")

Token indices sequence length is longer than the specified maximum sequence length for this model (676 > 512). Running this sequence through the model will result in indexing errors


🚀 Document Processing Started (No Additional Chunking - v2 with 'content' & 'content_type')
📁 Processing: ICC (icc_text_files)
   ✅ icc01_API_Evolution_Data_Formats_&_The_Emergence_of_Standards_SOAP_&_REST.txt: 676 tokens, 489 words [issue, acid, wsdl]
   ✅ icc02_Advantages_r_Disadvantages_of_Private_Cloud_&_Hybrid_Cloud_Introduction.txt: 563 tokens, 445 words [because, compared, public]
   ✅ icc03_Advantages_r_Disadvantages_of_Public_Cloud_&_Private_Cloud_Introduction.txt: 535 tokens, 429 words [because, public, minimal investment]
   ✅ icc04_Alternatives_to_VMs_Containers_Introduction.txt: 552 tokens, 441 words [to resource, portability, because]
   ✅ icc05_Basic_Security_Terms_Continued_&_Threat_Agents.txt: 512 tokens, 422 words [the probability of a threat successfully occurring, continued, authentication]
   ✅ icc06_Benefits_and_Steps_of_Cloud_Threat_Modeling.txt: 564 tokens, 396 words [clearly, define mitigation strategies:, privilege]
   ✅ icc07_Big_Data_Definition_Sources_Examp

# On Google COLAB on Dataset Embeddings were applied, EMBEDDER: "intfloat/e5-base"

###                                    AND

# Vector Store (FAISS) "Facebook AI Similarity Search" was applied for Creation & Indexing on Embedded Dataset 


### --------------------------------------------------------------------------------------------------------------
### --------------------------------------------------------------------------------------------------------------


# Load the Models Once In Memory. 
## This is done so you don't have to load it everytime.

### --------------------------------------------------------------------------------------------------------------
### --------------------------------------------------------------------------------------------------------------



In [4]:
import json
import numpy as np
import faiss
import time
from typing import List, Dict, Any, Tuple
from sentence_transformers import SentenceTransformer
import pickle
import os
from dataclasses import dataclass, field
from pathlib import Path

# Try to import llama-cpp-python
try:
    from llama_cpp import Llama
    LLAMA_CPP_AVAILABLE = True
except ImportError:
    LLAMA_CPP_AVAILABLE = False
    print("Warning: llama-cpp-python not installed. LLM generation will be simulated.")
    print("         Install with: pip install llama-cpp-python (or llama-cpp-python[cuda] for GPU)")

# Configuration for RAG pipeline targeting the 568 original documents
@dataclass
class RAGConfig568:
    """Configuration class for RAG pipeline using 568 original documents"""
    # Paths pointing to the output of FAISSVectorStore568
    # Assumes 'faiss_vector_store_568' is in the same directory as the script
    base_faiss_path: str = "faiss_vector_store_568" # Base directory for the 568 FAISS store
    
    # Derived paths
    faiss_index_path: str = field(init=False)
    document_metadata_path: str = field(init=False) # Changed from chunk_metadata_path
    vector_store_metadata_path: str = field(init=False)
    
    # Model paths
    embedding_model_name: str = "intfloat/e5-base"
    llm_model_path: str = "models/llama-3.2-1b-instruct-q4_k_m.gguf"  # IMPORTANT: Update this path if different
    
    # Retrieval parameters
    top_k_dense: int = 3
    similarity_threshold: float = 0.75 # Adjust based on testing
    
    # LLM parameters
    max_tokens: int = 1200
    temperature: float = 0.32 # Slightly increased for a bit more variability if desired
    top_p: float = 0.92
    context_length: int = 5000 # Max context for Llama 3.2 1B seems to be 8k, but conservative is fine

    def __post_init__(self):
        self.faiss_index_path = os.path.join(self.base_faiss_path, "faiss_index_568.index")
        self.document_metadata_path = os.path.join(self.base_faiss_path, "document_metadata_568.json")
        self.vector_store_metadata_path = os.path.join(self.base_faiss_path, "vector_store_metadata_568.json")

class DenseRetriever568:
    """Handles dense retrieval for the 568 original documents"""
    
    def __init__(self, config: RAGConfig568):
        self.config = config
        self.embedding_model = None
        self.faiss_index = None
        self.document_metadata = None # Changed from chunk_metadata
        self.vector_store_metadata = None
        
    def initialize(self):
        print("\n" + "="*80)
        print("🔄 INITIALIZING DENSE RETRIEVER (FOR 568 ORIGINAL DOCUMENTS)")
        print("="*80)
        
        print("📥 Loading embedding model (intfloat/e5-base)...")
        start_time = time.time()
        self.embedding_model = SentenceTransformer(self.config.embedding_model_name)
        print(f"✅ Embedding model loaded in {time.time() - start_time:.2f} seconds")
        
        print(f"📥 Loading FAISS index from: {self.config.faiss_index_path}")
        start_time = time.time()
        if not os.path.exists(self.config.faiss_index_path):
            raise FileNotFoundError(f"FAISS index not found at: {self.config.faiss_index_path}\n"
                                    f"Ensure 'faiss_vector_store_568' folder is correctly placed.")
        self.faiss_index = faiss.read_index(self.config.faiss_index_path)
        print(f"✅ FAISS index loaded in {time.time() - start_time:.2f} seconds")
        print(f"📊 Index contains {self.faiss_index.ntotal} vectors (should be 568).")
        
        print(f"📥 Loading document metadata from: {self.config.document_metadata_path}")
        if not os.path.exists(self.config.document_metadata_path):
            raise FileNotFoundError(f"Document metadata not found at: {self.config.document_metadata_path}\n"
                                    f"Ensure 'faiss_vector_store_568' folder is correctly placed.")
        with open(self.config.document_metadata_path, 'r', encoding='utf-8') as f:
            self.document_metadata = json.load(f)
        
        print(f"📥 Loading vector store metadata from: {self.config.vector_store_metadata_path}")
        if not os.path.exists(self.config.vector_store_metadata_path):
            print(f"⚠️ Vector store metadata not found at: {self.config.vector_store_metadata_path}. Proceeding...")
            self.vector_store_metadata = {}
        else:
            with open(self.config.vector_store_metadata_path, 'r', encoding='utf-8') as f:
                self.vector_store_metadata = json.load(f)
            
        print(f"✅ Loaded metadata for {len(self.document_metadata)} documents.")
        print("🚀 Dense Retriever (568) initialized successfully!")
        
    def embed_query(self, query: str) -> np.ndarray:
        prefixed_query = f"query: {query}" # E5 specific prefix for queries
        start_time = time.time()
        embedding = self.embedding_model.encode([prefixed_query], normalize_embeddings=True).astype(np.float32)
        embed_time_ms = (time.time() - start_time) * 1000
        print(f"⚡ Query embedded in {embed_time_ms:.1f}ms")
        return embedding[0]
    
    def search_similar_documents(self, query_embedding: np.ndarray) -> List[Dict[str, Any]]: # Renamed
        start_time = time.time()
        scores, indices = self.faiss_index.search(
            query_embedding.reshape(1, -1), 
            self.config.top_k_dense
        )
        search_time_ms = (time.time() - start_time) * 1000
        print(f"🔍 FAISS search completed in {search_time_ms:.1f}ms")
        
        results = []
        for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
            if 0 <= idx < len(self.document_metadata) and score >= self.config.similarity_threshold:
                doc_data = self.document_metadata[idx].copy()
                doc_data['similarity_score'] = float(score)
                doc_data['retrieval_rank'] = i + 1
                results.append(doc_data)
            elif score < self.config.similarity_threshold:
                break
        
        print(f"📋 Retrieved {len(results)} relevant documents (min threshold: {self.config.similarity_threshold:.2f})")
        return results
    
    def retrieve(self, query: str) -> List[Dict[str, Any]]:
        print(f"\n" + "-"*80)
        print(f"🔍 RETRIEVAL PHASE (568 Docs) FOR QUERY: '{query[:100]}{'...' if len(query) > 100 else ''}'")
        print("-"*80)
        
        query_embedding = self.embed_query(query)
        results = self.search_similar_documents(query_embedding) # Renamed
        
        if results:
            print(f"\n📊 TOP {len(results)} RETRIEVED DOCUMENTS (Summary):")
            for i, result_doc in enumerate(results):
                # Use 'file_name' as it's present in your document_metadata_568.json
                file_name = result_doc.get('file_name', Path(result_doc.get('source_file', 'Unknown')).name)
                print(f"  {result_doc['retrieval_rank']}. File: {file_name} (Score: {result_doc['similarity_score']:.4f})")
        else:
            print("\n⚠️ No documents retrieved for this query based on set parameters.")
        return results

class LLMGenerator568: # Renamed for clarity, though functionality is general
    """Handles LLM generation using LLaMA 3.2-1B"""
    
    def __init__(self, config: RAGConfig568):
        self.config = config
        self.llm = None
        
    def initialize(self):
        print("\n" + "="*80)
        print("🤖 INITIALIZING LLM GENERATOR")
        print("="*80)
        
        if not LLAMA_CPP_AVAILABLE:
            print("⚠️  llama-cpp-python not available. LLM generation will be simulated.")
            return
        if not os.path.exists(self.config.llm_model_path):
            print(f"⚠️  LLM model not found at: {self.config.llm_model_path}")
            print("📝 Please download a LLaMA GGUF model and update RAGConfig568.llm_model_path.")
            print("   Using simulation mode for LLM generation.")
            return
        
        print(f"📥 Loading LLaMA GGUF model from: {self.config.llm_model_path}...")
        start_time = time.time()
        try:
            self.llm = Llama(
                model_path=self.config.llm_model_path,
                n_ctx=self.config.context_length,
                n_threads=os.cpu_count(),
                verbose=False,
                n_gpu_layers=-1 
            )
            print(f"✅ LLM loaded in {time.time() - start_time:.2f} seconds")
            print("🚀 LLM Generator initialized successfully!")
        except Exception as e:
            print(f"❌ Error loading LLM: {e}. Using simulation mode.")
            self.llm = None # Ensure LLM is None if loading failed
    
    def format_prompt(self, query: str, retrieved_documents: List[Dict[str, Any]]) -> str: # Renamed
        context_parts = []
        for i, doc in enumerate(retrieved_documents): # Renamed
            # Use 'file_name' for source, and 'content' for the text, as per your 568 doc structure
            source_file_name = doc.get('file_name', Path(doc.get('source_file', 'Unknown Document')).name)
            text = doc.get('content', '') # Key change: use 'content'
            score = doc.get('similarity_score', 0.0)
            context_parts.append(
                f"<document id={i+1} source=\"{source_file_name}\" relevance={score:.3f}>\n{text}\n</document>"
            )
        context_string = "\n".join(context_parts)
        
        prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an intelligent AI assistant for course materials. Your primary task is to answer questions accurately using ONLY the provided context documents.

INSTRUCTIONS:
1. **BASE YOUR ANSWER ENTIRELY ON THE PROVIDED DOCUMENTS.** Do not use any external knowledge.
2. **CITE SOURCES METICULOUSLY**: When you use information from a document, cite it immediately using the format [Document X: source_filename.ext].
3. **BE SPECIFIC**: If a document provides a detail, include it.
4. **NO RELEVANT INFORMATION**: If NO documents contain relevant information for the query, respond EXACTLY with: "Based on the provided documents, I could not find specific information to answer your query." Do not add any other commentary.
5. **SYNTHESIZE**: Combine information from multiple documents if they all contribute to the answer, citing each piece of information.
6. **DIRECT QUOTES**: If a direct quote is useful and concise, you can use it, but always attribute it.
7. **DO NOT HALLUCINATE**: If the information isn't in the documents, you don't know it for the purpose of this task.

<|eot_id|><|start_header_id|>user<|end_header_id|>

Context Documents:
{context_string}

Question: {query}

Based ONLY on the provided documents, answer the question. Cite your sources.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
        return prompt
    
    def generate_response(self, query: str, retrieved_documents: List[Dict[str, Any]]) -> Dict[str, Any]: # Renamed
        print(f"\n" + "-"*80)
        print(f"🤖 GENERATION PHASE (568 Docs)")
        print("-"*80)
        
        if not retrieved_documents:
            print("⚠️ No relevant documents were retrieved. Rigid response will be generated.")
            return {
                'response': "Based on the provided documents, I could not find specific information to answer your query.",
                'sources': [], 'generation_time': 0, 'token_count': 0, 'simulated': True, 'no_context': True
            }
        
        prompt = self.format_prompt(query, retrieved_documents)
        
        if self.llm is None:
            print("🔄 Simulating LLM response...")
            time.sleep(1.0)
            sources_list = sorted(list(set(doc.get('file_name', 'Unknown') for doc in retrieved_documents)))
            simulated_response = (f"SIMULATED RESPONSE: Based on documents like {', '.join(sources_list[:2])}..., "
                                  f"the answer to '{query}' would be synthesized here. "
                                  f"LLM would cite sources like [Document 1: {sources_list[0] if sources_list else 'N/A'}].")
            print(f"✅ Simulated response generated.")
            return {
                'response': simulated_response, 'sources': sources_list, 'generation_time': 1.0, 
                'token_count': len(simulated_response.split()), 'simulated': True
            }
        
        print(f"🧠 Calling LLaMA GGUF model for generation...")
        start_time = time.time()
        try:
            output = self.llm.create_completion(
                prompt, max_tokens=self.config.max_tokens, temperature=self.config.temperature,
                top_p=self.config.top_p, stop=["<|eot_id|>"], echo=False
            )
            generation_time = time.time() - start_time
            response_text = output['choices'][0]['text'].strip()
            sources_list = sorted(list(set(doc.get('file_name', 'Unknown') for doc in retrieved_documents)))
            print(f"✅ LLM Response generated in {generation_time:.2f} seconds. Tokens: {output['usage']['completion_tokens']}")
            return {
                'response': response_text, 'sources': sources_list, 'generation_time': generation_time,
                'token_count': output['usage']['completion_tokens'], 'simulated': False
            }
        except Exception as e:
            print(f"❌ Error during LLM generation: {e}")
            return {
                'response': f"Error during LLM generation: {str(e)}", 'sources': [], 
                'generation_time': 0, 'token_count': 0, 'error': True, 'simulated': False
            }

class RAGPipeline568: # Renamed
    """Complete RAG Pipeline for 568 original documents"""
    
    def __init__(self, config: RAGConfig568 = None):
        self.config = config or RAGConfig568()
        self.retriever = DenseRetriever568(self.config)
        self.generator = LLMGenerator568(self.config) # Using the (potentially renamed) LLMGenerator
        
    def initialize(self):
        print("\n" + "="*80)
        print("🚀 INITIALIZING COMPLETE RAG PIPELINE (FOR 568 ORIGINAL DOCUMENTS)")
        print("="*80)
        self.retriever.initialize()
        self.generator.initialize()
        print("\n" + "="*80)
        print("✅ RAG PIPELINE (568) READY!")
        print("="*80)
    
    def save_pipeline(self, filepath: str = "rag_pipeline_568_v2.pkl"): # Changed default filename
        print(f"\n💾 Saving RAG pipeline (568) to {filepath}...")
        # Ensure directory exists
        os.makedirs(os.path.dirname(filepath) or '.', exist_ok=True)
        try:
            with open(filepath, 'wb') as f:
                pickle.dump(self, f)
            print(f"✅ RAG pipeline (568) saved successfully!")
        except Exception as e:
            print(f"❌ Error saving pipeline: {e}")
    
    @classmethod
    def load_pipeline(cls, filepath: str = "rag_pipeline_568_v2.pkl"): # Changed default filename
        print(f"\n📥 Loading RAG pipeline (568) from {filepath}...")
        if not os.path.exists(filepath):
            print(f"❌ Pipeline file not found at {filepath}. Please run setup first.")
            return None
        try:
            with open(filepath, 'rb') as f:
                pipeline = pickle.load(f)
            print(f"✅ RAG pipeline (568) loaded successfully!")
            return pipeline
        except Exception as e:
            print(f"❌ Error loading pipeline: {e}")
            return None
        
    def query(self, question: str) -> Dict[str, Any]:
        start_time = time.time()
        retrieved_documents = self.retriever.retrieve(question) # Renamed
        retrieval_time = time.time() - start_time
        
        generation_start = time.time()
        result = self.generator.generate_response(question, retrieved_documents) # Renamed
        generation_time = time.time() - generation_start
        total_time = time.time() - start_time
        
        final_result = {
            'query': question,
            'retrieved_documents_count': len(retrieved_documents), # Renamed
            'retrieval_time': retrieval_time,
            'generation_time': generation_time,
            'total_time': total_time,
            'response': result['response'],
            'sources': result['sources'],
            'document_details': retrieved_documents, # Renamed
            'no_context': result.get('no_context', False)
        }
        return final_result

def setup_rag_pipeline_568(): # Renamed
    """Setup and initialize the RAG pipeline for 568 original documents"""
    print("="*80)
    print("🚀 RAG PIPELINE (568 DOCUMENTS) SETUP AND INITIALIZATION")
    print("="*80)
    
    # --- IMPORTANT LOCAL SETUP STEPS for 568 ---
    # 1. Ensure required libraries are installed.
    # 2. Download FAISS vector store folder:
    #    Download 'faiss_vector_store_568' from your Google Drive and place it
    #    in the same directory as this script (or update RAGConfig568.base_faiss_path).
    # 3. Download LLaMA GGUF model and update RAGConfig568.llm_model_path.
    # ---
    
    config = RAGConfig568() # Use the new config
    rag = RAGPipeline568(config) # Use the new pipeline class
    
    try:
        rag.initialize()
        # Default save path for 568 pipeline
        rag.save_pipeline(filepath=os.path.join("pipelines", "rag_pipeline_568_v2.pkl")) # Save in a subfolder
        
        print("\n" + "="*80)
        print("✅ SETUP COMPLETE (568 DOCUMENTS)!")
        print("The RAG pipeline for 568 original documents has been initialized and saved.")
        print(f"  Saved to: {os.path.join('pipelines', 'rag_pipeline_568.pkl')}")
        print("You can now run a query session using this pipeline.")
        print("="*80)
        return rag
        
    except FileNotFoundError as fnf_error:
        print(f"\n❌ CRITICAL FILE NOT FOUND during setup: {fnf_error}")
        print("   Please ensure all required FAISS store files are in the correct location:")
        print(f"   Expected FAISS base path: {config.base_faiss_path}")
        print(f"   - Index: {config.faiss_index_path}")
        print(f"   - Document Metadata: {config.document_metadata_path}")
        print("   Also check your LLM model path.")
        return None
    except Exception as e:
        print(f"\n❌ An error occurred during setup: {e}")
        import traceback
        traceback.print_exc()
        return None

if __name__ == "__main__":
    # This will set up and save the pipeline for the 568 original documents
    pipeline_568 = setup_rag_pipeline_568()
    
    # Example of how you might run a query if setup was successful
    if pipeline_568:
        print("\nRunning a sample query with the initialized 568 pipeline...")
        sample_query = "What are the advantages of private cloud?"
        response = pipeline_568.query(sample_query)
        print("\n" + "="*80)
        print(f"Sample Query: {response['query']}")
        print(f"Response:\n{response['response']}")
        print(f"\nSources used: {response['sources']}")
        print(f"Total time: {response['total_time']:.2f}s")
        print("="*80)

        # To load and use later (in a different script or session):
        # loaded_pipeline = RAGPipeline568.load_pipeline(filepath=os.path.join("pipelines", "rag_pipeline_568.pkl"))
        # if loaded_pipeline:
        #     response = loaded_pipeline.query("Another query")
        #     print(response['response'])

🚀 RAG PIPELINE (568 DOCUMENTS) SETUP AND INITIALIZATION

🚀 INITIALIZING COMPLETE RAG PIPELINE (FOR 568 ORIGINAL DOCUMENTS)

🔄 INITIALIZING DENSE RETRIEVER (FOR 568 ORIGINAL DOCUMENTS)
📥 Loading embedding model (intfloat/e5-base)...
✅ Embedding model loaded in 3.70 seconds
📥 Loading FAISS index from: faiss_vector_store_568\faiss_index_568.index
✅ FAISS index loaded in 0.00 seconds
📊 Index contains 568 vectors (should be 568).
📥 Loading document metadata from: faiss_vector_store_568\document_metadata_568.json
📥 Loading vector store metadata from: faiss_vector_store_568\vector_store_metadata_568.json
✅ Loaded metadata for 568 documents.
🚀 Dense Retriever (568) initialized successfully!

🤖 INITIALIZING LLM GENERATOR
📥 Loading LLaMA GGUF model from: models/llama-3.2-1b-instruct-q4_k_m.gguf...


llama_context: n_ctx_per_seq (5000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


✅ LLM loaded in 2.28 seconds
🚀 LLM Generator initialized successfully!

✅ RAG PIPELINE (568) READY!

💾 Saving RAG pipeline (568) to pipelines\rag_pipeline_568_v2.pkl...
✅ RAG pipeline (568) saved successfully!

✅ SETUP COMPLETE (568 DOCUMENTS)!
The RAG pipeline for 568 original documents has been initialized and saved.
  Saved to: pipelines\rag_pipeline_568.pkl
You can now run a query session using this pipeline.

Running a sample query with the initialized 568 pipeline...

--------------------------------------------------------------------------------
🔍 RETRIEVAL PHASE (568 Docs) FOR QUERY: 'What are the advantages of private cloud?'
--------------------------------------------------------------------------------
⚡ Query embedded in 132.1ms
🔍 FAISS search completed in 0.0ms
📋 Retrieved 3 relevant documents (min threshold: 0.75)

📊 TOP 3 RETRIEVED DOCUMENTS (Summary):
  1. File: icc02_Advantages_r_Disadvantages_of_Private_Cloud_&_Hybrid_Cloud_Introduction.txt (Score: 0.8706)
  2. File

## NOW write you query it will take generally 50 seconds to respond

### Ask as many queries you want the output will properly display all the information, make sure to click scrollable element in output so that you able to see all the output parts 

### or copy the output of the cell and paste it to word if you donot know what scrollabel element is 

### Write exit when you want to stop

# max tokens 800 and collect 4 files


In [None]:
import pickle
import time
from pathlib import Path
from typing import Dict, Any
import os # Added for os.path.join

# Define the expected path for the 568 pipeline
# Assuming the query script is in the same root directory as the 'pipelines' folder
DEFAULT_PIPELINE_568_PATH = os.path.join("pipelines", "rag_pipeline_568.pkl")

def load_rag_pipeline_568(filepath: str = DEFAULT_PIPELINE_568_PATH): # Renamed and updated default
    """Load a pre-initialized RAG pipeline (568 version) from file"""
    print(f"\n📥 Loading RAG pipeline (568) from {filepath}...")
    try:
        with open(filepath, 'rb') as f:
            # This will require RAGPipeline568 and its dependent classes to be defined
            # or importable in the environment where this script is run.
            # If running as a standalone script, you might need to include class definitions
            # or ensure they are in an importable module.
            pipeline = pickle.load(f)
        print(f"✅ RAG pipeline (568) loaded successfully!")
        return pipeline
    except FileNotFoundError:
        print(f"❌ Pipeline file not found at {filepath}")
        print("   Please run the setup script (e.g., rag_setup_568.py) first to initialize and save the pipeline.")
        return None
    except Exception as e:
        print(f"❌ Error loading pipeline: {e}")
        import traceback
        traceback.print_exc() # More details on error
        return None

def display_query_result_568(result: Dict[str, Any]): # Renamed
    """Display the results of a RAG query (568 version) in a formatted way"""
    print(f"\n" + "="*80)
    print(f"FINAL RAG RESPONSE FOR: '{result['query']}'")
    print("="*80)
    
    print(f"\n📝 Generated Response:")
    print("-" * 70)
    print(result['response'])
    print("-" * 70)
    
    # Display details of the retrieved documents
    # The key in 'result' should be 'document_details' as per your updated RAGPipeline568
    if result.get('document_details'):
        print(f"\n📚 Retrieved Documents (Top {result.get('retrieved_documents_count', 0)}):") # Updated key
        print("-" * 70)
        for j, doc in enumerate(result['document_details']): # Updated variable name
            # Use 'file_name' directly as it's expected in the document metadata
            file_name = doc.get('file_name', 'Unknown_File') 
            print(f"  {j+1}. Document: {file_name}")
            print(f"     Category: {doc.get('category', 'N/A')}")
            print(f"     Relevance Score: {doc.get('similarity_score', 0):.4f}")
            # Preview of the text used for embedding, if available
            if 'text_for_embedding' in doc:
                preview_text = doc['text_for_embedding']
            elif 'content' in doc: # Fallback to 'content'
                preview_text = doc['content']
            else:
                preview_text = doc.get('text', '') # Last fallback
            # print(f"     Preview: {preview_text[:150]}...")
            print("-" * 70)
    elif result.get('no_context'):
        print("\n⚠️ No relevant documents were retrieved to answer this query based on the provided context.")
    else:
        print("\n⚠️ No document details found in the result.")

    
    print(f"\n📊 Pipeline Metrics:")
    print(f"  • Retrieval Time: {result.get('retrieval_time', 0):.3f} seconds")
    print(f"  • Generation Time: {result.get('generation_time', 0):.3f} seconds")
    print(f"  • Total Time: {result.get('total_time', 0):.3f} seconds")
    if result.get('simulated_llm_response'): # If your pipeline adds this flag
        print("  ⚠️ LLM Response was SIMULATED.")
    print("\n" + "="*80 + "\n")

def run_interactive_session_568(rag_pipeline): # Renamed
    """Run an interactive query session with the loaded RAG pipeline (568 version)"""
    print("\n" + "="*80)
    print("🚀 RAG PIPELINE (568 DOCUMENTS) INTERACTIVE SESSION")
    print("Type 'exit' or 'quit' to end the session.")
    print("="*80)
    
    while True:
        print("\n" + "-"*50)
        user_query = input("🤔 Your question: ").strip()
        
        if user_query.lower() in ['exit', 'quit']:
            print("\nGoodbye! 👋")
            break
        if not user_query:
            print("⚠️ Please enter a question.")
            continue
            
        try:
            print("\n🔄 Processing your question...")
            result = rag_pipeline.query(user_query) # This calls RAGPipeline568.query
            
            print(f"\n💡 **Answer:**")
            print("-" * 50)
            print(result['response'])
            print("-" * 50)
            
            if result.get('sources'):
                print(f"\n📚 **Sources cited (original files):** {', '.join(result['sources'])}")
            elif result.get('no_context'):
                 print(f"\n📚 No specific documents found to answer the query based on provided context.")


            print(f"⏱️ **Response time:** {result.get('total_time', 0):.2f} seconds")
            
        except Exception as e:
            print(f"❌ Error during query processing: {e}")
            import traceback
            traceback.print_exc()

def run_single_query_568(rag_pipeline, query: str): # Renamed
    """Run a single query (568 version) without interactive mode"""
    print(f"\n🔍 Processing single query: '{query}'")
    
    try:
        result = rag_pipeline.query(query)
        display_query_result_568(result) # Use the 568-specific display
        return result
    except Exception as e:
        print(f"❌ Error processing query: {e}")
        return None

def main_query_568(): # Renamed
    """Main function for the query session (568 version)"""
    print("="*80)
    print("🚀 RAG PIPELINE (568 DOCUMENTS) QUERY SESSION")
    print("="*80)
    
    # --- Ensure class definitions are available ---
    # If RAGPipeline568 and its dependencies (RAGConfig568, DenseRetriever568, LLMGenerator568)
    # are not in this file, they need to be imported from the setup script.
    # e.g., from rag_setup_568_script import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # For simplicity if running as a standalone file after setup, you might copy those class defs here
    # OR ensure the setup script creates a module you can import.

    # Load the pre-initialized 568 pipeline
    rag_pipeline_568 = load_rag_pipeline_568() # Use the 568-specific loader
    
    if rag_pipeline_568 is None:
        print("\n❌ Could not load RAG pipeline for 568 documents.")
        print("   Please ensure you have run the setup script for the 568-document pipeline,")
        print(f"   and the file '{DEFAULT_PIPELINE_568_PATH}' exists.")
        return
    
    print("\n✅ RAG pipeline (568) loaded and ready!")
    
    # Example: Run a few predefined queries or start interactive session
    # run_single_query_568(rag_pipeline_568, "What is cloud computing security?")
    # run_single_query_568(rag_pipeline_568, "Explain the concept of APIs in NLP.")
    
    run_interactive_session_568(rag_pipeline_568) # Use the 568-specific session runner

if __name__ == "__main__":
    # Crucial for unpickling custom objects:
    # The definitions of RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # must be known in the main scope when pickle.load() is called.
    # If these are in rag_setup_script.py, you would typically do:
    # from rag_setup_script import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568 # (Adjust script name)
    # For this example, assuming these classes are either in this file or correctly imported.
    
    # If you are running this as a separate script and the classes are in another file (e.g. rag_setup_568.py)
    # you would need to import them. For example:
    # from rag_setup_568 import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # If they are not imported, pickle will fail to reconstruct the objects.
    
    # For testing, if this script is run standalone AFTER the setup script has created the .pkl file,
    # and the class definitions are present in this file (or an imported module), it should work.
    # For now, I'll assume the class definitions would be made available (e.g. by copying them here or importing)
    print("NOTE: For this query script to successfully unpickle 'rag_pipeline_568.pkl',")
    print("      the class definitions (RAGPipeline568, RAGConfig568, etc.) must be accessible")
    print("      in this script's environment (e.g., defined here or imported from the setup script's module).")
    print("      Proceeding with main_query_568 assuming they are available.\n")

    # To actually run this, you'd make sure the class definitions are here or imported.
    # For now, this call will likely fail unless you've set up imports or copied classes.
    # --- Placeholder for demonstrating the call ---
    # To make it runnable as a standalone demo after setup, you'd typically copy the class definitions
    # from the setup script into this query script, or ensure the setup script can be imported as a module.

    # Let's assume for a moment you've copied the class definitions from the setup script into this file
    # (RAGConfig568, DenseRetriever568, LLMGenerator568, RAGPipeline568)
    # Then the following would be the intended execution:
    main_query_568()

NOTE: For this query script to successfully unpickle 'rag_pipeline_568.pkl',
      the class definitions (RAGPipeline568, RAGConfig568, etc.) must be accessible
      in this script's environment (e.g., defined here or imported from the setup script's module).
      Proceeding with main_query_568 assuming they are available.

🚀 RAG PIPELINE (568 DOCUMENTS) QUERY SESSION

📥 Loading RAG pipeline (568) from pipelines\rag_pipeline_568.pkl...


llama_context: n_ctx_per_seq (4500) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


✅ RAG pipeline (568) loaded successfully!

✅ RAG pipeline (568) loaded and ready!

🚀 RAG PIPELINE (568 DOCUMENTS) INTERACTIVE SESSION
Type 'exit' or 'quit' to end the session.

--------------------------------------------------

🔄 Processing your question...

--------------------------------------------------------------------------------
🔍 RETRIEVAL PHASE (568 Docs) FOR QUERY: 'tell me about the confusion matrix, accuracy, recall, precision and F1 score also explain with examp...'
--------------------------------------------------------------------------------
⚡ Query embedded in 2033.3ms
🔍 FAISS search completed in 1.0ms
📋 Retrieved 5 relevant documents (min threshold: 0.38)

📊 TOP 5 RETRIEVED DOCUMENTS (Summary):
  1. File: lec3-(f) Evaluation measure.txt (Score: 0.8532)
  2. File: lec3-(k) F-measure.txt (Score: 0.8378)
  3. File: lec3-(g) Confusion Matrix.txt (Score: 0.8341)
  4. File: lec3-(i) Precision vs recall.txt (Score: 0.8334)
  5. File: lec3-(m) Multi-class Confusion Matrix

# MAX token 800


In [None]:
import pickle
import time
from pathlib import Path
from typing import Dict, Any
import os # Added for os.path.join

# Define the expected path for the 568 pipeline
# Assuming the query script is in the same root directory as the 'pipelines' folder
DEFAULT_PIPELINE_568_PATH = os.path.join("pipelines", "rag_pipeline_568_v2.pkl")

def load_rag_pipeline_568(filepath: str = DEFAULT_PIPELINE_568_PATH): # Renamed and updated default
    """Load a pre-initialized RAG pipeline (568 version) from file"""
    print(f"\n📥 Loading RAG pipeline (568) from {filepath}...")
    try:
        with open(filepath, 'rb') as f:
            # This will require RAGPipeline568 and its dependent classes to be defined
            # or importable in the environment where this script is run.
            # If running as a standalone script, you might need to include class definitions
            # or ensure they are in an importable module.
            pipeline = pickle.load(f)
        print(f"✅ RAG pipeline (568) loaded successfully!")
        return pipeline
    except FileNotFoundError:
        print(f"❌ Pipeline file not found at {filepath}")
        print("   Please run the setup script (e.g., rag_setup_568.py) first to initialize and save the pipeline.")
        return None
    except Exception as e:
        print(f"❌ Error loading pipeline: {e}")
        import traceback
        traceback.print_exc() # More details on error
        return None

def display_query_result_568(result: Dict[str, Any]): # Renamed
    """Display the results of a RAG query (568 version) in a formatted way"""
    print(f"\n" + "="*80)
    print(f"FINAL RAG RESPONSE FOR: '{result['query']}'")
    print("="*80)
    
    print(f"\n📝 Generated Response:")
    print("-" * 70)
    print(result['response'])
    print("-" * 70)
    
    # Display details of the retrieved documents
    # The key in 'result' should be 'document_details' as per your updated RAGPipeline568
    if result.get('document_details'):
        print(f"\n📚 Retrieved Documents (Top {result.get('retrieved_documents_count', 0)}):") # Updated key
        print("-" * 70)
        for j, doc in enumerate(result['document_details']): # Updated variable name
            # Use 'file_name' directly as it's expected in the document metadata
            file_name = doc.get('file_name', 'Unknown_File') 
            print(f"  {j+1}. Document: {file_name}")
            print(f"     Category: {doc.get('category', 'N/A')}")
            print(f"     Relevance Score: {doc.get('similarity_score', 0):.4f}")
            # Preview of the text used for embedding, if available
            if 'text_for_embedding' in doc:
                preview_text = doc['text_for_embedding']
            elif 'content' in doc: # Fallback to 'content'
                preview_text = doc['content']
            else:
                preview_text = doc.get('text', '') # Last fallback
            # print(f"     Preview: {preview_text[:150]}...")
            print("-" * 70)
    elif result.get('no_context'):
        print("\n⚠️ No relevant documents were retrieved to answer this query based on the provided context.")
    else:
        print("\n⚠️ No document details found in the result.")

    
    print(f"\n📊 Pipeline Metrics:")
    print(f"  • Retrieval Time: {result.get('retrieval_time', 0):.3f} seconds")
    print(f"  • Generation Time: {result.get('generation_time', 0):.3f} seconds")
    print(f"  • Total Time: {result.get('total_time', 0):.3f} seconds")
    if result.get('simulated_llm_response'): # If your pipeline adds this flag
        print("  ⚠️ LLM Response was SIMULATED.")
    print("\n" + "="*80 + "\n")

def run_interactive_session_568(rag_pipeline): # Renamed
    """Run an interactive query session with the loaded RAG pipeline (568 version)"""
    print("\n" + "="*80)
    print("🚀 RAG PIPELINE (568 DOCUMENTS) INTERACTIVE SESSION")
    print("Type 'exit' or 'quit' to end the session.")
    print("="*80)
    
    while True:
        print("\n" + "-"*50)
        user_query = input("🤔 Your question: ").strip()
        
        if user_query.lower() in ['exit', 'quit']:
            print("\nGoodbye! 👋")
            break
        if not user_query:
            print("⚠️ Please enter a question.")
            continue
            
        try:
            print("\n🔄 Processing your question...")
            result = rag_pipeline.query(user_query) # This calls RAGPipeline568.query
            
            print(f"\n💡 **Answer:**")
            print("-" * 50)
            print(result['response'])
            print("-" * 50)
            
            if result.get('sources'):
                print(f"\n📚 **Sources cited (original files):** {', '.join(result['sources'])}")
            elif result.get('no_context'):
                 print(f"\n📚 No specific documents found to answer the query based on provided context.")


            print(f"⏱️ **Response time:** {result.get('total_time', 0):.2f} seconds")
            
        except Exception as e:
            print(f"❌ Error during query processing: {e}")
            import traceback
            traceback.print_exc()

def run_single_query_568(rag_pipeline, query: str): # Renamed
    """Run a single query (568 version) without interactive mode"""
    print(f"\n🔍 Processing single query: '{query}'")
    
    try:
        result = rag_pipeline.query(query)
        display_query_result_568(result) # Use the 568-specific display
        return result
    except Exception as e:
        print(f"❌ Error processing query: {e}")
        return None

def main_query_568(): # Renamed
    """Main function for the query session (568 version)"""
    print("="*80)
    print("🚀 RAG PIPELINE (568 DOCUMENTS) QUERY SESSION")
    print("="*80)
    
    # --- Ensure class definitions are available ---
    # If RAGPipeline568 and its dependencies (RAGConfig568, DenseRetriever568, LLMGenerator568)
    # are not in this file, they need to be imported from the setup script.
    # e.g., from rag_setup_568_script import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # For simplicity if running as a standalone file after setup, you might copy those class defs here
    # OR ensure the setup script creates a module you can import.

    # Load the pre-initialized 568 pipeline
    rag_pipeline_568 = load_rag_pipeline_568() # Use the 568-specific loader
    
    if rag_pipeline_568 is None:
        print("\n❌ Could not load RAG pipeline for 568 documents.")
        print("   Please ensure you have run the setup script for the 568-document pipeline,")
        print(f"   and the file '{DEFAULT_PIPELINE_568_PATH}' exists.")
        return
    
    print("\n✅ RAG pipeline (568) loaded and ready!")
    
    # Example: Run a few predefined queries or start interactive session
    # run_single_query_568(rag_pipeline_568, "What is cloud computing security?")
    # run_single_query_568(rag_pipeline_568, "Explain the concept of APIs in NLP.")
    
    run_interactive_session_568(rag_pipeline_568) # Use the 568-specific session runner

if __name__ == "__main__":
    # Crucial for unpickling custom objects:
    # The definitions of RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # must be known in the main scope when pickle.load() is called.
    # If these are in rag_setup_script.py, you would typically do:
    # from rag_setup_script import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568 # (Adjust script name)
    # For this example, assuming these classes are either in this file or correctly imported.
    
    # If you are running this as a separate script and the classes are in another file (e.g. rag_setup_568.py)
    # you would need to import them. For example:
    # from rag_setup_568 import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # If they are not imported, pickle will fail to reconstruct the objects.
    
    # For testing, if this script is run standalone AFTER the setup script has created the .pkl file,
    # and the class definitions are present in this file (or an imported module), it should work.
    # For now, I'll assume the class definitions would be made available (e.g. by copying them here or importing)
    print("NOTE: For this query script to successfully unpickle 'rag_pipeline_568.pkl',")
    print("      the class definitions (RAGPipeline568, RAGConfig568, etc.) must be accessible")
    print("      in this script's environment (e.g., defined here or imported from the setup script's module).")
    print("      Proceeding with main_query_568 assuming they are available.\n")

    # To actually run this, you'd make sure the class definitions are here or imported.
    # For now, this call will likely fail unless you've set up imports or copied classes.
    # --- Placeholder for demonstrating the call ---
    # To make it runnable as a standalone demo after setup, you'd typically copy the class definitions
    # from the setup script into this query script, or ensure the setup script can be imported as a module. 

    # Let's assume for a moment you've copied the class definitions from the setup script into this file
    # (RAGConfig568, DenseRetriever568, LLMGenerator568, RAGPipeline568)
    # Then the following would be the intended execution:
    main_query_568()

NOTE: For this query script to successfully unpickle 'rag_pipeline_568.pkl',
      the class definitions (RAGPipeline568, RAGConfig568, etc.) must be accessible
      in this script's environment (e.g., defined here or imported from the setup script's module).
      Proceeding with main_query_568 assuming they are available.

🚀 RAG PIPELINE (568 DOCUMENTS) QUERY SESSION

📥 Loading RAG pipeline (568) from pipelines\rag_pipeline_568_v2.pkl...


llama_context: n_ctx_per_seq (5000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


✅ RAG pipeline (568) loaded successfully!

✅ RAG pipeline (568) loaded and ready!

🚀 RAG PIPELINE (568 DOCUMENTS) INTERACTIVE SESSION
Type 'exit' or 'quit' to end the session.

--------------------------------------------------

🔄 Processing your question...

--------------------------------------------------------------------------------
🔍 RETRIEVAL PHASE (568 Docs) FOR QUERY: 'Define Vector Model, its steps and explain with example''
--------------------------------------------------------------------------------
⚡ Query embedded in 197.6ms
🔍 FAISS search completed in 0.0ms
📋 Retrieved 3 relevant documents (min threshold: 0.75)

📊 TOP 3 RETRIEVED DOCUMENTS (Summary):
  1. File: lec6-(f) Vector Space Models (VSMs).txt (Score: 0.8376)
  2. File: lec4-(e) N-gram Model.txt (Score: 0.8220)
  3. File: lec1-(a) Natural Language Processing.txt (Score: 0.8141)

--------------------------------------------------------------------------------
🤖 GENERATION PHASE (568 Docs)
----------------------

# max token 1200 and 3 files collectedd results

In [5]:
import pickle
import time
from pathlib import Path
from typing import Dict, Any
import os # Added for os.path.join

# Define the expected path for the 568 pipeline
# Assuming the query script is in the same root directory as the 'pipelines' folder
DEFAULT_PIPELINE_568_PATH = os.path.join("pipelines", "rag_pipeline_568_v2.pkl")

def load_rag_pipeline_568(filepath: str = DEFAULT_PIPELINE_568_PATH): # Renamed and updated default
    """Load a pre-initialized RAG pipeline (568 version) from file"""
    print(f"\n📥 Loading RAG pipeline (568) from {filepath}...")
    try:
        with open(filepath, 'rb') as f:
            # This will require RAGPipeline568 and its dependent classes to be defined
            # or importable in the environment where this script is run.
            # If running as a standalone script, you might need to include class definitions
            # or ensure they are in an importable module.
            pipeline = pickle.load(f)
        print(f"✅ RAG pipeline (568) loaded successfully!")
        return pipeline
    except FileNotFoundError:
        print(f"❌ Pipeline file not found at {filepath}")
        print("   Please run the setup script (e.g., rag_setup_568.py) first to initialize and save the pipeline.")
        return None
    except Exception as e:
        print(f"❌ Error loading pipeline: {e}")
        import traceback
        traceback.print_exc() # More details on error
        return None

def display_query_result_568(result: Dict[str, Any]): # Renamed
    """Display the results of a RAG query (568 version) in a formatted way"""
    print(f"\n" + "="*80)
    print(f"FINAL RAG RESPONSE FOR: '{result['query']}'")
    print("="*80)
    
    print(f"\n📝 Generated Response:")
    print("-" * 70)
    print(result['response'])
    print("-" * 70)
    
    # Display details of the retrieved documents
    # The key in 'result' should be 'document_details' as per your updated RAGPipeline568
    if result.get('document_details'):
        print(f"\n📚 Retrieved Documents (Top {result.get('retrieved_documents_count', 0)}):") # Updated key
        print("-" * 70)
        for j, doc in enumerate(result['document_details']): # Updated variable name
            # Use 'file_name' directly as it's expected in the document metadata
            file_name = doc.get('file_name', 'Unknown_File') 
            print(f"  {j+1}. Document: {file_name}")
            print(f"     Category: {doc.get('category', 'N/A')}")
            print(f"     Relevance Score: {doc.get('similarity_score', 0):.4f}")
            # Preview of the text used for embedding, if available
            if 'text_for_embedding' in doc:
                preview_text = doc['text_for_embedding']
            elif 'content' in doc: # Fallback to 'content'
                preview_text = doc['content']
            else:
                preview_text = doc.get('text', '') # Last fallback
            # print(f"     Preview: {preview_text[:150]}...")
            print("-" * 70)
    elif result.get('no_context'):
        print("\n⚠️ No relevant documents were retrieved to answer this query based on the provided context.")
    else:
        print("\n⚠️ No document details found in the result.")

    
    print(f"\n📊 Pipeline Metrics:")
    print(f"  • Retrieval Time: {result.get('retrieval_time', 0):.3f} seconds")
    print(f"  • Generation Time: {result.get('generation_time', 0):.3f} seconds")
    print(f"  • Total Time: {result.get('total_time', 0):.3f} seconds")
    if result.get('simulated_llm_response'): # If your pipeline adds this flag
        print("  ⚠️ LLM Response was SIMULATED.")
    print("\n" + "="*80 + "\n")

def run_interactive_session_568(rag_pipeline): # Renamed
    """Run an interactive query session with the loaded RAG pipeline (568 version)"""
    print("\n" + "="*80)
    print("🚀 RAG PIPELINE (568 DOCUMENTS) INTERACTIVE SESSION")
    print("Type 'exit' or 'quit' to end the session.")
    print("="*80)
    
    while True:
        print("\n" + "-"*50)
        user_query = input("🤔 Your question: ").strip()
        
        if user_query.lower() in ['exit', 'quit']:
            print("\nGoodbye! 👋")
            break
        if not user_query:
            print("⚠️ Please enter a question.")
            continue
            
        try:
            print("\n🔄 Processing your question...")
            result = rag_pipeline.query(user_query) # This calls RAGPipeline568.query
            
            print(f"\n💡 **Answer:**")
            print("-" * 50)
            print(result['response'])
            print("-" * 50)
            
            if result.get('sources'):
                print(f"\n📚 **Sources cited (original files):** {', '.join(result['sources'])}")
            elif result.get('no_context'):
                 print(f"\n📚 No specific documents found to answer the query based on provided context.")


            print(f"⏱️ **Response time:** {result.get('total_time', 0):.2f} seconds")
            
        except Exception as e:
            print(f"❌ Error during query processing: {e}")
            import traceback
            traceback.print_exc()

def run_single_query_568(rag_pipeline, query: str): # Renamed
    """Run a single query (568 version) without interactive mode"""
    print(f"\n🔍 Processing single query: '{query}'")
    
    try:
        result = rag_pipeline.query(query)
        display_query_result_568(result) # Use the 568-specific display
        return result
    except Exception as e:
        print(f"❌ Error processing query: {e}")
        return None

def main_query_568(): # Renamed
    """Main function for the query session (568 version)"""
    print("="*80)
    print("🚀 RAG PIPELINE (568 DOCUMENTS) QUERY SESSION")
    print("="*80)
    
    # --- Ensure class definitions are available ---
    # If RAGPipeline568 and its dependencies (RAGConfig568, DenseRetriever568, LLMGenerator568)
    # are not in this file, they need to be imported from the setup script.
    # e.g., from rag_setup_568_script import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # For simplicity if running as a standalone file after setup, you might copy those class defs here
    # OR ensure the setup script creates a module you can import.

    # Load the pre-initialized 568 pipeline
    rag_pipeline_568 = load_rag_pipeline_568() # Use the 568-specific loader
    
    if rag_pipeline_568 is None:
        print("\n❌ Could not load RAG pipeline for 568 documents.")
        print("   Please ensure you have run the setup script for the 568-document pipeline,")
        print(f"   and the file '{DEFAULT_PIPELINE_568_PATH}' exists.")
        return
    
    print("\n✅ RAG pipeline (568) loaded and ready!")
    
    # Example: Run a few predefined queries or start interactive session
    # run_single_query_568(rag_pipeline_568, "What is cloud computing security?")
    # run_single_query_568(rag_pipeline_568, "Explain the concept of APIs in NLP.")
    
    run_interactive_session_568(rag_pipeline_568) # Use the 568-specific session runner

if __name__ == "__main__":
    # Crucial for unpickling custom objects:
    # The definitions of RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # must be known in the main scope when pickle.load() is called.
    # If these are in rag_setup_script.py, you would typically do:
    # from rag_setup_script import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568 # (Adjust script name)
    # For this example, assuming these classes are either in this file or correctly imported.
    
    # If you are running this as a separate script and the classes are in another file (e.g. rag_setup_568.py)
    # you would need to import them. For example:
    # from rag_setup_568 import RAGPipeline568, RAGConfig568, DenseRetriever568, LLMGenerator568
    # If they are not imported, pickle will fail to reconstruct the objects.
    
    # For testing, if this script is run standalone AFTER the setup script has created the .pkl file,
    # and the class definitions are present in this file (or an imported module), it should work.
    # For now, I'll assume the class definitions would be made available (e.g. by copying them here or importing)
    print("NOTE: For this query script to successfully unpickle 'rag_pipeline_568.pkl',")
    print("      the class definitions (RAGPipeline568, RAGConfig568, etc.) must be accessible")
    print("      in this script's environment (e.g., defined here or imported from the setup script's module).")
    print("      Proceeding with main_query_568 assuming they are available.\n")

    # To actually run this, you'd make sure the class definitions are here or imported.
    # For now, this call will likely fail unless you've set up imports or copied classes.
    # --- Placeholder for demonstrating the call ---
    # To make it runnable as a standalone demo after setup, you'd typically copy the class definitions
    # from the setup script into this query script, or ensure the setup script can be imported as a module. 

    # Let's assume for a moment you've copied the class definitions from the setup script into this file
    # (RAGConfig568, DenseRetriever568, LLMGenerator568, RAGPipeline568)
    # Then the following would be the intended execution:
    main_query_568()

NOTE: For this query script to successfully unpickle 'rag_pipeline_568.pkl',
      the class definitions (RAGPipeline568, RAGConfig568, etc.) must be accessible
      in this script's environment (e.g., defined here or imported from the setup script's module).
      Proceeding with main_query_568 assuming they are available.

🚀 RAG PIPELINE (568 DOCUMENTS) QUERY SESSION

📥 Loading RAG pipeline (568) from pipelines\rag_pipeline_568_v2.pkl...


llama_context: n_ctx_per_seq (5000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


✅ RAG pipeline (568) loaded successfully!

✅ RAG pipeline (568) loaded and ready!

🚀 RAG PIPELINE (568 DOCUMENTS) INTERACTIVE SESSION
Type 'exit' or 'quit' to end the session.

--------------------------------------------------

🔄 Processing your question...

--------------------------------------------------------------------------------
🔍 RETRIEVAL PHASE (568 Docs) FOR QUERY: 'Define Vector Model, its steps and explain with example'
--------------------------------------------------------------------------------
⚡ Query embedded in 268.2ms
🔍 FAISS search completed in 0.0ms
📋 Retrieved 3 relevant documents (min threshold: 0.75)

📊 TOP 3 RETRIEVED DOCUMENTS (Summary):
  1. File: lec6-(f) Vector Space Models (VSMs).txt (Score: 0.8453)
  2. File: lec4-(e) N-gram Model.txt (Score: 0.8299)
  3. File: lec1-(a) Natural Language Processing.txt (Score: 0.8217)

--------------------------------------------------------------------------------
🤖 GENERATION PHASE (568 Docs)
-----------------------