# üìä Resume Retrieval System - RAG Evaluation

## ŸÜÿ∏ÿßŸÖ ÿßÿ≥ÿ™ÿ±ÿ¨ÿßÿπ ÿßŸÑÿ≥Ÿäÿ± ÿßŸÑÿ∞ÿßÿ™Ÿäÿ© ÿ®ÿßÿ≥ÿ™ÿÆÿØÿßŸÖ RAG

### üéØ ÿßŸÑŸÖŸäÿ≤ÿßÿ™ ÿßŸÑÿ±ÿ¶Ÿäÿ≥Ÿäÿ©:
1. **PDF Upload & Parsing**: ÿ±ŸÅÿπ Ÿàÿ™ÿ≠ŸÑŸäŸÑ ŸÖŸÑŸÅÿßÿ™ PDF ŸÑŸÑÿ≥Ÿäÿ± ÿßŸÑÿ∞ÿßÿ™Ÿäÿ©
2. **3 Embedding Models**: ÿßÿÆÿ™Ÿäÿßÿ± ŸÖŸÜ 3 ŸÜŸÖÿßÿ∞ÿ¨ ŸÖÿÆÿ™ŸÑŸÅÿ©
3. **3 Chunking Strategies**: ÿßÿ≥ÿ™ÿ±ÿßÿ™Ÿäÿ¨Ÿäÿßÿ™ ÿ™ŸÇÿ≥ŸäŸÖ ŸÖÿ™ÿπÿØÿØÿ© ŸÇÿßÿ®ŸÑÿ© ŸÑŸÑŸÖŸÇÿßÿ±ŸÜÿ©
4. **Top-K Retrieval**: ÿßŸÑÿ™ÿ≠ŸÉŸÖ ÿ®ÿπÿØÿØ ÿßŸÑŸÜÿ™ÿßÿ¶ÿ¨ (1-10)
5. **Evaluation Metrics**: ŸÖŸÇÿßŸäŸäÿ≥ ÿ™ŸÇŸäŸäŸÖ ÿ¥ÿßŸÖŸÑÿ©
6. **Interactive UI**: Ÿàÿßÿ¨Ÿáÿ© ÿ™ŸÅÿßÿπŸÑŸäÿ© ŸÖÿ™ŸÉÿßŸÖŸÑÿ©

### üì¶ Environment:
- Python 3.11 (Conda)
- VS Code Local Development
- SentenceTransformer (Free, Offline)

---

## üì¶ Section 1: Install & Import Libraries

In [1]:
# Install required packages (run once)
%pip install pandas chromadb sentence-transformers nltk pdfplumber gradio tabulate PyMuPDF --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import libraries
import pandas as pd
import numpy as np
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import pdfplumber
import re
import math
import os
from pathlib import Path
from typing import List, Dict, Tuple, Any, Optional
from collections import defaultdict
from tabulate import tabulate
import warnings
warnings.filterwarnings('ignore')

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("‚úÖ All libraries imported successfully!")
print(f"üìÇ Working Directory: {os.getcwd()}")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All libraries imported successfully!
üìÇ Working Directory: c:\Users\abrah\RAG


## ‚öôÔ∏è Section 2: Configuration Settings

### ÿ•ÿπÿØÿßÿØÿßÿ™ ÿßŸÑŸÜÿ∏ÿßŸÖ ÿßŸÑŸÇÿßÿ®ŸÑÿ© ŸÑŸÑÿ™ÿπÿØŸäŸÑ

In [3]:
# ============================================
# ‚öôÔ∏è SYSTEM CONFIGURATION
# ============================================

class Config:
    """
    Central configuration for the Resume RAG System.
    All settings can be modified here or through the UI.
    """
    
    # -----------------------------------------
    # üî§ EMBEDDING MODELS (Free, Local)
    # -----------------------------------------
    EMBEDDING_MODELS = {
        "all-MiniLM-L6-v2": {
            "name": "MiniLM (Fast)",
            "description": "Lightweight, fast inference. Good for quick experiments.",
            "dimension": 384
        },
        "all-mpnet-base-v2": {
            "name": "MPNet (High Quality)", 
            "description": "Best quality for semantic similarity tasks.",
            "dimension": 768
        },
        "BAAI/bge-base-en-v1.5": {
            "name": "BGE (Strong Retrieval)",
            "description": "Optimized for retrieval tasks. Strong semantic matching.",
            "dimension": 768
        }
    }
    
    DEFAULT_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
    
    # -----------------------------------------
    # ‚úÇÔ∏è CHUNKING SETTINGS
    # -----------------------------------------
    CHUNKING_STRATEGIES = {
        "fixed": "Fixed-Length (300-400 chars)",
        "sentence": "Sentence-Based",
        "layout": "Layout-Aware (Sections/Headings)"
    }
    
    # Chunk parameters
    CHUNK_SIZE = 350           # Target chunk size (chars)
    CHUNK_SIZE_MIN = 300       # Minimum chunk size
    CHUNK_SIZE_MAX = 400       # Maximum chunk size
    CHUNK_OVERLAP_PERCENT = 10  # Overlap percentage (10%)
    
    @property
    def chunk_overlap(self):
        return int(self.CHUNK_SIZE * self.CHUNK_OVERLAP_PERCENT / 100)
    
    # Sentence-based settings
    MAX_SENTENCES_PER_CHUNK = 5
    
    # -----------------------------------------
    # üîç RETRIEVAL SETTINGS
    # -----------------------------------------
    TOP_K_MIN = 1
    TOP_K_MAX = 10
    TOP_K_DEFAULT = 3  # Recommended for CV documents
    
    # -----------------------------------------
    # üìÅ FILE SETTINGS
    # -----------------------------------------
    SUPPORTED_FORMATS = [".pdf", ".txt"]
    PDF_PARSER = "pdfplumber"  # Options: "pdfplumber", "pymupdf"
    
    # -----------------------------------------
    # üìÇ STORAGE PATHS
    # -----------------------------------------
    UPLOAD_FOLDER = "./uploaded_resumes"
    CHROMA_PERSIST_DIR = "./chroma_db"

# Initialize config
config = Config()

# Create necessary directories
os.makedirs(config.UPLOAD_FOLDER, exist_ok=True)
os.makedirs(config.CHROMA_PERSIST_DIR, exist_ok=True)

# Display configuration
print("‚öôÔ∏è SYSTEM CONFIGURATION")
print("=" * 50)
print(f"\nüì¶ Embedding Models Available:")
for model_id, info in config.EMBEDDING_MODELS.items():
    marker = "‚Üí" if model_id == config.DEFAULT_EMBEDDING_MODEL else " "
    print(f"  {marker} {info['name']}: {model_id}")

print(f"\n‚úÇÔ∏è Chunking Settings:")
print(f"   Chunk Size: {config.CHUNK_SIZE_MIN}-{config.CHUNK_SIZE_MAX} chars")
print(f"   Overlap: {config.CHUNK_OVERLAP_PERCENT}%")

print(f"\nüîç Top-K Range: {config.TOP_K_MIN}-{config.TOP_K_MAX} (Default: {config.TOP_K_DEFAULT})")

print(f"\nüìÅ Directories:")
print(f"   Upload: {config.UPLOAD_FOLDER}")
print(f"   ChromaDB: {config.CHROMA_PERSIST_DIR}")

print("\n‚úÖ Configuration loaded!")

‚öôÔ∏è SYSTEM CONFIGURATION

üì¶ Embedding Models Available:
  ‚Üí MiniLM (Fast): all-MiniLM-L6-v2
    MPNet (High Quality): all-mpnet-base-v2
    BGE (Strong Retrieval): BAAI/bge-base-en-v1.5

‚úÇÔ∏è Chunking Settings:
   Chunk Size: 300-400 chars
   Overlap: 10%

üîç Top-K Range: 1-10 (Default: 3)

üìÅ Directories:
   Upload: ./uploaded_resumes
   ChromaDB: ./chroma_db

‚úÖ Configuration loaded!


## üìÑ Section 3: PDF Parsing & Text Extraction

### ŸÇÿ±ÿßÿ°ÿ© Ÿàÿ™ÿ≠ŸÑŸäŸÑ ŸÖŸÑŸÅÿßÿ™ PDF ŸÑŸÑÿ≥Ÿäÿ± ÿßŸÑÿ∞ÿßÿ™Ÿäÿ©
- ŸäÿØÿπŸÖ `pdfplumber` ŸÑŸÑŸÖŸÑŸÅÿßÿ™ ÿßŸÑÿ®ÿ≥Ÿäÿ∑ÿ©
- ŸäÿØÿπŸÖ `PyMuPDF` ŸÑŸÑŸÖŸÑŸÅÿßÿ™ ÿßŸÑŸÖÿπŸÇÿØÿ©
- ÿßÿ≥ÿ™ÿÆÿ±ÿßÿ¨ ÿßŸÑŸÜÿµ ŸÖÿπ ÿßŸÑÿ≠ŸÅÿßÿ∏ ÿπŸÑŸâ ÿßŸÑŸáŸäŸÉŸÑ

In [4]:
# ============================================
# üìÑ PDF PARSER CLASS
# ============================================

class PDFParser:
    """
    PDF parsing with support for pdfplumber and PyMuPDF.
    Extracts text while preserving structure for CV documents.
    """
    
    @staticmethod
    def parse_with_pdfplumber(pdf_path: str) -> Dict[str, Any]:
        """
        Parse PDF using pdfplumber (recommended for simple CVs).
        """
        try:
            text_content = []
            metadata = {}
            
            with pdfplumber.open(pdf_path) as pdf:
                metadata['pages'] = len(pdf.pages)
                metadata['parser'] = 'pdfplumber'
                
                for i, page in enumerate(pdf.pages):
                    page_text = page.extract_text()
                    if page_text:
                        text_content.append(page_text)
            
            full_text = "\n\n".join(text_content)
            metadata['char_count'] = len(full_text)
            metadata['word_count'] = len(full_text.split())
            
            return {
                'success': True,
                'text': full_text,
                'metadata': metadata
            }
            
        except Exception as e:
            return {
                'success': False,
                'error': str(e),
                'text': '',
                'metadata': {}
            }
    
    @staticmethod
    def parse_with_pymupdf(pdf_path: str) -> Dict[str, Any]:
        """
        Parse PDF using PyMuPDF (better for complex layouts).
        """
        try:
            import fitz  # PyMuPDF
            
            text_content = []
            metadata = {}
            
            doc = fitz.open(pdf_path)
            metadata['pages'] = len(doc)
            metadata['parser'] = 'pymupdf'
            
            for page in doc:
                text_content.append(page.get_text())
            
            doc.close()
            
            full_text = "\n\n".join(text_content)
            metadata['char_count'] = len(full_text)
            metadata['word_count'] = len(full_text.split())
            
            return {
                'success': True,
                'text': full_text,
                'metadata': metadata
            }
            
        except Exception as e:
            return {
                'success': False,
                'error': str(e),
                'text': '',
                'metadata': {}
            }
    
    @classmethod
    def parse(cls, pdf_path: str, parser: str = "pdfplumber") -> Dict[str, Any]:
        """
        Parse PDF using specified parser.
        """
        if parser == "pymupdf":
            return cls.parse_with_pymupdf(pdf_path)
        return cls.parse_with_pdfplumber(pdf_path)

# Test parser
print("‚úÖ PDF Parser class ready!")
print(f"   Using: {config.PDF_PARSER}")

‚úÖ PDF Parser class ready!
   Using: pdfplumber


In [5]:
# ============================================
# üìÅ RESUME MANAGER - Handle PDF & CSV Data
# ============================================

class ResumeManager:
    """
    Manages resume data from both CSV and PDF uploads.
    """
    
    def __init__(self):
        self.resumes = []  # List of {id, text, source, category, metadata}
        self.resume_count = 0
    
    def load_from_csv(self, csv_path: str, text_column: str = "Resume", 
                      category_column: str = "Category", limit: int = None) -> int:
        """
        Load resumes from CSV file.
        """
        df = pd.read_csv(csv_path)
        
        if limit:
            df = df.head(limit)
        
        for idx, row in df.iterrows():
            self.resumes.append({
                'id': f"csv_{self.resume_count}",
                'text': str(row[text_column]),
                'source': 'csv',
                'category': str(row.get(category_column, 'Unknown')),
                'filename': csv_path,
                'metadata': {'original_index': idx}
            })
            self.resume_count += 1
        
        return len(df)
    
    def add_pdf(self, pdf_path: str, category: str = "Uploaded") -> Dict:
        """
        Add a single PDF resume.
        """
        result = PDFParser.parse(pdf_path, config.PDF_PARSER)
        
        if result['success']:
            resume_id = f"pdf_{self.resume_count}"
            filename = os.path.basename(pdf_path)
            
            self.resumes.append({
                'id': resume_id,
                'text': result['text'],
                'source': 'pdf',
                'category': category,
                'filename': filename,
                'metadata': result['metadata']
            })
            self.resume_count += 1
            
            return {
                'success': True,
                'id': resume_id,
                'filename': filename,
                'pages': result['metadata'].get('pages', 0),
                'words': result['metadata'].get('word_count', 0)
            }
        
        return {'success': False, 'error': result.get('error', 'Unknown error')}
    
    def add_multiple_pdfs(self, pdf_folder: str, category: str = "Uploaded") -> List[Dict]:
        """
        Add all PDFs from a folder.
        """
        results = []
        pdf_files = list(Path(pdf_folder).glob("*.pdf"))
        
        for pdf_path in pdf_files:
            result = self.add_pdf(str(pdf_path), category)
            results.append(result)
        
        return results
    
    def get_all_resumes(self) -> List[Dict]:
        """Return all loaded resumes."""
        return self.resumes
    
    def get_dataframe(self) -> pd.DataFrame:
        """Return resumes as DataFrame."""
        return pd.DataFrame(self.resumes)
    
    def clear(self):
        """Clear all resumes."""
        self.resumes = []
        self.resume_count = 0
    
    def summary(self) -> str:
        """Get summary of loaded resumes."""
        df = self.get_dataframe()
        if df.empty:
            return "No resumes loaded."
        
        summary = f"üìä Total Resumes: {len(df)}\n"
        summary += f"   From CSV: {len(df[df['source'] == 'csv'])}\n"
        summary += f"   From PDF: {len(df[df['source'] == 'pdf'])}\n"
        summary += f"\nüìÅ Categories:\n"
        for cat, count in df['category'].value_counts().items():
            summary += f"   - {cat}: {count}\n"
        return summary

# Initialize Resume Manager
resume_manager = ResumeManager()
print("‚úÖ Resume Manager initialized!")

‚úÖ Resume Manager initialized!


In [6]:
# ============================================
# üìä LOAD CSV DATA (Optional - for evaluation)
# ============================================

# Check if CSV exists
CSV_FILENAME = "UpdatedResumeDataSet.csv"

if os.path.exists(CSV_FILENAME):
    # Load sample data for evaluation
    num_loaded = resume_manager.load_from_csv(CSV_FILENAME, limit=20)
    print(f"‚úÖ Loaded {num_loaded} resumes from CSV for evaluation")
    print(resume_manager.summary())
else:
    print(f"‚ö†Ô∏è CSV file '{CSV_FILENAME}' not found.")
    print("   You can upload PDF resumes through the UI instead.")

‚úÖ Loaded 20 resumes from CSV for evaluation
üìä Total Resumes: 20
   From CSV: 20
   From PDF: 0

üìÅ Categories:
   - Data Science: 20



## üîß Section 4: Text Cleaning & Utilities

In [9]:
def clean_text(text: str) -> str:
    """
    Clean and normalize resume text.
    """
    if not isinstance(text, str):
        return ""
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove special characters but keep important punctuation
    text = re.sub(r'[^\w\s.,;:()\-@#&/]', '', text)
    
    # Fix spacing around punctuation
    text = re.sub(r'\s+([.,;:])', r'\1', text)
    
    return text.strip()

# Test cleaning
if resume_manager.resumes:
    sample_text = resume_manager.resumes[0]['text'][:200]
    print("üîß Sample text cleaning:")
    print(f"Original: {sample_text[:100]}...")
    print(f"Cleaned: {clean_text(sample_text)[:100]}...")
else:
    print("‚ö†Ô∏è No resumes loaded yet. Load resumes first to test cleaning.")

üîß Sample text cleaning:
Original: Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, ...
Cleaned: Skills  Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, J...


## ‚úÇÔ∏è Section 5: Chunking Strategies

### 3 ÿßÿ≥ÿ™ÿ±ÿßÿ™Ÿäÿ¨Ÿäÿßÿ™ ŸÇÿßÿ®ŸÑÿ© ŸÑŸÑÿßÿÆÿ™Ÿäÿßÿ± ŸàÿßŸÑŸÖŸÇÿßÿ±ŸÜÿ©:

| # | Strategy | Description | Best For |
|---|----------|-------------|----------|
| 1 | **Fixed-Length** | 300-400 chars with 10% overlap | Consistent chunk sizes |
| 2 | **Sentence-Based** | Group sentences together | Natural text boundaries |
| 3 | **Layout-Aware** | Split by headings/sections | Structured CVs |

In [10]:
# ============================================
# ‚úÇÔ∏è CHUNKING STRATEGIES (3 Options)
# ============================================

class ChunkingStrategies:
    """
    Three chunking strategies for CV/Resume documents.
    All strategies use config settings for consistency.
    """
    
    # Section headers commonly found in CVs
    SECTION_HEADERS = [
        # English
        "skills", "technical skills", "skill", "programming",
        "experience", "work experience", "professional experience", "employment",
        "education", "academic", "qualification", "degree",
        "projects", "personal projects", "achievements", "accomplishments",
        "summary", "profile", "objective", "about me", "introduction",
        "certifications", "certificates", "training", "courses",
        "languages", "interests", "hobbies", "references",
        "contact", "personal information", "details"
    ]
    
    @staticmethod
    def clean_text(text: str) -> str:
        """Clean and normalize text."""
        if not isinstance(text, str):
            return ""
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\w\s.,;:()\-@#&/+]', '', text)
        text = re.sub(r'\s+([.,;:])', r'\1', text)
        return text.strip()
    
    @classmethod
    def fixed_length(cls, text: str, 
                     chunk_size: int = None, 
                     overlap_percent: int = None) -> List[Dict]:
        """
        Strategy 1: Fixed-Length Chunking
        - Chunk size: 300-400 characters
        - Overlap: 10%
        """
        chunk_size = chunk_size or config.CHUNK_SIZE
        overlap_percent = overlap_percent or config.CHUNK_OVERLAP_PERCENT
        overlap = int(chunk_size * overlap_percent / 100)
        
        text = cls.clean_text(text)
        chunks = []
        start = 0
        chunk_idx = 0
        
        while start < len(text):
            end = start + chunk_size
            
            # Try to break at word boundary
            if end < len(text):
                space_idx = text.rfind(' ', start + chunk_size - 50, end + 50)
                if space_idx > start:
                    end = space_idx
            
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append({
                    'text': chunk_text,
                    'start': start,
                    'end': end,
                    'index': chunk_idx,
                    'strategy': 'fixed'
                })
                chunk_idx += 1
            
            start = end - overlap
        
        return chunks if chunks else [{'text': text[:chunk_size], 'start': 0, 'end': len(text), 'index': 0, 'strategy': 'fixed'}]
    
    @classmethod
    def sentence_based(cls, text: str, 
                       max_sentences: int = None) -> List[Dict]:
        """
        Strategy 2: Sentence-Based Chunking
        - Groups sentences together
        - Respects natural text boundaries
        """
        max_sentences = max_sentences or config.MAX_SENTENCES_PER_CHUNK
        
        text = cls.clean_text(text)
        
        try:
            sentences = sent_tokenize(text)
        except:
            sentences = re.split(r'[.!?]+', text)
            sentences = [s.strip() for s in sentences if s.strip()]
        
        chunks = []
        current_sentences = []
        chunk_idx = 0
        
        for sentence in sentences:
            sentence = sentence.strip()
            if not sentence:
                continue
            
            current_sentences.append(sentence)
            
            # Check if we should create a chunk
            current_text = ' '.join(current_sentences)
            
            if len(current_sentences) >= max_sentences or len(current_text) >= config.CHUNK_SIZE_MAX:
                chunks.append({
                    'text': current_text,
                    'start': 0,
                    'end': len(current_text),
                    'index': chunk_idx,
                    'strategy': 'sentence',
                    'sentence_count': len(current_sentences)
                })
                chunk_idx += 1
                current_sentences = []
        
        # Add remaining sentences
        if current_sentences:
            chunks.append({
                'text': ' '.join(current_sentences),
                'start': 0,
                'end': 0,
                'index': chunk_idx,
                'strategy': 'sentence',
                'sentence_count': len(current_sentences)
            })
        
        return chunks if chunks else [{'text': text[:config.CHUNK_SIZE], 'index': 0, 'strategy': 'sentence'}]
    
    @classmethod
    def layout_aware(cls, text: str) -> List[Dict]:
        """
        Strategy 3: Layout-Aware Chunking
        - Splits by CV sections (Skills, Experience, Education, etc.)
        - Preserves document structure
        """
        text = cls.clean_text(text)
        
        # Split into lines/paragraphs
        lines = re.split(r'[\n\r]+|(?<=[.!?])\s+', text)
        lines = [l.strip() for l in lines if l.strip()]
        
        def is_section_header(line: str) -> bool:
            normalized = line.lower().strip()
            # Check against known headers
            for header in cls.SECTION_HEADERS:
                if header in normalized and len(normalized) < 60:
                    return True
            # Check for common patterns
            if re.match(r'^[A-Z][A-Z\s]+:?\s*$', line) and len(line) < 40:
                return True
            if re.match(r'^\d+\.\s*[A-Z]', line):
                return True
            return False
        
        chunks = []
        current_section = ""
        current_content = []
        chunk_idx = 0
        
        for line in lines:
            if is_section_header(line):
                # Save previous section
                if current_content:
                    section_text = f"{current_section}: " if current_section else ""
                    section_text += ' '.join(current_content)
                    
                    # Split if too large
                    if len(section_text) > config.CHUNK_SIZE_MAX:
                        sub_chunks = cls.fixed_length(section_text)
                        for sc in sub_chunks:
                            sc['index'] = chunk_idx
                            sc['strategy'] = 'layout'
                            sc['section'] = current_section
                            chunks.append(sc)
                            chunk_idx += 1
                    else:
                        chunks.append({
                            'text': section_text.strip(),
                            'index': chunk_idx,
                            'strategy': 'layout',
                            'section': current_section
                        })
                        chunk_idx += 1
                
                current_section = line.strip()
                current_content = []
            else:
                current_content.append(line)
        
        # Add final section
        if current_content:
            section_text = f"{current_section}: " if current_section else ""
            section_text += ' '.join(current_content)
            chunks.append({
                'text': section_text.strip(),
                'index': chunk_idx,
                'strategy': 'layout',
                'section': current_section
            })
        
        return chunks if chunks else [{'text': text[:config.CHUNK_SIZE], 'index': 0, 'strategy': 'layout'}]
    
    @classmethod
    def chunk(cls, text: str, strategy: str = "fixed") -> List[Dict]:
        """
        Apply chunking strategy.
        
        Args:
            text: Text to chunk
            strategy: "fixed", "sentence", or "layout"
        """
        if strategy == "sentence":
            return cls.sentence_based(text)
        elif strategy == "layout":
            return cls.layout_aware(text)
        else:
            return cls.fixed_length(text)

# Test chunking strategies
print("‚úÖ Chunking Strategies ready!")
print("\nüìä Testing with sample resume...")

if resume_manager.resumes:
    test_text = resume_manager.resumes[0]['text']
    
    print("\n" + "=" * 60)
    for strategy in ["fixed", "sentence", "layout"]:
        chunks = ChunkingStrategies.chunk(test_text, strategy)
        avg_len = sum(len(c['text']) for c in chunks) / len(chunks) if chunks else 0
        print(f"\n{config.CHUNKING_STRATEGIES[strategy]}:")
        print(f"   Chunks: {len(chunks)}")
        print(f"   Avg Length: {avg_len:.0f} chars")
        print(f"   Sample: {chunks[0]['text'][:100]}...")

‚úÖ Chunking Strategies ready!

üìä Testing with sample resume...


Fixed-Length (300-400 chars):
   Chunks: 14
   Avg Length: 369 chars
   Sample: Skills  Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, J...

Sentence-Based:
   Chunks: 10
   Avg Length: 469 chars
   Sample: Skills  Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, J...

Layout-Aware (Sections/Headings):
   Chunks: 1
   Avg Length: 4695 chars
   Sample: Skills  Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, J...


## üîó Section 6: Embedding Models & Vector Database

### 3 ŸÜŸÖÿßÿ∞ÿ¨ Embedding ŸÖÿ™ÿßÿ≠ÿ©:
| Model | Speed | Quality | Dimension |
|-------|-------|---------|-----------|
| MiniLM | ‚ö° Fast | Good | 384 |
| MPNet | üî• Medium | Best | 768 |
| BGE | üéØ Medium | Strong Retrieval | 768 |

In [11]:
# ============================================
# üîó EMBEDDING MODEL MANAGER
# ============================================

class EmbeddingManager:
    """
    Manages multiple embedding models with dynamic switching.
    """
    
    def __init__(self):
        self.models = {}
        self.current_model_id = None
        self.current_model = None
    
    def load_model(self, model_id: str) -> bool:
        """
        Load an embedding model (cached for reuse).
        """
        if model_id not in config.EMBEDDING_MODELS:
            print(f"‚ùå Unknown model: {model_id}")
            return False
        
        if model_id in self.models:
            self.current_model = self.models[model_id]
            self.current_model_id = model_id
            print(f"‚úÖ Using cached model: {config.EMBEDDING_MODELS[model_id]['name']}")
            return True
        
        print(f"üîÑ Loading model: {config.EMBEDDING_MODELS[model_id]['name']}...")
        try:
            self.models[model_id] = SentenceTransformer(model_id)
            self.current_model = self.models[model_id]
            self.current_model_id = model_id
            print(f"‚úÖ Model loaded successfully!")
            return True
        except Exception as e:
            print(f"‚ùå Error loading model: {e}")
            return False
    
    def encode(self, texts: List[str]) -> List[List[float]]:
        """
        Generate embeddings for texts.
        """
        if self.current_model is None:
            self.load_model(config.DEFAULT_EMBEDDING_MODEL)
        
        return self.current_model.encode(texts).tolist()
    
    def get_model_info(self) -> Dict:
        """Get info about current model."""
        if self.current_model_id:
            return config.EMBEDDING_MODELS[self.current_model_id]
        return {}

# Initialize Embedding Manager
embedding_manager = EmbeddingManager()

# Load default model
embedding_manager.load_model(config.DEFAULT_EMBEDDING_MODEL)

print(f"\nüìä Available Models:")
for model_id, info in config.EMBEDDING_MODELS.items():
    marker = "‚Üí" if model_id == embedding_manager.current_model_id else " "
    print(f"  {marker} {info['name']}: {info['description']}")

üîÑ Loading model: MiniLM (Fast)...
‚úÖ Model loaded successfully!

üìä Available Models:
  ‚Üí MiniLM (Fast): Lightweight, fast inference. Good for quick experiments.
    MPNet (High Quality): Best quality for semantic similarity tasks.
    BGE (Strong Retrieval): Optimized for retrieval tasks. Strong semantic matching.
‚úÖ Model loaded successfully!

üìä Available Models:
  ‚Üí MiniLM (Fast): Lightweight, fast inference. Good for quick experiments.
    MPNet (High Quality): Best quality for semantic similarity tasks.
    BGE (Strong Retrieval): Optimized for retrieval tasks. Strong semantic matching.


In [12]:
# ============================================
# üóÑÔ∏è VECTOR DATABASE (ChromaDB)
# ============================================

class VectorStore:
    """
    ChromaDB vector store for resume retrieval.
    Supports dynamic model and strategy switching.
    """
    
    def __init__(self):
        self.client = chromadb.Client()
        self.collections = {}  # {name: collection}
        self.indexed_resumes = {}  # {collection_name: [resume_ids]}
    
    def create_collection(self, name: str, force_recreate: bool = False) -> chromadb.Collection:
        """
        Create or get a collection.
        """
        safe_name = re.sub(r'[^a-zA-Z0-9_-]', '_', name)[:50]
        
        if force_recreate and safe_name in self.collections:
            try:
                self.client.delete_collection(safe_name)
            except:
                pass
            del self.collections[safe_name]
        
        if safe_name not in self.collections:
            try:
                self.client.delete_collection(safe_name)
            except:
                pass
            self.collections[safe_name] = self.client.create_collection(name=safe_name)
            self.indexed_resumes[safe_name] = []
        
        return self.collections[safe_name]
    
    def index_resumes(self, resumes: List[Dict], 
                      strategy: str = "fixed",
                      collection_name: str = None,
                      progress_callback = None) -> Dict:
        """
        Index resumes into vector store.
        """
        if not collection_name:
            collection_name = f"resumes_{strategy}_{embedding_manager.current_model_id.split('/')[-1]}"
        
        collection = self.create_collection(collection_name, force_recreate=True)
        
        all_ids = []
        all_docs = []
        all_metas = []
        
        total = len(resumes)
        
        for i, resume in enumerate(resumes):
            if progress_callback:
                progress_callback(i / total, f"Processing resume {i+1}/{total}")
            
            # Chunk the resume
            chunks = ChunkingStrategies.chunk(resume['text'], strategy)
            
            for j, chunk in enumerate(chunks):
                chunk_id = f"{resume['id']}_chunk_{j}"
                all_ids.append(chunk_id)
                all_docs.append(chunk['text'])
                all_metas.append({
                    'resume_id': resume['id'],
                    'category': resume.get('category', 'Unknown'),
                    'filename': resume.get('filename', ''),
                    'chunk_index': j,
                    'strategy': strategy
                })
        
        if progress_callback:
            progress_callback(0.9, "Generating embeddings...")
        
        # Generate embeddings
        all_embeddings = embedding_manager.encode(all_docs)
        
        if progress_callback:
            progress_callback(0.95, "Indexing into database...")
        
        # Add to collection
        collection.add(
            ids=all_ids,
            documents=all_docs,
            metadatas=all_metas,
            embeddings=all_embeddings
        )
        
        self.indexed_resumes[collection_name] = [r['id'] for r in resumes]
        
        return {
            'collection_name': collection_name,
            'total_chunks': len(all_ids),
            'total_resumes': len(resumes),
            'strategy': strategy,
            'model': embedding_manager.current_model_id
        }
    
    def search(self, query: str, collection_name: str, 
               top_k: int = 3, 
               return_unique_resumes: bool = True) -> List[Dict]:
        """
        Search for relevant resumes.
        """
        if collection_name not in self.collections:
            return []
        
        collection = self.collections[collection_name]
        
        # Generate query embedding
        query_embedding = embedding_manager.encode([query])[0]
        
        # Search
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k * 3 if return_unique_resumes else top_k
        )
        
        if not results['documents'] or not results['documents'][0]:
            return []
        
        # Format results
        formatted_results = []
        seen_resumes = set()
        
        for i, (doc, meta, dist) in enumerate(zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        )):
            resume_id = meta.get('resume_id', '')
            
            if return_unique_resumes:
                if resume_id in seen_resumes:
                    continue
                seen_resumes.add(resume_id)
            
            similarity = 1 - dist  # Convert distance to similarity
            
            formatted_results.append({
                'rank': len(formatted_results) + 1,
                'resume_id': resume_id,
                'category': meta.get('category', ''),
                'filename': meta.get('filename', ''),
                'chunk_text': doc,
                'similarity': similarity,
                'distance': dist
            })
            
            if len(formatted_results) >= top_k:
                break
        
        return formatted_results
    
    def get_collection_stats(self, collection_name: str) -> Dict:
        """Get statistics about a collection."""
        if collection_name not in self.collections:
            return {}
        
        collection = self.collections[collection_name]
        return {
            'name': collection_name,
            'total_chunks': collection.count(),
            'total_resumes': len(self.indexed_resumes.get(collection_name, []))
        }

# Initialize Vector Store
vector_store = VectorStore()
print("‚úÖ Vector Store (ChromaDB) initialized!")

‚úÖ Vector Store (ChromaDB) initialized!


## üìù Section 7: Ground Truth for Evaluation (Optional)

### ÿ£ÿ≥ÿ¶ŸÑÿ© ÿßŸÑÿ™ŸÇŸäŸäŸÖ ŸÖÿπ ÿßŸÑÿ•ÿ¨ÿßÿ®ÿßÿ™ ÿßŸÑŸÖÿ™ŸàŸÇÿπÿ©
> Ÿáÿ∞ÿß ÿßŸÑŸÇÿ≥ŸÖ ÿßÿÆÿ™Ÿäÿßÿ±Ÿä - ŸÑŸÑÿ™ŸÇŸäŸäŸÖ ÿßŸÑÿ£ŸÉÿßÿØŸäŸÖŸä ŸÅŸÇÿ∑

In [14]:
# First, let's analyze what information is in our resumes
print("üìã Sample resume content for Ground Truth preparation:")
print("=" * 60)

if resume_manager.resumes:
    for i in range(min(3, len(resume_manager.resumes))):
        resume = resume_manager.resumes[i]
        print(f"\n--- Resume {i} (Category: {resume['category']}) ---")
        print(resume['text'][:500])
        print("...")
else:
    print("‚ö†Ô∏è No resumes loaded. Please run the CSV loading cell first.")

üìã Sample resume content for Ground Truth preparation:

--- Resume 0 (Category: Data Science) ---
Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, Na√É¬Øve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib
...

--- Resume 1 (Category: Data Science) ---
Education Details 
May 2013 to May 2017 B.E   UIT-RGPV
Data Scientist 

Data Scientist - Matelabs
Skill Details 
Python- Exprience - Less than 1 year months
Statsmodels- Exprience - 12 months
AWS- Exprience - Less than 1 year months
Machine learning- Exprience - Less than 1 year months
Sklearn- Exprience - Less than 1 year months
Scipy- Exprience -

In [15]:
# Define Ground Truth: Questions + Expected Relevant Resume IDs
# Based on the actual content in our dataset

GROUND_TRUTH = [
    {
        "id": "Q1",
        "type": "Direct Information",
        "question": "Find resumes with Python programming experience",
        "relevant_ids": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
        "keywords": ["python", "programming"]
    },
    {
        "id": "Q2",
        "type": "Direct Information",
        "question": "Who has experience with Machine Learning?",
        "relevant_ids": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
        "keywords": ["machine learning", "ml", "deep learning"]
    },
    {
        "id": "Q3",
        "type": "Skills Query",
        "question": "Candidates with SQL database skills",
        "relevant_ids": ["0", "2", "3", "4", "6", "8"],
        "keywords": ["sql", "database", "mysql"]
    },
    {
        "id": "Q4",
        "type": "Experience Query",
        "question": "Who worked at major tech companies or consulting firms?",
        "relevant_ids": ["0", "3", "6", "8"],
        "keywords": ["ernst", "young", "deloitte", "accenture", "wipro", "ibm"]
    },
    {
        "id": "Q5",
        "type": "Skills Query",
        "question": "Find candidates with Data Visualization skills like Tableau",
        "relevant_ids": ["0", "3", "5", "6", "8"],
        "keywords": ["tableau", "visualization", "d3.js", "plotly"]
    },
    {
        "id": "Q6",
        "type": "Experience Query",
        "question": "Candidates with NLP or Natural Language Processing experience",
        "relevant_ids": ["0", "6", "7", "8"],
        "keywords": ["nlp", "natural language", "text mining", "sentiment"]
    },
    {
        "id": "Q7",
        "type": "Multi-step Query",
        "question": "Data Scientists with both Python and deep learning frameworks like TensorFlow or Keras",
        "relevant_ids": ["1", "6", "7"],
        "keywords": ["tensorflow", "keras", "deep learning", "neural network"]
    },
    {
        "id": "Q8",
        "type": "Comparison Query",
        "question": "Who has cloud platform experience like AWS or GCP?",
        "relevant_ids": ["1", "6", "8"],
        "keywords": ["aws", "gcp", "google cloud", "azure", "cloud"]
    }
]

print(f"‚úÖ Created {len(GROUND_TRUTH)} Ground Truth questions")
print("\nüìä Question Types Distribution:")
types = {}
for q in GROUND_TRUTH:
    types[q['type']] = types.get(q['type'], 0) + 1
for t, c in types.items():
    print(f"  - {t}: {c}")

‚úÖ Created 8 Ground Truth questions

üìä Question Types Distribution:
  - Direct Information: 2
  - Skills Query: 2
  - Experience Query: 2
  - Multi-step Query: 1
  - Comparison Query: 1


In [16]:
# Display Ground Truth table
gt_table = []
for q in GROUND_TRUTH:
    gt_table.append([
        q['id'],
        q['type'],
        q['question'][:50] + "...",
        len(q['relevant_ids'])
    ])

print("\nüìã Ground Truth Questions:")
print(tabulate(gt_table, 
               headers=['ID', 'Type', 'Question', '# Relevant'],
               tablefmt='grid'))


üìã Ground Truth Questions:
+------+--------------------+-------------------------------------------------------+--------------+
| ID   | Type               | Question                                              |   # Relevant |
| Q1   | Direct Information | Find resumes with Python programming experience...    |           10 |
+------+--------------------+-------------------------------------------------------+--------------+
| Q2   | Direct Information | Who has experience with Machine Learning?...          |           10 |
+------+--------------------+-------------------------------------------------------+--------------+
| Q3   | Skills Query       | Candidates with SQL database skills...                |            6 |
+------+--------------------+-------------------------------------------------------+--------------+
| Q4   | Experience Query   | Who worked at major tech companies or consulting f... |            4 |
+------+--------------------+--------------------------------

## üìà Section 8: Evaluation Metrics

### ŸÖŸÇÿßŸäŸäÿ≥ ÿßŸÑÿ™ŸÇŸäŸäŸÖ:
| Metric | Description |
|--------|-------------|
| **Precision@K** | ŸÜÿ≥ÿ®ÿ© ÿßŸÑŸÜÿ™ÿßÿ¶ÿ¨ ÿßŸÑÿµÿ≠Ÿäÿ≠ÿ© ŸÖŸÜ ÿ£ŸàŸÑ K |
| **Recall@K** | ŸÜÿ≥ÿ®ÿ© ÿßŸÑŸÜÿ™ÿßÿ¶ÿ¨ ÿßŸÑÿµÿ≠Ÿäÿ≠ÿ© ÿßŸÑŸÖÿ≥ÿ™ÿ±ÿ¨ÿπÿ© |
| **MRR** | ŸÖÿ™Ÿàÿ≥ÿ∑ ŸÖŸÇŸÑŸàÿ® ÿ™ÿ±ÿ™Ÿäÿ® ÿ£ŸàŸÑ ŸÜÿ™Ÿäÿ¨ÿ© ÿµÿ≠Ÿäÿ≠ÿ© |
| **MAP** | ŸÖÿ™Ÿàÿ≥ÿ∑ ÿßŸÑÿØŸÇÿ© |
| **nDCG@K** | Normalized Discounted Cumulative Gain |

In [17]:
class RetrievalMetrics:
    """
    Evaluation metrics for retrieval systems.
    """
    
    @staticmethod
    def precision_at_k(retrieved_ids: List[str], relevant_ids: set, k: int) -> float:
        """
        Precision@K: What proportion of retrieved items are relevant?
        """
        if k == 0:
            return 0.0
        retrieved_k = retrieved_ids[:k]
        hits = sum(1 for doc_id in retrieved_k if doc_id in relevant_ids)
        return hits / k
    
    @staticmethod
    def recall_at_k(retrieved_ids: List[str], relevant_ids: set, k: int) -> float:
        """
        Recall@K: What proportion of relevant items are retrieved?
        """
        if not relevant_ids:
            return 0.0
        retrieved_k = retrieved_ids[:k]
        hits = sum(1 for doc_id in retrieved_k if doc_id in relevant_ids)
        return hits / len(relevant_ids)
    
    @staticmethod
    def mrr(retrieved_ids: List[str], relevant_ids: set) -> float:
        """
        Mean Reciprocal Rank: 1/rank of first relevant result.
        """
        for i, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant_ids:
                return 1.0 / i
        return 0.0
    
    @staticmethod
    def average_precision(retrieved_ids: List[str], relevant_ids: set) -> float:
        """
        Average Precision for a single query.
        """
        if not relevant_ids:
            return 0.0
        
        ap_sum = 0.0
        hits = 0
        
        for i, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant_ids:
                hits += 1
                ap_sum += hits / i
        
        if hits == 0:
            return 0.0
        
        return ap_sum / len(relevant_ids)
    
    @staticmethod
    def dcg_at_k(retrieved_ids: List[str], relevant_ids: set, k: int) -> float:
        """
        Discounted Cumulative Gain at K.
        """
        dcg = 0.0
        for i, doc_id in enumerate(retrieved_ids[:k], start=1):
            rel = 1.0 if doc_id in relevant_ids else 0.0
            if rel > 0:
                dcg += rel / math.log2(i + 1)
        return dcg
    
    @classmethod
    def ndcg_at_k(cls, retrieved_ids: List[str], relevant_ids: set, k: int) -> float:
        """
        Normalized DCG at K.
        """
        if not relevant_ids:
            return 0.0
        
        dcg = cls.dcg_at_k(retrieved_ids, relevant_ids, k)
        
        # Ideal ranking: all relevant docs first
        ideal_order = list(relevant_ids)[:k]
        idcg = cls.dcg_at_k(ideal_order, relevant_ids, k)
        
        if idcg == 0.0:
            return 0.0
        
        return dcg / idcg

# Test metrics
test_retrieved = ["0", "1", "5", "2", "3"]
test_relevant = {"0", "2", "3", "4"}

print("üß™ Testing metrics with sample data:")
print(f"  Retrieved: {test_retrieved}")
print(f"  Relevant:  {test_relevant}")
print(f"\n  Precision@5: {RetrievalMetrics.precision_at_k(test_retrieved, test_relevant, 5):.3f}")
print(f"  Recall@5:    {RetrievalMetrics.recall_at_k(test_retrieved, test_relevant, 5):.3f}")
print(f"  MRR:         {RetrievalMetrics.mrr(test_retrieved, test_relevant):.3f}")
print(f"  MAP:         {RetrievalMetrics.average_precision(test_retrieved, test_relevant):.3f}")
print(f"  nDCG@5:      {RetrievalMetrics.ndcg_at_k(test_retrieved, test_relevant, 5):.3f}")

üß™ Testing metrics with sample data:
  Retrieved: ['0', '1', '5', '2', '3']
  Relevant:  {'3', '2', '4', '0'}

  Precision@5: 0.600
  Recall@5:    0.750
  MRR:         1.000
  MAP:         0.525
  nDCG@5:      0.710


## üîÑ Section 8: Evaluation Pipeline

### Running evaluation for each chunking strategy

In [18]:
def evaluate_strategy(strategy_name: str, collection: chromadb.Collection, 
                      ground_truth: List[Dict], k: int = 10) -> Dict[str, float]:
    """
    Evaluate a chunking strategy using Ground Truth questions.
    Returns average metrics across all queries.
    """
    all_precision = []
    all_recall = []
    all_mrr = []
    all_map = []
    all_ndcg = []
    
    query_results = []  # Store detailed results
    
    for gt in ground_truth:
        query = gt['question']
        relevant_ids = set(gt['relevant_ids'])
        
        # Query the collection
        results = collection.query(
            query_texts=[query],
            n_results=k * 3  # Get more to dedupe
        )
        
        # Extract unique resume IDs from chunk results
        retrieved_resume_ids = []
        seen_ids = set()
        
        if results['metadatas'] and results['metadatas'][0]:
            for meta in results['metadatas'][0]:
                resume_id = meta.get('resume_id', '')
                if resume_id and resume_id not in seen_ids:
                    seen_ids.add(resume_id)
                    retrieved_resume_ids.append(resume_id)
                    if len(retrieved_resume_ids) >= k:
                        break
        
        # Calculate metrics
        precision = RetrievalMetrics.precision_at_k(retrieved_resume_ids, relevant_ids, k)
        recall = RetrievalMetrics.recall_at_k(retrieved_resume_ids, relevant_ids, k)
        mrr = RetrievalMetrics.mrr(retrieved_resume_ids, relevant_ids)
        map_score = RetrievalMetrics.average_precision(retrieved_resume_ids, relevant_ids)
        ndcg = RetrievalMetrics.ndcg_at_k(retrieved_resume_ids, relevant_ids, k)
        
        all_precision.append(precision)
        all_recall.append(recall)
        all_mrr.append(mrr)
        all_map.append(map_score)
        all_ndcg.append(ndcg)
        
        query_results.append({
            'query_id': gt['id'],
            'query_type': gt['type'],
            'retrieved': retrieved_resume_ids[:5],
            'precision': precision,
            'recall': recall,
            'mrr': mrr
        })
    
    return {
        'Precision@K': np.mean(all_precision),
        'Recall@K': np.mean(all_recall),
        'MRR': np.mean(all_mrr),
        'MAP': np.mean(all_map),
        'nDCG@K': np.mean(all_ndcg),
        'query_results': query_results
    }

print("‚úÖ Evaluation pipeline ready!")

‚úÖ Evaluation pipeline ready!


In [20]:
# Run evaluation for all chunking strategies
K = 10  # Number of results to retrieve

# Map strategy names to the keys used in config
strategies = {
    "Fixed-Length": "fixed",
    "Sentence-Based": "sentence",
    "Layout-Aware": "layout"
}

evaluation_results = {}

print("üîÑ Running evaluation for each chunking strategy...")
print("=" * 60)

if not resume_manager.resumes:
    print("‚ùå No resumes loaded. Please run the CSV loading cell first.")
else:
    for strategy_name, strategy_key in strategies.items():
        print(f"\nüìä Evaluating: {strategy_name}")
        
        # Index resumes with this strategy
        print(f"   Creating collection and indexing...")
        result = vector_store.index_resumes(
            resume_manager.resumes,
            strategy=strategy_key
        )
        collection = vector_store.collections[result['collection_name']]
        print(f"   Collection size: {collection.count()} chunks")
        
        # Evaluate
        print(f"   Running queries...")
        results = evaluate_strategy(strategy_name, collection, GROUND_TRUTH, k=K)
        evaluation_results[strategy_name] = results
        
        print(f"   ‚úÖ Precision@{K}: {results['Precision@K']:.3f}")
        print(f"   ‚úÖ Recall@{K}: {results['Recall@K']:.3f}")
        print(f"   ‚úÖ MRR: {results['MRR']:.3f}")

    print("\n" + "=" * 60)
    print("‚úÖ Evaluation complete!")

üîÑ Running evaluation for each chunking strategy...

üìä Evaluating: Fixed-Length
   Creating collection and indexing...
   Collection size: 186 chunks
   Running queries...
   Collection size: 186 chunks
   Running queries...


C:\Users\abrah\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 79.3M/79.3M [00:36<00:00, 2.29MiB/s]



   ‚úÖ Precision@10: 0.000
   ‚úÖ Recall@10: 0.000
   ‚úÖ MRR: 0.000

üìä Evaluating: Sentence-Based
   Creating collection and indexing...
   Collection size: 124 chunks
   Running queries...
   Collection size: 124 chunks
   Running queries...
   ‚úÖ Precision@10: 0.000
   ‚úÖ Recall@10: 0.000
   ‚úÖ MRR: 0.000

üìä Evaluating: Layout-Aware
   Creating collection and indexing...
   ‚úÖ Precision@10: 0.000
   ‚úÖ Recall@10: 0.000
   ‚úÖ MRR: 0.000

üìä Evaluating: Layout-Aware
   Creating collection and indexing...
   Collection size: 68 chunks
   Running queries...
   Collection size: 68 chunks
   Running queries...
   ‚úÖ Precision@10: 0.000
   ‚úÖ Recall@10: 0.000
   ‚úÖ MRR: 0.000

‚úÖ Evaluation complete!
   ‚úÖ Precision@10: 0.000
   ‚úÖ Recall@10: 0.000
   ‚úÖ MRR: 0.000

‚úÖ Evaluation complete!


## üìä Section 9: Results Comparison Table

### ÿ¨ÿØŸàŸÑ ÿßŸÑŸÖŸÇÿßÿ±ŸÜÿ© ÿ®ŸäŸÜ ÿßŸÑÿßÿ≥ÿ™ÿ±ÿßÿ™Ÿäÿ¨Ÿäÿßÿ™

In [21]:
# Create comparison table
comparison_data = []

for strategy_name, results in evaluation_results.items():
    comparison_data.append([
        strategy_name,
        f"{results['Precision@K']:.4f}",
        f"{results['Recall@K']:.4f}",
        f"{results['MRR']:.4f}",
        f"{results['MAP']:.4f}",
        f"{results['nDCG@K']:.4f}"
    ])

print("\n" + "=" * 80)
print("üìä CHUNKING STRATEGY COMPARISON TABLE")
print("=" * 80)
print(tabulate(
    comparison_data,
    headers=['Strategy', f'Precision@{K}', f'Recall@{K}', 'MRR', 'MAP', f'nDCG@{K}'],
    tablefmt='grid',
    numalign='center'
))
print("=" * 80)


üìä CHUNKING STRATEGY COMPARISON TABLE
+----------------+----------------+-------------+-------+-------+-----------+
| Strategy       |  Precision@10  |  Recall@10  |  MRR  |  MAP  |  nDCG@10  |
| Fixed-Length   |       0        |      0      |   0   |   0   |     0     |
+----------------+----------------+-------------+-------+-------+-----------+
| Sentence-Based |       0        |      0      |   0   |   0   |     0     |
+----------------+----------------+-------------+-------+-------+-----------+
| Layout-Aware   |       0        |      0      |   0   |   0   |     0     |
+----------------+----------------+-------------+-------+-------+-----------+


In [22]:
# Create a pandas DataFrame for better visualization
results_df = pd.DataFrame(comparison_data, 
                          columns=['Strategy', f'Precision@{K}', f'Recall@{K}', 'MRR', 'MAP', f'nDCG@{K}'])
results_df = results_df.set_index('Strategy')

# Convert to numeric for analysis
results_df_numeric = results_df.astype(float)

print("\nüìä Results DataFrame:")
display(results_df)


üìä Results DataFrame:


Unnamed: 0_level_0,Precision@10,Recall@10,MRR,MAP,nDCG@10
Strategy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fixed-Length,0.0,0.0,0.0,0.0,0.0
Sentence-Based,0.0,0.0,0.0,0.0,0.0
Layout-Aware,0.0,0.0,0.0,0.0,0.0


In [23]:
# Determine the best strategy
# Calculate overall score (average of all metrics)
overall_scores = results_df_numeric.mean(axis=1)
best_strategy = overall_scores.idxmax()
best_score = overall_scores.max()

print("\n" + "=" * 60)
print("üèÜ FINAL RESULTS")
print("=" * 60)

print("\nüìà Overall Scores (Average of all metrics):")
for strategy, score in overall_scores.sort_values(ascending=False).items():
    marker = "ü•á" if strategy == best_strategy else "  "
    print(f"  {marker} {strategy}: {score:.4f}")

print(f"\nüéØ Best Chunking Strategy: {best_strategy}")
print(f"   Overall Score: {best_score:.4f}")

# Best per metric
print("\nüìä Best Strategy per Metric:")
for col in results_df_numeric.columns:
    best_for_metric = results_df_numeric[col].idxmax()
    best_value = results_df_numeric[col].max()
    print(f"  - {col}: {best_for_metric} ({best_value:.4f})")


üèÜ FINAL RESULTS

üìà Overall Scores (Average of all metrics):
  ü•á Fixed-Length: 0.0000
     Sentence-Based: 0.0000
     Layout-Aware: 0.0000

üéØ Best Chunking Strategy: Fixed-Length
   Overall Score: 0.0000

üìä Best Strategy per Metric:
  - Precision@10: Fixed-Length (0.0000)
  - Recall@10: Fixed-Length (0.0000)
  - MRR: Fixed-Length (0.0000)
  - MAP: Fixed-Length (0.0000)
  - nDCG@10: Fixed-Length (0.0000)


## üìã Section 10: Detailed Query Results

In [24]:
# Show detailed results for best strategy
print(f"\nüìã Detailed Query Results for Best Strategy: {best_strategy}")
print("=" * 80)

detailed_results = []
for qr in evaluation_results[best_strategy]['query_results']:
    detailed_results.append([
        qr['query_id'],
        qr['query_type'],
        str(qr['retrieved'][:3]),
        f"{qr['precision']:.3f}",
        f"{qr['recall']:.3f}",
        f"{qr['mrr']:.3f}"
    ])

print(tabulate(
    detailed_results,
    headers=['Query', 'Type', 'Top 3 Retrieved', 'Precision', 'Recall', 'MRR'],
    tablefmt='grid'
))


üìã Detailed Query Results for Best Strategy: Fixed-Length
+---------+--------------------+------------------------------+-------------+----------+-------+
| Query   | Type               | Top 3 Retrieved              |   Precision |   Recall |   MRR |
| Q1      | Direct Information | ['csv_2', 'csv_12', 'csv_6'] |           0 |        0 |     0 |
+---------+--------------------+------------------------------+-------------+----------+-------+
| Q2      | Direct Information | ['csv_6', 'csv_16', 'csv_7'] |           0 |        0 |     0 |
+---------+--------------------+------------------------------+-------------+----------+-------+
| Q3      | Skills Query       | ['csv_8', 'csv_18', 'csv_3'] |           0 |        0 |     0 |
+---------+--------------------+------------------------------+-------------+----------+-------+
| Q4      | Experience Query   | ['csv_3', 'csv_13', 'csv_6'] |           0 |        0 |     0 |
+---------+--------------------+------------------------------+---

In [25]:
# Compare Ground Truth vs Retrieved for each question
print("\nüìä Ground Truth vs Retrieved Results Comparison")
print("=" * 80)

for i, gt in enumerate(GROUND_TRUTH):
    qr = evaluation_results[best_strategy]['query_results'][i]
    print(f"\n{gt['id']}: {gt['question'][:60]}...")
    print(f"  Expected:  {gt['relevant_ids'][:5]}")
    print(f"  Retrieved: {qr['retrieved'][:5]}")
    print(f"  Precision: {qr['precision']:.3f} | Recall: {qr['recall']:.3f} | MRR: {qr['mrr']:.3f}")


üìä Ground Truth vs Retrieved Results Comparison

Q1: Find resumes with Python programming experience...
  Expected:  ['0', '1', '2', '3', '4']
  Retrieved: ['csv_2', 'csv_12', 'csv_6', 'csv_16', 'csv_8']
  Precision: 0.000 | Recall: 0.000 | MRR: 0.000

Q2: Who has experience with Machine Learning?...
  Expected:  ['0', '1', '2', '3', '4']
  Retrieved: ['csv_6', 'csv_16', 'csv_7', 'csv_17', 'csv_8']
  Precision: 0.000 | Recall: 0.000 | MRR: 0.000

Q3: Candidates with SQL database skills...
  Expected:  ['0', '2', '3', '4', '6']
  Retrieved: ['csv_8', 'csv_18', 'csv_3', 'csv_13', 'csv_6']
  Precision: 0.000 | Recall: 0.000 | MRR: 0.000

Q4: Who worked at major tech companies or consulting firms?...
  Expected:  ['0', '3', '6', '8']
  Retrieved: ['csv_3', 'csv_13', 'csv_6', 'csv_16', 'csv_9']
  Precision: 0.000 | Recall: 0.000 | MRR: 0.000

Q5: Find candidates with Data Visualization skills like Tableau...
  Expected:  ['0', '3', '5', '6', '8']
  Retrieved: ['csv_6', 'csv_16', 'csv_0',

## üñ•Ô∏è Section 11: Interactive Gradio UI

### Ÿàÿßÿ¨Ÿáÿ© ÿ™ŸÅÿßÿπŸÑŸäÿ© ŸÖÿ™ŸÉÿßŸÖŸÑÿ© ÿ™ÿ¥ŸÖŸÑ:
- üìÑ ÿ±ŸÅÿπ ŸÖŸÑŸÅÿßÿ™ PDF
- üî§ ÿßÿÆÿ™Ÿäÿßÿ± ŸÜŸÖŸàÿ∞ÿ¨ Embedding
- ‚úÇÔ∏è ÿßÿÆÿ™Ÿäÿßÿ± ÿßÿ≥ÿ™ÿ±ÿßÿ™Ÿäÿ¨Ÿäÿ© Chunking
- üîç ÿßŸÑÿ®ÿ≠ÿ´ ŸÖÿπ ÿßŸÑÿ™ÿ≠ŸÉŸÖ ÿ®ŸÄ Top-K
- üìä ÿπÿ±ÿ∂ ÿßŸÑŸÜÿ™ÿßÿ¶ÿ¨ ŸàÿßŸÑŸÖŸÇÿßŸäŸäÿ≥

In [28]:
# ============================================
# üñ•Ô∏è GRADIO INTERACTIVE UI
# ============================================

import gradio as gr

# Global state
current_collection = None

def upload_pdfs(files, category):
    """Handle PDF file uploads."""
    global current_collection
    
    if not files:
        return "‚ùå No files uploaded", resume_manager.summary()
    
    results = []
    for file in files:
        result = resume_manager.add_pdf(file.name, category or "Uploaded")
        if result['success']:
            results.append(f"‚úÖ {result['filename']} ({result['pages']} pages, {result['words']} words)")
        else:
            results.append(f"‚ùå {os.path.basename(file.name)}: {result.get('error', 'Unknown error')}")
    
    current_collection = None  # Reset to force re-indexing
    
    return "\n".join(results), resume_manager.summary()

def change_embedding_model(model_name):
    """Switch embedding model."""
    global current_collection
    
    # Find model ID from name
    model_id = None
    for mid, info in config.EMBEDDING_MODELS.items():
        if info['name'] == model_name:
            model_id = mid
            break
    
    if model_id:
        success = embedding_manager.load_model(model_id)
        current_collection = None  # Reset to force re-indexing
        if success:
            return f"‚úÖ Switched to {model_name}"
        return f"‚ùå Failed to load {model_name}"
    return "‚ùå Unknown model"

def index_resumes(strategy_name, progress=gr.Progress()):
    """Index all resumes with selected strategy."""
    global current_collection
    
    if not resume_manager.resumes:
        return "‚ùå No resumes loaded. Please upload PDFs or load CSV first.", ""
    
    # Find strategy key from name
    strategy_key = None
    for key, name in config.CHUNKING_STRATEGIES.items():
        if name == strategy_name:
            strategy_key = key
            break
    
    if not strategy_key:
        strategy_key = "fixed"
    
    progress(0, desc="Starting indexing...")
    
    result = vector_store.index_resumes(
        resume_manager.resumes,
        strategy=strategy_key,
        progress_callback=lambda p, msg: progress(p, desc=msg)
    )
    
    current_collection = result['collection_name']
    
    stats = f"""
‚úÖ **Indexing Complete!**

üìä **Statistics:**
- Total Resumes: {result['total_resumes']}
- Total Chunks: {result['total_chunks']}
- Strategy: {strategy_name}
- Model: {config.EMBEDDING_MODELS[result['model']]['name']}
"""
    return stats, current_collection

def search_resumes(query, top_k):
    """Search for relevant resumes using the current collection."""
    global current_collection
    
    if not query.strip():
        return "‚ùå Please enter a search query."
    
    # Auto-index if no collection exists
    if not current_collection:
        if not resume_manager.resumes:
            return "‚ùå No resumes loaded. Please upload PDFs or load CSV first."
        # Auto-index with default strategy
        result = vector_store.index_resumes(
            resume_manager.resumes,
            strategy="fixed"
        )
        current_collection = result['collection_name']
    
    results = vector_store.search(query, current_collection, top_k=int(top_k))
    
    if not results:
        return "No results found."
    
    output = f"## üîç Search Results for: '{query}'\n\n"
    output += f"**Top-K:** {int(top_k)} | **Total Resumes:** {len(resume_manager.resumes)}\n\n"
    output += "---\n\n"
    
    for r in results:
        similarity_bar = "üü¢" * int(r['similarity'] * 10) + "‚ö™" * (10 - int(r['similarity'] * 10))
        
        output += f"### #{r['rank']} - {r['filename'] or r['resume_id']}\n"
        output += f"**Category:** {r['category']} | **Similarity:** {r['similarity']:.3f} {similarity_bar}\n\n"
        output += f"**Relevant Text:**\n"
        output += f"> {r['chunk_text'][:500]}{'...' if len(r['chunk_text']) > 500 else ''}\n\n"
        output += "---\n\n"
    
    return output

def compare_strategies(query, top_k, progress=gr.Progress()):
    """Compare all chunking strategies."""
    if not query.strip():
        return "‚ùå Please enter a search query."
    
    if not resume_manager.resumes:
        return "‚ùå No resumes loaded."
    
    comparison = "## üìä Strategy Comparison\n\n"
    comparison += f"**Query:** {query}\n\n"
    
    for i, (strategy_key, strategy_name) in enumerate(config.CHUNKING_STRATEGIES.items()):
        progress((i + 1) / len(config.CHUNKING_STRATEGIES), desc=f"Testing {strategy_name}...")
        
        # Index with this strategy
        result = vector_store.index_resumes(
            resume_manager.resumes,
            strategy=strategy_key
        )
        
        # Search
        results = vector_store.search(query, result['collection_name'], top_k=int(top_k))
        
        comparison += f"### {strategy_name}\n"
        comparison += f"- Chunks created: {result['total_chunks']}\n"
        comparison += f"- Top result: {results[0]['filename'] if results else 'N/A'}\n"
        comparison += f"- Top similarity: {results[0]['similarity']:.3f if results else 0}\n\n"
        
        if results:
            comparison += "**Top 3 Results:**\n"
            for r in results[:3]:
                comparison += f"  - {r['filename'] or r['resume_id']}: {r['similarity']:.3f}\n"
        
        comparison += "\n---\n\n"
    
    return comparison

# Build UI
with gr.Blocks(title="üìÑ Resume RAG System", theme=gr.themes.Soft()) as demo:
    gr.Markdown("""
    # üìÑ Resume Retrieval System (RAG)
    
    ŸÜÿ∏ÿßŸÖ ÿßÿ≥ÿ™ÿ±ÿ¨ÿßÿπ ÿßŸÑÿ≥Ÿäÿ± ÿßŸÑÿ∞ÿßÿ™Ÿäÿ© ÿ®ÿßÿ≥ÿ™ÿÆÿØÿßŸÖ RAG - ÿ®ÿØŸàŸÜ LLM
    
    ### üìã Instructions:
    1. Upload PDF resumes or use existing CSV data
    2. Select embedding model and chunking strategy
    3. Click "Index Resumes" to process
    4. Search using natural language queries
    """)
    
    with gr.Tabs():
        # Tab 1: Upload & Configure
        with gr.Tab("üì§ Upload & Configure"):
            with gr.Row():
                with gr.Column(scale=2):
                    file_upload = gr.File(
                        label="üìÑ Upload PDF Resumes",
                        file_count="multiple",
                        file_types=[".pdf"],
                        type="filepath"
                    )
                    category_input = gr.Textbox(
                        label="Category (optional)",
                        placeholder="e.g., Data Science, Engineering",
                        value="Uploaded"
                    )
                    upload_btn = gr.Button("üì§ Upload Files", variant="primary")
                    upload_status = gr.Textbox(label="Upload Status", lines=3)
                
                with gr.Column(scale=1):
                    resume_summary = gr.Textbox(
                        label="üìä Resume Summary",
                        lines=10,
                        value=resume_manager.summary()
                    )
            
            upload_btn.click(
                upload_pdfs,
                inputs=[file_upload, category_input],
                outputs=[upload_status, resume_summary]
            )
        
        # Tab 2: Model & Strategy
        with gr.Tab("‚öôÔ∏è Model & Strategy"):
            with gr.Row():
                with gr.Column():
                    model_dropdown = gr.Dropdown(
                        label="üî§ Embedding Model",
                        choices=[info['name'] for info in config.EMBEDDING_MODELS.values()],
                        value=config.EMBEDDING_MODELS[config.DEFAULT_EMBEDDING_MODEL]['name']
                    )
                    model_status = gr.Textbox(label="Model Status")
                    model_dropdown.change(
                        change_embedding_model,
                        inputs=[model_dropdown],
                        outputs=[model_status]
                    )
                
                with gr.Column():
                    strategy_dropdown = gr.Dropdown(
                        label="‚úÇÔ∏è Chunking Strategy",
                        choices=list(config.CHUNKING_STRATEGIES.values()),
                        value=config.CHUNKING_STRATEGIES["fixed"]
                    )
            
            with gr.Row():
                index_btn = gr.Button("üîÑ Index Resumes", variant="primary", scale=2)
                collection_display = gr.Textbox(label="Collection Name", scale=1)
            
            index_status = gr.Markdown()
            
            index_btn.click(
                index_resumes,
                inputs=[strategy_dropdown],
                outputs=[index_status, collection_display]
            )
        
        # Tab 3: Search
        with gr.Tab("üîç Search"):
            with gr.Row():
                query_input = gr.Textbox(
                    label="üîç Search Query",
                    placeholder="e.g., 'Python machine learning experience'",
                    lines=2,
                    scale=3
                )
                top_k_slider = gr.Slider(
                    minimum=config.TOP_K_MIN,
                    maximum=config.TOP_K_MAX,
                    value=config.TOP_K_DEFAULT,
                    step=1,
                    label="Top-K Results",
                    scale=1
                )
            
            search_btn = gr.Button("üîç Search", variant="primary")
            search_results = gr.Markdown()
            
            search_btn.click(
                search_resumes,
                inputs=[query_input, top_k_slider],
                outputs=[search_results]
            )
            
            gr.Markdown("### üí° Example Queries:")
            gr.Examples(
                examples=[
                    ["Python programming experience", 3],
                    ["Machine learning and data analysis", 5],
                    ["SQL database skills", 3],
                    ["Cloud experience AWS or GCP", 3],
                    ["Natural language processing NLP", 3]
                ],
                inputs=[query_input, top_k_slider]
            )
        
        # Tab 4: Compare Strategies
        with gr.Tab("üìä Compare Strategies"):
            gr.Markdown("### Compare all 3 chunking strategies on the same query")
            
            compare_query = gr.Textbox(
                label="Query for Comparison",
                placeholder="Enter query to test all strategies..."
            )
            compare_k = gr.Slider(
                minimum=1, maximum=5, value=3, step=1,
                label="Top-K for Comparison"
            )
            compare_btn = gr.Button("üìä Compare All Strategies", variant="primary")
            compare_results = gr.Markdown()
            
            compare_btn.click(
                compare_strategies,
                inputs=[compare_query, compare_k],
                outputs=[compare_results]
            )

print("‚úÖ Gradio UI ready!")

‚úÖ Gradio UI ready!


In [31]:
# ============================================
# üöÄ LAUNCH GRADIO UI
# ============================================

# Launch the interface (auto-select available port)
demo.launch(
    share=False,  # Set to True if you want a public link
    server_name="127.0.0.1",
    server_port=None,  # Auto-select available port
    inbrowser=True  # Open browser automatically
)

* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.
* To create a public link, set `share=True` in `launch()`.




Traceback (most recent call last):
  File "c:\Users\abrah\anaconda3\envs\rag\Lib\site-packages\gradio\queueing.py", line 763, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\abrah\anaconda3\envs\rag\Lib\site-packages\gradio\route_utils.py", line 354, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\abrah\anaconda3\envs\rag\Lib\site-packages\gradio\blocks.py", line 2125, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\abrah\anaconda3\envs\rag\Lib\site-packages\gradio\blocks.py", line 1607, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\abrah\anaconda3\envs\rag\Lib\site-packages\anyio\to_thread.py", line 61, in run_sync
    return await get

## üìù Section 12: Summary & How to Use

### ŸÉŸäŸÅŸäÿ© ÿßŸÑÿßÿ≥ÿ™ÿÆÿØÿßŸÖ:

#### 1Ô∏è‚É£ ÿ™ÿ¥ÿ∫ŸäŸÑ ÿßŸÑÿÆŸÑÿßŸäÿß ÿ®ÿßŸÑÿ™ÿ±ÿ™Ÿäÿ®
- ÿ¥ÿ∫ŸëŸÑ ŸÉŸÑ ÿßŸÑÿÆŸÑÿßŸäÿß ŸÖŸÜ ÿßŸÑÿ®ÿØÿßŸäÿ© ŸÑŸÑŸÜŸáÿßŸäÿ©

#### 2Ô∏è‚É£ ÿ±ŸÅÿπ ÿßŸÑÿ≥Ÿäÿ± ÿßŸÑÿ∞ÿßÿ™Ÿäÿ©
- ÿßÿ∞Ÿáÿ® ŸÑÿ™ÿßÿ® "Upload & Configure"
- ÿßÿ±ŸÅÿπ ŸÖŸÑŸÅÿßÿ™ PDF
- ÿ£Ÿà ÿßÿ≥ÿ™ÿÆÿØŸÖ ÿßŸÑÿ®ŸäÿßŸÜÿßÿ™ ŸÖŸÜ CSV

#### 3Ô∏è‚É£ ÿ•ÿπÿØÿßÿØ ÿßŸÑŸÜŸÖŸàÿ∞ÿ¨
- ÿßÿÆÿ™ÿ± Embedding Model
- ÿßÿÆÿ™ÿ± Chunking Strategy
- ÿßÿ∂ÿ∫ÿ∑ "Index Resumes"

#### 4Ô∏è‚É£ ÿßŸÑÿ®ÿ≠ÿ´
- ÿßŸÉÿ™ÿ® ÿ≥ÿ§ÿßŸÑŸÉ (ŸÖÿ´ŸÑ: "ŸÖŸÜ ÿπŸÜÿØŸá Pythonÿü")
- ÿ≠ÿØÿØ ÿπÿØÿØ ÿßŸÑŸÜÿ™ÿßÿ¶ÿ¨ (Top-K)
- ÿßÿ∂ÿ∫ÿ∑ Search

#### 5Ô∏è‚É£ ÿßŸÑŸÖŸÇÿßÿ±ŸÜÿ©
- ÿßÿ≥ÿ™ÿÆÿØŸÖ ÿ™ÿßÿ® "Compare Strategies"
- ŸÑŸÖŸÇÿßÿ±ŸÜÿ© ÿ£ÿØÿßÿ° ÿßŸÑÿßÿ≥ÿ™ÿ±ÿßÿ™Ÿäÿ¨Ÿäÿßÿ™ ÿßŸÑŸÖÿÆÿ™ŸÑŸÅÿ©

In [None]:
# ============================================
# üìä SYSTEM SUMMARY
# ============================================

print("=" * 60)
print("üìÑ RESUME RAG SYSTEM - SUMMARY")
print("=" * 60)

print(f"""
üéØ FEATURES:
‚úÖ PDF Upload & Parsing (pdfplumber)
‚úÖ 3 Embedding Models (MiniLM, MPNet, BGE)
‚úÖ 3 Chunking Strategies (Fixed, Sentence, Layout)
‚úÖ Top-K Control (1-10, default=3)
‚úÖ Strategy Comparison Tool
‚úÖ Interactive Gradio UI

üì¶ CONFIGURATION:
‚Ä¢ Chunk Size: {config.CHUNK_SIZE_MIN}-{config.CHUNK_SIZE_MAX} chars
‚Ä¢ Overlap: {config.CHUNK_OVERLAP_PERCENT}%
‚Ä¢ Default Model: {config.DEFAULT_EMBEDDING_MODEL}
‚Ä¢ Default Top-K: {config.TOP_K_DEFAULT}

üìÅ DATA:
{resume_manager.summary()}

üöÄ TO START:
1. Run all cells in order
2. Upload PDFs or use CSV data
3. Select model & strategy
4. Index resumes
5. Search!

‚úÖ SYSTEM READY!
""")

---

## üìå END OF NOTEBOOK

### ÿßŸÑŸÖŸäÿ≤ÿßÿ™ ÿßŸÑŸÖŸèŸÜŸÅÿ∞ÿ© ‚úÖ

| Feature | Status |
|---------|--------|
| PDF Upload & Parsing | ‚úÖ |
| 3 Embedding Models | ‚úÖ |
| 3 Chunking Strategies | ‚úÖ |
| Top-K Control (1-10) | ‚úÖ |
| Interactive Gradio UI | ‚úÖ |
| Strategy Comparison | ‚úÖ |
| Evaluation Metrics | ‚úÖ |

### üîß Technical Stack:
- **PDF Parser**: pdfplumber (with PyMuPDF option)
- **Embeddings**: SentenceTransformer (Free, Local)
- **Vector DB**: ChromaDB (In-memory)
- **UI**: Gradio
- **Environment**: VS Code + Conda Python 3.11