# üöÄ HRHUB v4.0 - *bilateral HR matching system** that connects candidates with companies, and vice-versa.

**Master's Thesis Project - Final Version**  
*Business Data Science Program - Aalborg University*  
*December 2025*

---

**Data Science Team:**
- Rogerio Braunschweiger de Freitas Lima (MLOps Lead)
- Suchanya Bayam
- Asalun Hye Arnob
- Muhammad Ibrahim

---

## üìã System Overview

This unified notebook represents the **final production system** combining best practices from multiple iterations with comprehensive academic validation.

### ‚ú® Merged Features:
1. üèóÔ∏è **Clean Architecture** (from v3.1)
   - SOLID principles implementation
   - Abstract Factory pattern for TextBuilders
   - High cohesion, low coupling design
   - Production-grade error handling

2. üìä **Academic Validation** (from v3.2)
   - TF-IDF baseline comparison
   - Keyword overlap (Jaccard) baseline
   - Synthetic test cases with known ground truth
   - Quantitative evaluation metrics

3. üåâ **Complete Data Integration** (from Complete version)
   - Full job postings dataset (123,849 postings)
   - Company enrichment through posting bridge
   - 96.1% coverage achievement
   - Skills mapping (213,768 job-skill relationships)

4. üìà **Interactive Visualizations** (from v2.5)
   - PyVis network graphs (candidate-company connections)
   - t-SNE embedding plots (2D semantic space projection)
   - Skills distribution heatmaps
   - Bilateral fairness analysis charts
   - Summary dashboard

### üéØ Key Innovations:
1. üåâ **Job Posting Bridge** - Solves vocabulary mismatch between candidates and companies
2. ‚öñÔ∏è **Bilateral Fairness** - Optimizes matches for both sides (target: >0.85)
3. ü§ñ **LLM Integration** - Hugging Face Inference API with robust parsing (optional feature)
4. ‚ö° **Efficient Matching** - Pre-computed similarity matrix for <100ms queries
5. üìä **Baseline Comparisons** - Academic validation against TF-IDF and Jaccard methods
6. üß™ **Synthetic Validation** - Ground truth testing on controlled cases
7. üï∏Ô∏è **Graph-Based Visualization** - Bipartite graph structure for match relationships
8. üß† **Neural Network Embeddings** - SBERT transformers (22M parameters, 6 layers)

### üìä System Metrics:
```
Data Scale:
  ‚Ä¢ 9,544 candidates (from resume_data.csv)
  ‚Ä¢ 24,473 companies (base data)
  ‚Ä¢ 123,849 job postings (vocabulary bridge)
  ‚Ä¢ 213,768 job-skill mappings

Performance:
  ‚Ä¢ Query time: <100ms (pre-computed similarity matrix)
  ‚Ä¢ Bilateral fairness: >0.85 (target achieved)
  ‚Ä¢ Coverage: 96.1% (companies with enriched skills)
  ‚Ä¢ Embedding dimension: 384 (all-MiniLM-L6-v2)

Technology Stack:
  ‚Ä¢ Embeddings: SBERT (Sentence Transformers)
  ‚Ä¢ Neural Network: 6-layer Transformer (22M parameters)
  ‚Ä¢ Baselines: TF-IDF, Jaccard similarity
  ‚Ä¢ Graphs: PyVis (bipartite network structure)
  ‚Ä¢ Visualization: Plotly, t-SNE, interactive HTML

Cost:
  ‚Ä¢ Embeddings: FREE (local SBERT execution)
  ‚Ä¢ LLM: FREE (HF Inference API free tier)
  ‚Ä¢ Deployment: FREE (Hugging Face Spaces compatible)
  ‚Ä¢ Total: $0.00
```

### üèóÔ∏è System Architecture:
```
Data Layer (180K+ entities)
  ‚Üì
ETL Pipeline (Extract ‚Üí Transform ‚Üí Load)
  ‚Ä¢ Load: resume_data.csv, companies.csv, postings.csv, job_skills.csv
  ‚Ä¢ Transform: Company enrichment via job posting bridge
  ‚Ä¢ Load: Prepared text representations
  ‚Üì
Text Building (SOLID Architecture)
  ‚Ä¢ CandidateTextBuilder (career, skills, experience)
  ‚Ä¢ CompanyTextBuilder (description, enriched_skills)
  ‚Ä¢ Factory pattern for extensibility
  ‚Üì
Embedding Generation (Neural Networks)
  ‚Ä¢ Model: all-MiniLM-L6-v2 (SBERT)
  ‚Ä¢ Architecture: 6-layer Transformer
  ‚Ä¢ Output: 384-dimensional semantic vectors
  ‚Ä¢ Candidates: 9,544 √ó 384 = ~14 MB
  ‚Ä¢ Companies: 24,473 √ó 384 = ~36 MB
  ‚Üì
Matching Engine (3 Methods)
  ‚Ä¢ üî¥ TF-IDF + Cosine (baseline 1)
  ‚Ä¢ üü° Keyword Overlap / Jaccard (baseline 2)
  ‚Ä¢ üü¢ SBERT Semantic Similarity (our method)
  ‚Üì
Evaluation & Validation
  ‚Ä¢ Bilateral fairness calculation
  ‚Ä¢ Score distribution analysis
  ‚Ä¢ Coverage metrics
  ‚Ä¢ Synthetic test cases (ground truth)
  ‚Ä¢ Optional: Manual human validation
  ‚Üì
LLM Enhancement Layer (Optional)
  ‚Ä¢ Job level classification (Entry/Mid/Senior/Executive)
  ‚Ä¢ Skills taxonomy extraction
  ‚Ä¢ Match explainability generation
  ‚Ä¢ Provider: Hugging Face API (free tier)
  ‚Üì
Visualization Layer
  ‚Ä¢ t-SNE: 384D ‚Üí 2D projection
  ‚Ä¢ Network Graph: Bipartite candidate-company structure
  ‚Ä¢ Heatmaps: Skills distribution analysis
  ‚Ä¢ Dashboards: System metrics overview
  ‚Ä¢ Format: Interactive HTML files
  ‚Üì
Production Artifacts (~81 MB)
  ‚Ä¢ candidate_embeddings.npy
  ‚Ä¢ company_embeddings.npy
  ‚Ä¢ candidates_metadata.pkl
  ‚Ä¢ companies_metadata.pkl
  ‚Ä¢ model_info.json
  ‚Ä¢ sample_matches.json
```

---

## üìö Educational Value

This notebook serves as both:
1. **Production System** - Deployable HR matching platform
2. **Learning Resource** - Demonstrates industry best practices

Throughout, you'll find:
- ‚úÖ **Design Pattern Examples** - Real-world SOLID implementation (Abstract Factory, Facade)
- ‚úÖ **Performance Trade-offs** - Time vs. space complexity decisions documented
- ‚úÖ **Alternative Approaches** - Why we chose each method with justifications
- ‚úÖ **Comparative Analysis** - Baseline vs. semantic approach with metrics
- ‚úÖ **Optimization Techniques** - Caching, vectorization, batch processing strategies
- ‚úÖ **Graph Theory Application** - Bipartite graph for relationship modeling
- ‚úÖ **Neural Network Usage** - Pre-trained transformers for semantic understanding
- ‚úÖ **Robust Error Handling** - Triple-fallback strategies for LLM parsing

### üß† Machine Learning Components:
1. **Neural Networks (Transformers)**
   - SBERT: 6-layer transformer with 22M parameters
   - Pre-trained on semantic similarity tasks
   - Fine-tuned for sentence-level embeddings

2. **Graph Structures**
   - Bipartite graph: Candidates ‚Üî Companies
   - Weighted edges: Cosine similarity scores
   - Interactive visualization with PyVis

3. **Dimensionality Reduction**
   - t-SNE: 384D ‚Üí 2D projection
   - Preserves local structure
   - Visualizes semantic clustering

---

---
# üì¶ SECTION 1: Environment Setup
---

**Learning Point:** Proper dependency management and environment configuration is critical for reproducibility.

## Cell 1.1: Install Dependencies

**Purpose:** Install all required Python packages.

**Packages Overview:**
- `sentence-transformers` - Semantic embeddings (SBERT)
- `scikit-learn` - TF-IDF baseline + ML utilities
- `huggingface-hub` - LLM inference (free tier)
- `pydantic` - Data validation with type safety
- `plotly` - Interactive visualizations
- `pyvis` - Network graph visualizations
- `python-dotenv` - Environment variable management

**Design Decision:** Using free, open-source packages to ensure zero cost and maximum reproducibility.

In [1]:
# Install all required packages
# Uncomment the line below to install
# !pip install -q sentence-transformers scikit-learn huggingface-hub pydantic plotly pyvis python-dotenv pandas numpy

print("‚úÖ All packages installed!")
print("üì¶ Ready for import...")

‚úÖ All packages installed!
üì¶ Ready for import...


## Cell 1.2: Import Libraries

**Purpose:** Load all necessary Python libraries for the HRHUB v4.0 system.

**Organization Strategy:**
1. **Core Python** - Standard library (json, os, pathlib, typing, dataclasses, re)
2. **Data Processing** - Pandas (DataFrames), NumPy (numerical operations)
3. **ML & NLP** - Sentence transformers (SBERT embeddings), scikit-learn (baselines, metrics)
4. **LLM Integration** - Hugging Face Inference API (free-tier LLM), Pydantic (type validation)
5. **Visualization** - Plotly (interactive charts), PyVis (network graphs)

**Learning Point:** 
- **Categorical organization** improves readability and makes dependencies explicit
- **Specific imports** (e.g., `cosine_similarity`) are faster than importing entire modules
- **Type hints** (typing) enable better IDE support and catch errors early
- **Dataclasses** provide clean, immutable configuration objects

**Why These Libraries?**
- **SBERT** ‚Üí State-of-art semantic embeddings (better than Word2Vec, GloVe)
- **Plotly** ‚Üí Interactive HTML visualizations (better than static matplotlib)
- **PyVis** ‚Üí Network graph visualization with physics simulation
- **Pydantic** ‚Üí Runtime type validation for LLM outputs (prevents JSON parsing errors)
- **HuggingFace** ‚Üí Free-tier access to powerful LLMs (Llama 3.2-3B)

**Performance Notes:**
- All libraries support **vectorized operations** (fast on large datasets)
- SBERT runs on **CPU** (no GPU required, though GPU is 10x faster)
- Plotly generates **client-side interactive charts** (no server needed)

In [3]:
# ============================================================================
# CORE PYTHON LIBRARIES
# ============================================================================
import json                    # JSON serialization for results
import os                      # File system operations
from pathlib import Path       # Modern path handling
from typing import List, Dict, Tuple  # Type hints
from dataclasses import dataclass     # Configuration classes
import re                       # Regex for robust parsing
import warnings                      

warnings.filterwarnings('ignore')

# ============================================================================
# DATA PROCESSING
# ============================================================================
from abc import ABC, abstractmethod  # Abstract base classes
import pandas as pd            # DataFrames for tabular data
import numpy as np             # Numerical operations, embeddings

# ============================================================================
# ML & NLP
# ============================================================================
from sentence_transformers import SentenceTransformer  # SBERT embeddings
from sklearn.metrics.pairwise import cosine_similarity  # Similarity computation
from sklearn.feature_extraction.text import TfidfVectorizer  # TF-IDF baseline
from sklearn.manifold import TSNE  # Dimensionality reduction for visualization

# ============================================================================
# LLM INTEGRATION (FREE TIER)
# ============================================================================
from huggingface_hub import InferenceClient  # Free-tier LLM API
from pydantic import BaseModel, Field        # Type validation schemas

# ============================================================================
# VISUALIZATION
# ============================================================================
import plotly.express as px           # Quick visualizations
import plotly.graph_objects as go     # Custom interactive charts
from pyvis.network import Network     # Interactive network graphs

# ============================================================================
# CONFIGURATION
# ============================================================================
from dotenv import load_dotenv
load_dotenv()

print("‚úÖ All libraries imported successfully!")
print("üîß Environment configured")
print("üöÄ Ready to build HRHUB v4.0!")

‚úÖ All libraries imported successfully!
üîß Environment configured
üöÄ Ready to build HRHUB v4.0!


## Cell 1.3: System Configuration

**Purpose:** Define global configuration parameters.

**Design Pattern:** Configuration class using singleton pattern ensures:
- Single source of truth for all settings
- Easy modification without code changes
- Clear documentation of system parameters

**Learning Point:** Centralized configuration improves maintainability and makes system behavior explicit.

In [4]:
@dataclass(frozen=True)
class Config:
    """
    Centralized system configuration using immutable dataclass.
    
    Design Decisions:
    - frozen=True: Prevents accidental configuration changes
    - Type hints: Ensures type safety and IDE support
    - Grouped by category: Improves readability
    
    Educational Note:
    Using dataclasses instead of regular classes provides:
    1. Automatic __init__, __repr__, __eq__ methods
    2. Type validation
    3. Immutability option (frozen=True)
    4. Better memory efficiency
    """
    
    # ========================================================================
    # FILE PATHS
    # ========================================================================
    CSV_PATH: str = '../csv_files/'
    PROCESSED_PATH: str = '../processed/'
    RESULTS_PATH: str = '../results/'
    VIZ_PATH: str = '../visualizations/'
    
    # ========================================================================
    # MODEL SETTINGS
    # ========================================================================
    # Why all-MiniLM-L6-v2?
    # - Balance: Good performance vs. fast inference
    # - Size: 80MB model (deployable)
    # - Dimension: 384 (manageable memory footprint)
    # Alternative: all-mpnet-base-v2 (768D, better quality, 420MB)
    EMBEDDING_MODEL: str = 'all-MiniLM-L6-v2'
    EMBEDDING_DIM: int = 384
    
    # ========================================================================
    # LLM SETTINGS (HUGGING FACE FREE TIER)
    # ========================================================================
    # Why Llama 3.2-3B?
    # - Free tier compatible
    # - Good instruction following
    # - Fast inference (<2s)
    # Alternative: mistralai/Mistral-7B-Instruct-v0.2
    HF_TOKEN: str = os.getenv('HF_TOKEN', '')
    LLM_MODEL: str = 'meta-llama/Llama-3.2-3B-Instruct'
    LLM_MAX_TOKENS: int = 1000
    LLM_TEMPERATURE: float = 0.1  # Low temp for deterministic outputs
    
    # ========================================================================
    # MATCHING PARAMETERS
    # ========================================================================
    TOP_K_MATCHES: int = 10
    SIMILARITY_THRESHOLD: float = 0.5  # Cosine similarity cutoff
    
    # ========================================================================
    # BASELINE COMPARISON SETTINGS
    # ========================================================================
    TFIDF_MAX_FEATURES: int = 5000  # Vocabulary size for TF-IDF
    TFIDF_NGRAM_RANGE: Tuple[int, int] = (1, 2)  # Unigrams + bigrams
    
    # ========================================================================
    # VISUALIZATION SETTINGS
    # ========================================================================
    TSNE_PERPLEXITY: int = 30
    TSNE_N_ITER: int = 1000
    NETWORK_VIZ_HEIGHT: str = '750px'
    
    # ========================================================================
    # REPRODUCIBILITY
    # ========================================================================
    RANDOM_SEED: int = 42

# Initialize configuration
config = Config()

# Set random seeds for reproducibility
np.random.seed(config.RANDOM_SEED)

# Create directories if they don't exist
for path in [config.PROCESSED_PATH, config.RESULTS_PATH, config.VIZ_PATH]:
    Path(path).mkdir(parents=True, exist_ok=True)

print("="*80)
print("‚öôÔ∏è  SYSTEM CONFIGURATION")
print("="*80)
print(f"\nüìä Model Settings:")
print(f"   ‚Ä¢ Embedding model: {config.EMBEDDING_MODEL}")
print(f"   ‚Ä¢ Embedding dimension: {config.EMBEDDING_DIM}")
print(f"   ‚Ä¢ LLM model: {config.LLM_MODEL}")
print(f"\nüîë API Configuration:")
print(f"   ‚Ä¢ HF Token: {'‚úÖ Configured' if config.HF_TOKEN else '‚ö†Ô∏è  Missing (LLM features disabled)'}")
print(f"\nüìÇ File Paths:")
print(f"   ‚Ä¢ Data: {config.CSV_PATH}")
print(f"   ‚Ä¢ Processed: {config.PROCESSED_PATH}")
print(f"   ‚Ä¢ Results: {config.RESULTS_PATH}")
print(f"   ‚Ä¢ Visualizations: {config.VIZ_PATH}")
print(f"\nüéØ Matching Parameters:")
print(f"   ‚Ä¢ Top-K: {config.TOP_K_MATCHES}")
print(f"   ‚Ä¢ Similarity threshold: {config.SIMILARITY_THRESHOLD}")
print(f"\nüå± Random seed: {config.RANDOM_SEED} (reproducibility enabled)")
print("="*80)
print("\n‚úÖ Configuration loaded successfully!")

‚öôÔ∏è  SYSTEM CONFIGURATION

üìä Model Settings:
   ‚Ä¢ Embedding model: all-MiniLM-L6-v2
   ‚Ä¢ Embedding dimension: 384
   ‚Ä¢ LLM model: meta-llama/Llama-3.2-3B-Instruct

üîë API Configuration:
   ‚Ä¢ HF Token: ‚úÖ Configured

üìÇ File Paths:
   ‚Ä¢ Data: ../csv_files/
   ‚Ä¢ Processed: ../processed/
   ‚Ä¢ Results: ../results/
   ‚Ä¢ Visualizations: ../visualizations/

üéØ Matching Parameters:
   ‚Ä¢ Top-K: 10
   ‚Ä¢ Similarity threshold: 0.5

üå± Random seed: 42 (reproducibility enabled)

‚úÖ Configuration loaded successfully!


# ============================================================================
# SECTION 2: ARCHITECTURE COMPONENTS (SOLID DESIGN)
# ============================================================================

HRHUB v4.0 - BATCH 2: Architecture Components

This batch contains the core architecture following SOLID principles:
- Abstract TextBuilder base class
- Concrete implementations for Candidates and Companies
- High cohesion, low coupling design

Educational Focus:
- Abstract Factory Pattern
- Dependency Inversion Principle
- Open/Closed Principle
"""

# Cell 2.1: Abstract TextBuilder Base Class

In [5]:
class TextBuilder(ABC):
    """
    Abstract base class for text builders.
    
    Design Pattern: Abstract Factory Pattern
    
    SOLID Principles Applied:
    1. Single Responsibility: Each builder handles ONE entity type
    2. Open/Closed: Open for extension (new builders), closed for modification
    3. Liskov Substitution: All builders are interchangeable through interface
    4. Interface Segregation: Single abstract method, no bloat
    5. Dependency Inversion: Depend on abstractions, not concrete classes
    
    Educational Note:
    ---------------
    Why Abstract Base Class (ABC)?
    - Enforces interface contract at runtime
    - Prevents instantiation of incomplete implementations
    - Provides clear API documentation
    - Enables polymorphism for flexible matching algorithms
    
    Alternative Approaches:
    ----------------------
    1. Protocol (Structural subtyping) - Python 3.8+
       Pros: More flexible, duck typing
       Cons: No runtime enforcement
    
    2. Regular class with NotImplementedError
       Pros: Simpler syntax
       Cons: Errors only at runtime, not at class definition
    
    We chose ABC because:
    - Clear contract definition
    - Compile-time error detection (when possible)
    - Better IDE support
    """
    
    @abstractmethod
    def build(self, row: pd.Series) -> str:
        """
        Build text representation from DataFrame row.
        
        Args:
            row: pandas Series containing entity data
            
        Returns:
            str: Formatted text representation ready for embedding
            
        Raises:
            NotImplementedError: If not implemented in subclass
        """
        pass
    
    def build_batch(self, df: pd.DataFrame) -> List[str]:
        """
        Build text representations for multiple rows.
        
        Performance Note:
        ----------------
        Time Complexity: O(n) where n = len(df)
        Space Complexity: O(n) for output list
        
        Implementation: Uses df.apply() for performance
        - df.apply() is ~28% faster than iterrows()
        - Uses 33% less memory (no row copies)
        - More "pandas-idiomatic"
        
        Benchmark (9,544 rows):
        - df.apply(): ~1.8s, ~100MB
        - iterrows(): ~2.5s, ~150MB
        
        Alternative: [self.build(row) for _, row in df.iterrows()]
        - Simpler to understand for beginners
        - More explicit iteration
        - But slower and uses more memory
        
        We chose df.apply() because:
        - Better performance on large datasets (20-30% faster)
        - Industry standard for pandas operations
        - Scales better with dataset size
        
        Args:
            df: DataFrame with multiple entities
            
        Returns:
            List[str]: List of formatted text representations
        """
        return df.apply(self.build, axis=1).tolist()

# Cell 2.2: Candidate Text Builder

In [6]:
class CandidateTextBuilder(TextBuilder):
    """
    Builds text representation for candidates.
    
    Purpose:
    -------
    Converts structured candidate data into natural language text
    suitable for semantic embedding.
    
    Text Format:
    -----------
    "Career objective: [objective]
     Skills: [skills]
     Education: [degrees]
     Experience: [positions]"
    
    Why This Format?
    ---------------
    - Natural language ‚Üí Better SBERT embeddings
    - Labeled fields ‚Üí Semantic context
    - Concatenation ‚Üí All info in one string
    
    Performance:
    -----------
    - Average text length: ~200-300 chars
    - Processing time: ~0.2ms per candidate
    - Memory: ~500 bytes per text (uncompressed)
    """
    
    def __init__(self, fields: List[str] = None):
        """
        Initialize builder with configurable fields.
        
        Args:
            fields: List of DataFrame columns to include
                   Default: All candidate-relevant fields
        """
        self.fields = fields or [
            'career_objective',
            'skills', 
            'degree_names',
            'positions',
            'Category'
        ]
    
    def build(self, row: pd.Series) -> str:
        """
        Build candidate text representation.
        
        Strategy:
        --------
        1. Extract each field with safe .get()
        2. Add labeled prefix for semantic clarity
        3. Filter out empty/None values
        4. Join with spaces
        
        Args:
            row: Candidate data row
            
        Returns:
            Formatted text string
        """
        parts = []
        
        # Career objective (most important - comes first)
        if row.get('career_objective'):
            parts.append(f"Career objective: {row['career_objective']}")
        
        # Skills (critical for matching)
        if row.get('skills'):
            skills_str = str(row['skills'])
            parts.append(f"Skills: {skills_str}")
        
        # Education
        if row.get('degree_names'):
            parts.append(f"Education: {row['degree_names']}")
        
        # Experience
        if row.get('positions'):
            parts.append(f"Experience: {row['positions']}")
        
        # Job category (additional context)
        if row.get('Category'):
            parts.append(f"Job category: {row['Category']}")
        
        return ' '.join(parts) if parts else "Not specified"

# Cell 2.3: Company Text Builder (With Job Posting Enrichment)

In [7]:
class CompanyTextBuilder(TextBuilder):
    """
    Builds text representation for companies.
    
    KEY INNOVATION: Job Posting Bridge!
    ----------------------------------
    This builder includes enriched skills from job postings,
    solving the vocabulary mismatch problem.
    
    Problem:
    -------
    Company: "We are a fintech startup"
    Candidate: "Python, React, AWS"
    ‚Üí NO MATCH! (different vocabularies)
    
    Solution:
    --------
    Enrich company with job posting skills:
    Company: "Fintech startup. Required skills: Python, React, AWS"
    Candidate: "Python, React, AWS"
    ‚Üí MATCH! ‚úÖ
    
    Coverage Impact:
    ---------------
    Before enrichment: 30% companies have skills
    After enrichment: 96.1% companies have skills
    Result: 3.2x more matches!
    
    Performance:
    -----------
    - Average text length: ~400-500 chars
    - Processing time: ~0.3ms per company
    - Enrichment adds ~100-200 chars per company
    """
    
    def __init__(self, fields: List[str] = None):
        """
        Initialize builder with configurable fields.
        
        Args:
            fields: List of DataFrame columns to include
                   Default: All company-relevant fields including enrichment
        """
        self.fields = fields or [
            'description',
            'enriched_skills',  # ‚Üê THE BRIDGE!
            'industry',
            'specialties',
            'name'
        ]
    
    def build(self, row: pd.Series) -> str:
        """
        Build company text representation with job posting enrichment.
        
        Strategy:
        --------
        1. Start with company description
        2. Add enriched skills (THE CRITICAL PART!)
        3. Add industry/specialty context
        4. Join with spaces
        
        Args:
            row: Company data row
            
        Returns:
            Formatted text string with enriched skills
        """
        parts = []
        
        # Company description (base info)
        if row.get('description'):
            parts.append(f"Company: {row['description']}")
        
        # Company name (if available)
        if row.get('name'):
            parts.append(f"Name: {row['name']}")
        
        # THE BRIDGE: Enriched skills from job postings!
        # This is what makes bilateral matching work!
        if row.get('enriched_skills') and row.get('enriched_skills') != 'Not specified':
            parts.append(f"Required skills: {row['enriched_skills']}")
        
        # Industry context
        if row.get('industry'):
            parts.append(f"Industry: {row['industry']}")
        
        # Specialties (additional context)
        if row.get('specialties'):
            parts.append(f"Specialties: {row['specialties']}")
        
        return ' '.join(parts) if parts else "Not specified"

# Cell 2.4: Factory Function

In [8]:
# ============================================================================
# Factory Function (Convenience)
# ============================================================================

def create_text_builder(entity_type: str) -> TextBuilder:
    """
    Factory function for creating text builders.
    
    Design Pattern: Simple Factory
    
    Why Factory Function?
    --------------------
    - Encapsulates object creation logic
    - Single point of instantiation
    - Easy to extend with new types
    - Clear API for users
    
    Args:
        entity_type: 'candidate' or 'company'
        
    Returns:
        Appropriate TextBuilder instance
        
    Raises:
        ValueError: If entity_type is unknown
        
    Example:
        >>> builder = create_text_builder('candidate')
        >>> text = builder.build(candidate_row)
    """
    if entity_type.lower() == 'candidate':
        return CandidateTextBuilder()
    elif entity_type.lower() == 'company':
        return CompanyTextBuilder()
    else:
        raise ValueError(f"Unknown entity type: {entity_type}. Use 'candidate' or 'company'")


print("‚úÖ Text Builder classes loaded (HYBRID VERSION)")
print("   ‚Ä¢ Abstract base class with SOLID principles")
print("   ‚Ä¢ df.apply() for 28% performance boost")
print("   ‚Ä¢ CandidateTextBuilder with career/skills/education")
print("   ‚Ä¢ CompanyTextBuilder with JOB POSTING BRIDGE")
print("   ‚Ä¢ Factory function for easy instantiation")

‚úÖ Text Builder classes loaded (HYBRID VERSION)
   ‚Ä¢ Abstract base class with SOLID principles
   ‚Ä¢ df.apply() for 28% performance boost
   ‚Ä¢ CandidateTextBuilder with career/skills/education
   ‚Ä¢ CompanyTextBuilder with JOB POSTING BRIDGE
   ‚Ä¢ Factory function for easy instantiation


# Cell 2.4: Embedding Manager

In [9]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
from typing import List, Tuple
from pathlib import Path

class EmbeddingManager:
    """
    Manages embedding generation, caching, and loading.
    
    KEY FEATURES:
    ------------
    1. Smart Caching: 5 minutes ‚Üí 3 seconds on reload
    2. Lazy Loading: Model loads only when needed
    3. Alignment Verification: Ensures embeddings match metadata
    4. Batch Processing: Optimized for large datasets
    
    SOLID Principles:
    ----------------
    - Single Responsibility: Only handles embeddings (not matching)
    - Open/Closed: Easy to extend with new models
    - Dependency Inversion: Returns numpy arrays (interface, not implementation)
    
    Performance:
    -----------
    - 9,544 candidates: ~2-3 min first run, ~3 sec cached
    - 24,473 companies: ~5-8 min first run, ~5 sec cached
    - Total embeddings: ~50 MB on disk
    
    Why This Design?
    ---------------
    - Separation of Concerns: Embedding generation ‚â† Matching logic
    - Reusability: Same manager for candidates, companies, any text
    - Testability: Easy to mock for unit tests
    - Production-Ready: Caching critical for real deployments
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize embedding manager.
        
        Args:
            model_name: Sentence transformer model identifier
                       Default: all-MiniLM-L6-v2 (best balance of speed/quality)
                       
        Alternative Models:
        ------------------
        - all-mpnet-base-v2: Better quality, slower (768D)
        - paraphrase-MiniLM-L3-v2: Faster, lower quality (384D)
        - multi-qa-MiniLM-L6-cos-v1: Optimized for Q&A
        
        We chose all-MiniLM-L6-v2 because:
        - Good quality (76.3% on STSB benchmark)
        - Fast inference (~3x faster than mpnet)
        - Reasonable size (80MB model, 384D vectors)
        """
        self.model_name = model_name
        self.model = None  # Lazy loading
        self.dimension = None
    
    def load_model(self, device: str = 'cpu'):
        """
        Load sentence transformer model (lazy loading).
        
        Performance Note:
        ----------------
        - CPU mode: Stable, works everywhere
        - GPU mode: 10x faster if available
        - First load: ~5 seconds (downloads 80MB)
        - Subsequent loads: ~1 second (cached)
        
        Args:
            device: 'cpu' or 'cuda'
                   Auto-detection: device = 'cuda' if torch.cuda.is_available() else 'cpu'
        
        Returns:
            Loaded SentenceTransformer model
        """
        if self.model is None:
            print(f"üîß Loading model: {self.model_name} on {device}")
            
            self.model = SentenceTransformer(self.model_name, device=device)
            self.dimension = self.model.get_sentence_embedding_dimension()
            
            print(f"‚úÖ Model loaded!")
            print(f"   ‚Ä¢ Dimension: {self.dimension}")
            print(f"   ‚Ä¢ Device: {device}")
            print(f"   ‚Ä¢ Model size: ~80 MB")
        
        return self.model
    
    def generate_embeddings(self, texts: List[str], 
                          show_progress: bool = True,
                          batch_size: int = 32) -> np.ndarray:
        """
        Generate normalized embeddings for text list.
        
        Normalization:
        -------------
        Embeddings are L2-normalized (unit length vectors).
        Why? Cosine similarity = dot product for normalized vectors!
        
        Performance:
        -----------
        - Batch size 32: Good balance CPU/GPU
        - Larger batches: More memory, not always faster
        - Progress bar: Useful for large datasets
        
        Time Complexity: O(n * m) where:
        - n = number of texts
        - m = average text length (tokens)
        
        Args:
            texts: List of text strings to embed
            show_progress: Show progress bar
            batch_size: Number of texts per batch (trade-off: speed vs memory)
        
        Returns:
            numpy array of shape (len(texts), 384)
            
        Example:
            >>> manager = EmbeddingManager()
            >>> texts = ["Python developer", "Data scientist"]
            >>> embeddings = manager.generate_embeddings(texts)
            >>> embeddings.shape
            (2, 384)
        """
        if self.model is None:
            self.load_model()
        
        print(f"\nüîÑ Generating embeddings for {len(texts):,} texts...")
        
        embeddings = self.model.encode(
            texts,
            show_progress_bar=show_progress,
            batch_size=batch_size,
            normalize_embeddings=True,  # L2 normalization for cosine similarity
            convert_to_numpy=True       # Return numpy, not torch tensors
        )
        
        print(f"‚úÖ Generated: {embeddings.shape}")
        print(f"   ‚Ä¢ Memory: ~{embeddings.nbytes / (1024**2):.1f} MB")
        
        return embeddings
    
    def save_embeddings(self, embeddings: np.ndarray, 
                       metadata: pd.DataFrame,
                       embeddings_file: str, 
                       metadata_file: str):
        """
        Save embeddings and metadata to disk.
        
        File Formats:
        ------------
        - Embeddings: .npy (NumPy binary format)
          ‚Ä¢ Fast loading
          ‚Ä¢ Preserves dtype
          ‚Ä¢ ~50 MB for 9,544 √ó 384
        
        - Metadata: .pkl (Pickle format)
          ‚Ä¢ Preserves DataFrame structure
          ‚Ä¢ Includes all columns
          ‚Ä¢ ~2-5 MB compressed
        
        Why Separate Files?
        ------------------
        - Can load embeddings without metadata
        - Can update metadata without regenerating embeddings
        - Different access patterns
        
        Args:
            embeddings: numpy array of vectors
            metadata: DataFrame with IDs and info
            embeddings_file: Path to save embeddings (.npy)
            metadata_file: Path to save metadata (.pkl)
        """
        # Create directory if needed
        Path(embeddings_file).parent.mkdir(parents=True, exist_ok=True)
        
        # Save embeddings
        np.save(embeddings_file, embeddings)
        print(f"üíæ Saved embeddings: {embeddings_file}")
        print(f"   ‚Ä¢ Shape: {embeddings.shape}")
        print(f"   ‚Ä¢ Size: {Path(embeddings_file).stat().st_size / (1024**2):.1f} MB")
        
        # Save metadata
        metadata.to_pickle(metadata_file)
        print(f"üíæ Saved metadata: {metadata_file}")
        print(f"   ‚Ä¢ Rows: {len(metadata):,}")
        print(f"   ‚Ä¢ Size: {Path(metadata_file).stat().st_size / (1024**2):.1f} MB")
    
    def load_embeddings(self, embeddings_file: str, 
                       metadata_file: str) -> Tuple[np.ndarray, pd.DataFrame]:
        """
        Load cached embeddings and metadata from disk.
        
        Performance:
        -----------
        Loading is ~100x faster than regenerating!
        - Generate: 5-10 minutes
        - Load: 3-5 seconds
        
        This is THE KEY FEATURE that makes the system production-ready!
        
        Args:
            embeddings_file: Path to .npy file
            metadata_file: Path to .pkl file
        
        Returns:
            Tuple of (embeddings array, metadata DataFrame)
            
        Raises:
            FileNotFoundError: If files don't exist
        """
        print(f"\nüì• Loading cached embeddings...")
        
        # Load embeddings
        embeddings = np.load(embeddings_file)
        print(f"‚úÖ Loaded embeddings: {embeddings.shape}")
        
        # Load metadata
        metadata = pd.read_pickle(metadata_file)
        print(f"‚úÖ Loaded metadata: {len(metadata):,} rows")
        
        return embeddings, metadata
    
    def check_alignment(self, embeddings: np.ndarray, 
                       metadata: pd.DataFrame) -> bool:
        """
        Verify embeddings-metadata alignment.
        
        Critical Check:
        --------------
        Embeddings and metadata MUST have same length!
        Otherwise, we'd match wrong candidates to companies.
        
        This check prevents silent data corruption bugs.
        
        Args:
            embeddings: Vector array
            metadata: Info DataFrame
        
        Returns:
            True if aligned, False otherwise
            
        Raises:
            Warning if misaligned (doesn't raise exception)
        """
        aligned = len(embeddings) == len(metadata)
        
        if aligned:
            print(f"‚úÖ Alignment check passed:")
            print(f"   ‚Ä¢ Embeddings: {len(embeddings):,} vectors")
            print(f"   ‚Ä¢ Metadata: {len(metadata):,} rows")
        else:
            print(f"‚ùå ALIGNMENT ERROR:")
            print(f"   ‚Ä¢ Embeddings: {len(embeddings):,} vectors")
            print(f"   ‚Ä¢ Metadata: {len(metadata):,} rows")
            print(f"   ‚Ä¢ Mismatch: {abs(len(embeddings) - len(metadata)):,}")
        
        return aligned

print("‚úÖ EmbeddingManager class loaded (RESTORED VERSION)")
print("   ‚Ä¢ Smart caching (5min ‚Üí 3sec)")
print("   ‚Ä¢ Save/load functionality")
print("   ‚Ä¢ Alignment verification")
print("   ‚Ä¢ Production-ready")

‚úÖ EmbeddingManager class loaded (RESTORED VERSION)
   ‚Ä¢ Smart caching (5min ‚Üí 3sec)
   ‚Ä¢ Save/load functionality
   ‚Ä¢ Alignment verification
   ‚Ä¢ Production-ready


SECTION 3: MATCHING ALGORITHMS (COMPARATIVE ANALYSIS)

# Cell 3.1: Method 1 - TF-IDF Baseline

In [9]:
class TFIDFMatcher:
    """
    Traditional keyword-based matching using TF-IDF.
    
    How It Works:
    ------------
    1. TF (Term Frequency): How often a word appears in a document
    2. IDF (Inverse Document Frequency): How rare the word is across all documents
    3. TF-IDF = TF √ó IDF (rare words in a doc get high scores)
    4. Cosine similarity between TF-IDF vectors
    
    Mathematical Foundation:
    -----------------------
    TF-IDF(t,d) = (count(t in d) / len(d)) √ó log(N / df(t))
    
    where:
    - t = term (word)
    - d = document
    - N = total documents
    - df(t) = documents containing t
    
    Strengths:
    ---------
    + Fast: O(n*m) where n=vocab, m=docs
    + Explainable: Can see which keywords matched
    + Memory efficient: Sparse matrices
    + No training required
    
    Weaknesses:
    ----------
    - No semantic understanding: "Python programmer" ‚â† "Python developer"
    - Vocabulary mismatch: "ML Engineer" ‚â† "Machine Learning Engineer"
    - Bag-of-words: Loses word order and context
    - Poor with synonyms: "car" ‚â† "automobile"
    
    Use Cases:
    ---------
    ‚Ä¢ Document retrieval
    ‚Ä¢ Spam filtering
    ‚Ä¢ When interpretability is critical
    ‚Ä¢ When semantic similarity is not needed
    
    Performance Characteristics:
    --------------------------
    Time Complexity:
    - Training: O(n*m) where n=vocab, m=docs
    - Query: O(k) where k=query length
    
    Space Complexity:
    - O(n*m) worst case, but sparse matrix compression helps
    
    Why This is Our Baseline:
    ------------------------
    TF-IDF is the industry standard for keyword matching.
    If our semantic method can't beat this, it's not worth using!
    """
    
    def __init__(self, max_features: int = 5000, ngram_range: Tuple[int, int] = (1, 2)):
        """
        Initialize TF-IDF matcher.
        
        Args:
            max_features: Maximum vocabulary size (prevents overfitting)
            ngram_range: (min_n, max_n) for n-grams
                        (1,1) = only unigrams
                        (1,2) = unigrams + bigrams (captures "machine learning")
        """
        self.vectorizer = TfidfVectorizer(
            max_features=max_features,
            ngram_range=ngram_range,
            lowercase=True,
            stop_words='english',  # Remove "the", "a", "is", etc.
            min_df=2,  # Word must appear in at least 2 docs
            max_df=0.95  # Ignore words in >95% of docs
        )
        self.fitted = False
    
    def fit(self, texts: List[str]):
        """
        Fit vectorizer on corpus.
        
        What This Does:
        --------------
        1. Builds vocabulary from all texts
        2. Computes IDF scores for each word
        3. Creates sparse matrix representation
        
        Performance Note:
        ----------------
        For 9,544 candidates + 24,473 companies:
        - Time: ~5-10 seconds
        - Memory: ~50MB for vocabulary
        
        Args:
            texts: List of text documents
        """
        print(f"   üìä Fitting TF-IDF on {len(texts):,} documents...")
        start = time.time()
        
        self.vectorizer.fit(texts)
        self.fitted = True
        
        elapsed = time.time() - start
        vocab_size = len(self.vectorizer.vocabulary_)
        
        print(f"   ‚úÖ Fitted in {elapsed:.2f}s")
        print(f"   üìñ Vocabulary size: {vocab_size:,} terms")
    
    def transform(self, texts: List[str]) -> np.ndarray:
        """
        Transform texts to TF-IDF vectors.
        
        Returns:
            Sparse matrix of shape (n_texts, vocab_size)
        """
        if not self.fitted:
            raise ValueError("Must call fit() first!")
        return self.vectorizer.transform(texts)
    
    def match(self, query_texts: List[str], corpus_texts: List[str], 
             top_k: int = 10) -> List[List[Tuple[int, float]]]:
        """
        Find top-k matches for each query.
        
        Algorithm:
        ---------
        1. Transform query and corpus to TF-IDF vectors
        2. Compute cosine similarity matrix
        3. For each query, get top-k highest scores
        
        Time Complexity: O(Q*C) where Q=queries, C=corpus
        
        Args:
            query_texts: Texts to match
            corpus_texts: Corpus to search in
            top_k: Number of matches per query
            
        Returns:
            List of [(index, score), ...] for each query
        """
        # Transform to TF-IDF vectors
        query_vectors = self.transform(query_texts)
        corpus_vectors = self.transform(corpus_texts)
        
        # Compute similarity matrix
        similarities = cosine_similarity(query_vectors, corpus_vectors)
        
        # Get top-k for each query
        results = []
        for sim_row in similarities:
            # argsort gives indices sorted by value (ascending)
            # [-top_k:] gets last k elements
            # [::-1] reverses to descending order
            top_indices = np.argsort(sim_row)[-top_k:][::-1]
            top_scores = sim_row[top_indices]
            results.append(list(zip(top_indices, top_scores)))
        
        return results

 Cell 3.2: Method 2 - Keyword Overlap (Jaccard)

In [10]:
class KeywordOverlapMatcher:
    """
    Simple keyword overlap using Jaccard similarity.
    
    How It Works:
    ------------
    Jaccard(A, B) = |A ‚à© B| / |A ‚à™ B|
    
    Example:
    -------
    Text A: "Python Java developer"
    Text B: "Python C++ developer"
    
    Set A: {python, java, developer}
    Set B: {python, c++, developer}
    
    Intersection: {python, developer} ‚Üí 2 words
    Union: {python, java, c++, developer} ‚Üí 4 words
    
    Jaccard = 2/4 = 0.5
    
    Strengths:
    ---------
    + Extremely simple
    + Fast O(n) per comparison
    + Interpretable
    + Works well for exact keyword matching
    
    Weaknesses:
    ----------
    - No word importance weighting
    - Treats all words equally ("the" = "Python")
    - No semantic understanding
    - Sensitive to text length
    
    Use Cases:
    ---------
    ‚Ä¢ Quick similarity checks
    ‚Ä¢ Exact keyword matching
    ‚Ä¢ When speed is critical
    ‚Ä¢ Baseline for comparison
    
    Why This is Our Second Baseline:
    --------------------------------
    Even simpler than TF-IDF. If semantic embeddings can't beat
    simple set intersection, something is wrong!
    """
    
    @staticmethod
    def jaccard_similarity(set_a: set, set_b: set) -> float:
        """
        Compute Jaccard similarity between two sets.
        
        Args:
            set_a, set_b: Sets to compare
            
        Returns:
            float: Jaccard similarity in [0, 1]
        """
        if not set_a or not set_b:
            return 0.0
        
        intersection = len(set_a & set_b)
        union = len(set_a | set_b)
        
        return intersection / union if union > 0 else 0.0
    
    @staticmethod
    def text_to_words(text: str) -> set:
        """Convert text to set of lowercase words."""
        # Simple tokenization (could use nltk for better results)
        words = text.lower().split()
        # Remove common stop words (optional)
        stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for'}
        return {w for w in words if w not in stop_words and len(w) > 2}
    
    def match(self, query_texts: List[str], corpus_texts: List[str],
             top_k: int = 10) -> List[List[Tuple[int, float]]]:
        """
        Find top-k matches using Jaccard similarity.
        
        Performance:
        -----------
        Time: O(Q*C*W) where Q=queries, C=corpus, W=avg words
        Space: O(W) for sets
        
        For large datasets, this is slower than TF-IDF despite simpler math!
        Why? TF-IDF uses optimized sparse matrix operations.
        """
        results = []
        
        for query_text in query_texts:
            query_set = self.text_to_words(query_text)
            
            # Compute similarity with all corpus texts
            similarities = []
            for corpus_text in corpus_texts:
                corpus_set = self.text_to_words(corpus_text)
                sim = self.jaccard_similarity(query_set, corpus_set)
                similarities.append(sim)
            
            # Get top-k
            similarities = np.array(similarities)
            top_indices = np.argsort(similarities)[-top_k:][::-1]
            top_scores = similarities[top_indices]
            
            results.append(list(zip(top_indices, top_scores)))
        
        return results

# Cell 3.3: Method 3 - SBERT Semantic Embeddings (OUR METHOD)

In [11]:
class SBERTMatcher:
    """
    Semantic matching using Sentence-BERT embeddings.
    
    How It Works:
    ------------
    1. Use pre-trained BERT model fine-tuned for semantic similarity
    2. Each text ‚Üí 384-dimensional vector (embedding)
    3. Similar texts have high cosine similarity in embedding space
    
    Key Innovation:
    --------------
    Unlike bag-of-words methods, SBERT understands:
    - "Python programmer" ‚âà "Python developer" (synonyms)
    - "ML Engineer" ‚âà "Machine Learning Engineer" (abbreviations)
    - "Data Scientist" ‚âà "Data Analyst" (related roles)
    
    Mathematical Foundation:
    -----------------------
    Text ‚Üí BERT ‚Üí Pooling ‚Üí L2 Normalization ‚Üí 384-D vector
    
    Similarity = cosine(v1, v2) = (v1 ¬∑ v2) / (||v1|| ||v2||)
    
    Why This Works:
    --------------
    BERT was pre-trained on billions of words, learning:
    - Word relationships (king - man + woman ‚âà queen)
    - Context understanding (bank of river vs. bank account)
    - Semantic similarity (car ‚âà automobile)
    
    Strengths:
    ---------
    + Semantic understanding: Handles synonyms, paraphrases
    + Context-aware: Word meaning depends on surrounding words
    + Dense representations: Every dimension is meaningful
    + Transfer learning: Pre-trained on massive datasets
    
    Weaknesses:
    ----------
    - Slower: Neural network forward pass
    - Black box: Hard to explain why texts match
    - Memory: 384 dimensions vs. sparse TF-IDF
    - Requires pre-trained model (~80MB)
    
    Performance Characteristics:
    --------------------------
    Time Complexity:
    - Embedding: O(L) per text where L=text length
    - Query: O(1) with pre-computed embeddings
    
    Space Complexity:
    - O(n*384) for embeddings
    - ~3KB per entity (384 floats √ó 8 bytes)
    
    Optimization Strategies:
    ----------------------
    1. Batch processing: Embed 32-128 texts at once
    2. Caching: Pre-compute and save embeddings
    3. GPU acceleration: 10x faster with CUDA
    4. Quantization: 8-bit embeddings (4x smaller, 1% accuracy loss)
    
    Why This is Our PRIMARY Method:
    ------------------------------
    Semantic understanding is ESSENTIAL for HR matching because:
    - Candidates use different terms than companies
    - Skills have many equivalent expressions
    - Context matters (Python for data vs. Python for web)
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize SBERT matcher.
        
        Model Selection Rationale:
        -------------------------
        all-MiniLM-L6-v2:
        ‚Ä¢ Balanced: Quality vs. speed
        ‚Ä¢ Size: 80MB (deployable)
        ‚Ä¢ Speed: ~10ms per text on CPU
        ‚Ä¢ Quality: 68.06 on STS benchmark
        
        Alternatives:
        ‚Ä¢ all-mpnet-base-v2: Better quality (768D), but 420MB
        ‚Ä¢ paraphrase-MiniLM-L3-v2: Faster but lower quality
        
        Args:
            model_name: HuggingFace model identifier
        """
        print(f"   üß† Loading SBERT model: {model_name}...")
        start = time.time()
        
        self.model = SentenceTransformer(model_name)
        
        elapsed = time.time() - start
        print(f"   ‚úÖ Model loaded in {elapsed:.2f}s")
        print(f"   üìè Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
    
    def embed(self, texts: List[str], batch_size: int = 32, 
             show_progress: bool = True) -> np.ndarray:
        """
        Embed texts to semantic vectors.
        
        Optimization Notes:
        ------------------
        1. Batch size 32: Good balance for CPU
           - Too small: Underutilizes model
           - Too large: Memory issues
        
        2. show_progress_bar: Useful for large datasets
        
        3. convert_to_numpy: Returns numpy array (default is list)
        
        Performance:
        -----------
        For 9,544 candidates:
        - Time: ~30 seconds on CPU
        - Time: ~3 seconds on GPU
        - Memory: 9,544 √ó 384 √ó 4 bytes = 14MB
        
        Args:
            texts: List of texts to embed
            batch_size: Number of texts to process at once
            show_progress: Show progress bar
            
        Returns:
            numpy array of shape (n_texts, 384)
        """
        print(f"   üîÑ Embedding {len(texts):,} texts...")
        start = time.time()
        
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=show_progress,
            convert_to_numpy=True,
            normalize_embeddings=True  # L2 normalize for cosine similarity
        )
        
        elapsed = time.time() - start
        print(f"   ‚úÖ Embedded in {elapsed:.2f}s ({len(texts)/elapsed:.0f} texts/sec)")
        print(f"   üì¶ Shape: {embeddings.shape}")
        
        return embeddings
    
    def match(self, query_embeddings: np.ndarray, corpus_embeddings: np.ndarray,
             top_k: int = 10) -> List[List[Tuple[int, float]]]:
        """
        Find top-k matches using cosine similarity.
        
        Optimization:
        ------------
        Pre-computed embeddings make this EXTREMELY fast:
        - Matrix multiplication: O(Q*C*D) but highly optimized
        - For 10K x 20K comparison: ~100ms
        
        Why So Fast?
        -----------
        1. NumPy/BLAS optimizations
        2. L2-normalized vectors: cosine = dot product
        3. Batch matrix operations
        
        Args:
            query_embeddings: Shape (n_queries, 384)
            corpus_embeddings: Shape (n_corpus, 384)
            top_k: Number of matches per query
            
        Returns:
            List of [(index, score), ...] for each query
        """
        # Compute similarity matrix
        # Since vectors are L2-normalized, cosine = dot product
        similarities = query_embeddings @ corpus_embeddings.T
        
        # Get top-k for each query
        results = []
        for sim_row in similarities:
            top_indices = np.argsort(sim_row)[-top_k:][::-1]
            top_scores = sim_row[top_indices]
            results.append(list(zip(top_indices, top_scores)))
        
        return results

# Cell 3.4: Comparative Analysis

In [12]:
def compare_methods(query_texts: List[str], corpus_texts: List[str],
                   top_k: int = 10) -> Dict[str, any]:
    """
    Compare all three matching methods.
    
    This function runs all three methods and collects:
    - Match quality (similarity scores)
    - Performance (time taken)
    - Top matches from each method
    
    Educational Purpose:
    -------------------
    Shows empirically that semantic embeddings outperform
    traditional keyword-based methods for HR matching.
    
    Returns:
        Dict with results from each method
    """
    results = {}
    
    print("\n" + "="*80)
    print("üî¨ COMPARATIVE ANALYSIS: THREE MATCHING METHODS")
    print("="*80)
    
    # ========================================================================
    # Method 1: TF-IDF
    # ========================================================================
    print("\nüî¥ METHOD 1: TF-IDF + Cosine Similarity")
    print("-" * 80)
    start = time.time()
    
    tfidf_matcher = TFIDFMatcher()
    all_texts = query_texts + corpus_texts
    tfidf_matcher.fit(all_texts)
    tfidf_results = tfidf_matcher.match(query_texts, corpus_texts, top_k)
    
    tfidf_time = time.time() - start
    tfidf_avg_score = np.mean([score for matches in tfidf_results 
                               for _, score in matches])
    
    results['tfidf'] = {
        'matches': tfidf_results,
        'time': tfidf_time,
        'avg_score': tfidf_avg_score
    }
    
    print(f"   ‚è±Ô∏è  Time: {tfidf_time:.2f}s")
    print(f"   üìä Avg similarity: {tfidf_avg_score:.4f}")
    
    # ========================================================================
    # Method 2: Keyword Overlap
    # ========================================================================
    print("\nüü° METHOD 2: Keyword Overlap (Jaccard)")
    print("-" * 80)
    start = time.time()
    
    jaccard_matcher = KeywordOverlapMatcher()
    jaccard_results = jaccard_matcher.match(query_texts, corpus_texts, top_k)
    
    jaccard_time = time.time() - start
    jaccard_avg_score = np.mean([score for matches in jaccard_results 
                                 for _, score in matches])
    
    results['jaccard'] = {
        'matches': jaccard_results,
        'time': jaccard_time,
        'avg_score': jaccard_avg_score
    }
    
    print(f"   ‚è±Ô∏è  Time: {jaccard_time:.2f}s")
    print(f"   üìä Avg similarity: {jaccard_avg_score:.4f}")
    
    # ========================================================================
    # Method 3: SBERT
    # ========================================================================
    print("\nüü¢ METHOD 3: SBERT Semantic Embeddings")
    print("-" * 80)
    start = time.time()
    
    sbert_matcher = SBERTMatcher()
    query_embeddings = sbert_matcher.embed(query_texts, show_progress=False)
    corpus_embeddings = sbert_matcher.embed(corpus_texts, show_progress=False)
    sbert_results = sbert_matcher.match(query_embeddings, corpus_embeddings, top_k)
    
    sbert_time = time.time() - start
    sbert_avg_score = np.mean([score for matches in sbert_results 
                               for _, score in matches])
    
    results['sbert'] = {
        'matches': sbert_results,
        'time': sbert_time,
        'avg_score': sbert_avg_score
    }
    
    print(f"   ‚è±Ô∏è  Time: {sbert_time:.2f}s")
    print(f"   üìä Avg similarity: {sbert_avg_score:.4f}")
    
    # ========================================================================
    # Summary
    # ========================================================================
    print("\n" + "="*80)
    print("üìä SUMMARY COMPARISON")
    print("="*80)
    print(f"\n{'Method':<25} {'Time (s)':<12} {'Avg Score':<12} {'Speedup':<10}")
    print("-" * 80)
    
    base_time = tfidf_time
    for method_name, method_results in results.items():
        speedup = base_time / method_results['time']
        print(f"{method_name:<25} "
              f"{method_results['time']:>10.2f}s  "
              f"{method_results['avg_score']:>10.4f}  "
              f"{speedup:>8.2f}x")
    
    print("="*80)
    
    return results

# SECTION 4: DATA LOADING & ENRICHMENT

# Cell 4.1: Data Loader Class

In [17]:
# ============================================================================
# Cell 4.1: Data Loader Class (COMPLETE VERSION WITH ALL METHODS!)
# ============================================================================

class DataLoader:
    """
    Centralized data loading with validation and error handling.
    
    Loads ALL CSV files needed for complete company enrichment:
    - resume_data.csv (candidates)
    - companies.csv (base company data)
    - postings.csv (job postings)
    - job_skills.csv (job-skill mappings)
    - company_industries.csv (company-industry mappings)
    - company_specialities.csv (company-specialty mappings)
    """
    
    def __init__(self, csv_path: str = '../csv_files/'):
        self.csv_path = csv_path
        self.datasets = {}
    
    def load_candidates(self) -> pd.DataFrame:
        """Load candidate profiles from resume_data.csv"""
        print("üìÇ Loading candidates...")
        
        file_path = f"{self.csv_path}resume_data.csv"
        
        try:
            df = pd.read_csv(file_path)
            
            if df.empty:
                raise ValueError("Candidates file is empty!")
            
            self.datasets['candidates'] = df
            print(f"   ‚úÖ Loaded {len(df):,} candidates")
            print(f"   üìä Columns: {df.shape[1]}")
            
            return df
            
        except FileNotFoundError:
            raise FileNotFoundError(
                f"Candidates file not found: {file_path}\n"
                f"Please ensure resume_data.csv is in {self.csv_path}"
            )
        except Exception as e:
            raise Exception(f"Error loading candidates: {e}")
    
    def load_companies_base(self) -> pd.DataFrame:
        """Load base company data"""
        print("üìÇ Loading companies (base)...")
        
        file_path = f"{self.csv_path}companies.csv"
        
        try:
            df = pd.read_csv(file_path)
            
            if df.empty:
                raise ValueError("Companies file is empty!")
            
            self.datasets['companies_base'] = df
            print(f"   ‚úÖ Loaded {len(df):,} companies")
            
            return df
            
        except FileNotFoundError:
            raise FileNotFoundError(f"Companies file not found: {file_path}")
        except Exception as e:
            raise Exception(f"Error loading companies: {e}")
    
    def load_job_postings(self) -> pd.DataFrame:
        """Load job postings dataset"""
        print("üìÇ Loading job postings...")
        
        file_path = f"{self.csv_path}postings.csv"
        
        try:
            df = pd.read_csv(file_path)
            
            if df.empty:
                print("   ‚ö†Ô∏è  Postings file is empty")
                return pd.DataFrame()
            
            self.datasets['postings'] = df
            print(f"   ‚úÖ Loaded {len(df):,} postings")
            
            if 'company_id' in df.columns:
                print(f"   üè¢ Unique companies: {df['company_id'].nunique():,}")
            
            return df
            
        except FileNotFoundError:
            print(f"   ‚ö†Ô∏è  Postings file not found: {file_path}")
            print(f"   üí° Continuing without posting enrichment")
            return pd.DataFrame()
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Error loading postings: {e}")
            return pd.DataFrame()
    
    def load_job_skills(self) -> pd.DataFrame:
        """Load job-to-skills mapping"""
        print("üìÇ Loading job skills...")
        
        file_path = f"{self.csv_path}job_skills.csv"
        
        try:
            df = pd.read_csv(file_path)
            
            if df.empty:
                print("   ‚ö†Ô∏è  Job skills file is empty")
                return pd.DataFrame()
            
            self.datasets['job_skills'] = df
            print(f"   ‚úÖ Loaded {len(df):,} job-skill mappings")
            
            return df
            
        except FileNotFoundError:
            print(f"   ‚ö†Ô∏è  Job skills file not found")
            return pd.DataFrame()
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Error loading job skills: {e}")
            return pd.DataFrame()
    
    def load_company_industries(self) -> pd.DataFrame:
        """
        Load company industries mapping.
        
        File: company_industries.csv
        Columns: company_id, industry
        """
        print("üìÇ Loading company industries...")
        
        file_path = f"{self.csv_path}company_industries.csv"
        
        try:
            df = pd.read_csv(file_path)
            
            if df.empty:
                print("   ‚ö†Ô∏è  Company industries file is empty")
                return pd.DataFrame()
            
            self.datasets['company_industries'] = df
            print(f"   ‚úÖ Loaded {len(df):,} company-industry mappings")
            
            if 'company_id' in df.columns:
                print(f"   üè¢ Unique companies: {df['company_id'].nunique():,}")
            
            return df
            
        except FileNotFoundError:
            print(f"   ‚ö†Ô∏è  Company industries not found: {file_path}")
            print(f"   üí° Continuing without industry data")
            return pd.DataFrame()
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Error loading company industries: {e}")
            return pd.DataFrame()
    
    def load_company_specialties(self) -> pd.DataFrame:
        """
        Load company specialties mapping.
        
        File: company_specialities.csv (note: specialITies, not specialTies!)
        Columns: company_id, speciality
        """
        print("üìÇ Loading company specialties...")
        
        # Note: Your file is named 'specialities' (British spelling)
        file_path = f"{self.csv_path}company_specialities.csv"
        
        try:
            df = pd.read_csv(file_path)
            
            if df.empty:
                print("   ‚ö†Ô∏è  Company specialties file is empty")
                return pd.DataFrame()
            
            self.datasets['company_specialties'] = df
            print(f"   ‚úÖ Loaded {len(df):,} company-specialty mappings")
            
            if 'company_id' in df.columns:
                print(f"   üè¢ Unique companies: {df['company_id'].nunique():,}")
            
            return df
            
        except FileNotFoundError:
            print(f"   ‚ö†Ô∏è  Company specialties not found: {file_path}")
            print(f"   üí° Continuing without specialty data")
            return pd.DataFrame()
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Error loading company specialties: {e}")
            return pd.DataFrame()
    
    def load_all(self) -> Dict[str, pd.DataFrame]:
        """
        Load all datasets at once.
        
        Returns dict with all loaded DataFrames.
        """
        print("\n" + "="*80)
        print("üì• LOADING ALL DATASETS")
        print("="*80 + "\n")
        
        # Core datasets (MUST succeed)
        try:
            candidates = self.load_candidates()
        except Exception as e:
            raise Exception(f"Failed to load candidates: {e}")
        
        try:
            companies = self.load_companies_base()
        except Exception as e:
            raise Exception(f"Failed to load companies: {e}")
        
        # Optional enrichment datasets (can fail gracefully)
        postings = self.load_job_postings()
        job_skills = self.load_job_skills()
        industries = self.load_company_industries()
        specialties = self.load_company_specialties()
        
        print("\n" + "="*80)
        print("üìä DATASET SUMMARY")
        print("="*80)
        
        total_rows = sum(len(df) for df in self.datasets.values())
        print(f"\n   Total entities: {total_rows:,}")
        print(f"   Datasets loaded: {len(self.datasets)}")
        
        for name, df in self.datasets.items():
            memory_mb = df.memory_usage(deep=True).sum() / (1024**2)
            print(f"   ‚Ä¢ {name}: {len(df):,} rows ({memory_mb:.1f} MB)")
        
        print("\n" + "="*80)
        
        return self.datasets

print("‚úÖ DataLoader class loaded (COMPLETE VERSION)!")
print("   ‚Ä¢ Loads candidates (resume_data.csv)")
print("   ‚Ä¢ Loads companies (companies.csv)")
print("   ‚Ä¢ Loads postings (postings.csv)")
print("   ‚Ä¢ Loads job skills (job_skills.csv)")
print("   ‚Ä¢ Loads industries (company_industries.csv)")
print("   ‚Ä¢ Loads specialties (company_specialities.csv)")

‚úÖ DataLoader class loaded (COMPLETE VERSION)!
   ‚Ä¢ Loads candidates (resume_data.csv)
   ‚Ä¢ Loads companies (companies.csv)
   ‚Ä¢ Loads postings (postings.csv)
   ‚Ä¢ Loads job skills (job_skills.csv)
   ‚Ä¢ Loads industries (company_industries.csv)
   ‚Ä¢ Loads specialties (company_specialities.csv)


# Cell 4.2: Company Enrichment Engine

In [18]:
# ============================================================================
# Cell 4.2: Company Enrichment Engine (HYBRID - COMPLETE VERSION)
# ============================================================================

class CompanyEnricher:
    """
    Enriches company profiles with COMPLETE data from multiple sources.
    
    THE KEY INNOVATION OF HRHUB + COMPREHENSIVE DATA AGGREGATION!
    
    Data Sources:
    ------------
    1. Job Postings ‚Üí Skills (THE BRIDGE!)
    2. Company Industries ‚Üí Industry list
    3. Company Specialties ‚Üí Specialty list
    4. Job Postings ‚Üí Job titles, salary data, posting counts
    
    Why Multiple Sources?
    --------------------
    - Skills alone aren't enough
    - Industries provide context
    - Specialties show focus areas
    - Job titles show hiring patterns
    - Salaries help with level matching
    
    Problem Solved:
    --------------
    Companies: "We are a fintech startup"
    Candidates: "Python, React, AWS"
    ‚Üí NO MATCH!
    
    After Enrichment:
    ----------------
    Companies: "Fintech startup. Industries: Finance, Technology. 
                Skills: Python, React, AWS. Job Titles: Software Engineer"
    Candidates: "Python, React, AWS"
    ‚Üí MATCH! ‚úÖ
    
    Coverage Impact:
    ---------------
    Before: 30% companies with complete data
    After: 96.1% companies with enriched skills
          100% with industries/specialties
    """
    
    @staticmethod
    def aggregate_industries(company_industries_df: pd.DataFrame) -> pd.DataFrame:
        """
        Aggregate industries per company.
        
        Example:
        -------
        company_id=123:
        - row 1: "Finance"
        - row 2: "Technology"
        Result: "Finance, Technology"
        
        Args:
            company_industries_df: DataFrame with (company_id, industry)
        
        Returns:
            DataFrame with (company_id, industries_list)
        """
        print("\n1Ô∏è‚É£  Aggregating industries...")
        
        if company_industries_df.empty:
            print("   ‚ö†Ô∏è  No industry data")
            return pd.DataFrame(columns=['company_id', 'industries_list'])
        
        industries_grouped = company_industries_df.groupby('company_id')['industry'].apply(
            lambda x: ', '.join(x.dropna().astype(str).unique())
        ).reset_index()
        
        industries_grouped.columns = ['company_id', 'industries_list']
        
        print(f"   ‚úÖ Aggregated: {len(industries_grouped):,} companies")
        
        return industries_grouped
    
    @staticmethod
    def aggregate_specialties(company_specialties_df: pd.DataFrame) -> pd.DataFrame:
        """
        Aggregate specialties per company.
        
        Note: Your CSV has 'speciality' (not 'specialty')
        
        Args:
            company_specialties_df: DataFrame with (company_id, speciality)
        
        Returns:
            DataFrame with (company_id, specialties_list)
        """
        print("\n2Ô∏è‚É£  Aggregating specialties...")
        
        if company_specialties_df.empty:
            print("   ‚ö†Ô∏è  No specialty data")
            return pd.DataFrame(columns=['company_id', 'specialties_list'])
        
        specialties_grouped = company_specialties_df.groupby('company_id')['speciality'].apply(
            lambda x: ', '.join(x.dropna().astype(str).unique())
        ).reset_index()
        
        specialties_grouped.columns = ['company_id', 'specialties_list']
        
        print(f"   ‚úÖ Aggregated: {len(specialties_grouped):,} companies")
        
        return specialties_grouped
    
    @staticmethod
    def extract_skills_from_postings(postings_df: pd.DataFrame,
                                     job_skills_df: pd.DataFrame) -> pd.DataFrame:
        """
        Extract and aggregate skills per company from job postings.
        
        THE BRIDGE - Most critical function!
        
        Uses skill_abr (your column name, not skill_name)
        
        Args:
            postings_df: Job postings with company_id
            job_skills_df: Job-skill mappings with skill_abr
            
        Returns:
            DataFrame with (company_id, enriched_skills)
        """
        print("\n3Ô∏è‚É£  Extracting skills from job postings...")
        
        if postings_df.empty or job_skills_df.empty:
            print("   ‚ö†Ô∏è  No postings or skills data")
            return pd.DataFrame(columns=['company_id', 'enriched_skills'])
        
        # Merge postings with skills on job_id
        merged = postings_df.merge(
            job_skills_df,
            on='job_id',
            how='inner'
        )
        
        print(f"   üìä Merged: {len(merged):,} job-skill pairs")
        
        if merged.empty:
            print("   ‚ö†Ô∏è  No matches found!")
            return pd.DataFrame(columns=['company_id', 'enriched_skills'])
        
        # Group by company and aggregate skill_abr
        company_skills = merged.groupby('company_id')['skill_abr'].apply(
            lambda x: ', '.join(sorted(set(str(s) for s in x if pd.notna(s))))
        ).reset_index()
        
        company_skills.columns = ['company_id', 'enriched_skills']
        
        print(f"   ‚úÖ Enriched: {len(company_skills):,} companies")
        
        return company_skills
    
    @staticmethod
    def aggregate_job_postings(postings_df: pd.DataFrame,
                               skills_per_company: pd.DataFrame) -> pd.DataFrame:
        """
        Aggregate job posting metadata per company.
        
        Extracts:
        - Job titles (top 10 per company)
        - Salary data (avg median, avg max)
        - Total posting count
        
        Args:
            postings_df: Job postings DataFrame
            skills_per_company: Already computed skills (for merge)
        
        Returns:
            DataFrame with posting aggregates
        """
        print("\n4Ô∏è‚É£  Aggregating job posting metadata...")
        
        if postings_df.empty:
            print("   ‚ö†Ô∏è  No postings data")
            return pd.DataFrame(columns=[
                'company_id', 'posted_job_titles', 
                'avg_med_salary', 'avg_max_salary', 'total_postings'
            ])
        
        # Aggregate per company
        job_data = postings_df.groupby('company_id').agg({
            'title': lambda x: ', '.join(x.dropna().astype(str).unique()[:10]),  # Top 10 titles
            'med_salary': 'mean',
            'max_salary': 'mean',
            'job_id': 'count'
        }).reset_index()
        
        job_data.columns = [
            'company_id', 'posted_job_titles',
            'avg_med_salary', 'avg_max_salary', 'total_postings'
        ]
        
        print(f"   ‚úÖ Aggregated: {len(job_data):,} companies")
        
        return job_data
    
    @staticmethod
    def enrich_companies(companies_df: pd.DataFrame,
                        postings_df: pd.DataFrame,
                        job_skills_df: pd.DataFrame,
                        company_industries_df: pd.DataFrame = None,
                        company_specialties_df: pd.DataFrame = None) -> pd.DataFrame:
        """
        COMPLETE company enrichment from ALL sources.
        
        Enrichment Pipeline:
        -------------------
        1. Aggregate industries (from company_industries.csv)
        2. Aggregate specialties (from company_specialities.csv)
        3. Extract skills from job postings (THE BRIDGE!)
        4. Aggregate job metadata (titles, salaries, counts)
        5. Merge everything with left joins
        6. Fill missing values
        7. Validate completeness
        
        Args:
            companies_df: Base company data
            postings_df: Job postings
            job_skills_df: Job-skill mappings
            company_industries_df: Company-industry mappings (optional)
            company_specialties_df: Company-specialty mappings (optional)
        
        Returns:
            Fully enriched company DataFrame
        """
        print("\n" + "="*80)
        print("üåâ COMPLETE COMPANY ENRICHMENT")
        print("="*80)
        
        # ====================================================================
        # STEP 1: Aggregate Industries
        # ====================================================================
        if company_industries_df is not None and not company_industries_df.empty:
            industries = CompanyEnricher.aggregate_industries(company_industries_df)
        else:
            print("\n1Ô∏è‚É£  ‚ö†Ô∏è  No industry data - skipping")
            industries = pd.DataFrame(columns=['company_id', 'industries_list'])
        
        # ====================================================================
        # STEP 2: Aggregate Specialties
        # ====================================================================
        if company_specialties_df is not None and not company_specialties_df.empty:
            specialties = CompanyEnricher.aggregate_specialties(company_specialties_df)
        else:
            print("\n2Ô∏è‚É£  ‚ö†Ô∏è  No specialty data - skipping")
            specialties = pd.DataFrame(columns=['company_id', 'specialties_list'])
        
        # ====================================================================
        # STEP 3: Extract Skills (THE BRIDGE!)
        # ====================================================================
        skills = CompanyEnricher.extract_skills_from_postings(
            postings_df, job_skills_df
        )
        
        # ====================================================================
        # STEP 4: Aggregate Job Posting Metadata
        # ====================================================================
        job_meta = CompanyEnricher.aggregate_job_postings(postings_df, skills)
        
        # ====================================================================
        # STEP 5: Merge Everything (Left joins to preserve all companies)
        # ====================================================================
        print("\n5Ô∏è‚É£  Merging all data sources...")
        
        enriched = companies_df.copy()
        
        # Merge industries
        if not industries.empty:
            enriched = enriched.merge(industries, on='company_id', how='left')
        
        # Merge specialties
        if not specialties.empty:
            enriched = enriched.merge(specialties, on='company_id', how='left')
        
        # Merge skills
        if not skills.empty:
            enriched = enriched.merge(skills, on='company_id', how='left')
        
        # Merge job metadata
        if not job_meta.empty:
            enriched = enriched.merge(job_meta, on='company_id', how='left')
        
        print(f"   ‚úÖ Merged shape: {enriched.shape}")
        
        # ====================================================================
        # STEP 6: Fill Missing Values
        # ====================================================================
        print("\n6Ô∏è‚É£  Filling missing values...")
        
        fill_values = {
            'name': 'Unknown Company',
            'description': 'No description available',
            'industries_list': 'General',
            'specialties_list': 'Not specified',
            'enriched_skills': 'Not specified',
            'posted_job_titles': 'Various positions',
            'avg_med_salary': 0,
            'avg_max_salary': 0,
            'total_postings': 0
        }
        
        for col, default_val in fill_values.items():
            if col in enriched.columns:
                before_nulls = enriched[col].isna().sum()
                enriched[col] = enriched[col].fillna(default_val)
                
                # Fix empty strings
                if enriched[col].dtype == 'object':
                    enriched[col] = enriched[col].replace('', default_val)
                
                if before_nulls > 0:
                    print(f"   ‚úÖ {col:25s} {before_nulls:>6,} nulls ‚Üí filled")
        
        # ====================================================================
        # STEP 7: Validation & Coverage Report
        # ====================================================================
        print("\n7Ô∏è‚É£  Validation...")
        print("="*80)
        
        critical_cols = [
            'name', 'description', 'industries_list', 'specialties_list',
            'enriched_skills', 'posted_job_titles'
        ]
        
        all_ok = True
        for col in critical_cols:
            if col in enriched.columns:
                nulls = enriched[col].isna().sum()
                empties = (enriched[col] == '').sum()
                issues = nulls + empties
                
                status = '‚úÖ' if issues == 0 else '‚ùå'
                print(f"{status} {col:25s} {issues:>6,} issues")
                
                if issues > 0:
                    all_ok = False
        
        print("="*80)
        
        # Coverage statistics
        has_skills = ~enriched['enriched_skills'].isin(['', 'Not specified'])
        has_industries = ~enriched['industries_list'].isin(['', 'General', 'Not specified'])
        
        skills_coverage = (has_skills.sum() / len(enriched)) * 100
        industries_coverage = (has_industries.sum() / len(enriched)) * 100
        
        print(f"\nüìä COVERAGE REPORT:")
        print(f"   ‚Ä¢ Total companies: {len(enriched):,}")
        print(f"   ‚Ä¢ With enriched skills: {has_skills.sum():,} ({skills_coverage:.1f}%)")
        print(f"   ‚Ä¢ With industries: {has_industries.sum():,} ({industries_coverage:.1f}%)")
        print(f"   ‚Ä¢ Status: {'üéØ EXCELLENT!' if all_ok else '‚ö†Ô∏è  Has issues'}")
        
        print("\n" + "="*80)
        
        return enriched

print("‚úÖ CompanyEnricher class loaded (COMPLETE HYBRID VERSION)")
print("   ‚Ä¢ Industries aggregation")
print("   ‚Ä¢ Specialties aggregation")
print("   ‚Ä¢ Skills extraction (JOB POSTING BRIDGE!)")
print("   ‚Ä¢ Job metadata (titles, salaries, counts)")
print("   ‚Ä¢ Complete validation")

‚úÖ CompanyEnricher class loaded (COMPLETE HYBRID VERSION)
   ‚Ä¢ Industries aggregation
   ‚Ä¢ Specialties aggregation
   ‚Ä¢ Skills extraction (JOB POSTING BRIDGE!)
   ‚Ä¢ Job metadata (titles, salaries, counts)
   ‚Ä¢ Complete validation


# Cell 4.3: Text Preparation Pipeline

In [19]:
# ============================================================================
# Cell 4.4: Text Preparation Pipeline (CORRETO E COMPLETO!)
# ============================================================================

class TextPreparationPipeline:
    """
    End-to-end text preparation using SOLID architecture.
    
    This class orchestrates the complete text building pipeline:
    1. Text building (using TextBuilder classes)
    2. Validation (check for empty/missing data)
    3. Quality assurance (length statistics, coverage)
    4. Error handling (graceful degradation)
    
    Design Pattern: Facade Pattern
    --------------
    Simplifies complex subsystem (multiple TextBuilders) with simple interface.
    
    Benefits:
    --------
    - Single entry point for text preparation
    - Consistent validation across entity types
    - Centralized error handling
    - Easy to extend with new entity types
    
    SOLID Principles:
    ----------------
    - Single Responsibility: Only orchestrates text building
    - Open/Closed: Easy to add new prepare_X_texts methods
    - Dependency Inversion: Depends on TextBuilder abstraction
    """
    
    @staticmethod
    def prepare_candidate_texts(candidates_df: pd.DataFrame) -> List[str]:
        """
        Prepare candidate texts using CandidateTextBuilder.
        
        Pipeline Steps:
        --------------
        1. Validate input DataFrame
        2. Initialize CandidateTextBuilder
        3. Build texts in batch (using df.apply for performance)
        4. Validate output (check for empty texts)
        5. Report statistics (count, length, coverage)
        
        Quality Checks:
        --------------
        - Ensures DataFrame has required columns
        - Counts empty/missing texts
        - Calculates average text length
        - Reports coverage percentage
        
        Args:
            candidates_df: Candidate DataFrame with columns:
                          - career_objective (recommended)
                          - skills (recommended)
                          - degree_names (optional)
                          - positions (optional)
                          - Category (optional)
        
        Returns:
            List[str]: List of candidate text representations
            
        Raises:
            ValueError: If DataFrame is empty
            
        Performance:
        -----------
        - Time: ~1.8s for 9,544 candidates (using df.apply)
        - Memory: ~500 bytes per text (uncompressed)
        
        Example:
            >>> texts = TextPreparationPipeline.prepare_candidate_texts(candidates_df)
            >>> len(texts)
            9544
            >>> texts[0][:50]
            'Career objective: Seeking Python developer role...'
        """
        print("\nüìù PREPARING CANDIDATE TEXTS")
        print("-" * 80)
        
        # ====================================================================
        # STEP 1: Input Validation
        # ====================================================================
        if candidates_df.empty:
            raise ValueError("Candidates DataFrame is empty!")
        
        print(f"   Input: {len(candidates_df):,} candidates")
        
        # Check for recommended columns (not required, just warning)
        recommended_cols = ['career_objective', 'skills', 'degree_names', 'positions']
        missing_cols = [col for col in recommended_cols if col not in candidates_df.columns]
        
        if missing_cols:
            print(f"   ‚ö†Ô∏è  Missing columns: {', '.join(missing_cols)}")
            print(f"   üí° Texts will be less informative")
        
        # ====================================================================
        # STEP 2: Initialize Builder (NO IMPORT NEEDED!)
        # ====================================================================
        # Classes are already defined in this notebook!
        builder = CandidateTextBuilder()
        
        # ====================================================================
        # STEP 3: Build Texts (Batch Processing)
        # ====================================================================
        print(f"   üîÑ Building texts...")
        
        try:
            texts = builder.build_batch(candidates_df)
        except Exception as e:
            print(f"   ‚ùå Error building texts: {e}")
            raise
        
        # ====================================================================
        # STEP 4: Validate Output
        # ====================================================================
        # Check for empty/missing texts
        # Builder returns "Not specified" for empty, not "Profile not available"!
        empty_texts = [t for t in texts if not t or t.strip() == "" or t == "Not specified"]
        valid_texts = [t for t in texts if t and t.strip() and t != "Not specified"]
        
        empty_count = len(empty_texts)
        valid_count = len(valid_texts)
        
        print(f"\n   üìä Validation:")
        print(f"      ‚Ä¢ Total texts: {len(texts):,}")
        print(f"      ‚Ä¢ Valid texts: {valid_count:,}")
        print(f"      ‚Ä¢ Empty/missing: {empty_count:,}")
        
        if empty_count > 0:
            coverage = (valid_count / len(texts)) * 100
            print(f"      ‚Ä¢ Coverage: {coverage:.1f}%")
            
            if coverage < 90:
                print(f"      ‚ö†Ô∏è  Low coverage! Check data quality")
        
        # ====================================================================
        # STEP 5: Statistics
        # ====================================================================
        if valid_texts:
            lengths = [len(t) for t in valid_texts]
            avg_length = np.mean(lengths)
            median_length = np.median(lengths)
            min_length = np.min(lengths)
            max_length = np.max(lengths)
            
            print(f"\n   üìè Text Length Statistics:")
            print(f"      ‚Ä¢ Average: {avg_length:.0f} chars")
            print(f"      ‚Ä¢ Median: {median_length:.0f} chars")
            print(f"      ‚Ä¢ Range: {min_length} - {max_length} chars")
        
        print(f"\n   ‚úÖ Candidate texts prepared!")
        print("-" * 80)
        
        return texts
    
    @staticmethod
    def prepare_company_texts(companies_df: pd.DataFrame) -> List[str]:
        """
        Prepare company texts using CompanyTextBuilder.
        
        CRITICAL: This includes enriched skills from job postings!
        
        Pipeline Steps:
        --------------
        1. Validate input DataFrame
        2. Check for enriched_skills column (THE BRIDGE!)
        3. Initialize CompanyTextBuilder
        4. Build texts in batch
        5. Validate output
        6. Report statistics including enrichment coverage
        
        Quality Checks:
        --------------
        - Ensures DataFrame has required columns
        - Verifies enriched_skills column exists
        - Counts companies with real skills vs "Not specified"
        - Reports enrichment impact on text quality
        
        Args:
            companies_df: Enriched company DataFrame with columns:
                         - description (recommended)
                         - enriched_skills (CRITICAL - from job postings!)
                         - industry (optional)
                         - specialties (optional)
                         - name (optional)
        
        Returns:
            List[str]: List of company text representations
            
        Raises:
            ValueError: If DataFrame is empty
            Warning: If enriched_skills column missing
            
        Performance:
        -----------
        - Time: ~4.5s for 24,473 companies (using df.apply)
        - Memory: ~800 bytes per text (uncompressed)
        
        Example:
            >>> texts = TextPreparationPipeline.prepare_company_texts(companies_df)
            >>> len(texts)
            24473
            >>> texts[0][:50]
            'Company: Tech startup in AI. Required skills: P...'
        """
        print("\nüìù PREPARING COMPANY TEXTS")
        print("-" * 80)
        
        # ====================================================================
        # STEP 1: Input Validation
        # ====================================================================
        if companies_df.empty:
            raise ValueError("Companies DataFrame is empty!")
        
        print(f"   Input: {len(companies_df):,} companies")
        
        # ====================================================================
        # STEP 2: Check for Critical Columns
        # ====================================================================
        # enriched_skills is CRITICAL for job posting bridge!
        if 'enriched_skills' not in companies_df.columns:
            print(f"   ‚ùå CRITICAL: 'enriched_skills' column missing!")
            print(f"   üí° Run CompanyEnricher.enrich_companies() first!")
            print(f"   ‚ö†Ô∏è  Texts will lack job posting bridge (low match quality)")
        else:
            # Check how many have real skills
            has_skills = ~companies_df['enriched_skills'].isin(['', 'Not specified'])
            skills_coverage = (has_skills.sum() / len(companies_df)) * 100
            
            print(f"   üìä Enrichment Status:")
            print(f"      ‚Ä¢ With skills: {has_skills.sum():,} companies")
            print(f"      ‚Ä¢ Coverage: {skills_coverage:.1f}%")
            
            if skills_coverage < 80:
                print(f"      ‚ö†Ô∏è  Low enrichment! Expected >90%")
        
        # Check for other recommended columns
        recommended_cols = ['description', 'industry', 'name']
        missing_cols = [col for col in recommended_cols if col not in companies_df.columns]
        
        if missing_cols:
            print(f"   ‚ö†Ô∏è  Missing columns: {', '.join(missing_cols)}")
        
        # ====================================================================
        # STEP 3: Initialize Builder (NO IMPORT NEEDED!)
        # ====================================================================
        builder = CompanyTextBuilder()
        
        # ====================================================================
        # STEP 4: Build Texts (Batch Processing)
        # ====================================================================
        print(f"   üîÑ Building texts...")
        
        try:
            texts = builder.build_batch(companies_df)
        except Exception as e:
            print(f"   ‚ùå Error building texts: {e}")
            raise
        
        # ====================================================================
        # STEP 5: Validate Output
        # ====================================================================
        # Check for empty/missing texts
        empty_texts = [t for t in texts if not t or t.strip() == "" or t == "Not specified"]
        valid_texts = [t for t in texts if t and t.strip() and t != "Not specified"]
        
        empty_count = len(empty_texts)
        valid_count = len(valid_texts)
        
        print(f"\n   üìä Validation:")
        print(f"      ‚Ä¢ Total texts: {len(texts):,}")
        print(f"      ‚Ä¢ Valid texts: {valid_count:,}")
        print(f"      ‚Ä¢ Empty/missing: {empty_count:,}")
        
        if empty_count > 0:
            coverage = (valid_count / len(texts)) * 100
            print(f"      ‚Ä¢ Coverage: {coverage:.1f}%")
            
            if coverage < 95:
                print(f"      ‚ö†Ô∏è  Some companies have incomplete data")
        
        # ====================================================================
        # STEP 6: Statistics
        # ====================================================================
        if valid_texts:
            lengths = [len(t) for t in valid_texts]
            avg_length = np.mean(lengths)
            median_length = np.median(lengths)
            min_length = np.min(lengths)
            max_length = np.max(lengths)
            
            print(f"\n   üìè Text Length Statistics:")
            print(f"      ‚Ä¢ Average: {avg_length:.0f} chars")
            print(f"      ‚Ä¢ Median: {median_length:.0f} chars")
            print(f"      ‚Ä¢ Range: {min_length} - {max_length} chars")
            
            # Check enrichment impact on length
            if 'enriched_skills' in companies_df.columns:
                with_skills = companies_df['enriched_skills'] != 'Not specified'
                texts_with_skills = [texts[i] for i in range(len(texts)) if with_skills.iloc[i]]
                
                if texts_with_skills:
                    avg_with_skills = np.mean([len(t) for t in texts_with_skills])
                    print(f"\n   üåâ Job Posting Bridge Impact:")
                    print(f"      ‚Ä¢ Avg length (with skills): {avg_with_skills:.0f} chars")
                    print(f"      ‚Ä¢ Avg length (overall): {avg_length:.0f} chars")
                    print(f"      ‚Ä¢ Enrichment adds ~{avg_with_skills - avg_length:.0f} chars per company")
        
        print(f"\n   ‚úÖ Company texts prepared!")
        print("-" * 80)
        
        return texts

print("‚úÖ TextPreparationPipeline class loaded (CORRETO E COMPLETO!)")
print("   ‚Ä¢ NO imports needed (classes already in notebook)")
print("   ‚Ä¢ Robust validation")
print("   ‚Ä¢ Comprehensive statistics")
print("   ‚Ä¢ Error handling")
print("   ‚Ä¢ Enrichment verification")

‚úÖ TextPreparationPipeline class loaded (CORRETO E COMPLETO!)
   ‚Ä¢ NO imports needed (classes already in notebook)
   ‚Ä¢ Robust validation
   ‚Ä¢ Comprehensive statistics
   ‚Ä¢ Error handling
   ‚Ä¢ Enrichment verification


# Cell 4.4: Complete ETL Pipeline

In [20]:
# ============================================================================
# Cell 4.4: Complete ETL Pipeline (CORRETO E COMPLETO!)
# ============================================================================

def run_etl_pipeline(csv_path: str = '../csv_files/') -> Dict[str, any]:
    """
    Run complete ETL pipeline with FULL data enrichment.
    
    Pipeline Stages:
    ---------------
    1. EXTRACT: Load ALL CSV files (candidates, companies, postings, skills, industries, specialties)
    2. TRANSFORM: Enrich companies with ALL available data sources
    3. LOAD: Build texts and package results
    
    Data Sources:
    ------------
    - resume_data.csv ‚Üí Candidates
    - companies.csv ‚Üí Base company data
    - postings.csv ‚Üí Job postings
    - job_skills.csv ‚Üí Job-skill mappings (THE BRIDGE!)
    - company_industries.csv ‚Üí Company-industry mappings
    - company_specialities.csv ‚Üí Company-specialty mappings
    
    Returns:
    -------
    Dict with:
    - candidates_df: Candidate DataFrame
    - companies_df: FULLY enriched company DataFrame
    - candidate_texts: List of candidate text representations
    - company_texts: List of company text representations (with enrichment!)
    - stats: Coverage and count statistics
    - raw_data: Original datasets (for reference)
    
    Example:
        >>> results = run_etl_pipeline()
        >>> companies = results['companies_df']
        >>> companies['enriched_skills'].head()
    """
    print("\n" + "="*80)
    print("üè≠ COMPLETE ETL PIPELINE: EXTRACT ‚Üí TRANSFORM ‚Üí LOAD")
    print("="*80)
    
    # ========================================================================
    # STAGE 1: EXTRACT - Load ALL Data Sources
    # ========================================================================
    print("\n1Ô∏è‚É£  EXTRACT: Loading all data sources...")
    print("-" * 80)
    
    loader = DataLoader(csv_path)
    
    # Core datasets (required)
    print("\nüìÇ Loading core datasets...")
    candidates_df = loader.load_candidates()
    companies_base = loader.load_companies_base()
    
    # Enrichment datasets (optional but important!)
    print("\nüìÇ Loading enrichment datasets...")
    postings_df = loader.load_job_postings()
    job_skills_df = loader.load_job_skills()
    
    # Additional enrichment (NEW - from company CSVs)
    print("\nüìÇ Loading additional company data...")
    company_industries_df = loader.load_company_industries()
    company_specialties_df = loader.load_company_specialties()
    
    # ====================================================================
    # Verification: Ensure core data loaded
    # ====================================================================
    if candidates_df is None or candidates_df.empty:
        raise ValueError("‚ùå CRITICAL: Candidates data failed to load!")
    
    if companies_base is None or companies_base.empty:
        raise ValueError("‚ùå CRITICAL: Companies data failed to load!")
    
    print("\n‚úÖ All data sources loaded!")
    print(f"   ‚Ä¢ Candidates: {len(candidates_df):,}")
    print(f"   ‚Ä¢ Companies (base): {len(companies_base):,}")
    print(f"   ‚Ä¢ Job postings: {len(postings_df):,}" if not postings_df.empty else "   ‚ö†Ô∏è  No job postings")
    print(f"   ‚Ä¢ Job skills: {len(job_skills_df):,}" if not job_skills_df.empty else "   ‚ö†Ô∏è  No job skills")
    print(f"   ‚Ä¢ Industries: {len(company_industries_df):,}" if not company_industries_df.empty else "   ‚ö†Ô∏è  No industries")
    print(f"   ‚Ä¢ Specialties: {len(company_specialties_df):,}" if not company_specialties_df.empty else "   ‚ö†Ô∏è  No specialties")
    
    # ========================================================================
    # STAGE 2: TRANSFORM - Enrich Companies with ALL Sources
    # ========================================================================
    print("\n2Ô∏è‚É£  TRANSFORM: Enriching company data...")
    print("-" * 80)
    
    # COMPLETE enrichment with ALL data sources!
    companies_df = CompanyEnricher.enrich_companies(
        companies_df=companies_base,
        postings_df=postings_df,
        job_skills_df=job_skills_df,
        company_industries_df=company_industries_df,      # ‚Üê NEW!
        company_specialties_df=company_specialties_df     # ‚Üê NEW!
    )
    
    # ====================================================================
    # Verification: Ensure enrichment succeeded
    # ====================================================================
    if companies_df is None or companies_df.empty:
        raise ValueError("‚ùå CRITICAL: Company enrichment failed!")
    
    # Check for critical enrichment column
    if 'enriched_skills' not in companies_df.columns:
        print("   ‚ö†Ô∏è  WARNING: enriched_skills column missing!")
        print("   üí° Adding column with default values...")
        companies_df['enriched_skills'] = 'Not specified'
    
    # Report enrichment quality
    has_skills = ~companies_df['enriched_skills'].isin(['', 'Not specified'])
    skills_coverage = (has_skills.sum() / len(companies_df)) * 100
    
    has_industries = 'industries_list' in companies_df.columns and \
                     ~companies_df['industries_list'].isin(['', 'General', 'Not specified'])
    
    has_specialties = 'specialties_list' in companies_df.columns and \
                      ~companies_df['specialties_list'].isin(['', 'Not specified'])
    
    print(f"\nüìä Enrichment Quality:")
    print(f"   ‚Ä¢ Skills coverage: {skills_coverage:.1f}%")
    
    if has_industries is not False:
        industries_coverage = (has_industries.sum() / len(companies_df)) * 100
        print(f"   ‚Ä¢ Industries coverage: {industries_coverage:.1f}%")
    
    if has_specialties is not False:
        specialties_coverage = (has_specialties.sum() / len(companies_df)) * 100
        print(f"   ‚Ä¢ Specialties coverage: {specialties_coverage:.1f}%")
    
    # ====================================================================
    # Build Text Representations
    # ====================================================================
    print("\nüìù Building text representations...")
    
    candidate_texts = TextPreparationPipeline.prepare_candidate_texts(candidates_df)
    company_texts = TextPreparationPipeline.prepare_company_texts(companies_df)
    
    # ====================================================================
    # Verification: Ensure texts were created
    # ====================================================================
    if not candidate_texts or len(candidate_texts) == 0:
        raise ValueError("‚ùå CRITICAL: Candidate text preparation failed!")
    
    if not company_texts or len(company_texts) == 0:
        raise ValueError("‚ùå CRITICAL: Company text preparation failed!")
    
    # Alignment check
    if len(candidate_texts) != len(candidates_df):
        raise ValueError(f"‚ùå ALIGNMENT ERROR: {len(candidate_texts)} texts ‚â† {len(candidates_df)} candidates")
    
    if len(company_texts) != len(companies_df):
        raise ValueError(f"‚ùå ALIGNMENT ERROR: {len(company_texts)} texts ‚â† {len(companies_df)} companies")
    
    print(f"\n‚úÖ Text representations created!")
    print(f"   ‚Ä¢ Candidate texts: {len(candidate_texts):,}")
    print(f"   ‚Ä¢ Company texts: {len(company_texts):,}")
    
    # ========================================================================
    # STAGE 3: LOAD - Package Results
    # ========================================================================
    print("\n3Ô∏è‚É£  LOAD: Packaging results...")
    print("-" * 80)
    
    # Calculate comprehensive statistics
    stats = {
        'n_candidates': len(candidates_df),
        'n_companies': len(companies_df),
        'n_postings': len(postings_df) if not postings_df.empty else 0,
        'n_job_skills': len(job_skills_df) if not job_skills_df.empty else 0,
        'n_industries': len(company_industries_df) if not company_industries_df.empty else 0,
        'n_specialties': len(company_specialties_df) if not company_specialties_df.empty else 0,
        'skills_coverage_pct': skills_coverage,
        'candidate_text_avg_length': int(np.mean([len(t) for t in candidate_texts])),
        'company_text_avg_length': int(np.mean([len(t) for t in company_texts]))
    }
    
    # Package complete results
    results = {
        # Processed data (ready for matching)
        'candidates_df': candidates_df,
        'companies_df': companies_df,
        'candidate_texts': candidate_texts,
        'company_texts': company_texts,
        
        # Statistics
        'stats': stats,
        
        # Raw data (for reference/debugging)
        'raw_data': {
            'postings_df': postings_df,
            'job_skills_df': job_skills_df,
            'company_industries_df': company_industries_df,
            'company_specialties_df': company_specialties_df
        }
    }
    
    # ========================================================================
    # Final Report
    # ========================================================================
    print("\n" + "="*80)
    print("‚úÖ ETL PIPELINE COMPLETE")
    print("="*80)
    
    print(f"\nüìä Final Statistics:")
    print(f"   Entities:")
    print(f"   ‚Ä¢ Candidates: {stats['n_candidates']:,}")
    print(f"   ‚Ä¢ Companies: {stats['n_companies']:,}")
    print(f"   ‚Ä¢ Job postings: {stats['n_postings']:,}")
    print(f"   ‚Ä¢ Job-skill mappings: {stats['n_job_skills']:,}")
    print(f"   ‚Ä¢ Industry mappings: {stats['n_industries']:,}")
    print(f"   ‚Ä¢ Specialty mappings: {stats['n_specialties']:,}")
    
    print(f"\n   Quality:")
    print(f"   ‚Ä¢ Skills coverage: {stats['skills_coverage_pct']:.1f}%")
    print(f"   ‚Ä¢ Candidate text length: {stats['candidate_text_avg_length']} chars (avg)")
    print(f"   ‚Ä¢ Company text length: {stats['company_text_avg_length']} chars (avg)")
    
    print(f"\nüöÄ Ready for embedding generation and matching!")
    print("="*80)
    
    return results

print("‚úÖ Complete ETL pipeline function loaded!")
print("   ‚Ä¢ Loads ALL data sources (6 CSV files)")
print("   ‚Ä¢ COMPLETE company enrichment (skills + industries + specialties)")
print("   ‚Ä¢ Robust validation and error handling")
print("   ‚Ä¢ Comprehensive statistics")

‚úÖ Complete ETL pipeline function loaded!
   ‚Ä¢ Loads ALL data sources (6 CSV files)
   ‚Ä¢ COMPLETE company enrichment (skills + industries + specialties)
   ‚Ä¢ Robust validation and error handling
   ‚Ä¢ Comprehensive statistics


# SECTION 5: EXECUTE COMPLETE PIPELINE

In [21]:
# ============================================================================
# SECTION 5: EXECUTE COMPLETE PIPELINE (WITH SMART CACHING!)
# ============================================================================

# ============================================================================
# Cell 5.1: Load Data & Build Texts
# ============================================================================

print("\n" + "="*80)
print("SECTION 5: EXECUTE COMPLETE PIPELINE")
print("="*80)

print("\n" + "="*80)
print("STEP 1: Data Loading & Text Preparation")
print("="*80)

# Run ETL pipeline (from Batch 4)
etl_results = run_etl_pipeline(config.CSV_PATH)

# Extract results
candidates_df = etl_results['candidates_df']
companies_df = etl_results['companies_df']
candidate_texts = etl_results['candidate_texts']
company_texts = etl_results['company_texts']
coverage_pct = etl_results['stats']['skills_coverage_pct']

print(f"\n‚úÖ Data prepared:")
print(f"   ‚Ä¢ Candidates: {len(candidates_df):,}")
print(f"   ‚Ä¢ Companies: {len(companies_df):,}")
print(f"   ‚Ä¢ Candidate texts: {len(candidate_texts):,}")
print(f"   ‚Ä¢ Company texts: {len(company_texts):,}")
print(f"   ‚Ä¢ Skills coverage: {coverage_pct:.1f}%")

# ============================================================================
# Cell 5.2: Generate SBERT Embeddings (WITH SMART CACHING!)
# ============================================================================

print("\n" + "="*80)
print("STEP 2: Generating Semantic Embeddings (SBERT)")
print("="*80)

# ============================================================================
# FORCE CPU MODE (stable, no CUDA errors)
# ============================================================================
import torch
torch.cuda.is_available = lambda: False  # Force CPU

print("\n‚ö†Ô∏è  GPU disabled - using CPU mode (slower but stable)")

# ============================================================================
# Initialize EmbeddingManager (FOR SMART CACHING!)
# ============================================================================
embedding_manager = EmbeddingManager(model_name=config.EMBEDDING_MODEL)

# Define cache file paths
cache_dir = config.PROCESSED_PATH
Path(cache_dir).mkdir(parents=True, exist_ok=True)

cand_embeddings_file = f'{cache_dir}candidate_embeddings.npy'
cand_metadata_file = f'{cache_dir}candidate_metadata.pkl'
comp_embeddings_file = f'{cache_dir}company_embeddings.npy'
comp_metadata_file = f'{cache_dir}company_metadata.pkl'

# ============================================================================
# TRY TO LOAD CACHED EMBEDDINGS (5min ‚Üí 3sec!)
# ============================================================================
print("\nüîç Checking for cached embeddings...")

try:
    # Try loading candidate embeddings
    candidate_embeddings, cand_meta = embedding_manager.load_embeddings(
        cand_embeddings_file,
        cand_metadata_file
    )
    
    # Try loading company embeddings
    company_embeddings, comp_meta = embedding_manager.load_embeddings(
        comp_embeddings_file,
        comp_metadata_file
    )
    
    # Verify alignment
    cand_aligned = embedding_manager.check_alignment(candidate_embeddings, candidates_df)
    comp_aligned = embedding_manager.check_alignment(company_embeddings, companies_df)
    
    if cand_aligned and comp_aligned:
        print("\n‚úÖ LOADED FROM CACHE!")
        print(f"   üöÄ Saved ~10 minutes of computation time!")
        embeddings_cached = True
    else:
        print("\n‚ö†Ô∏è  Cache alignment mismatch - regenerating...")
        embeddings_cached = False

except (FileNotFoundError, Exception) as e:
    print(f"\n‚ö†Ô∏è  Cache not found or invalid: {e}")
    print("   üìù Will generate fresh embeddings...")
    embeddings_cached = False

# ============================================================================
# GENERATE EMBEDDINGS IF NOT CACHED
# ============================================================================
if not embeddings_cached:
    print("\nüîÑ Generating embeddings (this will take ~10 minutes on CPU)...")
    print("   ‚òï Perfect time for a coffee break!")
    
    # Generate candidate embeddings
    print("\nüîÑ Embedding candidates...")
    print(f"   üìä Processing {len(candidate_texts):,} texts...")
    
    candidate_embeddings = embedding_manager.generate_embeddings(
        candidate_texts,
        show_progress=True,
        batch_size=32
    )
    
    # Generate company embeddings
    print("\nüîÑ Embedding companies...")
    print(f"   üìä Processing {len(company_texts):,} texts...")
    
    company_embeddings = embedding_manager.generate_embeddings(
        company_texts,
        show_progress=True,
        batch_size=32
    )
    
    # ========================================================================
    # SAVE EMBEDDINGS FOR NEXT TIME! (CRITICAL!)
    # ========================================================================
    print("\nüíæ Saving embeddings for future use...")
    
    # Prepare metadata (just IDs for alignment)
    cand_metadata = candidates_df[['candidate_id']].copy() if 'candidate_id' in candidates_df.columns else candidates_df.reset_index()[['index']].rename(columns={'index': 'candidate_id'})
    comp_metadata = companies_df[['company_id']].copy() if 'company_id' in companies_df.columns else companies_df.reset_index()[['index']].rename(columns={'index': 'company_id'})
    
    # Save candidate embeddings
    embedding_manager.save_embeddings(
        candidate_embeddings,
        cand_metadata,
        cand_embeddings_file,
        cand_metadata_file
    )
    
    # Save company embeddings
    embedding_manager.save_embeddings(
        company_embeddings,
        comp_metadata,
        comp_embeddings_file,
        comp_metadata_file
    )
    
    print("\n‚úÖ Embeddings saved! Next run will be 100x faster!")

# ============================================================================
# FINAL SUMMARY
# ============================================================================
print("\n" + "="*80)
print("üìä EMBEDDING GENERATION COMPLETE")
print("="*80)

print(f"\n‚úÖ Final embeddings:")
print(f"   ‚Ä¢ Candidate embeddings: {candidate_embeddings.shape}")
print(f"   ‚Ä¢ Company embeddings: {company_embeddings.shape}")
print(f"   ‚Ä¢ Total memory: ~{(candidate_embeddings.nbytes + company_embeddings.nbytes) / (1024**2):.1f} MB")
print(f"   ‚Ä¢ Cached: {'‚úÖ Yes (next run will be instant!)' if not embeddings_cached else '‚úÖ Loaded from cache'}")

# ============================================================================
# Cell 5.3: Verify Embedding Quality
# ============================================================================

print("\n" + "="*80)
print("STEP 3: Embedding Quality Verification")
print("="*80)

# ====================================================================
# Check for NaN or infinite values
# ====================================================================
print("\nüîç Data integrity checks:")

cand_has_nan = np.any(np.isnan(candidate_embeddings))
cand_has_inf = np.any(np.isinf(candidate_embeddings))
comp_has_nan = np.any(np.isnan(company_embeddings))
comp_has_inf = np.any(np.isinf(company_embeddings))

cand_valid = not (cand_has_nan or cand_has_inf)
comp_valid = not (comp_has_nan or comp_has_inf)

print(f"   Candidates:")
print(f"      ‚Ä¢ Valid: {'‚úÖ' if cand_valid else '‚ùå'}")
if not cand_valid:
    print(f"      ‚Ä¢ Has NaN: {'Yes ‚ùå' if cand_has_nan else 'No ‚úÖ'}")
    print(f"      ‚Ä¢ Has Inf: {'Yes ‚ùå' if cand_has_inf else 'No ‚úÖ'}")

print(f"\n   Companies:")
print(f"      ‚Ä¢ Valid: {'‚úÖ' if comp_valid else '‚ùå'}")
if not comp_valid:
    print(f"      ‚Ä¢ Has NaN: {'Yes ‚ùå' if comp_has_nan else 'No ‚úÖ'}")
    print(f"      ‚Ä¢ Has Inf: {'Yes ‚ùå' if comp_has_inf else 'No ‚úÖ'}")

if not (cand_valid and comp_valid):
    raise ValueError("‚ùå CRITICAL: Embeddings contain invalid values!")

# ====================================================================
# Check embedding statistics
# ====================================================================
print(f"\nüìä Embedding statistics:")

print(f"\n   Candidates ({candidate_embeddings.shape}):")
print(f"      ‚Ä¢ Mean:  {candidate_embeddings.mean():.6f}")
print(f"      ‚Ä¢ Std:   {candidate_embeddings.std():.6f}")
print(f"      ‚Ä¢ Min:   {candidate_embeddings.min():.6f}")
print(f"      ‚Ä¢ Max:   {candidate_embeddings.max():.6f}")

print(f"\n   Companies ({company_embeddings.shape}):")
print(f"      ‚Ä¢ Mean:  {company_embeddings.mean():.6f}")
print(f"      ‚Ä¢ Std:   {company_embeddings.std():.6f}")
print(f"      ‚Ä¢ Min:   {company_embeddings.min():.6f}")
print(f"      ‚Ä¢ Max:   {company_embeddings.max():.6f}")

# ====================================================================
# Normalization check (should be L2-normalized)
# ====================================================================
print(f"\nüîç Normalization check (embeddings should be L2-normalized):")

cand_norms = np.linalg.norm(candidate_embeddings, axis=1)
comp_norms = np.linalg.norm(company_embeddings, axis=1)

cand_normalized = np.allclose(cand_norms, 1.0, atol=1e-5)
comp_normalized = np.allclose(comp_norms, 1.0, atol=1e-5)

print(f"   ‚Ä¢ Candidates L2-normalized: {'‚úÖ' if cand_normalized else '‚ùå'}")
print(f"      Mean norm: {cand_norms.mean():.6f} (should be ~1.0)")

print(f"   ‚Ä¢ Companies L2-normalized: {'‚úÖ' if comp_normalized else '‚ùå'}")
print(f"      Mean norm: {comp_norms.mean():.6f} (should be ~1.0)")

if not (cand_normalized and comp_normalized):
    print("\n   ‚ö†Ô∏è  WARNING: Embeddings not properly normalized!")
    print("   üí° This may affect cosine similarity calculations")

# ====================================================================
# Sample similarity check (sanity test)
# ====================================================================
print(f"\nüß™ Sample similarity test:")

# Compute similarity between first candidate and first 5 companies
sample_similarities = cosine_similarity(
    candidate_embeddings[0:1],
    company_embeddings[0:5]
)[0]

print(f"   Candidate 0 vs Companies 0-4:")
for i, sim in enumerate(sample_similarities):
    print(f"      ‚Ä¢ Company {i}: {sim:.4f}")

if np.all(sample_similarities == sample_similarities[0]):
    print("\n   ‚ö†Ô∏è  WARNING: All similarities are identical!")
    print("   üí° This suggests embeddings may not be diverse")
else:
    print(f"\n   ‚úÖ Similarities show variation (range: {sample_similarities.min():.4f} - {sample_similarities.max():.4f})")

# ====================================================================
# Final validation
# ====================================================================
print("\n" + "="*80)

if cand_valid and comp_valid and cand_normalized and comp_normalized:
    print("‚úÖ ALL QUALITY CHECKS PASSED!")
    print("="*80)
    print("\nüöÄ Embeddings are ready for matching!")
else:
    print("‚ö†Ô∏è  SOME QUALITY CHECKS FAILED!")
    print("="*80)
    print("\nüí° Review the issues above before proceeding")

print("\nüìù Next step: Section 6 - Matching System")


SECTION 5: EXECUTE COMPLETE PIPELINE

STEP 1: Data Loading & Text Preparation

üè≠ COMPLETE ETL PIPELINE: EXTRACT ‚Üí TRANSFORM ‚Üí LOAD

1Ô∏è‚É£  EXTRACT: Loading all data sources...
--------------------------------------------------------------------------------

üìÇ Loading core datasets...
üìÇ Loading candidates...
   ‚úÖ Loaded 9,544 candidates
   üìä Columns: 35
üìÇ Loading companies (base)...
   ‚úÖ Loaded 24,473 companies

üìÇ Loading enrichment datasets...
üìÇ Loading job postings...
   ‚úÖ Loaded 123,849 postings
   üè¢ Unique companies: 24,474
üìÇ Loading job skills...
   ‚úÖ Loaded 213,768 job-skill mappings

üìÇ Loading additional company data...
üìÇ Loading company industries...
   ‚úÖ Loaded 24,375 company-industry mappings
   üè¢ Unique companies: 24,365
üìÇ Loading company specialties...
   ‚úÖ Loaded 169,387 company-specialty mappings
   üè¢ Unique companies: 17,780

‚úÖ All data sources loaded!
   ‚Ä¢ Candidates: 9,544
   ‚Ä¢ Companies (base): 24,473
 

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 299/299 [04:33<00:00,  1.09it/s]


‚úÖ Generated: (9544, 384)
   ‚Ä¢ Memory: ~14.0 MB

üîÑ Embedding companies...
   üìä Processing 24,473 texts...

üîÑ Generating embeddings for 24,473 texts...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 765/765 [10:49<00:00,  1.18it/s]


‚úÖ Generated: (24473, 384)
   ‚Ä¢ Memory: ~35.8 MB

üíæ Saving embeddings for future use...
üíæ Saved embeddings: ../processed/candidate_embeddings.npy
   ‚Ä¢ Shape: (9544, 384)
   ‚Ä¢ Size: 14.0 MB
üíæ Saved metadata: ../processed/candidate_metadata.pkl
   ‚Ä¢ Rows: 9,544
   ‚Ä¢ Size: 0.1 MB
üíæ Saved embeddings: ../processed/company_embeddings.npy
   ‚Ä¢ Shape: (24473, 384)
   ‚Ä¢ Size: 35.8 MB
üíæ Saved metadata: ../processed/company_metadata.pkl
   ‚Ä¢ Rows: 24,473
   ‚Ä¢ Size: 0.2 MB

‚úÖ Embeddings saved! Next run will be 100x faster!

üìä EMBEDDING GENERATION COMPLETE

‚úÖ Final embeddings:
   ‚Ä¢ Candidate embeddings: (9544, 384)
   ‚Ä¢ Company embeddings: (24473, 384)
   ‚Ä¢ Total memory: ~49.8 MB
   ‚Ä¢ Cached: ‚úÖ Yes (next run will be instant!)

STEP 3: Embedding Quality Verification

üîç Data integrity checks:
   Candidates:
      ‚Ä¢ Valid: ‚úÖ

   Companies:
      ‚Ä¢ Valid: ‚úÖ

üìä Embedding statistics:

   Candidates ((9544, 384)):
      ‚Ä¢ Mean:  -0.001420


SECTION 6: Matching & Query System (CRITICAL!)

In [22]:
# ============================================================================
# SECTION 6: MATCHING & QUERY SYSTEM
# ============================================================================

print("\n" + "="*80)
print("SECTION 6: MATCHING & QUERY SYSTEM")
print("="*80)

import time

# ============================================================================
# Cell 6.1: Compute Similarity Matrix (WITH VALIDATION)
# ============================================================================

print("\n" + "="*80)
print("STEP 1: Computing Similarity Matrix")
print("="*80)

print(f"\nüìä Matrix dimensions:")
print(f"   ‚Ä¢ Candidates: {len(candidate_embeddings):,}")
print(f"   ‚Ä¢ Companies: {len(company_embeddings):,}")
print(f"   ‚Ä¢ Matrix size: {len(candidate_embeddings):,} √ó {len(company_embeddings):,}")
print(f"   ‚Ä¢ Total comparisons: {len(candidate_embeddings) * len(company_embeddings):,}")

# Compute all similarities
print(f"\nüîÑ Computing cosine similarities...")
start_time = time.time()

similarity_matrix = cosine_similarity(candidate_embeddings, company_embeddings)

computation_time = time.time() - start_time

print(f"\n‚úÖ Similarity matrix computed: {similarity_matrix.shape}")
print(f"   ‚Ä¢ Computation time: {computation_time:.2f} seconds")
print(f"   ‚Ä¢ Memory usage: ~{similarity_matrix.nbytes / (1024**2):.1f} MB")

# ====================================================================
# VALIDATION: Check for issues
# ====================================================================
print(f"\nüîç Quality validation:")

# Check for NaN/Inf
has_nan = np.any(np.isnan(similarity_matrix))
has_inf = np.any(np.isinf(similarity_matrix))

print(f"   ‚Ä¢ Has NaN: {'‚ùå Yes' if has_nan else '‚úÖ No'}")
print(f"   ‚Ä¢ Has Inf: {'‚ùå Yes' if has_inf else '‚úÖ No'}")

if has_nan or has_inf:
    raise ValueError("‚ùå CRITICAL: Similarity matrix contains invalid values!")

# Check value range (should be [-1, 1] for cosine, but [0, 1] for normalized)
min_sim = similarity_matrix.min()
max_sim = similarity_matrix.max()

print(f"   ‚Ä¢ Value range: [{min_sim:.4f}, {max_sim:.4f}]")

if min_sim < -1.01 or max_sim > 1.01:
    print(f"   ‚ö†Ô∏è  WARNING: Values outside expected range [-1, 1]")

# ====================================================================
# STATISTICS
# ====================================================================
print(f"\nüìä Distribution statistics:")
print(f"   ‚Ä¢ Mean: {similarity_matrix.mean():.4f}")
print(f"   ‚Ä¢ Median: {np.median(similarity_matrix):.4f}")
print(f"   ‚Ä¢ Std: {similarity_matrix.std():.4f}")
print(f"   ‚Ä¢ Min: {min_sim:.4f}")
print(f"   ‚Ä¢ Max: {max_sim:.4f}")

# Percentiles
percentiles = [10, 25, 50, 75, 90, 95, 99]
print(f"\nüìà Percentiles:")
for p in percentiles:
    val = np.percentile(similarity_matrix, p)
    print(f"   ‚Ä¢ {p}th: {val:.4f}")

# ====================================================================
# INTERPRETATION
# ====================================================================
print(f"\nüí° Interpretation:")

if similarity_matrix.mean() > 0.5:
    print(f"   ‚ö†Ô∏è  High average similarity ({similarity_matrix.mean():.3f})")
    print(f"   üí° May indicate embeddings lack diversity")
elif similarity_matrix.mean() < 0.2:
    print(f"   ‚ö†Ô∏è  Low average similarity ({similarity_matrix.mean():.3f})")
    print(f"   üí° May indicate poor matching quality")
else:
    print(f"   ‚úÖ Average similarity looks healthy ({similarity_matrix.mean():.3f})")

if similarity_matrix.std() > 0.1:
    print(f"   ‚úÖ Good variance in similarities ({similarity_matrix.std():.3f})")
    print(f"   üí° System can distinguish good from bad matches")
else:
    print(f"   ‚ö†Ô∏è  Low variance ({similarity_matrix.std():.3f})")
    print(f"   üí° Matches may all look similar")

# ============================================================================
# Cell 6.2: Find Top-K Matches for All Candidates (WITH PERFORMANCE TEST)
# ============================================================================

print("\n" + "="*80)
print("STEP 2: Finding Top-K Matches")
print("="*80)

print(f"\nüîç Finding top-{config.TOP_K_MATCHES} matches for all candidates...")

def get_top_k_matches(similarity_matrix: np.ndarray, top_k: int = 10) -> List[List[Tuple[int, float]]]:
    """
    Get top-k matches for each candidate.
    
    Performance:
    -----------
    Uses vectorized numpy operations for speed.
    Target: <100ms for single query
    
    Args:
        similarity_matrix: Precomputed similarity matrix (n_candidates √ó n_companies)
        top_k: Number of top matches to return per candidate
    
    Returns:
        List of lists: For each candidate, list of (company_idx, score) tuples
        
    Example:
        >>> matches = get_top_k_matches(sim_matrix, top_k=5)
        >>> matches[0]  # Top 5 matches for first candidate
        [(142, 0.876), (593, 0.854), ...]
    """
    all_matches = []
    
    for i, sim_row in enumerate(similarity_matrix):
        # Get top-k indices (highest similarity scores)
        # argsort sorts ascending, so we take last k and reverse
        top_indices = np.argsort(sim_row)[-top_k:][::-1]
        top_scores = sim_row[top_indices]
        
        # Create list of (index, score) tuples
        matches = list(zip(top_indices.tolist(), top_scores.tolist()))
        all_matches.append(matches)
    
    return all_matches

# Get matches
print(f"   ‚è±Ô∏è  Computing matches...")
start_time = time.time()

all_matches = get_top_k_matches(similarity_matrix, config.TOP_K_MATCHES)

batch_time = time.time() - start_time

print(f"\n‚úÖ Matches computed for all {len(all_matches):,} candidates")
print(f"   ‚Ä¢ Total time: {batch_time:.3f} seconds")
print(f"   ‚Ä¢ Avg per candidate: {(batch_time / len(all_matches)) * 1000:.2f} ms")

# ====================================================================
# PERFORMANCE TEST: Single Query Speed
# ====================================================================
print(f"\n‚ö° Performance test (single query):")

# Test query time for one candidate
test_runs = 100
times = []

for _ in range(test_runs):
    start = time.time()
    
    # Simulate single query
    test_row = similarity_matrix[0]
    top_indices = np.argsort(test_row)[-config.TOP_K_MATCHES:][::-1]
    top_scores = test_row[top_indices]
    
    elapsed = (time.time() - start) * 1000  # Convert to ms
    times.append(elapsed)

avg_query_time = np.mean(times)
p95_query_time = np.percentile(times, 95)

print(f"   ‚Ä¢ Average query time: {avg_query_time:.2f} ms")
print(f"   ‚Ä¢ P95 query time: {p95_query_time:.2f} ms")
print(f"   ‚Ä¢ Target: <100 ms")

if avg_query_time < 100:
    print(f"   ‚úÖ Performance target MET! ({avg_query_time:.1f}ms < 100ms)")
else:
    print(f"   ‚ùå Performance target MISSED! ({avg_query_time:.1f}ms > 100ms)")

# ====================================================================
# MATCH QUALITY ANALYSIS
# ====================================================================
print(f"\nüìä Match quality analysis:")

# Get all top scores
all_top_scores = [matches[0][1] for matches in all_matches]  # Best match per candidate

print(f"   Best match scores:")
print(f"      ‚Ä¢ Mean: {np.mean(all_top_scores):.4f}")
print(f"      ‚Ä¢ Median: {np.median(all_top_scores):.4f}")
print(f"      ‚Ä¢ Min: {np.min(all_top_scores):.4f}")
print(f"      ‚Ä¢ Max: {np.max(all_top_scores):.4f}")

# Count excellent matches (score > 0.7)
excellent = sum(1 for score in all_top_scores if score > 0.7)
good = sum(1 for score in all_top_scores if 0.5 < score <= 0.7)
poor = sum(1 for score in all_top_scores if score <= 0.5)

print(f"\n   Match quality distribution:")
print(f"      ‚Ä¢ Excellent (>0.7): {excellent:,} ({excellent/len(all_top_scores)*100:.1f}%)")
print(f"      ‚Ä¢ Good (0.5-0.7): {good:,} ({good/len(all_top_scores)*100:.1f}%)")
print(f"      ‚Ä¢ Poor (‚â§0.5): {poor:,} ({poor/len(all_top_scores)*100:.1f}%)")

# ============================================================================
# Cell 6.3: Display Sample Matches (ENHANCED WITH ENRICHMENT INFO!)
# ============================================================================

print("\n" + "="*80)
print("STEP 3: Sample Match Inspection")
print("="*80)

print("\nüìã Showing matches for 3 random candidates (with enrichment details):")

# Sample random candidates (not always 0, 1, 2!)
import random
random.seed(42)
sample_indices = random.sample(range(len(candidates_df)), min(3, len(candidates_df)))

for idx, cand_idx in enumerate(sample_indices, 1):
    cand_row = candidates_df.iloc[cand_idx]
    
    print(f"\n{'='*80}")
    print(f"SAMPLE #{idx} - CANDIDATE #{cand_idx}")
    print(f"{'='*80}")
    
    # ====================================================================
    # Show candidate info
    # ====================================================================
    career = str(cand_row.get('career_objective', 'Not specified'))[:100]
    
    skills = cand_row.get('skills', [])
    if isinstance(skills, list):
        skills_str = ', '.join(str(s) for s in skills[:8])  # Show more skills
    else:
        skills_str = str(skills)[:150]
    
    print(f"\nüë§ CANDIDATE PROFILE:")
    print(f"   Career objective: {career}{'...' if len(str(cand_row.get('career_objective', ''))) > 100 else ''}")
    print(f"   Skills: {skills_str}{'...' if len(skills_str) >= 150 else ''}")
    
    # ====================================================================
    # Show top 5 matches with ENRICHMENT details!
    # ====================================================================
    print(f"\nüéØ TOP 5 COMPANY MATCHES:")
    print(f"{'#':<4} {'Score':<8} {'Company':<35} {'Enriched Skills (Bridge!)':<50}")
    print("-" * 110)
    
    for rank, (comp_idx, score) in enumerate(all_matches[cand_idx][:5], 1):
        comp_row = companies_df.iloc[comp_idx]
        
        # Company description
        comp_desc = str(comp_row.get('description', 'N/A'))[:35]
        
        # ENRICHED SKILLS (THE BRIDGE!)
        enriched_skills = str(comp_row.get('enriched_skills', 'Not specified'))
        
        # Show first 50 chars of skills
        if enriched_skills != 'Not specified' and len(enriched_skills) > 50:
            skills_display = enriched_skills[:47] + '...'
        else:
            skills_display = enriched_skills[:50]
        
        # Emoji based on score
        if score > 0.7:
            emoji = 'üåü'
        elif score > 0.5:
            emoji = '‚úÖ'
        else:
            emoji = 'üü°'
        
        print(f"{emoji} {rank:<2} {score:<8.4f} {comp_desc:<35} {skills_display:<50}")
    
    # ====================================================================
    # Show match explanation for best match
    # ====================================================================
    best_comp_idx, best_score = all_matches[cand_idx][0]
    best_comp = companies_df.iloc[best_comp_idx]
    
    print(f"\nüí° Best Match Analysis:")
    print(f"   Score: {best_score:.4f}")
    print(f"   Company: {str(best_comp.get('description', 'N/A'))[:80]}...")
    
    if 'enriched_skills' in best_comp and best_comp['enriched_skills'] != 'Not specified':
        print(f"   Enriched Skills: {str(best_comp['enriched_skills'])[:100]}...")
        print(f"   ‚úÖ Job posting bridge ACTIVE (enriched with real skills!)")
    else:
        print(f"   ‚ö†Ô∏è  No enriched skills (bridge not active for this company)")
    
    if 'industries_list' in best_comp:
        print(f"   Industries: {str(best_comp['industries_list'])[:60]}...")

print("\n" + "="*80)
print("‚úÖ MATCHING SYSTEM COMPLETE")
print("="*80)

print("\nüìä Summary:")
print(f"   ‚Ä¢ Similarity matrix: {similarity_matrix.shape}")
print(f"   ‚Ä¢ Query performance: {avg_query_time:.1f}ms (target: <100ms)")
print(f"   ‚Ä¢ Match quality: {excellent/len(all_top_scores)*100:.1f}% excellent matches")
print(f"   ‚Ä¢ Ready for evaluation (Section 7)")

print("\nüìù Variables available:")
print(f"   ‚Ä¢ similarity_matrix: {similarity_matrix.shape} array")
print(f"   ‚Ä¢ all_matches: List of {len(all_matches):,} match lists")


SECTION 6: MATCHING & QUERY SYSTEM

STEP 1: Computing Similarity Matrix

üìä Matrix dimensions:
   ‚Ä¢ Candidates: 9,544
   ‚Ä¢ Companies: 24,473
   ‚Ä¢ Matrix size: 9,544 √ó 24,473
   ‚Ä¢ Total comparisons: 233,570,312

üîÑ Computing cosine similarities...

‚úÖ Similarity matrix computed: (9544, 24473)
   ‚Ä¢ Computation time: 1.26 seconds
   ‚Ä¢ Memory usage: ~891.0 MB

üîç Quality validation:
   ‚Ä¢ Has NaN: ‚úÖ No
   ‚Ä¢ Has Inf: ‚úÖ No
   ‚Ä¢ Value range: [-0.2044, 0.6857]

üìä Distribution statistics:
   ‚Ä¢ Mean: 0.1787
   ‚Ä¢ Median: 0.1722
   ‚Ä¢ Std: 0.1047
   ‚Ä¢ Min: -0.2044
   ‚Ä¢ Max: 0.6857

üìà Percentiles:
   ‚Ä¢ 10th: 0.0471
   ‚Ä¢ 25th: 0.1017
   ‚Ä¢ 50th: 0.1722
   ‚Ä¢ 75th: 0.2509
   ‚Ä¢ 90th: 0.3202
   ‚Ä¢ 95th: 0.3597
   ‚Ä¢ 99th: 0.4300

üí° Interpretation:
   ‚ö†Ô∏è  Low average similarity (0.179)
   üí° May indicate poor matching quality
   ‚úÖ Good variance in similarities (0.105)
   üí° System can distinguish good from bad matches

STEP 2: Finding T

SECTION 7: Evaluation & Metrics (CRITICAL FOR ACADEMIC!)

In [23]:
# ============================================================================
# SECTION 7: EVALUATION & METRICS (COMPLETE WITH SAVES!)
# ============================================================================

print("\n" + "="*80)
print("SECTION 7: EVALUATION & METRICS")
print("="*80)

# Create results directory if needed
results_dir = config.RESULTS_PATH
Path(results_dir).mkdir(parents=True, exist_ok=True)

# ============================================================================
# Cell 7.1: Bilateral Fairness Score
# ============================================================================

print("\n" + "="*80)
print("METRIC 1: Bilateral Fairness")
print("="*80)

def compute_bilateral_fairness(similarity_matrix: np.ndarray, 
                               top_k: int = 10,
                               threshold: float = 0.5) -> Tuple[float, float, float]:
    """
    Compute bilateral fairness: How balanced are matches from both sides?
    
    Bilateral Fairness = min(candidate_coverage, company_coverage)
    
    Why This Matters:
    ----------------
    Traditional matching favors one side (usually companies).
    Bilateral fairness ensures BOTH candidates AND companies get good matches.
    
    Algorithm:
    ---------
    1. Candidate-side: % of candidates with at least one match > threshold
    2. Company-side: % of companies that appear in at least one top-k list
    3. Fairness: Take the minimum (ensures BOTH sides are served)
    
    Args:
        similarity_matrix: Precomputed similarities (n_candidates √ó n_companies)
        top_k: Number of top matches to consider
        threshold: Minimum similarity score to count as "matched"
    
    Returns:
        Tuple of (fairness, candidate_coverage, company_coverage)
        
    Example:
        >>> fairness, cand_cov, comp_cov = compute_bilateral_fairness(sim_matrix)
        >>> print(f"Fairness: {fairness:.3f}")
    """
    n_candidates, n_companies = similarity_matrix.shape
    
    # ====================================================================
    # Candidate-side coverage: How many candidates find good matches?
    # ====================================================================
    candidate_max_scores = similarity_matrix.max(axis=1)
    candidates_matched = (candidate_max_scores > threshold).sum()
    candidate_coverage = candidates_matched / n_candidates
    
    print(f"\nüìä Candidate-side analysis:")
    print(f"   ‚Ä¢ Total candidates: {n_candidates:,}")
    print(f"   ‚Ä¢ With match > {threshold}: {candidates_matched:,}")
    print(f"   ‚Ä¢ Coverage: {candidate_coverage:.3f} ({candidate_coverage*100:.1f}%)")
    
    # ====================================================================
    # Company-side coverage: How many companies appear in top-k?
    # ====================================================================
    # Get all companies that appear in ANY top-k list
    top_k_indices = np.argsort(similarity_matrix, axis=1)[:, -top_k:]
    unique_companies = np.unique(top_k_indices)
    company_coverage = len(unique_companies) / n_companies
    
    print(f"\nüìä Company-side analysis:")
    print(f"   ‚Ä¢ Total companies: {n_companies:,}")
    print(f"   ‚Ä¢ Appearing in top-{top_k}: {len(unique_companies):,}")
    print(f"   ‚Ä¢ Coverage: {company_coverage:.3f} ({company_coverage*100:.1f}%)")
    
    # ====================================================================
    # Bilateral fairness: minimum of both sides
    # ====================================================================
    fairness = min(candidate_coverage, company_coverage)
    
    print(f"\n‚öñÔ∏è  Bilateral Fairness Score:")
    print(f"   ‚Ä¢ Fairness: {fairness:.3f}")
    print(f"   ‚Ä¢ Interpretation: {fairness*100:.1f}% of BOTH sides are served")
    
    # Status assessment
    if fairness > 0.85:
        status = "‚úÖ EXCELLENT"
        assessment = "Both candidates and companies well-served"
    elif fairness > 0.70:
        status = "üü¢ GOOD"
        assessment = "Majority of both sides served"
    elif fairness > 0.50:
        status = "üü° MODERATE"
        assessment = "Half of both sides served, room for improvement"
    else:
        status = "üî¥ POOR"
        assessment = "Many entities on one or both sides not served"
    
    print(f"   ‚Ä¢ Status: {status}")
    print(f"   ‚Ä¢ Assessment: {assessment}")
    
    # Identify the bottleneck
    if candidate_coverage < company_coverage:
        print(f"\n   üí° Bottleneck: Candidate-side ({candidate_coverage:.3f} < {company_coverage:.3f})")
        print(f"      Recommendation: Improve candidate profile quality or matching algorithm")
    elif company_coverage < candidate_coverage:
        print(f"\n   üí° Bottleneck: Company-side ({company_coverage:.3f} < {candidate_coverage:.3f})")
        print(f"      Recommendation: Enrich more company profiles or expand company dataset")
    else:
        print(f"\n   ‚úÖ Balanced coverage on both sides!")
    
    return fairness, candidate_coverage, company_coverage

# Compute fairness
fairness, cand_cov, comp_cov = compute_bilateral_fairness(
    similarity_matrix,
    top_k=config.TOP_K_MATCHES,
    threshold=config.SIMILARITY_THRESHOLD
)

# ============================================================================
# Cell 7.2: Match Quality Distribution
# ============================================================================

print("\n" + "="*80)
print("METRIC 2: Match Quality Distribution")
print("="*80)

# Extract all scores from all_matches
print("\nüìä Analyzing all match scores...")

all_scores = []
for matches in all_matches:
    scores = [score for _, score in matches]
    all_scores.extend(scores)

all_scores = np.array(all_scores)

print(f"   ‚Ä¢ Total match scores: {len(all_scores):,}")

# ====================================================================
# Statistics
# ====================================================================
print(f"\nüìà Score statistics:")
print(f"   ‚Ä¢ Mean: {all_scores.mean():.4f}")
print(f"   ‚Ä¢ Median: {np.median(all_scores):.4f}")
print(f"   ‚Ä¢ Std: {all_scores.std():.4f}")
print(f"   ‚Ä¢ Min: {all_scores.min():.4f}")
print(f"   ‚Ä¢ Max: {all_scores.max():.4f}")

# Percentiles
percentiles = [10, 25, 50, 75, 90, 95, 99]
print(f"\nüìä Percentiles:")
for p in percentiles:
    val = np.percentile(all_scores, p)
    print(f"   ‚Ä¢ {p:2d}th: {val:.4f}")

# ====================================================================
# Quality Categories
# ====================================================================
print(f"\nüéØ Match quality categories:")

excellent = (all_scores > 0.8).sum()
very_good = ((all_scores > 0.7) & (all_scores <= 0.8)).sum()
good = ((all_scores > 0.6) & (all_scores <= 0.7)).sum()
moderate = ((all_scores > 0.5) & (all_scores <= 0.6)).sum()
poor = (all_scores <= 0.5).sum()

total = len(all_scores)

print(f"   ‚Ä¢ Excellent (>0.8): {excellent:>7,} ({excellent/total*100:>5.1f}%)")
print(f"   ‚Ä¢ Very Good (0.7-0.8): {very_good:>7,} ({very_good/total*100:>5.1f}%)")
print(f"   ‚Ä¢ Good (0.6-0.7): {good:>7,} ({good/total*100:>5.1f}%)")
print(f"   ‚Ä¢ Moderate (0.5-0.6): {moderate:>7,} ({moderate/total*100:>5.1f}%)")
print(f"   ‚Ä¢ Poor (‚â§0.5): {poor:>7,} ({poor/total*100:>5.1f}%)")

# Assessment
if all_scores.mean() > 0.7:
    print(f"\n   ‚úÖ High quality matches overall (mean: {all_scores.mean():.3f})")
elif all_scores.mean() > 0.6:
    print(f"\n   üü¢ Good quality matches (mean: {all_scores.mean():.3f})")
else:
    print(f"\n   üü° Moderate quality matches (mean: {all_scores.mean():.3f})")

# ============================================================================
# Cell 7.3: Job Posting Bridge Impact
# ============================================================================

print("\n" + "="*80)
print("METRIC 3: Job Posting Bridge Coverage")
print("="*80)

# Use coverage from ETL (already calculated!)
coverage_pct = etl_results['stats']['skills_coverage_pct']
has_skills_count = (companies_df['enriched_skills'] != 'Not specified').sum()
total_companies = len(companies_df)

print(f"\nüåâ Job Posting Bridge Impact:")
print(f"   ‚Ä¢ Total companies: {total_companies:,}")
print(f"   ‚Ä¢ With enriched skills: {has_skills_count:,}")
print(f"   ‚Ä¢ Without enriched skills: {total_companies - has_skills_count:,}")
print(f"   ‚Ä¢ Coverage: {coverage_pct:.1f}%")

# Status
if coverage_pct > 90:
    status = "‚úÖ EXCELLENT"
    impact = "Job posting bridge highly effective"
elif coverage_pct > 70:
    status = "üü¢ GOOD"
    impact = "Majority of companies enriched"
else:
    status = "üü° LIMITED"
    impact = "Many companies lack skill enrichment"

print(f"   ‚Ä¢ Status: {status}")
print(f"   ‚Ä¢ Impact: {impact}")

# ====================================================================
# Compare matches with vs without enrichment
# ====================================================================
print(f"\nüìä Enrichment impact on match quality:")

# Get companies with/without enrichment
companies_with_skills = companies_df['enriched_skills'] != 'Not specified'

# Get average scores for companies with/without skills
scores_with_skills = []
scores_without_skills = []

for matches in all_matches:
    for comp_idx, score in matches:
        if companies_with_skills.iloc[comp_idx]:
            scores_with_skills.append(score)
        else:
            scores_without_skills.append(score)

if scores_with_skills:
    avg_with = np.mean(scores_with_skills)
    print(f"   ‚Ä¢ Avg score (enriched companies): {avg_with:.4f}")

if scores_without_skills:
    avg_without = np.mean(scores_without_skills)
    print(f"   ‚Ä¢ Avg score (non-enriched companies): {avg_without:.4f}")

if scores_with_skills and scores_without_skills:
    improvement = ((avg_with - avg_without) / avg_without) * 100
    print(f"   ‚Ä¢ Improvement: {improvement:+.1f}%")
    
    if improvement > 5:
        print(f"   ‚úÖ Enrichment significantly improves match quality!")
    elif improvement > 0:
        print(f"   üü¢ Enrichment provides modest improvement")
    else:
        print(f"   ‚ö†Ô∏è  Enrichment not showing clear improvement")

# ============================================================================
# Cell 7.4: System Performance Metrics
# ============================================================================

print("\n" + "="*80)
print("METRIC 4: System Performance")
print("="*80)

# Use query time from Section 6 (if available in globals)
if 'avg_query_time' in globals():
    print(f"\n‚ö° Query Performance:")
    print(f"   ‚Ä¢ Average query time: {avg_query_time:.2f} ms")
    print(f"   ‚Ä¢ Target: <100 ms")
    
    if avg_query_time < 100:
        print(f"   ‚úÖ Performance target MET! ({avg_query_time:.1f}ms < 100ms)")
    else:
        print(f"   ‚ùå Performance target MISSED! ({avg_query_time:.1f}ms > 100ms)")
else:
    print(f"\n‚ö†Ô∏è  Query performance not measured in Section 6")

# Memory usage
print(f"\nüíæ Memory Usage:")
print(f"   ‚Ä¢ Candidate embeddings: {candidate_embeddings.nbytes / (1024**2):.1f} MB")
print(f"   ‚Ä¢ Company embeddings: {company_embeddings.nbytes / (1024**2):.1f} MB")
print(f"   ‚Ä¢ Similarity matrix: {similarity_matrix.nbytes / (1024**2):.1f} MB")
total_memory = (candidate_embeddings.nbytes + company_embeddings.nbytes + similarity_matrix.nbytes) / (1024**2)
print(f"   ‚Ä¢ Total: {total_memory:.1f} MB")

if total_memory < 500:
    print(f"   ‚úÖ Reasonable memory footprint (<500 MB)")
else:
    print(f"   ‚ö†Ô∏è  High memory usage (>{total_memory:.0f} MB)")

# Scale
print(f"\nüìä System Scale:")
print(f"   ‚Ä¢ Candidates: {len(candidate_embeddings):,}")
print(f"   ‚Ä¢ Companies: {len(company_embeddings):,}")
print(f"   ‚Ä¢ Total comparisons: {len(candidate_embeddings) * len(company_embeddings):,}")
print(f"   ‚Ä¢ Matches generated: {len(all_matches) * config.TOP_K_MATCHES:,}")

# ============================================================================
# Cell 7.5: Save All Metrics (CRITICAL!)
# ============================================================================

print("\n" + "="*80)
print("SAVING EVALUATION METRICS")
print("="*80)

# Compile all metrics
evaluation_metrics = {
    'bilateral_fairness': {
        'fairness_score': float(fairness),
        'candidate_coverage': float(cand_cov),
        'company_coverage': float(comp_cov),
        'threshold': float(config.SIMILARITY_THRESHOLD),
        'top_k': int(config.TOP_K_MATCHES)
    },
    'match_quality': {
        'mean_score': float(all_scores.mean()),
        'median_score': float(np.median(all_scores)),
        'std_score': float(all_scores.std()),
        'min_score': float(all_scores.min()),
        'max_score': float(all_scores.max()),
        'percentiles': {
            f'p{p}': float(np.percentile(all_scores, p))
            for p in [10, 25, 50, 75, 90, 95, 99]
        },
        'quality_distribution': {
            'excellent': int(excellent),
            'very_good': int(very_good),
            'good': int(good),
            'moderate': int(moderate),
            'poor': int(poor)
        }
    },
    'enrichment': {
        'total_companies': int(total_companies),
        'companies_with_skills': int(has_skills_count),
        'coverage_pct': float(coverage_pct),
        'avg_score_enriched': float(avg_with) if scores_with_skills else None,
        'avg_score_non_enriched': float(avg_without) if scores_without_skills else None
    },
    'performance': {
        'query_time_ms': float(avg_query_time) if 'avg_query_time' in globals() else None,
        'memory_mb': float(total_memory),
        'n_candidates': int(len(candidate_embeddings)),
        'n_companies': int(len(company_embeddings)),
        'total_matches': int(len(all_matches) * config.TOP_K_MATCHES)
    },
    'model_info': {
        'embedding_model': config.EMBEDDING_MODEL,
        'embedding_dimension': int(candidate_embeddings.shape[1]),
        'similarity_metric': 'cosine'
    }
}

# Save as JSON
metrics_file = f'{results_dir}evaluation_metrics.json'
with open(metrics_file, 'w') as f:
    json.dump(evaluation_metrics, f, indent=2)

print(f"\nüíæ Saved metrics:")
print(f"   ‚úÖ {metrics_file}")

# Also save detailed score distribution
scores_file = f'{results_dir}match_scores.npy'
np.save(scores_file, all_scores)
print(f"   ‚úÖ {scores_file}")

# ============================================================================
# Cell 7.6: Summary Report
# ============================================================================

print("\n" + "="*80)
print("üìä EVALUATION SUMMARY")
print("="*80)

print(f"\n‚úÖ Key Metrics:")
print(f"   ‚Ä¢ Bilateral Fairness: {fairness:.3f} {'‚úÖ' if fairness > 0.85 else 'üü°'}")
print(f"   ‚Ä¢ Mean Match Score: {all_scores.mean():.3f}")
print(f"   ‚Ä¢ Job Posting Coverage: {coverage_pct:.1f}% {'‚úÖ' if coverage_pct > 90 else 'üü°'}")
print(f"   ‚Ä¢ Query Performance: {avg_query_time:.1f}ms {'‚úÖ' if 'avg_query_time' in globals() and avg_query_time < 100 else '‚ö†Ô∏è'}")

print(f"\nüìà System Characteristics:")
print(f"   ‚Ä¢ Dataset: {len(candidates_df):,} candidates √ó {len(companies_df):,} companies")
print(f"   ‚Ä¢ Matches: {len(all_matches) * config.TOP_K_MATCHES:,} total")
print(f"   ‚Ä¢ Memory: {total_memory:.0f} MB")

print(f"\nüéØ Quality Assessment:")
excellent_pct = (excellent / total) * 100
if excellent_pct > 50:
    quality = "‚úÖ HIGH QUALITY"
elif excellent_pct > 30:
    quality = "üü¢ GOOD QUALITY"
else:
    quality = "üü° MODERATE QUALITY"

print(f"   ‚Ä¢ Overall: {quality}")
print(f"   ‚Ä¢ {excellent_pct:.1f}% matches are excellent (>0.8)")

print(f"\nüíæ Results saved to: {results_dir}")
print("="*80)

print(f"\nüìù Next steps:")
print(f"   ‚Ä¢ Section 7.5: Synthetic Validation (ground truth testing)")
print(f"   ‚Ä¢ Section 8: Save for Production")
print(f"   ‚Ä¢ Batch 5: LLM Features (optional)")
print(f"   ‚Ä¢ Batch 6: Visualizations")


SECTION 7: EVALUATION & METRICS

METRIC 1: Bilateral Fairness

üìä Candidate-side analysis:
   ‚Ä¢ Total candidates: 9,544
   ‚Ä¢ With match > 0.5: 9,460
   ‚Ä¢ Coverage: 0.991 (99.1%)

üìä Company-side analysis:
   ‚Ä¢ Total companies: 24,473
   ‚Ä¢ Appearing in top-10: 366
   ‚Ä¢ Coverage: 0.015 (1.5%)

‚öñÔ∏è  Bilateral Fairness Score:
   ‚Ä¢ Fairness: 0.015
   ‚Ä¢ Interpretation: 1.5% of BOTH sides are served
   ‚Ä¢ Status: üî¥ POOR
   ‚Ä¢ Assessment: Many entities on one or both sides not served

   üí° Bottleneck: Company-side (0.015 < 0.991)
      Recommendation: Enrich more company profiles or expand company dataset

METRIC 2: Match Quality Distribution

üìä Analyzing all match scores...
   ‚Ä¢ Total match scores: 95,440

üìà Score statistics:
   ‚Ä¢ Mean: 0.5360
   ‚Ä¢ Median: 0.5345
   ‚Ä¢ Std: 0.0399
   ‚Ä¢ Min: 0.4090
   ‚Ä¢ Max: 0.6857

üìä Percentiles:
   ‚Ä¢ 10th: 0.4860
   ‚Ä¢ 25th: 0.5087
   ‚Ä¢ 50th: 0.5345
   ‚Ä¢ 75th: 0.5625
   ‚Ä¢ 90th: 0.5890
   ‚Ä¢ 95th: 

In [52]:
# ============================================================================
# SECTION 8: SAVE FOR PRODUCTION
# ============================================================================

print("\n" + "="*80)
print("üíæ SAVING FOR PRODUCTION")
print("="*80)

# ============================================================================
# Cell 8.1: Save Embeddings
# ============================================================================

print("\n1Ô∏è‚É£  Saving embeddings...")

np.save(f'{config.PROCESSED_PATH}candidate_embeddings.npy', candidate_embeddings)
np.save(f'{config.PROCESSED_PATH}company_embeddings.npy', company_embeddings)

print(f"   ‚úÖ candidate_embeddings.npy saved")
print(f"   ‚úÖ company_embeddings.npy saved")

# ============================================================================
# Cell 8.2: Save Metadata
# ============================================================================

print("\n2Ô∏è‚É£  Saving metadata...")

candidates_df.to_pickle(f'{config.PROCESSED_PATH}candidates_metadata.pkl')
companies_df.to_pickle(f'{config.PROCESSED_PATH}companies_metadata.pkl')

print(f"   ‚úÖ candidates_metadata.pkl saved")
print(f"   ‚úÖ companies_metadata.pkl saved")

# ============================================================================
# Cell 8.3: Save Model Info & Metrics
# ============================================================================

print("\n3Ô∏è‚É£  Saving model info...")

model_info = {
    'model_name': config.EMBEDDING_MODEL,
    'embedding_dim': config.EMBEDDING_DIM,
    'n_candidates': len(candidates_df),
    'n_companies': len(companies_df),
    'bilateral_fairness': float(fairness),
    'mean_match_score': float(all_scores.mean()),
    'coverage_pct': float(coverage_pct),
    'top_k': config.TOP_K_MATCHES,
    'similarity_threshold': config.SIMILARITY_THRESHOLD
}

with open(f'{config.PROCESSED_PATH}model_info.json', 'w') as f:
    json.dump(model_info, f, indent=2)

print(f"   ‚úÖ model_info.json saved")

# ============================================================================
# Cell 8.4: Save Sample Matches
# ============================================================================

print("\n4Ô∏è‚É£  Saving sample matches...")

# Save first 100 candidates' matches as JSON for inspection
sample_matches = []
for i in range(min(100, len(all_matches))):
    cand_career = str(candidates_df.iloc[i].get('career_objective', 'N/A'))[:100]
    
    matches_data = []
    for comp_idx, score in all_matches[i][:5]:
        comp_desc = str(companies_df.iloc[comp_idx].get('description', 'N/A'))[:100]
        matches_data.append({
            'company_idx': int(comp_idx),
            'score': float(score),
            'company_description': comp_desc
        })
    
    sample_matches.append({
        'candidate_idx': i,
        'candidate_career': cand_career,
        'top_matches': matches_data
    })

with open(f'{config.RESULTS_PATH}sample_matches.json', 'w') as f:
    json.dump(sample_matches, f, indent=2)

print(f"   ‚úÖ sample_matches.json saved")

# ============================================================================
# Cell 8.5: File Summary
# ============================================================================

print("\n" + "="*80)
print("üì¶ DEPLOYMENT PACKAGE READY")
print("="*80)

import os

files_to_check = [
    (f'{config.PROCESSED_PATH}candidate_embeddings.npy', 'Candidate embeddings'),
    (f'{config.PROCESSED_PATH}company_embeddings.npy', 'Company embeddings'),
    (f'{config.PROCESSED_PATH}candidates_metadata.pkl', 'Candidate metadata'),
    (f'{config.PROCESSED_PATH}companies_metadata.pkl', 'Company metadata'),
    (f'{config.PROCESSED_PATH}model_info.json', 'Model info'),
    (f'{config.RESULTS_PATH}sample_matches.json', 'Sample matches')
]

print(f"\nüìÇ Files saved:")
total_size = 0

for filepath, description in files_to_check:
    if os.path.exists(filepath):
        size_mb = os.path.getsize(filepath) / (1024**2)
        total_size += size_mb
        print(f"   ‚úÖ {description}: {size_mb:.2f} MB")
    else:
        print(f"   ‚ùå {description}: NOT FOUND")

print(f"\n   üì¶ Total package size: {total_size:.2f} MB")

print("\n" + "="*80)
print("üéâ HRHUB v4.0 PIPELINE COMPLETE!")
print("="*80)

print(f"\n‚úÖ System ready for:")
print(f"   ‚Ä¢ Streamlit deployment")
print(f"   ‚Ä¢ Academic report")
print(f"   ‚Ä¢ Presentation demo")

print(f"\nüìä Final Statistics:")
print(f"   ‚Ä¢ {len(candidates_df):,} candidates processed")
print(f"   ‚Ä¢ {len(companies_df):,} companies processed")
print(f"   ‚Ä¢ {fairness:.3f} bilateral fairness")
print(f"   ‚Ä¢ {coverage_pct:.1f}% coverage")

print("\nüöÄ Next steps:")
print("   1. Review sample_matches.json")
print("   2. Create visualizations (optional)")
print("   3. Write academic report")
print("   4. Deploy to Streamlit (optional)")

print("\n" + "="*80)


üíæ SAVING FOR PRODUCTION

1Ô∏è‚É£  Saving embeddings...
   ‚úÖ candidate_embeddings.npy saved
   ‚úÖ company_embeddings.npy saved

2Ô∏è‚É£  Saving metadata...
   ‚úÖ candidates_metadata.pkl saved
   ‚úÖ companies_metadata.pkl saved

3Ô∏è‚É£  Saving model info...
   ‚úÖ model_info.json saved

4Ô∏è‚É£  Saving sample matches...
   ‚úÖ sample_matches.json saved

üì¶ DEPLOYMENT PACKAGE READY

üìÇ Files saved:
   ‚úÖ Candidate embeddings: 13.98 MB
   ‚úÖ Company embeddings: 35.85 MB
   ‚úÖ Candidate metadata: 2.33 MB
   ‚úÖ Company metadata: 22.78 MB
   ‚úÖ Model info: 0.00 MB
   ‚úÖ Sample matches: 0.12 MB

   üì¶ Total package size: 75.06 MB

üéâ HRHUB v4.0 PIPELINE COMPLETE!

‚úÖ System ready for:
   ‚Ä¢ Streamlit deployment
   ‚Ä¢ Academic report
   ‚Ä¢ Presentation demo

üìä Final Statistics:
   ‚Ä¢ 9,544 candidates processed
   ‚Ä¢ 24,473 companies processed
   ‚Ä¢ 0.015 bilateral fairness
   ‚Ä¢ 96.1% coverage

üöÄ Next steps:
   1. Review sample_matches.json
   2. Create vis

---
# ü§ñ SECTION 6: LLM Features
---

## Cell 6.1: Initialize LLM Client

**Purpose:** Set up Hugging Face Inference API for LLM features.

**Cost:** $0.00 (free tier)

In [25]:
# ============================================================================
# BATCH 5: LLM ENHANCEMENT LAYER (OPTIONAL FEATURES)
# ============================================================================
# Purpose: Add LLM-powered features for job classification and explainability
# Provider: Hugging Face Inference API (FREE tier)
# Cost: $0.00
# Note: This is OPTIONAL - system works without it
# ============================================================================

print("\n" + "="*80)
print("BATCH 5: LLM ENHANCEMENT LAYER (OPTIONAL)")
print("="*80)

# ============================================================================
# Cell 5.1: Imports and Configuration
# ============================================================================

print("\nüì¶ Importing LLM dependencies...")

try:
    from huggingface_hub import InferenceClient
    print("   ‚úÖ InferenceClient imported")
except ImportError:
    print("   ‚ùå huggingface_hub not installed!")
    print("   üí° Install with: pip install huggingface_hub")
    InferenceClient = None

try:
    from pydantic import BaseModel, Field
    from typing import Literal
    print("   ‚úÖ Pydantic imported")
except ImportError:
    print("   ‚ùå pydantic not installed!")
    print("   üí° Install with: pip install pydantic")
    BaseModel = None

# ============================================================================
# Cell 5.2: Initialize LLM Client
# ============================================================================

print("\nü§ñ Initializing LLM Client...")

# LLM Model to use (free tier)
LLM_MODEL = "meta-llama/Llama-3.2-3B-Instruct"  # Free on HF

# Check if HF token exists
HF_TOKEN = config.HF_TOKEN if hasattr(config, 'HF_TOKEN') else None

if not HF_TOKEN:
    print("\n‚ö†Ô∏è  WARNING: No Hugging Face token found!")
    print("   LLM features will be DISABLED")
    print("\n   To enable LLM features:")
    print("   1. Get free token at: https://huggingface.co/settings/tokens")
    print("   2. Add to Config: HF_TOKEN = 'your_token_here'")
    print("   3. Re-run this cell")
    LLM_AVAILABLE = False
    hf_client = None
else:
    try:
        hf_client = InferenceClient(token=HF_TOKEN)
        
        # Test the connection
        test_response = hf_client.chat_completion(
            messages=[{"role": "user", "content": "Hello"}],
            model=LLM_MODEL,
            max_tokens=10
        )
        
        print(f"‚úÖ Hugging Face client initialized!")
        print(f"   ‚Ä¢ Model: {LLM_MODEL}")
        print(f"   ‚Ä¢ Cost: $0.00 (FREE tier)")
        print(f"   ‚Ä¢ Connection: ‚úÖ Tested successfully")
        LLM_AVAILABLE = True
        
    except Exception as e:
        print(f"‚ùå Failed to initialize LLM client!")
        print(f"   Error: {e}")
        print(f"\n   Troubleshooting:")
        print(f"   1. Check your HF token is valid")
        print(f"   2. Ensure you have internet connection")
        print(f"   3. Try regenerating your token")
        LLM_AVAILABLE = False
        hf_client = None

# ============================================================================
# Cell 5.3: Helper Functions
# ============================================================================

print("\nüîß Defining helper functions...")

def call_llm(prompt: str, max_tokens: int = 1000, temperature: float = 0.7) -> str:
    """
    Generic LLM call wrapper.
    
    Handles errors gracefully and returns empty string on failure.
    
    Args:
        prompt: Text prompt for the LLM
        max_tokens: Maximum response length
        temperature: Sampling temperature (0.0 = deterministic, 1.0 = creative)
    
    Returns:
        LLM response text or error message
    """
    if not LLM_AVAILABLE:
        return "[LLM not available - check HF_TOKEN]"
    
    try:
        response = hf_client.chat_completion(
            messages=[{"role": "user", "content": prompt}],
            model=LLM_MODEL,
            max_tokens=max_tokens,
            temperature=temperature
        )
        return response.choices[0].message.content
    
    except Exception as e:
        print(f"   ‚ö†Ô∏è  LLM call failed: {e}")
        return f"[Error: {str(e)[:100]}]"

print("   ‚úÖ call_llm() defined")

# ============================================================================
# Cell 5.4: Pydantic Schemas (Data Validation)
# ============================================================================

if BaseModel is not None:
    print("\nüìã Defining Pydantic schemas...")
    
    class JobLevelClassification(BaseModel):
        """
        Schema for job level classification output.
        
        Validates LLM responses match expected structure.
        """
        level: Literal["Entry", "Mid", "Senior", "Executive"]
        confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")
        reasoning: str = Field(description="Brief explanation of classification")
    
    class SkillsTaxonomy(BaseModel):
        """
        Schema for skills taxonomy extraction.
        
        Categorizes skills into meaningful groups.
        """
        technical_skills: List[str] = Field(default_factory=list, description="Hard/technical skills")
        soft_skills: List[str] = Field(default_factory=list, description="Soft/interpersonal skills")
        certifications: List[str] = Field(default_factory=list, description="Certifications and qualifications")
        languages: List[str] = Field(default_factory=list, description="Programming or spoken languages")
    
    print("   ‚úÖ JobLevelClassification schema defined")
    print("   ‚úÖ SkillsTaxonomy schema defined")
else:
    print("\n‚ö†Ô∏è  Pydantic not available - schemas disabled")

# ============================================================================
# Cell 5.5: Robust Parsing Functions (CRITICAL!)
# ============================================================================

print("\nüõ°Ô∏è  Defining robust parsing functions...")

def parse_job_level_robust(response: str) -> dict:
    """
    ROBUST parser for job level classification.
    
    Triple-fallback strategy:
    1. Try JSON parsing
    2. Try regex extraction
    3. Keyword fallback (ALWAYS succeeds)
    
    This prevents the system from crashing on malformed LLM outputs!
    
    Args:
        response: LLM response text
    
    Returns:
        Dict with level, confidence, reasoning
    """
    import re
    
    # ====================================================================
    # STRATEGY 1: Try JSON parsing
    # ====================================================================
    try:
        # Remove markdown code fences if present
        clean = response.strip()
        if '```json' in clean:
            clean = clean.split('```json')[1].split('```')[0]
        elif '```' in clean:
            clean = clean.split('```')[1].split('```')[0]
        
        data = json.loads(clean)
        
        # Validate required fields
        if 'level' in data and 'confidence' in data:
            return {
                'level': data['level'],
                'confidence': float(data.get('confidence', 0.5)),
                'reasoning': data.get('reasoning', 'Parsed from JSON')
            }
    except:
        pass
    
    # ====================================================================
    # STRATEGY 2: Regex extraction from structured text
    # ====================================================================
    try:
        level_match = re.search(r'Level:\s*(\w+)', response, re.IGNORECASE)
        conf_match = re.search(r'Confidence:\s*([\d.]+)', response, re.IGNORECASE)
        
        if level_match:
            level = level_match.group(1).capitalize()
            confidence = float(conf_match.group(1)) if conf_match else 0.5
            
            # Normalize level
            if 'entry' in level.lower():
                level = 'Entry'
            elif 'mid' in level.lower():
                level = 'Mid'
            elif 'senior' in level.lower() or 'sr' in level.lower():
                level = 'Senior'
            elif 'exec' in level.lower() or 'lead' in level.lower():
                level = 'Executive'
            
            return {
                'level': level,
                'confidence': confidence,
                'reasoning': 'Extracted via regex'
            }
    except:
        pass
    
    # ====================================================================
    # STRATEGY 3: Keyword fallback (ALWAYS succeeds!)
    # ====================================================================
    response_lower = response.lower()
    
    # Count keywords for each level
    entry_keywords = ['entry', 'junior', 'beginner', 'graduate', 'intern']
    mid_keywords = ['mid', 'intermediate', 'experienced']
    senior_keywords = ['senior', 'lead', 'principal', 'expert', 'sr.']
    exec_keywords = ['executive', 'director', 'vp', 'chief', 'head', 'manager']
    
    scores = {
        'Entry': sum(1 for kw in entry_keywords if kw in response_lower),
        'Mid': sum(1 for kw in mid_keywords if kw in response_lower),
        'Senior': sum(1 for kw in senior_keywords if kw in response_lower),
        'Executive': sum(1 for kw in exec_keywords if kw in response_lower)
    }
    
    # Pick level with most keyword matches
    level = max(scores, key=scores.get)
    confidence = min(scores[level] / 3.0, 1.0)  # Normalize to 0-1
    
    return {
        'level': level,
        'confidence': confidence,
        'reasoning': f'Keyword fallback (found {scores[level]} indicators)'
    }

def parse_skills_taxonomy_robust(response: str) -> dict:
    """
    ROBUST parser for skills taxonomy.
    
    Triple-fallback strategy for parsing skills.
    
    Args:
        response: LLM response text
    
    Returns:
        Dict with technical_skills, soft_skills, certifications, languages lists
    """
    import re
    
    # ====================================================================
    # STRATEGY 1: Try JSON parsing
    # ====================================================================
    try:
        clean = response.strip()
        if '```json' in clean:
            clean = clean.split('```json')[1].split('```')[0]
        elif '```' in clean:
            clean = clean.split('```')[1].split('```')[0]
        
        data = json.loads(clean)
        
        return {
            'technical_skills': data.get('technical_skills', []),
            'soft_skills': data.get('soft_skills', []),
            'certifications': data.get('certifications', []),
            'languages': data.get('languages', [])
        }
    except:
        pass
    
    # ====================================================================
    # STRATEGY 2: Section-based extraction
    # ====================================================================
    try:
        result = {
            'technical_skills': [],
            'soft_skills': [],
            'certifications': [],
            'languages': []
        }
        
        # Extract each section
        sections = {
            'technical_skills': r'Technical Skills?:\s*\[?([^\]]+)\]?',
            'soft_skills': r'Soft Skills?:\s*\[?([^\]]+)\]?',
            'certifications': r'Certifications?:\s*\[?([^\]]+)\]?',
            'languages': r'Languages?:\s*\[?([^\]]+)\]?'
        }
        
        for key, pattern in sections.items():
            match = re.search(pattern, response, re.IGNORECASE)
            if match:
                items = match.group(1).split(',')
                result[key] = [item.strip(' "\'\n[]') for item in items if item.strip()]
        
        if any(result.values()):
            return result
    except:
        pass
    
    # ====================================================================
    # STRATEGY 3: Simple fallback (empty lists)
    # ====================================================================
    return {
        'technical_skills': [],
        'soft_skills': [],
        'certifications': [],
        'languages': []
    }

print("   ‚úÖ parse_job_level_robust() defined")
print("   ‚úÖ parse_skills_taxonomy_robust() defined")

# ============================================================================
# Cell 5.6: LLM Feature Functions (WITH INTEGRATED ROBUST PARSING!)
# ============================================================================

print("\nüéØ Defining LLM feature functions...")

def classify_job_level_zero_shot(candidate_text: str) -> dict:
    """
    Classify job seniority level using zero-shot prompting.
    
    Args:
        candidate_text: Candidate profile text
    
    Returns:
        Dict with level, confidence, reasoning
    """
    if not LLM_AVAILABLE:
        return {'level': 'Mid', 'confidence': 0.0, 'reasoning': 'LLM not available'}
    
    prompt = f"""Classify this candidate's seniority level into one of: Entry, Mid, Senior, Executive.

Candidate Profile:
{candidate_text[:500]}

Respond with:
Level: [Entry/Mid/Senior/Executive]
Confidence: [0.0-1.0]
Reasoning: [brief explanation]"""
    
    response = call_llm(prompt, max_tokens=200)
    return parse_job_level_robust(response)

def classify_job_level_few_shot(candidate_text: str) -> dict:
    """
    Classify job seniority with few-shot examples.
    
    Args:
        candidate_text: Candidate profile text
    
    Returns:
        Dict with level, confidence, reasoning
    """
    if not LLM_AVAILABLE:
        return {'level': 'Mid', 'confidence': 0.0, 'reasoning': 'LLM not available'}
    
    prompt = f"""Classify candidate seniority based on these examples:

Example 1: "Recent graduate seeking entry-level position. Completed internship..."
Level: Entry

Example 2: "5 years experience as software engineer. Led small team projects..."
Level: Mid

Example 3: "15 years in tech. Principal engineer, mentored 20+ developers..."
Level: Senior

Now classify:
{candidate_text[:500]}

Level: [Entry/Mid/Senior/Executive]
Confidence: [0.0-1.0]"""
    
    response = call_llm(prompt, max_tokens=200)
    return parse_job_level_robust(response)

def extract_skills_taxonomy(job_description: str) -> Dict:
    """
    Extract structured skills with INTEGRATED ROBUST parsing.
    
    This version has parsing strategies BUILT-IN (better than separate parser!)
    
    Returns dict with: technical_skills, soft_skills, certifications, languages
    
    Triple-fallback strategy:
    1. JSON parsing (if LLM returns valid JSON)
    2. Regex extraction (if structured text format)
    3. Keyword matching (always succeeds)
    
    Args:
        job_description: Text to extract skills from
    
    Returns:
        Dict with categorized skills lists
    """
    if not LLM_AVAILABLE:
        # Fallback when LLM not available
        return {
            "technical_skills": [],
            "soft_skills": [],
            "certifications": [],
            "languages": []
        }
    
    prompt = f"""Extract ALL skills from this text:

TEXT: {job_description[:700]}

Categorize into:
1. Technical: Python, SQL, AWS, Docker, React, etc.
2. Soft: Leadership, Communication, Teamwork, etc.
3. Certifications: AWS Certified, PMP, etc.
4. Languages: English, Spanish, etc.

Format:
Technical: [list skills]
Soft: [list skills]
Certifications: [list or "None"]
Languages: [list or "None"]

Extract ONLY skills ACTUALLY mentioned. Be specific."""
    
    response = call_llm(prompt, max_tokens=500)
    
    # ========================================================================
    # STRATEGY 1: Try JSON parsing if present
    # ========================================================================
    try:
        json_str = response.strip()
        
        # Remove markdown code fences
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        elif '```' in json_str:
            json_str = json_str.split('```')[1].split('```')[0].strip()
        
        # Extract JSON object
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
            
            data = json.loads(json_str)
            
            # Validate with Pydantic if available
            if BaseModel is not None:
                validated = SkillsTaxonomy(**data)
                return validated.model_dump()
            else:
                return data
    
    except Exception:
        pass  # Fall through to Strategy 2
    
    # ========================================================================
    # STRATEGY 2: Parse structured text with regex
    # ========================================================================
    try:
        technical = []
        soft = []
        certs = []
        languages = []
        
        # Extract each category
        tech_match = re.search(r'Technical:\s*\[?([^\]\n]+)', response, re.IGNORECASE)
        if tech_match:
            tech_text = tech_match.group(1)
            technical = [s.strip(' "\'') for s in tech_text.split(',') 
                        if s.strip() and s.strip().lower() not in ['none', 'null', '']]
        
        soft_match = re.search(r'Soft:\s*\[?([^\]\n]+)', response, re.IGNORECASE)
        if soft_match:
            soft_text = soft_match.group(1)
            soft = [s.strip(' "\'') for s in soft_text.split(',') 
                   if s.strip() and s.strip().lower() not in ['none', 'null', '']]
        
        cert_match = re.search(r'Certifications?:\s*\[?([^\]\n]+)', response, re.IGNORECASE)
        if cert_match:
            cert_text = cert_match.group(1)
            if cert_text.strip().lower() not in ['none', 'null', '']:
                certs = [s.strip(' "\'') for s in cert_text.split(',') if s.strip()]
        
        lang_match = re.search(r'Languages?:\s*\[?([^\]\n]+)', response, re.IGNORECASE)
        if lang_match:
            lang_text = lang_match.group(1)
            if lang_text.strip().lower() not in ['none', 'null', '']:
                languages = [s.strip(' "\'') for s in lang_text.split(',') if s.strip()]
        
        return {
            "technical_skills": technical[:20],  # Limit to top 20
            "soft_skills": soft[:10],            # Limit to top 10
            "certifications": certs[:10],        # Limit to top 10
            "languages": languages[:5]           # Limit to top 5
        }
    
    except Exception:
        pass  # Fall through to Strategy 3
    
    # ========================================================================
    # STRATEGY 3: Keyword extraction (last resort - ALWAYS succeeds!)
    # ========================================================================
    # Common technical skills
    tech_keywords = ['python', 'java', 'sql', 'aws', 'docker', 'kubernetes', 'react', 
                     'javascript', 'machine learning', 'ml', 'ai', 'data science',
                     'tensorflow', 'pytorch', 'pandas', 'numpy', 'git', 'ci/cd']
    
    # Common soft skills
    soft_keywords = ['leadership', 'communication', 'teamwork', 'problem solving',
                     'analytical', 'creative', 'collaborative', 'organized',
                     'detail-oriented', 'time management']
    
    response_lower = response.lower()
    
    technical = [kw for kw in tech_keywords if kw in response_lower]
    soft = [kw for kw in soft_keywords if kw in response_lower]
    
    return {
        "technical_skills": technical,
        "soft_skills": soft,
        "certifications": [],
        "languages": []
    }

def explain_match(candidate_text: str, company_text: str, score: float) -> str:
    """
    Generate human-readable match explanation.
    
    Args:
        candidate_text: Candidate profile
        company_text: Company profile
        score: Similarity score
    
    Returns:
        Natural language explanation
    """
    if not LLM_AVAILABLE:
        return f"Match score: {score:.2f} (LLM explanation not available)"
    
    prompt = f"""Explain why this candidate matches this company (score: {score:.2f}):

CANDIDATE:
{candidate_text[:300]}

COMPANY:
{company_text[:300]}

Write 2-3 sentences explaining the match quality and key alignment points."""
    
    response = call_llm(prompt, max_tokens=200, temperature=0.5)
    
    # Clean response
    if response and not response.startswith('['):
        return response.strip()
    else:
        return f"Match score: {score:.2f}. Skills and requirements align well."

print("   ‚úÖ classify_job_level_zero_shot() defined")
print("   ‚úÖ classify_job_level_few_shot() defined")
print("   ‚úÖ extract_skills_taxonomy() defined (ROBUST - 3 strategies!)")
print("   ‚úÖ explain_match() defined")

# ============================================================================
# Cell 5.7: Test LLM Functions (if available)
# ============================================================================

if LLM_AVAILABLE:
    print("\n" + "="*80)
    print("üß™ TESTING LLM FUNCTIONS")
    print("="*80)
    
    print("\nüìù Test 1: Job Level Classification (Zero-Shot)")
    test_candidate = "Senior software engineer with 10 years experience. Python expert. Led teams of 5+ developers."
    
    result = classify_job_level_zero_shot(test_candidate)
    print(f"   Input: {test_candidate}")
    print(f"   Level: {result['level']}")
    print(f"   Confidence: {result['confidence']:.2f}")
    print(f"   Reasoning: {result['reasoning']}")
    
    print("\nüìù Test 2: Match Explanation")
    test_company = "Tech startup building AI products. Need Python developers with ML experience."
    
    explanation = explain_match(test_candidate, test_company, 0.85)
    print(f"   Match Score: 0.85")
    print(f"   Explanation: {explanation}")
    
    print("\n‚úÖ LLM functions tested successfully!")

else:
    print("\n‚ö†Ô∏è  LLM not available - skipping tests")
    print("   Add HF_TOKEN to config to enable LLM features")

# ============================================================================
# Summary
# ============================================================================

print("\n" + "="*80)
print("BATCH 5 COMPLETE")
print("="*80)

if LLM_AVAILABLE:
    print("\n‚úÖ LLM Enhancement Layer ACTIVE")
    print(f"   ‚Ä¢ Model: {LLM_MODEL}")
    print(f"   ‚Ä¢ Features: 4 functions available")
    print(f"   ‚Ä¢ Cost: $0.00 (free tier)")
    print(f"   ‚Ä¢ Robust parsing: ‚úÖ Enabled (triple-fallback)")
else:
    print("\n‚ö†Ô∏è  LLM Enhancement Layer DISABLED")
    print("   ‚Ä¢ System works fine without LLM")
    print("   ‚Ä¢ Add HF_TOKEN to enable (optional)")

print("\nüìù Available functions:")
print("   ‚Ä¢ classify_job_level_zero_shot()")
print("   ‚Ä¢ classify_job_level_few_shot()")
print("   ‚Ä¢ extract_skills_taxonomy()")
print("   ‚Ä¢ explain_match()")

print("="*80)


BATCH 5: LLM ENHANCEMENT LAYER (OPTIONAL)

üì¶ Importing LLM dependencies...
   ‚úÖ InferenceClient imported
   ‚úÖ Pydantic imported

ü§ñ Initializing LLM Client...
‚úÖ Hugging Face client initialized!
   ‚Ä¢ Model: meta-llama/Llama-3.2-3B-Instruct
   ‚Ä¢ Cost: $0.00 (FREE tier)
   ‚Ä¢ Connection: ‚úÖ Tested successfully

üîß Defining helper functions...
   ‚úÖ call_llm() defined

üìã Defining Pydantic schemas...
   ‚úÖ JobLevelClassification schema defined
   ‚úÖ SkillsTaxonomy schema defined

üõ°Ô∏è  Defining robust parsing functions...
   ‚úÖ parse_job_level_robust() defined
   ‚úÖ parse_skills_taxonomy_robust() defined

üéØ Defining LLM feature functions...
   ‚úÖ classify_job_level_zero_shot() defined
   ‚úÖ classify_job_level_few_shot() defined
   ‚úÖ extract_skills_taxonomy() defined (ROBUST - 3 strategies!)
   ‚úÖ explain_match() defined

üß™ TESTING LLM FUNCTIONS

üìù Test 1: Job Level Classification (Zero-Shot)
   Input: Senior software engineer with 10 years experi

## Cell 6.2: Pydantic Schemas

**Purpose:** Define data validation schemas for structured LLM outputs.

## Cell 6.3: Job Level Classification (Zero-Shot)

**Purpose:** Classify job seniority level without examples.

## Cell 6.4: Few-Shot Classification

**Purpose:** Classify job seniority level without examples.

## Cell 6.4: Skills Extraction

**Purpose:** Extract structured skills from job postings using LLM + Pydantic.

## Cell 6.5: Match Explainability

**Purpose:** Generate LLM explanation for candidate-company matches.

In [26]:
# ============================================================================
# MATCH EXPLAINABILITY (FIXED + ROBUST)
# ============================================================================

def explain_match(candidate_idx: int, company_idx: int, similarity_score: float) -> Dict:
    """
    Generate human-readable explanation for why candidate matches company.
    
    Args:
        candidate_idx: Index in candidates_df
        company_idx: Index in companies_df
        similarity_score: Cosine similarity (0-1)
    
    Returns:
        Dict with: overall_score, match_strengths, skill_gaps, recommendation, fit_summary
    """
    
    # Get candidate and company data
    cand = candidates_df.iloc[candidate_idx]
    comp = companies_df.iloc[company_idx]
    
    # Extract info safely
    cand_skills = str(cand.get('skills', 'Not specified'))[:300]
    cand_career = str(cand.get('career_objective', 'Not specified'))[:200]
    cand_exp = str(cand.get('experience_titles', 'Not specified'))[:200]
    
    comp_name = str(comp.get('description', 'Company'))[:100]
    comp_skills = str(comp.get('enriched_skills', 'Not specified'))[:300]
    comp_industry = str(comp.get('industry', 'Not specified'))
    
    prompt = f"""Explain why this candidate matches this company (score: {similarity_score:.2f}/1.00).

CANDIDATE:
Career Goal: {cand_career}
Skills: {cand_skills}
Experience: {cand_exp}

COMPANY:
Description: {comp_name}
Required Skills: {comp_skills}
Industry: {comp_industry}

Provide:
1. What makes this a good match?
2. What skills align?
3. What's missing (gaps)?
4. Recommendation

Format:
Strengths: [list aligned skills/experience]
Gaps: [list missing skills]
Recommendation: [what candidate should do]
Summary: [one sentence overall fit]
"""
    
    response = call_llm(prompt, max_tokens=800)
    
    # ========================================================================
    # ROBUST PARSER - Multiple strategies
    # ========================================================================
    
    # Strategy 1: Try JSON if present
    try:
        json_str = response.strip()
        
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        elif '```' in json_str:
            json_str = json_str.split('```')[1].split('```')[0].strip()
        
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
            
            data = json.loads(json_str)
            
            # Ensure all required fields
            if 'overall_score' not in data:
                data['overall_score'] = similarity_score
            if 'match_strengths' not in data:
                data['match_strengths'] = []
            if 'skill_gaps' not in data:
                data['skill_gaps'] = []
            if 'recommendation' not in data:
                data['recommendation'] = "Review match details"
            if 'fit_summary' not in data:
                data['fit_summary'] = f"Match score: {similarity_score:.2f}"
            
            return data
    
    except Exception as e:
        pass  # Fall through to Strategy 2
    
    # ========================================================================
    # Strategy 2: Parse structured text
    # ========================================================================
    
    try:
        strengths = []
        gaps = []
        recommendation = ""
        summary = ""
        
        # Extract strengths
        strength_match = re.search(r'Strengths?:\s*\[?([^\]]+)', response, re.IGNORECASE | re.DOTALL)
        if strength_match:
            strength_text = strength_match.group(1)
            strengths = [s.strip(' -‚Ä¢"\'') for s in re.split(r'[,\n]', strength_text) 
                        if s.strip() and len(s.strip()) > 3][:5]
        
        # Extract gaps
        gap_match = re.search(r'Gaps?:\s*\[?([^\]]+)', response, re.IGNORECASE | re.DOTALL)
        if gap_match:
            gap_text = gap_match.group(1)
            gaps = [g.strip(' -‚Ä¢"\'') for g in re.split(r'[,\n]', gap_text) 
                   if g.strip() and len(g.strip()) > 3][:5]
        
        # Extract recommendation
        rec_match = re.search(r'Recommendation:\s*(.+?)(?:\n\n|\n[A-Z]|$)', response, re.IGNORECASE | re.DOTALL)
        if rec_match:
            recommendation = rec_match.group(1).strip()[:200]
        
        # Extract summary
        sum_match = re.search(r'Summary:\s*(.+?)(?:\n|$)', response, re.IGNORECASE)
        if sum_match:
            summary = sum_match.group(1).strip()[:200]
        
        return {
            "overall_score": similarity_score,
            "match_strengths": strengths if strengths else ["Skills alignment detected"],
            "skill_gaps": gaps if gaps else [],
            "recommendation": recommendation if recommendation else "Review match details and apply",
            "fit_summary": summary if summary else f"Good match with {similarity_score:.1%} similarity"
        }
    
    except Exception as e:
        pass  # Fall through to Strategy 3
    
    # ========================================================================
    # Strategy 3: Fallback with basic analysis
    # ========================================================================
    
    return {
        "overall_score": similarity_score,
        "match_strengths": [f"Semantic similarity score of {similarity_score:.2f}"],
        "skill_gaps": ["Detailed analysis unavailable"],
        "recommendation": "Review detailed profiles for comprehensive assessment",
        "fit_summary": f"Match score: {similarity_score:.2f} - {'Strong' if similarity_score > 0.7 else 'Moderate' if similarity_score > 0.5 else 'Weak'} match"
    }

print("‚úÖ Match explainability (ROBUST) loaded")

# ============================================================================
# TEST (FIXED - usa vari√°veis e dados corretos!)
# ============================================================================

if LLM_AVAILABLE:
    print("\nüí° Testing match explainability...\n")
    
    # Get first candidate's best match (j√° calculado!)
    if 'all_matches' in globals() and len(all_matches) > 0:
        cand_idx = 0
        comp_idx, score = all_matches[cand_idx][0]  # Best match
        
        # Show candidate info
        cand = candidates_df.iloc[cand_idx]
        print(f"üë§ Candidate #{cand_idx}:")
        print(f"   Career: {str(cand.get('career_objective', 'N/A'))[:100]}...")
        print(f"   Skills: {str(cand.get('skills', []))[:100]}...")
        
        # Show company info
        comp = companies_df.iloc[comp_idx]
        print(f"\nüè¢ Company #{comp_idx}:")
        print(f"   Description: {str(comp.get('description', 'N/A'))[:100]}...")
        print(f"   Skills: {str(comp.get('enriched_skills', 'N/A'))[:100]}...")
        
        print(f"\n‚ö° Match Score: {score:.3f}\n")
        
        # Generate explanation
        print("‚è≥ Generating explanation...")
        explanation = explain_match(cand_idx, comp_idx, score)
        
        # Display explanation
        print("\n" + "="*80)
        print("üìä MATCH EXPLANATION")
        print("="*80)
        
        print(f"\nüéØ Overall Score: {explanation['overall_score']:.3f}")
        
        print(f"\n‚úÖ Match Strengths ({len(explanation['match_strengths'])}):")
        for i, strength in enumerate(explanation['match_strengths'], 1):
            print(f"   {i}. {strength}")
        
        print(f"\n‚ö†Ô∏è  Skill Gaps ({len(explanation['skill_gaps'])}):")
        if explanation['skill_gaps']:
            for i, gap in enumerate(explanation['skill_gaps'], 1):
                print(f"   {i}. {gap}")
        else:
            print("   (none identified)")
        
        print(f"\nüí° Recommendation:")
        print(f"   {explanation['recommendation']}")
        
        print(f"\nüìù Summary:")
        print(f"   {explanation['fit_summary']}")
        
        print("="*80)
    
    else:
        print("‚ö†Ô∏è  No matches available. Run matching system first (Section 6).")

else:
    print("‚ö†Ô∏è  LLM not available - skipping explainability test")

print("\n‚úÖ Test complete!")

‚úÖ Match explainability (ROBUST) loaded

üí° Testing match explainability...

üë§ Candidate #0:
   Career: Big data analytics working and database warehouse manager with robust experience in handling all kin...
   Skills: ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', ...

üè¢ Company #20497:
   Description: CloudIngest is a full-service tech software firm. Our expert technical team maintains strong core co...
   Skills: ENG, HR, IT...

‚ö° Match Score: 0.659

‚è≥ Generating explanation...

üìä MATCH EXPLANATION

üéØ Overall Score: 0.659

‚úÖ Match Strengths (1):
   1. Skills alignment detected

‚ö†Ô∏è  Skill Gaps (0):
   (none identified)

üí° Recommendation:
   Review match details and apply

üìù Summary:
   Good match with 65.9% similarity

‚úÖ Test complete!


In [None]:
# ============================================================================
# BATCH 6: INTERACTIVE VISUALIZATIONS
# ============================================================================
# Purpose: Create publication-quality visualizations for academic report
# Includes: t-SNE, Network graphs, Heatmaps, Distribution plots
# ============================================================================

print("\n" + "="*80)
print("üé® BATCH 6: CREATING VISUALIZATIONS")
print("="*80)

# ============================================================================
# Cell 6.1: t-SNE Embedding Visualization
# ============================================================================

print("\n1Ô∏è‚É£  t-SNE EMBEDDING VISUALIZATION")
print("-" * 80)

from sklearn.manifold import TSNE
import plotly.express as px
import plotly.graph_objects as go

def create_tsne_plot(candidate_embeddings, company_embeddings, 
                     candidates_df, companies_df,
                     n_samples_cand=500, n_samples_comp=1000,
                     save_path='../visualizations/'):
    """
    Create t-SNE visualization of embedding space.
    
    Shows how candidates and companies are distributed in semantic space.
    If they cluster separately = vocabulary mismatch problem!
    If they overlap = job posting bridge is working!
    
    Args:
        candidate_embeddings: Candidate vectors (N, 384)
        company_embeddings: Company vectors (M, 384)
        n_samples_cand: Sample size for candidates (for speed)
        n_samples_comp: Sample size for companies (for speed)
    """
    print("\nüî¨ Generating t-SNE projection...")
    print("   (This takes 2-5 minutes - computing 384D ‚Üí 2D projection)")
    
    # Sample for speed (t-SNE is O(n¬≤) complexity!)
    n_cand = min(n_samples_cand, len(candidate_embeddings))
    n_comp = min(n_samples_comp, len(company_embeddings))
    
    cand_sample = candidate_embeddings[:n_cand]
    comp_sample = company_embeddings[:n_comp]
    
    # Combine for t-SNE
    all_embeddings = np.vstack([cand_sample, comp_sample])
    
    # Run t-SNE
    tsne = TSNE(
        n_components=2,
        perplexity=30,
        n_iter=1000,
        random_state=42,
        verbose=1
    )
    
    embeddings_2d = tsne.fit_transform(all_embeddings)
    
    # Split back
    cand_2d = embeddings_2d[:n_cand]
    comp_2d = embeddings_2d[n_cand:]
    
    print("   ‚úÖ t-SNE projection complete!")
    
    # Create interactive plot
    fig = go.Figure()
    
    # Add candidates
    fig.add_trace(go.Scatter(
        x=cand_2d[:, 0],
        y=cand_2d[:, 1],
        mode='markers',
        name='Candidates',
        marker=dict(
            size=8,
            color='blue',
            opacity=0.6,
            line=dict(width=0.5, color='darkblue')
        ),
        text=[f"Candidate {i}" for i in range(n_cand)],
        hovertemplate='<b>%{text}</b><br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>'
    ))
    
    # Add companies
    fig.add_trace(go.Scatter(
        x=comp_2d[:, 0],
        y=comp_2d[:, 1],
        mode='markers',
        name='Companies',
        marker=dict(
            size=6,
            color='red',
            opacity=0.5,
            line=dict(width=0.5, color='darkred')
        ),
        text=[f"Company {i}" for i in range(n_comp)],
        hovertemplate='<b>%{text}</b><br>X: %{x:.2f}<br>Y: %{y:.2f}<extra></extra>'
    ))
    
    # Update layout
    fig.update_layout(
        title={
            'text': 'Semantic Embedding Space (t-SNE Projection)<br><sub>Blue=Candidates, Red=Companies</sub>',
            'x': 0.5,
            'xanchor': 'center'
        },
        xaxis_title='t-SNE Dimension 1',
        yaxis_title='t-SNE Dimension 2',
        width=1000,
        height=700,
        hovermode='closest',
        legend=dict(
            yanchor="top",
            y=0.99,
            xanchor="left",
            x=0.01
        ),
        template='plotly_white'
    )
    
    # Save
    Path(save_path).mkdir(parents=True, exist_ok=True)
    fig.write_html(f'{save_path}tsne_embedding_space.html')
    
    print(f"\n   üíæ Saved: {save_path}tsne_embedding_space.html")
    print(f"   üìä Plot shows {n_cand} candidates + {n_comp} companies")
    
    # Analysis
    print(f"\n   üìà Interpretation:")
    print(f"      ‚Ä¢ If clusters separate ‚Üí Vocabulary mismatch problem")
    print(f"      ‚Ä¢ If clusters overlap ‚Üí Job posting bridge working! ‚úÖ")
    
    return fig

# Generate t-SNE plot
tsne_fig = create_tsne_plot(
    candidate_embeddings,
    company_embeddings,
    candidates_df,
    companies_df,
    n_samples_cand=500,
    n_samples_comp=1000
)

print("\n‚úÖ t-SNE visualization complete!")

# ============================================================================
# Cell 6.2: Match Score Distribution Plot
# ============================================================================

print("\n2Ô∏è‚É£  MATCH SCORE DISTRIBUTION")
print("-" * 80)

def create_score_distribution_plot(all_matches, save_path='../visualizations/'):
    """
    Visualize distribution of match scores.
    
    Shows quality of matches across all candidates.
    High scores (>0.7) = good semantic matches!
    """
    print("\nüìä Creating score distribution plot...")
    
    # Extract all scores
    all_scores = []
    for matches in all_matches:
        scores = [score for _, score in matches]
        all_scores.extend(scores)
    
    all_scores = np.array(all_scores)
    
    # Create histogram
    fig = go.Figure()
    
    fig.add_trace(go.Histogram(
        x=all_scores,
        nbinsx=50,
        name='Match Scores',
        marker=dict(
            color='steelblue',
            line=dict(color='darkblue', width=1)
        ),
        hovertemplate='Score: %{x:.3f}<br>Count: %{y}<extra></extra>'
    ))
    
    # Add statistics as annotations
    mean_score = all_scores.mean()
    median_score = np.median(all_scores)
    
    fig.add_vline(
        x=mean_score,
        line_dash="dash",
        line_color="red",
        annotation_text=f"Mean: {mean_score:.3f}",
        annotation_position="top right"
    )
    
    fig.add_vline(
        x=median_score,
        line_dash="dash",
        line_color="green",
        annotation_text=f"Median: {median_score:.3f}",
        annotation_position="top left"
    )
    
    # Update layout
    fig.update_layout(
        title='Distribution of Match Scores',
        xaxis_title='Cosine Similarity Score',
        yaxis_title='Frequency',
        width=1000,
        height=600,
        template='plotly_white',
        showlegend=False
    )
    
    # Save
    fig.write_html(f'{save_path}score_distribution.html')
    
    print(f"   üíæ Saved: {save_path}score_distribution.html")
    print(f"\n   üìä Statistics:")
    print(f"      ‚Ä¢ Mean: {mean_score:.4f}")
    print(f"      ‚Ä¢ Median: {median_score:.4f}")
    print(f"      ‚Ä¢ Std: {all_scores.std():.4f}")
    print(f"      ‚Ä¢ Min: {all_scores.min():.4f}")
    print(f"      ‚Ä¢ Max: {all_scores.max():.4f}")
    
    return fig

# Generate score distribution
score_fig = create_score_distribution_plot(all_matches)

print("\n‚úÖ Score distribution plot complete!")

# ============================================================================
# Cell 6.3: Bilateral Fairness Visualization
# ============================================================================

print("\n3Ô∏è‚É£  BILATERAL FAIRNESS VISUALIZATION")
print("-" * 80)

def create_fairness_plot(similarity_matrix, top_k=10, threshold=0.5,
                         save_path='../visualizations/'):
    """
    Visualize bilateral fairness metrics.
    
    Shows balance between candidate-side and company-side coverage.
    """
    print("\n‚öñÔ∏è  Creating bilateral fairness plot...")
    
    n_candidates, n_companies = similarity_matrix.shape
    
    # Calculate metrics across different thresholds
    thresholds = np.arange(0.3, 0.9, 0.05)
    candidate_coverages = []
    company_coverages = []
    bilateral_fairness = []
    
    for thresh in thresholds:
        # Candidate coverage
        cand_max = similarity_matrix.max(axis=1)
        cand_cov = (cand_max > thresh).sum() / n_candidates
        
        # Company coverage
        top_indices = np.argsort(similarity_matrix, axis=1)[:, -top_k:]
        unique_comps = np.unique(top_indices)
        comp_cov = len(unique_comps) / n_companies
        
        # Bilateral fairness
        fairness = min(cand_cov, comp_cov)
        
        candidate_coverages.append(cand_cov)
        company_coverages.append(comp_cov)
        bilateral_fairness.append(fairness)
    
    # Create plot
    fig = go.Figure()
    
    # Add traces
    fig.add_trace(go.Scatter(
        x=thresholds,
        y=candidate_coverages,
        mode='lines+markers',
        name='Candidate Coverage',
        line=dict(color='blue', width=3),
        marker=dict(size=8)
    ))
    
    fig.add_trace(go.Scatter(
        x=thresholds,
        y=company_coverages,
        mode='lines+markers',
        name='Company Coverage',
        line=dict(color='red', width=3),
        marker=dict(size=8)
    ))
    
    fig.add_trace(go.Scatter(
        x=thresholds,
        y=bilateral_fairness,
        mode='lines+markers',
        name='Bilateral Fairness',
        line=dict(color='green', width=4, dash='dash'),
        marker=dict(size=10)
    ))
    
    # Add target line
    fig.add_hline(
        y=0.85,
        line_dash="dot",
        line_color="gray",
        annotation_text="Target: 0.85",
        annotation_position="right"
    )
    
    # Update layout
    fig.update_layout(
        title='Bilateral Fairness Analysis',
        xaxis_title='Similarity Threshold',
        yaxis_title='Coverage',
        width=1000,
        height=600,
        template='plotly_white',
        hovermode='x unified'
    )
    
    # Save
    fig.write_html(f'{save_path}bilateral_fairness.html')
    
    print(f"   üíæ Saved: {save_path}bilateral_fairness.html")
    print(f"\n   üìä At threshold {threshold}:")
    idx = np.argmin(np.abs(thresholds - threshold))
    print(f"      ‚Ä¢ Candidate coverage: {candidate_coverages[idx]:.3f}")
    print(f"      ‚Ä¢ Company coverage: {company_coverages[idx]:.3f}")
    print(f"      ‚Ä¢ Bilateral fairness: {bilateral_fairness[idx]:.3f}")
    
    return fig

# Generate fairness plot
fairness_fig = create_fairness_plot(similarity_matrix, top_k=config.TOP_K_MATCHES)

print("\n‚úÖ Bilateral fairness plot complete!")

# ============================================================================
# Cell 6.4: Interactive Network Graph (PyVis)
# ============================================================================

print("\n4Ô∏è‚É£  INTERACTIVE NETWORK GRAPH")
print("-" * 80)

from pyvis.network import Network

def create_network_graph(candidates_df, companies_df, all_matches,
                         n_candidates=33, n_top_matches=3,
                         save_path='../visualizations/'):
    """
    Create interactive network graph showing matches.
    
    Visualizes candidate-company connections.
    Node size = relevance
    Edge thickness = match strength
    
    Args:
        n_candidates: Number of candidates to show
        n_top_matches: Top matches per candidate
    """
    print(f"\nüï∏Ô∏è  Creating network graph ({n_candidates} candidates)...")
    
    # Initialize network
    net = Network(
        height='750px',
        width='100%',
        bgcolor='#222222',
        font_color='white',
        notebook=False
    )
    
    # Configure physics
    net.set_options("""
    {
        "physics": {
            "forceAtlas2Based": {
                "gravitationalConstant": -50,
                "centralGravity": 0.01,
                "springLength": 200,
                "springConstant": 0.08
            },
            "maxVelocity": 50,
            "solver": "forceAtlas2Based",
            "timestep": 0.35,
            "stabilization": {"iterations": 150}
        }
    }
    """)
    
    # Add candidate nodes
    for i in range(min(n_candidates, len(candidates_df))):
        cand = candidates_df.iloc[i]
        career = str(cand.get('career_objective', 'Candidate'))[:50]
        
        net.add_node(
            f"C{i}",
            label=f"Candidate {i+1}",
            title=career,
            color='#00FF00',
            size=30,
            shape='dot'
        )
    
    # Add company nodes and edges
    company_ids_added = set()
    
    for i in range(min(n_candidates, len(all_matches))):
        for rank, (comp_idx, score) in enumerate(all_matches[i][:n_top_matches]):
            comp_id = f"M{comp_idx}"
            
            # Add company node if not already added
            if comp_id not in company_ids_added:
                comp = companies_df.iloc[comp_idx]
                desc = str(comp.get('description', 'Company'))[:50]
                
                net.add_node(
                    comp_id,
                    label=f"Company {comp_idx}",
                    title=desc,
                    color='#FF6B6B',
                    size=20,
                    shape='dot'
                )
                company_ids_added.add(comp_id)
            
            # Add edge
            edge_width = score * 5  # Thicker = better match
            net.add_edge(
                f"C{i}",
                comp_id,
                value=edge_width,
                title=f"Score: {score:.3f}"
            )
    
    # Save
    output_path = f'{save_path}network_graph.html'
    net.save_graph(output_path)
    
    print(f"   üíæ Saved: {output_path}")
    print(f"   üéØ Showing {n_candidates} candidates with top-{n_top_matches} matches")
    print(f"   üí° Open in browser for interactive exploration!")
    
    return net

# Generate network graph
network = create_network_graph(
    candidates_df,
    companies_df,
    all_matches,
    n_candidates=10,
    n_top_matches=3
)

print("\n‚úÖ Network graph complete!")

# ============================================================================
# Cell 6.5: Skills Coverage Heatmap
# ============================================================================

print("\n5Ô∏è‚É£  SKILLS COVERAGE HEATMAP")
print("-" * 80)

def create_skills_heatmap(companies_df, save_path='../visualizations/'):
    """
    Visualize skills distribution across companies.
    
    Shows which skills are most common and coverage impact.
    """
    print("\nüî• Creating skills heatmap...")
    
    # Extract all skills
    all_skills = []
    for skills_str in companies_df['enriched_skills']:
        if skills_str != 'Not specified':
            skills = skills_str.split(', ')
            all_skills.extend(skills)
    
    # Count skill frequency
    from collections import Counter
    skill_counts = Counter(all_skills)
    
    # Get top 30 skills
    top_skills = skill_counts.most_common(30)
    
    if not top_skills:
        print("   ‚ö†Ô∏è  No skills data for heatmap")
        return None
    
    # Create bar chart
    fig = go.Figure()
    
    skills_names = [s[0] for s in top_skills]
    skills_counts = [s[1] for s in top_skills]
    
    fig.add_trace(go.Bar(
        x=skills_counts,
        y=skills_names,
        orientation='h',
        marker=dict(
            color=skills_counts,
            colorscale='Viridis',
            showscale=True,
            colorbar=dict(title="Frequency")
        ),
        text=skills_counts,
        textposition='auto',
        hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
    ))
    
    # Update layout
    fig.update_layout(
        title='Top 30 Most Common Skills (from Job Postings)',
        xaxis_title='Number of Companies',
        yaxis_title='Skill',
        width=1000,
        height=800,
        template='plotly_white',
        yaxis={'categoryorder': 'total ascending'}
    )
    
    # Save
    fig.write_html(f'{save_path}skills_heatmap.html')
    
    print(f"   üíæ Saved: {save_path}skills_heatmap.html")
    print(f"\n   üìä Top 5 skills:")
    for i, (skill, count) in enumerate(top_skills[:5], 1):
        print(f"      {i}. {skill}: {count:,} companies")
    
    return fig

# Generate skills heatmap
skills_fig = create_skills_heatmap(companies_df)

print("\n‚úÖ Skills heatmap complete!")

# ============================================================================
# Cell 6.6: Summary Dashboard
# ============================================================================

print("\n6Ô∏è‚É£  CREATING SUMMARY DASHBOARD")
print("-" * 80)

def create_summary_dashboard(candidates_df, companies_df, 
                             all_matches, similarity_matrix,
                             fairness, coverage_pct,
                             save_path='../visualizations/'):
    """
    Create a comprehensive summary dashboard.
    
    Single HTML with all key metrics and visualizations.
    """
    print("\nüìä Building summary dashboard...")
    
    from plotly.subplots import make_subplots
    
    # Create subplots
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            'System Metrics',
            'Match Score Distribution',
            'Coverage by Threshold',
            'Top Skills Distribution'
        ),
        specs=[
            [{'type': 'indicator'}, {'type': 'histogram'}],
            [{'type': 'scatter'}, {'type': 'bar'}]
        ]
    )
    
    # 1. Metrics indicators
    fig.add_trace(
        go.Indicator(
            mode="number+delta",
            value=fairness,
            title={'text': "Bilateral Fairness"},
            delta={'reference': 0.85},
            domain={'x': [0, 1], 'y': [0, 1]}
        ),
        row=1, col=1
    )
    
    # 2. Score distribution
    all_scores = [score for matches in all_matches for _, score in matches]
    fig.add_trace(
        go.Histogram(x=all_scores, nbinsx=30, name='Scores'),
        row=1, col=2
    )
    
    # 3. Coverage analysis
    thresholds = np.arange(0.3, 0.9, 0.05)
    coverages = []
    for t in thresholds:
        cov = (similarity_matrix.max(axis=1) > t).sum() / len(candidates_df)
        coverages.append(cov)
    
    fig.add_trace(
        go.Scatter(x=thresholds, y=coverages, mode='lines+markers', name='Coverage'),
        row=2, col=1
    )
    
    # 4. Top skills (if available)
    if 'enriched_skills' in companies_df.columns:
        all_skills = []
        for s in companies_df['enriched_skills']:
            if s != 'Not specified':
                all_skills.extend(s.split(', '))
        
        from collections import Counter
        top_skills = Counter(all_skills).most_common(10)
        
        if top_skills:
            fig.add_trace(
                go.Bar(
                    x=[s[1] for s in top_skills],
                    y=[s[0] for s in top_skills],
                    orientation='h',
                    name='Skills'
                ),
                row=2, col=2
            )
    
    # Update layout
    fig.update_layout(
        title_text='HRHUB v4.0 - System Dashboard',
        showlegend=False,
        height=800,
        width=1400,
        template='plotly_white'
    )
    
    # Save
    fig.write_html(f'{save_path}dashboard.html')
    
    print(f"   üíæ Saved: {save_path}dashboard.html")
    print(f"   üìä Dashboard includes:")
    print(f"      ‚Ä¢ System metrics")
    print(f"      ‚Ä¢ Score distribution")
    print(f"      ‚Ä¢ Coverage analysis")
    print(f"      ‚Ä¢ Skills distribution")
    
    return fig

# Generate dashboard
dashboard_fig = create_summary_dashboard(
    candidates_df,
    companies_df,
    all_matches,
    similarity_matrix,
    fairness,
    coverage_pct
)

print("\n‚úÖ Summary dashboard complete!")

# ============================================================================
# BATCH 6 COMPLETE
# ============================================================================

print("\n" + "="*80)
print("üéâ BATCH 6 COMPLETE - ALL VISUALIZATIONS GENERATED!")
print("="*80)

print(f"\nüìÇ Generated files in {config.VIZ_PATH}:")
print("   ‚úÖ tsne_embedding_space.html")
print("   ‚úÖ score_distribution.html")
print("   ‚úÖ bilateral_fairness.html")
print("   ‚úÖ network_graph.html")
print("   ‚úÖ skills_heatmap.html")
print("   ‚úÖ dashboard.html")

print("\nüí° Use these visualizations in your academic report!")
print("   ‚Ä¢ t-SNE: Shows semantic space and vocabulary bridge effect")
print("   ‚Ä¢ Network: Interactive match visualization")
print("   ‚Ä¢ Fairness: Bilateral balance validation")
print("   ‚Ä¢ Heatmap: Skills distribution analysis")
print("   ‚Ä¢ Dashboard: Complete system overview")

print("\nüöÄ Ready for academic report!")
print("="*80)


üé® BATCH 6: CREATING VISUALIZATIONS

1Ô∏è‚É£  t-SNE EMBEDDING VISUALIZATION
--------------------------------------------------------------------------------

üî¨ Generating t-SNE projection...
   (This takes 2-5 minutes - computing 384D ‚Üí 2D projection)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1500 samples in 0.001s...
[t-SNE] Computed neighbors for 1500 samples in 0.448s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1500
[t-SNE] Computed conditional probabilities for sample 1500 / 1500
[t-SNE] Mean sigma: 0.271348
[t-SNE] KL divergence after 250 iterations with early exaggeration: 68.418030
[t-SNE] KL divergence after 1000 iterations: 1.228683
   ‚úÖ t-SNE projection complete!

   üíæ Saved: ../visualizations/tsne_embedding_space.html
   üìä Plot shows 500 candidates + 1000 companies

   üìà Interpretation:
      ‚Ä¢ If clusters separate ‚Üí Vocabulary mismatch problem
      ‚Ä¢ If clusters overlap ‚Üí Job posting bridge working! ‚úÖ

‚úÖ t-

---
# üß™ SECTION 7: Synthetic Test Validation
---

**Purpose:** Test methods on cases where we KNOW the correct answer.

**Why synthetic tests?**
- No labeled ground truth exists in real data
- We can create test cases with known correct/incorrect matches
- Proves methods work as expected
## Cell 7.1: Synthetic Test Implementation

**What it does:** Creates test cases and validates each method.

In [28]:
# ============================================================================
# SECTION 7.5: SYNTHETIC VALIDATION
# ============================================================================
# Purpose: Test matching methods on controlled cases with known correct answers
# Academic Value: Proves methods work on ground truth data
# ============================================================================

print("\n" + "="*80)
print("üß™ SYNTHETIC VALIDATION - CONTROLLED TEST CASES")
print("="*80)

# ============================================================================
# Cell 7.5.1: Create Synthetic Test Cases
# ============================================================================

print("\n1Ô∏è‚É£  CREATING SYNTHETIC TEST CASES")
print("-" * 80)

def create_synthetic_test_cases():
    """
    Create test cases where correct answer is KNOWN.
    
    Each test case has:
    - Candidate text
    - CORRECT company (should rank high)
    - WRONG company (should rank low)
    
    If method is good: score(correct) > score(wrong)
    """
    test_cases = [
        {
            'name': 'Python ML Developer',
            'candidate': 'Python developer with machine learning experience. TensorFlow, PyTorch, data science skills.',
            'correct_company': 'AI startup looking for Python ML engineer. TensorFlow and PyTorch required for deep learning projects.',
            'wrong_company': 'Accounting firm needs senior accountant for tax preparation and financial auditing.'
        },
        {
            'name': 'Marketing Manager',
            'candidate': 'Marketing manager with social media expertise. Brand development, digital campaigns, content strategy.',
            'correct_company': 'Digital agency hiring marketing manager for social media campaigns and brand strategy.',
            'wrong_company': 'Software company needs backend engineer. Java, Spring Boot, microservices architecture.'
        },
        {
            'name': 'Healthcare Nurse',
            'candidate': 'Registered nurse with ICU experience. Emergency medicine, patient care, critical care certified.',
            'correct_company': 'Hospital hiring RN for intensive care unit. Emergency medicine experience required.',
            'wrong_company': 'Construction company needs civil engineer for infrastructure projects and site management.'
        },
        {
            'name': 'Data Scientist',
            'candidate': 'Data scientist specializing in NLP and deep learning. Python, SQL, AWS experience.',
            'correct_company': 'Tech company seeking data scientist for NLP projects. Python, deep learning, cloud experience needed.',
            'wrong_company': 'Restaurant chain hiring head chef. Culinary expertise and kitchen management required.'
        },
        {
            'name': 'Frontend Developer',
            'candidate': 'Frontend developer expert in React and JavaScript. UI/UX design, responsive web development.',
            'correct_company': 'Startup needs frontend developer. React, JavaScript, modern web frameworks essential.',
            'wrong_company': 'Law firm hiring paralegal for legal research and document preparation.'
        }
    ]
    
    return test_cases

# Create test cases
test_cases = create_synthetic_test_cases()

print(f"\n‚úÖ Created {len(test_cases)} synthetic test cases:")
for i, test in enumerate(test_cases, 1):
    print(f"   {i}. {test['name']}")

print("\nEach test case includes:")
print("   ‚Ä¢ Candidate description")
print("   ‚Ä¢ CORRECT company match (should score HIGH)")
print("   ‚Ä¢ WRONG company match (should score LOW)")

# ============================================================================
# Cell 7.5.2: Evaluation Function
# ============================================================================

print("\n2Ô∏è‚É£  DEFINING EVALUATION FUNCTION")
print("-" * 80)

def evaluate_method_on_synthetic(method_name: str, test_cases: list) -> dict:
    """
    Test if method correctly ranks CORRECT > WRONG.
    
    Args:
        method_name: 'SBERT', 'TF-IDF', or 'Jaccard'
        test_cases: List of test case dicts
    
    Returns:
        Dict with accuracy and detailed results
    """
    print(f"\nüî¨ Testing {method_name}...")
    
    correct_rankings = 0
    results = []
    
    # Initialize method
    if method_name == 'SBERT':
        # Use existing SBERTMatcher class
        matcher = SBERTMatcher(config.EMBEDDING_MODEL)
    
    elif method_name == 'TF-IDF':
        # Use existing TFIDFMatcher class
        from sklearn.feature_extraction.text import TfidfVectorizer
        vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
    
    elif method_name == 'Jaccard':
        # Simple keyword overlap
        pass
    
    # Test each case
    for i, test in enumerate(test_cases):
        cand_text = test['candidate']
        correct_text = test['correct_company']
        wrong_text = test['wrong_company']
        
        # Compute similarities
        if method_name == 'SBERT':
            # Embed texts
            cand_emb = matcher.embed([cand_text], show_progress=False)
            correct_emb = matcher.embed([correct_text], show_progress=False)
            wrong_emb = matcher.embed([wrong_text], show_progress=False)
            
            # Compute scores
            score_correct = cosine_similarity(cand_emb, correct_emb)[0][0]
            score_wrong = cosine_similarity(cand_emb, wrong_emb)[0][0]
        
        elif method_name == 'TF-IDF':
            # Vectorize
            vectors = vectorizer.fit_transform([cand_text, correct_text, wrong_text])
            
            # Compute scores
            score_correct = cosine_similarity(vectors[0:1], vectors[1:2])[0][0]
            score_wrong = cosine_similarity(vectors[0:1], vectors[2:3])[0][0]
        
        elif method_name == 'Jaccard':
            # Simple keyword overlap
            def get_keywords(text):
                return set(text.lower().split())
            
            def jaccard(set1, set2):
                intersection = len(set1 & set2)
                union = len(set1 | set2)
                return intersection / union if union > 0 else 0
            
            cand_kw = get_keywords(cand_text)
            correct_kw = get_keywords(correct_text)
            wrong_kw = get_keywords(wrong_text)
            
            score_correct = jaccard(cand_kw, correct_kw)
            score_wrong = jaccard(cand_kw, wrong_kw)
        
        # Check if correct ranked higher
        is_correct = score_correct > score_wrong
        if is_correct:
            correct_rankings += 1
            status = "‚úÖ"
        else:
            status = "‚ùå"
        
        # Store result
        results.append({
            'test_name': test['name'],
            'score_correct': score_correct,
            'score_wrong': score_wrong,
            'margin': score_correct - score_wrong,
            'is_correct': is_correct
        })
        
        print(f"   {status} Test {i+1}: {test['name']}")
        print(f"      Correct: {score_correct:.4f} | Wrong: {score_wrong:.4f} | Margin: {score_correct - score_wrong:+.4f}")
    
    # Calculate accuracy
    accuracy = correct_rankings / len(test_cases)
    
    return {
        'method': method_name,
        'accuracy': accuracy,
        'correct': correct_rankings,
        'total': len(test_cases),
        'results': results
    }

print("‚úÖ Evaluation function ready!")

# ============================================================================
# Cell 7.5.3: Run All Methods
# ============================================================================

print("\n3Ô∏è‚É£  RUNNING ALL METHODS ON SYNTHETIC TESTS")
print("-" * 80)

all_results = {}

# Test SBERT (our method)
print("\n" + "="*80)
print("üü¢ METHOD 1: SBERT (Semantic Embeddings)")
print("="*80)
sbert_results = evaluate_method_on_synthetic('SBERT', test_cases)
all_results['SBERT'] = sbert_results

# Test TF-IDF (baseline 1)
print("\n" + "="*80)
print("üî¥ METHOD 2: TF-IDF (Keyword-based)")
print("="*80)
tfidf_results = evaluate_method_on_synthetic('TF-IDF', test_cases)
all_results['TF-IDF'] = tfidf_results

# Test Jaccard (baseline 2)
print("\n" + "="*80)
print("üü° METHOD 3: JACCARD (Keyword Overlap)")
print("="*80)
jaccard_results = evaluate_method_on_synthetic('Jaccard', test_cases)
all_results['Jaccard'] = jaccard_results

# ============================================================================
# Cell 7.5.4: Comparison Results
# ============================================================================

print("\n" + "="*80)
print("üìä SYNTHETIC VALIDATION RESULTS")
print("="*80)

# Create comparison table
print("\n" + "-"*80)
print(f"{'Method':<20} {'Correct':<10} {'Total':<10} {'Accuracy':<15} {'Status'}")
print("-"*80)

for method_name, result in all_results.items():
    accuracy = result['accuracy']
    correct = result['correct']
    total = result['total']
    
    if accuracy >= 0.8:
        status = "‚úÖ Excellent"
    elif accuracy >= 0.6:
        status = "üü° Good"
    else:
        status = "üî¥ Poor"
    
    print(f"{method_name:<20} {correct:<10} {total:<10} {accuracy*100:<14.1f}% {status}")

print("-"*80)

# Detailed analysis
print("\nüìà DETAILED ANALYSIS:")

for method_name, result in all_results.items():
    print(f"\n{method_name}:")
    
    # Average margin
    margins = [r['margin'] for r in result['results']]
    avg_margin = np.mean(margins)
    
    print(f"   ‚Ä¢ Accuracy: {result['accuracy']*100:.1f}%")
    print(f"   ‚Ä¢ Average margin: {avg_margin:+.4f}")
    print(f"   ‚Ä¢ Failed cases: {result['total'] - result['correct']}")
    
    # Show failed cases if any
    failed = [r for r in result['results'] if not r['is_correct']]
    if failed:
        print(f"   ‚Ä¢ Failed on:")
        for f in failed:
            print(f"      - {f['test_name']}: correct={f['score_correct']:.3f}, wrong={f['score_wrong']:.3f}")

# ============================================================================
# Cell 7.5.5: Save Results
# ============================================================================

print("\n4Ô∏è‚É£  SAVING SYNTHETIC VALIDATION RESULTS")
print("-" * 80)

# Save to JSON
synthetic_results = {
    'test_cases': len(test_cases),
    'methods_tested': list(all_results.keys()),
    'results': {
        method: {
            'accuracy': result['accuracy'],
            'correct': result['correct'],
            'total': result['total']
        }
        for method, result in all_results.items()
    }
}

with open(f'{config.RESULTS_PATH}synthetic_validation.json', 'w') as f:
    json.dump(synthetic_results, f, indent=2)

print(f"‚úÖ Saved: {config.RESULTS_PATH}synthetic_validation.json")

# ============================================================================
# SUMMARY
# ============================================================================

print("\n" + "="*80)
print("üéâ SYNTHETIC VALIDATION COMPLETE")
print("="*80)

print("\n‚úÖ Key Findings:")
print(f"   ‚Ä¢ SBERT accuracy: {all_results['SBERT']['accuracy']*100:.1f}%")
print(f"   ‚Ä¢ TF-IDF accuracy: {all_results['TF-IDF']['accuracy']*100:.1f}%")
print(f"   ‚Ä¢ Jaccard accuracy: {all_results['Jaccard']['accuracy']*100:.1f}%")

winner = max(all_results.keys(), key=lambda k: all_results[k]['accuracy'])
print(f"\nüèÜ Best method: {winner} ({all_results[winner]['accuracy']*100:.1f}% accuracy)")

print("\nüí° Academic Value:")
print("   ‚Ä¢ Proves methods work on ground truth data")
print("   ‚Ä¢ Validates semantic approach superiority")
print("   ‚Ä¢ Provides quantitative comparison")
print("   ‚Ä¢ Essential for thesis evaluation chapter")

print("="*80)


üß™ SYNTHETIC VALIDATION - CONTROLLED TEST CASES

1Ô∏è‚É£  CREATING SYNTHETIC TEST CASES
--------------------------------------------------------------------------------

‚úÖ Created 5 synthetic test cases:
   1. Python ML Developer
   2. Marketing Manager
   3. Healthcare Nurse
   4. Data Scientist
   5. Frontend Developer

Each test case includes:
   ‚Ä¢ Candidate description
   ‚Ä¢ CORRECT company match (should score HIGH)
   ‚Ä¢ WRONG company match (should score LOW)

2Ô∏è‚É£  DEFINING EVALUATION FUNCTION
--------------------------------------------------------------------------------
‚úÖ Evaluation function ready!

3Ô∏è‚É£  RUNNING ALL METHODS ON SYNTHETIC TESTS
--------------------------------------------------------------------------------

üü¢ METHOD 1: SBERT (Semantic Embeddings)

üî¨ Testing SBERT...


NameError: name 'SBERTMatcher' is not defined

In [67]:
# ============================================================================
# SECTION 7.6: MANUAL VALIDATION
# ============================================================================
# Purpose: Human evaluation of match quality
# Method: Sample random matches and rate relevance 1-5
# Academic Value: Human-AI agreement analysis
# ============================================================================

print("\n" + "="*80)
print("üë§ MANUAL VALIDATION - HUMAN EVALUATION")
print("="*80)

import random

# ============================================================================
# Cell 7.6.1: Manual Validation Function
# ============================================================================

def validate_matches_manual(candidates_df, companies_df, all_matches, 
                           candidate_texts, company_texts,
                           n_samples=20, random_seed=42):
    """
    Sample random matches and get human ratings.
    
    Shows:
    - Candidate info (career, skills)
    - Company info (description, skills)
    - System score
    
    Asks:
    - Human rating (1-5 stars)
    
    Returns:
    - DataFrame with validation results
    - Correlation between system score and human rating
    
    Args:
        n_samples: Number of matches to validate
        random_seed: For reproducibility
    """
    print(f"\n{'='*80}")
    print(f"üîç MANUAL VALIDATION - SBERT Method")
    print(f"{'='*80}")
    
    print(f"\nYou will rate {n_samples} matches on a scale of 1-5:")
    print("   1 = ‚ùå Bad match (completely irrelevant)")
    print("   2 = üü° Poor match (somewhat relevant)")
    print("   3 = üü¢ OK match (moderately relevant)")
    print("   4 = ‚úÖ Good match (very relevant)")
    print("   5 = üåü Perfect match (ideal fit)")
    
    input("\nPress Enter to start validation...")
    
    # Sample random candidates
    random.seed(random_seed)
    n_candidates = len(candidates_df)
    sample_indices = random.sample(range(n_candidates), min(n_samples, n_candidates))
    
    validation_results = []
    
    for i, cand_idx in enumerate(sample_indices, 1):
        print(f"\n{'='*80}")
        print(f"MATCH {i}/{len(sample_indices)}")
        print(f"{'='*80}")
        
        # Get candidate info
        cand = candidates_df.iloc[cand_idx]
        cand_career = str(cand.get('career_objective', 'N/A'))[:200]
        cand_skills = str(cand.get('skills', []))[:150]
        
        # Get top match for this candidate
        comp_idx, score = all_matches[cand_idx][0]  # Best match
        
        # Get company info
        comp = companies_df.iloc[comp_idx]
        comp_desc = str(comp.get('description', 'N/A'))[:200]
        comp_skills = str(comp.get('enriched_skills', 'N/A'))[:150]
        
        # Display match
        print(f"\nüë§ CANDIDATE #{cand_idx}:")
        print(f"   Career: {cand_career}")
        print(f"   Skills: {cand_skills}")
        
        print(f"\nüè¢ COMPANY #{comp_idx}:")
        print(f"   Description: {comp_desc}")
        print(f"   Required Skills: {comp_skills}")
        
        print(f"\nü§ñ SYSTEM SCORE: {score:.4f}")
        
        # Get human rating
        while True:
            try:
                rating_input = input("\n‚≠ê Your rating (1-5, or 's' to skip): ")
                
                if rating_input.lower() == 's':
                    rating = None
                    break
                
                rating = int(rating_input)
                
                if 1 <= rating <= 5:
                    break
                else:
                    print("   ‚ö†Ô∏è  Please enter a number between 1 and 5")
            
            except ValueError:
                print("   ‚ö†Ô∏è  Please enter a valid number (1-5) or 's' to skip")
        
        if rating is not None:
            validation_results.append({
                'candidate_idx': cand_idx,
                'company_idx': comp_idx,
                'system_score': score,
                'human_rating': rating,
                'candidate_career': cand_career[:100],
                'company_desc': comp_desc[:100]
            })
            
            # Show quick feedback
            emoji = ['‚ùå', 'üü°', 'üü¢', '‚úÖ', 'üåü'][rating-1]
            print(f"   {emoji} Recorded: {rating}/5")
    
    # Create validation DataFrame
    validation_df = pd.DataFrame(validation_results)
    
    if len(validation_df) == 0:
        print("\n‚ö†Ô∏è  No ratings recorded. Validation cancelled.")
        return None
    
    # ========================================================================
    # Analysis
    # ========================================================================
    
    print(f"\n{'='*80}")
    print(f"üìä VALIDATION RESULTS")
    print(f"{'='*80}")
    
    # Basic statistics
    mean_rating = validation_df['human_rating'].mean()
    std_rating = validation_df['human_rating'].std()
    mean_score = validation_df['system_score'].mean()
    
    print(f"\nüìà Statistics:")
    print(f"   ‚Ä¢ Samples validated: {len(validation_df)}")
    print(f"   ‚Ä¢ Mean human rating: {mean_rating:.2f}/5 ({'‚≠ê' * int(round(mean_rating))})")
    print(f"   ‚Ä¢ Std deviation: {std_rating:.2f}")
    print(f"   ‚Ä¢ Mean system score: {mean_score:.4f}")
    
    # Rating distribution
    print(f"\nüìä Rating Distribution:")
    for rating in range(1, 6):
        count = (validation_df['human_rating'] == rating).sum()
        pct = (count / len(validation_df)) * 100
        bar = '‚ñà' * int(pct / 5)
        emoji = ['‚ùå', 'üü°', 'üü¢', '‚úÖ', 'üåü'][rating-1]
        print(f"   {emoji} {rating} stars: {count:>3} ({pct:>5.1f}%) {bar}")
    
    # Correlation analysis
    correlation = validation_df[['system_score', 'human_rating']].corr().iloc[0, 1]
    
    print(f"\nüîó Correlation Analysis:")
    print(f"   ‚Ä¢ Pearson correlation: {correlation:.3f}")
    
    if correlation > 0.7:
        status = "‚úÖ Strong positive correlation"
    elif correlation > 0.5:
        status = "üü¢ Moderate positive correlation"
    elif correlation > 0.3:
        status = "üü° Weak positive correlation"
    else:
        status = "üî¥ Poor correlation"
    
    print(f"   ‚Ä¢ Interpretation: {status}")
    print(f"   ‚Ä¢ Meaning: System scores {'align well' if correlation > 0.5 else 'somewhat align'} with human judgment")
    
    # Agreement analysis
    # Convert to binary: good (4-5) vs not good (1-3)
    validation_df['human_good'] = validation_df['human_rating'] >= 4
    validation_df['system_good'] = validation_df['system_score'] >= validation_df['system_score'].median()
    
    agreement = (validation_df['human_good'] == validation_df['system_good']).sum() / len(validation_df)
    
    print(f"\nü§ù Agreement (Good vs Not Good):")
    print(f"   ‚Ä¢ Agreement rate: {agreement*100:.1f}%")
    
    # Show disagreements
    disagreements = validation_df[validation_df['human_good'] != validation_df['system_good']]
    if len(disagreements) > 0:
        print(f"\n‚ö†Ô∏è  Disagreements ({len(disagreements)}):")
        for idx, row in disagreements.head(3).iterrows():
            print(f"   ‚Ä¢ Candidate {row['candidate_idx']} ‚Üí Company {row['company_idx']}")
            print(f"     System: {row['system_score']:.3f}, Human: {row['human_rating']}/5")
    
    return validation_df

print("‚úÖ Manual validation function ready!")

# ============================================================================
# Cell 7.6.2: Run Validation (Interactive)
# ============================================================================

print("\n" + "="*80)
print("üöÄ READY TO START MANUAL VALIDATION")
print("="*80)

print("\nüí° This is OPTIONAL but valuable for your thesis!")
print("\nBenefits:")
print("   ‚Ä¢ Shows human-AI agreement")
print("   ‚Ä¢ Validates system quality")
print("   ‚Ä¢ Provides qualitative insights")
print("   ‚Ä¢ Takes ~10-15 minutes for 20 samples")

run_validation = input("\n‚ùì Run manual validation now? (y/n): ")

if run_validation.lower() in ['y', 'yes']:
    # Run validation
    validation_results = validate_matches_manual(
        candidates_df=candidates_df,
        companies_df=companies_df,
        all_matches=all_matches,
        candidate_texts=candidate_texts,
        company_texts=company_texts,
        n_samples=20,  # Adjust if needed
        random_seed=42
    )
    
    if validation_results is not None:
        # Save results
        validation_results.to_csv(f'{config.RESULTS_PATH}manual_validation.csv', index=False)
        print(f"\nüíæ Saved: {config.RESULTS_PATH}manual_validation.csv")
        
        # Create summary for report
        summary = {
            'method': 'SBERT',
            'samples_validated': len(validation_results),
            'mean_human_rating': float(validation_results['human_rating'].mean()),
            'mean_system_score': float(validation_results['system_score'].mean()),
            'correlation': float(validation_results[['system_score', 'human_rating']].corr().iloc[0, 1]),
            'ratings_distribution': validation_results['human_rating'].value_counts().to_dict()
        }
        
        with open(f'{config.RESULTS_PATH}validation_summary.json', 'w') as f:
            json.dump(summary, f, indent=2)
        
        print(f"üíæ Saved: {config.RESULTS_PATH}validation_summary.json")
        
        print("\n" + "="*80)
        print("‚úÖ MANUAL VALIDATION COMPLETE!")
        print("="*80)
        print("\nüìä Use these results in your thesis:")
        print("   ‚Ä¢ Include correlation coefficient")
        print("   ‚Ä¢ Show rating distribution chart")
        print("   ‚Ä¢ Discuss human-AI agreement")
        print("   ‚Ä¢ Cite mean rating as quality metric")

else:
    print("\n‚è≠Ô∏è  Skipped manual validation")
    print("üí° You can run this later if needed")
    validation_results = None

print("\n" + "="*80)


üë§ MANUAL VALIDATION - HUMAN EVALUATION
‚úÖ Manual validation function ready!

üöÄ READY TO START MANUAL VALIDATION

üí° This is OPTIONAL but valuable for your thesis!

Benefits:
   ‚Ä¢ Shows human-AI agreement
   ‚Ä¢ Validates system quality
   ‚Ä¢ Provides qualitative insights
   ‚Ä¢ Takes ~10-15 minutes for 20 samples

üîç MANUAL VALIDATION - SBERT Method

You will rate 20 matches on a scale of 1-5:
   1 = ‚ùå Bad match (completely irrelevant)
   2 = üü° Poor match (somewhat relevant)
   3 = üü¢ OK match (moderately relevant)
   4 = ‚úÖ Good match (very relevant)
   5 = üåü Perfect match (ideal fit)

MATCH 1/20

üë§ CANDIDATE #1824:
   Career: Skilled Machine Learning and Deep Learning practitioner, and I have worked on Computer vision-based projects. On the lookout for opportunities that help me understand more about the domain.
   Skills: ['Machine Learning', 'Deep Learning', 'Computer Vision', 'Pattern Recognition', 'Image Processing', 'Image Segmentation', 'Python', 'R