# 🧠 HRHUB v2.1 - Enhanced with LLM (FREE VERSION)

## 📘 Project Overview

**Bilateral HR Matching System with LLM-Powered Intelligence**

### What's New in v2.1:
- ✅ **FREE LLM**: Using Hugging Face Inference API (no cost)
- ✅ **Job Level Classification**: Zero-shot & few-shot learning
- ✅ **Structured Skills Extraction**: Pydantic schemas
- ✅ **Match Explainability**: LLM-generated reasoning
- ✅ **Flexible Data Loading**: Upload OR Google Drive

### Tech Stack:
```
Embeddings: sentence-transformers (local, free)
LLM: Hugging Face Inference API (free tier)
Schemas: Pydantic
Platform: Google Colab → VS Code
```

---

**Master's Thesis - Aalborg University**  
*Business Data Science Program*  
*December 2025*

---## 📊 Step 1: Install Dependencies

In [1]:
# Install required packages
#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy

print("✅ All packages installed!")

✅ All packages installed!


---## 📊 Step 2: Import Libraries

In [2]:
import pandas as pd
import numpy as np
import json
import os
from typing import List, Dict, Optional, Literal
import warnings
warnings.filterwarnings('ignore')

# ML & NLP
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# LLM Integration (FREE)
from huggingface_hub import InferenceClient
from pydantic import BaseModel, Field

# Visualization
import plotly.graph_objects as go
from IPython.display import HTML, display

# Configuration Settings
from dotenv import load_dotenv

# Carrega variáveis do .env
load_dotenv()
print("✅ Environment variables loaded from .env")
# ============== ATÉ AQUI ⬆️ ==============

print("✅ All libraries imported!")

✅ Environment variables loaded from .env
✅ All libraries imported!


---## 📊 Step 3: Configuration

In [3]:
class Config:
    """Centralized configuration for VS Code"""
    
    # Paths - VS Code structure
    CSV_PATH = '../csv_files/'
    PROCESSED_PATH = '../processed/'
    RESULTS_PATH = '../results/'
    
    # Embedding Model
    EMBEDDING_MODEL = 'all-MiniLM-L6-v2'
    
    # LLM Settings (FREE - Hugging Face)
    HF_TOKEN = os.getenv('HF_TOKEN', '')  # ✅ Pega do .env
    LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'
    
    LLM_MAX_TOKENS = 1000
    
    # Matching Parameters
    TOP_K_MATCHES = 10
    SIMILARITY_THRESHOLD = 0.5
    RANDOM_SEED = 42

np.random.seed(Config.RANDOM_SEED)

print("✅ Configuration loaded!")
print(f"🧠 Embedding model: {Config.EMBEDDING_MODEL}")
print(f"🤖 LLM model: {Config.LLM_MODEL}")
print(f"🔑 HF Token configured: {'Yes ✅' if Config.HF_TOKEN else 'No ⚠️'}")
print(f"📂 Data path: {Config.CSV_PATH}")

✅ Configuration loaded!
🧠 Embedding model: all-MiniLM-L6-v2
🤖 LLM model: meta-llama/Llama-3.2-3B-Instruct
🔑 HF Token configured: Yes ✅
📂 Data path: ../csv_files/


---
## 🏗️ Step 3.5: Architecture - Text Builders

**HIGH COHESION:** Each class has ONE responsibility
**LOW COUPLING:** Classes don't depend on each other

In [None]:
# ============================================================================
# TEXT BUILDER CLASSES - Single Responsibility Principle
# ============================================================================

from abc import ABC, abstractmethod
from typing import List

class TextBuilder(ABC):
    """Abstract base class for text builders"""
    
    @abstractmethod
    def build(self, row: pd.Series) -> str:
        """Build text representation from DataFrame row"""
        pass
    
    def build_batch(self, df: pd.DataFrame) -> List[str]:
        """Build text representations for entire DataFrame"""
        return df.apply(self.build, axis=1).tolist()


class CandidateTextBuilder(TextBuilder):
    """Builds text representation for candidates"""
    
    def __init__(self, fields: List[str] = None):
        self.fields = fields or [
            'Category',
            'skills',
            'career_objective',
            'degree_names',
            'positions'
        ]
    
    def build(self, row: pd.Series) -> str:
        parts = []
        
        if row.get('Category'):
            parts.append(f"Job Category: {row['Category']}")
        
        if row.get('skills'):
            parts.append(f"Skills: {row['skills']}")
        
        if row.get('career_objective'):
            parts.append(f"Objective: {row['career_objective']}")
        
        if row.get('degree_names'):
            parts.append(f"Education: {row['degree_names']}")
        
        if row.get('positions'):
            parts.append(f"Experience: {row['positions']}")
        
        return ' '.join(parts)


class CompanyTextBuilder(TextBuilder):
    """Builds text representation for companies"""
    
    def __init__(self, include_postings: bool = True):
        self.include_postings = include_postings
    
    def build(self, row: pd.Series) -> str:
        parts = []
        
        if row.get('name'):
            parts.append(f"Company: {row['name']}")
        
        if row.get('description'):
            parts.append(f"Description: {row['description']}")
        
        if row.get('industries_list'):
            parts.append(f"Industries: {row['industries_list']}")
        
        if row.get('specialties_list'):
            parts.append(f"Specialties: {row['specialties_list']}")
        
        # Include job postings data (THE BRIDGE!)
        if self.include_postings:
            if row.get('required_skills'):
                parts.append(f"Required Skills: {row['required_skills']}")
            
            if row.get('posted_job_titles'):
                parts.append(f"Job Titles: {row['posted_job_titles']}")
            
            if row.get('experience_levels'):
                parts.append(f"Experience: {row['experience_levels']}")
        
        return ' '.join(parts)


print("✅ Text Builder classes loaded")
print("   • CandidateTextBuilder")
print("   • CompanyTextBuilder")

---
## 🏗️ Step 3.6: Architecture - Embedding Manager

**Responsibility:** Generate, save, and load embeddings

In [None]:
# ============================================================================
# EMBEDDING MANAGER - Handles all embedding operations
# ============================================================================

from pathlib import Path
from typing import Tuple, Optional

class EmbeddingManager:
    """Manages embedding generation, saving, and loading"""
    
    def __init__(self, model: SentenceTransformer, save_dir: str):
        self.model = model
        self.save_dir = Path(save_dir)
        self.save_dir.mkdir(parents=True, exist_ok=True)
    
    def _get_file_paths(self, entity_type: str) -> Tuple[Path, Path]:
        """Get file paths for embeddings and metadata"""
        emb_file = self.save_dir / f"{entity_type}_embeddings.npy"
        meta_file = self.save_dir / f"{entity_type}_metadata.pkl"
        return emb_file, meta_file
    
    def exists(self, entity_type: str) -> bool:
        """Check if embeddings exist for entity type"""
        emb_file, _ = self._get_file_paths(entity_type)
        return emb_file.exists()
    
    def load(self, entity_type: str) -> Tuple[np.ndarray, pd.DataFrame]:
        """Load embeddings and metadata"""
        emb_file, meta_file = self._get_file_paths(entity_type)
        
        if not emb_file.exists():
            raise FileNotFoundError(f"Embeddings not found: {emb_file}")
        
        embeddings = np.load(emb_file)
        metadata = pd.read_pickle(meta_file) if meta_file.exists() else None
        
        return embeddings, metadata
    
    def generate(self,
                texts: List[str],
                batch_size: int = 32,
                show_progress: bool = True) -> np.ndarray:
        """Generate embeddings from texts"""
        return self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=show_progress,
            normalize_embeddings=True,
            convert_to_numpy=True
        )
    
    def save(self,
            entity_type: str,
            embeddings: np.ndarray,
            metadata: pd.DataFrame) -> None:
        """Save embeddings and metadata"""
        emb_file, meta_file = self._get_file_paths(entity_type)
        
        np.save(emb_file, embeddings)
        metadata.to_pickle(meta_file)
        
        print(f"💾 Saved:")
        print(f"   {emb_file}")
        print(f"   {meta_file}")
    
    def generate_and_save(self,
                         entity_type: str,
                         texts: List[str],
                         metadata: pd.DataFrame,
                         batch_size: int = 32) -> np.ndarray:
        """Generate embeddings and save everything"""
        print(f"🔄 Generating {entity_type} embeddings...")
        print(f"   Processing {len(texts):,} items...")
        
        embeddings = self.generate(texts, batch_size=batch_size)
        self.save(entity_type, embeddings, metadata)
        
        return embeddings
    
    def load_or_generate(self,
                        entity_type: str,
                        texts: List[str],
                        metadata: pd.DataFrame,
                        force_regenerate: bool = False) -> Tuple[np.ndarray, pd.DataFrame]:
        """Load if exists, generate otherwise"""
        
        if not force_regenerate and self.exists(entity_type):
            print(f"📥 Loading {entity_type} embeddings...")
            embeddings, saved_metadata = self.load(entity_type)
            
            # Verify alignment
            if len(embeddings) != len(metadata):
                print(f"⚠️  Size mismatch! Regenerating...")
                embeddings = self.generate_and_save(
                    entity_type, texts, metadata
                )
            else:
                print(f"✅ Loaded: {embeddings.shape}")
        else:
            embeddings = self.generate_and_save(
                entity_type, texts, metadata
            )
        
        return embeddings, metadata


print("✅ EmbeddingManager class loaded")

---
## 🏗️ Step 3.7: Architecture - Matching Engine

**Responsibility:** Calculate similarities and find matches

In [None]:
# ============================================================================
# MATCHING ENGINE - Handles similarity calculations
# ============================================================================

class MatchingEngine:
    """Calculates similarities and finds top matches"""
    
    def __init__(self,
                candidate_vectors: np.ndarray,
                company_vectors: np.ndarray,
                candidate_metadata: pd.DataFrame,
                company_metadata: pd.DataFrame):
        
        self.cand_vectors = candidate_vectors
        self.comp_vectors = company_vectors
        self.cand_metadata = candidate_metadata
        self.comp_metadata = company_metadata
        
        # Verify alignment
        assert len(candidate_vectors) == len(candidate_metadata), \
            "Candidate embeddings and metadata size mismatch"
        assert len(company_vectors) == len(company_metadata), \
            "Company embeddings and metadata size mismatch"
    
    def find_matches(self,
                    candidate_idx: int,
                    top_k: int = 10) -> List[Tuple[int, float]]:
        """Find top K company matches for a candidate"""
        
        if candidate_idx >= len(self.cand_vectors):
            raise IndexError(f"Candidate index {candidate_idx} out of range")
        
        # Get candidate vector
        cand_vec = self.cand_vectors[candidate_idx].reshape(1, -1)
        
        # Calculate similarities
        similarities = cosine_similarity(cand_vec, self.comp_vectors)[0]
        
        # Get top K
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        # Return (index, score) tuples
        return [(int(idx), float(similarities[idx])) for idx in top_indices]
    
    def get_match_details(self,
                         candidate_idx: int,
                         company_idx: int) -> dict:
        """Get detailed match information"""
        
        candidate = self.cand_metadata.iloc[candidate_idx]
        company = self.comp_metadata.iloc[company_idx]
        
        # Calculate similarity
        cand_vec = self.cand_vectors[candidate_idx].reshape(1, -1)
        comp_vec = self.comp_vectors[company_idx].reshape(1, -1)
        similarity = float(cosine_similarity(cand_vec, comp_vec)[0][0])
        
        return {
            'candidate': candidate.to_dict(),
            'company': company.to_dict(),
            'similarity_score': similarity
        }
    
    def batch_match(self,
                   candidate_indices: List[int],
                   top_k: int = 10) -> dict:
        """Find matches for multiple candidates"""
        
        results = {}
        for idx in candidate_indices:
            results[idx] = self.find_matches(idx, top_k=top_k)
        
        return results


print("✅ MatchingEngine class loaded")

---## 📊 Step 4: Load All Datasets

In [4]:
print("📂 Loading all datasets...\n")
print("=" * 70)

# Load main datasets
candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')
print(f"✅ Candidates: {len(candidates):,} rows × {len(candidates.columns)} columns")

companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')
print(f"✅ Companies (base): {len(companies_base):,} rows")

company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')
print(f"✅ Company industries: {len(company_industries):,} rows")

company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')
print(f"✅ Company specialties: {len(company_specialties):,} rows")

employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')
print(f"✅ Employee counts: {len(employee_counts):,} rows")

postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')
print(f"✅ Postings: {len(postings):,} rows × {len(postings.columns)} columns")

# Optional datasets
try:
    job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')
    print(f"✅ Job skills: {len(job_skills):,} rows")
except:
    job_skills = None
    print("⚠️  Job skills not found (optional)")

try:
    job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')
    print(f"✅ Job industries: {len(job_industries):,} rows")
except:
    job_industries = None
    print("⚠️  Job industries not found (optional)")

print("\n" + "=" * 70)
print("✅ All datasets loaded successfully!\n")

📂 Loading all datasets...

✅ Candidates: 9,544 rows × 35 columns
✅ Companies (base): 24,473 rows
✅ Company industries: 24,375 rows
✅ Company specialties: 169,387 rows
✅ Employee counts: 35,787 rows
✅ Postings: 123,849 rows × 31 columns
✅ Job skills: 213,768 rows
✅ Job industries: 164,808 rows

✅ All datasets loaded successfully!



---## 📊 Step 5: Merge & Enrich Company Data

In [5]:
print("🔗 Merging company data...\n")

# Aggregate industries
company_industries_agg = company_industries.groupby('company_id')['industry'].apply(
    lambda x: ', '.join(map(str, x.tolist()))
).reset_index()
company_industries_agg.columns = ['company_id', 'industries_list']
print(f"✅ Aggregated industries for {len(company_industries_agg):,} companies")

# Aggregate specialties
company_specialties_agg = company_specialties.groupby('company_id')['speciality'].apply(
    lambda x: ' | '.join(x.astype(str).tolist())
).reset_index()
company_specialties_agg.columns = ['company_id', 'specialties_list']
print(f"✅ Aggregated specialties for {len(company_specialties_agg):,} companies")

# Merge all company data
companies_merged = companies_base.copy()
companies_merged = companies_merged.merge(company_industries_agg, on='company_id', how='left')
companies_merged = companies_merged.merge(company_specialties_agg, on='company_id', how='left')
companies_merged = companies_merged.merge(employee_counts, on='company_id', how='left')

print(f"\n✅ Base company merge complete: {len(companies_merged):,} companies\n")

🔗 Merging company data...

✅ Aggregated industries for 24,365 companies
✅ Aggregated specialties for 17,780 companies

✅ Base company merge complete: 35,787 companies



---## 📊 Step 6: Enrich with Job Postings

In [6]:
print("🌉 Enriching companies with job posting data...\n")
print("=" * 70)
print("KEY INSIGHT: Postings = 'Requirements Language Bridge'")
print("=" * 70 + "\n")

postings = postings.fillna('')
postings['company_id'] = postings['company_id'].astype(str)

# Aggregate postings per company
postings_agg = postings.groupby('company_id').agg({
    'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),
    'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),
    'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),
    'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),
}).reset_index()

postings_agg.columns = ['company_id', 'posted_job_titles', 'posted_descriptions', 'required_skills', 'experience_levels']

companies_merged['company_id'] = companies_merged['company_id'].astype(str)
companies_full = companies_merged.merge(postings_agg, on='company_id', how='left').fillna('')

print(f"✅ Enriched {len(companies_full):,} companies with posting data\n")

🌉 Enriching companies with job posting data...

KEY INSIGHT: Postings = 'Requirements Language Bridge'

✅ Enriched 35,787 companies with posting data



In [7]:
companies_full.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url,industries_list,specialties_list,employee_count,follower_count,time_recorded,posted_job_titles,posted_descriptions,required_skills,experience_levels
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,314102,16253625,1712378162,,,,
1,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,313142,16309464,1713392385,,,,
2,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,313147,16309985,1713402495,,,,
3,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm,IT Services and IT Consulting,Cloud | Mobile | Cognitive | Security | Resear...,311223,16314846,1713501255,,,,
4,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare,Hospitals and Health Care,Healthcare | Biotechnology,56873,2185368,1712382540,,,,


In [19]:
## 🔍 Data Quality Check - Duplicate Detection

"""
Checking for duplicates in all datasets based on primary keys.
This cell only REPORTS duplicates, does not modify data.
"""

print("=" * 80)
print("🔍 DUPLICATE DETECTION REPORT")
print("=" * 80)
print()

# Define primary keys for each dataset
duplicate_report = []

# 1. Candidates
print("┌─ 📊 resume_data.csv (Candidates)")
print(f"│  Primary Key: Resume_ID")
cand_total = len(candidates)
cand_unique = candidates['Resume_ID'].nunique() if 'Resume_ID' in candidates.columns else len(candidates)
cand_dups = cand_total - cand_unique
print(f"│  Total rows:     {cand_total:,}")
print(f"│  Unique rows:    {cand_unique:,}")
print(f"│  Duplicates:     {cand_dups:,}")
print(f"│  Status:         {'✅ CLEAN' if cand_dups == 0 else '🔴 HAS DUPLICATES'}")
print("└─\n")
duplicate_report.append(('Candidates', cand_total, cand_unique, cand_dups))

# 2. Companies Base
print("┌─ 📊 companies.csv (Companies Base)")
print(f"│  Primary Key: company_id")
comp_total = len(companies_base)
comp_unique = companies_base['company_id'].nunique()
comp_dups = comp_total - comp_unique
print(f"│  Total rows:     {comp_total:,}")
print(f"│  Unique rows:    {comp_unique:,}")
print(f"│  Duplicates:     {comp_dups:,}")
print(f"│  Status:         {'✅ CLEAN' if comp_dups == 0 else '🔴 HAS DUPLICATES'}")
if comp_dups > 0:
    dup_ids = companies_base[companies_base.duplicated('company_id', keep=False)]['company_id'].value_counts().head(3)
    print(f"│  Top duplicates:")
    for cid, count in dup_ids.items():
        print(f"│    - company_id={cid}: {count} times")
print("└─\n")
duplicate_report.append(('Companies Base', comp_total, comp_unique, comp_dups))

# 3. Company Industries
print("┌─ 📊 company_industries.csv")
print(f"│  Primary Key: company_id + industry")
ci_total = len(company_industries)
ci_unique = len(company_industries.drop_duplicates(subset=['company_id', 'industry']))
ci_dups = ci_total - ci_unique
print(f"│  Total rows:     {ci_total:,}")
print(f"│  Unique rows:    {ci_unique:,}")
print(f"│  Duplicates:     {ci_dups:,}")
print(f"│  Status:         {'✅ CLEAN' if ci_dups == 0 else '🔴 HAS DUPLICATES'}")
print("└─\n")
duplicate_report.append(('Company Industries', ci_total, ci_unique, ci_dups))

# 4. Company Specialties
print("┌─ 📊 company_specialities.csv")
print(f"│  Primary Key: company_id + speciality")
cs_total = len(company_specialties)
cs_unique = len(company_specialties.drop_duplicates(subset=['company_id', 'speciality']))
cs_dups = cs_total - cs_unique
print(f"│  Total rows:     {cs_total:,}")
print(f"│  Unique rows:    {cs_unique:,}")
print(f"│  Duplicates:     {cs_dups:,}")
print(f"│  Status:         {'✅ CLEAN' if cs_dups == 0 else '🔴 HAS DUPLICATES'}")
print("└─\n")
duplicate_report.append(('Company Specialties', cs_total, cs_unique, cs_dups))

# 5. Employee Counts
print("┌─ 📊 employee_counts.csv")
print(f"│  Primary Key: company_id")
ec_total = len(employee_counts)
ec_unique = employee_counts['company_id'].nunique()
ec_dups = ec_total - ec_unique
print(f"│  Total rows:     {ec_total:,}")
print(f"│  Unique rows:    {ec_unique:,}")
print(f"│  Duplicates:     {ec_dups:,}")
print(f"│  Status:         {'✅ CLEAN' if ec_dups == 0 else '🔴 HAS DUPLICATES'}")
print("└─\n")
duplicate_report.append(('Employee Counts', ec_total, ec_unique, ec_dups))

# 6. Postings
print("┌─ 📊 postings.csv (Job Postings)")
print(f"│  Primary Key: job_id")
if 'job_id' in postings.columns:
    post_total = len(postings)
    post_unique = postings['job_id'].nunique()
    post_dups = post_total - post_unique
else:
    post_total = len(postings)
    post_unique = len(postings.drop_duplicates())
    post_dups = post_total - post_unique
print(f"│  Total rows:     {post_total:,}")
print(f"│  Unique rows:    {post_unique:,}")
print(f"│  Duplicates:     {post_dups:,}")
print(f"│  Status:         {'✅ CLEAN' if post_dups == 0 else '🔴 HAS DUPLICATES'}")
print("└─\n")
duplicate_report.append(('Postings', post_total, post_unique, post_dups))

# 7. Companies Full (After Merge)
print("┌─ 📊 companies_full (After Enrichment)")
print(f"│  Primary Key: company_id")
cf_total = len(companies_full)
cf_unique = companies_full['company_id'].nunique()
cf_dups = cf_total - cf_unique
print(f"│  Total rows:     {cf_total:,}")
print(f"│  Unique rows:    {cf_unique:,}")
print(f"│  Duplicates:     {cf_dups:,}")
print(f"│  Status:         {'✅ CLEAN' if cf_dups == 0 else '🔴 HAS DUPLICATES'}")
if cf_dups > 0:
    dup_ids = companies_full[companies_full.duplicated('company_id', keep=False)]['company_id'].value_counts().head(5)
    print(f"│")
    print(f"│  Top duplicate company_ids:")
    for cid, count in dup_ids.items():
        comp_name = companies_full[companies_full['company_id'] == cid]['name'].iloc[0]
        print(f"│    - {cid} ({comp_name}): {count} times")
print("└─\n")
duplicate_report.append(('Companies Full', cf_total, cf_unique, cf_dups))

# Summary
print("=" * 80)
print("📊 SUMMARY")
print("=" * 80)
print()

total_dups = sum(r[3] for r in duplicate_report)
clean_datasets = sum(1 for r in duplicate_report if r[3] == 0)
dirty_datasets = len(duplicate_report) - clean_datasets

print(f"✅ Clean datasets:          {clean_datasets}/{len(duplicate_report)}")
print(f"🔴 Datasets with duplicates: {dirty_datasets}/{len(duplicate_report)}")
print(f"🗑️  Total duplicates found:  {total_dups:,} rows")
print()

if dirty_datasets > 0:
    print("⚠️  DUPLICATES DETECTED!")
else:
    print("✅ All datasets are clean! No duplicates found.")

print("=" * 80)

🔍 DUPLICATE DETECTION REPORT

┌─ 📊 resume_data.csv (Candidates)
│  Primary Key: Resume_ID
│  Total rows:     9,544
│  Unique rows:    9,544
│  Duplicates:     0
│  Status:         ✅ CLEAN
└─

┌─ 📊 companies.csv (Companies Base)
│  Primary Key: company_id
│  Total rows:     24,473
│  Unique rows:    24,473
│  Duplicates:     0
│  Status:         ✅ CLEAN
└─

┌─ 📊 company_industries.csv
│  Primary Key: company_id + industry
│  Total rows:     24,375
│  Unique rows:    24,375
│  Duplicates:     0
│  Status:         ✅ CLEAN
└─

┌─ 📊 company_specialities.csv
│  Primary Key: company_id + speciality
│  Total rows:     169,387
│  Unique rows:    169,387
│  Duplicates:     0
│  Status:         ✅ CLEAN
└─

┌─ 📊 employee_counts.csv
│  Primary Key: company_id
│  Total rows:     35,787
│  Unique rows:    24,473
│  Duplicates:     11,314
│  Status:         🔴 HAS DUPLICATES
└─

┌─ 📊 postings.csv (Job Postings)
│  Primary Key: job_id
│  Total rows:     123,849
│  Unique rows:    123,849
│  Duplicates: 

In [22]:
"""
## 🧹 Data Cleaning - Remove Duplicates

Based on the report above, removing duplicates from datasets.
"""

print("🧹 CLEANING DUPLICATES...\n")
print("=" * 80)

# Store original counts
original_counts = {}

# 1. Clean Companies Base (if needed)
if len(companies_base) != companies_base['company_id'].nunique():
    original_counts['companies_base'] = len(companies_base)
    companies_base = companies_base.drop_duplicates(subset=['company_id'], keep='first')
    removed = original_counts['companies_base'] - len(companies_base)
    print(f"✅ companies_base:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['companies_base']:,} → {len(companies_base):,} rows\n")
else:
    print(f"✅ companies_base: Already clean\n")

# 2. Clean Company Industries (if needed)
if len(company_industries) != len(company_industries.drop_duplicates(subset=['company_id', 'industry'])):
    original_counts['company_industries'] = len(company_industries)
    company_industries = company_industries.drop_duplicates(subset=['company_id', 'industry'], keep='first')
    removed = original_counts['company_industries'] - len(company_industries)
    print(f"✅ company_industries:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['company_industries']:,} → {len(company_industries):,} rows\n")
else:
    print(f"✅ company_industries: Already clean\n")

# 3. Clean Company Specialties (if needed)
if len(company_specialties) != len(company_specialties.drop_duplicates(subset=['company_id', 'speciality'])):
    original_counts['company_specialties'] = len(company_specialties)
    company_specialties = company_specialties.drop_duplicates(subset=['company_id', 'speciality'], keep='first')
    removed = original_counts['company_specialties'] - len(company_specialties)
    print(f"✅ company_specialties:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['company_specialties']:,} → {len(company_specialties):,} rows\n")
else:
    print(f"✅ company_specialties: Already clean\n")

# 4. Clean Employee Counts (if needed)
if len(employee_counts) != employee_counts['company_id'].nunique():
    original_counts['employee_counts'] = len(employee_counts)
    employee_counts = employee_counts.drop_duplicates(subset=['company_id'], keep='first')
    removed = original_counts['employee_counts'] - len(employee_counts)
    print(f"✅ employee_counts:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['employee_counts']:,} → {len(employee_counts):,} rows\n")
else:
    print(f"✅ employee_counts: Already clean\n")

# 5. Clean Postings (if needed)
if 'job_id' in postings.columns:
    if len(postings) != postings['job_id'].nunique():
        original_counts['postings'] = len(postings)
        postings = postings.drop_duplicates(subset=['job_id'], keep='first')
        removed = original_counts['postings'] - len(postings)
        print(f"✅ postings:")
        print(f"   Removed {removed:,} duplicates")
        print(f"   {original_counts['postings']:,} → {len(postings):,} rows\n")
    else:
        print(f"✅ postings: Already clean\n")

# 6. Clean Companies Full (if needed)
if len(companies_full) != companies_full['company_id'].nunique():
    original_counts['companies_full'] = len(companies_full)
    companies_full = companies_full.drop_duplicates(subset=['company_id'], keep='first')
    removed = original_counts['companies_full'] - len(companies_full)
    print(f"✅ companies_full:")
    print(f"   Removed {removed:,} duplicates")
    print(f"   {original_counts['companies_full']:,} → {len(companies_full):,} rows\n")
else:
    print(f"✅ companies_full: Already clean\n")

print("=" * 80)
print("✅ DATA CLEANING COMPLETE!")
print("=" * 80)
print()

# Summary
if original_counts:
    total_removed = sum(original_counts[k] - globals()[k].shape[0] if k in globals() else 0 
                       for k in original_counts.keys())
    print(f"📊 Total duplicates removed: {total_removed:,} rows")
    print()
    print("Cleaned datasets:")
    for dataset, original in original_counts.items():
        current = len(globals()[dataset]) if dataset in globals() else 0
        print(f"  - {dataset}: {original:,} → {current:,}")
else:
    print("✅ No duplicates found - all datasets were already clean!")

🧹 CLEANING DUPLICATES...

✅ companies_base: Already clean

✅ company_industries: Already clean

✅ company_specialties: Already clean

✅ employee_counts:
   Removed 11,314 duplicates
   35,787 → 24,473 rows

✅ postings: Already clean

✅ companies_full:
   Removed 11,314 duplicates
   35,787 → 24,473 rows

✅ DATA CLEANING COMPLETE!

📊 Total duplicates removed: 22,628 rows

Cleaned datasets:
  - employee_counts: 35,787 → 24,473
  - companies_full: 35,787 → 24,473


---## 📊 Step 7: Load Embedding Model & Pre-computed Vectors

In [23]:
print("🧠 Loading embedding model...\n")
model = SentenceTransformer(Config.EMBEDDING_MODEL)
embedding_dim = model.get_sentence_embedding_dimension()
print(f"✅ Model loaded: {Config.EMBEDDING_MODEL}")
print(f"📐 Embedding dimension: ℝ^{embedding_dim}\n")

print("📂 Loading pre-computed embeddings...")

try:
    # Try to load from processed folder
    cand_vectors = np.load(f'{Config.PROCESSED_PATH}candidate_embeddings.npy')
    comp_vectors = np.load(f'{Config.PROCESSED_PATH}company_embeddings.npy')
    
    print(f"✅ Loaded from {Config.PROCESSED_PATH}")
    print(f"📊 Candidate vectors: {cand_vectors.shape}")
    print(f"📊 Company vectors: {comp_vectors.shape}\n")
    
except FileNotFoundError:
    print("⚠️  Pre-computed embeddings not found!")
    print("   Embeddings will need to be generated (takes ~5-10 minutes)")
    print("   This is normal if running for the first time.\n")
    
    # You can add embedding generation code here if needed
    # For now, we'll skip to keep notebook clean
    cand_vectors = None
    comp_vectors = None

🧠 Loading embedding model...

✅ Model loaded: all-MiniLM-L6-v2
📐 Embedding dimension: ℝ^384

📂 Loading pre-computed embeddings...
✅ Loaded from ../processed/
📊 Candidate vectors: (9544, 384)
📊 Company vectors: (35787, 384)



---## 📊 Step 8: Core Matching Function

In [24]:
# ============================================================================# CORE MATCHING FUNCTION (SAFE VERSION)# ============================================================================def find_top_matches(candidate_idx: int, top_k: int = 10) -> list:    """    Find top K company matches for a candidate.        SAFE VERSION: Handles index mismatches between embeddings and dataset        Args:        candidate_idx: Index of candidate in candidates DataFrame        top_k: Number of top matches to return        Returns:        List of tuples: [(company_idx, similarity_score), ...]    """        # Validate candidate index    if candidate_idx >= len(cand_vectors):        print(f"❌ Candidate index {candidate_idx} out of range")        return []        # Get candidate vector    cand_vec = cand_vectors[candidate_idx].reshape(1, -1)        # Calculate similarities with all company vectors    similarities = cosine_similarity(cand_vec, comp_vectors)[0]        # CRITICAL FIX: Only use indices that exist in companies_full    max_valid_idx = len(companies_full) - 1        # Truncate similarities to valid range    valid_similarities = similarities[:max_valid_idx + 1]        # Get top K indices from valid range    top_indices = np.argsort(valid_similarities)[::-1][:top_k]        # Return (index, score) tuples    results = [(int(idx), float(valid_similarities[idx])) for idx in top_indices]        return results# Test function and show diagnosticsprint("✅ Safe matching function loaded!")print(f"\n📊 DIAGNOSTICS:")print(f"   Candidate vectors: {len(cand_vectors):,}")print(f"   Company vectors: {len(comp_vectors):,}")print(f"   Companies dataset: {len(companies_full):,}")if len(comp_vectors) > len(companies_full):    print(f"\n⚠️  INDEX MISMATCH DETECTED!")    print(f"   Embeddings: {len(comp_vectors):,}")    print(f"   Dataset: {len(companies_full):,}")    print(f"   Missing rows: {len(comp_vectors) - len(companies_full):,}")    print(f"\n💡 CAUSE: Embeddings generated BEFORE deduplication")    print(f"\n🎯 SOLUTIONS:")    print(f"   A. Safe functions active (current) ✅")    print(f"   B. Regenerate embeddings after dedup")    print(f"   C. Run collaborative filtering step")else:    print(f"\n✅ Embeddings and dataset are aligned!")

✅ Matching function ready


---## 📊 Step 9: Initialize FREE LLM (Hugging Face)### Get your FREE token: https://huggingface.co/settings/tokens

In [25]:
# Initialize Hugging Face Inference Client (FREE)
if Config.HF_TOKEN:
    try:
        hf_client = InferenceClient(token=Config.HF_TOKEN)
        print("✅ Hugging Face client initialized (FREE)")
        print(f"🤖 Model: {Config.LLM_MODEL}")
        print("💰 Cost: $0.00 (completely free!)\n")
        LLM_AVAILABLE = True
    except Exception as e:
        print(f"⚠️  Failed to initialize HF client: {e}")
        LLM_AVAILABLE = False
else:
    print("⚠️  No Hugging Face token configured")
    print("   LLM features will be disabled")
    print("\n📝 To enable:")
    print("   1. Go to: https://huggingface.co/settings/tokens")
    print("   2. Create a token (free)")
    print("   3. Set: Config.HF_TOKEN = 'your-token-here'\n")
    LLM_AVAILABLE = False
    hf_client = None

def call_llm(prompt: str, max_tokens: int = 1000) -> str:
    """
    Generic LLM call using Hugging Face Inference API (FREE).
    """
    if not LLM_AVAILABLE:
        return "[LLM not available - check .env file for HF_TOKEN]"
    
    try:
        response = hf_client.chat_completion(  # ✅ chat_completion
            messages=[{"role": "user", "content": prompt}],
            model=Config.LLM_MODEL,
            max_tokens=max_tokens,
            temperature=0.7
        )
        return response.choices[0].message.content  # ✅ Extrai conteúdo
    except Exception as e:
        return f"[Error: {str(e)}]"

print("✅ LLM helper functions ready")

✅ Hugging Face client initialized (FREE)
🤖 Model: meta-llama/Llama-3.2-3B-Instruct
💰 Cost: $0.00 (completely free!)

✅ LLM helper functions ready


---## 📊 Step 10: Pydantic Schemas for Structured Output

In [26]:
class JobLevelClassification(BaseModel):
    """Job level classification result"""
    level: Literal['Entry', 'Mid', 'Senior', 'Executive']
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str

class SkillsTaxonomy(BaseModel):
    """Structured skills extraction"""
    technical_skills: List[str] = Field(default_factory=list)
    soft_skills: List[str] = Field(default_factory=list)
    certifications: List[str] = Field(default_factory=list)
    languages: List[str] = Field(default_factory=list)

class MatchExplanation(BaseModel):
    """Match reasoning"""
    overall_score: float = Field(ge=0.0, le=1.0)
    match_strengths: List[str]
    skill_gaps: List[str]
    recommendation: str
    fit_summary: str = Field(max_length=200)

print("✅ Pydantic schemas defined")

✅ Pydantic schemas defined


---## 📊 Step 11: Job Level Classification (Zero-Shot)

In [27]:
def classify_job_level_zero_shot(job_description: str) -> Dict:
    """
    Zero-shot job level classification.
    
    Returns classification as: Entry, Mid, Senior, or Executive
    """
    
    prompt = f"""Classify this job posting into ONE seniority level.

Levels:
- Entry: 0-2 years experience, junior roles
- Mid: 3-5 years experience, independent work
- Senior: 6-10 years experience, technical leadership
- Executive: 10+ years, strategic leadership, C-level

Job Posting:
{job_description[:500]}

Return ONLY valid JSON:
{{
    "level": "Entry|Mid|Senior|Executive",
    "confidence": 0.85,
    "reasoning": "Brief explanation"
}}
"""
    
    response = call_llm(prompt)
    
    try:
        # Extract JSON
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        elif '```' in json_str:
            json_str = json_str.split('```')[1].split('```')[0].strip()
        
        # Find JSON in response
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        result = json.loads(json_str)
        return result
    except:
        return {
            "level": "Unknown",
            "confidence": 0.0,
            "reasoning": "Failed to parse response"
        }

# Test if LLM available and data loaded
if LLM_AVAILABLE and len(postings) > 0:
    print("🧪 Testing zero-shot classification...\n")
    sample = postings.iloc[0]['description']
    result = classify_job_level_zero_shot(sample)
    
    print("📊 Classification Result:")
    print(json.dumps(result, indent=2))
else:
    print("⚠️  Skipped - LLM not available or no data")

🧪 Testing zero-shot classification...

📊 Classification Result:
{
  "level": "Unknown",
  "confidence": 0.0,
  "reasoning": "Failed to parse response"
}


---## 📊 Step 12: Few-Shot Learning

In [28]:
def classify_job_level_few_shot(job_description: str) -> Dict:
    """
    Few-shot classification with examples.
    """
    
    prompt = f"""Classify this job posting using examples.

EXAMPLES:

Example 1 (Entry):
"Recent graduate wanted. Python basics. Mentorship provided."
→ Entry level (learning focus, 0-2 years)

Example 2 (Senior):
"5+ years backend. Lead team of 3. System architecture."
→ Senior level (technical leadership, 6-10 years)

Example 3 (Executive):
"CTO position. 15+ years. Define technical strategy."
→ Executive level (C-level, strategic)

NOW CLASSIFY:
{job_description[:500]}

Return JSON:
{{
    "level": "Entry|Mid|Senior|Executive",
    "confidence": 0.0-1.0,
    "reasoning": "Explain"
}}
"""
    
    response = call_llm(prompt)
    
    try:
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        result = json.loads(json_str)
        return result
    except:
        return {"level": "Unknown", "confidence": 0.0, "reasoning": "Parse error"}

# Compare zero-shot vs few-shot
if LLM_AVAILABLE and len(postings) > 0:
    print("🧪 Comparing Zero-Shot vs Few-Shot...\n")
    sample = postings.iloc[0]['description']
    
    zero = classify_job_level_zero_shot(sample)
    few = classify_job_level_few_shot(sample)
    
    print("📊 Comparison:")
    print(f"Zero-shot: {zero['level']} (confidence: {zero['confidence']:.2f})")
    print(f"Few-shot:  {few['level']} (confidence: {few['confidence']:.2f})")
else:
    print("⚠️  Skipped")

🧪 Comparing Zero-Shot vs Few-Shot...

📊 Comparison:
Zero-shot: Unknown (confidence: 0.00)
Few-shot:  Unknown (confidence: 0.00)


---## 📊 Step 13: Structured Skills Extraction

In [29]:
def extract_skills_taxonomy(job_description: str) -> Dict:
    """
    Extract structured skills using LLM + Pydantic validation.
    """
    
    prompt = f"""Extract skills from this job posting.

Job Posting:
{job_description[:800]}

Return ONLY valid JSON:
{{
    "technical_skills": ["Python", "Docker", "AWS"],
    "soft_skills": ["Communication", "Leadership"],
    "certifications": ["AWS Certified"],
    "languages": ["English", "Danish"]
}}
"""
    
    response = call_llm(prompt, max_tokens=800)
    
    try:
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        data = json.loads(json_str)
        # Validate with Pydantic
        validated = SkillsTaxonomy(**data)
        return validated.model_dump()
    except:
        return {
            "technical_skills": [],
            "soft_skills": [],
            "certifications": [],
            "languages": []
        }

# Test extraction
if LLM_AVAILABLE and len(postings) > 0:
    print("🔍 Testing skills extraction...\n")
    sample = postings.iloc[0]['description']
    skills = extract_skills_taxonomy(sample)
    
    print("📊 Extracted Skills:")
    print(json.dumps(skills, indent=2))
else:
    print("⚠️  Skipped")

🔍 Testing skills extraction...

📊 Extracted Skills:
{
  "technical_skills": [
    "Adobe Creative Cloud (Indesign, Illustrator, Photoshop)",
    "Microsoft Office Suite"
  ],
  "soft_skills": [
    "Communication",
    "Leadership"
  ],
  "certifications": [],
  "languages": [
    "English",
    "Danish"
  ]
}


---## 📊 Step 14: Match Explainability

In [30]:
def explain_match(candidate_idx: int, company_idx: int, similarity_score: float) -> Dict:
    """
    Generate LLM explanation for why candidate matches company.
    """
    
    cand = candidates.iloc[candidate_idx]
    comp = companies_full.iloc[company_idx]
    
    cand_skills = str(cand.get('skills', 'N/A'))[:300]
    cand_exp = str(cand.get('positions', 'N/A'))[:300]
    comp_req = str(comp.get('required_skills', 'N/A'))[:300]
    comp_name = comp.get('name', 'Unknown')
    
    prompt = f"""Explain why this candidate matches this company.

Candidate:
Skills: {cand_skills}
Experience: {cand_exp}

Company: {comp_name}
Requirements: {comp_req}

Similarity Score: {similarity_score:.2f}

Return JSON:
{{
    "overall_score": {similarity_score},
    "match_strengths": ["Top 3-5 matching factors"],
    "skill_gaps": ["Missing skills"],
    "recommendation": "What candidate should do",
    "fit_summary": "One sentence summary"
}}
"""
    
    response = call_llm(prompt, max_tokens=1000)
    
    try:
        json_str = response.strip()
        if '```json' in json_str:
            json_str = json_str.split('```json')[1].split('```')[0].strip()
        
        if '{' in json_str and '}' in json_str:
            start = json_str.index('{')
            end = json_str.rindex('}') + 1
            json_str = json_str[start:end]
        
        data = json.loads(json_str)
        return data
    except:
        return {
            "overall_score": similarity_score,
            "match_strengths": ["Unable to generate"],
            "skill_gaps": [],
            "recommendation": "Review manually",
            "fit_summary": f"Match score: {similarity_score:.2f}"
        }

# Test explainability
if LLM_AVAILABLE and cand_vectors is not None and len(candidates) > 0:
    print("💡 Testing match explainability...\n")
    matches = find_top_matches(0, top_k=1)
    if matches:
        comp_idx, score = matches[0]
        explanation = explain_match(0, comp_idx, score)
        
        print("📊 Match Explanation:")
        print(json.dumps(explanation, indent=2))
else:
    print("⚠️  Skipped - requirements not met")

💡 Testing match explainability...

📊 Match Explanation:
{
  "overall_score": 0.7028058171272278,
  "match_strengths": [
    "Big Data",
    "Machine Learning",
    "Cloud",
    "Data Science",
    "Data Structures"
  ],
  "skill_gaps": [
    "TeachTown-specific skills"
  ],
  "recommendation": "Encourage the candidate to learn TeachTown-specific skills",
  "fit_summary": "The candidate has a strong background in big data, machine learning, and cloud technologies, but may need to learn TeachTown-specific skills to fully align with the company's needs."
}


---
## 📊 Step 16: Detailed Match Visualization

In [None]:
# ============================================================================
# 🔍 DETAILED MATCH EXAMPLE
# ============================================================================

def show_detailed_match_example(candidate_idx=0, top_k=5):
    print("🔍 DETAILED MATCH ANALYSIS")
    print("=" * 100)
    
    if candidate_idx >= len(candidates):
        print(f"❌ ERROR: Candidate {candidate_idx} out of range")
        return None
    
    cand = candidates.iloc[candidate_idx]
    
    print(f"\n🎯 CANDIDATE #{candidate_idx}")
    print(f"Resume ID: {cand.get('Resume_ID', 'N/A')}")
    print(f"Category: {cand.get('Category', 'N/A')}")
    print(f"Skills: {str(cand.get('skills', 'N/A'))[:150]}...\n")
    
    matches = find_top_matches(candidate_idx, top_k=top_k)
    
    print(f"🔗 TOP {len(matches)} MATCHES:\n")
    
    for rank, (comp_idx, score) in enumerate(matches, 1):
        if comp_idx >= len(companies_full):
            continue
        
        company = companies_full.iloc[comp_idx]
        print(f"#{rank}. {company.get('name', 'N/A')} (Score: {score:.4f})")
        print(f"    Industries: {str(company.get('industries_list', 'N/A'))[:60]}...")
    
    print("\n" + "=" * 100)
    return matches

# Test
show_detailed_match_example(candidate_idx=0, top_k=5)

---
## 📊 Step 17: Bridging Concept Analysis

In [None]:
# ============================================================================
# 🌉 BRIDGING CONCEPT ANALYSIS
# ============================================================================

def show_bridging_concept_analysis():
    print("🌉 THE BRIDGING CONCEPT")
    print("=" * 90)
    
    companies_with = companies_full[companies_full['required_skills'] != '']
    companies_without = companies_full[companies_full['required_skills'] == '']
    
    print(f"\n📊 DATA REALITY:")
    print(f"   Total companies: {len(companies_full):,}")
    print(f"   WITH postings: {len(companies_with):,} ({len(companies_with)/len(companies_full)*100:.1f}%)")
    print(f"   WITHOUT postings: {len(companies_without):,}\n")
    
    print("🎯 THE PROBLEM:")
    print("   Companies: 'We are in TECH INDUSTRY'")
    print("   Candidates: 'I know PYTHON, AWS'")
    print("   → Different languages! 🚫\n")
    
    print("🌉 THE SOLUTION (BRIDGING):")
    print("   1. Extract from postings: 'Need PYTHON developers'")
    print("   2. Enrich company profile with skills")
    print("   3. Now both speak SKILLS LANGUAGE! ✅\n")
    
    print("=" * 90)
    return companies_with, companies_without

# Test
show_bridging_concept_analysis()

---
## 📊 Step 18: Export Results to CSV

In [None]:
# ============================================================================
# 💾 EXPORT MATCHES TO CSV
# ============================================================================

def export_matches_to_csv(num_candidates=100, top_k=10):
    print(f"💾 Exporting {num_candidates} candidates (top {top_k} each)...\n")
    
    results = []
    
    for i in range(min(num_candidates, len(candidates))):
        if i % 50 == 0:
            print(f"   Processing {i+1}/{num_candidates}...")
        
        matches = find_top_matches(i, top_k=top_k)
        cand = candidates.iloc[i]
        
        for rank, (comp_idx, score) in enumerate(matches, 1):
            if comp_idx >= len(companies_full):
                continue
            
            company = companies_full.iloc[comp_idx]
            
            results.append({
                'candidate_id': i,
                'candidate_category': cand.get('Category', 'N/A'),
                'company_id': company.get('company_id', 'N/A'),
                'company_name': company.get('name', 'N/A'),
                'match_rank': rank,
                'similarity_score': round(float(score), 4)
            })
    
    results_df = pd.DataFrame(results)
    output_file = f'{Config.RESULTS_PATH}hrhub_matches.csv'
    results_df.to_csv(output_file, index=False)
    
    print(f"\n✅ Exported {len(results_df):,} matches")
    print(f"📄 File: {output_file}\n")
    
    return results_df

# Export sample
matches_df = export_matches_to_csv(num_candidates=50, top_k=5)

---
## 📊 Interactive Visualization 1: t-SNE Vector Space

Project embeddings from ℝ³⁸⁴ → ℝ² to visualize candidates and companies

In [None]:
# ============================================================================
# 🎨 T-SNE VECTOR SPACE VISUALIZATION
# ============================================================================

from sklearn.manifold import TSNE

print("🎨 VECTOR SPACE VISUALIZATION\n")
print("=" * 70)

# Sample for visualization
n_cand_viz = min(500, len(candidates))
n_comp_viz = min(2000, len(companies_full))

print(f"📊 Visualizing:")
print(f"   • {n_cand_viz} candidates")
print(f"   • {n_comp_viz} companies")
print(f"   • From ℝ^384 → ℝ² (t-SNE)\n")

# Sample vectors
cand_sample = cand_vectors[:n_cand_viz]
comp_sample = comp_vectors[:n_comp_viz]
all_vectors = np.vstack([cand_sample, comp_sample])

print("🔄 Running t-SNE (2-3 minutes)...")
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    n_iter=1000
)

vectors_2d = tsne.fit_transform(all_vectors)
cand_2d = vectors_2d[:n_cand_viz]
comp_2d = vectors_2d[n_cand_viz:]

print("\n✅ t-SNE complete!")

In [None]:
# Create interactive plot
fig = go.Figure()

# Companies (red)
fig.add_trace(go.Scatter(
    x=comp_2d[:, 0],
    y=comp_2d[:, 1],
    mode='markers',
    name='Companies',
    marker=dict(size=6, color='#ff6b6b', opacity=0.6),
    text=[f"Company: {companies_full.iloc[i].get('name', 'N/A')[:30]}" 
          for i in range(n_comp_viz)],
    hovertemplate='<b>%{text}</b><extra></extra>'
))

# Candidates (green)
fig.add_trace(go.Scatter(
    x=cand_2d[:, 0],
    y=cand_2d[:, 1],
    mode='markers',
    name='Candidates',
    marker=dict(
        size=10,
        color='#00ff00',
        opacity=0.8,
        line=dict(width=1, color='white')
    ),
    text=[f"Candidate {i}" for i in range(n_cand_viz)],
    hovertemplate='<b>%{text}</b><extra></extra>'
))

fig.update_layout(
    title='Vector Space: Candidates & Companies (Enriched with Postings)',
    xaxis_title='Dimension 1',
    yaxis_title='Dimension 2',
    width=1200,
    height=800,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white')
)

fig.show()

print("\n✅ Visualization complete!")
print("💡 If green & red OVERLAP → Alignment worked!")

---
## 📊 Interactive Visualization 2: Highlighted Match Network

Show candidate and their top matches with connection lines

In [None]:
# ============================================================================
# 🔍 HIGHLIGHTED MATCH NETWORK
# ============================================================================

target_candidate = 0

print(f"🔍 Analyzing Candidate #{target_candidate}...\n")

matches = find_top_matches(target_candidate, top_k=10)
match_indices = [comp_idx for comp_idx, score in matches if comp_idx < n_comp_viz]

# Create highlighted plot
fig2 = go.Figure()

# All companies (background)
fig2.add_trace(go.Scatter(
    x=comp_2d[:, 0],
    y=comp_2d[:, 1],
    mode='markers',
    name='All Companies',
    marker=dict(size=4, color='#ff6b6b', opacity=0.3),
    showlegend=True
))

# Top matches (highlighted)
if match_indices:
    match_positions = comp_2d[match_indices]
    fig2.add_trace(go.Scatter(
        x=match_positions[:, 0],
        y=match_positions[:, 1],
        mode='markers',
        name='Top Matches',
        marker=dict(
            size=15,
            color='#ff0000',
            line=dict(width=2, color='white')
        ),
        text=[f"Match #{i+1}: {companies_full.iloc[match_indices[i]].get('name', 'N/A')[:30]}<br>Score: {matches[i][1]:.3f}" 
              for i in range(len(match_indices))],
        hovertemplate='<b>%{text}</b><extra></extra>'
    ))

# Target candidate (star)
fig2.add_trace(go.Scatter(
    x=[cand_2d[target_candidate, 0]],
    y=[cand_2d[target_candidate, 1]],
    mode='markers',
    name=f'Candidate #{target_candidate}',
    marker=dict(
        size=25,
        color='#00ff00',
        symbol='star',
        line=dict(width=3, color='white')
    )
))

# Connection lines (top 5)
for i, match_idx in enumerate(match_indices[:5]):
    fig2.add_trace(go.Scatter(
        x=[cand_2d[target_candidate, 0], comp_2d[match_idx, 0]],
        y=[cand_2d[target_candidate, 1], comp_2d[match_idx, 1]],
        mode='lines',
        line=dict(color='yellow', width=1, dash='dot'),
        opacity=0.5,
        showlegend=False
    ))

fig2.update_layout(
    title=f'Candidate #{target_candidate} and Top Matches',
    xaxis_title='Dimension 1',
    yaxis_title='Dimension 2',
    width=1200,
    height=800,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white')
)

fig2.show()

print("\n✅ Highlighted visualization created!")
print(f"   ⭐ Green star = Candidate #{target_candidate}")
print(f"   🔴 Red dots = Top matches")
print(f"   💛 Yellow lines = Connections")

---
## 🌐 Interactive Visualization 3: Network Graph (PyVis)

Interactive network showing candidate-company connections with nodes & edges

In [None]:
# ============================================================================
# 🌐 NETWORK GRAPH WITH PYVIS
# ============================================================================

from pyvis.network import Network
import webbrowser
import os

print("🌐 Creating interactive network graph...\n")

target_candidate = 0
top_k_network = 10

# Get matches
matches = find_top_matches(target_candidate, top_k=top_k_network)

# Create network
net = Network(
    height='800px',
    width='100%',
    bgcolor='#1a1a1a',
    font_color='white',
    directed=False
)

# Configure physics
net.barnes_hut(
    gravity=-5000,
    central_gravity=0.3,
    spring_length=100,
    spring_strength=0.01
)

# Add candidate node (center)
cand = candidates.iloc[target_candidate]
cand_label = f"Candidate #{target_candidate}"
net.add_node(
    f'cand_{target_candidate}',
    label=cand_label,
    title=f"{cand.get('Category', 'N/A')}<br>Skills: {str(cand.get('skills', 'N/A'))[:100]}",
    color='#00ff00',
    size=40,
    shape='star'
)

# Add company nodes + edges
for rank, (comp_idx, score) in enumerate(matches, 1):
    if comp_idx >= len(companies_full):
        continue
    
    company = companies_full.iloc[comp_idx]
    comp_name = company.get('name', f'Company {comp_idx}')[:30]
    
    # Color by score
    if score > 0.7:
        color = '#ff0000'  # Red (strong match)
    elif score > 0.5:
        color = '#ff6b6b'  # Light red (good match)
    else:
        color = '#ffaaaa'  # Pink (weak match)
    
    # Add company node
    net.add_node(
        f'comp_{comp_idx}',
        label=f"#{rank}. {comp_name}",
        title=f"Score: {score:.3f}<br>Industries: {str(company.get('industries_list', 'N/A'))[:50]}<br>Required: {str(company.get('required_skills', 'N/A'))[:100]}",
        color=color,
        size=20 + (score * 20)  # Size by score
    )
    
    # Add edge
    net.add_edge(
        f'cand_{target_candidate}',
        f'comp_{comp_idx}',
        value=float(score),
        title=f"Similarity: {score:.3f}",
        color='yellow'
    )

# Save
output_file = f'{Config.RESULTS_PATH}network_graph.html'
net.save_graph(output_file)

print(f"✅ Network graph created!")
print(f"📄 Saved: {output_file}")
print(f"\n💡 LEGEND:")
print(f"   ⭐ Green star = Candidate #{target_candidate}")
print(f"   🔴 Red nodes = Companies (size = match score)")
print(f"   💛 Yellow edges = Connections")
print(f"\nℹ️  Hover over nodes to see details")
print(f"   Drag nodes to rearrange")
print(f"   Zoom with mouse wheel\n")

# Display in notebook
from IPython.display import IFrame
IFrame(output_file, width=1000, height=800)

### 📊 Network Node Data

Detailed information about nodes and connections

In [None]:
# ============================================================================
# DISPLAY NODE DATA
# ============================================================================

print("📊 NETWORK DATA SUMMARY")
print("=" * 80)
print(f"\nTotal nodes: {1 + len(matches)}")
print(f"   - 1 candidate node (green star)")
print(f"   - {len(matches)} company nodes (red circles)")
print(f"\nTotal edges: {len(matches)}")
print(f"\n" + "=" * 80)

# Show node details
print(f"\n🎯 CANDIDATE NODE:")
print(f"   ID: cand_{target_candidate}")
print(f"   Category: {cand.get('Category', 'N/A')}")
print(f"   Skills: {str(cand.get('skills', 'N/A'))[:100]}...")

print(f"\n🏢 COMPANY NODES (Top 5):")
for rank, (comp_idx, score) in enumerate(matches[:5], 1):
    if comp_idx < len(companies_full):
        company = companies_full.iloc[comp_idx]
        print(f"\n   #{rank}. {company.get('name', 'N/A')[:40]}")
        print(f"       ID: comp_{comp_idx}")
        print(f"       Score: {score:.4f}")
        print(f"       Industries: {str(company.get('industries_list', 'N/A'))[:60]}...")

print(f"\n" + "=" * 80)

---
## 🔍 Visualization 4: Display Node Data

Inspect detailed information about candidates and companies

In [None]:
# ============================================================================# DISPLAY NODE DATA - See what's behind the graph# ============================================================================def display_node_data(node_id):    print("=" * 80)        if node_id.startswith('C'):        # CANDIDATE        cand_idx = int(node_id[1:])                if cand_idx >= len(candidates):            print(f"❌ Candidate {cand_idx} not found!")            return                candidate = candidates.iloc[cand_idx]                print(f"🟢 CANDIDATE #{cand_idx}")        print("=" * 80)        print(f"\n📊 KEY INFORMATION:\n")        print(f"Resume ID: {candidate.get('Resume_ID', 'N/A')}")        print(f"Category: {candidate.get('Category', 'N/A')}")        print(f"Skills: {str(candidate.get('skills', 'N/A'))[:200]}")        print(f"Career Objective: {str(candidate.get('career_objective', 'N/A'))[:200]}")            elif node_id.startswith('J'):        # COMPANY        comp_idx = int(node_id[1:])                if comp_idx >= len(companies_full):            print(f"❌ Company {comp_idx} not found!")            return                company = companies_full.iloc[comp_idx]                print(f"🔴 COMPANY #{comp_idx}")        print("=" * 80)        print(f"\n📊 COMPANY INFORMATION:\n")        print(f"Name: {company.get('name', 'N/A')}")        print(f"Industries: {str(company.get('industries_list', 'N/A'))[:200]}")        print(f"Required Skills: {str(company.get('required_skills', 'N/A'))[:200]}")        print(f"Posted Jobs: {str(company.get('posted_job_titles', 'N/A'))[:200]}")        print("\n" + "=" * 80 + "\n")def display_node_with_connections(node_id, top_k=10):    display_node_data(node_id)        if node_id.startswith('C'):        cand_idx = int(node_id[1:])                print(f"🎯 TOP {top_k} MATCHES:")        print("=" * 80)                matches = find_top_matches(cand_idx, top_k=top_k)                # FIXED: Validate indices before accessing        valid_matches = 0        for rank, (comp_idx, score) in enumerate(matches, 1):            # Check if index is valid            if comp_idx >= len(companies_full):                print(f"⚠️  Match #{rank}: Index {comp_idx} out of range (skipping)")                continue                        company = companies_full.iloc[comp_idx]            print(f"#{rank}. {company.get('name', 'N/A')[:40]} (Score: {score:.4f})")            valid_matches += 1                if valid_matches == 0:            print("⚠️  No valid matches found (all indices out of bounds)")            print("\n💡 SOLUTION: Regenerate embeddings after deduplication!")                print("\n" + "=" * 80)# Example usagedisplay_node_with_connections('C0', top_k=5)

---
## 🕸️ Visualization 5: NetworkX Graph

Network graph using NetworkX + Plotly with force-directed layout

In [None]:
# ============================================================================
# NETWORK GRAPH WITH NETWORKX + PLOTLY
# ============================================================================

import networkx as nx

print("🕸️  Creating NETWORK GRAPH...\n")

# Create graph
G = nx.Graph()

# Sample
n_cand_sample = min(20, len(candidates))
top_k_per_cand = 5

print(f"📊 Network size:")
print(f"   • {n_cand_sample} candidates")
print(f"   • {top_k_per_cand} companies per candidate\n")

# Add nodes + edges
companies_in_graph = set()

for i in range(n_cand_sample):
    G.add_node(f"C{i}", node_type='candidate', label=f"C{i}")
    
    matches = find_top_matches(i, top_k=top_k_per_cand)
    
    for comp_idx, score in matches:
        comp_id = f"J{comp_idx}"
        
        if comp_id not in companies_in_graph:
            company_name = companies_full.iloc[comp_idx].get('name', 'N/A')[:20]
            G.add_node(comp_id, node_type='company', label=company_name)
            companies_in_graph.add(comp_id)
        
        G.add_edge(f"C{i}", comp_id, weight=float(score))

print(f"✅ Network created!")
print(f"   Nodes: {G.number_of_nodes()}")
print(f"   Edges: {G.number_of_edges()}\n")

# Calculate layout
print("🔄 Calculating layout...")
pos = nx.spring_layout(G, k=2, iterations=50, seed=42)
print("✅ Layout done!\n")

# Create edge traces
edge_trace = []
for edge in G.edges(data=True):
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    weight = edge[2]['weight']
    
    edge_trace.append(go.Scatter(
        x=[x0, x1, None],
        y=[y0, y1, None],
        mode='lines',
        line=dict(width=weight*3, color='rgba(255,255,255,0.3)'),
        hoverinfo='none',
        showlegend=False
    ))

# Candidate nodes
cand_nodes = [n for n, d in G.nodes(data=True) if d['node_type']=='candidate']
cand_x = [pos[n][0] for n in cand_nodes]
cand_y = [pos[n][1] for n in cand_nodes]
cand_labels = [G.nodes[n]['label'] for n in cand_nodes]

candidate_trace = go.Scatter(
    x=cand_x, y=cand_y,
    mode='markers+text',
    name='Candidates',
    marker=dict(size=25, color='#00ff00', line=dict(width=2, color='white')),
    text=cand_labels,
    textposition='top center',
    hovertemplate='<b>%{text}</b><extra></extra>'
)

# Company nodes
comp_nodes = [n for n, d in G.nodes(data=True) if d['node_type']=='company']
comp_x = [pos[n][0] for n in comp_nodes]
comp_y = [pos[n][1] for n in comp_nodes]
comp_labels = [G.nodes[n]['label'] for n in comp_nodes]

company_trace = go.Scatter(
    x=comp_x, y=comp_y,
    mode='markers+text',
    name='Companies',
    marker=dict(size=15, color='#ff6b6b', symbol='square'),
    text=comp_labels,
    textposition='top center',
    hovertemplate='<b>%{text}</b><extra></extra>'
)

# Create figure
fig = go.Figure(data=edge_trace + [candidate_trace, company_trace])

fig.update_layout(
    title='Network Graph: Candidates ↔ Companies',
    showlegend=True,
    width=1400, height=900,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white'),
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)

fig.show()

print("✅ NetworkX graph created!")
print("   🟢 Green = Candidates")
print("   🔴 Red = Companies")
print("   Lines = Connections (thicker = stronger)\n")

---
## 🐛 DEBUG: Why aren't candidates & companies overlapping?

Investigating the embedding space alignment

In [None]:
# ============================================================================
# DEBUG: CHECK EMBEDDING ALIGNMENT
# ============================================================================

print("🐛 DEBUGGING EMBEDDING SPACE")
print("=" * 80)

# 1. Check if vectors loaded correctly
print(f"\n1️⃣ VECTOR SHAPES:")
print(f"   Candidates: {cand_vectors.shape}")
print(f"   Companies: {comp_vectors.shape}")

# 2. Check vector norms
print(f"\n2️⃣ VECTOR NORMS (should be ~1.0 if normalized):")
cand_norms = np.linalg.norm(cand_vectors, axis=1)
comp_norms = np.linalg.norm(comp_vectors, axis=1)
print(f"   Candidates: mean={cand_norms.mean():.4f}, min={cand_norms.min():.4f}, max={cand_norms.max():.4f}")
print(f"   Companies: mean={comp_norms.mean():.4f}, min={comp_norms.min():.4f}, max={comp_norms.max():.4f}")

# 3. Sample similarity
print(f"\n3️⃣ SAMPLE SIMILARITIES:")
sample_cand = 0
matches = find_top_matches(sample_cand, top_k=5)
print(f"   Candidate #{sample_cand} top 5 matches:")
for rank, (comp_idx, score) in enumerate(matches, 1):
    print(f"      #{rank}. Company {comp_idx}: {score:.4f}")

# 4. Check text representations
print(f"\n4️⃣ TEXT REPRESENTATION SAMPLES:")
print(f"\n   📋 CANDIDATE #{sample_cand}:")
cand = candidates.iloc[sample_cand]
print(f"      Skills: {str(cand.get('skills', 'N/A'))[:100]}")
print(f"      Category: {cand.get('Category', 'N/A')}")

top_company_idx = matches[0][0]
print(f"\n   🏢 TOP MATCH COMPANY #{top_company_idx}:")
company = companies_full.iloc[top_company_idx]
print(f"      Name: {company.get('name', 'N/A')}")
print(f"      Required Skills: {str(company.get('required_skills', 'N/A'))[:100]}")
print(f"      Industries: {str(company.get('industries_list', 'N/A'))[:100]}")

# 5. Check if postings enrichment worked
print(f"\n5️⃣ POSTINGS ENRICHMENT CHECK:")
companies_with_postings = companies_full[companies_full['required_skills'] != ''].shape[0]
companies_without = companies_full[companies_full['required_skills'] == ''].shape[0]
print(f"   WITH postings: {companies_with_postings:,} ({companies_with_postings/len(companies_full)*100:.1f}%)")
print(f"   WITHOUT postings: {companies_without:,}")

# 6. HYPOTHESIS
print(f"\n❓ HYPOTHESIS:")
if companies_without > companies_with_postings:
    print(f"   ⚠️  Most companies DON'T have postings!")
    print(f"   ⚠️  They only have: industries, specialties, description")
    print(f"   ⚠️  This creates DIFFERENT language than candidates")
    print(f"\n   💡 SOLUTION:")
    print(f"      Option A: Filter to only companies WITH postings")
    print(f"      Option B: Use LLM to translate industries → skills")
else:
    print(f"   ✅ Most companies have postings")
    print(f"   ❓ Need to check if embeddings were generated AFTER enrichment")

print(f"\n" + "=" * 80)

---
## 📊 Step 19: Summary

### What We Built

In [31]:
print("="*70)
print("🎯 HRHUB v2.1 - SUMMARY")
print("="*70)
print("")
print("✅ IMPLEMENTED:")
print("  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)")
print("  2. Few-Shot Learning with Examples")
print("  3. Structured Skills Extraction (Pydantic schemas)")
print("  4. Match Explainability (LLM-generated reasoning)")
print("  5. FREE LLM Integration (Hugging Face)")
print("  6. Flexible Data Loading (Upload OR Google Drive)")
print("")
print("💰 COST: $0.00 (completely free!)")
print("")
print("📈 COURSE ALIGNMENT:")
print("  ✅ LLMs for structured output")
print("  ✅ Pydantic schemas")
print("  ✅ Classification pipelines")
print("  ✅ Zero-shot & few-shot learning")
print("  ✅ JSON extraction")
print("  ✅ Transformer architecture (embeddings)")
print("  ✅ API deployment strategies")
print("")
print("="*70)
print("🚀 READY TO MOVE TO VS CODE!")
print("="*70)

🎯 HRHUB v2.1 - SUMMARY

✅ IMPLEMENTED:
  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)
  2. Few-Shot Learning with Examples
  3. Structured Skills Extraction (Pydantic schemas)
  4. Match Explainability (LLM-generated reasoning)
  5. FREE LLM Integration (Hugging Face)
  6. Flexible Data Loading (Upload OR Google Drive)

💰 COST: $0.00 (completely free!)

📈 COURSE ALIGNMENT:
  ✅ LLMs for structured output
  ✅ Pydantic schemas
  ✅ Classification pipelines
  ✅ Zero-shot & few-shot learning
  ✅ JSON extraction
  ✅ Transformer architecture (embeddings)
  ✅ API deployment strategies

🚀 READY TO MOVE TO VS CODE!


---
## 🎯 Step 7.5: Collaborative Filtering for Companies

**THE GENIUS SOLUTION!**

Companies WITHOUT postings can inherit skills from similar companies WITH postings!

Like Netflix recommendations:
- Company A (no postings) similar to Company B (has postings)
- → Company A inherits Company B's required skills!

In [None]:
# ============================================================================
# COLLABORATIVE FILTERING: Companies without postings
# ============================================================================

print("🎯 COLLABORATIVE FILTERING FOR COMPANIES")
print("=" * 80)
print("\nLike Netflix: Similar companies → Similar skills needed!\n")

# Step 1: Separate companies
companies_with_postings = companies_full[companies_full['required_skills'] != ''].copy()
companies_without_postings = companies_full[companies_full['required_skills'] == ''].copy()

print(f"📊 DATA SPLIT:")
print(f"   WITH postings: {len(companies_with_postings):,} companies")
print(f"   WITHOUT postings: {len(companies_without_postings):,} companies")
print(f"\n💡 Goal: Infer skills for {len(companies_without_postings):,} companies\n")

# Step 2: Build company profile vectors (BEFORE postings)
# Using ONLY: industries, specialties, employee_count, description
print("🔧 Building company profile vectors...")

def build_company_profile_text(row):
    """Build text representation WITHOUT postings data"""
    parts = []
    
    if row.get('name'):
        parts.append(f"Company: {row['name']}")
    
    if row.get('description'):
        parts.append(f"Description: {row['description']}")
    
    if row.get('industries_list'):
        parts.append(f"Industries: {row['industries_list']}")
    
    if row.get('specialties_list'):
        parts.append(f"Specialties: {row['specialties_list']}")
    
    if row.get('employee_count'):
        parts.append(f"Size: {row['employee_count']} employees")
    
    return ' '.join(parts)

# Generate profile embeddings
with_postings_profiles = companies_with_postings.apply(build_company_profile_text, axis=1).tolist()
without_postings_profiles = companies_without_postings.apply(build_company_profile_text, axis=1).tolist()

print(f"   Encoding {len(with_postings_profiles):,} companies WITH postings...")
with_postings_embeddings = model.encode(
    with_postings_profiles,
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

print(f"   Encoding {len(without_postings_profiles):,} companies WITHOUT postings...")
without_postings_embeddings = model.encode(
    without_postings_profiles,
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

print(f"\n✅ Profile embeddings created!")
print(f"   Shape WITH: {with_postings_embeddings.shape}")
print(f"   Shape WITHOUT: {without_postings_embeddings.shape}\n")

In [None]:
# ============================================================================
# STEP 3: Find Similar Companies & Inherit Skills
# ============================================================================

print("🔍 Finding similar companies for skill inheritance...\n")

# For each company WITHOUT postings, find top-K similar WITH postings
TOP_K_SIMILAR = 5  # Use top 5 similar companies

print(f"📊 Method: Top-{TOP_K_SIMILAR} Collaborative Filtering\n")

inferred_skills = []
inferred_titles = []
inferred_levels = []

# Calculate similarities (batch processing)
print("⚙️  Calculating company-to-company similarities...")
similarities = cosine_similarity(without_postings_embeddings, with_postings_embeddings)

print(f"✅ Similarity matrix: {similarities.shape}\n")
print(f"🔄 Inferring skills for {len(companies_without_postings):,} companies...\n")

for i in range(len(companies_without_postings)):
    if i % 10000 == 0:
        print(f"   Progress: {i:,}/{len(companies_without_postings):,}")
    
    # Get top-K similar companies WITH postings
    top_k_indices = np.argsort(similarities[i])[::-1][:TOP_K_SIMILAR]
    
    # Collect skills from similar companies
    similar_skills = []
    similar_titles = []
    similar_levels = []
    
    for idx in top_k_indices:
        similar_company = companies_with_postings.iloc[idx]
        
        if similar_company.get('required_skills'):
            similar_skills.append(str(similar_company['required_skills']))
        
        if similar_company.get('posted_job_titles'):
            similar_titles.append(str(similar_company['posted_job_titles']))
        
        if similar_company.get('experience_levels'):
            similar_levels.append(str(similar_company['experience_levels']))
    
    # Aggregate (simple concatenation)
    inferred_skills.append(' | '.join(similar_skills) if similar_skills else '')
    inferred_titles.append(' | '.join(similar_titles) if similar_titles else '')
    inferred_levels.append(' | '.join(similar_levels) if similar_levels else '')

print(f"\n✅ Skill inference complete!\n")

# Add to companies_without_postings
companies_without_postings['required_skills'] = inferred_skills
companies_without_postings['posted_job_titles'] = inferred_titles
companies_without_postings['experience_levels'] = inferred_levels

# Mark as inferred
companies_without_postings['skills_source'] = 'inferred_cf'
companies_with_postings['skills_source'] = 'actual_postings'

print(f"📊 RESULTS:")
non_empty = sum(1 for s in inferred_skills if s != '')
print(f"   Successfully inferred skills: {non_empty:,}/{len(inferred_skills):,} ({non_empty/len(inferred_skills)*100:.1f}%)\n")

In [None]:
# ============================================================================
# STEP 4: Rebuild companies_full with INFERRED skills
# ============================================================================

print("🔄 Rebuilding companies_full with inferred skills...\n")

# Combine
companies_full_enhanced = pd.concat([
    companies_with_postings,
    companies_without_postings
], ignore_index=False).sort_index()

print(f"✅ Enhanced dataset created!")
print(f"   Total companies: {len(companies_full_enhanced):,}")
print(f"   With actual postings: {len(companies_with_postings):,}")
print(f"   With inferred skills: {len(companies_without_postings):,}")

# Verify
total_with_skills = companies_full_enhanced[companies_full_enhanced['required_skills'] != ''].shape[0]
print(f"\n📈 IMPROVEMENT:")
print(f"   BEFORE: {len(companies_with_postings):,} companies with skills ({len(companies_with_postings)/len(companies_full)*100:.1f}%)")
print(f"   AFTER: {total_with_skills:,} companies with skills ({total_with_skills/len(companies_full_enhanced)*100:.1f}%)")
print(f"   📊 Increase: +{total_with_skills - len(companies_with_postings):,} companies!\n")

# Replace companies_full
companies_full = companies_full_enhanced

print(f"✅ companies_full updated with collaborative filtering!\n")

In [None]:
# ============================================================================
# STEP 5: Regenerate Company Embeddings with INFERRED skills
# ============================================================================

print("🔄 Regenerating company embeddings with inferred skills...\n")

def build_company_text_enhanced(row):
    """Build company text WITH inferred/actual skills"""
    parts = []
    
    if row.get('name'):
        parts.append(f"Company: {row['name']}")
    
    if row.get('description'):
        parts.append(f"Description: {row['description']}")
    
    if row.get('industries_list'):
        parts.append(f"Industries: {row['industries_list']}")
    
    if row.get('specialties_list'):
        parts.append(f"Specialties: {row['specialties_list']}")
    
    # NOW INCLUDES INFERRED SKILLS!
    if row.get('required_skills'):
        parts.append(f"Required Skills: {row['required_skills']}")
    
    if row.get('posted_job_titles'):
        parts.append(f"Job Titles: {row['posted_job_titles']}")
    
    if row.get('experience_levels'):
        parts.append(f"Experience: {row['experience_levels']}")
    
    return ' '.join(parts)

# Build texts
company_texts_enhanced = companies_full.apply(build_company_text_enhanced, axis=1).tolist()

print(f"📝 Encoding {len(company_texts_enhanced):,} enhanced company profiles...\n")

comp_vectors_enhanced = model.encode(
    company_texts_enhanced,
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

print(f"\n✅ Enhanced embeddings created!")
print(f"   Shape: {comp_vectors_enhanced.shape}")

# Replace global comp_vectors
comp_vectors = comp_vectors_enhanced

print(f"\n🎯 NOW candidates & companies speak the SAME LANGUAGE!")
print(f"   All companies have skill information (actual or inferred)")
print(f"   Ready for matching!\n")

# Save
np.save(f'{Config.PROCESSED_PATH}company_embeddings_cf_enhanced.npy', comp_vectors)
print(f"💾 Saved: company_embeddings_cf_enhanced.npy\n")

### 🔍 Example: Check Inferred Skills

In [None]:
# ============================================================================
# EXAMPLE: See skill inference in action
# ============================================================================

print("🔍 COLLABORATIVE FILTERING EXAMPLE")
print("=" * 80)

# Find a company that got inferred skills
inferred_companies = companies_full[companies_full['skills_source'] == 'inferred_cf']

if len(inferred_companies) > 0:
    example = inferred_companies.iloc[0]
    
    print(f"\n📋 COMPANY (skills were INFERRED):")
    print(f"   Name: {example.get('name', 'N/A')}")
    print(f"   Industries: {str(example.get('industries_list', 'N/A'))[:100]}")
    print(f"   Specialties: {str(example.get('specialties_list', 'N/A'))[:100]}")
    print(f"\n   🎯 INFERRED Required Skills:")
    print(f"      {str(example.get('required_skills', 'N/A'))[:200]}")
    print(f"\n   💼 INFERRED Job Titles:")
    print(f"      {str(example.get('posted_job_titles', 'N/A'))[:200]}")
    
    print(f"\n💡 These skills were inherited from similar companies!")
else:
    print("\n⚠️  No inferred companies found")

print("\n" + "=" * 80)

---
## 🧠 Step 8: Generate OR Load Embeddings

**Smart pipeline:**
- First run: Generate embeddings (slow ~5 min)
- Subsequent runs: Load from file (fast <5 sec)

**CRITICAL:** Embeddings generated AFTER deduplication for perfect alignment!

In [None]:
# ============================================================================
# EMBEDDING GENERATION + SAVE/LOAD PIPELINE
# ============================================================================

import os
from pathlib import Path

print("🧠 EMBEDDING PIPELINE")
print("=" * 80)
print()

# Ensure processed directory exists
Path(Config.PROCESSED_PATH).mkdir(parents=True, exist_ok=True)

# Define file paths
CAND_EMBEDDINGS_FILE = f'{Config.PROCESSED_PATH}candidate_embeddings.npy'
COMP_EMBEDDINGS_FILE = f'{Config.PROCESSED_PATH}company_embeddings.npy'
CAND_METADATA_FILE = f'{Config.PROCESSED_PATH}candidates_metadata.pkl'
COMP_METADATA_FILE = f'{Config.PROCESSED_PATH}companies_metadata.pkl'

# Check if embeddings already exist
cand_exists = os.path.exists(CAND_EMBEDDINGS_FILE)
comp_exists = os.path.exists(COMP_EMBEDDINGS_FILE)

print(f"📁 Checking for existing embeddings...")
print(f"   Candidates: {'✅ Found' if cand_exists else '❌ Not found'}")
print(f"   Companies: {'✅ Found' if comp_exists else '❌ Not found'}")
print()

# Load model
print(f"🔧 Loading embedding model: {Config.EMBEDDING_MODEL}")
model = SentenceTransformer(Config.EMBEDDING_MODEL)
embedding_dim = model.get_sentence_embedding_dimension()
print(f"✅ Model loaded! Dimension: {embedding_dim}\n")

In [None]:
# ============================================================================
# CANDIDATE EMBEDDINGS - Generate or Load
# ============================================================================

if cand_exists:
    print("📥 LOADING candidate embeddings from file...")
    cand_vectors = np.load(CAND_EMBEDDINGS_FILE)
    print(f"✅ Loaded: {cand_vectors.shape}")
    
    # Verify alignment
    if len(cand_vectors) != len(candidates):
        print(f"\n⚠️  WARNING: Size mismatch!")
        print(f"   Embeddings: {len(cand_vectors):,}")
        print(f"   Dataset: {len(candidates):,}")
        print(f"\n🔄 Regenerating...")
        cand_exists = False

if not cand_exists:
    print("🔄 GENERATING candidate embeddings...")
    print(f"   Processing {len(candidates):,} candidates...\n")
    
    # Build text representations
    def build_candidate_text(row):
        parts = []
        
        if row.get('Category'):
            parts.append(f"Job Category: {row['Category']}")
        
        if row.get('skills'):
            parts.append(f"Skills: {row['skills']}")
        
        if row.get('career_objective'):
            parts.append(f"Objective: {row['career_objective']}")
        
        if row.get('degree_names'):
            parts.append(f"Education: {row['degree_names']}")
        
        if row.get('positions'):
            parts.append(f"Experience: {row['positions']}")
        
        return ' '.join(parts)
    
    candidate_texts = candidates.apply(build_candidate_text, axis=1).tolist()
    
    # Generate embeddings
    cand_vectors = model.encode(
        candidate_texts,
        show_progress_bar=True,
        batch_size=32,
        normalize_embeddings=True,
        convert_to_numpy=True
    )
    
    # Save
    np.save(CAND_EMBEDDINGS_FILE, cand_vectors)
    candidates.to_pickle(CAND_METADATA_FILE)
    
    print(f"\n💾 Saved:")
    print(f"   {CAND_EMBEDDINGS_FILE}")
    print(f"   {CAND_METADATA_FILE}")

print(f"\n✅ CANDIDATE EMBEDDINGS READY")
print(f"   Shape: {cand_vectors.shape}")
print(f"   Dataset size: {len(candidates):,}")
print(f"   Alignment: {'✅ PERFECT' if len(cand_vectors) == len(candidates) else '❌ MISMATCH'}\n")

In [None]:
# ============================================================================
# COMPANY EMBEDDINGS - Generate or Load
# ============================================================================

if comp_exists:
    print("📥 LOADING company embeddings from file...")
    comp_vectors = np.load(COMP_EMBEDDINGS_FILE)
    print(f"✅ Loaded: {comp_vectors.shape}")
    
    # Verify alignment
    if len(comp_vectors) != len(companies_full):
        print(f"\n⚠️  WARNING: Size mismatch!")
        print(f"   Embeddings: {len(comp_vectors):,}")
        print(f"   Dataset: {len(companies_full):,}")
        print(f"\n🔄 Regenerating...")
        comp_exists = False

if not comp_exists:
    print("🔄 GENERATING company embeddings...")
    print(f"   Processing {len(companies_full):,} companies...")
    print(f"   IMPORTANT: Generated AFTER deduplication for alignment!\n")
    
    # Build text representations
    def build_company_text(row):
        parts = []
        
        if row.get('name'):
            parts.append(f"Company: {row['name']}")
        
        if row.get('description'):
            parts.append(f"Description: {row['description']}")
        
        if row.get('industries_list'):
            parts.append(f"Industries: {row['industries_list']}")
        
        if row.get('specialties_list'):
            parts.append(f"Specialties: {row['specialties_list']}")
        
        # Include job postings data (THE BRIDGE!)
        if row.get('required_skills'):
            parts.append(f"Required Skills: {row['required_skills']}")
        
        if row.get('posted_job_titles'):
            parts.append(f"Job Titles: {row['posted_job_titles']}")
        
        if row.get('experience_levels'):
            parts.append(f"Experience Levels: {row['experience_levels']}")
        
        return ' '.join(parts)
    
    company_texts = companies_full.apply(build_company_text, axis=1).tolist()
    
    # Generate embeddings
    comp_vectors = model.encode(
        company_texts,
        show_progress_bar=True,
        batch_size=32,
        normalize_embeddings=True,
        convert_to_numpy=True
    )
    
    # Save
    np.save(COMP_EMBEDDINGS_FILE, comp_vectors)
    companies_full.to_pickle(COMP_METADATA_FILE)
    
    print(f"\n💾 Saved:")
    print(f"   {COMP_EMBEDDINGS_FILE}")
    print(f"   {COMP_METADATA_FILE}")

print(f"\n✅ COMPANY EMBEDDINGS READY")
print(f"   Shape: {comp_vectors.shape}")
print(f"   Dataset size: {len(companies_full):,}")
print(f"   Alignment: {'✅ PERFECT' if len(comp_vectors) == len(companies_full) else '❌ MISMATCH'}\n")

In [None]:
# ============================================================================
# FINAL VERIFICATION
# ============================================================================

print("🔍 FINAL ALIGNMENT CHECK")
print("=" * 80)
print()

print(f"📊 CANDIDATES:")
print(f"   Dataset rows: {len(candidates):,}")
print(f"   Embedding vectors: {len(cand_vectors):,}")
print(f"   Status: {'✅ ALIGNED' if len(candidates) == len(cand_vectors) else '❌ MISALIGNED'}")
print()

print(f"📊 COMPANIES:")
print(f"   Dataset rows: {len(companies_full):,}")
print(f"   Embedding vectors: {len(comp_vectors):,}")
print(f"   Status: {'✅ ALIGNED' if len(companies_full) == len(comp_vectors) else '❌ MISALIGNED'}")
print()

if len(candidates) == len(cand_vectors) and len(companies_full) == len(comp_vectors):
    print("🎯 PERFECT ALIGNMENT! Ready for matching!")
    print("\n💡 Next runs will LOAD embeddings (fast!)")
else:
    print("⚠️  ALIGNMENT ISSUE DETECTED")
    print("   Delete .npy files and regenerate")

print("\n" + "=" * 80)