# üöÄ HRHUB - Complete Bilateral Matching System

## üéØ System Architecture:

```
Candidates (9.5K) ‚Üê‚Üí Postings (700) ‚Üê‚Üí Companies (180K)
         ‚Üì                ‚Üì                  ‚Üì
    Skills text    Job requirements    Enriched profiles
         ‚Üì                ‚Üì                  ‚Üì
    Embeddings ‚Üê‚Äï‚Äï‚Äï‚Äï‚Äï‚Äï SAME SPACE ‚Ñù¬≥‚Å∏‚Å¥ ‚Äï‚Äï‚Äï‚Äï‚Äï‚Üí
```

## üîë Key Innovation:

**Use postings to enrich company profiles** so they speak the same language as candidates!

- Companies describe: "We are in tech industry"
- Postings translate: "We need Python, AWS, React"
- Result: Companies can match with candidates!

---

## üì¶ Step 1: Install & Import

In [None]:
!pip install -q sentence-transformers plotly anthropic scikit-learn umap-learn

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import plotly.express as px
import plotly.graph_objects as go
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All packages ready!")

## üìÇ Step 2: Load ALL Datasets

In [None]:
print("üìÇ Loading all datasets...\n")
print("=" * 70)

# Load candidates
candidates = pd.read_csv('resume_data.csv')
print(f"‚úÖ Candidates: {len(candidates):,} rows √ó {len(candidates.columns)} columns")

# Load companies base
companies_base = pd.read_csv('companies/companies.csv')
print(f"‚úÖ Companies (base): {len(companies_base):,} rows")

# Load company enrichment data
company_industries = pd.read_csv('companies/company_industries.csv')
print(f"‚úÖ Company industries: {len(company_industries):,} rows")

company_specialties = pd.read_csv('companies/company_specialties.csv')
print(f"‚úÖ Company specialties: {len(company_specialties):,} rows")

employee_counts = pd.read_csv('companies/employee_counts.csv')
print(f"‚úÖ Employee counts: {len(employee_counts):,} rows")

# Load POSTINGS (THE BRIDGE!)
postings = pd.read_csv('postings.csv', on_bad_lines='skip')
print(f"‚úÖ Postings: {len(postings):,} rows √ó {len(postings.columns)} columns")

# Load job-related tables
try:
    job_skills = pd.read_csv('jobs/job_skills.csv')
    print(f"‚úÖ Job skills: {len(job_skills):,} rows")
except:
    job_skills = None
    print("‚ö†Ô∏è  Job skills not found (optional)")

try:
    job_industries = pd.read_csv('jobs/job_industries.csv')
    print(f"‚úÖ Job industries: {len(job_industries):,} rows")
except:
    job_industries = None
    print("‚ö†Ô∏è  Job industries not found (optional)")

print("\n" + "=" * 70)
print("‚úÖ All datasets loaded!\n")

## üîó Step 3: Merge Company Data

In [None]:
print("üîó Merging company data...\n")

# Aggregate industries
company_industries_agg = company_industries.groupby('company_id')['industry_id'].apply(
    lambda x: ', '.join(map(str, x.tolist()))
).reset_index()
company_industries_agg.columns = ['company_id', 'industries_list']
print(f"‚úÖ Aggregated industries for {len(company_industries_agg):,} companies")

# Aggregate specialties
company_specialties_agg = company_specialties.groupby('company_id')['specialty'].apply(
    lambda x: ' | '.join(x.astype(str).tolist())
).reset_index()
company_specialties_agg.columns = ['company_id', 'specialties_list']
print(f"‚úÖ Aggregated specialties for {len(company_specialties_agg):,} companies")

# Start with base
companies_merged = companies_base.copy()

# Merge industries
companies_merged = companies_merged.merge(
    company_industries_agg, 
    on='company_id', 
    how='left'
)

# Merge specialties
companies_merged = companies_merged.merge(
    company_specialties_agg, 
    on='company_id', 
    how='left'
)

# Merge employee counts
companies_merged = companies_merged.merge(
    employee_counts, 
    on='company_id', 
    how='left'
)

print(f"\n‚úÖ Base company merge complete: {len(companies_merged):,} companies")
print(f"üìä Columns: {companies_merged.columns.tolist()[:10]}...\n")

## üåâ Step 4: Enrich Companies with Postings (THE BRIDGE!)

**This is the key step!** Postings tell us what companies actually need.

In [None]:
print("üåâ Enriching companies with job posting data...\n")
print("=" * 70)
print("KEY INSIGHT: Postings contain the 'requirements language'")
print("that bridges companies and candidates!")
print("=" * 70 + "\n")

# Clean postings
postings = postings.fillna('')

# Aggregate postings per company
postings_agg = postings.groupby('company_id').agg({
    'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),  # Top 10 job titles
    'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),  # Top 5 descriptions (truncated)
    'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),  # All skills
    'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),
    'formatted_work_type': lambda x: ' | '.join(x.dropna().unique().astype(str))
}).reset_index()

postings_agg.columns = [
    'company_id', 
    'posted_job_titles', 
    'posted_descriptions',
    'required_skills',
    'experience_levels',
    'work_types'
]

print(f"‚úÖ Aggregated postings for {len(postings_agg):,} companies")
print(f"\nüí° These {len(postings_agg):,} companies have explicit requirements!\n")

# Merge postings into companies
companies_full = companies_merged.merge(
    postings_agg,
    on='company_id',
    how='left'
)

# Fill NaN
companies_full = companies_full.fillna('')

print(f"‚úÖ ENRICHED COMPANIES CREATED!")
print(f"üìä Final: {len(companies_full):,} companies √ó {len(companies_full.columns)} columns")
print(f"\nüìã New columns from postings:")
print(f"   - posted_job_titles")
print(f"   - posted_descriptions")
print(f"   - required_skills ‚Üê KEY FOR MATCHING!")
print(f"   - experience_levels")
print(f"   - work_types\n")

# Show sample
print("üëÄ Sample enriched company:")
sample_with_postings = companies_full[companies_full['required_skills'] != ''].iloc[0]
print(f"\nCompany: {sample_with_postings.get('name', 'N/A')}")
print(f"Industries: {str(sample_with_postings.get('industries_list', ''))[:100]}...")
print(f"Required Skills: {str(sample_with_postings.get('required_skills', ''))[:100]}...")
print(f"Job Titles Posted: {str(sample_with_postings.get('posted_job_titles', ''))[:100]}...")

## üìÇ Step 5: Load & Clean Candidates

In [None]:
# Clean candidates
candidates = candidates.fillna('')

print(f"‚úÖ Candidates cleaned: {len(candidates):,} rows")
print(f"üìã Columns: {candidates.columns.tolist()[:10]}...")
candidates.head(3)

## üìù Step 6: Create Aligned Text Representations

**CRITICAL:** Both entities must use the same vocabulary!

In [None]:
print("üìù Creating ALIGNED text representations...\n")
print("=" * 70)
print("ALIGNMENT STRATEGY:")
print("‚Ä¢ Candidates: Describe skills, experience, education")
print("‚Ä¢ Companies: Describe what they NEED (from postings!)")
print("‚Ä¢ Result: Both use 'skills language' ‚Üí same vector space!")
print("=" * 70 + "\n")

# ========================================================================
# CANDIDATE TEXT - Professional offering
# ========================================================================
def make_candidate_text(row):
    """
    Candidate text focuses on:
    - What skills I have
    - What experience I bring
    - What value I offer
    """
    parts = []
    
    # Professional identity
    if row.get('career_objective'):
        parts.append(f"Professional seeking: {row['career_objective']}")
    
    if row.get('job_position_name'):
        parts.append(f"Target role: {row['job_position_name']}")
    
    # SKILLS (most important for matching!)
    all_skills = []
    if row.get('skills'): 
        all_skills.append(row['skills'])
    if row.get('related_skills_in_job'): 
        all_skills.append(row['related_skills_in_job'])
    if row.get('certification_skills'): 
        all_skills.append(row['certification_skills'])
    if row.get('skills_required'):  # Skills they're looking for in jobs
        all_skills.append(row['skills_required'])
    
    if all_skills:
        parts.append(f"Skills and expertise: {' | '.join(all_skills)}")
    
    # EXPERIENCE
    if row.get('positions'):
        parts.append(f"Experience in roles: {row['positions']}")
    
    if row.get('professional_company_names'):
        parts.append(f"Companies worked at: {row['professional_company_names']}")
    
    if row.get('responsibilities'):
        resp = str(row['responsibilities'])[:250]
        parts.append(f"Responsibilities: {resp}")
    
    # EDUCATION
    edu_parts = []
    if row.get('degree_names'): 
        edu_parts.append(row['degree_names'])
    if row.get('major_field_of_studies'): 
        edu_parts.append(f"in {row['major_field_of_studies']}")
    if row.get('educational_institution_name'): 
        edu_parts.append(f"from {row['educational_institution_name']}")
    
    if edu_parts:
        parts.append(f"Education: {' '.join(edu_parts)}")
    
    # ADDITIONAL
    if row.get('languages'):
        parts.append(f"Languages: {row['languages']}")
    
    if row.get('certification_providers'):
        parts.append(f"Certifications from: {row['certification_providers']}")
    
    if row.get('extra_curricular_activity_types'):
        parts.append(f"Activities: {row['extra_curricular_activity_types']}")
    
    return ' || '.join(parts) if parts else "Professional profile"


# ========================================================================
# COMPANY TEXT - Job requirements (enriched with postings!)
# ========================================================================
def make_company_text(row):
    """
    Company text focuses on:
    - What skills we need (from postings!)
    - What roles we're hiring for
    - What our company does
    """
    parts = []
    
    # Company identity
    if row.get('name'):
        parts.append(f"Company: {row['name']}")
    
    # REQUIRED SKILLS (from postings - KEY!)
    if row.get('required_skills'):
        parts.append(f"Looking for skills: {row['required_skills']}")
    
    # JOB TITLES (from postings)
    if row.get('posted_job_titles'):
        parts.append(f"Hiring for roles: {row['posted_job_titles']}")
    
    # EXPERIENCE LEVELS (from postings)
    if row.get('experience_levels'):
        parts.append(f"Experience levels: {row['experience_levels']}")
    
    # Industries & specialties
    if row.get('industries_list'):
        parts.append(f"Industries: {row['industries_list']}")
    
    if row.get('specialties_list'):
        parts.append(f"Specialties: {row['specialties_list']}")
    
    # Company description
    if row.get('description'):
        desc = str(row['description'])[:300]
        parts.append(f"About: {desc}")
    
    # Posted descriptions (gives context)
    if row.get('posted_descriptions'):
        posted_desc = str(row['posted_descriptions'])[:200]
        parts.append(f"Job descriptions: {posted_desc}")
    
    # Company size
    if row.get('employee_count'):
        parts.append(f"Company size: {row['employee_count']} employees")
    
    # Location
    loc = []
    if row.get('city'): loc.append(row['city'])
    if row.get('state'): loc.append(row['state'])
    if row.get('country'): loc.append(row['country'])
    if loc:
        parts.append(f"Location: {', '.join(loc)}")
    
    # Work types
    if row.get('work_types'):
        parts.append(f"Work arrangement: {row['work_types']}")
    
    return ' || '.join(parts) if parts else "Company profile"


# ========================================================================
# APPLY TO DATAFRAMES
# ========================================================================
print("üîÑ Generating candidate texts...")
candidates['text'] = candidates.apply(make_candidate_text, axis=1)

print("üîÑ Generating company texts...")
companies_full['text'] = companies_full.apply(make_company_text, axis=1)

print("\n‚úÖ ALIGNED texts created!\n")

# Compare vocabularies
print("=" * 70)
print("CANDIDATE SAMPLE:")
print(candidates['text'].iloc[0][:500])
print("\n" + "=" * 70)
print("COMPANY SAMPLE (with postings data):")
# Find company with postings
company_with_postings = companies_full[companies_full['required_skills'] != ''].iloc[0]
print(company_with_postings['text'][:500])
print("=" * 70)

print("\nüí° Notice: Both now use SKILLS LANGUAGE!")
print("   Candidate: 'Skills and expertise: Python, Java'")
print("   Company: 'Looking for skills: Python, AWS'")
print("   ‚Üí They can now be compared in the same space!\n")

## üß† Step 7: Generate Embeddings (‚Ñù¬≥‚Å∏‚Å¥)

Transform aligned text ‚Üí vectors in same mathematical space

In [None]:
print("üß† Loading embedding model...\n")
model = SentenceTransformer('all-MiniLM-L6-v2')

embedding_dim = model.get_sentence_embedding_dimension()
print(f"‚úÖ Model loaded! Embedding dimension: ‚Ñù^{embedding_dim}\n")

print("üîÑ Generating candidate vectors...")
print(f"   ({len(candidates):,} candidates √ó ~2-3 minutes)\n")
cand_vectors = model.encode(
    candidates['text'].tolist(), 
    show_progress_bar=True,
    batch_size=32
)

print("\nüîÑ Generating company vectors...")
print(f"   ({len(companies_full):,} companies √ó ~15-20 minutes)\n")
comp_vectors = model.encode(
    companies_full['text'].tolist(), 
    show_progress_bar=True,
    batch_size=64
)

print("\n" + "=" * 70)
print("‚úÖ VECTORS CREATED IN SAME SPACE!")
print("=" * 70)
print(f"üìä Candidate vectors: {cand_vectors.shape}")
print(f"üìä Company vectors: {comp_vectors.shape}")
print(f"\nüéØ Both live in ‚Ñù^{embedding_dim}!")
print(f"üéØ Now companies with 'Python' requirements will be")
print(f"   CLOSE to candidates with 'Python' skills!\n")

## üéØ Step 8: Matching Engine

In [None]:
def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def find_top_matches(candidate_idx, top_k=10):
    """
    Find top K company matches for a candidate.
    
    Returns: List of (company_idx, similarity_score)
    """
    cand_vec = cand_vectors[candidate_idx]
    
    scores = []
    for i, comp_vec in enumerate(comp_vectors):
        score = cosine_similarity(cand_vec, comp_vec)
        scores.append((i, score))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

print("‚úÖ Matching engine ready!")
print(f"üìä Can match {len(candidates):,} candidates with {len(companies_full):,} companies\n")

## üîç Step 9: Test Matching

In [None]:
print("üîç Finding top 10 matches for Candidate #0...\n")

matches = find_top_matches(0, top_k=10)

print("üéØ Top 10 Company Matches:\n")
print("=" * 90)
print(f"{'Rank':<6} {'Score':<8} {'Company':<35} {'Skills Needed':<40}")
print("=" * 90)

for rank, (comp_idx, score) in enumerate(matches, 1):
    company = companies_full.iloc[comp_idx]
    name = company.get('name', 'N/A')[:33]
    skills = company.get('required_skills', 'N/A')[:38]
    print(f"{rank:<6} {score:.4f}   {name:<35} {skills}")

print("=" * 90)

print("\nüí° If scores are good (>0.5), the alignment worked!")
print("   High scores = Company needs match candidate skills\n")

## üìä Step 10: Visualize Vector Space

See where candidates and companies live in ‚Ñù¬≥‚Å∏‚Å¥ (projected to ‚Ñù¬≤)

In [None]:
print("üé® VECTOR SPACE VISUALIZATION\n")
print("=" * 70)

# Sample for visualization
n_cand_viz = min(500, len(candidates))
n_comp_viz = min(2000, len(companies_full))

print(f"üìä Visualizing:")
print(f"   ‚Ä¢ {n_cand_viz} candidates")
print(f"   ‚Ä¢ {n_comp_viz} companies")
print(f"   ‚Ä¢ From ‚Ñù^{embedding_dim} ‚Üí ‚Ñù¬≤ (t-SNE projection)\n")

# Sample vectors
cand_sample = cand_vectors[:n_cand_viz]
comp_sample = comp_vectors[:n_comp_viz]

# Combine
all_vectors = np.vstack([cand_sample, comp_sample])

print("üîÑ Running t-SNE (2-3 minutes)...")
tsne = TSNE(
    n_components=2,
    perplexity=30,
    random_state=42,
    n_iter=1000,
    verbose=1
)

vectors_2d = tsne.fit_transform(all_vectors)

# Split
cand_2d = vectors_2d[:n_cand_viz]
comp_2d = vectors_2d[n_cand_viz:]

print("\n‚úÖ t-SNE complete!\n")

In [None]:
# Create plot
fig = go.Figure()

# Companies (red)
fig.add_trace(go.Scatter(
    x=comp_2d[:, 0],
    y=comp_2d[:, 1],
    mode='markers',
    name='Companies',
    marker=dict(
        size=6,
        color='#ff6b6b',
        opacity=0.6
    ),
    text=[f"Company {i}: {companies_full.iloc[i].get('name', 'N/A')[:30]}" 
          for i in range(n_comp_viz)],
    hovertemplate='<b>%{text}</b><extra></extra>'
))

# Candidates (green)
fig.add_trace(go.Scatter(
    x=cand_2d[:, 0],
    y=cand_2d[:, 1],
    mode='markers',
    name='Candidates',
    marker=dict(
        size=10,
        color='#00ff00',
        opacity=0.8,
        line=dict(width=1, color='white')
    ),
    text=[f"Candidate {i}" for i in range(n_cand_viz)],
    hovertemplate='<b>%{text}</b><extra></extra>'
))

fig.update_layout(
    title='Vector Space: Candidates & Companies (with Postings Enrichment)',
    xaxis_title='Dimension 1',
    yaxis_title='Dimension 2',
    width=1200,
    height=800,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white')
)

fig.show()

print("‚úÖ Visualization complete!\n")
print("üí° KEY OBSERVATIONS:")
print("   ‚Ä¢ Green = Candidates | Red = Companies")
print("   ‚Ä¢ If they OVERLAP ‚Üí Good! Alignment worked!")
print("   ‚Ä¢ If still separated ‚Üí Need more postings data")
print("   ‚Ä¢ Clusters = Similar skill profiles grouped\n")

## üîç Step 11: Highlight Specific Candidate + Matches

In [None]:
target_candidate = 0

print(f"üîç Analyzing Candidate #{target_candidate}...\n")

matches = find_top_matches(target_candidate, top_k=10)
match_indices = [comp_idx for comp_idx, score in matches if comp_idx < n_comp_viz]

# Create highlighted plot
fig2 = go.Figure()

# All companies (background)
fig2.add_trace(go.Scatter(
    x=comp_2d[:, 0],
    y=comp_2d[:, 1],
    mode='markers',
    name='All Companies',
    marker=dict(size=4, color='#ff6b6b', opacity=0.3),
    showlegend=True
))

# Top matches (highlighted)
if match_indices:
    match_positions = comp_2d[match_indices]
    fig2.add_trace(go.Scatter(
        x=match_positions[:, 0],
        y=match_positions[:, 1],
        mode='markers',
        name='Top Matches',
        marker=dict(
            size=15,
            color='#ff0000',
            line=dict(width=2, color='white')
        ),
        text=[f"Match #{i+1}: {companies_full.iloc[match_indices[i]].get('name', 'N/A')[:30]}<br>Score: {matches[i][1]:.3f}" 
              for i in range(len(match_indices))],
        hovertemplate='<b>%{text}</b><extra></extra>'
    ))

# Target candidate
fig2.add_trace(go.Scatter(
    x=[cand_2d[target_candidate, 0]],
    y=[cand_2d[target_candidate, 1]],
    mode='markers',
    name=f'Candidate #{target_candidate}',
    marker=dict(
        size=25,
        color='#00ff00',
        symbol='star',
        line=dict(width=3, color='white')
    )
))

# Connection lines
for i, match_idx in enumerate(match_indices[:5]):
    fig2.add_trace(go.Scatter(
        x=[cand_2d[target_candidate, 0], comp_2d[match_idx, 0]],
        y=[cand_2d[target_candidate, 1], comp_2d[match_idx, 1]],
        mode='lines',
        line=dict(color='yellow', width=1, dash='dot'),
        opacity=0.5,
        showlegend=False
    ))

fig2.update_layout(
    title=f'Candidate #{target_candidate} and Top Matches',
    xaxis_title='Dimension 1',
    yaxis_title='Dimension 2',
    width=1200,
    height=800,
    plot_bgcolor='#1a1a1a',
    paper_bgcolor='#0d0d0d',
    font=dict(color='white')
)

fig2.show()

print("‚úÖ Highlighted visualization created!")
print(f"   ‚≠ê Green star = Candidate #{target_candidate}")
print(f"   üî¥ Red dots = Top matches")
print(f"   üíõ Yellow lines = Connections in vector space\n")

## üíæ Step 12: Export Results

In [None]:
# Generate matches for sample
results = []
export_sample = min(500, len(candidates))

print(f"üíæ Generating matches for {export_sample} candidates...\n")

for i in range(export_sample):
    if i % 50 == 0:
        print(f"   Progress: {i}/{export_sample}")
    
    matches = find_top_matches(i, top_k=10)
    
    for rank, (comp_idx, score) in enumerate(matches, 1):
        company = companies_full.iloc[comp_idx]
        results.append({
            'candidate_id': i,
            'company_id': company.get('company_id'),
            'company_name': company.get('name', 'N/A'),
            'rank': rank,
            'similarity_score': float(score),
            'required_skills': company.get('required_skills', 'N/A')[:100],
            'posted_jobs': company.get('posted_job_titles', 'N/A')[:100]
        })

results_df = pd.DataFrame(results)
results_df.to_csv('hrhub_matches_with_postings.csv', index=False)

print(f"\n‚úÖ Exported {len(results_df):,} matches!")
print(f"üìÑ File: hrhub_matches_with_postings.csv\n")
results_df.head(20)

## üéâ COMPLETE!

### ‚úÖ What you have:

1. **Enriched companies** with job posting data (requirements, skills needed)
2. **Aligned text representations** (both use "skills language")
3. **Vectors in same space** ‚Ñù¬≥‚Å∏‚Å¥
4. **Cosine similarity matching**
5. **Vector space visualization**
6. **Exported results**

### üöÄ Next steps:

1. **Train LLM on patterns:** "Company in industry X historically needs skills Y"
2. **Predict for companies without postings:** Use learned patterns
3. **Add weights:** Let users tune dimension importance
4. **Build UI:** Interactive matching interface
5. **LLM explanations:** Why these matches make sense

### üí° Key insight achieved:

**Postings bridge the gap!** They translate "what companies are" into "what companies need" - the same language candidates speak!

---