#SBERT RECOMMENDATION ENGINE (THE BRAIN)

This notebook implements the core AI logic for PathFinder+.
It uses **Sentence-BERT (SBERT)** to understand the *meaning* of skills, not just keywords.

**Features Implemented:**
1. **Semantic Similarity**: Matching User goals to Courses/Jobs.
2. **Skill Gap Analysis**: "What am I missing for this job?"
3. **Career Paths**: finding the next logical step (e.g. Developer -> Lead).
4. **Top-Up Logic**: Recommending Degrees for professionals.

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
from pathlib import Path

# 1. Load Model (Small & Fast)
print("Loading SBERT Model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print( "Model Loaded.")

# 2. Load Data
PROCESSED_DIR = Path("../data/processed")
jobs_df = pd.read_csv(PROCESSED_DIR / "jobs_cleaned_sbert_ready.csv")
courses_df = pd.read_csv(PROCESSED_DIR / "courses_cleaned_sbert_ready.csv")

print(f"Loaded {len(jobs_df)} Jobs and {len(courses_df)} Courses.")

Loading SBERT Model...
✅ Model Loaded.
Loaded 209 Jobs and 24930 Courses.


## STEP 2: GENERATE EMBEDDINGS (The "Vectors")
We convert text descriptions into number lists (vectors). This allows math operations on meaning.

In [None]:
def safe_text(row, cols):
    # Combine relevant columns for full context
    text = " "
    for col in cols:
        if col in row and pd.notna(row[col]):
            text += str(row[col]) + ". "
    return text.strip()

print("Generating Job Embeddings...")
job_texts = jobs_df.apply(lambda x: safe_text(x, ['job_title', 'description']), axis=1).tolist()
job_embeddings = model.encode(job_texts, show_progress_bar=True)

print("Generating Course Embeddings...")
course_texts = courses_df.apply(lambda x: safe_text(x, ['course_name', 'description', 'category']), axis=1).tolist()
course_embeddings = model.encode(course_texts, show_progress_bar=True)

print(f"Embeddings Ready. Job Shape: {job_embeddings.shape}")

Generating Job Embeddings...


Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Generating Course Embeddings...


Batches:   0%|          | 0/780 [00:00<?, ?it/s]

✅ Embeddings Ready. Job Shape: (209, 384)


## STEP 3: FEATURE 1 - SKILL GAP & MATCHING
The core function: Given a User Profile (Text), find best Jobs and Courses.

In [3]:
def finding_pathfinder_recommendations(user_profile_text, user_level="Entry"):
    """
    user_profile_text: "I know Python and SQL, want to be a Data Scientist"
    user_level: Current level (Entry, Mid, Senior)
    """
    # 1. Encode User
    user_emb = model.encode(user_profile_text)
    
    # 2. Find Best Job Matches
    job_scores = util.cos_sim(user_emb, job_embeddings)[0]
    
    # Add scores to dataframe
    jobs_scored = jobs_df.copy()
    jobs_scored['similarity'] = job_scores.cpu().numpy()
    
    # Filter by level (Optional rule: Don't show Senior jobs to Entry users unless requested)
    # best_jobs = jobs_scored[jobs_scored['experience_level'] == user_level] # Example filter
    best_jobs = jobs_scored.sort_values('similarity', ascending=False).head(5)
    
    return best_jobs

# TEST IT
user_query = "I am a student good at mathematics and basic coding. I want to build AI models."
print(f"User: {user_query}\n")
recs = finding_pathfinder_recommendations(user_query)
print("Recommended Jobs:")
print(recs[['job_title', 'similarity']])

User: I am a student good at mathematics and basic coding. I want to build AI models.

Recommended Jobs:
                          job_title  similarity
20                      AI Engineer    0.642653
74           AI/ML Engineer Trainee    0.553099
111  Voice/Chat AI Engineer Trainee    0.506865
207      Chatbot Development Intern    0.489125
73          Machine Learning Intern    0.481523


## STEP 4: FEATURE 2 - SKILL GAP ANALYSIS
How to tell the user *what* they are missing.

In [4]:
def analyze_skill_gap(user_text, target_job_description):
    # Simple approach: Extract keywords from target that are missing in user text
    # (A full semantic subtraction is complex, so we use a set difference proxy)
    
    user_words = set(user_text.lower().split())
    job_words = set(target_job_description.lower().split())
    
    # Filter for interesting words (simple stopword removal)
    stopwords = {'and', 'the', 'to', 'of', 'in', 'a', 'with', 'for'}
    missing = [w for w in job_words if w not in user_words and w not in stopwords and len(w) > 4]
    
    # Rank these missing words by importance (using SBERT to check if they are key concepts)
    # For simplicity here, we stick to the top non-matching terms
    return list(set(missing))[:5]

# DEMO
target_job = recs.iloc[0]['description']
gaps = analyze_skill_gap(user_query, target_job)
print(f"\nDetected Skill Gaps for '{recs.iloc[0]['job_title']}':")
print(gaps)


Detected Skill Gaps for 'AI Engineer':
['databases', 'llms,', 'pytorch,', 'vector', 'tensorflow,']


## STEP 5: FEATURE 3 - CAREER PATHS & DEGREES
Logic: If user is "Senior", suggest "Postgraduate". If "Entry", suggest "Diploma/Degree".

In [5]:
def recommend_upskilling(user_level, gaps):
    """
    user_level: Professional, Student, etc.
    gaps: List of missing skills found above
    """
    print(f"\nFinding Courses for {user_level} to learn: {gaps}")
    
    # 1. Filter Courses by Level Logic
    if user_level == "Professional":
        # Professionals usually want Masters or specialized certs
        candidates = courses_df[courses_df['course_level'].isin(['Postgraduate', 'Certificate'])]
    else:
        # Students need Degrees or Diplomas
        candidates = courses_df[courses_df['course_level'].isin(['Undergraduate', 'Diploma'])]
        
    # 2. Semantic Search against Gaps
    gap_text = " ".join(gaps)
    gap_emb = model.encode(gap_text)
    
    # Calculate similarity on the filtered set
    candidates = candidates.copy()
    candidate_texts = candidates.apply(lambda x: safe_text(x, ['course_name', 'description']), axis=1).tolist()
    
    if not candidate_texts: return pd.DataFrame() # No candidates
    
    cand_embs = model.encode(candidate_texts)
    scores = util.cos_sim(gap_emb, cand_embs)[0]
    
    candidates['match_score'] = scores.cpu().numpy()
    return candidates.sort_values('match_score', ascending=False).head(3)

# DEMO
courses_rec = recommend_upskilling("Student", gaps)
print("\nRecommended Courses to fill gaps:")
print(courses_rec[['course_name', 'course_level', 'match_score']])


Finding Courses for Student to learn: ['databases', 'llms,', 'pytorch,', 'vector', 'tensorflow,']

Recommended Courses to fill gaps:
                                            course_name course_level  \
1394  IDM International Foundation Diploma in Computing      Diploma   
2649                   Graduate Diploma in Data Science      Diploma   
1401             IDM International Diploma in Computing      Diploma   

      match_score  
1394     0.401598  
2649     0.388976  
1401     0.379758  
