### Computing similarity between occupations (cosine similarity)

Insted of creating full similarity matrix, we compute similarity on demand, which means that we compute similarity when we need to compare one job to others. Also recommender system for job transitions that we are building don't need the full matrix.

In [1]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

# Load back if needed
df = pd.read_csv("ExtractedSummaries_with_idx.csv")
embeddings = np.load("SBERT_embeddings_summaries.npy")

def get_similar_occupations(
    job_code: str,
    top_k: int = 10,
    exclude_self: bool = True
) -> pd.DataFrame:
    """
    Return the top_k most similar occupations (by SBERT cosine similarity)
    to the occupation with the given O*NET-SOC Code.
    """
    # Find the rows with this code
    matches = df.index[df["O*NET-SOC Code"] == job_code].tolist()
    if not matches:
        raise ValueError(f"Job code {job_code} not found in dataframe.")
    
    idx = matches[0]  # if there are duplicates, take the first
    
    query_vec = embeddings[idx].reshape(1, -1)
    sims = cosine_similarity(query_vec, embeddings)[0]  # shape (n_occupations,)

    # Create a copy of df with similarity scores
    result = df.copy()
    result["similarity"] = sims

    if exclude_self:
        result = result[result.index != idx]

    result = result.sort_values(by="similarity", ascending=False).head(top_k)

    return result[[
        "O*NET-SOC Code",
        "Element Name",
        "similarity",
        "Summary"
    ]]


#### Test

In [2]:
df[df["O*NET-SOC Code"] == "41-2011.00"]


Unnamed: 0,embedding_idx,O*NET-SOC Code,Element Name,Description,Skills,Tasks,Summary
545,545,41-2011.00,Cashiers,Receive and disburse money in establishments o...,"[{'skill': 'Reading Comprehension', 'importanc...","[{'Task': 'Receive payment by cash, check, cre...",Individuals in this role interact with custome...


In [3]:
get_similar_occupations("41-2011.00", top_k=10)


Unnamed: 0,O*NET-SOC Code,Element Name,similarity,Summary
578,43-4051.00,Customer Service Representatives,0.75894,Individuals in this role engage with customers...
855,53-3031.00,Driver/Sales Workers,0.746835,Individuals in this role navigate established ...
548,41-2022.00,Parts Salespersons,0.744674,Individuals in this role engage with customers...
547,41-2021.00,Counter and Rental Clerks,0.741029,Individuals in this role interact with custome...
570,43-3041.00,Gambling Cage Workers,0.727216,These professionals handle financial exchanges...
585,43-4141.00,New Accounts Clerks,0.699859,Individuals in this role engage with prospecti...
565,43-2011.00,"Switchboard Operators, Including Answering Ser...",0.693831,Individuals in this role manage incoming and o...
499,35-3023.00,Fast Food and Counter Workers,0.685517,Individuals in this role interact directly wit...
586,43-4151.00,Order Clerks,0.672407,Individuals in this role manage incoming reque...
568,43-3021.00,Billing and Posting Clerks,0.670839,Individuals in this role are responsible for m...


In [4]:
df[df["Element Name"].str.contains("Shampoo", case=False, na=False)]

Unnamed: 0,embedding_idx,O*NET-SOC Code,Element Name,Description,Skills,Tasks,Summary
532,532,39-5093.00,Shampooers,Shampoo and rinse customers' hair.,"[{'skill': 'Reading Comprehension', 'importanc...","[{'Task': 'Massage, shampoo, and condition pat...",Professionals in this role cleanse and conditi...


In [5]:
get_similar_occupations("39-5093.00", top_k=10)

Unnamed: 0,O*NET-SOC Code,Element Name,similarity,Summary
529,39-5012.00,"Hairdressers, Hairstylists, and Cosmetologists",0.691515,Professionals in this field provide a range of...
533,39-5094.00,Skincare Specialists,0.683966,Professionals in this role interact with clien...
528,39-5011.00,Barbers,0.67857,Professionals in this role engage with patrons...
396,29-1213.00,Dermatologists,0.616749,Professionals in this field conduct thorough e...
883,53-7061.00,Cleaners of Vehicles and Equipment,0.606082,Individuals in this role are responsible for c...
508,37-2011.00,"Janitors and Cleaners, Except Maids and Housek...",0.57524,Individuals in this role are responsible for m...
506,37-1011.00,First-Line Supervisors of Housekeeping and Jan...,0.573375,These professionals oversee and coordinate the...
454,31-2022.00,Physical Therapist Aides,0.568396,Professionals in this role prepare treatment a...
531,39-5092.00,Manicurists and Pedicurists,0.561356,Professionals in this role engage with clients...
517,39-2021.00,Animal Caretakers,0.557218,Individuals in this role are responsible for t...
