### Creating embeddings for each occupation using SBERT.

1. Loading CSV file with extracted by LLM desriptions.

In [2]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("ExtractedSummaries.csv")

print(df.columns)
print(df.head()[["O*NET-SOC Code", "Element Name", "Summary"]])
print("Number of occupations:", len(df))


Index(['O*NET-SOC Code', 'Element Name', 'Description', 'Skills', 'Tasks',
       'Summary'],
      dtype='object')
  O*NET-SOC Code                         Element Name  \
0     11-1011.00                     Chief Executives   
1     11-1011.03        Chief Sustainability Officers   
2     11-1021.00      General and Operations Managers   
3     11-2011.00  Advertising and Promotions Managers   
4     11-2021.00                   Marketing Managers   

                                             Summary  
0  Individuals in this role determine and formula...  
1  Individuals in this role develop and execute s...  
2  Individuals in this role are responsible for p...  
3  Individuals in this role develop and execute s...  
4  Professionals in this field plan, direct, and ...  
Number of occupations: 894


Just in case looking if there is any missing summary.

In [4]:
# Checking for missing summaries
missing_summaries = df["Summary"].isnull().sum()
print(f"Number of missing summaries: {missing_summaries}")  

Number of missing summaries: 0


2. Loading SBERT model

In [3]:
from sentence_transformers import SentenceTransformer

model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

3. Creating embeddings for each occupation

In [5]:
import numpy as np

# Use the LLM-generated summaries
texts = df["Summary"].astype(str).fillna("").tolist()

embeddings = model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True  # makes cosine ≈ dot product, nice for similarity
)

print("Embeddings shape:", embeddings.shape)  # (894, 384) for MiniLM


Batches: 100%|██████████| 28/28 [00:08<00:00,  3.47it/s]

Embeddings shape: (894, 384)





The shape is (894, 384) = 894 occupations x 384 features (For all-MiniLM-L6-v2, the vector length = 384 dimensions.)

4. Saving embeddings

In [6]:
# Save embeddings as a .npy file
np.save("SBERT_embeddings_summaries.npy", embeddings)

# (Optional) save dataframe with an explicit index we can use later
df.reset_index(drop=False, inplace=True)
df.rename(columns={"index": "embedding_idx"}, inplace=True)
df.to_csv("ExtractedSummaries_with_idx.csv", index=False)


We are saving the results so we do not need to recompute embeddings every time.

Now we have:
- SBERT_embeddings_summaries.npy – matrix of vectors
- ExtractedSummaries_with_idx.csv – our occupations + embedding_idx column matching row in the embedding matrix