# Skill Extraction with CAG
**Objective:** Extract relevant skills from text by first retrieving a candidate set using semantic search (MPNet embeddings + FAISS) and then using an LLM (Gemma via VLLM) to refine selections from these candidates.
- **Date:** 9-June-25
- **Author:** Anket Patil

## 1. Install Dependencies
Installs required libraries:

* `faiss-cpu`: For similarity search.
* `sentence-transformers`: Create embeddings

In [None]:
!pip install -q sentence-transformers faiss-cpu

## 2. Import Libraries
Imports essential Python libraries for data handling, numerical operations, model interaction, and FAISS.

In [None]:
import os
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import faiss
import time
import random


## 3. Create FAISS Index for Semantic Search
Loads ESCO skills, turns them into vectors using a all-mpnet-base-v2, and builds a FAISS index so we can quickly find the most relevant skills for any course or job description.

In [None]:
# Load ESCO skill data
esco_df = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/taxonomies/ESCO_skills_Taxonomy.csv")  # replace with your file if needed
skill_names = esco_df["preferredLabel"].tolist()

# Embed ESCO skills using SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
print("Embedding ESCO skills...")
esco_embeddings = model.encode(skill_names, convert_to_numpy=True, show_progress_bar=True)

# ⚡ Normalize & Index using FAISS (cosine sim = L2 norm + dot product)
dimension = esco_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(esco_embeddings)
index.add(esco_embeddings)


## 4. Find Most Relevant ESCO Skills
Takes a course description, turns it into a vector using the embedding model, and searches the FAISS index to return the top 50 most relevant ESCO skills based on similarity.

In [None]:
def get_top_esco_skills(course_desc, top_k=50):
    emb = model.encode(course_desc, convert_to_numpy=True)
    faiss.normalize_L2(emb.reshape(1, -1))
    scores, indices = index.search(emb.reshape(1, -1), top_k)
    return [skill_names[i] for i in indices[0]]


## 5.  Load Syllabi Dataset
Loads a preprocessed dataset of 50 course syllabi from OpenSyllabus

In [None]:
syllabi_data = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/syllabi-data/preprocessed_50_opensyllabus_syllabi_data.csv")
syllabi_data

## Test Skill Extraction on a Single Course Description
This cell randomly selects one course from the dataset, extracts its top 50 relevant ESCO skills using the semantic search pipeline, and displays the time taken for the operation in milliseconds.

In [None]:

# Randomly pick an index between 0 and 45
rand_idx = random.randint(0, 49)
row = syllabi_data.loc[rand_idx]

course_title = row['title']
course_desc = row['description']

print(f"Course Title: {course_title}")
print(f"Course Description:\n{course_desc}\n")

start_time = time.time()
single_skills = get_top_esco_skills(course_desc)
end_time = time.time()

time_ms = (end_time - start_time) * 1000
print(f"Time taken: {round(time_ms, 2)} ms")
print(f"\nTop 50 ESCO Skills:\n{single_skills}")


## Extract Top 50 ESCO Skills for Each Course
Loops through each course description in the syllabi dataset, uses the get_top_esco_skills function to find the top 50 relevant ESCO skills


In [None]:
def get_top_k_skills_bulk(df, text_col='description', top_k=50):
    top_skills_list = []
    for i, row in df.iterrows():
        course_desc = row[text_col]
        if not isinstance(course_desc, str) or not course_desc.strip():
            top_skills_list.append([])
            continue

        skills = get_top_esco_skills(course_desc, top_k=top_k)
        top_skills_list.append(skills)

    return top_skills_list




In [None]:
# ⏱️ Measure total time in milliseconds
start_time = time.time()

# Bulk processing
syllabi_data['top_50_esco_skills'] = get_top_k_skills_bulk(syllabi_data)

end_time = time.time()
total_time_ms = (end_time - start_time) * 1000
print(f"Total time taken for all rows: {round(total_time_ms, 2)} ms")
syllabi_data[['title','description','top_50_esco_skills']]