# Skill Extraction with CAG
**Objective:** Extract relevant skills from text by first retrieving a candidate set using semantic search (MPNet embeddings + FAISS) and then using an LLM (Gemma via VLLM) to refine selections from these candidates.
- **Date:** 9-June-25
- **Author:** Anket Patil

## 1. Install Dependencies
Installs required libraries:

* `faiss-cpu`: For similarity search.
* `sentence-transformers`: Create embeddings

In [None]:
!pip install -q sentence-transformers faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m82.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m76.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2. Import Libraries
Imports essential Python libraries for data handling, numerical operations, model interaction, and FAISS.

In [None]:
import os
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import faiss
import time
import random


## 3. Create FAISS Index for Semantic Search
Loads ESCO skills, turns them into vectors using a all-mpnet-base-v2, and builds a FAISS index so we can quickly find the most relevant skills for any course or job description.

In [None]:
# Load ESCO skill data
esco_df = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/taxonomies/ESCO_skills_Taxonomy.csv")  # replace with your file if needed
skill_names = esco_df["preferredLabel"].tolist()

# Embed ESCO skills using SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
print("Embedding ESCO skills...")
esco_embeddings = model.encode(skill_names, convert_to_numpy=True, show_progress_bar=True)

# ⚡ Normalize & Index using FAISS (cosine sim = L2 norm + dot product)
dimension = esco_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(esco_embeddings)
index.add(esco_embeddings)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding ESCO skills...


Batches:   0%|          | 0/436 [00:00<?, ?it/s]

## 4. Find Most Relevant ESCO Skills
Takes a course description, turns it into a vector using the embedding model, and searches the FAISS index to return the top 50 most relevant ESCO skills based on similarity.

In [None]:
def get_top_esco_skills(course_desc, top_k=50):
    emb = model.encode(course_desc, convert_to_numpy=True)
    faiss.normalize_L2(emb.reshape(1, -1))
    scores, indices = index.search(emb.reshape(1, -1), top_k)
    return [skill_names[i] for i in indices[0]]


## 5.  Load Syllabi Dataset
Loads a preprocessed dataset of 50 course syllabi from OpenSyllabus

In [None]:
syllabi_data = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/syllabi-data/preprocessed_50_opensyllabus_syllabi_data.csv")
syllabi_data

Unnamed: 0,id,title,code,in_id,institution,in_country,in_state,in_city,in_latitute,in_longitude,in_description,description,learning_outcomes,year
0,4904852663176,Film Appreciation,COMM 2366,20005,South Plains College,United States,Texas,Levelland,33.576283,-102.367622,South Plains College (SPC) is a public communi...,"survey and analysis of cinema , including hist...",communications skills — to include effective w...,2022
1,661424964946,Quality Improvement Tools,PROD1252,1963,Niagara College,Canada,Ontario,Niagara Falls,43.079483,-79.09053,The Niagara College of Applied Arts and Techno...,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,2022
2,6004364299427,"Data Privacy, Security, and Ethics",CSIS-2700,18664,Webster University,United States,Missouri,Webster Groves,38.592548,-90.357338,Webster University is a private university wit...,there is a subtle balance between improvements...,,2022
3,8014409002470,Consumer Behavior,MKT 358,18014,Upper Iowa University,United States,Iowa,Fayette,42.839714,-91.797958,Upper Iowa University (UIU) is a private unive...,this course provides a survey of research find...,an overview of consumer behavior and terminolo...,2022
4,15728170259237,General Organic Chemistry I,CHM-235,20492,Chandler–Gilbert Community College,United States,Arizona,Chandler,33.295132,-111.797127,"public community college in Chandler, Arizona,...",rigorous introduction to chemistry of carbon -...,describe the bonding properties of the element...,2022
5,16097537445411,Physical Geology,GEOL-1403,20047,Tyler Junior College,United States,Texas,Tyler,32.333752,-95.283379,Tyler Junior College (TJC) is a public communi...,introduction to the study of the materials and...,describe how the scientific method has led to ...,2022
6,15350213134595,Writing Competency through Genres,ENGL 100A,17371,San Jose State University,United States,California,San Jose,37.33939,-121.894958,San José State University (San Jose State or S...,"prepares students for 100w through drafting , ...",use correct and situationally appropriate sent...,2022
7,274877917243,ADMINISTRATIVE LAW,PAD 6605,20884,Florida Gulf Coast University,United States,Florida,Fort Myers,26.465445,-81.773735,Florida Gulf Coast University (FGCU) is a publ...,administrative law affects every aspect of ame...,examine how law affects agencies ’ policymakin...,2022
8,14001593392419,Criminal Justice Research and Writing,CJUS-230,20123,Liberty University,United States,Virginia,Lynchburg,37.352421,-79.180183,Liberty University (LU) is a private Baptist u...,this course is an introductory course to resea...,research a topic in criminal justice thoroughl...,2022
9,7181185329132,Computer literacy,INF0203,296,Vytautas Magnus University,Lithuania,,Kaunas,54.898335,23.913889,Vytautas Magnus University (VMU) (Lithuanian: ...,course introduces main concepts of computer sc...,to formalize and specify real - world problems...,2022


## Test Skill Extraction on a Single Course Description
This cell randomly selects one course from the dataset, extracts its top 50 relevant ESCO skills using the semantic search pipeline, and displays the time taken for the operation in milliseconds.

In [None]:

# Randomly pick an index between 0 and 45
rand_idx = random.randint(0, 49)
row = syllabi_data.loc[rand_idx]

course_title = row['title']
course_desc = row['description']

print(f"Course Title: {course_title}")
print(f"Course Description:\n{course_desc}\n")

start_time = time.time()
single_skills = get_top_esco_skills(course_desc)
end_time = time.time()

time_ms = (end_time - start_time) * 1000
print(f"Time taken: {round(time_ms, 2)} ms")
print(f"\nTop 50 ESCO Skills:\n{single_skills}")


Course Title: Financial Decision Making for Managers
Course Description:
mgmt 640 combines the study of financial accounting , finance , and managerial accounting into a concentrated one - semester course . business organizations , both for - profit and non - profit , employ financial managers in a wide variety of roles to gather and report on company financial performance ; direct investment decisions ; implement cash management strategies ; prepare budgets and establish operating performance measures ; and participate in the development and implementation of long - term business strategies . this course is an introduction to the management of a firm 's financial and operational resources . it is intended as a foundation - level course in corporate financial management for students pursuing the master of science in management with specialization , or as a perquisite for students enrolled in the financial management and accounting or healthcare degree programs . emphasis is placed on h

## Extract Top 50 ESCO Skills for Each Course
Loops through each course description in the syllabi dataset, uses the get_top_esco_skills function to find the top 50 relevant ESCO skills


In [None]:
def get_top_k_skills_bulk(df, text_col='description', top_k=50):
    top_skills_list = []
    for i, row in df.iterrows():
        course_desc = row[text_col]
        if not isinstance(course_desc, str) or not course_desc.strip():
            top_skills_list.append([])
            continue

        skills = get_top_esco_skills(course_desc, top_k=top_k)
        top_skills_list.append(skills)

    return top_skills_list




In [None]:
# ⏱️ Measure total time in milliseconds
start_time = time.time()

# Bulk processing
syllabi_data['top_50_esco_skills'] = get_top_k_skills_bulk(syllabi_data)

end_time = time.time()
total_time_ms = (end_time - start_time) * 1000
print(f"Total time taken for all rows: {round(total_time_ms, 2)} ms")
syllabi_data[['title','description','top_50_esco_skills']]

Total time taken for all rows: 850.72 ms


Unnamed: 0,title,description,top_50_esco_skills
0,Film Appreciation,"survey and analysis of cinema , including hist...","[film studies, film production process, develo..."
1,Quality Improvement Tools,you will be provided with a practical understa...,"[quality control systems, statistical quality ..."
2,"Data Privacy, Security, and Ethics",there is a subtle balance between improvements...,"[data ethics, business analytics, use analytic..."
3,Consumer Behavior,this course provides a survey of research find...,"[marketing principles, satisfy customers, prom..."
4,General Organic Chemistry I,rigorous introduction to chemistry of carbon -...,"[organic chemistry, develop chemical products,..."
5,Physical Geology,introduction to the study of the materials and...,"[Earth science, use earth sciences tools, geoc..."
6,Writing Competency through Genres,"prepares students for 100w through drafting , ...","[study relevant writing, teach writing, provid..."
7,ADMINISTRATIVE LAW,administrative law affects every aspect of ame...,"[make legislative decisions, legal department ..."
8,Criminal Justice Research and Writing,this course is an introductory course to resea...,"[perform copywriting, teach journalistic pract..."
9,Computer literacy,course introduces main concepts of computer sc...,"[computer technology, computer engineering, pr..."
