Module Description:
-------------------
Class to extract skills from text and align them to existing taxonomy

Ownership:
----------
Project: Leveraging Artificial intelligence for Skills Extraction and Research (LAiSER)
Owner:  George Washington University Institute of Public Policy
        Program on Skills, Credentials and Workforce Policy
        Media and Public Affairs Building
        805 21st Street NW
        Washington, DC 20052
        PSCWP@gwu.edu
        https://gwipp.gwu.edu/program-skills-credentials-workforce-policy-pscwp

License:
--------
Copyright 2024 George Washington University Institute of Public Policy

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the “Software”), to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Revision History:
-----------------
| Rev No. | Date | Author | Description |
|---------|------|--------|-------------|
| [1.0.0] | 06/05/2024 |      Satya Phanindra K. |  Created a standalone notebook for skill extraction
| [1.0.1] | 06/11/2024 |      Satya Phanindra K. |  Added GPU support for processing
| [1.0.1] | 06/20/2024 |      Satya Phanindra K. |  Added error handling and logging
| [1.0.2] | 07/01/2024 |      Satya Phanindra K. |  Threshold update for similarity and AI model
| [1.0.3] | 07/10/2024 |      Satya Phanindra K. |  Added seperate functions set for LLM usecases
| [1.0.4] | 07/13/2024 |      Satya Phanindra K. |  Add descriptions to each method
| [1.0.5] | 07/18/2024 |      Satya Phanindra K. |  Added CONDITIONAL GPU support for LLM
| [1.0.6] | 07/22/2024 |      Satya Phanindra K. |  Added support for SkillNer model for skill extraction, if GPU not available
| [1.0.7] | 07/25/2024 |      Satya Phanindra K. |  Calculate cosine similarities in bulk for optimal performance.
| [1.0.8] | 07/28/2024 |      Satya Phanindra K. |  Error handling for empty list outputs from extract_raw function
| [1.0.9] | 11/24/2024 |      Prudhvi Chekuri    |  Add functionality to extract skills from syllabi data.
| [1.1.0] | 03/14/2025 |      Deepika Reddygari  |  Import laiser as a python package.
| [1.1.1] | 03/15/2025 |      Bharat Khandelwal  |  Resolved all issues related to importing laiser as a python package.
| [1.1.2] | 03/19/2025 |      Satya Phanindra K. |  Update installation with uv.
| [1.1.3] | 04/02/2025 |      Prudhvi Chekuri    |  Fix dependency issues.

## Install and import LAiSER

In [1]:
!pip install uv
!uv pip install dev-laiser -q

Collecting uv
  Downloading uv-0.6.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.6.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.6.12


**NOTE**: If running on Google Colab, RESTART the runtime for a True Clean Slate before executing below code. (**REQUIRED**)

In [1]:
from laiser.skill_extractor import Skill_Extractor
import pandas as pd
import torch

INFO 04-03 21:23:54 [__init__.py:239] Automatically detected platform cuda.


## Using the Skill Extractor

#### With Job Descriptions

In [2]:
# Import the dataset
job_sample = pd.read_csv('https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/jobs-data/linkedin_jobs_sample_36rows.csv')

job_sample = job_sample[0:1]
job_sample = job_sample[['description', 'job_id']]
print("Considering", len(job_sample), "rows for processing...")

Considering 1 rows for processing...


In [3]:
job_sample

Unnamed: 0,description,job_id
0,\nJob description\nDescription\n\nDo you have ...,1


In [4]:
print('Initializing the Skill Extractor...')
se = Skill_Extractor(AI_MODEL_ID="marcsun13/gemma-2-9b-it-GPTQ", HF_TOKEN="<YOUR_HUGGING_FACE_API_TOKEN>", use_gpu=True)
print('The Skill Extractor has been initialized successfully!')

Initializing the Skill Extractor...
Found 'en_core_web_lg' model. Loading...
Downloading 'en_core_web_lg' model...
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
GPU is available. Using GPU for Large Language model initialization...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

INFO 04-03 21:24:43 [config.py:585] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 04-03 21:24:45 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='marcsun13/gemma-2-9b-it-GPTQ', speculative_config=None, tokenizer='marcsun13/gemma-2-9b-it-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), se

tokenizer_config.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

INFO 04-03 21:24:49 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-03 21:24:49 [cuda.py:288] Using XFormers backend.
INFO 04-03 21:24:50 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-03 21:24:50 [model_runner.py:1110] Starting to load model marcsun13/gemma-2-9b-it-GPTQ...
INFO 04-03 21:24:51 [weight_utils.py:265] Using model weights format ['*.safetensors']


model-00002-of-00002.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

INFO 04-03 21:25:41 [weight_utils.py:281] Time spent downloading weights for marcsun13/gemma-2-9b-it-GPTQ: 50.143623 seconds


model.safetensors.index.json:   0%|          | 0.00/134k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 04-03 21:26:00 [loader.py:447] Loading weights took 18.92 seconds
INFO 04-03 21:26:01 [model_runner.py:1146] Model loading took 5.7838 GB and 70.245523 seconds
INFO 04-03 21:26:14 [worker.py:267] Memory profiling takes 13.09 seconds
INFO 04-03 21:26:14 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 04-03 21:26:14 [worker.py:267] model weights take 5.78GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 2.38GiB; the rest of the memory reserved for KV Cache is 5.05GiB.
INFO 04-03 21:26:15 [executor_base.py:111] # cuda blocks: 985, # CPU blocks: 780
INFO 04-03 21:26:15 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 1.92x
INFO 04-03 21:26:19 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:53<00:00,  1.54s/it]

INFO 04-03 21:27:12 [model_runner.py:1570] Graph capturing finished in 54 secs, took 0.53 GiB
INFO 04-03 21:27:12 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 71.57 seconds
The Skill Extractor has been initialized successfully!





In [5]:
# skills output based on the taxonomy database
output = se.extractor(job_sample, 'job_id', text_columns = ['description'])

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.51s/it, est. speed input: 143.99 toks/s, output: 24.51 toks/s]


0it [00:00, ?it/s]

  extracted = extracted._append(matches, ignore_index=True)


In [6]:
# save the extracted skills to a csv file
display(output)
output.to_csv('extracted_skills_for_sample_jobs.csv', index=False)

Unnamed: 0,Research ID,Description,Raw Skill,Knowledge Required,Task Abilities,Skill Tag,Correlation Coefficient
0,1,\nJob description\nDescription\n\nDo you have ...,Research,"[technical challenges, problem solving, data e...","[identify needs, propose solutions, conduct ex...",ESCO.1543,0.879462
1,1,\nJob description\nDescription\n\nDo you have ...,Research,"[technical challenges, problem solving, data e...","[identify needs, propose solutions, conduct ex...",ESCO.1909,0.889391
2,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.607,0.876281
3,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.663,0.899144
4,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.718,0.851187
5,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.734,0.916649
6,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.785,0.850359
7,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.786,0.866268
8,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.791,0.851294
9,1,\nJob description\nDescription\n\nDo you have ...,Software Development,"[programming languages, algorithms, data struc...","[code software, debug solutions, implement pro...",ESCO.932,0.853463


#### With syllabi

In [7]:
syllabi_sample = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/syllabi-data/preprocessed_50_opensyllabus_syllabi_data.csv")
syllabi_sample = syllabi_sample[0:1]
syllabi_sample = syllabi_sample[['id', 'description', 'learning_outcomes']]
print("Considering", len(syllabi_sample), "rows for processing...")

Considering 1 rows for processing...


In [8]:
syllabi_sample

Unnamed: 0,id,description,learning_outcomes
0,4904852663176,"survey and analysis of cinema , including hist...",communications skills — to include effective w...


In [9]:
output = se.extractor(syllabi_sample, 'id', text_columns = ['description', 'learning_outcomes'], input_type = "syllabus")

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.64s/it, est. speed input: 69.87 toks/s, output: 26.95 toks/s]


0it [00:00, ?it/s]

In [10]:
# save the extracted skills to a csv file
display(output)
output.to_csv('extracted_skills_for_sample_syllabus.csv', index=False)

Unnamed: 0,Research ID,Description,Learning Outcomes,Raw Skill,Knowledge Required,Task Abilities,Skill Tag,Correlation Coefficient


None of the raw skills extracted from this sample has high correlation with the taxonomy skills. Considering a different sample...

In [11]:
syllabi = pd.read_csv("https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/syllabi-data/preprocessed_50_opensyllabus_syllabi_data.csv")
syllabi = syllabi[['id', 'description', 'learning_outcomes']]
syllabi_sample = syllabi[1:2]
print("Considering", len(syllabi_sample), "rows for processing...")

output = se.extractor(syllabi_sample, 'id', text_columns = ['description', 'learning_outcomes'], input_type = "syllabus")

# save the extracted skills to a csv file
display(output)
output.to_csv('extracted_skills_for_sample_syllabus.csv', index=False)

Considering 1 rows for processing...


Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.52s/it, est. speed input: 75.45 toks/s, output: 26.08 toks/s]


0it [00:00, ?it/s]

  extracted = extracted._append(matches, ignore_index=True)


Unnamed: 0,Research ID,Description,Learning Outcomes,Raw Skill,Knowledge Required,Task Abilities,Skill Tag,Correlation Coefficient
0,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Quality Improvement,"[process control, data analysis, quality manag...","[problem solving, process optimization, error ...",ESCO.1513,0.876636
1,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Statistical Process Control,"[SPC charts, data interpretation, process moni...","[quality assurance, process adjustments, data-...",ESCO.48,0.852867
2,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Statistical Process Control,"[SPC charts, data interpretation, process moni...","[quality assurance, process adjustments, data-...",ESCO.1191,0.940936
3,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Statistical Process Control,"[SPC charts, data interpretation, process moni...","[quality assurance, process adjustments, data-...",ESCO.2072,0.85366
4,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Statistical Process Control,"[SPC charts, data interpretation, process moni...","[quality assurance, process adjustments, data-...",ESCO.2093,0.85652
5,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Data Techniques,"[data collection, data analysis methods, visua...","[trend identification, problem analysis, proce...",ESCO.142,0.931964
6,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Data Techniques,"[data collection, data analysis methods, visua...","[trend identification, problem analysis, proce...",ESCO.556,0.867287
7,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Data Techniques,"[data collection, data analysis methods, visua...","[trend identification, problem analysis, proce...",ESCO.792,0.855737
8,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Data Techniques,"[data collection, data analysis methods, visua...","[trend identification, problem analysis, proce...",ESCO.1005,0.852624
9,661424964946,you will be provided with a practical understa...,employ statistics to solve problems . use stat...,Data Techniques,"[data collection, data analysis methods, visua...","[trend identification, problem analysis, proce...",ESCO.1266,0.881053
